All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] block: fixes for long standing issues
@ 2012-04-19 23:29 ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel

Hello,

This patchset fixes two long standing issues and one relatively new
css ref leak.

a. elvpriv alloc failure, including ioc and icq failures, fails
   request allocation.  As those aren't mempool backed and may fail
   indefinitely, this can lead to deadlock under memory pressure.

b. blkgs don't have proper indexing.  With enough number of
   request_queues and blk-throttle enabled, block layer can spend
   considerable amount of cpu cycles walking the same list over and
   over again.

c. __blkg_lookup_create() was leaking a css ref on failure path.

This patchset contains the following four patches.

 0001-block-collapse-blk_alloc_request-into-get_request.patch
 0002-block-fix-elvpriv-allocation-failure-handling.patch
 0003-blkcg-fix-blkcg-css-ref-leak-in-__blkg_lookup_create.patch
 0004-blkcg-use-radix-tree-to-index-blkgs-from-blkcg.patch

0001-0002 fix #a.  0003 fixes #c.  0004 fixes #b.

This patchset is on top of

  block/for-3.5/core 5bc4afb1ec "blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros"
+ [1] [PATCHSET] block: per-queue policy activation, take#2
+ [2] [PATCHSET] block: cosmetic updates to blkcg API

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-misc-fixes

diffstat follows.

 block/blk-cgroup.c |   71 ++++++++++++++++++++++++++++++++-----------
 block/blk-cgroup.h |    6 +++
 block/blk-core.c   |   87 ++++++++++++++++++++++++++++-------------------------
 3 files changed, 106 insertions(+), 58 deletions(-)

Thanks.

--
tejun

[1] https://lkml.org/lkml/2012/4/13/380
[2] http://www.spinics.net/lists/cgroups/msg01748.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHSET] block: fixes for long standing issues
@ 2012-04-19 23:29 ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello,

This patchset fixes two long standing issues and one relatively new
css ref leak.

a. elvpriv alloc failure, including ioc and icq failures, fails
   request allocation.  As those aren't mempool backed and may fail
   indefinitely, this can lead to deadlock under memory pressure.

b. blkgs don't have proper indexing.  With enough number of
   request_queues and blk-throttle enabled, block layer can spend
   considerable amount of cpu cycles walking the same list over and
   over again.

c. __blkg_lookup_create() was leaking a css ref on failure path.

This patchset contains the following four patches.

 0001-block-collapse-blk_alloc_request-into-get_request.patch
 0002-block-fix-elvpriv-allocation-failure-handling.patch
 0003-blkcg-fix-blkcg-css-ref-leak-in-__blkg_lookup_create.patch
 0004-blkcg-use-radix-tree-to-index-blkgs-from-blkcg.patch

0001-0002 fix #a.  0003 fixes #c.  0004 fixes #b.

This patchset is on top of

  block/for-3.5/core 5bc4afb1ec "blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros"
+ [1] [PATCHSET] block: per-queue policy activation, take#2
+ [2] [PATCHSET] block: cosmetic updates to blkcg API

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-misc-fixes

diffstat follows.

 block/blk-cgroup.c |   71 ++++++++++++++++++++++++++++++++-----------
 block/blk-cgroup.h |    6 +++
 block/blk-core.c   |   87 ++++++++++++++++++++++++++++-------------------------
 3 files changed, 106 insertions(+), 58 deletions(-)

Thanks.

--
tejun

[1] https://lkml.org/lkml/2012/4/13/380
[2] http://www.spinics.net/lists/cgroups/msg01748.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] block: collapse blk_alloc_request() into get_request()
       [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-19 23:29   ` Tejun Heo
  2012-04-19 23:29   ` [PATCH 2/4] block: fix elvpriv allocation failure handling Tejun Heo
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Allocation failure handling in get_request() is about to be updated.
To ease the update, collapse blk_alloc_request() into get_request().

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c |   46 +++++++++++++++++-----------------------------
 1 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3b02ba3..f6f68b0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,33 +719,6 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 	mempool_free(rq, q->rq.rq_pool);
 }
 
-static struct request *
-blk_alloc_request(struct request_queue *q, struct bio *bio, struct io_cq *icq,
-		  unsigned int flags, gfp_t gfp_mask)
-{
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
-
-	if (!rq)
-		return NULL;
-
-	blk_rq_init(q, rq);
-
-	rq->cmd_flags = flags | REQ_ALLOCED;
-
-	if (flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			return NULL;
-		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
-		if (icq)
-			get_io_context(icq->ioc);
-	}
-
-	return rq;
-}
-
 /*
  * ioc_batching returns true if the ioc is a valid batching request and
  * should be given priority access to a request.
@@ -968,10 +941,25 @@ retry:
 			goto fail_alloc;
 	}
 
-	rq = blk_alloc_request(q, bio, icq, rw_flags, gfp_mask);
-	if (unlikely(!rq))
+	/* allocate and init request */
+	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	if (!rq)
 		goto fail_alloc;
 
+	blk_rq_init(q, rq);
+	rq->cmd_flags = rw_flags | REQ_ALLOCED;
+
+	if (rw_flags & REQ_ELVPRIV) {
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
+			mempool_free(rq, q->rq.rq_pool);
+			goto fail_alloc;
+		}
+		/* @rq->elv.icq holds on to io_context until @rq is freed */
+		if (icq)
+			get_io_context(icq->ioc);
+	}
+
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 1/4] block: collapse blk_alloc_request() into get_request()
       [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-19 23:29   ` Tejun Heo
  2012-04-19 23:29   ` [PATCH 2/4] block: fix elvpriv allocation failure handling Tejun Heo
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel, Tejun Heo

Allocation failure handling in get_request() is about to be updated.
To ease the update, collapse blk_alloc_request() into get_request().

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c |   46 +++++++++++++++++-----------------------------
 1 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3b02ba3..f6f68b0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,33 +719,6 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 	mempool_free(rq, q->rq.rq_pool);
 }
 
-static struct request *
-blk_alloc_request(struct request_queue *q, struct bio *bio, struct io_cq *icq,
-		  unsigned int flags, gfp_t gfp_mask)
-{
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
-
-	if (!rq)
-		return NULL;
-
-	blk_rq_init(q, rq);
-
-	rq->cmd_flags = flags | REQ_ALLOCED;
-
-	if (flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			return NULL;
-		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
-		if (icq)
-			get_io_context(icq->ioc);
-	}
-
-	return rq;
-}
-
 /*
  * ioc_batching returns true if the ioc is a valid batching request and
  * should be given priority access to a request.
@@ -968,10 +941,25 @@ retry:
 			goto fail_alloc;
 	}
 
-	rq = blk_alloc_request(q, bio, icq, rw_flags, gfp_mask);
-	if (unlikely(!rq))
+	/* allocate and init request */
+	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	if (!rq)
 		goto fail_alloc;
 
+	blk_rq_init(q, rq);
+	rq->cmd_flags = rw_flags | REQ_ALLOCED;
+
+	if (rw_flags & REQ_ELVPRIV) {
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
+			mempool_free(rq, q->rq.rq_pool);
+			goto fail_alloc;
+		}
+		/* @rq->elv.icq holds on to io_context until @rq is freed */
+		if (icq)
+			get_io_context(icq->ioc);
+	}
+
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 1/4] block: collapse blk_alloc_request() into get_request()
@ 2012-04-19 23:29   ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Allocation failure handling in get_request() is about to be updated.
To ease the update, collapse blk_alloc_request() into get_request().

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c |   46 +++++++++++++++++-----------------------------
 1 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3b02ba3..f6f68b0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,33 +719,6 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 	mempool_free(rq, q->rq.rq_pool);
 }
 
-static struct request *
-blk_alloc_request(struct request_queue *q, struct bio *bio, struct io_cq *icq,
-		  unsigned int flags, gfp_t gfp_mask)
-{
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
-
-	if (!rq)
-		return NULL;
-
-	blk_rq_init(q, rq);
-
-	rq->cmd_flags = flags | REQ_ALLOCED;
-
-	if (flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			return NULL;
-		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
-		if (icq)
-			get_io_context(icq->ioc);
-	}
-
-	return rq;
-}
-
 /*
  * ioc_batching returns true if the ioc is a valid batching request and
  * should be given priority access to a request.
@@ -968,10 +941,25 @@ retry:
 			goto fail_alloc;
 	}
 
-	rq = blk_alloc_request(q, bio, icq, rw_flags, gfp_mask);
-	if (unlikely(!rq))
+	/* allocate and init request */
+	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	if (!rq)
 		goto fail_alloc;
 
+	blk_rq_init(q, rq);
+	rq->cmd_flags = rw_flags | REQ_ALLOCED;
+
+	if (rw_flags & REQ_ELVPRIV) {
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
+			mempool_free(rq, q->rq.rq_pool);
+			goto fail_alloc;
+		}
+		/* @rq->elv.icq holds on to io_context until @rq is freed */
+		if (icq)
+			get_io_context(icq->ioc);
+	}
+
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] block: fix elvpriv allocation failure handling
       [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2012-04-19 23:29   ` [PATCH 1/4] block: collapse blk_alloc_request() into get_request() Tejun Heo
@ 2012-04-19 23:29   ` Tejun Heo
  2012-04-19 23:29     ` Tejun Heo
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data.  Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole.  There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.

This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail.  If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV.  This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c |   53 ++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f6f68b0..6cf13df 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/ratelimit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -930,17 +931,6 @@ retry:
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	/* create icq if missing */
-	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
-		create_io_context(gfp_mask, q->node);
-		ioc = rq_ioc(bio);
-		if (!ioc)
-			goto fail_alloc;
-		icq = ioc_create_icq(ioc, q, gfp_mask);
-		if (!icq)
-			goto fail_alloc;
-	}
-
 	/* allocate and init request */
 	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 	if (!rq)
@@ -949,17 +939,28 @@ retry:
 	blk_rq_init(q, rq);
 	rq->cmd_flags = rw_flags | REQ_ALLOCED;
 
+	/* init elvpriv */
 	if (rw_flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			goto fail_alloc;
+		if (unlikely(et->icq_cache && !icq)) {
+			create_io_context(gfp_mask, q->node);
+			ioc = rq_ioc(bio);
+			if (!ioc)
+				goto fail_elvpriv;
+
+			icq = ioc_create_icq(ioc, q, gfp_mask);
+			if (!icq)
+				goto fail_elvpriv;
 		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
+
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask)))
+			goto fail_elvpriv;
+
+		/* @rq->elv.icq holds io_context until @rq is freed */
 		if (icq)
 			get_io_context(icq->ioc);
 	}
-
+out:
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
@@ -972,6 +973,24 @@ retry:
 	trace_block_getrq(q, bio, rw_flags & 1);
 	return rq;
 
+fail_elvpriv:
+	/*
+	 * elvpriv init failed.  ioc, icq and elvpriv aren't mempool backed
+	 * and may fail indefinitely under memory pressure and thus
+	 * shouldn't stall IO.  Treat this request as !elvpriv.  This will
+	 * disturb iosched and blkcg but weird is bettern than dead.
+	 */
+	printk_ratelimited(KERN_WARNING "%s: request aux data allocation failed, iosched may be disturbed\n",
+			   dev_name(q->backing_dev_info.dev));
+
+	rq->cmd_flags &= ~REQ_ELVPRIV;
+	rq->elv.icq = NULL;
+
+	spin_lock_irq(q->queue_lock);
+	rl->elvpriv--;
+	spin_unlock_irq(q->queue_lock);
+	goto out;
+
 fail_alloc:
 	/*
 	 * Allocation failed presumably due to memory. Undo anything we
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] block: fix elvpriv allocation failure handling
       [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-19 23:29   ` Tejun Heo
  2012-04-19 23:29   ` [PATCH 2/4] block: fix elvpriv allocation failure handling Tejun Heo
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel, Tejun Heo

Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data.  Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole.  There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.

This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail.  If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV.  This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c |   53 ++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f6f68b0..6cf13df 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/ratelimit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -930,17 +931,6 @@ retry:
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	/* create icq if missing */
-	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
-		create_io_context(gfp_mask, q->node);
-		ioc = rq_ioc(bio);
-		if (!ioc)
-			goto fail_alloc;
-		icq = ioc_create_icq(ioc, q, gfp_mask);
-		if (!icq)
-			goto fail_alloc;
-	}
-
 	/* allocate and init request */
 	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 	if (!rq)
@@ -949,17 +939,28 @@ retry:
 	blk_rq_init(q, rq);
 	rq->cmd_flags = rw_flags | REQ_ALLOCED;
 
+	/* init elvpriv */
 	if (rw_flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			goto fail_alloc;
+		if (unlikely(et->icq_cache && !icq)) {
+			create_io_context(gfp_mask, q->node);
+			ioc = rq_ioc(bio);
+			if (!ioc)
+				goto fail_elvpriv;
+
+			icq = ioc_create_icq(ioc, q, gfp_mask);
+			if (!icq)
+				goto fail_elvpriv;
 		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
+
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask)))
+			goto fail_elvpriv;
+
+		/* @rq->elv.icq holds io_context until @rq is freed */
 		if (icq)
 			get_io_context(icq->ioc);
 	}
-
+out:
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
@@ -972,6 +973,24 @@ retry:
 	trace_block_getrq(q, bio, rw_flags & 1);
 	return rq;
 
+fail_elvpriv:
+	/*
+	 * elvpriv init failed.  ioc, icq and elvpriv aren't mempool backed
+	 * and may fail indefinitely under memory pressure and thus
+	 * shouldn't stall IO.  Treat this request as !elvpriv.  This will
+	 * disturb iosched and blkcg but weird is bettern than dead.
+	 */
+	printk_ratelimited(KERN_WARNING "%s: request aux data allocation failed, iosched may be disturbed\n",
+			   dev_name(q->backing_dev_info.dev));
+
+	rq->cmd_flags &= ~REQ_ELVPRIV;
+	rq->elv.icq = NULL;
+
+	spin_lock_irq(q->queue_lock);
+	rl->elvpriv--;
+	spin_unlock_irq(q->queue_lock);
+	goto out;
+
 fail_alloc:
 	/*
 	 * Allocation failed presumably due to memory. Undo anything we
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] block: fix elvpriv allocation failure handling
@ 2012-04-19 23:29   ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo

Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data.  Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole.  There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.

This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail.  If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV.  This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c |   53 ++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f6f68b0..6cf13df 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/ratelimit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -930,17 +931,6 @@ retry:
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	/* create icq if missing */
-	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
-		create_io_context(gfp_mask, q->node);
-		ioc = rq_ioc(bio);
-		if (!ioc)
-			goto fail_alloc;
-		icq = ioc_create_icq(ioc, q, gfp_mask);
-		if (!icq)
-			goto fail_alloc;
-	}
-
 	/* allocate and init request */
 	rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 	if (!rq)
@@ -949,17 +939,28 @@ retry:
 	blk_rq_init(q, rq);
 	rq->cmd_flags = rw_flags | REQ_ALLOCED;
 
+	/* init elvpriv */
 	if (rw_flags & REQ_ELVPRIV) {
-		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			goto fail_alloc;
+		if (unlikely(et->icq_cache && !icq)) {
+			create_io_context(gfp_mask, q->node);
+			ioc = rq_ioc(bio);
+			if (!ioc)
+				goto fail_elvpriv;
+
+			icq = ioc_create_icq(ioc, q, gfp_mask);
+			if (!icq)
+				goto fail_elvpriv;
 		}
-		/* @rq->elv.icq holds on to io_context until @rq is freed */
+
+		rq->elv.icq = icq;
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask)))
+			goto fail_elvpriv;
+
+		/* @rq->elv.icq holds io_context until @rq is freed */
 		if (icq)
 			get_io_context(icq->ioc);
 	}
-
+out:
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
 	 * OK, if the queue is under the request limit then requests need
@@ -972,6 +973,24 @@ retry:
 	trace_block_getrq(q, bio, rw_flags & 1);
 	return rq;
 
+fail_elvpriv:
+	/*
+	 * elvpriv init failed.  ioc, icq and elvpriv aren't mempool backed
+	 * and may fail indefinitely under memory pressure and thus
+	 * shouldn't stall IO.  Treat this request as !elvpriv.  This will
+	 * disturb iosched and blkcg but weird is bettern than dead.
+	 */
+	printk_ratelimited(KERN_WARNING "%s: request aux data allocation failed, iosched may be disturbed\n",
+			   dev_name(q->backing_dev_info.dev));
+
+	rq->cmd_flags &= ~REQ_ELVPRIV;
+	rq->elv.icq = NULL;
+
+	spin_lock_irq(q->queue_lock);
+	rl->elvpriv--;
+	spin_unlock_irq(q->queue_lock);
+	goto out;
+
 fail_alloc:
 	/*
 	 * Allocation failed presumably due to memory. Undo anything we
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
  2012-04-19 23:29 ` Tejun Heo
@ 2012-04-19 23:29     ` Tejun Heo
  -1 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

__blkg_lookup_create() leaked blkcg->css ref if blkg allocation
failed.  Fix it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8228385..30a7a9c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -174,6 +174,7 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
+	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -186,24 +187,22 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	if (!css_tryget(&blkcg->css))
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * Allocate and initialize.
-	 */
+	/* allocate */
+	ret = -ENOMEM;
 	blkg = blkg_alloc(blkcg, q);
-
-	/* did alloc fail? */
-	if (unlikely(!blkg)) {
-		blkg = ERR_PTR(-ENOMEM);
-		goto out;
-	}
+	if (unlikely(!blkg))
+		goto err_put;
 
 	/* insert */
 	spin_lock(&blkcg->lock);
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
 	spin_unlock(&blkcg->lock);
-out:
 	return blkg;
+
+err_put:
+	css_put(&blkcg->css);
+	return ERR_PTR(ret);
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
@ 2012-04-19 23:29     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel, Tejun Heo

__blkg_lookup_create() leaked blkcg->css ref if blkg allocation
failed.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8228385..30a7a9c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -174,6 +174,7 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct blkcg_gq *blkg;
+	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -186,24 +187,22 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	if (!css_tryget(&blkcg->css))
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * Allocate and initialize.
-	 */
+	/* allocate */
+	ret = -ENOMEM;
 	blkg = blkg_alloc(blkcg, q);
-
-	/* did alloc fail? */
-	if (unlikely(!blkg)) {
-		blkg = ERR_PTR(-ENOMEM);
-		goto out;
-	}
+	if (unlikely(!blkg))
+		goto err_put;
 
 	/* insert */
 	spin_lock(&blkcg->lock);
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
 	spin_unlock(&blkcg->lock);
-out:
 	return blkg;
+
+err_put:
+	css_put(&blkcg->css);
+	return ERR_PTR(ret);
 }
 
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg
       [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2012-04-19 23:29     ` Tejun Heo
@ 2012-04-19 23:29   ` Tejun Heo
  2012-04-20  8:10     ` Jens Axboe
  4 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

blkg lookup is currently performed by traversing linked list anchored
at blkcg->blkg_list.  This is very unscalable and with blk-throttle
enabled and enough request queues on the system, this can get very
ugly quickly (blk-throttle performs look up on every bio submission).

This patch makes blkcg use radix tree to index blkgs combined with
simple last-looked-up hint.  This is mostly identical to how icqs are
indexed from ioc.

Note that because __blkg_lookup() may be invoked without holding queue
lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
cfqq caching, this makes hint updates overly lazy.  This will be
improved with scheduled blkcg aware request allocation.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c |   52 ++++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-cgroup.h |    6 ++++++
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 30a7a9c..02cf633 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -142,11 +142,21 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 				      struct request_queue *q)
 {
 	struct blkcg_gq *blkg;
-	struct hlist_node *n;
 
-	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node)
-		if (blkg->q == q)
-			return blkg;
+	blkg = rcu_dereference(blkcg->blkg_hint);
+	if (blkg && blkg->q == q)
+		return blkg;
+
+	/*
+	 * Hint didn't match.  Look up from the radix tree.  Note that we
+	 * may not be holding queue_lock and thus are not sure whether
+	 * @blkg from blkg_tree has already been removed or not, so we
+	 * can't update hint to the lookup result.  Leave it to the caller.
+	 */
+	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
+	if (blkg && blkg->q == q)
+		return blkg;
+
 	return NULL;
 }
 
@@ -179,9 +189,12 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
+	/* lookup and update hint on success, see __blkg_lookup() for details */
 	blkg = __blkg_lookup(blkcg, q);
-	if (blkg)
+	if (blkg) {
+		rcu_assign_pointer(blkcg->blkg_hint, blkg);
 		return blkg;
+	}
 
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css))
@@ -194,12 +207,24 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
+	ret = radix_tree_preload(GFP_ATOMIC);
+	if (ret)
+		goto err_free;
+
 	spin_lock(&blkcg->lock);
-	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
-	list_add(&blkg->q_node, &q->blkg_list);
+	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
+	if (likely(!ret)) {
+		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+		list_add(&blkg->q_node, &q->blkg_list);
+	}
 	spin_unlock(&blkcg->lock);
-	return blkg;
 
+	radix_tree_preload_end();
+
+	if (!ret)
+		return blkg;
+err_free:
+	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
 	return ERR_PTR(ret);
@@ -229,10 +254,20 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	/* Something wrong if we are trying to remove same group twice */
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
+
+	radix_tree_delete(&blkcg->blkg_tree, blkg->q->id);
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
 	/*
+	 * Both setting lookup hint to and clearing it from @blkg are done
+	 * under queue_lock.  If it's not pointing to @blkg now, it never
+	 * will.  Hint assignment itself can race safely.
+	 */
+	if (rcu_dereference_raw(blkcg->blkg_hint) == blkg)
+		rcu_assign_pointer(blkcg->blkg_hint, NULL);
+
+	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
 	 */
@@ -593,6 +628,7 @@ static struct cgroup_subsys_state *blkcg_create(struct cgroup *cgroup)
 	blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
 done:
 	spin_lock_init(&blkcg->lock);
+	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
 	return &blkcg->css;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 44cb908..8ac457c 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -16,6 +16,7 @@
 #include <linux/cgroup.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/seq_file.h>
+#include <linux/radix-tree.h>
 
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
@@ -37,9 +38,14 @@ enum blkg_rwstat_type {
 	BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
 };
 
+struct blkcg_gq;
+
 struct blkcg {
 	struct cgroup_subsys_state	css;
 	spinlock_t			lock;
+
+	struct radix_tree_root		blkg_tree;
+	struct blkcg_gq			*blkg_hint;
 	struct hlist_head		blkg_list;
 
 	/* for policies to test whether associated blkcg has changed */
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg
  2012-04-19 23:29 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2012-04-19 23:29 ` Tejun Heo
  2012-04-20 17:26     ` Vivek Goyal
       [not found]   ` <1334878164-24788-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 2 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel, Tejun Heo

blkg lookup is currently performed by traversing linked list anchored
at blkcg->blkg_list.  This is very unscalable and with blk-throttle
enabled and enough request queues on the system, this can get very
ugly quickly (blk-throttle performs look up on every bio submission).

This patch makes blkcg use radix tree to index blkgs combined with
simple last-looked-up hint.  This is mostly identical to how icqs are
indexed from ioc.

Note that because __blkg_lookup() may be invoked without holding queue
lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
cfqq caching, this makes hint updates overly lazy.  This will be
improved with scheduled blkcg aware request allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   52 ++++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-cgroup.h |    6 ++++++
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 30a7a9c..02cf633 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -142,11 +142,21 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 				      struct request_queue *q)
 {
 	struct blkcg_gq *blkg;
-	struct hlist_node *n;
 
-	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node)
-		if (blkg->q == q)
-			return blkg;
+	blkg = rcu_dereference(blkcg->blkg_hint);
+	if (blkg && blkg->q == q)
+		return blkg;
+
+	/*
+	 * Hint didn't match.  Look up from the radix tree.  Note that we
+	 * may not be holding queue_lock and thus are not sure whether
+	 * @blkg from blkg_tree has already been removed or not, so we
+	 * can't update hint to the lookup result.  Leave it to the caller.
+	 */
+	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
+	if (blkg && blkg->q == q)
+		return blkg;
+
 	return NULL;
 }
 
@@ -179,9 +189,12 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
+	/* lookup and update hint on success, see __blkg_lookup() for details */
 	blkg = __blkg_lookup(blkcg, q);
-	if (blkg)
+	if (blkg) {
+		rcu_assign_pointer(blkcg->blkg_hint, blkg);
 		return blkg;
+	}
 
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css))
@@ -194,12 +207,24 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 		goto err_put;
 
 	/* insert */
+	ret = radix_tree_preload(GFP_ATOMIC);
+	if (ret)
+		goto err_free;
+
 	spin_lock(&blkcg->lock);
-	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
-	list_add(&blkg->q_node, &q->blkg_list);
+	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
+	if (likely(!ret)) {
+		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+		list_add(&blkg->q_node, &q->blkg_list);
+	}
 	spin_unlock(&blkcg->lock);
-	return blkg;
 
+	radix_tree_preload_end();
+
+	if (!ret)
+		return blkg;
+err_free:
+	blkg_free(blkg);
 err_put:
 	css_put(&blkcg->css);
 	return ERR_PTR(ret);
@@ -229,10 +254,20 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	/* Something wrong if we are trying to remove same group twice */
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
+
+	radix_tree_delete(&blkcg->blkg_tree, blkg->q->id);
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
 	/*
+	 * Both setting lookup hint to and clearing it from @blkg are done
+	 * under queue_lock.  If it's not pointing to @blkg now, it never
+	 * will.  Hint assignment itself can race safely.
+	 */
+	if (rcu_dereference_raw(blkcg->blkg_hint) == blkg)
+		rcu_assign_pointer(blkcg->blkg_hint, NULL);
+
+	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
 	 */
@@ -593,6 +628,7 @@ static struct cgroup_subsys_state *blkcg_create(struct cgroup *cgroup)
 	blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
 done:
 	spin_lock_init(&blkcg->lock);
+	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
 	return &blkcg->css;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 44cb908..8ac457c 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -16,6 +16,7 @@
 #include <linux/cgroup.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/seq_file.h>
+#include <linux/radix-tree.h>
 
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
@@ -37,9 +38,14 @@ enum blkg_rwstat_type {
 	BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
 };
 
+struct blkcg_gq;
+
 struct blkcg {
 	struct cgroup_subsys_state	css;
 	spinlock_t			lock;
+
+	struct radix_tree_root		blkg_tree;
+	struct blkcg_gq			*blkg_hint;
 	struct hlist_head		blkg_list;
 
 	/* for policies to test whether associated blkcg has changed */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCHSET] block: fixes for long standing issues
  2012-04-19 23:29 ` Tejun Heo
@ 2012-04-20  8:10     ` Jens Axboe
  -1 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2012-04-20  8:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

On Thu, Apr 19 2012, Tejun Heo wrote:
> Hello,
> 
> This patchset fixes two long standing issues and one relatively new
> css ref leak.
> 
> a. elvpriv alloc failure, including ioc and icq failures, fails
>    request allocation.  As those aren't mempool backed and may fail
>    indefinitely, this can lead to deadlock under memory pressure.
> 
> b. blkgs don't have proper indexing.  With enough number of
>    request_queues and blk-throttle enabled, block layer can spend
>    considerable amount of cpu cycles walking the same list over and
>    over again.
> 
> c. __blkg_lookup_create() was leaking a css ref on failure path.
> 
> This patchset contains the following four patches.
> 
>  0001-block-collapse-blk_alloc_request-into-get_request.patch
>  0002-block-fix-elvpriv-allocation-failure-handling.patch
>  0003-blkcg-fix-blkcg-css-ref-leak-in-__blkg_lookup_create.patch
>  0004-blkcg-use-radix-tree-to-index-blkgs-from-blkcg.patch
> 
> 0001-0002 fix #a.  0003 fixes #c.  0004 fixes #b.
> 
> This patchset is on top of
> 
>   block/for-3.5/core 5bc4afb1ec "blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros"
> + [1] [PATCHSET] block: per-queue policy activation, take#2
> + [2] [PATCHSET] block: cosmetic updates to blkcg API

Applied, thanks Tejun.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET] block: fixes for long standing issues
@ 2012-04-20  8:10     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2012-04-20  8:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: vgoyal, ctalbott, rni, cgroups, containers, linux-kernel

On Thu, Apr 19 2012, Tejun Heo wrote:
> Hello,
> 
> This patchset fixes two long standing issues and one relatively new
> css ref leak.
> 
> a. elvpriv alloc failure, including ioc and icq failures, fails
>    request allocation.  As those aren't mempool backed and may fail
>    indefinitely, this can lead to deadlock under memory pressure.
> 
> b. blkgs don't have proper indexing.  With enough number of
>    request_queues and blk-throttle enabled, block layer can spend
>    considerable amount of cpu cycles walking the same list over and
>    over again.
> 
> c. __blkg_lookup_create() was leaking a css ref on failure path.
> 
> This patchset contains the following four patches.
> 
>  0001-block-collapse-blk_alloc_request-into-get_request.patch
>  0002-block-fix-elvpriv-allocation-failure-handling.patch
>  0003-blkcg-fix-blkcg-css-ref-leak-in-__blkg_lookup_create.patch
>  0004-blkcg-use-radix-tree-to-index-blkgs-from-blkcg.patch
> 
> 0001-0002 fix #a.  0003 fixes #c.  0004 fixes #b.
> 
> This patchset is on top of
> 
>   block/for-3.5/core 5bc4afb1ec "blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros"
> + [1] [PATCHSET] block: per-queue policy activation, take#2
> + [2] [PATCHSET] block: cosmetic updates to blkcg API

Applied, thanks Tejun.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg
       [not found]   ` <1334878164-24788-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-20 17:26     ` Vivek Goyal
  0 siblings, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-04-20 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 19, 2012 at 04:29:24PM -0700, Tejun Heo wrote:
> blkg lookup is currently performed by traversing linked list anchored
> at blkcg->blkg_list.  This is very unscalable and with blk-throttle
> enabled and enough request queues on the system, this can get very
> ugly quickly (blk-throttle performs look up on every bio submission).
> 
> This patch makes blkcg use radix tree to index blkgs combined with
> simple last-looked-up hint.  This is mostly identical to how icqs are
> indexed from ioc.
> 
> Note that because __blkg_lookup() may be invoked without holding queue
> lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
> cfqq caching, this makes hint updates overly lazy.  This will be
> improved with scheduled blkcg aware request allocation.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Nice to see radix tree lookup and caching hint. Helps with scalability.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg
       [not found]   ` <1334878164-24788-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-04-20 17:26     ` Vivek Goyal
  0 siblings, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-04-20 17:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, cgroups, containers, linux-kernel

On Thu, Apr 19, 2012 at 04:29:24PM -0700, Tejun Heo wrote:
> blkg lookup is currently performed by traversing linked list anchored
> at blkcg->blkg_list.  This is very unscalable and with blk-throttle
> enabled and enough request queues on the system, this can get very
> ugly quickly (blk-throttle performs look up on every bio submission).
> 
> This patch makes blkcg use radix tree to index blkgs combined with
> simple last-looked-up hint.  This is mostly identical to how icqs are
> indexed from ioc.
> 
> Note that because __blkg_lookup() may be invoked without holding queue
> lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
> cfqq caching, this makes hint updates overly lazy.  This will be
> improved with scheduled blkcg aware request allocation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>

Nice to see radix tree lookup and caching hint. Helps with scalability.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg
@ 2012-04-20 17:26     ` Vivek Goyal
  0 siblings, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-04-20 17:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Apr 19, 2012 at 04:29:24PM -0700, Tejun Heo wrote:
> blkg lookup is currently performed by traversing linked list anchored
> at blkcg->blkg_list.  This is very unscalable and with blk-throttle
> enabled and enough request queues on the system, this can get very
> ugly quickly (blk-throttle performs look up on every bio submission).
> 
> This patch makes blkcg use radix tree to index blkgs combined with
> simple last-looked-up hint.  This is mostly identical to how icqs are
> indexed from ioc.
> 
> Note that because __blkg_lookup() may be invoked without holding queue
> lock, hint is only updated from __blkg_lookup_create().  Due to cfq's
> cfqq caching, this makes hint updates overly lazy.  This will be
> improved with scheduled blkcg aware request allocation.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Nice to see radix tree lookup and caching hint. Helps with scalability.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHSET] block: fixes for long standing issues
@ 2012-04-19 23:29 Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-04-19 23:29 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA

Hello,

This patchset fixes two long standing issues and one relatively new
css ref leak.

a. elvpriv alloc failure, including ioc and icq failures, fails
   request allocation.  As those aren't mempool backed and may fail
   indefinitely, this can lead to deadlock under memory pressure.

b. blkgs don't have proper indexing.  With enough number of
   request_queues and blk-throttle enabled, block layer can spend
   considerable amount of cpu cycles walking the same list over and
   over again.

c. __blkg_lookup_create() was leaking a css ref on failure path.

This patchset contains the following four patches.

 0001-block-collapse-blk_alloc_request-into-get_request.patch
 0002-block-fix-elvpriv-allocation-failure-handling.patch
 0003-blkcg-fix-blkcg-css-ref-leak-in-__blkg_lookup_create.patch
 0004-blkcg-use-radix-tree-to-index-blkgs-from-blkcg.patch

0001-0002 fix #a.  0003 fixes #c.  0004 fixes #b.

This patchset is on top of

  block/for-3.5/core 5bc4afb1ec "blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros"
+ [1] [PATCHSET] block: per-queue policy activation, take#2
+ [2] [PATCHSET] block: cosmetic updates to blkcg API

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-misc-fixes

diffstat follows.

 block/blk-cgroup.c |   71 ++++++++++++++++++++++++++++++++-----------
 block/blk-cgroup.h |    6 +++
 block/blk-core.c   |   87 ++++++++++++++++++++++++++++-------------------------
 3 files changed, 106 insertions(+), 58 deletions(-)

Thanks.

--
tejun

[1] https://lkml.org/lkml/2012/4/13/380
[2] http://www.spinics.net/lists/cgroups/msg01748.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-04-20 17:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-19 23:29 [PATCHSET] block: fixes for long standing issues Tejun Heo
2012-04-19 23:29 ` Tejun Heo
2012-04-19 23:29 ` [PATCH 1/4] block: collapse blk_alloc_request() into get_request() Tejun Heo
2012-04-19 23:29   ` Tejun Heo
2012-04-19 23:29 ` [PATCH 2/4] block: fix elvpriv allocation failure handling Tejun Heo
2012-04-19 23:29   ` Tejun Heo
     [not found] ` <1334878164-24788-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-19 23:29   ` [PATCH 1/4] block: collapse blk_alloc_request() into get_request() Tejun Heo
2012-04-19 23:29   ` [PATCH 2/4] block: fix elvpriv allocation failure handling Tejun Heo
2012-04-19 23:29   ` [PATCH 3/4] blkcg: fix blkcg->css ref leak in __blkg_lookup_create() Tejun Heo
2012-04-19 23:29     ` Tejun Heo
2012-04-19 23:29   ` [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg Tejun Heo
2012-04-20  8:10   ` [PATCHSET] block: fixes for long standing issues Jens Axboe
2012-04-20  8:10     ` Jens Axboe
2012-04-19 23:29 ` [PATCH 4/4] blkcg: use radix tree to index blkgs from blkcg Tejun Heo
2012-04-20 17:26   ` Vivek Goyal
2012-04-20 17:26     ` Vivek Goyal
     [not found]   ` <1334878164-24788-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-04-20 17:26     ` Vivek Goyal
  -- strict thread matches above, loose matches on Subject: below --
2012-04-19 23:29 [PATCHSET] block: fixes for long standing issues Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.