[PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
@ 2011-12-22 21:45 Tejun Heo
  2011-12-22 21:45 ` [PATCH 1/7] mempool: fix and document synchronization and memory barrier usage Tejun Heo
                   ` (7 more replies)
  0 siblings, 8 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel

Hello, guys.

This patchset implements mempool for percpu memory and use it to solve
allocation deadlock problem in block cgroup paths.  Percpu memory
allocator can't be called from memory reclaim path, mostly because its
on-demand chunk filling hooks into vmalloc area management, which in
turn hooks into arch pagetable code.

This usually isn't a problem but block cgroup code wants to do
opportunistic percpu allocation to track statistics in IO issue path.
Currently, it directly calls alloc_percpu() from IO path triggering
lockdep warning and falling into deadlocks under the right conditions.

This patchset adds percpu mempool which supports async refilling and
uses that as allocation buffer to solve the above problem.  It solves
the problem for the next merge window but I'm not sure about what to
do about this window and -stable.  The mempool updates seems a bit too
invasive.  Maybe implement a stripped down allocation buffer in
block-cgroup?

This patchset contains the following seven patches.

 0001-mempool-fix-and-document-synchronization-and-memory-.patch
 0002-mempool-drop-unnecessary-and-incorrect-BUG_ON-from-m.patch
 0003-mempool-fix-first-round-failure-behavior.patch
 0004-mempool-factor-out-mempool_fill.patch
 0005-mempool-separate-out-__mempool_create.patch
 0006-mempool-percpu-implement-percpu-mempool.patch
 0007-block-fix-deadlock-through-percpu-allocation-in-blk-.patch

0001-0003 are general mempool updates and already in -mm.  Note that
Oleg has an alternative patch for 0003 which seems better to me.

0004-0006 prepare for and implement percpu mempool.

0007 uses it in block cgroup to fix the deadlock problem.

This patchset is on top of block/for-3.3/core but applies on mainline
too.

If people agree with this approach, it would probably be easier to
route everything through block, I think.  Andrew, Jens, how do you
guys want to route these?

Oleg, if you can send patch with proper description and SOB, I'll
replace 0003 with yours.

The patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git block-blkcg-pcpu-fix

and contains the following changes.

 block/blk-cgroup.c      |   37 +++++
 block/blk-cgroup.h      |    2 
 block/blk-throttle.c    |    2 
 block/cfq-iosched.c     |    2 
 include/linux/mempool.h |   80 ++++++++++++
 mm/mempool.c            |  310 +++++++++++++++++++++++++++++++++++++-----------
 6 files changed, 363 insertions(+), 70 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/7] mempool: fix and document synchronization and memory barrier usage
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 2/7] mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy() Tejun Heo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal
  Cc: linux-kernel, Tejun Heo, Linus Torvalds, Paul E. McKenney, David Howells

mempool_alloc/free() use undocumented smp_mb()'s.  The code is
slightly broken and misleading.

The lockless part is in mempool_free().  It wants to determine whether
the item being freed needs to be returned to the pool or backing
allocator without grabbing pool->lock.  Two things need to be
guaranteed for correct operation.

1. pool->curr_nr + #allocated should never dip below pool->min_nr.
2. Waiters shouldn't be left dangling.

For #1, The only necessary condition is that curr_nr visible at free
is from after the allocation of the element being freed (details in
the comment).  For most cases, this is true without any barrier but
there can be fringe cases where the allocated pointer is passed to the
freeing task without going through memory barriers.  To cover this
case, wmb is necessary before returning from allocation and rmb is
necessary before reading curr_nr.  IOW,

	ALLOCATING TASK			FREEING TASK

	update pool state after alloc;
	wmb();
	pass pointer to freeing task;
					read pointer;
					rmb();
					read pool state to free;

The current code doesn't have wmb after pool update during allocation
and may theoretically, on machines where unlock doesn't behave as full
wmb, lead to pool depletion and deadlock.  smp_wmb() needs to be added
after successful allocation from reserved elements and smp_mb() in
mempool_free() can be replaced with smp_rmb().

For #2, the waiter needs to add itself to waitqueue and then check the
wait condition and the waker needs to update the wait condition and
then wake up.  Because waitqueue operations always go through full
spinlock synchronization, there is no need for extra memory barriers.

Furthermore, mempool_alloc() is already holding pool->lock when it
decides that it needs to wait.  There is no reason to do unlock - add
waitqueue - test condition again.  It can simply add itself to
waitqueue while holding pool->lock and then unlock and sleep.

This patch adds smp_wmb() after successful allocation from reserved
pool, replaces smp_mb() in mempool_free() with smp_rmb() and extend
pool->lock over waitqueue addition.  More importantly, it explains
what memory barriers do and how the lockless testing is correct.

-v2: Oleg pointed out that unlock doesn't imply wmb.  Added explicit
     smp_wmb() after successful allocation from reserved pool and
     updated comments accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
---
 mm/mempool.c |   61 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index e73641b..11f0d0a 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -224,28 +224,31 @@ repeat_alloc:
 	if (likely(pool->curr_nr)) {
 		element = remove_element(pool);
 		spin_unlock_irqrestore(&pool->lock, flags);
+		/* paired with rmb in mempool_free(), read comment there */
+		smp_wmb();
 		return element;
 	}
-	spin_unlock_irqrestore(&pool->lock, flags);

 	/* We must not sleep in the GFP_ATOMIC case */
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!(gfp_mask & __GFP_WAIT)) {
+		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
+	}

-	/* Now start performing page reclaim */
+	/* Let's wait for someone else to return an element to @pool */
 	gfp_temp = gfp_mask;
 	init_wait(&wait);
 	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
-	smp_mb();
-	if (!pool->curr_nr) {
-		/*
-		 * FIXME: this should be io_schedule().  The timeout is there
-		 * as a workaround for some DM problems in 2.6.18.
-		 */
-		io_schedule_timeout(5*HZ);
-	}
-	finish_wait(&pool->wait, &wait);

+	spin_unlock_irqrestore(&pool->lock, flags);
+
+	/*
+	 * FIXME: this should be io_schedule().  The timeout is there as a
+	 * workaround for some DM problems in 2.6.18.
+	 */
+	io_schedule_timeout(5*HZ);
+
+	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
 }
 EXPORT_SYMBOL(mempool_alloc);
@@ -265,7 +268,39 @@ void mempool_free(void *element, mempool_t *pool)
 	if (unlikely(element == NULL))
 		return;

-	smp_mb();
+	/*
+	 * Paired with the wmb in mempool_alloc().  The preceding read is
+	 * for @element and the following @pool->curr_nr.  This ensures
+	 * that the visible value of @pool->curr_nr is from after the
+	 * allocation of @element.  This is necessary for fringe cases
+	 * where @element was passed to this task without going through
+	 * barriers.
+	 *
+	 * For example, assume @p is %NULL at the beginning and one task
+	 * performs "p = mempool_alloc(...);" while another task is doing
+	 * "while (!p) cpu_relax(); mempool_free(p, ...);".  This function
+	 * may end up using curr_nr value which is from before allocation
+	 * of @p without the following rmb.
+	 */
+	smp_rmb();
+
+	/*
+	 * For correctness, we need a test which is guaranteed to trigger
+	 * if curr_nr + #allocated == min_nr.  Testing curr_nr < min_nr
+	 * without locking achieves that and refilling as soon as possible
+	 * is desirable.
+	 *
+	 * Because curr_nr visible here is always a value after the
+	 * allocation of @element, any task which decremented curr_nr below
+	 * min_nr is guaranteed to see curr_nr < min_nr unless curr_nr gets
+	 * incremented to min_nr afterwards.  If curr_nr gets incremented
+	 * to min_nr after the allocation of @element, the elements
+	 * allocated after that are subject to the same guarantee.
+	 *
+	 * Waiters happen iff curr_nr is 0 and the above guarantee also
+	 * ensures that there will be frees which return elements to the
+	 * pool waking up the waiters.
+	 */
 	if (pool->curr_nr < pool->min_nr) {
 		spin_lock_irqsave(&pool->lock, flags);
 		if (pool->curr_nr < pool->min_nr) {
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/7] mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy()
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
  2011-12-22 21:45 ` [PATCH 1/7] mempool: fix and document synchronization and memory barrier usage Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 3/7] mempool: fix first round failure behavior Tejun Heo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

mempool_destroy() is a thin wrapper around free_pool().  The only
thing it adds is BUG_ON(pool->curr_nr != pool->min_nr).  The intention
seems to be to enforce that all allocated elements are freed; however,
the BUG_ON() can't achieve that (it doesn't know anything about
objects above min_nr) and incorrect as mempool_resize() is allowed to
leave the pool extended but not filled.  Furthermore, panicking is way
worse than any memory leak and there are better debug tools to track
memory leaks.

Drop the BUG_ON() from mempool_destory() and as that leaves the
function identical to free_pool(), replace it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/mempool.c |   30 +++++++++++-------------------
 1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 11f0d0a..e3a802a 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -27,7 +27,15 @@ static void *remove_element(mempool_t *pool)
 	return pool->elements[--pool->curr_nr];
 }
 
-static void free_pool(mempool_t *pool)
+/**
+ * mempool_destroy - deallocate a memory pool
+ * @pool:      pointer to the memory pool which was allocated via
+ *             mempool_create().
+ *
+ * Free all reserved elements in @pool and @pool itself.  This function
+ * only sleeps if the free_fn() function sleeps.
+ */
+void mempool_destroy(mempool_t *pool)
 {
 	while (pool->curr_nr) {
 		void *element = remove_element(pool);
@@ -36,6 +44,7 @@ static void free_pool(mempool_t *pool)
 	kfree(pool->elements);
 	kfree(pool);
 }
+EXPORT_SYMBOL(mempool_destroy);
 
 /**
  * mempool_create - create a memory pool
@@ -86,7 +95,7 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 
 		element = pool->alloc(GFP_KERNEL, pool->pool_data);
 		if (unlikely(!element)) {
-			free_pool(pool);
+			mempool_destroy(pool);
 			return NULL;
 		}
 		add_element(pool, element);
@@ -172,23 +181,6 @@ out:
 EXPORT_SYMBOL(mempool_resize);
 
 /**
- * mempool_destroy - deallocate a memory pool
- * @pool:      pointer to the memory pool which was allocated via
- *             mempool_create().
- *
- * this function only sleeps if the free_fn() function sleeps. The caller
- * has to guarantee that all elements have been returned to the pool (ie:
- * freed) prior to calling mempool_destroy().
- */
-void mempool_destroy(mempool_t *pool)
-{
-	/* Check for outstanding elements */
-	BUG_ON(pool->curr_nr != pool->min_nr);
-	free_pool(pool);
-}
-EXPORT_SYMBOL(mempool_destroy);
-
-/**
  * mempool_alloc - allocate an element from a specific memory pool
  * @pool:      pointer to the memory pool which was allocated via
  *             mempool_create().
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/7] mempool: fix first round failure behavior
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
  2011-12-22 21:45 ` [PATCH 1/7] mempool: fix and document synchronization and memory barrier usage Tejun Heo
  2011-12-22 21:45 ` [PATCH 2/7] mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy() Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 4/7] mempool: factor out mempool_fill() Tejun Heo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

mempool modifies gfp_mask so that the backing allocator doesn't try
too hard or trigger warning message when there's pool to fall back on.
In addition, for the first try, it removes __GFP_WAIT and IO, so that
it doesn't trigger reclaim or wait when allocation can be fulfilled
from pool; however, when that allocation fails and pool is empty too,
it waits for the pool to be replenished before retrying.

Allocation which could have succeeded after a bit of reclaim has to
wait on the reserved items and it's not like mempool doesn't retry
with __GFP_WAIT and IO.  It just does that *after* someone returns an
element, pointlessly delaying things.

Fix it by retrying immediately if the first round of allocation
attempts w/o __GFP_WAIT and IO fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/mempool.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index e3a802a..6326cc6 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -221,14 +221,23 @@ repeat_alloc:
 		return element;
 	}
 
-	/* We must not sleep in the GFP_ATOMIC case */
+	/*
+	 * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
+	 * alloc failed with that and @pool was empty, retry immediately.
+	 */
+	if (gfp_temp != gfp_mask) {
+		gfp_temp = gfp_mask;
+		spin_unlock_irqrestore(&pool->lock, flags);
+		goto repeat_alloc;
+	}
+
+	/* We must not sleep if !__GFP_WAIT */
 	if (!(gfp_mask & __GFP_WAIT)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
 	}
 
 	/* Let's wait for someone else to return an element to @pool */
-	gfp_temp = gfp_mask;
 	init_wait(&wait);
 	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/7] mempool: factor out mempool_fill()
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
                   ` (2 preceding siblings ...)
  2011-12-22 21:45 ` [PATCH 3/7] mempool: fix first round failure behavior Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 5/7] mempool: separate out __mempool_create() Tejun Heo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

Factor out mempool_fill() from mempool_create() and mempool_resize().
When called, it fills the reservoir upto the configured min_nr
elements.

After the change, gotos in mempool_resize() exit paths don't make
whole lot of sense, unlock and return directly.

Note that mempool_create() repeatedly locks and unlocks pool->lock
while filling the pool initially and mempool_resize() may go through
one extra locking cycle.  This shouldn't result in noticeable
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempool.c |   75 ++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 6326cc6..4747283 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -47,6 +47,43 @@ void mempool_destroy(mempool_t *pool)
 EXPORT_SYMBOL(mempool_destroy);
 
 /**
+ * mempool_fill - fill a memory pool
+ * @pool:	memory pool to fill
+ * @gfp_mask:	allocation mask to use
+ *
+ * Allocate new elements with @gfp_mask and fill @pool so that it has
+ * @pool->min_nr elements.  Returns 0 on success, -errno on failure.
+ */
+static int mempool_fill(mempool_t *pool, gfp_t gfp_mask)
+{
+	/*
+	 * If curr_nr == min_nr is visible, we're correct regardless of
+	 * locking.
+	 */
+	while (pool->curr_nr < pool->min_nr) {
+		void *elem;
+		unsigned long flags;
+
+		elem = pool->alloc(gfp_mask, pool->pool_data);
+		if (unlikely(!elem))
+			return -ENOMEM;
+
+		spin_lock_irqsave(&pool->lock, flags);
+		if (pool->curr_nr < pool->min_nr) {
+			add_element(pool, elem);
+			elem = NULL;
+		}
+		spin_unlock_irqrestore(&pool->lock, flags);
+
+		if (elem) {
+			pool->free(elem, pool->pool_data);
+			return 0;
+		}
+	}
+	return 0;
+}
+
+/**
  * mempool_create - create a memory pool
  * @min_nr:    the minimum number of elements guaranteed to be
  *             allocated for this pool.
@@ -90,16 +127,11 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	/*
 	 * First pre-allocate the guaranteed number of buffers.
 	 */
-	while (pool->curr_nr < pool->min_nr) {
-		void *element;
-
-		element = pool->alloc(GFP_KERNEL, pool->pool_data);
-		if (unlikely(!element)) {
-			mempool_destroy(pool);
-			return NULL;
-		}
-		add_element(pool, element);
+	if (mempool_fill(pool, GFP_KERNEL)) {
+		mempool_destroy(pool);
+		return NULL;
 	}
+
 	return pool;
 }
 EXPORT_SYMBOL(mempool_create_node);
@@ -137,7 +169,8 @@ int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
 			spin_lock_irqsave(&pool->lock, flags);
 		}
 		pool->min_nr = new_min_nr;
-		goto out_unlock;
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return 0;
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
@@ -151,31 +184,17 @@ int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
 		/* Raced, other resize will do our work */
 		spin_unlock_irqrestore(&pool->lock, flags);
 		kfree(new_elements);
-		goto out;
+		return 0;
 	}
 	memcpy(new_elements, pool->elements,
 			pool->curr_nr * sizeof(*new_elements));
 	kfree(pool->elements);
 	pool->elements = new_elements;
 	pool->min_nr = new_min_nr;
-
-	while (pool->curr_nr < pool->min_nr) {
-		spin_unlock_irqrestore(&pool->lock, flags);
-		element = pool->alloc(gfp_mask, pool->pool_data);
-		if (!element)
-			goto out;
-		spin_lock_irqsave(&pool->lock, flags);
-		if (pool->curr_nr < pool->min_nr) {
-			add_element(pool, element);
-		} else {
-			spin_unlock_irqrestore(&pool->lock, flags);
-			pool->free(element, pool->pool_data);	/* Raced */
-			goto out;
-		}
-	}
-out_unlock:
 	spin_unlock_irqrestore(&pool->lock, flags);
-out:
+
+	mempool_fill(pool, gfp_mask);
+
 	return 0;
 }
 EXPORT_SYMBOL(mempool_resize);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 5/7] mempool: separate out __mempool_create()
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
                   ` (3 preceding siblings ...)
  2011-12-22 21:45 ` [PATCH 4/7] mempool: factor out mempool_fill() Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 6/7] mempool, percpu: implement percpu mempool Tejun Heo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

Separate out __mempool_create(), which handles creation of the mempool
without filling it, from mempool_create_node().  This will be used to
implemen percpu_mempool.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempool.c |   24 ++++++++++++++++++------
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 4747283..85e2c28 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -104,11 +104,13 @@ mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 }
 EXPORT_SYMBOL(mempool_create);
 
-mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int node_id)
+static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
+				   mempool_free_t *free_fn, void *pool_data,
+				   int node_id, size_t mempool_bytes)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
+
+	pool = kmalloc_node(mempool_bytes, GFP_KERNEL | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
@@ -124,9 +126,19 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	pool->alloc = alloc_fn;
 	pool->free = free_fn;
 
-	/*
-	 * First pre-allocate the guaranteed number of buffers.
-	 */
+	return pool;
+}
+
+mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+			       mempool_free_t *free_fn, void *pool_data,
+			       int node_id)
+{
+	mempool_t *pool;
+
+	pool = __mempool_create(min_nr, alloc_fn, free_fn, pool_data, node_id,
+				sizeof(*pool));
+
+	/* Pre-allocate the guaranteed number of buffers */
 	if (mempool_fill(pool, GFP_KERNEL)) {
 		mempool_destroy(pool);
 		return NULL;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 6/7] mempool, percpu: implement percpu mempool
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
                   ` (4 preceding siblings ...)
  2011-12-22 21:45 ` [PATCH 5/7] mempool: separate out __mempool_create() Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-22 21:45 ` [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
  2011-12-22 21:59 ` [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Andrew Morton
  7 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

This patch implements mempool for percpu memory areas.  Percpu mempool
is mostly identical to regular mempool and shares most of code but has
some peculiarities.

Percpu memory allocator requires %GFP_KERNEL during allocation, which
comes from its on-demand nature and vmalloc area usage.  In most
cases, it's not a good idea to allocate percpu memory from more
constricted context and this doesn't cause a problem; however, there
are rare cases where opportunistic allocation from NOIO path makes
sense.

To ease such use cases, percpu mempool comes with refill mechanism
which can behave both synchronously and asynchronously depending on
the specified gfp mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 include/linux/mempool.h |   80 ++++++++++++++++++++++++++++++++++
 mm/mempool.c            |  111 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+), 0 deletions(-)

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 7c08052..129acbe 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,7 @@
 #define _LINUX_MEMPOOL_H
 
 #include <linux/wait.h>
+#include <linux/workqueue.h>
 
 struct kmem_cache;
 
@@ -70,4 +71,83 @@ static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
 			      (void *)(long)order);
 }
 
+/*
+ * Percpu mempool - mempool backed by percpu memory allocator.
+ *
+ * Along with the usual mempool role, because percpu allocator doesn't
+ * support NOIO allocations, percpu mempool is useful as allocation buffer
+ * which is filled from IO context and consumed from atomic or non-IO one.
+ * To help this usage, percpu_mempool has built-in mechanism to refill the
+ * pool which supports both sync and async operations.  Refer to
+ * percpu_mempool_refill() for details.
+ */
+struct percpu_mempool {
+	mempool_t		pool;
+	size_t			size;		/* size of elements */
+	size_t			align;		/* align of elements */
+	struct work_struct	refill_work;	/* work item for async refill */
+};
+
+struct percpu_mempool *percpu_mempool_create(int min_nr, size_t size,
+					     size_t align);
+int percpu_mempool_refill(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask);
+void percpu_mempool_destroy(struct percpu_mempool *pcpu_pool);
+
+/**
+ * percpu_mempool_resize - resize an existing percpu mempool
+ * @pcpu_pool:	percpu mempool to resize
+ * @new_min_nr:	new minimum number of elements guaranteed to be allocated
+ * @gfp_mask:	allocation mask to use
+ *
+ * Counterpart of mempool_resize().  If @gfp_mask doesn't contain
+ * %__GFP_IO, resizing itself may succeed but the implied filling (if
+ * necessary) will fail.
+ */
+static inline int percpu_mempool_resize(struct percpu_mempool *pcpu_pool,
+					int new_min_nr, gfp_t gfp_mask)
+{
+	return mempool_resize(&pcpu_pool->pool, new_min_nr, gfp_mask);
+}
+
+/**
+ * percpu_mempool_alloc - allocate an element from a percpu mempool
+ * @pcpu_pool:	percpu mempool to allocate from
+ * @gfp_mask:	allocation mask to use
+ *
+ * Counterpart of mempool_alloc().  If @gfp_mask doesn't contain %__GFP_IO,
+ * allocation is always from the reserved pool.
+ */
+static inline void __percpu *
+percpu_mempool_alloc(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask)
+{
+	void *p = mempool_alloc(&pcpu_pool->pool, gfp_mask);
+
+	return (void __percpu __force *)p;
+}
+
+/**
+ * percpu_mempool_free - free an element to a percpu mempool
+ * @elem:	element being freed
+ * @pcpu_pool:	percpu mempool to free to
+ */
+static inline void percpu_mempool_free(void __percpu *elem,
+				       struct percpu_mempool *pcpu_pool)
+{
+	void *p = (void __kernel __force *)elem;
+
+	mempool_free(p, &pcpu_pool->pool);
+}
+
+/**
+ * percpu_mempool_nr_elems - return nr of reserved elems in a percpu mempool
+ * @pcpu_pool:	percpu mempool of interest
+ *
+ * Returns the number of reserved elements in @pcpu_pool.  Mostly useful
+ * for deciding when to refill.
+ */
+static inline int percpu_mempool_nr_elems(struct percpu_mempool *pcpu_pool)
+{
+	return pcpu_pool->pool.curr_nr;
+}
+
 #endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/mempool.c b/mm/mempool.c
index 85e2c28..f25f731 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include <linux/percpu.h>
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -398,3 +399,113 @@ void mempool_free_pages(void *element, void *pool_data)
 	__free_pages(element, order);
 }
 EXPORT_SYMBOL(mempool_free_pages);
+
+/*
+ * Mempool for percpu memory.
+ */
+static void *percpu_mempool_alloc_fn(gfp_t gfp_mask, void *data)
+{
+	struct percpu_mempool *pcpu_pool = data;
+	void __percpu *p;
+
+	/*
+	 * Percpu allocator doesn't do NOIO.  This makes percpu mempool
+	 * always try reserved elements first, which isn't such a bad idea
+	 * given that percpu allocator is pretty heavy and percpu areas are
+	 * expensive.
+	 */
+	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return NULL;
+
+	p = __alloc_percpu(pcpu_pool->size, pcpu_pool->align);
+	return (void __kernel __force *)p;
+}
+
+static void percpu_mempool_free_fn(void *elem, void *data)
+{
+	void __percpu *p = (void __percpu __force *)elem;
+
+	free_percpu(p);
+}
+
+static void percpu_mempool_refill_workfn(struct work_struct *work)
+{
+	struct percpu_mempool *pcpu_pool =
+		container_of(work, struct percpu_mempool, refill_work);
+
+	percpu_mempool_refill(pcpu_pool, GFP_KERNEL);
+}
+
+/**
+ * percpu_mempool_create - create mempool for percpu memory
+ * @min_nr:	the minimum number of elements guaranteed to be
+ *		allocated for this pool.
+ * @size:	size of percpu memory areas in this pool
+ * @align:	alignment of percpu memory areas in this pool
+ *
+ * This is counterpart of mempool_create() for percpu memory areas.
+ * Allocations from the pool will return @size bytes percpu memory areas
+ * aligned at @align bytes.
+ */
+struct percpu_mempool *percpu_mempool_create(int min_nr, size_t size,
+					     size_t align)
+{
+	struct percpu_mempool *pcpu_pool;
+	mempool_t *pool;
+
+	BUILD_BUG_ON(offsetof(struct percpu_mempool, pool));
+
+	pool = __mempool_create(min_nr, percpu_mempool_alloc_fn,
+				percpu_mempool_free_fn, NULL, NUMA_NO_NODE,
+				sizeof(*pcpu_pool));
+	if (!pool)
+		return NULL;
+
+	/* fill in pcpu_pool part and set pool_data to self */
+	pcpu_pool = container_of(pool, struct percpu_mempool, pool);
+	pcpu_pool->size = size;
+	pcpu_pool->align = align;
+	INIT_WORK(&pcpu_pool->refill_work, percpu_mempool_refill_workfn);
+	pcpu_pool->pool.pool_data = pcpu_pool;
+
+	/* Pre-allocate the guaranteed number of buffers */
+	if (mempool_fill(&pcpu_pool->pool, GFP_KERNEL)) {
+		mempool_destroy(&pcpu_pool->pool);
+		return NULL;
+	}
+
+	return pcpu_pool;
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_create);
+
+/**
+ * percpu_mempool_refill - refill a percpu mempool
+ * @pcpu_pool:	percpu mempool to refill
+ * @gfp_mask:	allocation mask to use
+ *
+ * Refill @pcpu_pool upto the configured min_nr using @gfp_mask.
+ *
+ * Percpu memory allocation depends on %GFP_KERNEL.  If @gfp_mask doesn't
+ * contain it, this function will schedule a work item to refill the pool
+ * and return -%EAGAIN indicating refilling is in progress.
+ */
+int percpu_mempool_refill(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask)
+{
+	if ((gfp_mask & GFP_KERNEL) == GFP_KERNEL)
+		return mempool_fill(&pcpu_pool->pool, gfp_mask);
+
+	schedule_work(&pcpu_pool->refill_work);
+	return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_refill);
+
+/**
+ * percpu_mempool_destroy - destroy a percpu mempool
+ * @pcpu_pool:	percpu mempool to destroy
+ */
+void percpu_mempool_destroy(struct percpu_mempool *pcpu_pool)
+{
+	cancel_work_sync(&pcpu_pool->refill_work);
+	mempool_destroy(&pcpu_pool->pool);
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_destroy);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
                   ` (5 preceding siblings ...)
  2011-12-22 21:45 ` [PATCH 6/7] mempool, percpu: implement percpu mempool Tejun Heo
@ 2011-12-22 21:45 ` Tejun Heo
  2011-12-23  1:00   ` Vivek Goyal
  2011-12-22 21:59 ` [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Andrew Morton
  7 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 21:45 UTC (permalink / raw)
  To: akpm, avi, nate, cl, oleg, axboe, vgoyal; +Cc: linux-kernel, Tejun Heo

Percpu allocator depends on %GFP_KERNEL allocation and can't be used
for NOIO or NOFS allocations; unfortunately, blk-cgroup allocates
percpu stats structure in the IO path, which triggers the following
lockdep warning and also leads to actual deadlocks under certain
conditions.

 inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
 qemu-kvm/8774 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff8110c8be>] pcpu_alloc+0x6f/0x835
 {RECLAIM_FS-ON-W} state was registered at:
   [<ffffffff8108f85a>] mark_held_locks+0x6d/0x95
   [<ffffffff8108fe69>] lockdep_trace_alloc+0x9f/0xc2
   [<ffffffff8112d781>] slab_pre_alloc_hook+0x1e/0x54
   [<ffffffff8112fcf9>] __kmalloc+0x64/0x10c
   [<ffffffff8110bec8>] pcpu_mem_alloc+0x5e/0x67
   [<ffffffff8110c169>] pcpu_extend_area_map+0x2b/0xd4
   [<ffffffff8110ca0d>] pcpu_alloc+0x1be/0x835
   [<ffffffff8110d094>] __alloc_percpu+0x10/0x12
   [<ffffffff8113167a>] kmem_cache_open+0x2cc/0x2d6
   [<ffffffff8113185d>] kmem_cache_create+0x1d9/0x281
   [<ffffffff812971a5>] acpi_os_create_cache+0x1d/0x2d
   [<ffffffff812bf742>] acpi_ut_create_caches+0x26/0xb0
   [<ffffffff812c2106>] acpi_ut_init_globals+0xe/0x244
   [<ffffffff81d82ca9>] acpi_initialize_subsystem+0x35/0xae
   [<ffffffff81d819c2>] acpi_early_init+0x5c/0xf7
   [<ffffffff81d51be9>] start_kernel+0x3ce/0x3ea
   [<ffffffff81d512c4>] x86_64_start_reservations+0xaf/0xb3
   [<ffffffff81d51412>] x86_64_start_kernel+0x14a/0x159
 irq event stamp: 48683649
 hardirqs last  enabled at (48683649): [<ffffffff814fd547>] __slab_alloc+0x413/0x434
 hardirqs last disabled at (48683648): [<ffffffff814fd17c>] __slab_alloc+0x48/0x434
 softirqs last  enabled at (48683630): [<ffffffff810630bf>] __do_softirq+0x200/0x25a
 softirqs last disabled at (48683625): [<ffffffff8150debc>] call_softirq+0x1c/0x30

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(pcpu_alloc_mutex);
   <Interrupt>
     lock(pcpu_alloc_mutex);

  *** DEADLOCK ***

 3 locks held by qemu-kvm/8774:
  #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffa005d979>] vcpu_load+0x1d/0x90 [kvm]
  #1:  (&kvm->srcu){.+.+.+}, at: [<ffffffffa00663f7>] srcu_read_lock+0x0/0x49 [kvm]
  #2:  (&mm->mmap_sem){++++++}, at: [<ffffffffa005ee6f>] hva_to_pfn+0xbb/0x2d1 [kvm]

 stack backtrace:
 Pid: 8774, comm: qemu-kvm Not tainted 3.1.4 #2
 Call Trace:
  [<ffffffff814fa906>] print_usage_bug+0x1e7/0x1f8
  [<ffffffff8108e232>] mark_lock+0x106/0x220
  [<ffffffff8108e6ee>] __lock_acquire+0x3a2/0xd0c
  [<ffffffff8108f55b>] lock_acquire+0xf3/0x13e
  [<ffffffff815030ad>] __mutex_lock_common+0x5d/0x39a
  [<ffffffff815034f9>] mutex_lock_nested+0x40/0x45
  [<ffffffff8110c8be>] pcpu_alloc+0x6f/0x835
  [<ffffffff8110d094>] __alloc_percpu+0x10/0x12
  [<ffffffff8123d64c>] blkio_alloc_blkg_stats+0x1d/0x31
  [<ffffffff8123ed87>] throtl_alloc_tg+0x3a/0xdf
  [<ffffffff8123f79f>] blk_throtl_bio+0x154/0x39c
  [<ffffffff81231c52>] generic_make_request+0x2e4/0x415
  [<ffffffff81231e61>] submit_bio+0xde/0xfd
  [<ffffffff8111e15d>] swap_writepage+0x94/0x9f
  [<ffffffff811041d1>] shmem_writepage+0x192/0x1d8
  [<ffffffff8110167b>] shrink_page_list+0x402/0x79a
  [<ffffffff81101e1a>] shrink_inactive_list+0x22c/0x3df
  [<ffffffff811026b2>] shrink_zone+0x43f/0x582
  [<ffffffff81102b5e>] do_try_to_free_pages+0x107/0x318
  [<ffffffff811030a6>] try_to_free_pages+0xd5/0x175
  [<ffffffff810f9021>] __alloc_pages_nodemask+0x535/0x7eb
  [<ffffffff81127f02>] alloc_pages_vma+0xf5/0xfa
  [<ffffffff81136bfb>] do_huge_pmd_anonymous_page+0xb3/0x274
  [<ffffffff811112f2>] handle_mm_fault+0xfd/0x1b8
  [<ffffffff8111173c>] __get_user_pages+0x2fa/0x465
  [<ffffffffa005edb2>] get_user_page_nowait+0x37/0x39 [kvm]
  [<ffffffffa005ee8a>] hva_to_pfn+0xd6/0x2d1 [kvm]
  [<ffffffffa005f108>] __gfn_to_pfn+0x83/0x8c [kvm]
  [<ffffffffa005f1c9>] gfn_to_pfn_async+0x1a/0x1c [kvm]
  [<ffffffffa007799e>] try_async_pf+0x3f/0x248 [kvm]
  [<ffffffffa0079140>] tdp_page_fault+0xe8/0x1a5 [kvm]
  [<ffffffffa0077c83>] kvm_mmu_page_fault+0x2b/0x83 [kvm]
  [<ffffffffa00cda71>] handle_ept_violation+0xe1/0xea [kvm_intel]
  [<ffffffffa00d2129>] vmx_handle_exit+0x5ee/0x638 [kvm_intel]
  [<ffffffffa007153f>] kvm_arch_vcpu_ioctl_run+0xb30/0xe00 [kvm]
  [<ffffffffa005db4f>] kvm_vcpu_ioctl+0x120/0x59c [kvm]
  [<ffffffff811502a5>] do_vfs_ioctl+0x472/0x4b3
  [<ffffffff8115033c>] sys_ioctl+0x56/0x7a
  [<ffffffff8150bbc2>] system_call_fastpath+0x16/0x1b

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.comp.emulators.kvm.devel/82755

This patch fixes the bug by using percpu_mempool to allocate blkg
percpu stats.  As the allocations are only per cgroup - request_queue
pair, they aren't expected to be frequent.  Use smallish pool of 8
elements which is refilled once half is consumed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Avi Kivity <avi@redhat.com>
Reported-by: Nate Custer <nate@cpanel.net>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |   37 +++++++++++++++++++++++++++++++++++--
 block/blk-cgroup.h   |    2 ++
 block/blk-throttle.c |    2 +-
 block/cfq-iosched.c  |    2 +-
 4 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2788693..43bf5e8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -17,17 +17,33 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
+#include <linux/mempool.h>
 #include "blk-cgroup.h"
 #include <linux/genhd.h>
 
 #define MAX_KEY_LEN 100
 
+/*
+ * blkg_stats_cpu_pool parameters.  These allocations aren't frequent, can
+ * be opportunistic and percpu memory is expensive.
+ */
+#define BLKG_STATS_CPU_POOL_SIZE	8
+#define BLKG_STATS_CPU_POOL_REFILL	4
+
 static DEFINE_SPINLOCK(blkio_list_lock);
 static LIST_HEAD(blkio_list);
 
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
+/*
+ * Percpu mempool for blkgio_group_stats_cpu which are allocated on-demand
+ * on IO path.  As percpu doesn't support NOIO allocations, we need to
+ * buffer them through mempool.
+ */
+struct percpu_mempool *blkg_stats_cpu_pool;
+EXPORT_SYMBOL_GPL(blkg_stats_cpu_pool);
+
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
 static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
@@ -469,8 +485,13 @@ EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
  */
 int blkio_alloc_blkg_stats(struct blkio_group *blkg)
 {
+	/* Schedule refill if necessary */
+	if (percpu_mempool_nr_elems(blkg_stats_cpu_pool) <=
+	    BLKG_STATS_CPU_POOL_REFILL)
+		percpu_mempool_refill(blkg_stats_cpu_pool, GFP_NOWAIT);
+
 	/* Allocate memory for per cpu stats */
-	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
+	blkg->stats_cpu = percpu_mempool_alloc(blkg_stats_cpu_pool, GFP_NOWAIT);
 	if (!blkg->stats_cpu)
 		return -ENOMEM;
 	return 0;
@@ -1671,12 +1692,24 @@ EXPORT_SYMBOL_GPL(blkio_policy_unregister);
 
 static int __init init_cgroup_blkio(void)
 {
-	return cgroup_load_subsys(&blkio_subsys);
+	int ret;
+
+	blkg_stats_cpu_pool = percpu_mempool_create(BLKG_STATS_CPU_POOL_SIZE,
+				sizeof(struct blkio_group_stats_cpu),
+				__alignof__(struct blkio_group_stats_cpu));
+	if (!blkg_stats_cpu_pool)
+		return -ENOMEM;
+
+	ret = cgroup_load_subsys(&blkio_subsys);
+	if (ret)
+		percpu_mempool_destroy(blkg_stats_cpu_pool);
+	return ret;
 }
 
 static void __exit exit_cgroup_blkio(void)
 {
 	cgroup_unload_subsys(&blkio_subsys);
+	percpu_mempool_destroy(blkg_stats_cpu_pool);
 }
 
 module_init(init_cgroup_blkio);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 6f3ace7..7659564 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -32,6 +32,8 @@ extern struct cgroup_subsys blkio_subsys;
 #define blkio_subsys_id blkio_subsys.subsys_id
 #endif
 
+extern struct percpu_mempool *blkg_stats_cpu_pool;
+
 enum stat_type {
 	/* Total time spent (in ns) between request dispatch to the driver and
 	 * request completion for IOs doen by this cgroup. This may not be
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5eed6a7..0f0805e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -159,7 +159,7 @@ static void throtl_free_tg(struct rcu_head *head)
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
-	free_percpu(tg->blkg.stats_cpu);
+	percpu_mempool_free(tg->blkg.stats_cpu, blkg_stats_cpu_pool);
 	kfree(tg);
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 163263d..44d374a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1204,7 +1204,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
-	free_percpu(cfqg->blkg.stats_cpu);
+	percpu_mempool_free(cfqg->blkg.stats_cpu, blkg_stats_cpu_pool);
 	kfree(cfqg);
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
                   ` (6 preceding siblings ...)
  2011-12-22 21:45 ` [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
@ 2011-12-22 21:59 ` Andrew Morton
  2011-12-22 22:09   ` Tejun Heo
  7 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-22 21:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 13:45:19 -0800
Tejun Heo <tj@kernel.org> wrote:

> This patchset implements mempool for percpu memory and use it to solve
> allocation deadlock problem in block cgroup paths.  Percpu memory
> allocator can't be called from memory reclaim path, mostly because its
> on-demand chunk filling hooks into vmalloc area management, which in
> turn hooks into arch pagetable code.
> 
> This usually isn't a problem but block cgroup code wants to do
> opportunistic percpu allocation to track statistics in IO issue path.
> Currently, it directly calls alloc_percpu() from IO path triggering
> lockdep warning and falling into deadlocks under the right conditions.
> 
> This patchset adds percpu mempool which supports async refilling and
> uses that as allocation buffer to solve the above problem.  It solves
> the problem for the next merge window but I'm not sure about what to
> do about this window and -stable.  The mempool updates seems a bit too
> invasive.  Maybe implement a stripped down allocation buffer in
> block-cgroup?

This is taking the kernel in exactly the wrong direction.  More code,
more complexity.  Let's think of an alternative which takes us in the
right direction.


How about we just delete those statistics and then this patchset?

Or how about we change those statistics to not do percpu allocations,
then delete this patchset?

Or how about we fix the percpu memory allocation code so that it
propagates the gfp flags, then delete this patchset?


All of those things would simplify and/or improve the kernel.  Whereas
this patchset complicates the kernel while working around known
shortcomings.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 21:59 ` [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Andrew Morton
@ 2011-12-22 22:09   ` Tejun Heo
  2011-12-22 22:20     ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

Hello,

On Thu, Dec 22, 2011 at 01:59:25PM -0800, Andrew Morton wrote:
> How about we just delete those statistics and then this patchset?
> 
> Or how about we change those statistics to not do percpu allocations,
> then delete this patchset?

I'm not against above both but apparently those percpu stats reduced
CPU overhead significantly.

> Or how about we fix the percpu memory allocation code so that it
> propagates the gfp flags, then delete this patchset?

Oh, no, this is gonna make things *way* more complex.  I tried.  If
we're gonna have many more NOIO percpu users, which I don't think we
would or should, that might make sense but, for fringe cases,
extending mempool to cover percpu is a much better sized solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 22:09   ` Tejun Heo
@ 2011-12-22 22:20     ` Andrew Morton
  2011-12-22 22:41       ` Tejun Heo
  2011-12-23  1:40       ` Vivek Goyal
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2011-12-22 22:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 14:09:11 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Thu, Dec 22, 2011 at 01:59:25PM -0800, Andrew Morton wrote:
> > How about we just delete those statistics and then this patchset?
> > 
> > Or how about we change those statistics to not do percpu allocations,
> > then delete this patchset?
> 
> I'm not against above both

Don't just consider my suggestions - please try to come up with your own
alternatives too!  If all else fails then this patch is a last resort.

> but apparently those percpu stats reduced
> CPU overhead significantly.

Deleting them would save even more CPU.

Or make them runtime or compile-time configurable, so only the
developers see the impact.

Some specifics on which counters are causing the problems would help here.

> > Or how about we fix the percpu memory allocation code so that it
> > propagates the gfp flags, then delete this patchset?
> 
> Oh, no, this is gonna make things *way* more complex.  I tried.

But there's a difference between fixing a problem and working around it.

>  If
> we're gonna have many more NOIO percpu users, which I don't think we
> would or should, that might make sense but, for fringe cases,
> extending mempool to cover percpu is a much better sized solution.

I've long felt that we goofed with the gfp_flags thing and that it
should be a field in the task_struct.  Now *that* would be a large
patch!



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 22:20     ` Andrew Morton
@ 2011-12-22 22:41       ` Tejun Heo
  2011-12-22 22:54         ` Andrew Morton
  2011-12-23  1:40       ` Vivek Goyal
  1 sibling, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 22:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

Hello, Andrew.

On Thu, Dec 22, 2011 at 02:20:58PM -0800, Andrew Morton wrote:
> Don't just consider my suggestions - please try to come up with your own
> alternatives too!  If all else fails then this patch is a last resort.

Umm... this is my alternative.

> > but apparently those percpu stats reduced
> > CPU overhead significantly.
> 
> Deleting them would save even more CPU.
> 
> Or make them runtime or compile-time configurable, so only the
> developers see the impact.
> 
> Some specifics on which counters are causing the problems would help here.

These stats are userland visible and quite useful ones if blkcg is in
use.  I don't really see how these can be removed.

> > > Or how about we fix the percpu memory allocation code so that it
> > > propagates the gfp flags, then delete this patchset?
> > 
> > Oh, no, this is gonna make things *way* more complex.  I tried.
> 
> But there's a difference between fixing a problem and working around it.

Yeah, that was my first direction too.  The reason why percpu can't do
NOIO is the same one why vmalloc can't do it.  It reaches pretty deep
into page table code and I don't think doing all that churning is
worthwhile or even desirable.  An altnernative approach would be
implementing transparent front buffer to percpu allocator, which I
*might* do if there really are more of these users, but I think
keeping percpu allocator painful to use from reclaim context isn't
such a bad idea.

There have been multiple requests for atomic allocation and they all
have been successfully pushed back, but IMHO this is a valid one and I
don't see a better way around the problem, so while I agree using
mempool for this is a workaround, I think it is a right choice, for
now, anyway.

> > If we're gonna have many more NOIO percpu users, which I don't
> > think we would or should, that might make sense but, for fringe
> > cases, extending mempool to cover percpu is a much better sized
> > solution.
> 
> I've long felt that we goofed with the gfp_flags thing and that it
> should be a field in the task_struct.  Now *that* would be a large
> patch!

Yeah, some of PF_* flags already carry related role information.  I'm
not too sure how much pushing the whole thing into task_struct would
change tho.  We would need push/popping.  It could be simpler in some
cases but in essence wouldn't we have just relocated the position of
parameter?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 22:41       ` Tejun Heo
@ 2011-12-22 22:54         ` Andrew Morton
  2011-12-22 23:00           ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-22 22:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 14:41:17 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello, Andrew.
> 
> On Thu, Dec 22, 2011 at 02:20:58PM -0800, Andrew Morton wrote:
> > Don't just consider my suggestions - please try to come up with your own
> > alternatives too!  If all else fails then this patch is a last resort.
> 
> Umm... this is my alternative.

We're beyond the point where aany additional kernel complexity should
be considered a regression.

> > > but apparently those percpu stats reduced
> > > CPU overhead significantly.
> > 
> > Deleting them would save even more CPU.
> > 
> > Or make them runtime or compile-time configurable, so only the
> > developers see the impact.
> > 
> > Some specifics on which counters are causing the problems would help here.
> 
> These stats are userland visible and quite useful ones if blkcg is in
> use.  I don't really see how these can be removed.

What stats?

And why are we doing percpu *allocation* so deep in the code?  You mean
we're *creating* stats counters on an IO path?  Sounds odd.  Where is
this code?

> > > > Or how about we fix the percpu memory allocation code so that it
> > > > propagates the gfp flags, then delete this patchset?
> > > 
> > > Oh, no, this is gonna make things *way* more complex.  I tried.
> > 
> > But there's a difference between fixing a problem and working around it.
> 
> Yeah, that was my first direction too.  The reason why percpu can't do
> NOIO is the same one why vmalloc can't do it.  It reaches pretty deep
> into page table code and I don't think doing all that churning is
> worthwhile or even desirable.  An altnernative approach would be
> implementing transparent front buffer to percpu allocator, which I
> *might* do if there really are more of these users, but I think
> keeping percpu allocator painful to use from reclaim context isn't
> such a bad idea.
> 
> There have been multiple requests for atomic allocation and they all
> have been successfully pushed back, but IMHO this is a valid one and I
> don't see a better way around the problem, so while I agree using
> mempool for this is a workaround, I think it is a right choice, for
> now, anyway.

For starters, doing pagetable allocation on the I/O path sounds nutty.

Secondly, GFP_NOIO is a *weaker* allocation mode than GFP_KERNEL.  By
permitting it with this patchset, we have a kernel which is more likely
to get oom failures.  Fixing the kernel to not perform GFP_NOIO
allocations for these counters will result in a more robust kernel. 
This is a good thing, which improves the kernel while avoiding adding
more compexity elsewhere.

This patchset is the worst option and we should try much harder to avoid
applying it!

> > > If we're gonna have many more NOIO percpu users, which I don't
> > > think we would or should, that might make sense but, for fringe
> > > cases, extending mempool to cover percpu is a much better sized
> > > solution.
> > 
> > I've long felt that we goofed with the gfp_flags thing and that it
> > should be a field in the task_struct.  Now *that* would be a large
> > patch!
> 
> Yeah, some of PF_* flags already carry related role information.  I'm
> not too sure how much pushing the whole thing into task_struct would
> change tho.  We would need push/popping.  It could be simpler in some
> cases but in essence wouldn't we have just relocated the position of
> parameter?

The code would get considerably simpler.  The big benefit comes when
you have deep call stacks - we're presently passing a gfp_t down five
layers of function call while none of the intermediate functions even
use the thing - they just pass it on to the next guy.  Pass it via the
task_struct and all that goes away.  It would make maintenance a lot
easier - at present if you want to add a new kmalloc() to a leaf
function you need to edit all five layers of caller functions.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 22:54         ` Andrew Morton
@ 2011-12-22 23:00           ` Tejun Heo
  2011-12-22 23:16             ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

Hello, Andrew.

On Thu, Dec 22, 2011 at 02:54:26PM -0800, Andrew Morton wrote:
> > These stats are userland visible and quite useful ones if blkcg is in
> > use.  I don't really see how these can be removed.
> 
> What stats?

The ones allocated in the last patch.  blk_group_cpu_stats.

> And why are we doing percpu *allocation* so deep in the code?  You mean
> we're *creating* stats counters on an IO path?  Sounds odd.  Where is
> this code?

Please read below.

> > > > > Or how about we fix the percpu memory allocation code so that it
> > > > > propagates the gfp flags, then delete this patchset?
> > > > 
> > > > Oh, no, this is gonna make things *way* more complex.  I tried.
> > > 
> > > But there's a difference between fixing a problem and working around it.
> > 
> > Yeah, that was my first direction too.  The reason why percpu can't do
> > NOIO is the same one why vmalloc can't do it.  It reaches pretty deep
> > into page table code and I don't think doing all that churning is
> > worthwhile or even desirable.  An altnernative approach would be
> > implementing transparent front buffer to percpu allocator, which I
> > *might* do if there really are more of these users, but I think
> > keeping percpu allocator painful to use from reclaim context isn't
> > such a bad idea.
> > 
> > There have been multiple requests for atomic allocation and they all
> > have been successfully pushed back, but IMHO this is a valid one and I
> > don't see a better way around the problem, so while I agree using
> > mempool for this is a workaround, I think it is a right choice, for
> > now, anyway.
> 
> For starters, doing pagetable allocation on the I/O path sounds nutty.
> 
> Secondly, GFP_NOIO is a *weaker* allocation mode than GFP_KERNEL.  By
> permitting it with this patchset, we have a kernel which is more likely
> to get oom failures.  Fixing the kernel to not perform GFP_NOIO
> allocations for these counters will result in a more robust kernel. 
> This is a good thing, which improves the kernel while avoiding adding
> more compexity elsewhere.
> 
> This patchset is the worst option and we should try much harder to avoid
> applying it!

The stats are per cgroup - request_queue pair.  We don't want to
allocate for all of them for each combination as there are
configurations with stupid number of request_queues and silly many
cgroups and #cgroups * #request_queue * #cpus can be huge.  So, we
want on-demand allocation.  While the stats are important, they are
not critical and allocations can be opportunistic.  If the allocation
fails this time, we can try it for the next time.

So, yeah, the suggested solution fits the problem.  If you have a
better idea, please don't be shy.

> > Yeah, some of PF_* flags already carry related role information.  I'm
> > not too sure how much pushing the whole thing into task_struct would
> > change tho.  We would need push/popping.  It could be simpler in some
> > cases but in essence wouldn't we have just relocated the position of
> > parameter?
> 
> The code would get considerably simpler.  The big benefit comes when
> you have deep call stacks - we're presently passing a gfp_t down five
> layers of function call while none of the intermediate functions even
> use the thing - they just pass it on to the next guy.  Pass it via the
> task_struct and all that goes away.  It would make maintenance a lot
> easier - at present if you want to add a new kmalloc() to a leaf
> function you need to edit all five layers of caller functions.

Hmmm... yeah, the relocation could save a lot of hassle, I suppose.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:00           ` Tejun Heo
@ 2011-12-22 23:16             ` Andrew Morton
  2011-12-22 23:24               ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-22 23:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 15:00:47 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello, Andrew.
> 
> On Thu, Dec 22, 2011 at 02:54:26PM -0800, Andrew Morton wrote:
> > > These stats are userland visible and quite useful ones if blkcg is in
> > > use.  I don't really see how these can be removed.
> > 
> > What stats?
> 
> The ones allocated in the last patch.  blk_group_cpu_stats.

What last patch.

I can find no occurence of "blk_group_cpu_stats" on linux-kernel or in
the kernel tree.

> > For starters, doing pagetable allocation on the I/O path sounds nutty.
> > 
> > Secondly, GFP_NOIO is a *weaker* allocation mode than GFP_KERNEL.  By
> > permitting it with this patchset, we have a kernel which is more likely
> > to get oom failures.  Fixing the kernel to not perform GFP_NOIO
> > allocations for these counters will result in a more robust kernel. 
> > This is a good thing, which improves the kernel while avoiding adding
> > more compexity elsewhere.
> > 
> > This patchset is the worst option and we should try much harder to avoid
> > applying it!
> 
> The stats are per cgroup - request_queue pair.  We don't want to
> allocate for all of them for each combination as there are
> configurations with stupid number of request_queues and silly many
> cgroups and #cgroups * #request_queue * #cpus can be huge.  So, we
> want on-demand allocation.  While the stats are important, they are
> not critical and allocations can be opportunistic.  If the allocation
> fails this time, we can try it for the next time.

Without code to look at I am at a loss.

request_queues are allocated in blk_alloc_queue_node(), which uses
GFP_KERNEL (and also mysteriously takes a gfp_t arg).

> So, yeah, the suggested solution fits the problem.  If you have a
> better idea, please don't be shy.

Unsure which solution you're referring to here.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:16             ` Andrew Morton
@ 2011-12-22 23:24               ` Tejun Heo
  2011-12-22 23:41                 ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 23:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

Hello,

On Thu, Dec 22, 2011 at 03:16:49PM -0800, Andrew Morton wrote:
> > The ones allocated in the last patch.  blk_group_cpu_stats.
> 
> What last patch.
> 
> I can find no occurence of "blk_group_cpu_stats" on linux-kernel or in
> the kernel tree.

Sorry it's blkio_group_stats_cpu.  It's in the seventh path in this
series.

> > The stats are per cgroup - request_queue pair.  We don't want to
> > allocate for all of them for each combination as there are
> > configurations with stupid number of request_queues and silly many
> > cgroups and #cgroups * #request_queue * #cpus can be huge.  So, we
> > want on-demand allocation.  While the stats are important, they are
> > not critical and allocations can be opportunistic.  If the allocation
> > fails this time, we can try it for the next time.
> 
> Without code to look at I am at a loss.

block/blk-cgroup.c blk-throttle.c cfq-iosched.c.  Have fun.

> request_queues are allocated in blk_alloc_queue_node(), which uses
> GFP_KERNEL (and also mysteriously takes a gfp_t arg).

Yeah, sure, we *can* allocate everything for every combination when
either request_queue or cgroup comes up.  That's the thing I tried to
explain in the above quoted paragraph.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:24               ` Tejun Heo
@ 2011-12-22 23:41                 ` Andrew Morton
  2011-12-22 23:54                   ` Tejun Heo
  2011-12-23  1:21                   ` Vivek Goyal
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2011-12-22 23:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 15:24:33 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Thu, Dec 22, 2011 at 03:16:49PM -0800, Andrew Morton wrote:
> > > The ones allocated in the last patch.  blk_group_cpu_stats.
> > 
> > What last patch.
> > 
> > I can find no occurence of "blk_group_cpu_stats" on linux-kernel or in
> > the kernel tree.
> 
> Sorry it's blkio_group_stats_cpu.  It's in the seventh path in this
> series.
> 
> > > The stats are per cgroup - request_queue pair.  We don't want to
> > > allocate for all of them for each combination as there are
> > > configurations with stupid number of request_queues and silly many
> > > cgroups and #cgroups * #request_queue * #cpus can be huge.  So, we
> > > want on-demand allocation.  While the stats are important, they are
> > > not critical and allocations can be opportunistic.  If the allocation
> > > fails this time, we can try it for the next time.
> > 
> > Without code to look at I am at a loss.
> 
> block/blk-cgroup.c blk-throttle.c cfq-iosched.c.  Have fun.
> 
> > request_queues are allocated in blk_alloc_queue_node(), which uses
> > GFP_KERNEL (and also mysteriously takes a gfp_t arg).
> 
> Yeah, sure, we *can* allocate everything for every combination when
> either request_queue or cgroup comes up.  That's the thing I tried to
> explain in the above quoted paragraph.
> 

All the code I'm looking at assumes that blkio_group.stats_cpu is
non-zero.  Won't the kerenl just go splat if that allocation failed?

If the code *does* correctly handle ->stats_cpu == NULL then we have
options.

a) Give userspace a new procfs/debugfs file to start stats gathering
   on a particular cgroup/request_queue pair.  Allocate the stats
   memory in that.

b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
   and return zeroes for this first call.

c) Or change the low-level code to do
   blkio_group.want_stats_cpu=true, then test that at the top level
   after we've determined that blkio_group.stats_cpu is NULL.

d) Or, worse, punt the allocation into a workqueue thread.

Note that all these option will permit us to use GFP_KERNEL, which is
better.

Note that a) and b) means that users get control over whether these
stats are accumulated at all, so many won't incur needless memory and
CPU consumption.

I think I like b).  Fix the code so it doesn't oops when ->stats_cpu is
NULL, then turn on stats gathering the first time someone tries to read
the stats.

(Someone appears to have misspelled "throttle" as "throtl" for no
apparent reason about 1000 times.  Sigh.)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:41                 ` Andrew Morton
@ 2011-12-22 23:54                   ` Tejun Heo
  2011-12-23  1:14                     ` Andrew Morton
  2011-12-23  1:21                   ` Vivek Goyal
  1 sibling, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-22 23:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

Hello,

On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> All the code I'm looking at assumes that blkio_group.stats_cpu is
> non-zero.  Won't the kerenl just go splat if that allocation failed?
> 
> If the code *does* correctly handle ->stats_cpu == NULL then we have
> options.

I think it's supposed to just skip creating whole blk_group if percpu
allocation fails, so ->stats_cpu of existing groups are guaranteed to
be !%NULL.

> a) Give userspace a new procfs/debugfs file to start stats gathering
>    on a particular cgroup/request_queue pair.  Allocate the stats
>    memory in that.
> 
> b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
>    and return zeroes for this first call.

Hmmm... IIRC, the stats aren't exported per cgroup-request_queue pair,
so reads are issued per cgroup.  We can't tell which request_queues
userland is actually interested in.

> c) Or change the low-level code to do
>    blkio_group.want_stats_cpu=true, then test that at the top level
>    after we've determined that blkio_group.stats_cpu is NULL.

Not following.  Where's the "top level"?

> d) Or, worse, punt the allocation into a workqueue thread.

I would much prefer using mempool to this.  They are essentially the
same approach.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup
  2011-12-22 21:45 ` [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
@ 2011-12-23  1:00   ` Vivek Goyal
  2011-12-23 22:54     ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23  1:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 01:45:26PM -0800, Tejun Heo wrote:

[..]
> +/*
> + * Percpu mempool for blkgio_group_stats_cpu which are allocated on-demand
> + * on IO path.  As percpu doesn't support NOIO allocations, we need to
> + * buffer them through mempool.
> + */
> +struct percpu_mempool *blkg_stats_cpu_pool;
> +EXPORT_SYMBOL_GPL(blkg_stats_cpu_pool);
> +

Thanks for this work Tejun.

So this pool is global and shared by all devices (both by CFQ and
throttling logic). I am not sure if there are any advantages of having
this pool per queue. In one of the mails you mentioned that dm is having
one pool across many devices which defeats the purpose of having mempool.

**********************************************************************
It seemed the problem was dm sharing the same pool from multiple devices
(which defeats the purpose of using mempool BTW)
*********************************************************************

Not sure why devices sharing a mempool is a bad idea. Having a global pool
definitely simplifies the implementation a lot in this case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:54                   ` Tejun Heo
@ 2011-12-23  1:14                     ` Andrew Morton
  2011-12-23 15:17                       ` Vivek Goyal
  2011-12-27 18:34                       ` Tejun Heo
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2011-12-23  1:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel

On Thu, 22 Dec 2011 15:54:55 -0800 Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > All the code I'm looking at assumes that blkio_group.stats_cpu is
> > non-zero.  Won't the kerenl just go splat if that allocation failed?
> > 
> > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > options.
> 
> I think it's supposed to just skip creating whole blk_group if percpu
> allocation fails, so ->stats_cpu of existing groups are guaranteed to
> be !%NULL.

What is the role of ->elevator_set_req_fn()?  And when is it called?

It seems that we allocate the blkio_group within the
elevator_set_req_fn() context?

(Your stack trace in the "block: fix deadlock through percpu allocation
in blk-cgroup" changelog is some unuseful ACPI thing.  It would be
better if it were to show the offending trace into the block code).

> > a) Give userspace a new procfs/debugfs file to start stats gathering
> >    on a particular cgroup/request_queue pair.  Allocate the stats
> >    memory in that.
> > 
> > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> >    and return zeroes for this first call.
> 
> Hmmm... IIRC, the stats aren't exported per cgroup-request_queue pair,
> so reads are issued per cgroup.  We can't tell which request_queues
> userland is actually interested in.

Doesn't matter.  The stats are allocated on a per-blkio_group basis. 
blkio_read_stat_cpu() is passed the blkio_group.  Populate ->stats_cpu
there.

Advantages:

- performs allocation with the more reliable GPF_KERNEL

- avoids burdening users with the space and CPU overhead when they're
  not using the stats 

- avoids adding more code into the mempool code.

> > c) Or change the low-level code to do
> >    blkio_group.want_stats_cpu=true, then test that at the top level
> >    after we've determined that blkio_group.stats_cpu is NULL.
> 
> Not following.  Where's the "top level"?

Somewhere appropriate where we can use GFP_KERNEL.  ie: the correct
context for percpu_alloc().

Separately...

Mixing mempools and percpu_alloc() in the proposed fashion seems a
pretty poor fit.  mempools are for high-frequency low-level allocations
which have key characteristics: there are typically a finite number of
elements in flight and we *know* that elements are being freed in a
timely manner.

This doesn't fit with percpu_alloc(), which is a very heavyweight
operation requiring GFP_KERNEL and it doesn't fit with
blkio_group_stats_cpu because blkio_group_stats_cpu does not have the
"freed in a timely manner" behaviour.

To resolve these things you've added the workqueue to keep the pool
populated, which turns percpu_mempool into a quite different concept
which happens to borrow some mempool code (not necessarily a bad thing).
This will result in some memory wastage, keeping that pool full.

More significantly, it's pretty unreliable: if the allocations outpace
the kernel thread's ability to refill the pool, all we can do is to
wait for the kernel thread to do some work.  But we're holding
low-level locks while doing that wait, which will block the kernel
thread.  Deadlock.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 23:41                 ` Andrew Morton
  2011-12-22 23:54                   ` Tejun Heo
@ 2011-12-23  1:21                   ` Vivek Goyal
  2011-12-23  1:38                     ` Andrew Morton
  1 sibling, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23  1:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 15:24:33 -0800
> Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Thu, Dec 22, 2011 at 03:16:49PM -0800, Andrew Morton wrote:
> > > > The ones allocated in the last patch.  blk_group_cpu_stats.
> > > 
> > > What last patch.
> > > 
> > > I can find no occurence of "blk_group_cpu_stats" on linux-kernel or in
> > > the kernel tree.
> > 
> > Sorry it's blkio_group_stats_cpu.  It's in the seventh path in this
> > series.
> > 
> > > > The stats are per cgroup - request_queue pair.  We don't want to
> > > > allocate for all of them for each combination as there are
> > > > configurations with stupid number of request_queues and silly many
> > > > cgroups and #cgroups * #request_queue * #cpus can be huge.  So, we
> > > > want on-demand allocation.  While the stats are important, they are
> > > > not critical and allocations can be opportunistic.  If the allocation
> > > > fails this time, we can try it for the next time.
> > > 
> > > Without code to look at I am at a loss.
> > 
> > block/blk-cgroup.c blk-throttle.c cfq-iosched.c.  Have fun.
> > 
> > > request_queues are allocated in blk_alloc_queue_node(), which uses
> > > GFP_KERNEL (and also mysteriously takes a gfp_t arg).
> > 
> > Yeah, sure, we *can* allocate everything for every combination when
> > either request_queue or cgroup comes up.  That's the thing I tried to
> > explain in the above quoted paragraph.
> > 
> 
> All the code I'm looking at assumes that blkio_group.stats_cpu is
> non-zero.  Won't the kerenl just go splat if that allocation failed?

If per cpu stat allocation fails, we fail the whole group allocation 
and IO is accounted to root group and is tried again when new IO
request comes in.

Look at throtl_alloc_tg() in block/blk-throttle.c

> 
> If the code *does* correctly handle ->stats_cpu == NULL then we have
> options.
> 
> a) Give userspace a new procfs/debugfs file to start stats gathering
>    on a particular cgroup/request_queue pair.  Allocate the stats
>    memory in that.
> 
> b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
>    and return zeroes for this first call.

But the purpose of stats is that they are gathered even if somebody
has not read them even once? So if I create a cgroup and put some
task into it which does some IO, I think stat collection should start
immediately without user taking any action. Forcing the user to first
read a stat before the collection starts is kind of odd to me.

> 
> c) Or change the low-level code to do
>    blkio_group.want_stats_cpu=true, then test that at the top level
>    after we've determined that blkio_group.stats_cpu is NULL.
> 
> d) Or, worse, punt the allocation into a workqueue thread.

I implemented a patch to punt the allocation using a worker thread. Tejun
did not like it. I personally think that it is less intrusive to fix this
specific problem.

https://lkml.org/lkml/2011/12/19/291

> 
> Note that all these option will permit us to use GFP_KERNEL, which is
> better.
> 
> Note that a) and b) means that users get control over whether these
> stats are accumulated at all, so many won't incur needless memory and
> CPU consumption.
> 
> I think I like b).  Fix the code so it doesn't oops when ->stats_cpu is
> NULL, then turn on stats gathering the first time someone tries to read
> the stats.
> 
> (Someone appears to have misspelled "throttle" as "throtl" for no
> apparent reason about 1000 times.  Sigh.)

That someone would be me. I thought that throtl communicates the meaning
and keeps the length of all the strings relatively short. But if it does
not look good, I can change it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:21                   ` Vivek Goyal
@ 2011-12-23  1:38                     ` Andrew Morton
  2011-12-23  2:54                       ` Vivek Goyal
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-23  1:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:

> On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > 
> > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > options.
> > 
> > a) Give userspace a new procfs/debugfs file to start stats gathering
> >    on a particular cgroup/request_queue pair.  Allocate the stats
> >    memory in that.
> > 
> > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> >    and return zeroes for this first call.
> 
> But the purpose of stats is that they are gathered even if somebody
> has not read them even once?

That's not a useful way of using stats.  The normal usage would be to
record the stats then start the workload then monitor how the stats
have changed as work proceeds.

> So if I create a cgroup and put some
> task into it which does some IO, I think stat collection should start
> immediately without user taking any action.

If you really want to know the stats since cgroup creation then trigger
the stats initialisation from userspace when creating the blkio_cgroup.

> Forcing the user to first
> read a stat before the collection starts is kind of odd to me.

Well one could add a separate stats_enable knob.  Doing it
automatically from read() would be for approximate-back-compatibility
with existing behaviour.

Plus (again) this way we also avoid burdening non-stats-users with the
overhead of stats.

> > 
> > c) Or change the low-level code to do
> >    blkio_group.want_stats_cpu=true, then test that at the top level
> >    after we've determined that blkio_group.stats_cpu is NULL.
> > 
> > d) Or, worse, punt the allocation into a workqueue thread.
> 
> I implemented a patch to punt the allocation using a worker thread. Tejun
> did not like it. I personally think that it is less intrusive to fix this
> specific problem.
> 
> https://lkml.org/lkml/2011/12/19/291

hm.

Look, the basic problem here is a highish-level design one.  We're
attempting to do a high-level heavyweight intialization operation in a
low-level place.  How do we fix that *properly*?  It has to be by
performing the heavyweight operation in an appropriate place.

> > 
> > Note that all these option will permit us to use GFP_KERNEL, which is
> > better.
> > 
> > Note that a) and b) means that users get control over whether these
> > stats are accumulated at all, so many won't incur needless memory and
> > CPU consumption.
> > 
> > I think I like b).  Fix the code so it doesn't oops when ->stats_cpu is
> > NULL, then turn on stats gathering the first time someone tries to read
> > the stats.
> > 
> > (Someone appears to have misspelled "throttle" as "throtl" for no
> > apparent reason about 1000 times.  Sigh.)
> 
> That someone would be me. I thought that throtl communicates the meaning
> and keeps the length of all the strings relatively short. But if it does
> not look good, I can change it.

It's unconventional.  We do usually avoid the odd abbreviations and
just spell the whole thing out.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-22 22:20     ` Andrew Morton
  2011-12-22 22:41       ` Tejun Heo
@ 2011-12-23  1:40       ` Vivek Goyal
  2011-12-23  1:58         ` Andrew Morton
  1 sibling, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23  1:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 02:20:58PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 14:09:11 -0800
> Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Thu, Dec 22, 2011 at 01:59:25PM -0800, Andrew Morton wrote:
> > > How about we just delete those statistics and then this patchset?
> > > 
> > > Or how about we change those statistics to not do percpu allocations,
> > > then delete this patchset?
> > 
> > I'm not against above both
> 
> Don't just consider my suggestions - please try to come up with your own
> alternatives too!  If all else fails then this patch is a last resort.
> 
> > but apparently those percpu stats reduced
> > CPU overhead significantly.
> 
> Deleting them would save even more CPU.
> 

[..]
> Or make them runtime or compile-time configurable, so only the
> developers see the impact.

Some of the stats are already under debug option (DEBUG_BLK_CGROUP). But
rest seem to be useful ones to be put under debug option.

Making them run time configuration is an option. I am assuming that would
be a global option and not per cgroup per device option. If yes, then
again you have same problem where after enabling the stat, any new
group creation or new device creation will require allocation of per
cpu stat.

So I think we need to figure out a way to be able to allocation per cpu
stat dynamically.

> 
> Some specifics on which counters are causing the problems would help here.

Various kind of stats are collected. Current per cpu stats are.

- Number of sectors transferred.
- Number of bytes transferred.
- Number of IOs transferred.
- Number of IOs merged 

If a user has not put any throttling rules in the cgroup, then we do want
to collect the stats but don't want to take any locks. Otherwise on fast
devices, ex PCI-E based flash, it becomes a bottleneck.

So far we were taking request queue lock. I guess if we fall back to
non-per cpu stats, we should be able to get away with group's stat
lock (blkg->stats_lock) and access the group under rcu. So this will
be an improvement as lock will be per group and not per device but
I think it is still a problem for most of the users because most contended
group is root group.

Distributions are now shipping throttling logic enabled and by default
all IO goes through root group and for every IO submission we don't
want to take blkg->stats lock just to collect the stats.

That's why the need of per cpu data structures to make stat collection
lockless.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:40       ` Vivek Goyal
@ 2011-12-23  1:58         ` Andrew Morton
  2011-12-23  2:56           ` Vivek Goyal
  2011-12-23 14:46           ` Vivek Goyal
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2011-12-23  1:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, 22 Dec 2011 20:40:43 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:

> That's why the need of per cpu data structures to make stat collection
> lockless.

btw, (and this is a common refrain): was there any reason for avoiding
using percpu_counters here?

afacit these stats don't have the transient-negative-number problem
because they're purely upcounters, but percpu_counters would be helpful
when num_online_cpus is much less than num_possible_cpus.  Plus it would
avoid implementing yet another hand-coded percpu_counter...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:38                     ` Andrew Morton
@ 2011-12-23  2:54                       ` Vivek Goyal
  2011-12-23  3:11                         ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > 
> > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > options.
> > > 
> > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > >    on a particular cgroup/request_queue pair.  Allocate the stats
> > >    memory in that.
> > > 
> > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > >    and return zeroes for this first call.
> > 
> > But the purpose of stats is that they are gathered even if somebody
> > has not read them even once?
> 
> That's not a useful way of using stats.  The normal usage would be to
> record the stats then start the workload then monitor how the stats
> have changed as work proceeds.

I have atleast one example "iostat" which does not follow this. Its
first report shows the total stats since the system boot and each 
subsequent report covers time since previous report. With stats being
available since the cgroup creation time, one can think of extending
iostat tool to display per IO cgroup stats too.

Also we have a knob "reset_stats" to reset all the stats to zero. So
one can first reset stats, starts workload and then monitor it (if one
does not like stats since the cgroup creation time).

> 
> > So if I create a cgroup and put some
> > task into it which does some IO, I think stat collection should start
> > immediately without user taking any action.
> 
> If you really want to know the stats since cgroup creation then trigger
> the stats initialisation from userspace when creating the blkio_cgroup.

These per cpu stats are per cgroup per device. So if a workload in a
cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
stats. So at cgroup creation time we just don't know how many of these
to create and also it does not cover the case of device hotplug after
cgroup creation.

> 
> > Forcing the user to first
> > read a stat before the collection starts is kind of odd to me.
> 
> Well one could add a separate stats_enable knob.  Doing it
> automatically from read() would be for approximate-back-compatibility
> with existing behaviour.
> 
> Plus (again) this way we also avoid burdening non-stats-users with the
> overhead of stats.

Even if we do that we have the problem with hoplugged device. Assume a
cgroup created, stats enabled now a new devices shows up and some task
in the group does IO on that device. Now we need to create percpu data
area for that cgroup-device pair dynamically in IO path and we are back
to the same problem. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:58         ` Andrew Morton
@ 2011-12-23  2:56           ` Vivek Goyal
  2011-12-26  6:05             ` KAMEZAWA Hiroyuki
  2011-12-23 14:46           ` Vivek Goyal
  1 sibling, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23  2:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 05:58:34PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 20:40:43 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > That's why the need of per cpu data structures to make stat collection
> > lockless.
> 
> btw, (and this is a common refrain): was there any reason for avoiding
> using percpu_counters here?

Frankly speaking I did not consider that. I will look into it (/me needs
to read up a bit on per cpu counter and look at code examples).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  2:54                       ` Vivek Goyal
@ 2011-12-23  3:11                         ` Andrew Morton
  2011-12-23 14:58                           ` Vivek Goyal
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-23  3:11 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, 22 Dec 2011 21:54:11 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:

> On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > > 
> > > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > > options.
> > > > 
> > > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > > >    on a particular cgroup/request_queue pair.  Allocate the stats
> > > >    memory in that.
> > > > 
> > > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > > >    and return zeroes for this first call.
> > > 
> > > But the purpose of stats is that they are gathered even if somebody
> > > has not read them even once?
> > 
> > That's not a useful way of using stats.  The normal usage would be to
> > record the stats then start the workload then monitor how the stats
> > have changed as work proceeds.
> 
> I have atleast one example "iostat" which does not follow this. Its
> first report shows the total stats since the system boot and each 
> subsequent report covers time since previous report. With stats being
> available since the cgroup creation time, one can think of extending
> iostat tool to display per IO cgroup stats too.

If that's useful (dubious) then it can be addressed by creating the
stats when a device is bound to the cgroup (below).

> Also we have a knob "reset_stats" to reset all the stats to zero. So
> one can first reset stats, starts workload and then monitor it (if one
> does not like stats since the cgroup creation time).
> 
> > 
> > > So if I create a cgroup and put some
> > > task into it which does some IO, I think stat collection should start
> > > immediately without user taking any action.
> > 
> > If you really want to know the stats since cgroup creation then trigger
> > the stats initialisation from userspace when creating the blkio_cgroup.
> 
> These per cpu stats are per cgroup per device. So if a workload in a
> cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
> stats. So at cgroup creation time we just don't know how many of these
> to create and also it does not cover the case of device hotplug after
> cgroup creation.

Mark the cgroup as "needing stats" then allocate the stats (if needed)
when a device is bound to the cgroup.  Rather than on first I/O.

> > 
> > > Forcing the user to first
> > > read a stat before the collection starts is kind of odd to me.
> > 
> > Well one could add a separate stats_enable knob.  Doing it
> > automatically from read() would be for approximate-back-compatibility
> > with existing behaviour.
> > 
> > Plus (again) this way we also avoid burdening non-stats-users with the
> > overhead of stats.
> 
> Even if we do that we have the problem with hoplugged device. Assume a
> cgroup created, stats enabled now a new devices shows up and some task
> in the group does IO on that device. Now we need to create percpu data
> area for that cgroup-device pair dynamically in IO path and we are back
> to the same problem. 

Why do the allocation during I/O?  Can't it be done in the hotplug handler?


Please, don't expend brain cells thinking of reasons why this all can't
be done.  Instead expend them on finding a way in which it *can* be
done!

Maybe doing all this cgroups stuff within ->elevator_set_req_fn() was
the wrong design decision.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:58         ` Andrew Morton
  2011-12-23  2:56           ` Vivek Goyal
@ 2011-12-23 14:46           ` Vivek Goyal
  1 sibling, 0 replies; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23 14:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 05:58:34PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 20:40:43 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > That's why the need of per cpu data structures to make stat collection
> > lockless.
> 
> btw, (and this is a common refrain): was there any reason for avoiding
> using percpu_counters here?

IIUC, per cpu counters also call alloc_percpu() during initialization.

int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
                          struct lock_class_key *key)
{
        raw_spin_lock_init(&fbc->lock);
        lockdep_set_class(&fbc->lock, key);
        fbc->count = amount;
        fbc->counters = alloc_percpu(s32);
....
....
}

So they are no good either for allocation and initialization in IO path.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  3:11                         ` Andrew Morton
@ 2011-12-23 14:58                           ` Vivek Goyal
  2011-12-27 21:25                             ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23 14:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 07:11:44PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 21:54:11 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> > > On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> > > 
> > > > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > > > 
> > > > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > > > options.
> > > > > 
> > > > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > > > >    on a particular cgroup/request_queue pair.  Allocate the stats
> > > > >    memory in that.
> > > > > 
> > > > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > > > >    and return zeroes for this first call.
> > > > 
> > > > But the purpose of stats is that they are gathered even if somebody
> > > > has not read them even once?
> > > 
> > > That's not a useful way of using stats.  The normal usage would be to
> > > record the stats then start the workload then monitor how the stats
> > > have changed as work proceeds.
> > 
> > I have atleast one example "iostat" which does not follow this. Its
> > first report shows the total stats since the system boot and each 
> > subsequent report covers time since previous report. With stats being
> > available since the cgroup creation time, one can think of extending
> > iostat tool to display per IO cgroup stats too.
> 
> If that's useful (dubious) then it can be addressed by creating the
> stats when a device is bound to the cgroup (below).
> 
> > Also we have a knob "reset_stats" to reset all the stats to zero. So
> > one can first reset stats, starts workload and then monitor it (if one
> > does not like stats since the cgroup creation time).
> > 
> > > 
> > > > So if I create a cgroup and put some
> > > > task into it which does some IO, I think stat collection should start
> > > > immediately without user taking any action.
> > > 
> > > If you really want to know the stats since cgroup creation then trigger
> > > the stats initialisation from userspace when creating the blkio_cgroup.
> > 
> > These per cpu stats are per cgroup per device. So if a workload in a
> > cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
> > stats. So at cgroup creation time we just don't know how many of these
> > to create and also it does not cover the case of device hotplug after
> > cgroup creation.
> 
> Mark the cgroup as "needing stats" then allocate the stats (if needed)
> when a device is bound to the cgroup.  Rather than on first I/O.

This will work for the throttling case where a user has to specifically
put throttling rules for each device and that can be considered as binding
device to the cgroup. In that case we will not be collecting the stats
for which there are no rules for the device. I guess I can live with that.

But it still does not work for the case of CFQ where there might not
be any user initiated device binding to cgroup. User might just specify
a cgroup weight (like task ioprio) and binding to device is automatically
created on first IO to the device from the cgroup. User does not initiate
any specific binding.

> 
> > > 
> > > > Forcing the user to first
> > > > read a stat before the collection starts is kind of odd to me.
> > > 
> > > Well one could add a separate stats_enable knob.  Doing it
> > > automatically from read() would be for approximate-back-compatibility
> > > with existing behaviour.
> > > 
> > > Plus (again) this way we also avoid burdening non-stats-users with the
> > > overhead of stats.
> > 
> > Even if we do that we have the problem with hoplugged device. Assume a
> > cgroup created, stats enabled now a new devices shows up and some task
> > in the group does IO on that device. Now we need to create percpu data
> > area for that cgroup-device pair dynamically in IO path and we are back
> > to the same problem. 
> 
> Why do the allocation during I/O?  Can't it be done in the hotplug handler?
> 

Even if we can do it in hotplug handler it will be very wasteful of
memory. So if there are 100 IO cgroups in the system, upon every block
device hotplug, we will allocate per cpu memory for all the 100 cgroups,
irrespective of the fact whether they are doing IO to the device or not.

Now expand this to a system with 100 cgroups and 100 Luns. 10000
allocations for no reason. (Even if we do it for cgroups needing stats,
does not help much). Current scheme allocates memory for the group
only if a sepcific cgroup is doing IO to a specific block device. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:14                     ` Andrew Morton
@ 2011-12-23 15:17                       ` Vivek Goyal
  2011-12-27 18:34                       ` Tejun Heo
  1 sibling, 0 replies; 51+ messages in thread
From: Vivek Goyal @ 2011-12-23 15:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, Dec 22, 2011 at 05:14:32PM -0800, Andrew Morton wrote:

[..]
> Separately...
> 
> Mixing mempools and percpu_alloc() in the proposed fashion seems a
> pretty poor fit.  mempools are for high-frequency low-level allocations
> which have key characteristics: there are typically a finite number of
> elements in flight and we *know* that elements are being freed in a
> timely manner.
> 
> This doesn't fit with percpu_alloc(), which is a very heavyweight
> operation requiring GFP_KERNEL and it doesn't fit with
> blkio_group_stats_cpu because blkio_group_stats_cpu does not have the
> "freed in a timely manner" behaviour.
> 
> To resolve these things you've added the workqueue to keep the pool
> populated, which turns percpu_mempool into a quite different concept
> which happens to borrow some mempool code (not necessarily a bad thing).
> This will result in some memory wastage, keeping that pool full.

Tejun seems to have created one global pool for all the device with
minimum number of objects at 8. So keeping just 8 objects aside for
the whole system is not a significant memory wastage, IMHO. This is
much better then creating all the possible cgroup, device pair cgroup
irrespective of the fact whether IO is happening on that pair or not.

> 
> More significantly, it's pretty unreliable: if the allocations outpace
> the kernel thread's ability to refill the pool, all we can do is to
> wait for the kernel thread to do some work.  But we're holding
> low-level locks while doing that wait, which will block the kernel
> thread.  Deadlock.

Sorry, I did not understand this point. In the IO path we don't wait
for kernel thread to do some work. If there are no elements in the pool,
we will simply return and not create the group and IO will be accounted to
root group. Once the new IO comes in, it will try again and if by
that time worker thread succeeded in filling the pool, we will allocate
the group.

So in tight memory situation we will not account the IO to actual cgroup
but to root cgroup. Once the memory situation eases, IO will be acccounted
to right cgroup.

Can't see any deadlock happening. Did I miss something?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup
  2011-12-23  1:00   ` Vivek Goyal
@ 2011-12-23 22:54     ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-23 22:54 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: akpm, avi, nate, cl, oleg, axboe, linux-kernel

Hello,

On Thu, Dec 22, 2011 at 5:00 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> So this pool is global and shared by all devices (both by CFQ and
> throttling logic). I am not sure if there are any advantages of having
> this pool per queue. In one of the mails you mentioned that dm is having
> one pool across many devices which defeats the purpose of having mempool.
>
> **********************************************************************
> It seemed the problem was dm sharing the same pool from multiple devices
> (which defeats the purpose of using mempool BTW)
> *********************************************************************
>
> Not sure why devices sharing a mempool is a bad idea. Having a global pool
> definitely simplifies the implementation a lot in this case.

Ooh, sharing breaks the forward progress guarantee by mempool. blkcg
isn't using mempool for that. It's just using it as allocation buffer
so it isn't relevant.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  2:56           ` Vivek Goyal
@ 2011-12-26  6:05             ` KAMEZAWA Hiroyuki
  2011-12-27 17:52               ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-26  6:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Thu, 22 Dec 2011 21:56:29 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Thu, Dec 22, 2011 at 05:58:34PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 20:40:43 -0500 Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > That's why the need of per cpu data structures to make stat collection
> > > lockless.
> > 
> > btw, (and this is a common refrain): was there any reason for avoiding
> > using percpu_counters here?
> 
> Frankly speaking I did not consider that. I will look into it (/me needs
> to read up a bit on per cpu counter and look at code examples).
> 

Hm, how about this kind of delayed initialization ?

== this patch is untested. just for sharing idea ==

>From 433b56fd5644d4b1e695bc16bbf8dd78842999fd Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 26 Dec 2011 15:06:21 +0900
Subject: [PATCH] percpu_counter: add lazy init

percpu_counter calls alloc_percpu(). This means percpu_counter_init()
assumes GFP_KERNEL context. This may call vmalloc() in percpu_counter
allocation and will have locks.

If a caller doesn't want to assume GFP_KERNEL, we need some tricks.
This patch adds percpu_counter_init_lazy().

At lazy allocation, the function leaves fbc->counters as NULL and
init fbc->counters by workqueue. This work item handling is done
by fbc->list, so, the struct size of 'fbc' will not increase if
a user configs CONFIG_HOTPLUG_CPU.
---
 include/linux/percpu_counter.h |   32 +++++++-----
 lib/percpu_counter.c           |  110 +++++++++++++++++++++++++++++----------
 2 files changed, 100 insertions(+), 42 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index b9df9ed..1ab6d9a 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -18,22 +18,31 @@
 struct percpu_counter {
 	raw_spinlock_t lock;
 	s64 count;
-#ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
-#endif
 	s32 __percpu *counters;
 };
 
 extern int percpu_counter_batch;
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key);
+			  bool lazy, struct lock_class_key *key);
 
+/* percpu_counter_init() requires GFP_KERNEL context. */
 #define percpu_counter_init(fbc, value)					\
 	({								\
 		static struct lock_class_key __key;			\
 									\
-		__percpu_counter_init(fbc, value, &__key);		\
+		__percpu_counter_init(fbc, value, false, &__key);	\
+	})
+/*
+ * percpu_counter_lazy_init() doesn't requires GFP_KERNEL but cannot be
+ * called in interrupt context. This function may sleep.
+ */
+#define percpu_couneter_lazy_init(fbc, value)				\
+	({								\
+		static struct lock_class_key __key;			\
+									\
+		__percpu_counter_lazy_init(fbc, value, true, &_key);	\
 	})
 
 void percpu_counter_destroy(struct percpu_counter *fbc);
@@ -78,11 +87,6 @@ static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 	return 0;
 }
 
-static inline int percpu_counter_initialized(struct percpu_counter *fbc)
-{
-	return (fbc->counters != NULL);
-}
-
 #else
 
 struct percpu_counter {
@@ -94,6 +98,11 @@ static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 	fbc->count = amount;
 	return 0;
 }
+static inline int
+percpu_counter_lazy_init(struct percpu_counter *fbc, s64 amount)
+{
+	return percpu_counter_init(fbc, amount);
+}
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
@@ -152,11 +161,6 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 	return percpu_counter_read(fbc);
 }
 
-static inline int percpu_counter_initialized(struct percpu_counter *fbc)
-{
-	return 1;
-}
-
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f8a3f1a..ae19dcc 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -10,9 +10,10 @@
 #include <linux/module.h>
 #include <linux/debugobjects.h>
 
+static DEFINE_MUTEX(percpu_counters_lock);
+static LIST_HEAD(lazy_percpu_counters);
 #ifdef CONFIG_HOTPLUG_CPU
 static LIST_HEAD(percpu_counters);
-static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
 #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
@@ -62,9 +63,11 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 	int cpu;
 
 	raw_spin_lock(&fbc->lock);
-	for_each_possible_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		*pcount = 0;
+	if (likely(fbc->counters)) {
+		for_each_possible_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			*pcount = 0;
+		}
 	}
 	fbc->count = amount;
 	raw_spin_unlock(&fbc->lock);
@@ -76,14 +79,20 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 	s64 count;
 
 	preempt_disable();
-	count = __this_cpu_read(*fbc->counters) + amount;
-	if (count >= batch || count <= -batch) {
+	if (likely(fbc->counters)) {
+		count = __this_cpu_read(*fbc->counters) + amount;
+		if (count >= batch || count <= -batch) {
+			raw_spin_lock(&fbc->lock);
+			fbc->count += count;
+			__this_cpu_write(*fbc->counters, 0);
+			raw_spin_unlock(&fbc->lock);
+		} else {
+			__this_cpu_write(*fbc->counters, count);
+		}
+	} else {
 		raw_spin_lock(&fbc->lock);
-		fbc->count += count;
-		__this_cpu_write(*fbc->counters, 0);
+		fbc->count += amount;
 		raw_spin_unlock(&fbc->lock);
-	} else {
-		__this_cpu_write(*fbc->counters, count);
 	}
 	preempt_enable();
 }
@@ -100,50 +109,95 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 
 	raw_spin_lock(&fbc->lock);
 	ret = fbc->count;
-	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
+	if (fbc->counters) {
+		for_each_online_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			ret += *pcount;
+		}
 	}
 	raw_spin_unlock(&fbc->lock);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+
+
+static void percpu_counter_lazy_init(struct work_struct *work)
+{
+	struct percpu_counter *fbc, *next;
+	struct delayed_work *dwork = to_delayed_work(work);
+	bool need_retry = false;
+
+	get_online_cpus();
+
+	mutex_lock(&percpu_counters_lock);
+
+	list_for_each_entry_safe(fbc, next, &lazy_percpu_counters, list) {
+		fbc->counters = alloc_percpu(32);
+		if (!fbc->counters)
+			break;
+#ifdef CONFIG_HOTPLUG_CPU
+		list_move(&fbc->list, &percpu_counters);
+#else
+		list_del(&fbc->list);
+#endif
+	}
+	if (!list_empty(&lazy_percpu_counters))
+		need_retry = true;
+	mutex_unlock(&percpu_counters_lock);
+	put_online_cpus();
+	/* we don't want to be cpu hog at memory shortage */
+	if (need_retry)
+		schedule_delayed_work(dwork, HZ/10);
+}
+DECLARE_DELAYED_WORK(percpu_counter_lazy_init_work, percpu_counter_lazy_init);
+
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key)
+			  bool lazy, struct lock_class_key *key)
 {
 	raw_spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
 	fbc->count = amount;
-	fbc->counters = alloc_percpu(s32);
-	if (!fbc->counters)
-		return -ENOMEM;
-
+	if (!lazy) {
+		fbc->counters = alloc_percpu(s32);
+		if (!fbc->counters)
+			return -ENOMEM;
+	} else
+		fbc->counters = NULL;
 	debug_percpu_counter_activate(fbc);
 
-#ifdef CONFIG_HOTPLUG_CPU
+	/*
+	 * If lazy == true, we delay allocation of percpu counter.
+	 * This is useful when the caller want to avoid GFP_KERNEL
+	 * at creation.
+	 */
 	INIT_LIST_HEAD(&fbc->list);
-	mutex_lock(&percpu_counters_lock);
-	list_add(&fbc->list, &percpu_counters);
-	mutex_unlock(&percpu_counters_lock);
+	if (lazy) {
+		mutex_lock(&percpu_counters_lock);
+		list_add(&fbc->list, &lazy_percpu_counters);
+		mutex_unlock(&percpu_counters_lock);
+		schedule_delayed_work(&percpu_counter_lazy_init_work, 0);
+	} else {
+#ifdef CONFIG_HOTPLUG_CPU
+		mutex_lock(&percpu_counters_lock);
+		list_add(&fbc->list, &percpu_counters);
+		mutex_unlock(&percpu_counters_lock);
 #endif
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(__percpu_counter_init);
 
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
-	if (!fbc->counters)
-		return;
-
 	debug_percpu_counter_deactivate(fbc);
 
-#ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
 	list_del(&fbc->list);
 	mutex_unlock(&percpu_counters_lock);
-#endif
-	free_percpu(fbc->counters);
+	if (fbc->counters)
+		free_percpu(fbc->counters);
 	fbc->counters = NULL;
 }
 EXPORT_SYMBOL(percpu_counter_destroy);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-26  6:05             ` KAMEZAWA Hiroyuki
@ 2011-12-27 17:52               ` Tejun Heo
  2011-12-28  0:14                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 17:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel

Hello, KAMEZAWA.

On Mon, Dec 26, 2011 at 03:05:31PM +0900, KAMEZAWA Hiroyuki wrote:
> From 433b56fd5644d4b1e695bc16bbf8dd78842999fd Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 26 Dec 2011 15:06:21 +0900
> Subject: [PATCH] percpu_counter: add lazy init
> 
> percpu_counter calls alloc_percpu(). This means percpu_counter_init()
> assumes GFP_KERNEL context. This may call vmalloc() in percpu_counter
> allocation and will have locks.
> 
> If a caller doesn't want to assume GFP_KERNEL, we need some tricks.
> This patch adds percpu_counter_init_lazy().
> 
> At lazy allocation, the function leaves fbc->counters as NULL and
> init fbc->counters by workqueue. This work item handling is done
> by fbc->list, so, the struct size of 'fbc' will not increase if
> a user configs CONFIG_HOTPLUG_CPU.

This is essentially more specialized form of the mempool approach.  It
doesn't seem any simpler to me while being less generic.  I don't see
what the upside would be.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23  1:14                     ` Andrew Morton
  2011-12-23 15:17                       ` Vivek Goyal
@ 2011-12-27 18:34                       ` Tejun Heo
  2011-12-27 21:20                         ` Andrew Morton
  1 sibling, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 18:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel, jaxboe

(cc'ing Jens)

Hello, Andrew.

On Thu, Dec 22, 2011 at 05:14:32PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 15:54:55 -0800 Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > All the code I'm looking at assumes that blkio_group.stats_cpu is
> > > non-zero.  Won't the kerenl just go splat if that allocation failed?
> > > 
> > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > options.
> > 
> > I think it's supposed to just skip creating whole blk_group if percpu
> > allocation fails, so ->stats_cpu of existing groups are guaranteed to
> > be !%NULL.
> 
> What is the role of ->elevator_set_req_fn()?  And when is it called?

It allocates elevator specific data for a request and is called during
request construction, IOW, on IO issue path.

> It seems that we allocate the blkio_group within the
> elevator_set_req_fn() context?

The allocation may happen through any subsystem which implements blkcg
policy - at the moment, either blk-throttle or cfq-iosched.
elevator_set_req_fn() is for cfq-iosched.  In most cases, the user is
expected to be in IO submission path as that's the only place we can
identify active blkcg - request_queue pairs and we don't want to
create blk_group for all possible combinations.

> > Hmmm... IIRC, the stats aren't exported per cgroup-request_queue pair,
> > so reads are issued per cgroup.  We can't tell which request_queues
> > userland is actually interested in.
> 
> Doesn't matter.  The stats are allocated on a per-blkio_group basis. 
> blkio_read_stat_cpu() is passed the blkio_group.  Populate ->stats_cpu
> there.

Hmmm.... so you wanna break up blkg and its per-cpu stats allocation.

One problem I see is that there's no reliable way to tell when the
stats for specific pair has started.  ie. Reading the stats file from
userland shouldn't create blkg for all matching combinations.  It
should only walk the blkgs which are created by actual IOs.  So, blkg
stat counting would start on the first stats read after IOs happen for
that specific pair and there wouldn't be any way for userland to
discover whether stat gathering is in progress or not.  Ugh... that's
just nasty.  Jens, what do you think?

> > > c) Or change the low-level code to do
> > >    blkio_group.want_stats_cpu=true, then test that at the top level
> > >    after we've determined that blkio_group.stats_cpu is NULL.
> > 
> > Not following.  Where's the "top level"?
> 
> Somewhere appropriate where we can use GFP_KERNEL.  ie: the correct
> context for percpu_alloc().

There's no natural "top level" for block layer.  It either has to hook
off from userland access as you suggested above or we should
explicitly defer to work item.  That of course is possible but that is
worse than making use of existing infrastructure to solve the problem.
Note that explicitly deferring to wq is basically more specialized
version of using wq backed mempool and would take more code.

> Separately...
> 
> Mixing mempools and percpu_alloc() in the proposed fashion seems a
> pretty poor fit.  mempools are for high-frequency low-level allocations
> which have key characteristics: there are typically a finite number of
> elements in flight and we *know* that elements are being freed in a
> timely manner.

mempool is a pool of memory allocations.  Nothing more, nothing less.

> This doesn't fit with percpu_alloc(), which is a very heavyweight
> operation requiring GFP_KERNEL and it doesn't fit with
> blkio_group_stats_cpu because blkio_group_stats_cpu does not have the
> "freed in a timely manner" behaviour.

And, yes, it is a different usage of the same mechanism.  Nothing
changes for the mechanism itself (other than the silly GFP_WAIT
behavior which is buggy anyway).

> To resolve these things you've added the workqueue to keep the pool
> populated, which turns percpu_mempool into a quite different concept
> which happens to borrow some mempool code (not necessarily a bad thing).
> This will result in some memory wastage, keeping that pool full.

The wq thing can be moved into block layer if that's the bothering
part, but given this mode of usage would be the prevalent one for
percpu mempool, I think it would be better to put it there for obvious
reasons.

It's the same pool of allocations.  If you don't have inter-item
dependency and allocated items are guaranteed to be returned in finite
amount of time, it guarantees allocation attempts would succeed in
finite amount of time.  If you put in items in a context and take it
out in a different one, it serves as allocation buffer.

> More significantly, it's pretty unreliable: if the allocations outpace
> the kernel thread's ability to refill the pool, all we can do is to
> wait for the kernel thread to do some work.  But we're holding
> low-level locks while doing that wait, which will block the kernel
> thread.  Deadlock.

Ummm... if you try to allocate with GFP_KERNEL from IO path, deadlock
of course is possible.  That's the *whole* reason why allocation
buffering is used there.  It's filled from GFP_KERNEL context and
consumed from GFO_NOIO and as pointed out multiple times the allocaion
there is infrequent and can be opportunistic.  Sans the use of small
buffering items, this isn't any different from deferring allocation to
different context.  There's no guarantee when that allocation would
happen but in practice both will be reliable enough for the given use
case.

I don't necessarily insist on using mempool here but all the given
objections seem bogus to me.  The amount of code added is minimal and
straight-forward.  It doesn't change what mempool is or what it does
at all and the usage fits the problem to be solved.  I can't really
understand what the objection is about.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 18:34                       ` Tejun Heo
@ 2011-12-27 21:20                         ` Andrew Morton
  2011-12-27 21:44                           ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-27 21:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel, jaxboe

On Tue, 27 Dec 2011 10:34:02 -0800 Tejun Heo <tj@kernel.org> wrote:

> > Separately...
> > 
> > Mixing mempools and percpu_alloc() in the proposed fashion seems a
> > pretty poor fit.  mempools are for high-frequency low-level allocations
> > which have key characteristics: there are typically a finite number of
> > elements in flight and we *know* that elements are being freed in a
> > timely manner.
> 
> mempool is a pool of memory allocations.  Nothing more, nothing less.

No, there is more.  An important concept in mempool is that it is safe
to indefinitely wait for items because it is guaranteed that the
in-flight ones will be returned in a timely fashion.

> > More significantly, it's pretty unreliable: if the allocations outpace
> > the kernel thread's ability to refill the pool, all we can do is to
> > wait for the kernel thread to do some work.  But we're holding
> > low-level locks while doing that wait, which will block the kernel
> > thread.  Deadlock.
> 
> Ummm... if you try to allocate with GFP_KERNEL from IO path, deadlock
> of course is possible.

It is deadlockable when used with GFP_NOFS, or GFP_NOIO.  Anything with
__GFP_WAIT.

The only "safe" usage of the proposed interface is when the caller
performs the per-cpu allocation with !__GFP_WAIT (ie: GFP_ATOMIC) and
when the caller is prepared to repeatedly poll until the allocation
succeeds.

That's a narrow and rare calling pattern which doesn't merit being
formalised into a core kernel API.  It is one which encourages poor
code in callers, which should instead be reworked to use GFP_KERNEL, in
which case the interface isn't needed.

It's also a dangerous API because people could call it with GFP_NOIO
(for example) and risk deadlocks.  Waiting for a kernel thread to
perform a GFP_KERNEL allocation is equivalent to directly performing
that allocation, with GFP_KERNEL.  The only real effect of handing over
to a kernel thread to do the work is to hide your bug from lockdep.

>  That's the *whole* reason why allocation
> buffering is used there.  It's filled from GFP_KERNEL context and
> consumed from GFO_NOIO and as pointed out multiple times the allocaion
> there is infrequent and can be opportunistic.  Sans the use of small
> buffering items, this isn't any different from deferring allocation to
> different context.  There's no guarantee when that allocation would
> happen but in practice both will be reliable enough for the given use
> case.

"reliable enough" is not the standard to which we aspire.

Look, the core problem here is that this block code is trying to
allocate from an inappropriate context.  Rather than hacking around
adding stuff to make this work most-of-the-time, how about fixing the
block code to perform the allocation from the correct context?

> I don't necessarily insist on using mempool here but all the given
> objections seem bogus to me.  The amount of code added is minimal and
> straight-forward.  It doesn't change what mempool is or what it does
> at all and the usage fits the problem to be solved.  I can't really
> understand what the objection is about.

It's adding complexity to core kernel.  It's adding a weak and
dangerous interface which will encourage poor code in callers.

And why are we doing all of this?  So we can continue to use what is
now poor code at one particular site in the block layer!  If the block
code wants to run alloc_percpu() (which requires GFP_KERNEL context)
then it should be reworked to do so from GFP_KERNEL context.  For that
is the *best* solution, no?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-23 14:58                           ` Vivek Goyal
@ 2011-12-27 21:25                             ` Andrew Morton
  2011-12-27 22:07                               ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-27 21:25 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, avi, nate, cl, oleg, axboe, linux-kernel

On Fri, 23 Dec 2011 09:58:56 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> > Why do the allocation during I/O?  Can't it be done in the hotplug handler?
> > 
> 
> Even if we can do it in hotplug handler it will be very wasteful of
> memory. So if there are 100 IO cgroups in the system, upon every block
> device hotplug, we will allocate per cpu memory for all the 100 cgroups,
> irrespective of the fact whether they are doing IO to the device or not.
> 
> Now expand this to a system with 100 cgroups and 100 Luns. 10000
> allocations for no reason. (Even if we do it for cgroups needing stats,
> does not help much). Current scheme allocates memory for the group
> only if a sepcific cgroup is doing IO to a specific block device. 

umm, we've already declared that it is OK to completely waste this
memory for the users (probably a majority) who will not be using
these stats.

Also, has this stuff been tested at that scale?  I fear to think what
10000 allocations will do to fragmetnation of the vmalloc() arena.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 21:20                         ` Andrew Morton
@ 2011-12-27 21:44                           ` Tejun Heo
  2011-12-27 21:58                             ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 21:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel, jaxboe

Hello, Andrew.

On Tue, Dec 27, 2011 at 01:20:56PM -0800, Andrew Morton wrote:
> >  That's the *whole* reason why allocation
> > buffering is used there.  It's filled from GFP_KERNEL context and
> > consumed from GFO_NOIO and as pointed out multiple times the allocaion
> > there is infrequent and can be opportunistic.  Sans the use of small
> > buffering items, this isn't any different from deferring allocation to
> > different context.  There's no guarantee when that allocation would
> > happen but in practice both will be reliable enough for the given use
> > case.
> 
> "reliable enough" is not the standard to which we aspire.
> 
> Look, the core problem here is that this block code is trying to
> allocate from an inappropriate context.  Rather than hacking around
> adding stuff to make this work most-of-the-time, how about fixing the
> block code to perform the allocation from the correct context?

Yeah, sure, that would be nice.  I'm just not convinced that would be
possible in a way which would be in the end better than buffering
memory allocations some way.  For better or for worse, the whole blkcg
/ ioc / request_queue association mechanism is based on essentially
opportunistic allocation on the expectation that the number of used
combinations would be low and allocation frequency would be low too.

> > I don't necessarily insist on using mempool here but all the given
> > objections seem bogus to me.  The amount of code added is minimal and
> > straight-forward.  It doesn't change what mempool is or what it does
> > at all and the usage fits the problem to be solved.  I can't really
> > understand what the objection is about.
> 
> It's adding complexity to core kernel.  It's adding a weak and
> dangerous interface which will encourage poor code in callers.
>
> And why are we doing all of this?  So we can continue to use what is
> now poor code at one particular site in the block layer!  If the block
> code wants to run alloc_percpu() (which requires GFP_KERNEL context)
> then it should be reworked to do so from GFP_KERNEL context.  For that
> is the *best* solution, no?

Ummm... I'm not arguing this is the best solution in the world.

I'm not convinced trying to put this into GFP_KERNEL context would
work.  Short of that, the next best thing would be making percpu
allocator useable from memory reclaim path, right?  But that would
involved a lot more churn and complexity without much added benefit,
given that this type of use cases aren't expected to be common - and
I'm fairly sure it isn't given track record of past few years.

IMHO, core code is there to help its users.  In this case, if we
choose to buffer allocation somehow, we sure can add that to block
layer, but that would be pretty silly thing to do.  It's not keeping
anything simple.  It's just hiding things.

Anyways, let's stop this abstract discussion.  I think the crux of
disagreement is about whether this can be moved into GFP_KERNEL
context or not.  Let's continue with Vivek's subthread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 21:44                           ` Tejun Heo
@ 2011-12-27 21:58                             ` Andrew Morton
  2011-12-27 22:22                               ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-27 21:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel, jaxboe

On Tue, 27 Dec 2011 13:44:21 -0800
Tejun Heo <tj@kernel.org> wrote:

> I'm not convinced trying to put this into GFP_KERNEL context would
> work.  Short of that, the next best thing would be making percpu
> allocator useable from memory reclaim path, right?

Well..  All allocations which are weaker than GFP_KERNEL are to be
discouraged.  That being said...

>  But that would
> involved a lot more churn and complexity without much added benefit,
> given that this type of use cases aren't expected to be common - and
> I'm fairly sure it isn't given track record of past few years.

I don't think it would be too hard to add an alloc_percpu_gfp().  Add
the gfp_t to a small number of functions (two or three?) then change
pcpu_mem_zalloc() to always use kzalloc() if (flags & GFP_KERNEL !=
GFP_KERNEL).  And that's it?

But the question is: is this a *good* thing to do?  It would be nice if
kernel developers understood that GFP_KERNEL is strongly preferred and
that they should put in effort to use it.  But there's a strong
tendency for people to get themselves into a sticky corner then take
the easy way out, resulting in less robust code.  Maybe calling the
function alloc_percpu_i_really_suck() would convey the hint.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 21:25                             ` Andrew Morton
@ 2011-12-27 22:07                               ` Tejun Heo
  2011-12-27 22:21                                 ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vivek Goyal, avi, nate, cl, oleg, axboe, linux-kernel

Hello,

On Tue, Dec 27, 2011 at 01:25:01PM -0800, Andrew Morton wrote:
> umm, we've already declared that it is OK to completely waste this
> memory for the users (probably a majority) who will not be using
> these stats.

We're talking about combinatorial combinations where only small subset
is usually expected to be used and, in addition to the absolute usage,
there's big advantage in showing behavior which users would expect.
If 1000 cgroups are doing IOs to 1000 devices, it's expected to
consume some amount of resource.

The whole io_context / blk_cgroup - request_queue association
mechanism is based on opportunistic allocation.  It might not be the
prettiest thing in the world but given the circumstances IMHO the
approach fits the constraints defined by the problem.

Given the restricted nature of percpu allocation, it would be nice to
punt it to GFP_KERNEL context *somewhere* and for block layer that
somewhere probably can only be userland access.  I just don't see that
fitting better here.  The suggested alternative seems much nastier
with userland visible side effects and possibility for combinatorial
increase in memory usage for something as benign as single cat of stat
files.

Also, such erratic userland visible behavior is deviation from the
current one and at the same time we would be bound to the
idiosyncracies later when we can improve the implementation.

I can't see how that can be a better tradeoff.  It shifts the problem
to even more cumbersome corner.

> Also, has this stuff been tested at that scale?  I fear to think what
> 10000 allocations will do to fragmetnation of the vmalloc() arena.

Percpu allocator doesn't use vmalloc directly.  It maps address ranges
(which is at least 32k and usually much larger) from vmalloc space and
allocate it using simplistic extent allocator.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 22:07                               ` Tejun Heo
@ 2011-12-27 22:21                                 ` Andrew Morton
  2011-12-27 22:30                                   ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-27 22:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, avi, nate, cl, oleg, axboe, linux-kernel

On Tue, 27 Dec 2011 14:07:53 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Dec 27, 2011 at 01:25:01PM -0800, Andrew Morton wrote:
> > umm, we've already declared that it is OK to completely waste this
> > memory for the users (probably a majority) who will not be using
> > these stats.
> 
> We're talking about combinatorial combinations where only small subset
> is usually expected to be used and, in addition to the absolute usage,
> there's big advantage in showing behavior which users would expect.
> If 1000 cgroups are doing IOs to 1000 devices, it's expected to
> consume some amount of resource.

<autorepeat>For those users who don't want the stats, stats shouldn't
consume any resources at all.

And I bet that the majority of the minority who want stats simply want
to know "how much IO is this cgroup doing", and don't need per-cgroup,
per-device accounting.

And it could be that the minority of the minority who want per-device,
per-cgroup stats only want those for a minority of the time.

IOW, what happens if we give 'em atomic_add() and be done with it?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 21:58                             ` Andrew Morton
@ 2011-12-27 22:22                               ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: avi, nate, cl, oleg, axboe, vgoyal, linux-kernel, jaxboe

Hello, Andrew.

On Tue, Dec 27, 2011 at 01:58:36PM -0800, Andrew Morton wrote:
> >  But that would
> > involved a lot more churn and complexity without much added benefit,
> > given that this type of use cases aren't expected to be common - and
> > I'm fairly sure it isn't given track record of past few years.
> 
> I don't think it would be too hard to add an alloc_percpu_gfp().  Add
> the gfp_t to a small number of functions (two or three?) then change
> pcpu_mem_zalloc() to always use kzalloc() if (flags & GFP_KERNEL !=
> GFP_KERNEL).  And that's it?

Hmmm... What do you mean "always use kzalloc()"?  Percpu memory can't
be served by kzalloc().  They need to be congruent.

Currently, percpu allocation happens in two stages.  The first stage
is locating congruent address areas in vmalloc space (so that static
percpu offsets among different CPUs can be maintained for dynamic
percpu areas).  The second stage is actually allocating pages and
populating those areas.  Because percpu memory is expensive, the
allocator keeps both steps on demand.  When it runs out of address
space, it gets some and the addresses are filled only when the memory
areas are actually handed out.

Unfortunately, both steps involve vmalloc machinery and GFP_KERNEL
assumption reaches into arch specific code for page table allocation.
So, making all those steps reclaim path friendly would be quite a
churn.

*If* we're gonna solve this at the allocator level, we'll probably end
up with front buffer in percpu allocator, which buffers populated
percpu areas to give out for reclaim path allocations, which is
doable, would involve a lot more complexity and would probably need to
keep more percpu memory reserved to stay generic.

> But the question is: is this a *good* thing to do?  It would be nice if
> kernel developers understood that GFP_KERNEL is strongly preferred and
> that they should put in effort to use it.  But there's a strong
> tendency for people to get themselves into a sticky corner then take
> the easy way out, resulting in less robust code.  Maybe calling the
> function alloc_percpu_i_really_suck() would convey the hint.

I fully agree and have been pushing back pretty hard against people
trying to allocate percpu memory from !GFP_KERNEL context and I want
to keep it that way.  But, to me, this one seems like a valid use
case.  If we want to track dynamic associations in IO path, short of
allocating for all combinatorial combinations, we can't avoid
performing NOIO allocations.  Of course, the code path should be
careful so that it's not too demanding on the allocator and must be
able to degrade operation gracefully on allocation failures.  It's a
special case but a valid one at that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 22:21                                 ` Andrew Morton
@ 2011-12-27 22:30                                   ` Tejun Heo
  2012-01-16 15:26                                     ` Vivek Goyal
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-27 22:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vivek Goyal, avi, nate, cl, oleg, axboe, linux-kernel

Hello, Andrew.

On Tue, Dec 27, 2011 at 02:21:56PM -0800, Andrew Morton wrote:
> <autorepeat>For those users who don't want the stats, stats shouldn't
> consume any resources at all.

Hmmm.... For common use cases - a few cgroups doing IOs to most likely
single physical device and maybe a couple virtual ones, I don't think
this would show up anywhere both in terms of memory and process
overhead.  While avoding it would be nice, I don't think that should
be the focus of optimization or design decisions.

> And I bet that the majority of the minority who want stats simply want
> to know "how much IO is this cgroup doing", and don't need per-cgroup,
> per-device accounting.
> 
> And it could be that the minority of the minority who want per-device,
> per-cgroup stats only want those for a minority of the time.
> 
> IOW, what happens if we give 'em atomic_add() and be done with it?

I really don't know.  That surely is an enticing idea tho.  Jens,
Vivek, can you guys chime in?  Is gutting out (or drastically
simplifying) cgroup-dev stats an option?  Are there users who are
actually interested in this stuff?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 17:52               ` Tejun Heo
@ 2011-12-28  0:14                 ` KAMEZAWA Hiroyuki
  2011-12-28  0:41                   ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-28  0:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel

On Tue, 27 Dec 2011 09:52:49 -0800
Tejun Heo <tj@kernel.org> wrote:

> Hello, KAMEZAWA.
> 
> On Mon, Dec 26, 2011 at 03:05:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > From 433b56fd5644d4b1e695bc16bbf8dd78842999fd Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 26 Dec 2011 15:06:21 +0900
> > Subject: [PATCH] percpu_counter: add lazy init
> > 
> > percpu_counter calls alloc_percpu(). This means percpu_counter_init()
> > assumes GFP_KERNEL context. This may call vmalloc() in percpu_counter
> > allocation and will have locks.
> > 
> > If a caller doesn't want to assume GFP_KERNEL, we need some tricks.
> > This patch adds percpu_counter_init_lazy().
> > 
> > At lazy allocation, the function leaves fbc->counters as NULL and
> > init fbc->counters by workqueue. This work item handling is done
> > by fbc->list, so, the struct size of 'fbc' will not increase if
> > a user configs CONFIG_HOTPLUG_CPU.
> 
> This is essentially more specialized form of the mempool approach.  It
> doesn't seem any simpler to me while being less generic.  I don't see
> what the upside would be.
> 

Hm, but this never causes -ENOMEM error, at all.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-28  0:14                 ` KAMEZAWA Hiroyuki
@ 2011-12-28  0:41                   ` Tejun Heo
  2012-01-05  1:28                     ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-28  0:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel

Hello,

On Wed, Dec 28, 2011 at 09:14:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > This is essentially more specialized form of the mempool approach.  It
> > doesn't seem any simpler to me while being less generic.  I don't see
> > what the upside would be.
> 
> Hm, but this never causes -ENOMEM error, at all.

Ooh, I missed the part it falls back to the global counter if percpu
counters aren't allocated yet.  Yeah, this is an interesting approach.
I'll think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-28  0:41                   ` Tejun Heo
@ 2012-01-05  1:28                     ` Tejun Heo
  2012-01-16 15:28                       ` Vivek Goyal
  2012-02-09 23:58                       ` Tejun Heo
  0 siblings, 2 replies; 51+ messages in thread
From: Tejun Heo @ 2012-01-05  1:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel

On Tue, Dec 27, 2011 at 04:41:02PM -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 28, 2011 at 09:14:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > > This is essentially more specialized form of the mempool approach.  It
> > > doesn't seem any simpler to me while being less generic.  I don't see
> > > what the upside would be.
> > 
> > Hm, but this never causes -ENOMEM error, at all.
> 
> Ooh, I missed the part it falls back to the global counter if percpu
> counters aren't allocated yet.  Yeah, this is an interesting approach.
> I'll think more about it.

I've been staring at the blkcg stats code and commit logs and am
wondering whether we can just scrap percpu counters there.  It seems
the reason why it was introduced in the first place is to avoid
stats->lock, which BTW is extremely heavy handed for gathering stats,
overhead in fast paths and I think there can be easier ways to avoid
stats->lock.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2011-12-27 22:30                                   ` Tejun Heo
@ 2012-01-16 15:26                                     ` Vivek Goyal
  0 siblings, 0 replies; 51+ messages in thread
From: Vivek Goyal @ 2012-01-16 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel, Divyesh Shah

On Tue, Dec 27, 2011 at 02:30:12PM -0800, Tejun Heo wrote:
> Hello, Andrew.
> 
> On Tue, Dec 27, 2011 at 02:21:56PM -0800, Andrew Morton wrote:
> > <autorepeat>For those users who don't want the stats, stats shouldn't
> > consume any resources at all.
> 
> Hmmm.... For common use cases - a few cgroups doing IOs to most likely
> single physical device and maybe a couple virtual ones, I don't think
> this would show up anywhere both in terms of memory and process
> overhead.  While avoding it would be nice, I don't think that should
> be the focus of optimization or design decisions.
> 
> > And I bet that the majority of the minority who want stats simply want
> > to know "how much IO is this cgroup doing", and don't need per-cgroup,
> > per-device accounting.
> > 
> > And it could be that the minority of the minority who want per-device,
> > per-cgroup stats only want those for a minority of the time.
> > 
> > IOW, what happens if we give 'em atomic_add() and be done with it?
> 
> I really don't know.  That surely is an enticing idea tho.  Jens,
> Vivek, can you guys chime in?  Is gutting out (or drastically
> simplifying) cgroup-dev stats an option?  Are there users who are
> actually interested in this stuff?

Ok, I am back after a break of 3 weeks. So time to restart the discussion.

So we seem to be talking of two things.

- Use atomic_add() for stats.
- Do not keep stats per cgroup/per device instead just keep gloabl per
  cgroup stat.

For the first point, is atomic operation really that cheap then taking
spin lock. The whole point of introducing per cpu data structure was
to make fast path lockless. My understanding is that atomic operation
on IO submission path is expensive so to me it really does not solve
the overhead problem?

Initially google folks (Divyesh Shah) introduced additional files to
display additional stats which per per cgroup per device. I am assuming
they are making use of it. To me knowing how IO is distributed to
different devies from a cgroup is a good thing to know.

Keeping the stats per device also helps that aggregation of stats happens
from process context and we reduce the contention on stat update from
various devices. So to me it is good thing to keep stats per device and
then display these as user find them useful (Either per cgroup or per
cgroup per device).

So to me none of the above options are really solving the issue of
reducing the cost/overhead of atomic operation in IO submission path.
Please correct me if missed something here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2012-01-05  1:28                     ` Tejun Heo
@ 2012-01-16 15:28                       ` Vivek Goyal
  2012-02-09 23:58                       ` Tejun Heo
  1 sibling, 0 replies; 51+ messages in thread
From: Vivek Goyal @ 2012-01-16 15:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, avi, nate, cl, oleg, axboe,
	linux-kernel

On Wed, Jan 04, 2012 at 05:28:42PM -0800, Tejun Heo wrote:
> On Tue, Dec 27, 2011 at 04:41:02PM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Dec 28, 2011 at 09:14:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > This is essentially more specialized form of the mempool approach.  It
> > > > doesn't seem any simpler to me while being less generic.  I don't see
> > > > what the upside would be.
> > > 
> > > Hm, but this never causes -ENOMEM error, at all.
> > 
> > Ooh, I missed the part it falls back to the global counter if percpu
> > counters aren't allocated yet.  Yeah, this is an interesting approach.
> > I'll think more about it.
> 
> I've been staring at the blkcg stats code and commit logs and am
> wondering whether we can just scrap percpu counters there.  It seems
> the reason why it was introduced in the first place is to avoid
> stats->lock, which BTW is extremely heavy handed for gathering stats,
> overhead in fast paths and I think there can be easier ways to avoid
> stats->lock.

Tejun, 

If you get rid of stats->lock, and also don't use per cpu data strucutres
how would we avoid races in case of concurrent update (IO submission?).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2012-01-05  1:28                     ` Tejun Heo
  2012-01-16 15:28                       ` Vivek Goyal
@ 2012-02-09 23:58                       ` Tejun Heo
  2012-02-10 16:26                         ` Vivek Goyal
  1 sibling, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2012-02-09 23:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Vivek Goyal, Andrew Morton, avi, nate, cl, oleg, axboe, linux-kernel

Hello, guys.

On Wed, Jan 04, 2012 at 05:28:42PM -0800, Tejun Heo wrote:
> On Tue, Dec 27, 2011 at 04:41:02PM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Dec 28, 2011 at 09:14:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > This is essentially more specialized form of the mempool approach.  It
> > > > doesn't seem any simpler to me while being less generic.  I don't see
> > > > what the upside would be.
> > > 
> > > Hm, but this never causes -ENOMEM error, at all.
> > 
> > Ooh, I missed the part it falls back to the global counter if percpu
> > counters aren't allocated yet.  Yeah, this is an interesting approach.
> > I'll think more about it.
> 
> I've been staring at the blkcg stats code and commit logs and am
> wondering whether we can just scrap percpu counters there.  It seems
> the reason why it was introduced in the first place is to avoid
> stats->lock, which BTW is extremely heavy handed for gathering stats,
> overhead in fast paths and I think there can be easier ways to avoid
> stats->lock.

I could remove stats_lock without much trouble but couldn't get
blk-throtl fast path stat working. :(

I think I'll try to get KAMEZAWA's percpu counter working.  Any
objections?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2012-02-09 23:58                       ` Tejun Heo
@ 2012-02-10 16:26                         ` Vivek Goyal
  2012-02-13 22:31                           ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Vivek Goyal @ 2012-02-10 16:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, avi, nate, cl, oleg, axboe,
	linux-kernel

On Thu, Feb 09, 2012 at 03:58:45PM -0800, Tejun Heo wrote:
> Hello, guys.
> 
> On Wed, Jan 04, 2012 at 05:28:42PM -0800, Tejun Heo wrote:
> > On Tue, Dec 27, 2011 at 04:41:02PM -0800, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Dec 28, 2011 at 09:14:02AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > This is essentially more specialized form of the mempool approach.  It
> > > > > doesn't seem any simpler to me while being less generic.  I don't see
> > > > > what the upside would be.
> > > > 
> > > > Hm, but this never causes -ENOMEM error, at all.
> > > 
> > > Ooh, I missed the part it falls back to the global counter if percpu
> > > counters aren't allocated yet.  Yeah, this is an interesting approach.
> > > I'll think more about it.
> > 
> > I've been staring at the blkcg stats code and commit logs and am
> > wondering whether we can just scrap percpu counters there.  It seems
> > the reason why it was introduced in the first place is to avoid
> > stats->lock, which BTW is extremely heavy handed for gathering stats,
> > overhead in fast paths and I think there can be easier ways to avoid
> > stats->lock.
> 
> I could remove stats_lock without much trouble but couldn't get
> blk-throtl fast path stat working. :(
> 
> I think I'll try to get KAMEZAWA's percpu counter working.  Any
> objections?

So IIUC, that patch essentially does the same thing which my patch
was doing for blk-throttle.c and cfq-iosched.c. That is do the allocation
of counters from worker thread and till allocation does not happen we do
not record the stats.

The only difference is that by putting this logic in per cpu counters,
we make it somewhat generic so that other users who can't do GFP_KERNEL
allocation of per cpu data, can use it. I can live with that.

Ideally I would have thought that if passing gfp_flag to alloc_percpu()
is a need, then we need to fix that instead of trying to create generic
infrastructure around that. For the code which is broken currently like
blkcg, we can fix them by doing allocation from worker context so that
this workaround is limited to a specific piece of code and not generic
enough that other subsystems latch on to it.

But if you don't think that fixing alloc_percpu() is possible in long
term and users should use per cpu counters for any kind of non GFP_KERNEL
needs, then it probably is find to continue to develop this patch.

Personally, I liked my old patch of restricting worker thread allocation
logic to blk-throttle.c and cfq-iosched.c. If you don't have objection
to that approach, I can brush it up, fix a pending issue and post it?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2012-02-10 16:26                         ` Vivek Goyal
@ 2012-02-13 22:31                           ` Tejun Heo
  2012-02-15 15:43                             ` Vivek Goyal
  0 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2012-02-13 22:31 UTC (permalink / raw)
  To: Vivek Goyal, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, avi, nate, cl, oleg, axboe, linux-kernel

Hello, Vivek.

On Fri, Feb 10, 2012 at 11:26:58AM -0500, Vivek Goyal wrote:
> The only difference is that by putting this logic in per cpu counters,
> we make it somewhat generic so that other users who can't do GFP_KERNEL
> allocation of per cpu data, can use it. I can live with that.

Also, it has fallback mechanism while percpu data isn't there, so the
counts are guaranteed to be right.  Probably doesn't matter all that
much for blkcg stats.

> But if you don't think that fixing alloc_percpu() is possible in long
> term and users should use per cpu counters for any kind of non GFP_KERNEL
> needs, then it probably is find to continue to develop this patch.

Updating alloc_percpu() to support full @gfp_mask is rather complex
and likely to generate a lot of churn and fragility.

> Personally, I liked my old patch of restricting worker thread allocation
> logic to blk-throttle.c and cfq-iosched.c. If you don't have objection
> to that approach, I can brush it up, fix a pending issue and post it?

I don't know.  If we're gonna do that, I think doing that in mempool
is better.  Andrew, are you still dead against using mempool for
percpu pooling?  That logic is going somewhere and it probably is
better to put it somewhere common rather than shoving it in block
cgroup code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock
  2012-02-13 22:31                           ` Tejun Heo
@ 2012-02-15 15:43                             ` Vivek Goyal
  0 siblings, 0 replies; 51+ messages in thread
From: Vivek Goyal @ 2012-02-15 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, avi, nate, cl, oleg, axboe,
	linux-kernel

On Mon, Feb 13, 2012 at 02:31:56PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Fri, Feb 10, 2012 at 11:26:58AM -0500, Vivek Goyal wrote:
> > The only difference is that by putting this logic in per cpu counters,
> > we make it somewhat generic so that other users who can't do GFP_KERNEL
> > allocation of per cpu data, can use it. I can live with that.
> 
> Also, it has fallback mechanism while percpu data isn't there, so the
> counts are guaranteed to be right.  Probably doesn't matter all that
> much for blkcg stats.

Ok, I had missed that part. Nice.

I have no objections to making this lazy allocation in per cpu counter
idea work and migrate blkcg per cpu stats to use per cpu counters. If
this patchset is not too intrusive, it might even be backportable to
stable kernels.

Andrew does not seem to like mempool idea, so instead of not making
any progress, it is better to go with this idea. We need to fix this
issue. More poeple might run into this.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2012-02-15 15:43 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-22 21:45 [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Tejun Heo
2011-12-22 21:45 ` [PATCH 1/7] mempool: fix and document synchronization and memory barrier usage Tejun Heo
2011-12-22 21:45 ` [PATCH 2/7] mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy() Tejun Heo
2011-12-22 21:45 ` [PATCH 3/7] mempool: fix first round failure behavior Tejun Heo
2011-12-22 21:45 ` [PATCH 4/7] mempool: factor out mempool_fill() Tejun Heo
2011-12-22 21:45 ` [PATCH 5/7] mempool: separate out __mempool_create() Tejun Heo
2011-12-22 21:45 ` [PATCH 6/7] mempool, percpu: implement percpu mempool Tejun Heo
2011-12-22 21:45 ` [PATCH 7/7] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
2011-12-23  1:00   ` Vivek Goyal
2011-12-23 22:54     ` Tejun Heo
2011-12-22 21:59 ` [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock Andrew Morton
2011-12-22 22:09   ` Tejun Heo
2011-12-22 22:20     ` Andrew Morton
2011-12-22 22:41       ` Tejun Heo
2011-12-22 22:54         ` Andrew Morton
2011-12-22 23:00           ` Tejun Heo
2011-12-22 23:16             ` Andrew Morton
2011-12-22 23:24               ` Tejun Heo
2011-12-22 23:41                 ` Andrew Morton
2011-12-22 23:54                   ` Tejun Heo
2011-12-23  1:14                     ` Andrew Morton
2011-12-23 15:17                       ` Vivek Goyal
2011-12-27 18:34                       ` Tejun Heo
2011-12-27 21:20                         ` Andrew Morton
2011-12-27 21:44                           ` Tejun Heo
2011-12-27 21:58                             ` Andrew Morton
2011-12-27 22:22                               ` Tejun Heo
2011-12-23  1:21                   ` Vivek Goyal
2011-12-23  1:38                     ` Andrew Morton
2011-12-23  2:54                       ` Vivek Goyal
2011-12-23  3:11                         ` Andrew Morton
2011-12-23 14:58                           ` Vivek Goyal
2011-12-27 21:25                             ` Andrew Morton
2011-12-27 22:07                               ` Tejun Heo
2011-12-27 22:21                                 ` Andrew Morton
2011-12-27 22:30                                   ` Tejun Heo
2012-01-16 15:26                                     ` Vivek Goyal
2011-12-23  1:40       ` Vivek Goyal
2011-12-23  1:58         ` Andrew Morton
2011-12-23  2:56           ` Vivek Goyal
2011-12-26  6:05             ` KAMEZAWA Hiroyuki
2011-12-27 17:52               ` Tejun Heo
2011-12-28  0:14                 ` KAMEZAWA Hiroyuki
2011-12-28  0:41                   ` Tejun Heo
2012-01-05  1:28                     ` Tejun Heo
2012-01-16 15:28                       ` Vivek Goyal
2012-02-09 23:58                       ` Tejun Heo
2012-02-10 16:26                         ` Vivek Goyal
2012-02-13 22:31                           ` Tejun Heo
2012-02-15 15:43                             ` Vivek Goyal
2011-12-23 14:46           ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.