[PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
@ 2012-02-23 22:30 Tejun Heo
  2012-02-23 22:30 ` [PATCH 1/8] mempool: factor out mempool_fill() Tejun Heo
                   ` (8 more replies)
  0 siblings, 9 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni

Hello, guys.

This patchset is combination of the patchset "block, mempool, percpu:
implement percpu mempool and fix blkcg percpu alloc deadlock" [1] and
patches to remove blkg->stats_lock.

* percpu mempool and percpu stat allocation

Andrew, Hugh, other than the fourth patch updated for the current
blkcg branch, all the mempool changes are the same as before.  I tried
several different approaches to remove the percpu stats but failed to
do so without introducing something even sillier, so unless we're
gonna break userland visible stats, we need some form of NOIO percpu
allocation mechanism.

I don't think implementing private allocation queue in blkcg proper is
a good option.  Doing it from percpu counter has the advantage of not
losing any count while percpu allocation is in progress but it feels
weird like a spoon which can be used as a screw driver.  So, it still
seems to me that utilizing mempool code is the least of all evils.

The plan is to isolate these percpu stats into blk-throttle and
implement soft failure for stat allocation failure, so simple
buffering with opportunistic refilling should do.

If people agree with this, I think it would be best to route these
changes through block/core along with other blkcg updates.  If
somebody still disagrees, scream.

* removal of blkg->stats_lock

After the recent plug merge updates, all non-percpu stats are updated
under queue_lock, so use u64_stats_sync instead of spinlock.

 0001-mempool-factor-out-mempool_fill.patch
 0002-mempool-separate-out-__mempool_create.patch
 0003-mempool-percpu-implement-percpu-mempool.patch
 0004-block-fix-deadlock-through-percpu-allocation-in-blk-.patch
 0005-blkcg-don-t-use-percpu-for-merged-stats.patch
 0006-blkcg-simplify-stat-reset.patch
 0007-blkcg-restructure-blkio_get_stat.patch
 0008-blkcg-remove-blkio_group-stats_lock.patch

0001-0003 implement percpu mempool.

0004 make blk-cgroup use it to fix the GFP_KERNEL allocation from IO
path bug.

0005-0008 replace blkg->stats_lock with u64_stats_sync.

This patchset is on top of

  block/for-linus 621032ad6e "block: exit_io_context() should call eleva..."
+ [2] blkcg: accumulated blkcg updates

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blkcg-stats

Thanks.

 block/blk-cgroup.c      |  399 ++++++++++++++++++++++--------------------------
 block/blk-cgroup.h      |   29 ++-
 include/linux/mempool.h |   80 +++++++++
 mm/mempool.c            |  208 +++++++++++++++++++++----
 4 files changed, 462 insertions(+), 254 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1232735
[2] http://thread.gmane.org/gmane.linux.kernel/1256355

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/8] mempool: factor out mempool_fill()
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 2/8] mempool: separate out __mempool_create() Tejun Heo
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo,
	Oleg Nesterov

Factor out mempool_fill() from mempool_create() and mempool_resize().
When called, it fills the reservoir upto the configured min_nr
elements.

After the change, gotos in mempool_resize() exit paths don't make
whole lot of sense, unlock and return directly.

Note that mempool_create() repeatedly locks and unlocks pool->lock
while filling the pool initially and mempool_resize() may go through
one extra locking cycle.  This shouldn't result in noticeable
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempool.c |   75 ++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index d904981..e0403d0 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -47,6 +47,43 @@ void mempool_destroy(mempool_t *pool)
 EXPORT_SYMBOL(mempool_destroy);
 
 /**
+ * mempool_fill - fill a memory pool
+ * @pool:	memory pool to fill
+ * @gfp_mask:	allocation mask to use
+ *
+ * Allocate new elements with @gfp_mask and fill @pool so that it has
+ * @pool->min_nr elements.  Returns 0 on success, -errno on failure.
+ */
+static int mempool_fill(mempool_t *pool, gfp_t gfp_mask)
+{
+	/*
+	 * If curr_nr == min_nr is visible, we're correct regardless of
+	 * locking.
+	 */
+	while (pool->curr_nr < pool->min_nr) {
+		void *elem;
+		unsigned long flags;
+
+		elem = pool->alloc(gfp_mask, pool->pool_data);
+		if (unlikely(!elem))
+			return -ENOMEM;
+
+		spin_lock_irqsave(&pool->lock, flags);
+		if (pool->curr_nr < pool->min_nr) {
+			add_element(pool, elem);
+			elem = NULL;
+		}
+		spin_unlock_irqrestore(&pool->lock, flags);
+
+		if (elem) {
+			pool->free(elem, pool->pool_data);
+			return 0;
+		}
+	}
+	return 0;
+}
+
+/**
  * mempool_create - create a memory pool
  * @min_nr:    the minimum number of elements guaranteed to be
  *             allocated for this pool.
@@ -90,16 +127,11 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	/*
 	 * First pre-allocate the guaranteed number of buffers.
 	 */
-	while (pool->curr_nr < pool->min_nr) {
-		void *element;
-
-		element = pool->alloc(GFP_KERNEL, pool->pool_data);
-		if (unlikely(!element)) {
-			mempool_destroy(pool);
-			return NULL;
-		}
-		add_element(pool, element);
+	if (mempool_fill(pool, GFP_KERNEL)) {
+		mempool_destroy(pool);
+		return NULL;
 	}
+
 	return pool;
 }
 EXPORT_SYMBOL(mempool_create_node);
@@ -137,7 +169,8 @@ int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
 			spin_lock_irqsave(&pool->lock, flags);
 		}
 		pool->min_nr = new_min_nr;
-		goto out_unlock;
+		spin_unlock_irqrestore(&pool->lock, flags);
+		return 0;
 	}
 	spin_unlock_irqrestore(&pool->lock, flags);
 
@@ -151,31 +184,17 @@ int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
 		/* Raced, other resize will do our work */
 		spin_unlock_irqrestore(&pool->lock, flags);
 		kfree(new_elements);
-		goto out;
+		return 0;
 	}
 	memcpy(new_elements, pool->elements,
 			pool->curr_nr * sizeof(*new_elements));
 	kfree(pool->elements);
 	pool->elements = new_elements;
 	pool->min_nr = new_min_nr;
-
-	while (pool->curr_nr < pool->min_nr) {
-		spin_unlock_irqrestore(&pool->lock, flags);
-		element = pool->alloc(gfp_mask, pool->pool_data);
-		if (!element)
-			goto out;
-		spin_lock_irqsave(&pool->lock, flags);
-		if (pool->curr_nr < pool->min_nr) {
-			add_element(pool, element);
-		} else {
-			spin_unlock_irqrestore(&pool->lock, flags);
-			pool->free(element, pool->pool_data);	/* Raced */
-			goto out;
-		}
-	}
-out_unlock:
 	spin_unlock_irqrestore(&pool->lock, flags);
-out:
+
+	mempool_fill(pool, gfp_mask);
+
 	return 0;
 }
 EXPORT_SYMBOL(mempool_resize);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 2/8] mempool: separate out __mempool_create()
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
  2012-02-23 22:30 ` [PATCH 1/8] mempool: factor out mempool_fill() Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 3/8] mempool, percpu: implement percpu mempool Tejun Heo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo,
	Oleg Nesterov

Separate out __mempool_create(), which handles creation of the mempool
without filling it, from mempool_create_node().  This will be used to
implemen percpu_mempool.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 mm/mempool.c |   24 ++++++++++++++++++------
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index e0403d0..1ed8d5e 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -104,11 +104,13 @@ mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
 }
 EXPORT_SYMBOL(mempool_create);
 
-mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
-			mempool_free_t *free_fn, void *pool_data, int node_id)
+static mempool_t *__mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
+				   mempool_free_t *free_fn, void *pool_data,
+				   int node_id, size_t mempool_bytes)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
+
+	pool = kmalloc_node(mempool_bytes, GFP_KERNEL | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
@@ -124,9 +126,19 @@ mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
 	pool->alloc = alloc_fn;
 	pool->free = free_fn;
 
-	/*
-	 * First pre-allocate the guaranteed number of buffers.
-	 */
+	return pool;
+}
+
+mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+			       mempool_free_t *free_fn, void *pool_data,
+			       int node_id)
+{
+	mempool_t *pool;
+
+	pool = __mempool_create(min_nr, alloc_fn, free_fn, pool_data, node_id,
+				sizeof(*pool));
+
+	/* Pre-allocate the guaranteed number of buffers */
 	if (mempool_fill(pool, GFP_KERNEL)) {
 		mempool_destroy(pool);
 		return NULL;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 3/8] mempool, percpu: implement percpu mempool
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
  2012-02-23 22:30 ` [PATCH 1/8] mempool: factor out mempool_fill() Tejun Heo
  2012-02-23 22:30 ` [PATCH 2/8] mempool: separate out __mempool_create() Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 4/8] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo,
	Oleg Nesterov

This patch implements mempool for percpu memory areas.  Percpu mempool
is mostly identical to regular mempool and shares most of code but has
some peculiarities.

Percpu memory allocator requires %GFP_KERNEL during allocation, which
comes from its on-demand nature and vmalloc area usage.  In most
cases, it's not a good idea to allocate percpu memory from more
constricted context and this doesn't cause a problem; however, there
are rare cases where opportunistic allocation from NOIO path makes
sense.

To ease such use cases, percpu mempool comes with refill mechanism
which can behave both synchronously and asynchronously depending on
the specified gfp mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 include/linux/mempool.h |   80 ++++++++++++++++++++++++++++++++++
 mm/mempool.c            |  111 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+), 0 deletions(-)

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 7c08052..129acbe 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,7 @@
 #define _LINUX_MEMPOOL_H
 
 #include <linux/wait.h>
+#include <linux/workqueue.h>
 
 struct kmem_cache;
 
@@ -70,4 +71,83 @@ static inline mempool_t *mempool_create_page_pool(int min_nr, int order)
 			      (void *)(long)order);
 }
 
+/*
+ * Percpu mempool - mempool backed by percpu memory allocator.
+ *
+ * Along with the usual mempool role, because percpu allocator doesn't
+ * support NOIO allocations, percpu mempool is useful as allocation buffer
+ * which is filled from IO context and consumed from atomic or non-IO one.
+ * To help this usage, percpu_mempool has built-in mechanism to refill the
+ * pool which supports both sync and async operations.  Refer to
+ * percpu_mempool_refill() for details.
+ */
+struct percpu_mempool {
+	mempool_t		pool;
+	size_t			size;		/* size of elements */
+	size_t			align;		/* align of elements */
+	struct work_struct	refill_work;	/* work item for async refill */
+};
+
+struct percpu_mempool *percpu_mempool_create(int min_nr, size_t size,
+					     size_t align);
+int percpu_mempool_refill(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask);
+void percpu_mempool_destroy(struct percpu_mempool *pcpu_pool);
+
+/**
+ * percpu_mempool_resize - resize an existing percpu mempool
+ * @pcpu_pool:	percpu mempool to resize
+ * @new_min_nr:	new minimum number of elements guaranteed to be allocated
+ * @gfp_mask:	allocation mask to use
+ *
+ * Counterpart of mempool_resize().  If @gfp_mask doesn't contain
+ * %__GFP_IO, resizing itself may succeed but the implied filling (if
+ * necessary) will fail.
+ */
+static inline int percpu_mempool_resize(struct percpu_mempool *pcpu_pool,
+					int new_min_nr, gfp_t gfp_mask)
+{
+	return mempool_resize(&pcpu_pool->pool, new_min_nr, gfp_mask);
+}
+
+/**
+ * percpu_mempool_alloc - allocate an element from a percpu mempool
+ * @pcpu_pool:	percpu mempool to allocate from
+ * @gfp_mask:	allocation mask to use
+ *
+ * Counterpart of mempool_alloc().  If @gfp_mask doesn't contain %__GFP_IO,
+ * allocation is always from the reserved pool.
+ */
+static inline void __percpu *
+percpu_mempool_alloc(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask)
+{
+	void *p = mempool_alloc(&pcpu_pool->pool, gfp_mask);
+
+	return (void __percpu __force *)p;
+}
+
+/**
+ * percpu_mempool_free - free an element to a percpu mempool
+ * @elem:	element being freed
+ * @pcpu_pool:	percpu mempool to free to
+ */
+static inline void percpu_mempool_free(void __percpu *elem,
+				       struct percpu_mempool *pcpu_pool)
+{
+	void *p = (void __kernel __force *)elem;
+
+	mempool_free(p, &pcpu_pool->pool);
+}
+
+/**
+ * percpu_mempool_nr_elems - return nr of reserved elems in a percpu mempool
+ * @pcpu_pool:	percpu mempool of interest
+ *
+ * Returns the number of reserved elements in @pcpu_pool.  Mostly useful
+ * for deciding when to refill.
+ */
+static inline int percpu_mempool_nr_elems(struct percpu_mempool *pcpu_pool)
+{
+	return pcpu_pool->pool.curr_nr;
+}
+
 #endif /* _LINUX_MEMPOOL_H */
diff --git a/mm/mempool.c b/mm/mempool.c
index 1ed8d5e..75e01c4 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -14,6 +14,7 @@
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
 #include <linux/writeback.h>
+#include <linux/percpu.h>
 
 static void add_element(mempool_t *pool, void *element)
 {
@@ -398,3 +399,113 @@ void mempool_free_pages(void *element, void *pool_data)
 	__free_pages(element, order);
 }
 EXPORT_SYMBOL(mempool_free_pages);
+
+/*
+ * Mempool for percpu memory.
+ */
+static void *percpu_mempool_alloc_fn(gfp_t gfp_mask, void *data)
+{
+	struct percpu_mempool *pcpu_pool = data;
+	void __percpu *p;
+
+	/*
+	 * Percpu allocator doesn't do NOIO.  This makes percpu mempool
+	 * always try reserved elements first, which isn't such a bad idea
+	 * given that percpu allocator is pretty heavy and percpu areas are
+	 * expensive.
+	 */
+	if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+		return NULL;
+
+	p = __alloc_percpu(pcpu_pool->size, pcpu_pool->align);
+	return (void __kernel __force *)p;
+}
+
+static void percpu_mempool_free_fn(void *elem, void *data)
+{
+	void __percpu *p = (void __percpu __force *)elem;
+
+	free_percpu(p);
+}
+
+static void percpu_mempool_refill_workfn(struct work_struct *work)
+{
+	struct percpu_mempool *pcpu_pool =
+		container_of(work, struct percpu_mempool, refill_work);
+
+	percpu_mempool_refill(pcpu_pool, GFP_KERNEL);
+}
+
+/**
+ * percpu_mempool_create - create mempool for percpu memory
+ * @min_nr:	the minimum number of elements guaranteed to be
+ *		allocated for this pool.
+ * @size:	size of percpu memory areas in this pool
+ * @align:	alignment of percpu memory areas in this pool
+ *
+ * This is counterpart of mempool_create() for percpu memory areas.
+ * Allocations from the pool will return @size bytes percpu memory areas
+ * aligned at @align bytes.
+ */
+struct percpu_mempool *percpu_mempool_create(int min_nr, size_t size,
+					     size_t align)
+{
+	struct percpu_mempool *pcpu_pool;
+	mempool_t *pool;
+
+	BUILD_BUG_ON(offsetof(struct percpu_mempool, pool));
+
+	pool = __mempool_create(min_nr, percpu_mempool_alloc_fn,
+				percpu_mempool_free_fn, NULL, NUMA_NO_NODE,
+				sizeof(*pcpu_pool));
+	if (!pool)
+		return NULL;
+
+	/* fill in pcpu_pool part and set pool_data to self */
+	pcpu_pool = container_of(pool, struct percpu_mempool, pool);
+	pcpu_pool->size = size;
+	pcpu_pool->align = align;
+	INIT_WORK(&pcpu_pool->refill_work, percpu_mempool_refill_workfn);
+	pcpu_pool->pool.pool_data = pcpu_pool;
+
+	/* Pre-allocate the guaranteed number of buffers */
+	if (mempool_fill(&pcpu_pool->pool, GFP_KERNEL)) {
+		mempool_destroy(&pcpu_pool->pool);
+		return NULL;
+	}
+
+	return pcpu_pool;
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_create);
+
+/**
+ * percpu_mempool_refill - refill a percpu mempool
+ * @pcpu_pool:	percpu mempool to refill
+ * @gfp_mask:	allocation mask to use
+ *
+ * Refill @pcpu_pool upto the configured min_nr using @gfp_mask.
+ *
+ * Percpu memory allocation depends on %GFP_KERNEL.  If @gfp_mask doesn't
+ * contain it, this function will schedule a work item to refill the pool
+ * and return -%EAGAIN indicating refilling is in progress.
+ */
+int percpu_mempool_refill(struct percpu_mempool *pcpu_pool, gfp_t gfp_mask)
+{
+	if ((gfp_mask & GFP_KERNEL) == GFP_KERNEL)
+		return mempool_fill(&pcpu_pool->pool, gfp_mask);
+
+	schedule_work(&pcpu_pool->refill_work);
+	return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_refill);
+
+/**
+ * percpu_mempool_destroy - destroy a percpu mempool
+ * @pcpu_pool:	percpu mempool to destroy
+ */
+void percpu_mempool_destroy(struct percpu_mempool *pcpu_pool)
+{
+	cancel_work_sync(&pcpu_pool->refill_work);
+	mempool_destroy(&pcpu_pool->pool);
+}
+EXPORT_SYMBOL_GPL(percpu_mempool_destroy);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 4/8] block: fix deadlock through percpu allocation in blk-cgroup
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (2 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 3/8] mempool, percpu: implement percpu mempool Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 5/8] blkcg: don't use percpu for merged stats Tejun Heo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo,
	Oleg Nesterov

Percpu allocator depends on %GFP_KERNEL allocation and can't be used
for NOIO or NOFS allocations; unfortunately, blk-cgroup allocates
percpu stats structure in the IO path, which triggers the following
lockdep warning and also leads to actual deadlocks under certain
conditions.

 inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
 qemu-kvm/8774 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff8110c8be>] pcpu_alloc+0x6f/0x835
 {RECLAIM_FS-ON-W} state was registered at:
   [<ffffffff8108f85a>] mark_held_locks+0x6d/0x95
   [<ffffffff8108fe69>] lockdep_trace_alloc+0x9f/0xc2
   [<ffffffff8112d781>] slab_pre_alloc_hook+0x1e/0x54
   [<ffffffff8112fcf9>] __kmalloc+0x64/0x10c
   [<ffffffff8110bec8>] pcpu_mem_alloc+0x5e/0x67
   [<ffffffff8110c169>] pcpu_extend_area_map+0x2b/0xd4
   [<ffffffff8110ca0d>] pcpu_alloc+0x1be/0x835
   [<ffffffff8110d094>] __alloc_percpu+0x10/0x12
   [<ffffffff8113167a>] kmem_cache_open+0x2cc/0x2d6
   [<ffffffff8113185d>] kmem_cache_create+0x1d9/0x281
   [<ffffffff812971a5>] acpi_os_create_cache+0x1d/0x2d
   [<ffffffff812bf742>] acpi_ut_create_caches+0x26/0xb0
   [<ffffffff812c2106>] acpi_ut_init_globals+0xe/0x244
   [<ffffffff81d82ca9>] acpi_initialize_subsystem+0x35/0xae
   [<ffffffff81d819c2>] acpi_early_init+0x5c/0xf7
   [<ffffffff81d51be9>] start_kernel+0x3ce/0x3ea
   [<ffffffff81d512c4>] x86_64_start_reservations+0xaf/0xb3
   [<ffffffff81d51412>] x86_64_start_kernel+0x14a/0x159
 irq event stamp: 48683649
 hardirqs last  enabled at (48683649): [<ffffffff814fd547>] __slab_alloc+0x413/0x434
 hardirqs last disabled at (48683648): [<ffffffff814fd17c>] __slab_alloc+0x48/0x434
 softirqs last  enabled at (48683630): [<ffffffff810630bf>] __do_softirq+0x200/0x25a
 softirqs last disabled at (48683625): [<ffffffff8150debc>] call_softirq+0x1c/0x30

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(pcpu_alloc_mutex);
   <Interrupt>
     lock(pcpu_alloc_mutex);

  *** DEADLOCK ***

 3 locks held by qemu-kvm/8774:
  #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffa005d979>] vcpu_load+0x1d/0x90 [kvm]
  #1:  (&kvm->srcu){.+.+.+}, at: [<ffffffffa00663f7>] srcu_read_lock+0x0/0x49 [kvm]
  #2:  (&mm->mmap_sem){++++++}, at: [<ffffffffa005ee6f>] hva_to_pfn+0xbb/0x2d1 [kvm]

 stack backtrace:
 Pid: 8774, comm: qemu-kvm Not tainted 3.1.4 #2
 Call Trace:
  [<ffffffff814fa906>] print_usage_bug+0x1e7/0x1f8
  [<ffffffff8108e232>] mark_lock+0x106/0x220
  [<ffffffff8108e6ee>] __lock_acquire+0x3a2/0xd0c
  [<ffffffff8108f55b>] lock_acquire+0xf3/0x13e
  [<ffffffff815030ad>] __mutex_lock_common+0x5d/0x39a
  [<ffffffff815034f9>] mutex_lock_nested+0x40/0x45
  [<ffffffff8110c8be>] pcpu_alloc+0x6f/0x835
  [<ffffffff8110d094>] __alloc_percpu+0x10/0x12
  [<ffffffff8123d64c>] blkio_alloc_blkg_stats+0x1d/0x31
  [<ffffffff8123ed87>] throtl_alloc_tg+0x3a/0xdf
  [<ffffffff8123f79f>] blk_throtl_bio+0x154/0x39c
  [<ffffffff81231c52>] generic_make_request+0x2e4/0x415
  [<ffffffff81231e61>] submit_bio+0xde/0xfd
  [<ffffffff8111e15d>] swap_writepage+0x94/0x9f
  [<ffffffff811041d1>] shmem_writepage+0x192/0x1d8
  [<ffffffff8110167b>] shrink_page_list+0x402/0x79a
  [<ffffffff81101e1a>] shrink_inactive_list+0x22c/0x3df
  [<ffffffff811026b2>] shrink_zone+0x43f/0x582
  [<ffffffff81102b5e>] do_try_to_free_pages+0x107/0x318
  [<ffffffff811030a6>] try_to_free_pages+0xd5/0x175
  [<ffffffff810f9021>] __alloc_pages_nodemask+0x535/0x7eb
  [<ffffffff81127f02>] alloc_pages_vma+0xf5/0xfa
  [<ffffffff81136bfb>] do_huge_pmd_anonymous_page+0xb3/0x274
  [<ffffffff811112f2>] handle_mm_fault+0xfd/0x1b8
  [<ffffffff8111173c>] __get_user_pages+0x2fa/0x465
  [<ffffffffa005edb2>] get_user_page_nowait+0x37/0x39 [kvm]
  [<ffffffffa005ee8a>] hva_to_pfn+0xd6/0x2d1 [kvm]
  [<ffffffffa005f108>] __gfn_to_pfn+0x83/0x8c [kvm]
  [<ffffffffa005f1c9>] gfn_to_pfn_async+0x1a/0x1c [kvm]
  [<ffffffffa007799e>] try_async_pf+0x3f/0x248 [kvm]
  [<ffffffffa0079140>] tdp_page_fault+0xe8/0x1a5 [kvm]
  [<ffffffffa0077c83>] kvm_mmu_page_fault+0x2b/0x83 [kvm]
  [<ffffffffa00cda71>] handle_ept_violation+0xe1/0xea [kvm_intel]
  [<ffffffffa00d2129>] vmx_handle_exit+0x5ee/0x638 [kvm_intel]
  [<ffffffffa007153f>] kvm_arch_vcpu_ioctl_run+0xb30/0xe00 [kvm]
  [<ffffffffa005db4f>] kvm_vcpu_ioctl+0x120/0x59c [kvm]
  [<ffffffff811502a5>] do_vfs_ioctl+0x472/0x4b3
  [<ffffffff8115033c>] sys_ioctl+0x56/0x7a
  [<ffffffff8150bbc2>] system_call_fastpath+0x16/0x1b

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.comp.emulators.kvm.devel/82755

This patch fixes the bug by using percpu_mempool to allocate blkg
percpu stats.  As the allocations are only per cgroup - request_queue
pair, they aren't expected to be frequent.  Use smallish pool of 8
elements which is refilled once half is consumed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Avi Kivity <avi@redhat.com>
Reported-by: Nate Custer <nate@cpanel.net>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   58 +++++++++++++++++++++++++++++++++------------------
 1 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4624558..c7e0de5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -17,6 +17,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/slab.h>
+#include <linux/mempool.h>
 #include <linux/genhd.h>
 #include <linux/delay.h>
 #include "blk-cgroup.h"
@@ -24,6 +25,13 @@
 
 #define MAX_KEY_LEN 100
 
+/*
+ * blkg_stats_cpu_pool parameters.  These allocations aren't frequent, can
+ * be opportunistic and percpu memory is expensive.
+ */
+#define BLKG_STATS_CPU_POOL_SIZE	16
+#define BLKG_STATS_CPU_POOL_REFILL	8
+
 static DEFINE_SPINLOCK(blkio_list_lock);
 static LIST_HEAD(blkio_list);
 
@@ -35,6 +43,13 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct blkio_policy_type *blkio_policy[BLKIO_NR_POLICIES];
 
+/*
+ * Percpu mempool for blkgio_group_stats_cpu which are allocated on-demand
+ * on IO path.  As percpu doesn't support NOIO allocations, we need to
+ * buffer them through mempool.
+ */
+static struct percpu_mempool *blkg_stats_cpu_pool;
+
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
 static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
@@ -477,7 +492,7 @@ static void blkg_free(struct blkio_group *blkg)
 		struct blkg_policy_data *pd = blkg->pd[i];
 
 		if (pd) {
-			free_percpu(pd->stats_cpu);
+			percpu_mempool_free(pd->stats_cpu, blkg_stats_cpu_pool);
 			kfree(pd);
 		}
 	}
@@ -491,9 +506,6 @@ static void blkg_free(struct blkio_group *blkg)
  * @q: request_queue the new blkg is associated with
  *
  * Allocate a new blkg assocating @blkcg and @q.
- *
- * FIXME: Should be called with queue locked but currently isn't due to
- *        percpu stat breakage.
  */
 static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 				      struct request_queue *q)
@@ -531,8 +543,14 @@ static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
 
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
+		/* schedule percpu pool refill if necessary */
+		if (percpu_mempool_nr_elems(blkg_stats_cpu_pool) <=
+		    BLKG_STATS_CPU_POOL_REFILL)
+			percpu_mempool_refill(blkg_stats_cpu_pool, GFP_NOWAIT);
+
+		/* allocate memory for per cpu stats */
+		pd->stats_cpu = percpu_mempool_alloc(blkg_stats_cpu_pool,
+						     GFP_NOWAIT);
 		if (!pd->stats_cpu) {
 			blkg_free(blkg);
 			return NULL;
@@ -578,23 +596,9 @@ struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
 	if (!css_tryget(&blkcg->css))
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
-	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
+	/* allocate and initialize */
 	new_blkg = blkg_alloc(blkcg, q);
 
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
 	/* did bypass get turned on inbetween? */
 	if (unlikely(blk_queue_bypass(q)) && !for_root) {
 		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
@@ -1789,3 +1793,15 @@ void blkio_policy_unregister(struct blkio_policy_type *blkiop)
 	blkcg_bypass_end();
 }
 EXPORT_SYMBOL_GPL(blkio_policy_unregister);
+
+static int __init blkg_init(void)
+{
+	blkg_stats_cpu_pool = percpu_mempool_create(BLKG_STATS_CPU_POOL_SIZE,
+				sizeof(struct blkio_group_stats_cpu),
+				__alignof__(struct blkio_group_stats_cpu));
+	if (!blkg_stats_cpu_pool)
+		return -ENOMEM;
+	return 0;
+}
+
+subsys_initcall(blkg_init);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 5/8] blkcg: don't use percpu for merged stats
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (3 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 4/8] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 6/8] blkcg: simplify stat reset Tejun Heo
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo

With recent plug merge updates, merged stats are no longer called for
plug merges and now only updated while holding queue_lock.  As
stats_lock is scheduled to be removed, there's no reason to use percpu
for merged stats.  Don't use percpu for merged stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   22 ++++++----------------
 block/blk-cgroup.h |    6 +++---
 2 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c7e0de5..ced9acd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -455,23 +455,13 @@ void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
 				    bool direction, bool sync)
 {
 	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	struct blkio_group_stats_cpu *stats_cpu;
+	struct blkio_group_stats *stats;
 	unsigned long flags;
 
-	/*
-	 * Disabling interrupts to provide mutual exclusion between two
-	 * writes on same cpu. It probably is not needed for 64bit. Not
-	 * optimizing that case yet.
-	 */
-	local_irq_save(flags);
-
-	stats_cpu = this_cpu_ptr(pd->stats_cpu);
-
-	u64_stats_update_begin(&stats_cpu->syncp);
-	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_MERGED], 1,
-				direction, sync);
-	u64_stats_update_end(&stats_cpu->syncp);
-	local_irq_restore(flags);
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	stats = &pd->stats;
+	blkio_add_stat(stats->stat_arr[BLKIO_STAT_MERGED], 1, direction, sync);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
@@ -1287,7 +1277,7 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 						BLKIO_STAT_WAIT_TIME, 1, 0);
 		case BLKIO_PROP_io_merged:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_CPU_MERGED, 1, 1);
+						BLKIO_STAT_MERGED, 1, 0);
 		case BLKIO_PROP_io_queued:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
 						BLKIO_STAT_QUEUED, 1, 0);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 4bf4c7b..af6e00a 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -29,10 +29,12 @@ enum blkio_policy_id {
 #ifdef CONFIG_BLK_CGROUP
 
 enum stat_type {
+	/* Number of IOs merged */
+	BLKIO_STAT_MERGED,
 	/* Total time spent (in ns) between request dispatch to the driver and
 	 * request completion for IOs doen by this cgroup. This may not be
 	 * accurate when NCQ is turned on. */
-	BLKIO_STAT_SERVICE_TIME = 0,
+	BLKIO_STAT_SERVICE_TIME,
 	/* Total time spent waiting in scheduler queue in ns */
 	BLKIO_STAT_WAIT_TIME,
 	/* Number of IOs queued up */
@@ -57,8 +59,6 @@ enum stat_type_cpu {
 	BLKIO_STAT_CPU_SERVICE_BYTES,
 	/* Total IOs serviced, post merge */
 	BLKIO_STAT_CPU_SERVICED,
-	/* Number of IOs merged */
-	BLKIO_STAT_CPU_MERGED,
 	BLKIO_STAT_CPU_NR
 };
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 6/8] blkcg: simplify stat reset
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (4 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 5/8] blkcg: don't use percpu for merged stats Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 7/8] blkcg: restructure blkio_get_stat() Tejun Heo
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo

blkiocg_reset_stats() implements stat reset for blkio.reset_stats
cgroupfs file.  This feature is very unconventional and something
which shouldn't have been merged.  It's only useful when there's only
one user or tool looking at the stats.  As soon as multiple users
and/or tools are involved, it becomes useless as resetting disrupts
other usages.  There are very good reasons why all other stats expect
readers to read values at the start and end of a period and subtract
to determine delta over the period.

The implementation is rather complex - some fields shouldn't be
cleared and it saves some fields, resets whole and restores for some
reason.  Reset of percpu stats is also racy.  The comment points to
64bit store atomicity for the reason but even without that stores for
zero can simply race with other CPUs doing RMW and get clobbered.

Simplify reset by

* Clear selectively instead of resetting and restoring.

* Grouping debug stat fields to be reset and using memset() over them.

* Not caring about stats_lock.

* Using memset() to reset percpu stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   80 ++++++++++++++++-----------------------------------
 block/blk-cgroup.h |   14 ++++++++-
 2 files changed, 37 insertions(+), 57 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ced9acd..614b7b3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -747,80 +747,50 @@ EXPORT_SYMBOL_GPL(__blkg_release);
 static void blkio_reset_stats_cpu(struct blkio_group *blkg, int plid)
 {
 	struct blkg_policy_data *pd = blkg->pd[plid];
-	struct blkio_group_stats_cpu *stats_cpu;
-	int i, j, k;
-	/*
-	 * Note: On 64 bit arch this should not be an issue. This has the
-	 * possibility of returning some inconsistent value on 32bit arch
-	 * as 64bit update on 32bit is non atomic. Taking care of this
-	 * corner case makes code very complicated, like sending IPIs to
-	 * cpus, taking care of stats of offline cpus etc.
-	 *
-	 * reset stats is anyway more of a debug feature and this sounds a
-	 * corner case. So I am not complicating the code yet until and
-	 * unless this becomes a real issue.
-	 */
-	for_each_possible_cpu(i) {
-		stats_cpu = per_cpu_ptr(pd->stats_cpu, i);
-		stats_cpu->sectors = 0;
-		for(j = 0; j < BLKIO_STAT_CPU_NR; j++)
-			for (k = 0; k < BLKIO_STAT_TOTAL; k++)
-				stats_cpu->stat_arr_cpu[j][k] = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct blkio_group_stats_cpu *sc =
+			per_cpu_ptr(pd->stats_cpu, cpu);
+
+		sc->sectors = 0;
+		memset(sc->stat_arr_cpu, 0, sizeof(sc->stat_arr_cpu));
 	}
 }
 
 static int
 blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 {
-	struct blkio_cgroup *blkcg;
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
 	struct blkio_group *blkg;
-	struct blkio_group_stats *stats;
 	struct hlist_node *n;
-	uint64_t queued[BLKIO_STAT_TOTAL];
 	int i;
-#ifdef CONFIG_DEBUG_BLK_CGROUP
-	bool idling, waiting, empty;
-	unsigned long long now = sched_clock();
-#endif
 
-	blkcg = cgroup_to_blkio_cgroup(cgroup);
 	spin_lock(&blkio_list_lock);
 	spin_lock_irq(&blkcg->lock);
+
+	/*
+	 * Note that stat reset is racy - it doesn't synchronize against
+	 * stat updates.  This is a debug feature which shouldn't exist
+	 * anyway.  If you get hit by a race, retry.
+	 */
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		struct blkio_policy_type *pol;
 
 		list_for_each_entry(pol, &blkio_list, list) {
 			struct blkg_policy_data *pd = blkg->pd[pol->plid];
-
-			spin_lock(&blkg->stats_lock);
-			stats = &pd->stats;
+			struct blkio_group_stats *stats = &pd->stats;
+
+			/* queued stats shouldn't be cleared */
+			for (i = 0; i < ARRAY_SIZE(stats->stat_arr); i++)
+				if (i != BLKIO_STAT_QUEUED)
+					memset(stats->stat_arr[i], 0,
+					       sizeof(stats->stat_arr[i]));
+			stats->time = 0;
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-			idling = blkio_blkg_idling(stats);
-			waiting = blkio_blkg_waiting(stats);
-			empty = blkio_blkg_empty(stats);
+			memset((void *)stats + BLKG_STATS_DEBUG_CLEAR_START, 0,
+			       BLKG_STATS_DEBUG_CLEAR_SIZE);
 #endif
-			for (i = 0; i < BLKIO_STAT_TOTAL; i++)
-				queued[i] = stats->stat_arr[BLKIO_STAT_QUEUED][i];
-			memset(stats, 0, sizeof(struct blkio_group_stats));
-			for (i = 0; i < BLKIO_STAT_TOTAL; i++)
-				stats->stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
-#ifdef CONFIG_DEBUG_BLK_CGROUP
-			if (idling) {
-				blkio_mark_blkg_idling(stats);
-				stats->start_idle_time = now;
-			}
-			if (waiting) {
-				blkio_mark_blkg_waiting(stats);
-				stats->start_group_wait_time = now;
-			}
-			if (empty) {
-				blkio_mark_blkg_empty(stats);
-				stats->start_empty_time = now;
-			}
-#endif
-			spin_unlock(&blkg->stats_lock);
-
-			/* Reset Per cpu stats which don't take blkg->stats_lock */
 			blkio_reset_stats_cpu(blkg, pol->plid);
 		}
 	}
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index af6e00a..3937a9e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -131,21 +131,31 @@ struct blkio_group_stats {
 
 	/* Total time spent waiting for it to be assigned a timeslice. */
 	uint64_t group_wait_time;
-	uint64_t start_group_wait_time;
 
 	/* Time spent idling for this blkio_group */
 	uint64_t idle_time;
-	uint64_t start_idle_time;
 	/*
 	 * Total time when we have requests queued and do not contain the
 	 * current active queue.
 	 */
 	uint64_t empty_time;
+
+	/* fields after this shouldn't be cleared on stat reset */
+	uint64_t start_group_wait_time;
+	uint64_t start_idle_time;
 	uint64_t start_empty_time;
 	uint16_t flags;
 #endif
 };
 
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+#define BLKG_STATS_DEBUG_CLEAR_START	\
+	offsetof(struct blkio_group_stats, unaccounted_time)
+#define BLKG_STATS_DEBUG_CLEAR_SIZE	\
+	(offsetof(struct blkio_group_stats, start_group_wait_time) - \
+	 BLKG_STATS_DEBUG_CLEAR_START)
+#endif
+
 /* Per cpu blkio group stats */
 struct blkio_group_stats_cpu {
 	uint64_t sectors;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 7/8] blkcg: restructure blkio_get_stat()
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (5 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 6/8] blkcg: simplify stat reset Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:30 ` [PATCH 8/8] blkcg: remove blkio_group->stats_lock Tejun Heo
  2012-02-23 22:43 ` [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Andrew Morton
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo

Restructure blkio_get_stat() to prepare for removal of stats_lock.

* Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
  subtypes instead of using BLKIO_STAT_QUEUED.

* Separate out stat acquisition and printing.  After this, there are
  only two users of blkio_fill_stat().  Just open code it.

* The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1.  There's no
  need to subtract one.  Use MAX_KEY_LEN consistently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  100 ++++++++++++++++++++++++++-------------------------
 block/blk-cgroup.h |    6 +++-
 2 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 614b7b3..bc2d1f3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -833,15 +833,6 @@ static void blkio_get_key_name(enum stat_sub_type type, const char *dname,
 	}
 }
 
-static uint64_t blkio_fill_stat(char *str, int chars_left, uint64_t val,
-				struct cgroup_map_cb *cb, const char *dname)
-{
-	blkio_get_key_name(0, dname, str, chars_left, true);
-	cb->fill(cb, str, val);
-	return val;
-}
-
-
 static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg, int plid,
 			enum stat_type_cpu type, enum stat_sub_type sub_type)
 {
@@ -878,8 +869,9 @@ static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg, int plid,
 
 	if (type == BLKIO_STAT_CPU_SECTORS) {
 		val = blkio_read_stat_cpu(blkg, plid, type, 0);
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, val, cb,
-				       dname);
+		blkio_get_key_name(0, dname, key_str, MAX_KEY_LEN, true);
+		cb->fill(cb, key_str, val);
+		return val;
 	}
 
 	for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
@@ -904,50 +896,60 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg, int plid,
 			       struct cgroup_map_cb *cb, const char *dname,
 			       enum stat_type type)
 {
-	struct blkg_policy_data *pd = blkg->pd[plid];
-	uint64_t disk_total;
+	struct blkio_group_stats *stats = &blkg->pd[plid]->stats;
+	uint64_t v = 0, disk_total = 0;
 	char key_str[MAX_KEY_LEN];
-	enum stat_sub_type sub_type;
+	int st;
 
-	if (type == BLKIO_STAT_TIME)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					pd->stats.time, cb, dname);
+	if (type >= BLKIO_STAT_ARR_NR) {
+		switch (type) {
+		case BLKIO_STAT_TIME:
+			v = stats->time;
+			break;
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-	if (type == BLKIO_STAT_UNACCOUNTED_TIME)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       pd->stats.unaccounted_time, cb, dname);
-	if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
-		uint64_t sum = pd->stats.avg_queue_size_sum;
-		uint64_t samples = pd->stats.avg_queue_size_samples;
-		if (samples)
-			do_div(sum, samples);
-		else
-			sum = 0;
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       sum, cb, dname);
-	}
-	if (type == BLKIO_STAT_GROUP_WAIT_TIME)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       pd->stats.group_wait_time, cb, dname);
-	if (type == BLKIO_STAT_IDLE_TIME)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       pd->stats.idle_time, cb, dname);
-	if (type == BLKIO_STAT_EMPTY_TIME)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       pd->stats.empty_time, cb, dname);
-	if (type == BLKIO_STAT_DEQUEUE)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-				       pd->stats.dequeue, cb, dname);
+		case BLKIO_STAT_UNACCOUNTED_TIME:
+			v = stats->unaccounted_time;
+			break;
+		case BLKIO_STAT_AVG_QUEUE_SIZE: {
+			uint64_t samples = stats->avg_queue_size_samples;
+
+			if (samples) {
+				v = stats->avg_queue_size_sum;
+				do_div(v, samples);
+			}
+			break;
+		}
+		case BLKIO_STAT_IDLE_TIME:
+			v = stats->idle_time;
+			break;
+		case BLKIO_STAT_EMPTY_TIME:
+			v = stats->empty_time;
+			break;
+		case BLKIO_STAT_DEQUEUE:
+			v = stats->dequeue;
+			break;
+		case BLKIO_STAT_GROUP_WAIT_TIME:
+			v = stats->group_wait_time;
+			break;
 #endif
+		default:
+			WARN_ON_ONCE(1);
+		}
 
-	for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
-			sub_type++) {
-		blkio_get_key_name(sub_type, dname, key_str, MAX_KEY_LEN,
-				   false);
-		cb->fill(cb, key_str, pd->stats.stat_arr[type][sub_type]);
+		blkio_get_key_name(0, dname, key_str, MAX_KEY_LEN, true);
+		cb->fill(cb, key_str, v);
+		return v;
 	}
-	disk_total = pd->stats.stat_arr[type][BLKIO_STAT_READ] +
-			pd->stats.stat_arr[type][BLKIO_STAT_WRITE];
+
+	for (st = BLKIO_STAT_READ; st < BLKIO_STAT_TOTAL; st++) {
+		v = stats->stat_arr[type][st];
+
+		blkio_get_key_name(st, dname, key_str, MAX_KEY_LEN, false);
+		cb->fill(cb, key_str, v);
+		if (st == BLKIO_STAT_READ || st == BLKIO_STAT_WRITE)
+			disk_total += v;
+	}
+
 	blkio_get_key_name(BLKIO_STAT_TOTAL, dname, key_str, MAX_KEY_LEN,
 			   false);
 	cb->fill(cb, key_str, disk_total);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 3937a9e..1425de1 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -39,6 +39,7 @@ enum stat_type {
 	BLKIO_STAT_WAIT_TIME,
 	/* Number of IOs queued up */
 	BLKIO_STAT_QUEUED,
+
 	/* All the single valued stats go below this */
 	BLKIO_STAT_TIME,
 #ifdef CONFIG_DEBUG_BLK_CGROUP
@@ -52,6 +53,9 @@ enum stat_type {
 #endif
 };
 
+/* Types lower than this live in stat_arr and have subtypes */
+#define BLKIO_STAT_ARR_NR	(BLKIO_STAT_QUEUED + 1)
+
 /* Per cpu stats */
 enum stat_type_cpu {
 	BLKIO_STAT_CPU_SECTORS,
@@ -117,7 +121,7 @@ struct blkio_cgroup {
 struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
-	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
+	uint64_t stat_arr[BLKIO_STAT_ARR_NR][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* Time not charged to this cgroup */
 	uint64_t unaccounted_time;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 8/8] blkcg: remove blkio_group->stats_lock
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (6 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 7/8] blkcg: restructure blkio_get_stat() Tejun Heo
@ 2012-02-23 22:30 ` Tejun Heo
  2012-02-23 22:43 ` [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Andrew Morton
  8 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 22:30 UTC (permalink / raw)
  To: axboe, vgoyal, akpm, hughd
  Cc: avi, nate, cl, linux-kernel, dpshah, ctalbott, rni, Tejun Heo

With recent plug merge updates, all non-percpu stat updates happen
under queue_lock making stats_lock unnecessary to synchronize stat
updates.  The only synchronization necessary is stat reading, which
can be done using u64_stats_sync instead.

This patch removes blkio_group->stats_lock and adds
blkio_group_stats->syncp for reader synchronization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  209 +++++++++++++++++++++++++--------------------------
 block/blk-cgroup.h |    3 +-
 2 files changed, 103 insertions(+), 109 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bc2d1f3..a0bd275 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -164,7 +164,7 @@ static inline void blkio_update_group_iops(struct blkio_group *blkg,
 
 /*
  * Add to the appropriate stat variable depending on the request type.
- * This should be called with the blkg->stats_lock held.
+ * This should be called with queue_lock held.
  */
 static void blkio_add_stat(uint64_t *stat, uint64_t add, bool direction,
 				bool sync)
@@ -182,7 +182,7 @@ static void blkio_add_stat(uint64_t *stat, uint64_t add, bool direction,
 /*
  * Decrements the appropriate stat variable if non-zero depending on the
  * request type. Panics on value being zero.
- * This should be called with the blkg->stats_lock held.
+ * This should be called with the queue_lock held.
  */
 static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
 {
@@ -203,7 +203,7 @@ static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
 }
 
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-/* This should be called with the blkg->stats_lock held. */
+/* This should be called with the queue_lock held. */
 static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
 					    struct blkio_policy_type *pol,
 					    struct blkio_group *curr_blkg)
@@ -218,7 +218,7 @@ static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
 	blkio_mark_blkg_waiting(&pd->stats);
 }
 
-/* This should be called with the blkg->stats_lock held. */
+/* This should be called with the queue_lock held. */
 static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
 {
 	unsigned long long now;
@@ -232,7 +232,7 @@ static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
 	blkio_clear_blkg_waiting(stats);
 }
 
-/* This should be called with the blkg->stats_lock held. */
+/* This should be called with the queue_lock held. */
 static void blkio_end_empty_time(struct blkio_group_stats *stats)
 {
 	unsigned long long now;
@@ -249,84 +249,74 @@ static void blkio_end_empty_time(struct blkio_group_stats *stats)
 void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg,
 					struct blkio_policy_type *pol)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	BUG_ON(blkio_blkg_idling(&pd->stats));
-	pd->stats.start_idle_time = sched_clock();
-	blkio_mark_blkg_idling(&pd->stats);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	lockdep_assert_held(blkg->q->queue_lock);
+	BUG_ON(blkio_blkg_idling(stats));
+
+	stats->start_idle_time = sched_clock();
+	blkio_mark_blkg_idling(stats);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_set_idle_time_stats);
 
 void blkiocg_update_idle_time_stats(struct blkio_group *blkg,
 				    struct blkio_policy_type *pol)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
-	unsigned long long now;
-	struct blkio_group_stats *stats;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
+
+	lockdep_assert_held(blkg->q->queue_lock);
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &pd->stats;
 	if (blkio_blkg_idling(stats)) {
-		now = sched_clock();
-		if (time_after64(now, stats->start_idle_time))
+		unsigned long long now = sched_clock();
+
+		if (time_after64(now, stats->start_idle_time)) {
+			u64_stats_update_begin(&stats->syncp);
 			stats->idle_time += now - stats->start_idle_time;
+			u64_stats_update_end(&stats->syncp);
+		}
 		blkio_clear_blkg_idling(stats);
 	}
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_idle_time_stats);
 
 void blkiocg_update_avg_queue_size_stats(struct blkio_group *blkg,
 					 struct blkio_policy_type *pol)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
-	struct blkio_group_stats *stats;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &pd->stats;
+	lockdep_assert_held(blkg->q->queue_lock);
+
+	u64_stats_update_begin(&stats->syncp);
 	stats->avg_queue_size_sum +=
 			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
 			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
 	stats->avg_queue_size_samples++;
 	blkio_update_group_wait_time(stats);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	u64_stats_update_end(&stats->syncp);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_avg_queue_size_stats);
 
 void blkiocg_set_start_empty_time(struct blkio_group *blkg,
 				  struct blkio_policy_type *pol)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
-	struct blkio_group_stats *stats;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &pd->stats;
+	lockdep_assert_held(blkg->q->queue_lock);
 
 	if (stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] ||
-			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE]) {
-		spin_unlock_irqrestore(&blkg->stats_lock, flags);
+			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE])
 		return;
-	}
 
 	/*
 	 * group is already marked empty. This can happen if cfqq got new
 	 * request in parent group and moved to this group while being added
 	 * to service tree. Just ignore the event and move on.
 	 */
-	if(blkio_blkg_empty(stats)) {
-		spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	if (blkio_blkg_empty(stats))
 		return;
-	}
 
 	stats->start_empty_time = sched_clock();
 	blkio_mark_blkg_empty(stats);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
 
@@ -336,6 +326,8 @@ void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 {
 	struct blkg_policy_data *pd = blkg->pd[pol->plid];
 
+	lockdep_assert_held(blkg->q->queue_lock);
+
 	pd->stats.dequeue += dequeue;
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_dequeue_stats);
@@ -351,15 +343,16 @@ void blkiocg_update_io_add_stats(struct blkio_group *blkg,
 				 struct blkio_group *curr_blkg, bool direction,
 				 bool sync)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
+
+	lockdep_assert_held(blkg->q->queue_lock);
+
+	u64_stats_update_begin(&stats->syncp);
+	blkio_add_stat(stats->stat_arr[BLKIO_STAT_QUEUED], 1, direction, sync);
+	blkio_end_empty_time(stats);
+	u64_stats_update_end(&stats->syncp);
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	blkio_add_stat(pd->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
-			sync);
-	blkio_end_empty_time(&pd->stats);
 	blkio_set_start_group_wait_time(blkg, pol, curr_blkg);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_add_stats);
 
@@ -367,13 +360,14 @@ void blkiocg_update_io_remove_stats(struct blkio_group *blkg,
 				    struct blkio_policy_type *pol,
 				    bool direction, bool sync)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	blkio_check_and_dec_stat(pd->stats.stat_arr[BLKIO_STAT_QUEUED],
-					direction, sync);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	lockdep_assert_held(blkg->q->queue_lock);
+
+	u64_stats_update_begin(&stats->syncp);
+	blkio_check_and_dec_stat(stats->stat_arr[BLKIO_STAT_QUEUED], direction,
+				 sync);
+	u64_stats_update_end(&stats->syncp);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_remove_stats);
 
@@ -382,15 +376,16 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 				   unsigned long time,
 				   unsigned long unaccounted_time)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
+
+	lockdep_assert_held(blkg->q->queue_lock);
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	pd->stats.time += time;
+	u64_stats_update_begin(&stats->syncp);
+	stats->time += time;
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-	pd->stats.unaccounted_time += unaccounted_time;
+	stats->unaccounted_time += unaccounted_time;
 #endif
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	u64_stats_update_end(&stats->syncp);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
 
@@ -432,20 +427,19 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
 				     uint64_t io_start_time, bool direction,
 				     bool sync)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	struct blkio_group_stats *stats;
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 	unsigned long long now = sched_clock();
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &pd->stats;
+	lockdep_assert_held(blkg->q->queue_lock);
+
+	u64_stats_update_begin(&stats->syncp);
 	if (time_after64(now, io_start_time))
 		blkio_add_stat(stats->stat_arr[BLKIO_STAT_SERVICE_TIME],
 				now - io_start_time, direction, sync);
 	if (time_after64(io_start_time, start_time))
 		blkio_add_stat(stats->stat_arr[BLKIO_STAT_WAIT_TIME],
 				io_start_time - start_time, direction, sync);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	u64_stats_update_end(&stats->syncp);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
 
@@ -454,14 +448,13 @@ void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
 				    struct blkio_policy_type *pol,
 				    bool direction, bool sync)
 {
-	struct blkg_policy_data *pd = blkg->pd[pol->plid];
-	struct blkio_group_stats *stats;
-	unsigned long flags;
+	struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &pd->stats;
+	lockdep_assert_held(blkg->q->queue_lock);
+
+	u64_stats_update_begin(&stats->syncp);
 	blkio_add_stat(stats->stat_arr[BLKIO_STAT_MERGED], 1, direction, sync);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	u64_stats_update_end(&stats->syncp);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
@@ -508,7 +501,6 @@ static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 	if (!blkg)
 		return NULL;
 
-	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
@@ -891,7 +883,6 @@ static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg, int plid,
 	return disk_total;
 }
 
-/* This should be called with blkg->stats_lock held */
 static uint64_t blkio_get_stat(struct blkio_group *blkg, int plid,
 			       struct cgroup_map_cb *cb, const char *dname,
 			       enum stat_type type)
@@ -899,42 +890,46 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg, int plid,
 	struct blkio_group_stats *stats = &blkg->pd[plid]->stats;
 	uint64_t v = 0, disk_total = 0;
 	char key_str[MAX_KEY_LEN];
+	unsigned int sync_start;
 	int st;
 
 	if (type >= BLKIO_STAT_ARR_NR) {
-		switch (type) {
-		case BLKIO_STAT_TIME:
-			v = stats->time;
-			break;
+		do {
+			sync_start = u64_stats_fetch_begin(&stats->syncp);
+			switch (type) {
+			case BLKIO_STAT_TIME:
+				v = stats->time;
+				break;
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-		case BLKIO_STAT_UNACCOUNTED_TIME:
-			v = stats->unaccounted_time;
-			break;
-		case BLKIO_STAT_AVG_QUEUE_SIZE: {
-			uint64_t samples = stats->avg_queue_size_samples;
+			case BLKIO_STAT_UNACCOUNTED_TIME:
+				v = stats->unaccounted_time;
+				break;
+			case BLKIO_STAT_AVG_QUEUE_SIZE: {
+				uint64_t samples = stats->avg_queue_size_samples;
 
-			if (samples) {
-				v = stats->avg_queue_size_sum;
-				do_div(v, samples);
+				if (samples) {
+					v = stats->avg_queue_size_sum;
+					do_div(v, samples);
+				}
+				break;
 			}
-			break;
-		}
-		case BLKIO_STAT_IDLE_TIME:
-			v = stats->idle_time;
-			break;
-		case BLKIO_STAT_EMPTY_TIME:
-			v = stats->empty_time;
-			break;
-		case BLKIO_STAT_DEQUEUE:
-			v = stats->dequeue;
-			break;
-		case BLKIO_STAT_GROUP_WAIT_TIME:
-			v = stats->group_wait_time;
-			break;
+			case BLKIO_STAT_IDLE_TIME:
+				v = stats->idle_time;
+				break;
+			case BLKIO_STAT_EMPTY_TIME:
+				v = stats->empty_time;
+				break;
+			case BLKIO_STAT_DEQUEUE:
+				v = stats->dequeue;
+				break;
+			case BLKIO_STAT_GROUP_WAIT_TIME:
+				v = stats->group_wait_time;
+				break;
 #endif
-		default:
-			WARN_ON_ONCE(1);
-		}
+			default:
+				WARN_ON_ONCE(1);
+			}
+		} while (u64_stats_fetch_retry(&stats->syncp, sync_start));
 
 		blkio_get_key_name(0, dname, key_str, MAX_KEY_LEN, true);
 		cb->fill(cb, key_str, v);
@@ -942,7 +937,10 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg, int plid,
 	}
 
 	for (st = BLKIO_STAT_READ; st < BLKIO_STAT_TOTAL; st++) {
-		v = stats->stat_arr[type][st];
+		do {
+			sync_start = u64_stats_fetch_begin(&stats->syncp);
+			v = stats->stat_arr[type][st];
+		} while (u64_stats_fetch_retry(&stats->syncp, sync_start));
 
 		blkio_get_key_name(st, dname, key_str, MAX_KEY_LEN, false);
 		cb->fill(cb, key_str, v);
@@ -1199,15 +1197,12 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
 		int plid = BLKIOFILE_POLICY(cft->private);
 
-		if (pcpu) {
+		if (pcpu)
 			cgroup_total += blkio_get_stat_cpu(blkg, plid,
 							   cb, dname, type);
-		} else {
-			spin_lock(&blkg->stats_lock);
+		else
 			cgroup_total += blkio_get_stat(blkg, plid,
 						       cb, dname, type);
-			spin_unlock(&blkg->stats_lock);
-		}
 	}
 	if (show_total)
 		cb->fill(cb, "Total", cgroup_total);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 1425de1..7d92d3f 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -119,6 +119,7 @@ struct blkio_cgroup {
 };
 
 struct blkio_group_stats {
+	struct u64_stats_sync syncp;
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
 	uint64_t stat_arr[BLKIO_STAT_ARR_NR][BLKIO_STAT_TOTAL];
@@ -200,8 +201,6 @@ struct blkio_group {
 	/* reference count */
 	int refcnt;
 
-	/* Need to serialize the stats in the case of reset/update */
-	spinlock_t stats_lock;
 	struct blkg_policy_data *pd[BLKIO_NR_POLICIES];
 
 	struct rcu_head rcu_head;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
                   ` (7 preceding siblings ...)
  2012-02-23 22:30 ` [PATCH 8/8] blkcg: remove blkio_group->stats_lock Tejun Heo
@ 2012-02-23 22:43 ` Andrew Morton
  2012-02-23 23:01   ` Tejun Heo
  8 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-02-23 22:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, vgoyal, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni

On Thu, 23 Feb 2012 14:30:38 -0800
Tejun Heo <tj@kernel.org> wrote:

> This patchset is combination of the patchset "block, mempool, percpu:
> implement percpu mempool and fix blkcg percpu alloc deadlock" [1] and
> patches to remove blkg->stats_lock.

What's changed since last time?  I scanned the changelogs to see how
earlier issues were addressed and saw no mention of any of it.  eg, is
the code still deadlockable if the allocator is called with __GFP_WAIT?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 22:43 ` [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Andrew Morton
@ 2012-02-23 23:01   ` Tejun Heo
  2012-02-23 23:12     ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 23:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: axboe, vgoyal, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni

On Thu, Feb 23, 2012 at 02:43:36PM -0800, Andrew Morton wrote:
> On Thu, 23 Feb 2012 14:30:38 -0800
> Tejun Heo <tj@kernel.org> wrote:
> 
> > This patchset is combination of the patchset "block, mempool, percpu:
> > implement percpu mempool and fix blkcg percpu alloc deadlock" [1] and
> > patches to remove blkg->stats_lock.
> 
> What's changed since last time?  I scanned the changelogs to see how
> earlier issues were addressed and saw no mention of any of it.  eg, is
> the code still deadlockable if the allocator is called with __GFP_WAIT?

Hmmm... going through the thread again, ah, okay, I forgot about that
completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
building async percpu pool in blkcg then.  Great.  :(

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:01   ` Tejun Heo
@ 2012-02-23 23:12     ` Tejun Heo
  2012-02-23 23:22       ` Andrew Morton
                         ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 23:12 UTC (permalink / raw)
  To: vgoyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> Hmmm... going through the thread again, ah, okay, I forgot about that
> completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> building async percpu pool in blkcg then.  Great.  :(

Vivek, you win. :) Can you please refresh the async alloc patch on top
of blkcg-stacking branch?  I'll rool that into this series and drop
the mempool stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:12     ` Tejun Heo
@ 2012-02-23 23:22       ` Andrew Morton
  2012-02-23 23:24         ` Tejun Heo
  2012-02-24 14:20       ` Vivek Goyal
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-02-23 23:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: vgoyal, axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni

On Thu, 23 Feb 2012 15:12:04 -0800
Tejun Heo <tj@kernel.org> wrote:

> On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> > Hmmm... going through the thread again, ah, okay, I forgot about that
> > completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> > isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> > building async percpu pool in blkcg then.  Great.  :(
> 
> Vivek, you win. :) Can you please refresh the async alloc patch on top
> of blkcg-stacking branch?  I'll rool that into this series and drop
> the mempool stuff.

I forget how those patches work, but they might be vulnerable to the
same issue.  If the block layer can handle the failed allocation
attempt and retry at the next I/O event then I guess that would be
acceptable; we'd lose a bit of statistical info occasionally, but who
cares.

But ISTR that we can't handle allocation failures here?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:22       ` Andrew Morton
@ 2012-02-23 23:24         ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-02-23 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: vgoyal, axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni

Hello, Andrew.

On Thu, Feb 23, 2012 at 03:22:23PM -0800, Andrew Morton wrote:
> I forget how those patches work, but they might be vulnerable to the
> same issue.  If the block layer can handle the failed allocation
> attempt and retry at the next I/O event then I guess that would be
> acceptable; we'd lose a bit of statistical info occasionally, but who
> cares.
>
> But ISTR that we can't handle allocation failures here?

It's just gonna skip percpu stats.  The problem with percpu mempool is
more about what __GFP_WAIT means in the interface.  Handling alloc
failure isn't difficult at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:12     ` Tejun Heo
  2012-02-23 23:22       ` Andrew Morton
@ 2012-02-24 14:20       ` Vivek Goyal
  2012-02-25 21:44         ` Tejun Heo
  2012-02-25  3:44       ` Vivek Goyal
  2012-02-27 18:22       ` Vivek Goyal
  3 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-24 14:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Thu, Feb 23, 2012 at 03:12:04PM -0800, Tejun Heo wrote:
> On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> > Hmmm... going through the thread again, ah, okay, I forgot about that
> > completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> > isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> > building async percpu pool in blkcg then.  Great.  :(
> 
> Vivek, you win. :) Can you please refresh the async alloc patch on top
> of blkcg-stacking branch?  I'll rool that into this series and drop
> the mempool stuff.
> 

Ok. I will write a patch. Things have changed a lot since last time.
I think there is only one tricky part and that is waiting for any
scheduled work to finish during blkg destruction.  Because group destruction
happens under both queue and blkcg spin locks, I think I will have to take
the group off list, drop locks, wait for worker thread to finish and then
take locks again and walk through list again to kill remaining groups.

/me goes to try it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:12     ` Tejun Heo
  2012-02-23 23:22       ` Andrew Morton
  2012-02-24 14:20       ` Vivek Goyal
@ 2012-02-25  3:44       ` Vivek Goyal
  2012-02-25 21:46         ` Tejun Heo
  2012-02-27 18:22       ` Vivek Goyal
  3 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-25  3:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Thu, Feb 23, 2012 at 03:12:04PM -0800, Tejun Heo wrote:
> On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> > Hmmm... going through the thread again, ah, okay, I forgot about that
> > completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> > isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> > building async percpu pool in blkcg then.  Great.  :(
> 
> Vivek, you win. :) Can you please refresh the async alloc patch on top
> of blkcg-stacking branch?  I'll rool that into this series and drop
> the mempool stuff.

Hi Tejun,

Booting with blkcg-stacking branch and changing io scheduler from cfq to
deadline oopsed.

Thanks
Vivek


login: [   67.382768] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[   67.383037] CPU 1 
[   67.383037] Modules linked in: floppy [last unloaded: scsi_wait_scan]
[   67.383037] 
[   67.383037] Pid: 4763, comm: bash Not tainted 3.3.0-rc3-tejun-misc+ #6 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   67.383037] RIP: 0010:[<ffffffff81311793>]  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0
[   67.383037] RSP: 0018:ffff8801315edd48  EFLAGS: 00010046
[   67.383037] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 00000001001d000e
[   67.383037] RDX: 0000000000000000 RSI: ffffea0004db6800 RDI: ffffffff8114442d
[   67.383037] RBP: ffff8801315edd68 R08: 0000000000000000 R09: 00000001001d000d
[   67.383037] R10: 0000000000000230 R11: 0000000000000000 R12: ffff880137fe44a8
[   67.383037] R13: ffff880137fe3078 R14: ffff880137cf17e0 R15: 0000000000000020
[   67.383037] FS:  00007fce2dc73720(0000) GS:ffff88013fc40000(0000) knlGS:0000000000000000
[   67.383037] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.383037] CR2: 00000000006d8dc8 CR3: 0000000138a6c000 CR4: 00000000000006e0
[   67.383037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   67.383037] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   67.383037] Process bash (pid: 4763, threadinfo ffff8801315ec000, task ffff8801386fa2c0)
[   67.383037] Stack:
[   67.383037]  ffffffff81311eb7 ffff880137fe44a8 ffff880137fe45e8 ffff880137fe4578
[   67.383037]  ffff8801315edda8 ffffffff81311ef4 0000000000000000 ffff8801378e4490
[   67.383037]  ffff8801378e44e0 ffffffff81e46d60 ffff8801378e4490 0000000000000001
[   67.383037] Call Trace:
[   67.383037]  [<ffffffff81311eb7>] ? cfq_exit_queue+0x47/0xe0
[   67.383037]  [<ffffffff81311ef4>] cfq_exit_queue+0x84/0xe0
[   67.383037]  [<ffffffff812ef19a>] elevator_exit+0x3a/0x60
[   67.383037]  [<ffffffff812efe88>] elevator_change+0x138/0x200
[   67.383037]  [<ffffffff81837c3c>] ? mutex_lock_nested+0x28c/0x350
[   67.383037]  [<ffffffff812f07db>] elv_iosched_store+0x2b/0x60
[   67.383037]  [<ffffffff812f9556>] queue_attr_store+0x66/0xc0
[   67.383037]  [<ffffffff811c5876>] sysfs_write_file+0xe6/0x170
[   67.383037]  [<ffffffff8114ecf3>] vfs_write+0xb3/0x180
[   67.383037]  [<ffffffff8114f01a>] sys_write+0x4a/0x90
[   67.383037]  [<ffffffff818438d2>] system_call_fastpath+0x16/0x1b
[   67.383037] Code: 00 00 48 8b 3d 2f 21 58 01 48 89 de 31 db e8 95 3a e3 ff 4d 85 ed 74 07 49 8b 9d 10 ff ff ff 44 8b 05 72 67 b3 00 45 85 c0 75 4d <8b> 83 b0 00 00 00 85 c0 0f 8e 94 00 00 00 83 e8 01 85 c0 89 83 
[   67.383037] RIP  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0
[   67.383037]  RSP <ffff8801315edd48>
[   67.383037] ---[ end trace 07c5b04a4c80feda ]---


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-24 14:20       ` Vivek Goyal
@ 2012-02-25 21:44         ` Tejun Heo
  2012-02-27  3:11           ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-02-25 21:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

Hello, Vivek.

On Fri, Feb 24, 2012 at 09:20:33AM -0500, Vivek Goyal wrote:
> Ok. I will write a patch. Things have changed a lot since last time.
> I think there is only one tricky part and that is waiting for any
> scheduled work to finish during blkg destruction.  Because group destruction
> happens under both queue and blkcg spin locks, I think I will have to take
> the group off list, drop locks, wait for worker thread to finish and then
> take locks again and walk through list again to kill remaining groups.

Rather than embedding work_struct into blkg and tying work and blkg
destruction, why not just create a spinlock protected alloc-pending
list?  On blkg creation, link blkg onto that list and kick shared work
item and on destruction, just unlink it.  Worker function can walk the
list and try to alloc for all and reschedule if alloc fails.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-25  3:44       ` Vivek Goyal
@ 2012-02-25 21:46         ` Tejun Heo
  2012-02-25 22:21           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-02-25 21:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

Hello,

On Fri, Feb 24, 2012 at 10:44:32PM -0500, Vivek Goyal wrote:
> Booting with blkcg-stacking branch and changing io scheduler from cfq to
> deadline oopsed.
> 
> login: [   67.382768] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> [   67.383037] CPU 1 
> [   67.383037] Modules linked in: floppy [last unloaded: scsi_wait_scan]
> [   67.383037] 
> [   67.383037] Pid: 4763, comm: bash Not tainted 3.3.0-rc3-tejun-misc+ #6 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> [   67.383037] RIP: 0010:[<ffffffff81311793>]  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0

Hmmm... weird.  Looking into it.  I'm away from office for a week and
will probably be slow.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-25 21:46         ` Tejun Heo
@ 2012-02-25 22:21           ` Tejun Heo
  2012-02-27 14:25             ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-02-25 22:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Sun, Feb 26, 2012 at 06:46:41AM +0900, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 24, 2012 at 10:44:32PM -0500, Vivek Goyal wrote:
> > Booting with blkcg-stacking branch and changing io scheduler from cfq to
> > deadline oopsed.
> > 
> > login: [   67.382768] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> > [   67.383037] CPU 1 
> > [   67.383037] Modules linked in: floppy [last unloaded: scsi_wait_scan]
> > [   67.383037] 
> > [   67.383037] Pid: 4763, comm: bash Not tainted 3.3.0-rc3-tejun-misc+ #6 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> > [   67.383037] RIP: 0010:[<ffffffff81311793>]  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0
> 
> Hmmm... weird.  Looking into it.  I'm away from office for a week and
> will probably be slow.

It won't reproduce here.  Can you please explain how to trigger it?
Can you please also run addr2line on the oops address?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-25 21:44         ` Tejun Heo
@ 2012-02-27  3:11           ` Vivek Goyal
  2012-02-27  9:11             ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-27  3:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Sun, Feb 26, 2012 at 06:44:21AM +0900, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Fri, Feb 24, 2012 at 09:20:33AM -0500, Vivek Goyal wrote:
> > Ok. I will write a patch. Things have changed a lot since last time.
> > I think there is only one tricky part and that is waiting for any
> > scheduled work to finish during blkg destruction.  Because group destruction
> > happens under both queue and blkcg spin locks, I think I will have to take
> > the group off list, drop locks, wait for worker thread to finish and then
> > take locks again and walk through list again to kill remaining groups.
> 
> Rather than embedding work_struct into blkg and tying work and blkg
> destruction, why not just create a spinlock protected alloc-pending
> list?  On blkg creation, link blkg onto that list and kick shared work
> item and on destruction, just unlink it.  Worker function can walk the
> list and try to alloc for all and reschedule if alloc fails.

Ok. This sounds better than embeding work_struct in blkg, I can embed it
in request_queue and make the worker walk the list of blkg pending 
alloc of stats. Will try that. Thanks for the idea.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27  3:11           ` Vivek Goyal
@ 2012-02-27  9:11             ` Tejun Heo
  2012-02-27 19:43               ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-02-27  9:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Sun, Feb 26, 2012 at 10:11:46PM -0500, Vivek Goyal wrote:
> Ok. This sounds better than embeding work_struct in blkg, I can embed it
> in request_queue and make the worker walk the list of blkg pending 
> alloc of stats. Will try that. Thanks for the idea.

We might not need to make it even per-queue.  Simple global list of
pending blkgs and single work item should work fine, I think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-25 22:21           ` Tejun Heo
@ 2012-02-27 14:25             ` Vivek Goyal
  2012-02-27 14:40               ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-27 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Sun, Feb 26, 2012 at 07:21:13AM +0900, Tejun Heo wrote:
> On Sun, Feb 26, 2012 at 06:46:41AM +0900, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Feb 24, 2012 at 10:44:32PM -0500, Vivek Goyal wrote:
> > > Booting with blkcg-stacking branch and changing io scheduler from cfq to
> > > deadline oopsed.
> > > 
> > > login: [   67.382768] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> > > [   67.383037] CPU 1 
> > > [   67.383037] Modules linked in: floppy [last unloaded: scsi_wait_scan]
> > > [   67.383037] 
> > > [   67.383037] Pid: 4763, comm: bash Not tainted 3.3.0-rc3-tejun-misc+ #6 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> > > [   67.383037] RIP: 0010:[<ffffffff81311793>]  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0
> > 
> > Hmmm... weird.  Looking into it.  I'm away from office for a week and
> > will probably be slow.
> 
> It won't reproduce here.  Can you please explain how to trigger it?
> Can you please also run addr2line on the oops address?

I have BLK_CGROUP enabled. CFQ is deafult scheduler. I boot the system and
just change the scheduler to deadline on sda and crash happens. It is
consistently reproducible on my machine.

add2line points to, blk-cgroup.h

blkg_put() {
	WARN_ON_ONCE(blkg->refcnt <= 0);
}

I put more printk and we are putting down async queues when crash happens.

cfq_put_async_queues().

So looks like a group might have already been freed. May be it is a group
refcount issue. I see 6b6b6b... pattern in RBX. Sounds like a use after
free thing.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27 14:25             ` Vivek Goyal
@ 2012-02-27 14:40               ` Vivek Goyal
  2012-03-05 17:45                 ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-27 14:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Feb 27, 2012 at 09:25:29AM -0500, Vivek Goyal wrote:
> On Sun, Feb 26, 2012 at 07:21:13AM +0900, Tejun Heo wrote:
> > On Sun, Feb 26, 2012 at 06:46:41AM +0900, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Fri, Feb 24, 2012 at 10:44:32PM -0500, Vivek Goyal wrote:
> > > > Booting with blkcg-stacking branch and changing io scheduler from cfq to
> > > > deadline oopsed.
> > > > 
> > > > login: [   67.382768] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> > > > [   67.383037] CPU 1 
> > > > [   67.383037] Modules linked in: floppy [last unloaded: scsi_wait_scan]
> > > > [   67.383037] 
> > > > [   67.383037] Pid: 4763, comm: bash Not tainted 3.3.0-rc3-tejun-misc+ #6 Hewlett-Packard HP xw6600 Workstation/0A9Ch
> > > > [   67.383037] RIP: 0010:[<ffffffff81311793>]  [<ffffffff81311793>] cfq_put_queue+0xb3/0x1d0
> > > 
> > > Hmmm... weird.  Looking into it.  I'm away from office for a week and
> > > will probably be slow.
> > 
> > It won't reproduce here.  Can you please explain how to trigger it?
> > Can you please also run addr2line on the oops address?
> 
> I have BLK_CGROUP enabled. CFQ is deafult scheduler. I boot the system and
> just change the scheduler to deadline on sda and crash happens. It is
> consistently reproducible on my machine.
> 
> add2line points to, blk-cgroup.h
> 
> blkg_put() {
> 	WARN_ON_ONCE(blkg->refcnt <= 0);
> }
> 
> I put more printk and we are putting down async queues when crash happens.
> 
> cfq_put_async_queues().
> 
> So looks like a group might have already been freed. May be it is a group
> refcount issue. I see 6b6b6b... pattern in RBX. Sounds like a use after
> free thing.

I think problem might be that we have destroyed policy data (cfqg also)
early and later we access it.

So we call following.

elevator_switch()
  blkg_destroy_all()
    update_root_blkg();

Here update_root_blkg() will free up the blkg->pd for cfq. 

And later we call.

elevator_exit()
  cfq_exit_queue()
    cfq_put_async_queues()
      cfq_put_queue()
        blkg_put(cfqg_to_blkg(cfqg)); <------ trying to reach blkg through
already freed policy data.

So we should not free up root group policy data till old elevator has
exited.

Thanks
Vivek

> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-23 23:12     ` Tejun Heo
                         ` (2 preceding siblings ...)
  2012-02-25  3:44       ` Vivek Goyal
@ 2012-02-27 18:22       ` Vivek Goyal
  2012-02-29 19:03         ` Vivek Goyal
  3 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-27 18:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Thu, Feb 23, 2012 at 03:12:04PM -0800, Tejun Heo wrote:
> On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> > Hmmm... going through the thread again, ah, okay, I forgot about that
> > completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> > isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> > building async percpu pool in blkcg then.  Great.  :(
> 
> Vivek, you win. :) Can you please refresh the async alloc patch on top
> of blkcg-stacking branch?  I'll rool that into this series and drop
> the mempool stuff.

Hi Tejun,

Looks like there is one more issue. After booting kernel, I tried to read
one of the cgroup files.

cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device and it oopses.

Basically it looks like that we are getting a null dev pointer while we
do following.

dev_name(blkg->q->backing_dev_info.dev);

I put some printk to verify backing_dev_info.dev is null. That means
somehow we have a request queue and we did not call add_disk() on that?

I checked that queue is not in bypass mode or dead mode. So nobody called
blk_cleanup_queue() on that.

Thanks
Vivek

chilli.lab.bos.redhat.com login: [   62.090238] Vivek: q =
ffff880136d647a0 bypass=0 dead=0 
[   62.091001] Vivek: dev = ffff880138959520
[   62.091001] Vivek: q = ffff880136d623d0 bypass=0 dead=0 
[   62.091001] Vivek: dev =           (null)
[   62.091001] BUG: unable to handle kernel NULL pointer dereference at
0000000000000050
[   62.091001] IP: [<ffffffff813091b5>] blkio_print_group_conf+0x85/0x160
[   62.091001] PGD 131757067 PUD 130915067 PMD 0 
[   62.091001] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   62.091001] CPU 2 
[   62.091001] Modules linked in: floppy [last unloaded: scsi_wait_scan]
[   62.091001] 
[   62.091001] Pid: 4787, comm: cat Tainted: G        W
3.3.0-rc3-tejun-misc+ #13 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[   62.091001] RIP: 0010:[<ffffffff813091b5>]  [<ffffffff813091b5>]
blkio_print_group_conf+0x85/0x160
[   62.091001] RSP: 0018:ffff8801315a9df8  EFLAGS: 00010096
[   62.091001] RAX: 0000000000000030 RBX: 0000000000000000 RCX:
0000000000006867
[   62.091001] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
ffffffff81038af5
[   62.091001] RBP: ffff8801315a9e28 R08: 0000000000000002 R09:
0000000000000000
[   62.091001] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000001
[   62.091001] R13: 0000000000010000 R14: ffff880137a9b480 R15:
ffff880130a4c248
[   62.091001] FS:  00007ff3882f6720(0000) GS:ffff88013fc80000(0000)
knlGS:0000000000000000
[   62.091001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.091001] CR2: 0000000000000050 CR3: 0000000138b0a000 CR4:
00000000000006e0
[   62.091001] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   62.091001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[   62.091001] Process cat (pid: 4787, threadinfo ffff8801315a8000, task
ffff8801386ce840)
[   62.091001] Stack:
[   62.091001]  ffff8801315a9e28 ffff880137a9b7e0 ffffffff81e45fc0
ffff880130a4c248
[   62.091001]  ffffffff81e46aa8 0000000000000000 ffff8801315a9e58
ffffffff813092ff
[   62.091001]  ffff88013167a580 ffff880130a4c248 0000000000000001
ffff88013167a580
[   62.091001] Call Trace:
[   62.091001]  [<ffffffff813092ff>] blkiocg_file_read+0x6f/0xd0
[   62.091001]  [<ffffffff810aa240>] cgroup_seqfile_show+0x50/0x60
[   62.091001]  [<ffffffff81174741>] seq_read+0x131/0x3f0
[   62.091001]  [<ffffffff8114ee70>] vfs_read+0xb0/0x180
[   62.091001]  [<ffffffff8114ef8a>] sys_read+0x4a/0x90
[   62.091001]  [<ffffffff81843952>] system_call_fastpath+0x16/0x1b
[   62.091001] Code: 05 48 c1 ea 06 83 e2 01 83 e1 01 e8 dd 51 52 00 48 8b
03 48 c7 c7 57 bc c9 81 48 8b 98 a8 05 00 00 31 c0 48 89 de e8 c2 51 52 00
<48> 8b 53 50 48 85 d2 74 7a 45 85 e4 75 25 41 8b 4e 08 85 c9 75 
[   62.091001] RIP  [<ffffffff813091b5>] blkio_print_group_conf+0x85/0x160
[   62.091001]  RSP <ffff8801315a9df8>
[   62.091001] CR2: 0000000000000050
[   62.091001] ---[ end trace be53af9706c0e89c ]---


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27  9:11             ` Tejun Heo
@ 2012-02-27 19:43               ` Vivek Goyal
  2012-02-29 17:36                 ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-27 19:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Feb 27, 2012 at 06:11:41PM +0900, Tejun Heo wrote:
> On Sun, Feb 26, 2012 at 10:11:46PM -0500, Vivek Goyal wrote:
> > Ok. This sounds better than embeding work_struct in blkg, I can embed it
> > in request_queue and make the worker walk the list of blkg pending 
> > alloc of stats. Will try that. Thanks for the idea.
> 
> We might not need to make it even per-queue.  Simple global list of
> pending blkgs and single work item should work fine, I think.

Thanks for the suggestion Tejun. I have implemented it and below is the
patch. I have done basic testing of boot and cgroup creation. Yet to test
it over elevator switch path. Will do that once it is fixed. I will sign
it after testing.

Do let me know if you want some changes in the patch.

Thanks
Vivek

Allocate blkg per cpu stat from a worker thread.

Yet-to-be-signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  135 +++++++++++++++++++++++++++++++++++++++--------------
 block/blk-cgroup.h |    2 
 2 files changed, 102 insertions(+), 35 deletions(-)

Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-02-28 01:29:09.238256494 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-02-28 01:29:12.000000000 -0500
@@ -180,6 +180,8 @@ struct blkio_group {
 	struct request_queue *q;
 	struct list_head q_node;
 	struct hlist_node blkcg_node;
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head pending_alloc_node;
 	struct blkio_cgroup *blkcg;
 	/* Store cgroup path */
 	char path[128];
Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-02-28 01:29:09.239256494 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-02-28 01:32:38.153263325 -0500
@@ -30,6 +30,12 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+static DEFINE_SPINLOCK(pending_alloc_list_lock);
+static LIST_HEAD(pending_alloc_list);
+
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +397,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +452,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +472,73 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	void *stat_ptr;
+	struct blkio_group *blkg, *n;
+	int i;
+
+alloc_stats:
+	spin_lock_irq(&pending_alloc_list_lock);
+		if (list_empty(&pending_alloc_list)) {
+			/* Nothing to do */
+			spin_unlock_irq(&pending_alloc_list_lock);
+			return;
+		}
+	spin_unlock_irq(&pending_alloc_list_lock);
+
+	stat_ptr = alloc_percpu(struct blkio_group_stats_cpu);
+
+	/* Retry. Should there be an upper limit on number of retries */
+	if (stat_ptr == NULL)
+		goto alloc_stats;
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&pending_alloc_list_lock);
+
+	list_for_each_entry_safe(blkg, n, &pending_alloc_list,
+		pending_alloc_node) {
+		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+			struct blkio_policy_type *pol = blkio_policy[i];
+			struct blkg_policy_data *pd;
+
+			if (!pol)
+				continue;
+
+			if (!blkg->pd[i])
+				continue;
+
+			pd = blkg->pd[i];
+			if (pd->stats_cpu)
+				continue;
+
+			if (stat_ptr) {
+				pd->stats_cpu = stat_ptr;
+				stat_ptr = NULL;
+				break;
+			}
+		}
+
+		if (stat_ptr != NULL || i == BLKIO_NR_POLICIES - 1) {
+			/* We are done with this group */
+			list_del_init(&blkg->pending_alloc_node);
+			continue;
+		} else
+			/* Go allocate more memory */
+			break;
+	}
+	spin_unlock(&pending_alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+
+	if (stat_ptr != NULL) {
+		/* Nobody needs memory anymore */
+		free_percpu(stat_ptr);
+		return;
+	} else
+		goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -509,6 +588,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->pending_alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +610,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +629,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +653,29 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
+	spin_lock(&pending_alloc_list_lock);
 
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
+	list_add(&blkg->pending_alloc_node, &pending_alloc_list);
 
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_work(system_nrt_wq, &blkio_stat_alloc_work);
+
+	spin_unlock(&pending_alloc_list_lock);
 	spin_unlock(&blkcg->lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -648,11 +702,16 @@ static void blkg_destroy(struct blkio_gr
 	lockdep_assert_held(q->queue_lock);
 	lockdep_assert_held(&blkcg->lock);
 
+	spin_lock(&pending_alloc_list_lock);
+
 	/* Something wrong if we are trying to remove same group twice */
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
+	list_del_init(&blkg->pending_alloc_node);
+
+	spin_unlock(&pending_alloc_list_lock);
 
 	/*
 	 * Put the reference taken at the time of creation so that when all
@@ -755,6 +814,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -886,6 +948,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27 19:43               ` Vivek Goyal
@ 2012-02-29 17:36                 ` Vivek Goyal
  2012-03-05 22:13                   ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-29 17:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Feb 27, 2012 at 02:43:21PM -0500, Vivek Goyal wrote:
> On Mon, Feb 27, 2012 at 06:11:41PM +0900, Tejun Heo wrote:
> > On Sun, Feb 26, 2012 at 10:11:46PM -0500, Vivek Goyal wrote:
> > > Ok. This sounds better than embeding work_struct in blkg, I can embed it
> > > in request_queue and make the worker walk the list of blkg pending 
> > > alloc of stats. Will try that. Thanks for the idea.
> > 
> > We might not need to make it even per-queue.  Simple global list of
> > pending blkgs and single work item should work fine, I think.
> 
> Thanks for the suggestion Tejun. I have implemented it and below is the
> patch. I have done basic testing of boot and cgroup creation. Yet to test
> it over elevator switch path. Will do that once it is fixed. I will sign
> it after testing.
> 
> Do let me know if you want some changes in the patch.
> 
> Thanks
> Vivek
> 
> Allocate blkg per cpu stat from a worker thread.
> 
> Yet-to-be-signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Came up with second version of patch. Minor cleanups. There were couple
of redundant condition checks.

Thanks
Vivek


Allocate blkg per cpu stats asynchrnously from a worker thread.

---
 block/blk-cgroup.c |  134 +++++++++++++++++++++++++++++++++++++++--------------
 block/blk-cgroup.h |    2 
 2 files changed, 101 insertions(+), 35 deletions(-)

Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-02-28 01:29:09.238256494 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-02-28 01:29:12.000000000 -0500
@@ -180,6 +180,8 @@ struct blkio_group {
 	struct request_queue *q;
 	struct list_head q_node;
 	struct hlist_node blkcg_node;
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head pending_alloc_node;
 	struct blkio_cgroup *blkcg;
 	/* Store cgroup path */
 	char path[128];
Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-02-28 01:29:09.239256494 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-02-29 23:02:00.279293289 -0500
@@ -30,6 +30,12 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+static DEFINE_SPINLOCK(pending_alloc_list_lock);
+static LIST_HEAD(pending_alloc_list);
+
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +397,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +452,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +472,72 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	void *stat_ptr = NULL;
+	struct blkio_group *blkg, *n;
+	int i;
+
+alloc_stats:
+	spin_lock_irq(&pending_alloc_list_lock);
+		if (list_empty(&pending_alloc_list)) {
+			/* Nothing to do */
+			spin_unlock_irq(&pending_alloc_list_lock);
+			return;
+		}
+	spin_unlock_irq(&pending_alloc_list_lock);
+
+	WARN_ON(stat_ptr != NULL);
+	stat_ptr = alloc_percpu(struct blkio_group_stats_cpu);
+
+	/* Retry. Should there be an upper limit on number of retries */
+	if (stat_ptr == NULL)
+		goto alloc_stats;
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&pending_alloc_list_lock);
+
+	list_for_each_entry_safe(blkg, n, &pending_alloc_list,
+		pending_alloc_node) {
+		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+			struct blkio_policy_type *pol = blkio_policy[i];
+			struct blkg_policy_data *pd;
+
+			if (!pol)
+				continue;
+
+			if (!blkg->pd[i])
+				continue;
+
+			pd = blkg->pd[i];
+			if (pd->stats_cpu)
+				continue;
+
+			pd->stats_cpu = stat_ptr;
+			stat_ptr = NULL;
+			break;
+		}
+
+		if (i == BLKIO_NR_POLICIES - 1) {
+			/* We are done with this group */
+			list_del_init(&blkg->pending_alloc_node);
+			continue;
+		} else
+			/* Go allocate more memory */
+			break;
+	}
+	spin_unlock(&pending_alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+
+	if (stat_ptr != NULL) {
+		/* Nobody needs memory anymore */
+		free_percpu(stat_ptr);
+		return;
+	} else
+		goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -509,6 +587,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->pending_alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +609,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +628,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +652,29 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
+	spin_lock(&pending_alloc_list_lock);
 
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
+	list_add(&blkg->pending_alloc_node, &pending_alloc_list);
 
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_work(system_nrt_wq, &blkio_stat_alloc_work);
+
+	spin_unlock(&pending_alloc_list_lock);
 	spin_unlock(&blkcg->lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -648,11 +701,16 @@ static void blkg_destroy(struct blkio_gr
 	lockdep_assert_held(q->queue_lock);
 	lockdep_assert_held(&blkcg->lock);
 
+	spin_lock(&pending_alloc_list_lock);
+
 	/* Something wrong if we are trying to remove same group twice */
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
+	list_del_init(&blkg->pending_alloc_node);
+
+	spin_unlock(&pending_alloc_list_lock);
 
 	/*
 	 * Put the reference taken at the time of creation so that when all
@@ -755,6 +813,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -886,6 +947,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27 18:22       ` Vivek Goyal
@ 2012-02-29 19:03         ` Vivek Goyal
  2012-03-05 17:20           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-02-29 19:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Feb 27, 2012 at 01:22:20PM -0500, Vivek Goyal wrote:
> On Thu, Feb 23, 2012 at 03:12:04PM -0800, Tejun Heo wrote:
> > On Thu, Feb 23, 2012 at 03:01:23PM -0800, Tejun Heo wrote:
> > > Hmmm... going through the thread again, ah, okay, I forgot about that
> > > completely.  Yeah, that is an actual problem.  Both __GFP_WAIT which
> > > isn't GFP_KERNEL and GFP_KERNEL are valid use cases.  I guess we'll be
> > > building async percpu pool in blkcg then.  Great.  :(
> > 
> > Vivek, you win. :) Can you please refresh the async alloc patch on top
> > of blkcg-stacking branch?  I'll rool that into this series and drop
> > the mempool stuff.
> 
> Hi Tejun,
> 
> Looks like there is one more issue. After booting kernel, I tried to read
> one of the cgroup files.
> 
> cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device and it oopses.
> 
> Basically it looks like that we are getting a null dev pointer while we
> do following.
> 
> dev_name(blkg->q->backing_dev_info.dev);
> 
> I put some printk to verify backing_dev_info.dev is null. That means
> somehow we have a request queue and we did not call add_disk() on that?
> 
> I checked that queue is not in bypass mode or dead mode. So nobody called
> blk_cleanup_queue() on that.

Ok, more debugging revealed that culprit here is "floppy" driver module.
It instanciates a queue but never registers a disk against it. Better
skip such queues/groups. Here is the patch.


blk-cgroup: Skip a group if there is no disk associated with the queue

blk-cgroup code currently assumes that there is a device/disk associated
with every queue in the system. But modules like floppy, can instanciate
bunch of request queue and not register disk against all the queues. This
can lead to blkcg crashing.

Skip the queue/blkg which don't have any dev/disk associated with them.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-02-28 00:33:36.000000000 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-01 00:51:15.276346953 -0500
@@ -1136,9 +1136,22 @@ static void blkio_print_group_conf(struc
 	int plid = BLKIOFILE_POLICY(cft->private);
 	int fileid = BLKIOFILE_ATTR(cft->private);
 	struct blkg_policy_data *pd = blkg->pd[plid];
-	const char *dname = dev_name(blkg->q->backing_dev_info.dev);
+	const char *dname;
+	struct device *dev;
 	int rw = WRITE;
 
+	dev = blkg->q->backing_dev_info.dev;
+
+	/*
+	 * Some drivers (like floppy), might instanciate a request queue
+	 * and not register any disk against that queue. So dev can be
+	 * null
+	 */
+	if (!dev)
+		return;
+
+	dname = dev_name(dev);
+
 	switch (plid) {
 		case BLKIO_POLICY_PROP:
 			if (pd->conf.weight)
@@ -1230,9 +1243,16 @@ static int blkio_read_blkg_stats(struct 
 	spin_lock_irq(&blkcg->lock);
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
+		struct device *dev;
+		const char *dname;
+
 		int plid = BLKIOFILE_POLICY(cft->private);
 
+		dev = blkg->q->backing_dev_info.dev;
+		if (!dev)
+			continue;
+		dname = dev_name(dev);
+
 		if (pcpu) {
 			cgroup_total += blkio_get_stat_cpu(blkg, plid,
 							   cb, dname, type);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-29 19:03         ` Vivek Goyal
@ 2012-03-05 17:20           ` Tejun Heo
  2012-03-05 18:03             ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-05 17:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

Hello, Vivek.

Sorry about the delay.  Was off last week.

On Wed, Feb 29, 2012 at 02:03:54PM -0500, Vivek Goyal wrote:
> Ok, more debugging revealed that culprit here is "floppy" driver module.
> It instanciates a queue but never registers a disk against it. Better
> skip such queues/groups. Here is the patch.

Urgh.. I hate that driver.  I ended up folding the following patch in
the series right after "blkcg: kill the mind-bending blkg->dev".

Thanks.

Subject: blkcg: skip blkg printing if q isn't associated with disk

blk-cgroup printing code currently assumes that there is a device/disk
associated with every queue in the system, but modules like floppy,
can instantiate request queues without registering disk which can lead
to oops.

Skip the queue/blkg which don't have dev/disk associated with them.

-tj: Factored out backing_dev_info check into blkg_dev_name().

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

Index: work/block/blk-cgroup.c
===================================================================
--- work.orig/block/blk-cgroup.c
+++ work/block/blk-cgroup.c
@@ -951,13 +951,24 @@ static int blkiocg_file_write(struct cgr
 	return ret;
 }
 
+static const char *blkg_dev_name(struct blkio_group *blkg)
+{
+	/* some drivers (floppy) instantiate a queue w/o disk registered */
+	if (blkg->q->backing_dev_info.dev)
+		return dev_name(blkg->q->backing_dev_info.dev);
+	return NULL;
+}
+
 static void blkio_print_group_conf(struct cftype *cft, struct blkio_group *blkg,
 				   struct seq_file *m)
 {
-	const char *dname = dev_name(blkg->q->backing_dev_info.dev);
+	const char *dname = blkg_dev_name(blkg);
 	int fileid = BLKIOFILE_ATTR(cft->private);
 	int rw = WRITE;
 
+	if (!dname)
+		return;
+
 	switch (blkg->plid) {
 		case BLKIO_POLICY_PROP:
 			if (blkg->conf.weight)
@@ -1049,9 +1060,9 @@ static int blkio_read_blkg_stats(struct 
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
+		const char *dname = blkg_dev_name(blkg);
 
-		if (BLKIOFILE_POLICY(cft->private) != blkg->plid)
+		if (!dname || BLKIOFILE_POLICY(cft->private) != blkg->plid)
 			continue;
 		if (pcpu)
 			cgroup_total += blkio_get_stat_cpu(blkg, cb, dname,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-27 14:40               ` Vivek Goyal
@ 2012-03-05 17:45                 ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-03-05 17:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

Hello,

On Mon, Feb 27, 2012 at 09:40:45AM -0500, Vivek Goyal wrote:
> So we call following.
> 
> elevator_switch()
>   blkg_destroy_all()
>     update_root_blkg();
> 
> Here update_root_blkg() will free up the blkg->pd for cfq. 
> 
> And later we call.
> 
> elevator_exit()
>   cfq_exit_queue()
>     cfq_put_async_queues()
>       cfq_put_queue()
>         blkg_put(cfqg_to_blkg(cfqg)); <------ trying to reach blkg through
> already freed policy data.
> 
> So we should not free up root group policy data till old elevator has
> exited.

Yeah, this analysis seems correct.  For some reason, I still can't
reproduce it even with slab debugging turned on.  I modified the free
path to fill *pd with pattern w/o actually freeing it and now can see
similar crashes.  Will post fix soon.

Thanks a lot.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-05 17:20           ` Tejun Heo
@ 2012-03-05 18:03             ` Vivek Goyal
  0 siblings, 0 replies; 56+ messages in thread
From: Vivek Goyal @ 2012-03-05 18:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Mar 05, 2012 at 09:20:00AM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> Sorry about the delay.  Was off last week.
> 
> On Wed, Feb 29, 2012 at 02:03:54PM -0500, Vivek Goyal wrote:
> > Ok, more debugging revealed that culprit here is "floppy" driver module.
> > It instanciates a queue but never registers a disk against it. Better
> > skip such queues/groups. Here is the patch.
> 
> Urgh.. I hate that driver.  I ended up folding the following patch in
> the series right after "blkcg: kill the mind-bending blkg->dev".

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-02-29 17:36                 ` Vivek Goyal
@ 2012-03-05 22:13                   ` Tejun Heo
  2012-03-06 21:09                     ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-05 22:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

Hello, Vivek.

On Wed, Feb 29, 2012 at 12:36:39PM -0500, Vivek Goyal wrote:
> Index: tejun-misc/block/blk-cgroup.h
> ===================================================================
> --- tejun-misc.orig/block/blk-cgroup.h	2012-02-28 01:29:09.238256494 -0500
> +++ tejun-misc/block/blk-cgroup.h	2012-02-28 01:29:12.000000000 -0500
> @@ -180,6 +180,8 @@ struct blkio_group {
>  	struct request_queue *q;
>  	struct list_head q_node;
>  	struct hlist_node blkcg_node;
> +	/* List of blkg waiting for per cpu stats memory to be allocated */
> +	struct list_head pending_alloc_node;

Can we move this right on top of rcu_head?  It's one of the coldest
entries.  Also, long field names tend to be a bit painful.  How about
just alloc_node?

> +static void blkio_stat_alloc_fn(struct work_struct *work)
> +{
> +
> +	void *stat_ptr = NULL;
> +	struct blkio_group *blkg, *n;
> +	int i;
> +
> +alloc_stats:
> +	spin_lock_irq(&pending_alloc_list_lock);
> +		if (list_empty(&pending_alloc_list)) {
> +			/* Nothing to do */
> +			spin_unlock_irq(&pending_alloc_list_lock);
> +			return;
> +		}
> +	spin_unlock_irq(&pending_alloc_list_lock);
> +
> +	WARN_ON(stat_ptr != NULL);
> +	stat_ptr = alloc_percpu(struct blkio_group_stats_cpu);

There will only one of this work item and if queued on nrt wq, only
one instance would be running.  Why not just create static ps[NR_POLS]
array and fill it here.

> +	/* Retry. Should there be an upper limit on number of retries */
> +	if (stat_ptr == NULL)
> +		goto alloc_stats;
> +
> +	spin_lock_irq(&blkio_list_lock);
> +	spin_lock(&pending_alloc_list_lock);
> +
> +	list_for_each_entry_safe(blkg, n, &pending_alloc_list,
> +		pending_alloc_node) {
> +		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +			struct blkio_policy_type *pol = blkio_policy[i];
> +			struct blkg_policy_data *pd;
> +
> +			if (!pol)
> +				continue;
> +
> +			if (!blkg->pd[i])
> +				continue;
> +
> +			pd = blkg->pd[i];
> +			if (pd->stats_cpu)
> +				continue;
> +
> +			pd->stats_cpu = stat_ptr;
> +			stat_ptr = NULL;
> +			break;

and install everything here at one go.

> +		}
> +
> +		if (i == BLKIO_NR_POLICIES - 1) {
> +			/* We are done with this group */
> +			list_del_init(&blkg->pending_alloc_node);
> +			continue;
> +		} else
> +			/* Go allocate more memory */
> +			break;
> +	}

remove it from alloc list while holding alloc lock, unlock and go for
retrying or exit and don't worry about stats_cpu left in ps[] as we're
gonna be using that again later anyway.

>  	/* insert */
>  	spin_lock(&blkcg->lock);
> -	swap(blkg, new_blkg);
> +	spin_lock(&pending_alloc_list_lock);

Do we need this nested inside blkcg->lock?  What's wrong with doing it
after release blkcg->lock?

> @@ -648,11 +701,16 @@ static void blkg_destroy(struct blkio_gr
>  	lockdep_assert_held(q->queue_lock);
>  	lockdep_assert_held(&blkcg->lock);
>  
> +	spin_lock(&pending_alloc_list_lock);
> +
>  	/* Something wrong if we are trying to remove same group twice */
>  	WARN_ON_ONCE(list_empty(&blkg->q_node));
>  	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
>  	list_del_init(&blkg->q_node);
>  	hlist_del_init_rcu(&blkg->blkcg_node);
> +	list_del_init(&blkg->pending_alloc_node);
> +
> +	spin_unlock(&pending_alloc_list_lock);

Why put the whole thing inside the alloc lock?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-05 22:13                   ` Tejun Heo
@ 2012-03-06 21:09                     ` Vivek Goyal
  2012-03-06 21:20                       ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-06 21:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, hughd, avi, nate, cl, linux-kernel, dpshah, ctalbott, rni,
	Andrew Morton

On Mon, Mar 05, 2012 at 02:13:21PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Wed, Feb 29, 2012 at 12:36:39PM -0500, Vivek Goyal wrote:
> > Index: tejun-misc/block/blk-cgroup.h
> > ===================================================================
> > --- tejun-misc.orig/block/blk-cgroup.h	2012-02-28 01:29:09.238256494 -0500
> > +++ tejun-misc/block/blk-cgroup.h	2012-02-28 01:29:12.000000000 -0500
> > @@ -180,6 +180,8 @@ struct blkio_group {
> >  	struct request_queue *q;
> >  	struct list_head q_node;
> >  	struct hlist_node blkcg_node;
> > +	/* List of blkg waiting for per cpu stats memory to be allocated */
> > +	struct list_head pending_alloc_node;
> 
> Can we move this right on top of rcu_head?  It's one of the coldest
> entries.  Also, long field names tend to be a bit painful.  How about
> just alloc_node?

Hi Tejun,

Here is the new version (lets call it v2) of the patch. Please let me
know if some more changes are required.

Thanks
Vivek

blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner

Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.

v2-> tejun suggested following changes. Changed the patch accordingly.
	- move alloc_node location in structure
	- reduce the size of names of some of the fields
	- Reduce the scope of locking of alloc_list_lock
	- Simplified stat_alloc_fn() by allocating stats for all
	  policies in one go and then assigning these to a group.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  128 ++++++++++++++++++++++++++++++++++++-----------------
 block/blk-cgroup.h |    2 
 2 files changed, 90 insertions(+), 40 deletions(-)

Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-03-07 02:33:14.761787970 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-07 02:33:51.042788674 -0500
@@ -30,6 +30,15 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+/* List of groups pending per cpu stats allocation */
+static DEFINE_SPINLOCK(alloc_list_lock);
+static LIST_HEAD(alloc_list);
+
+/* Array of per cpu stat pointers allocated for blk groups */
+static void *pcpu_stats[BLKIO_NR_POLICIES];
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +455,9 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +475,59 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	struct blkio_group *blkg, *n;
+	int i;
+
+alloc_stats:
+	spin_lock_irq(&alloc_list_lock);
+		if (list_empty(&alloc_list)) {
+			/* Nothing to do */
+			spin_unlock_irq(&alloc_list_lock);
+			return;
+		}
+	spin_unlock_irq(&alloc_list_lock);
+
+	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+		if (pcpu_stats[i] != NULL)
+			continue;
+
+		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
+		if (pcpu_stats[i] == NULL)
+			goto alloc_stats;
+	}
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&alloc_list_lock);
+
+	list_for_each_entry_safe(blkg, n, &alloc_list, alloc_node) {
+		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+			struct blkio_policy_type *pol = blkio_policy[i];
+			struct blkg_policy_data *pd;
+
+			if (!pol)
+				continue;
+
+			if (!blkg->pd[i])
+				continue;
+
+			pd = blkg->pd[i];
+			if (pd->stats_cpu)
+				continue;
+
+			pd->stats_cpu = pcpu_stats[i];
+			pcpu_stats[i] = NULL;
+		}
+		list_del_init(&blkg->alloc_node);
+		break;
+	}
+	spin_unlock(&alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+	goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -491,9 +559,6 @@ static void blkg_free(struct blkio_group
  * @q: request_queue the new blkg is associated with
  *
  * Allocate a new blkg assocating @blkcg and @q.
- *
- * FIXME: Should be called with queue locked but currently isn't due to
- *        percpu stat breakage.
  */
 static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 				      struct request_queue *q)
@@ -509,6 +574,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +596,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +615,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +639,27 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
-
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-
 	spin_unlock(&blkcg->lock);
+
+	spin_lock(&alloc_list_lock);
+	list_add(&blkg->alloc_node, &alloc_list);
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_work(system_nrt_wq, &blkio_stat_alloc_work);
+	spin_unlock(&alloc_list_lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -654,6 +692,10 @@ static void blkg_destroy(struct blkio_gr
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
+	spin_lock(&alloc_list_lock);
+	list_del_init(&blkg->alloc_node);
+	spin_unlock(&alloc_list_lock);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -752,6 +794,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -883,6 +928,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);
Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-03-07 02:33:14.761787970 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-03-07 02:33:51.042788674 -0500
@@ -190,6 +190,8 @@ struct blkio_group {
 	spinlock_t stats_lock;
 	struct blkg_policy_data *pd[BLKIO_NR_POLICIES];
 
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head alloc_node;
 	struct rcu_head rcu_head;
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-06 21:09                     ` Vivek Goyal
@ 2012-03-06 21:20                       ` Andrew Morton
  2012-03-06 21:34                         ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-03-06 21:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Tue, 6 Mar 2012 16:09:55 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

>
> ...
>
> blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner
> 
> Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
> IO path there are times when we want GFP_NOIO semantics. As there is no
> way to pass the allocation flags to alloc_percpu(), this patch delays the
> allocation of stats using a worker thread.
> 
> v2-> tejun suggested following changes. Changed the patch accordingly.
> 	- move alloc_node location in structure
> 	- reduce the size of names of some of the fields
> 	- Reduce the scope of locking of alloc_list_lock
> 	- Simplified stat_alloc_fn() by allocating stats for all
> 	  policies in one go and then assigning these to a group.

<takes a look to see if he can understand some block stuff>

<decides he can't>

>
> ...
>
> @@ -30,6 +30,15 @@ static LIST_HEAD(blkio_list);
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
>  
> +/* List of groups pending per cpu stats allocation */
> +static DEFINE_SPINLOCK(alloc_list_lock);
> +static LIST_HEAD(alloc_list);
> +
> +/* Array of per cpu stat pointers allocated for blk groups */
> +static void *pcpu_stats[BLKIO_NR_POLICIES];
> +static void blkio_stat_alloc_fn(struct work_struct *);
> +static DECLARE_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
> +
>  struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
>  EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>  
> @@ -391,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
>  	struct blkio_group_stats_cpu *stats_cpu;
>  	unsigned long flags;
>  
> +	if (pd->stats_cpu == NULL)
> +		return;

Maybe add a comment explaining how this comes about?  It isn't very
obvious..

>  	/*
>  	 * Disabling interrupts to provide mutual exclusion between two
>  	 * writes on same cpu. It probably is not needed for 64bit. Not
> @@ -443,6 +455,9 @@ void blkiocg_update_io_merged_stats(stru
>  	struct blkio_group_stats_cpu *stats_cpu;
>  	unsigned long flags;
>  
> +	if (pd->stats_cpu == NULL)
> +		return;
> +
>  	/*
>  	 * Disabling interrupts to provide mutual exclusion between two
>  	 * writes on same cpu. It probably is not needed for 64bit. Not
> @@ -460,6 +475,59 @@ void blkiocg_update_io_merged_stats(stru
>  }
>  EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
>  
> +static void blkio_stat_alloc_fn(struct work_struct *work)
> +{
> +
> +	struct blkio_group *blkg, *n;
> +	int i;
> +
> +alloc_stats:
> +	spin_lock_irq(&alloc_list_lock);
> +		if (list_empty(&alloc_list)) {
> +			/* Nothing to do */

That's not a very helpful comment, given that we weren't told what the
function is supposed to do in the first place.

> +			spin_unlock_irq(&alloc_list_lock);
> +			return;
> +		}
> +	spin_unlock_irq(&alloc_list_lock);

Interesting code layout - I rather like it!

> +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +		if (pcpu_stats[i] != NULL)
> +			continue;
> +
> +		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
> +		if (pcpu_stats[i] == NULL)
> +			goto alloc_stats;

hoo boy that looks like an infinite loop.  What's going on here?

> +	}
> +
> +	spin_lock_irq(&blkio_list_lock);
> +	spin_lock(&alloc_list_lock);
> +
> +	list_for_each_entry_safe(blkg, n, &alloc_list, alloc_node) {
> +		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +			struct blkio_policy_type *pol = blkio_policy[i];
> +			struct blkg_policy_data *pd;
> +
> +			if (!pol)
> +				continue;
> +
> +			if (!blkg->pd[i])
> +				continue;
> +
> +			pd = blkg->pd[i];
> +			if (pd->stats_cpu)
> +				continue;
> +
> +			pd->stats_cpu = pcpu_stats[i];
> +			pcpu_stats[i] = NULL;
> +		}
> +		list_del_init(&blkg->alloc_node);
> +		break;
> +	}
> +	spin_unlock(&alloc_list_lock);
> +	spin_unlock_irq(&blkio_list_lock);
> +	goto alloc_stats;
> +}

So the function runs until alloc_list is empty.  Very mysterious.

>
> ...
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-06 21:20                       ` Andrew Morton
@ 2012-03-06 21:34                         ` Vivek Goyal
  2012-03-06 21:55                           ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-06 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Tue, Mar 06, 2012 at 01:20:34PM -0800, Andrew Morton wrote:

[..]
> > @@ -391,6 +400,9 @@ void blkiocg_update_dispatch_stats(struc
> >  	struct blkio_group_stats_cpu *stats_cpu;
> >  	unsigned long flags;
> >  
> > +	if (pd->stats_cpu == NULL)
> > +		return;
> 
> Maybe add a comment explaining how this comes about?  It isn't very
> obvious..
> 

Will do. This basically says that if stats are not allocated yet, return
and don't try to account anything.

[..]
> > +static void blkio_stat_alloc_fn(struct work_struct *work)
> > +{
> > +
> > +	struct blkio_group *blkg, *n;
> > +	int i;
> > +
> > +alloc_stats:
> > +	spin_lock_irq(&alloc_list_lock);
> > +		if (list_empty(&alloc_list)) {
> > +			/* Nothing to do */
> 
> That's not a very helpful comment, given that we weren't told what the
> function is supposed to do in the first place.

Will add some comments. Once we create a new group, we add it to the 
linked list and wake up the worker thread. This is the work function which
is run by worker thread. Once it finds that list is empty, it knows there
is no group waiting for allocation and worker exits.

> > +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> > +		if (pcpu_stats[i] != NULL)
> > +			continue;
> > +
> > +		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
> > +		if (pcpu_stats[i] == NULL)
> > +			goto alloc_stats;
> 
> hoo boy that looks like an infinite loop.  What's going on here?

If allocation fails, I am trying to allocate it again in infinite loop.
What should I do? Try it after sleeping a bit? Or give up after certain
number of tries? This is in worker thread context though, so main IO path
is not impacted.

[..]
> > +		}
> > +		list_del_init(&blkg->alloc_node);
> > +		break;
> > +	}
> > +	spin_unlock(&alloc_list_lock);
> > +	spin_unlock_irq(&blkio_list_lock);
> > +	goto alloc_stats;
> > +}
> 
> So the function runs until alloc_list is empty.  Very mysterious.

Yes. Once alloc_list is empty, we know there are no groups needing 
per cpu stat allocation and worker exits. Once a new group is created
it will be added to the list and work will be scheduled again.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-06 21:34                         ` Vivek Goyal
@ 2012-03-06 21:55                           ` Andrew Morton
  2012-03-07 14:55                             ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-03-06 21:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Tue, 6 Mar 2012 16:34:37 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Mar 06, 2012 at 01:20:34PM -0800, Andrew Morton wrote:
> 
> > > +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> > > +		if (pcpu_stats[i] != NULL)
> > > +			continue;
> > > +
> > > +		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
> > > +		if (pcpu_stats[i] == NULL)
> > > +			goto alloc_stats;
> > 
> > hoo boy that looks like an infinite loop.  What's going on here?
> 
> If allocation fails, I am trying to allocate it again in infinite loop.
> What should I do? Try it after sleeping a bit? Or give up after certain
> number of tries? This is in worker thread context though, so main IO path
> is not impacted.

On a non-preemptible unprocessor kernel it's game over, isn't it? 
Unless someone frees some memory from interrupt context it is time for
the Big Red Button.

I'm not sure what to suggest, really - if an allocation failed then
there's nothing the caller can reliably do to fix that.  The best
approach is to fail all the way back to userspace with -ENOMEM.

In this context I suppose you could drop a warning into the logs then
bale out and retry on the next IO attempt.


> [..]
> > > +		}
> > > +		list_del_init(&blkg->alloc_node);
> > > +		break;
> > > +	}
> > > +	spin_unlock(&alloc_list_lock);
> > > +	spin_unlock_irq(&blkio_list_lock);
> > > +	goto alloc_stats;
> > > +}
> > 
> > So the function runs until alloc_list is empty.  Very mysterious.
> 
> Yes. Once alloc_list is empty, we know there are no groups needing 
> per cpu stat allocation and worker exits. Once a new group is created
> it will be added to the list and work will be scheduled again.
> 

Oh, is that what alloc_list does ;)


btw, speaking of uniprocessor: please do perform a uniprocessor build
and see what impact the patch has upon the size(1) output for the .o
files.  We should try to minimize the pointless bloat for the UP
kernel.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-06 21:55                           ` Andrew Morton
@ 2012-03-07 14:55                             ` Vivek Goyal
  2012-03-07 17:05                               ` Tejun Heo
  2012-03-07 23:05                               ` Andrew Morton
  0 siblings, 2 replies; 56+ messages in thread
From: Vivek Goyal @ 2012-03-07 14:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Tue, Mar 06, 2012 at 01:55:31PM -0800, Andrew Morton wrote:

[..]
> > > hoo boy that looks like an infinite loop.  What's going on here?
> > 
> > If allocation fails, I am trying to allocate it again in infinite loop.
> > What should I do? Try it after sleeping a bit? Or give up after certain
> > number of tries? This is in worker thread context though, so main IO path
> > is not impacted.
> 
> On a non-preemptible unprocessor kernel it's game over, isn't it? 
> Unless someone frees some memory from interrupt context it is time for
> the Big Red Button.

Yes.  Its an issue on non-preemptible UP kernels. I changed the logic to
msleep(10) before retrying. Tested on UP non-preemptible kernel with
always failing allocation and things are fine.

> 
> I'm not sure what to suggest, really - if an allocation failed then
> there's nothing the caller can reliably do to fix that.  The best
> approach is to fail all the way back to userspace with -ENOMEM.

As user space is not waiting for this allocation, -ENOMEM is really
not an option.

> 
> In this context I suppose you could drop a warning into the logs then
> bale out and retry on the next IO attempt.

Yes, that also can be done. I found msleep(10) to be easier solution then
remvoing group from list, and trying again when new IO comes in. Is this
acceptable?
 
[..]
> 
> btw, speaking of uniprocessor: please do perform a uniprocessor build
> and see what impact the patch has upon the size(1) output for the .o
> files.  We should try to minimize the pointless bloat for the UP
> kernel.

But this logic is required both for UP and SMP kernels. So bloat on UP
is not unnecessary?

I ran size(1) on block/blk-cgroup.o with and without the patch and I can
see some bloat.

Without patch(UP kernel)
------------------------
# size block/blk-cgroup.o
   text    data     bss     dec     hex filename
  12950    5248      50   18248    4748 block/blk-cgroup.o

With patch(UP kernel)
------------------------
# size block/blk-cgroup.o
   text    data     bss     dec     hex filename
  13316    5376      58   18750    493e block/blk-cgroup.o

Here is the V3 of the patch.


blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner

Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.

v2-> tejun suggested following changes. Changed the patch accordingly.
	- move alloc_node location in structure
	- reduce the size of names of some of the fields
	- Reduce the scope of locking of alloc_list_lock
	- Simplified stat_alloc_fn() by allocating stats for all
	  policies in one go and then assigning these to a group.

v3 -> Andrew suggested to put some comments in the code. Also raised
      concerns about trying to allocate infinitely in case of allocation
      failure. I have changed the logic to sleep for 10ms before retrying.
      That should take care of non-preemptible UP kernels.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  139 +++++++++++++++++++++++++++++++++++++----------------
 block/blk-cgroup.h |    2 
 2 files changed, 101 insertions(+), 40 deletions(-)

Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-03-07 20:36:44.019949136 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-07 20:37:51.188951195 -0500
@@ -30,6 +30,15 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+/* List of groups pending per cpu stats allocation */
+static DEFINE_SPINLOCK(alloc_list_lock);
+static LIST_HEAD(alloc_list);
+
+/* Array of per cpu stat pointers allocated for blk groups */
+static void *pcpu_stats[BLKIO_NR_POLICIES];
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +400,10 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +456,10 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +477,68 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+/*
+ * Worker for allocating per cpu stat for blk groups. This is scheduled
+ * once there are some groups on the alloc_list waiting for allocation
+ */
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	struct blkio_group *blkg, *n;
+	int i;
+
+alloc_stats:
+	spin_lock_irq(&alloc_list_lock);
+		if (list_empty(&alloc_list)) {
+			/* No more groups needing per cpu stat allocation */
+			spin_unlock_irq(&alloc_list_lock);
+			return;
+		}
+	spin_unlock_irq(&alloc_list_lock);
+
+	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+		if (pcpu_stats[i] != NULL)
+			continue;
+
+		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
+		/* Allocatoin failed. Try again after some time. */
+		if (pcpu_stats[i] == NULL) {
+			msleep(10);
+			goto alloc_stats;
+		}
+	}
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&alloc_list_lock);
+
+	list_for_each_entry_safe(blkg, n, &alloc_list, alloc_node) {
+		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+			struct blkio_policy_type *pol = blkio_policy[i];
+			struct blkg_policy_data *pd;
+
+			if (!pol)
+				continue;
+
+			if (!blkg->pd[i])
+				continue;
+
+			pd = blkg->pd[i];
+			if (pd->stats_cpu)
+				continue;
+
+			pd->stats_cpu = pcpu_stats[i];
+			pcpu_stats[i] = NULL;
+		}
+		list_del_init(&blkg->alloc_node);
+		break;
+	}
+	spin_unlock(&alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+
+	/* Check if there are more groups needing per cpu stat allocation. */
+	goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -491,9 +570,6 @@ static void blkg_free(struct blkio_group
  * @q: request_queue the new blkg is associated with
  *
  * Allocate a new blkg assocating @blkcg and @q.
- *
- * FIXME: Should be called with queue locked but currently isn't due to
- *        percpu stat breakage.
  */
 static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 				      struct request_queue *q)
@@ -509,6 +585,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +607,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +626,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +650,27 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
-
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-
 	spin_unlock(&blkcg->lock);
+
+	spin_lock(&alloc_list_lock);
+	list_add(&blkg->alloc_node, &alloc_list);
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_work(system_nrt_wq, &blkio_stat_alloc_work);
+	spin_unlock(&alloc_list_lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -654,6 +703,10 @@ static void blkg_destroy(struct blkio_gr
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
+	spin_lock(&alloc_list_lock);
+	list_del_init(&blkg->alloc_node);
+	spin_unlock(&alloc_list_lock);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -752,6 +805,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -883,6 +939,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);
Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-03-07 20:36:44.007949131 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-03-07 20:37:51.189951195 -0500
@@ -190,6 +190,8 @@ struct blkio_group {
 	spinlock_t stats_lock;
 	struct blkg_policy_data *pd[BLKIO_NR_POLICIES];
 
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head alloc_node;
 	struct rcu_head rcu_head;
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 14:55                             ` Vivek Goyal
@ 2012-03-07 17:05                               ` Tejun Heo
  2012-03-07 19:13                                 ` Vivek Goyal
  2012-03-07 23:05                               ` Andrew Morton
  1 sibling, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-07 17:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, Mar 07, 2012 at 09:55:56AM -0500, Vivek Goyal wrote:
> +/*
> + * Worker for allocating per cpu stat for blk groups. This is scheduled
> + * once there are some groups on the alloc_list waiting for allocation
> + */
> +static void blkio_stat_alloc_fn(struct work_struct *work)
> +{
> +
> +	struct blkio_group *blkg, *n;
> +	int i;
> +
> +alloc_stats:
> +	spin_lock_irq(&alloc_list_lock);
> +		if (list_empty(&alloc_list)) {
> +			/* No more groups needing per cpu stat allocation */
> +			spin_unlock_irq(&alloc_list_lock);
> +			return;
> +		}
> +	spin_unlock_irq(&alloc_list_lock);

I don't think we really need the above.  Just proceed with allocation.

> +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +		if (pcpu_stats[i] != NULL)
> +			continue;
> +
> +		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
> +		/* Allocatoin failed. Try again after some time. */
		          ^^ typo
> +		if (pcpu_stats[i] == NULL) {
> +			msleep(10);
> +			goto alloc_stats;
> +		}
> +	}

Why not queue_delayed_work(system_nrt_wq, work, msecs_to_jiffies(10))?
Why hog the worker thread?

> +
> +	spin_lock_irq(&blkio_list_lock);
> +	spin_lock(&alloc_list_lock);
> +
> +	list_for_each_entry_safe(blkg, n, &alloc_list, alloc_node) {
> +		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +			struct blkio_policy_type *pol = blkio_policy[i];
> +			struct blkg_policy_data *pd;
> +
> +			if (!pol)
> +				continue;
> +
> +			if (!blkg->pd[i])
> +				continue;
> +
> +			pd = blkg->pd[i];
> +			if (pd->stats_cpu)
> +				continue;
> +
> +			pd->stats_cpu = pcpu_stats[i];
> +			pcpu_stats[i] = NULL;

I think this can be slightly more compact.  How about the following?

	struct blkg_policy_data *pd = blkg->pd[i];

	if (blkio_policy[i] && pd && !pd->stats_cpu)
		swap(pd->stats_cpu, pcpu_stats[i]);

> +		}
> +		list_del_init(&blkg->alloc_node);
> +		break;

If we're breaking after the first iteration, why are we using
list_for_each at all?

> +	}

and we can test whether we need to loop here, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 17:05                               ` Tejun Heo
@ 2012-03-07 19:13                                 ` Vivek Goyal
  2012-03-07 19:22                                   ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-07 19:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, Mar 07, 2012 at 09:05:44AM -0800, Tejun Heo wrote:

[..]
> > +alloc_stats:
> > +	spin_lock_irq(&alloc_list_lock);
> > +		if (list_empty(&alloc_list)) {
> > +			/* No more groups needing per cpu stat allocation */
> > +			spin_unlock_irq(&alloc_list_lock);
> > +			return;
> > +		}
> > +	spin_unlock_irq(&alloc_list_lock);
> 
> I don't think we really need the above.  Just proceed with allocation.

Hi Tejun,

Here is V4 of the patch which takes care of your suggestions. 

Thanks
Vivek

blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner

Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.

v2-> tejun suggested following changes. Changed the patch accordingly.
	- move alloc_node location in structure
	- reduce the size of names of some of the fields
	- Reduce the scope of locking of alloc_list_lock
	- Simplified stat_alloc_fn() by allocating stats for all
	  policies in one go and then assigning these to a group.

v3 -> Andrew suggested to put some comments in the code. Also raised
      concerns about trying to allocate infinitely in case of allocation
      failure. I have changed the logic to sleep for 10ms before retrying.
      That should take care of non-preemptible UP kernels.

v4 -> Tejun had more suggestions.
	- drop list_for_each_entry_all()
	- instead of msleep() use queue_delayed_work()
	- Some cleanups realted to more compact coding.
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  136 +++++++++++++++++++++++++++++++++++++----------------
 block/blk-cgroup.h |    2 
 2 files changed, 98 insertions(+), 40 deletions(-)

Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-03-07 20:36:44.019949136 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-08 00:56:03.374961752 -0500
@@ -30,6 +30,15 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+/* List of groups pending per cpu stats allocation */
+static DEFINE_SPINLOCK(alloc_list_lock);
+static LIST_HEAD(alloc_list);
+
+/* Array of per cpu stat pointers allocated for blk groups */
+static void *pcpu_stats[BLKIO_NR_POLICIES];
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_DELAYED_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +400,10 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +456,10 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +477,65 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+/*
+ * Worker for allocating per cpu stat for blk groups. This is scheduled
+ * once there are some groups on the alloc_list waiting for allocation
+ */
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct blkio_group *blkg;
+	int i;
+	bool alloc_more = false;
+
+alloc_stats:
+	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+		if (pcpu_stats[i] != NULL)
+			continue;
+
+		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
+
+		/* Allocation failed. Try again after some time. */
+		if (pcpu_stats[i] == NULL) {
+			queue_delayed_work(system_nrt_wq, dwork,
+						msecs_to_jiffies(10));
+			return;
+		}
+	}
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&alloc_list_lock);
+
+	/* cgroup got deleted or queue exited. */
+	if (list_empty(&alloc_list)) {
+		alloc_more = false;
+		goto unlock;
+	}
+
+	blkg = list_first_entry(&alloc_list, struct blkio_group, alloc_node);
+
+	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+		struct blkg_policy_data *pd = blkg->pd[i];
+
+		if (blkio_policy[i] && pd && !pd->stats_cpu)
+			swap(pd->stats_cpu, pcpu_stats[i]);
+	}
+
+	list_del_init(&blkg->alloc_node);
+
+	if (list_empty(&alloc_list))
+		alloc_more = false;
+	else
+		alloc_more = true;
+unlock:
+	spin_unlock(&alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+
+	if (alloc_more)
+		goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -491,9 +567,6 @@ static void blkg_free(struct blkio_group
  * @q: request_queue the new blkg is associated with
  *
  * Allocate a new blkg assocating @blkcg and @q.
- *
- * FIXME: Should be called with queue locked but currently isn't due to
- *        percpu stat breakage.
  */
 static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 				      struct request_queue *q)
@@ -509,6 +582,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +604,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +623,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +647,27 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
-
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-
 	spin_unlock(&blkcg->lock);
+
+	spin_lock(&alloc_list_lock);
+	list_add(&blkg->alloc_node, &alloc_list);
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_delayed_work(system_nrt_wq, &blkio_stat_alloc_work, 0);
+	spin_unlock(&alloc_list_lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -654,6 +700,10 @@ static void blkg_destroy(struct blkio_gr
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
+	spin_lock(&alloc_list_lock);
+	list_del_init(&blkg->alloc_node);
+	spin_unlock(&alloc_list_lock);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -752,6 +802,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -883,6 +936,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);
Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-03-07 20:36:44.007949131 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-03-07 20:37:51.189951195 -0500
@@ -190,6 +190,8 @@ struct blkio_group {
 	spinlock_t stats_lock;
 	struct blkg_policy_data *pd[BLKIO_NR_POLICIES];
 
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head alloc_node;
 	struct rcu_head rcu_head;
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 19:13                                 ` Vivek Goyal
@ 2012-03-07 19:22                                   ` Tejun Heo
  2012-03-07 19:42                                     ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-07 19:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

Hello, Vivek.

On Wed, Mar 07, 2012 at 02:13:34PM -0500, Vivek Goyal wrote:
> +static void blkio_stat_alloc_fn(struct work_struct *work)
> +{
> +
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct blkio_group *blkg;
> +	int i;
> +	bool alloc_more = false;
> +
> +alloc_stats:
> +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +		if (pcpu_stats[i] != NULL)
> +			continue;
> +
> +		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
> +
> +		/* Allocation failed. Try again after some time. */
> +		if (pcpu_stats[i] == NULL) {
> +			queue_delayed_work(system_nrt_wq, dwork,
> +						msecs_to_jiffies(10));
> +			return;
> +		}
> +	}
> +
> +	spin_lock_irq(&blkio_list_lock);
> +	spin_lock(&alloc_list_lock);
> +
> +	/* cgroup got deleted or queue exited. */
> +	if (list_empty(&alloc_list)) {
> +		alloc_more = false;
> +		goto unlock;
> +	}
> +
> +	blkg = list_first_entry(&alloc_list, struct blkio_group, alloc_node);
> +
> +	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
> +		struct blkg_policy_data *pd = blkg->pd[i];
> +
> +		if (blkio_policy[i] && pd && !pd->stats_cpu)
> +			swap(pd->stats_cpu, pcpu_stats[i]);
> +	}
> +
> +	list_del_init(&blkg->alloc_node);
> +
> +	if (list_empty(&alloc_list))
> +		alloc_more = false;
> +	else
> +		alloc_more = true;

Sorry about being a pain in the ass for small stuff but how about the
following?

	lock_stuff()

	if (!list_empty()) {
		struct blkio_group *blkg = list_first_entry();

		for (i = 0; ..) {
			set stuff;
		}
		list_del_init();
	}
	empty = list_empty();

	unlock_stuff();

	if (!empty)
		goto repeat;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 19:22                                   ` Tejun Heo
@ 2012-03-07 19:42                                     ` Vivek Goyal
  2012-03-07 22:56                                       ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-07 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, Mar 07, 2012 at 11:22:19AM -0800, Tejun Heo wrote:

[..]
> Sorry about being a pain in the ass for small stuff but how about the
> following?

Not a problem. Here you go.

Thanks
Vivek

blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner

Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
IO path there are times when we want GFP_NOIO semantics. As there is no
way to pass the allocation flags to alloc_percpu(), this patch delays the
allocation of stats using a worker thread.

v2-> tejun suggested following changes. Changed the patch accordingly.
	- move alloc_node location in structure
	- reduce the size of names of some of the fields
	- Reduce the scope of locking of alloc_list_lock
	- Simplified stat_alloc_fn() by allocating stats for all
	  policies in one go and then assigning these to a group.

v3 -> Andrew suggested to put some comments in the code. Also raised
      concerns about trying to allocate infinitely in case of allocation
      failure. I have changed the logic to sleep for 10ms before retrying.
      That should take care of non-preemptible UP kernels.

v4 -> Tejun had more suggestions.
	- drop list_for_each_entry_all()
	- instead of msleep() use queue_delayed_work()
	- Some cleanups realted to more compact coding.
 
v5-> tejun suggested more cleanups leading to more compact code.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  130 ++++++++++++++++++++++++++++++++++++-----------------
 block/blk-cgroup.h |    2 
 2 files changed, 92 insertions(+), 40 deletions(-)

Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-03-07 20:36:44.019949136 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-08 01:25:07.436928409 -0500
@@ -30,6 +30,15 @@ static LIST_HEAD(blkio_list);
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
 
+/* List of groups pending per cpu stats allocation */
+static DEFINE_SPINLOCK(alloc_list_lock);
+static LIST_HEAD(alloc_list);
+
+/* Array of per cpu stat pointers allocated for blk groups */
+static void *pcpu_stats[BLKIO_NR_POLICIES];
+static void blkio_stat_alloc_fn(struct work_struct *);
+static DECLARE_DELAYED_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
@@ -391,6 +400,10 @@ void blkiocg_update_dispatch_stats(struc
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -443,6 +456,10 @@ void blkiocg_update_io_merged_stats(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	unsigned long flags;
 
+	/* If per cpu stats are not allocated yet, don't do any accounting. */
+	if (pd->stats_cpu == NULL)
+		return;
+
 	/*
 	 * Disabling interrupts to provide mutual exclusion between two
 	 * writes on same cpu. It probably is not needed for 64bit. Not
@@ -460,6 +477,59 @@ void blkiocg_update_io_merged_stats(stru
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+/*
+ * Worker for allocating per cpu stat for blk groups. This is scheduled
+ * once there are some groups on the alloc_list waiting for allocation
+ */
+static void blkio_stat_alloc_fn(struct work_struct *work)
+{
+
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct blkio_group *blkg;
+	int i;
+	bool empty = false;
+
+alloc_stats:
+	for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+		if (pcpu_stats[i] != NULL)
+			continue;
+
+		pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
+
+		/* Allocation failed. Try again after some time. */
+		if (pcpu_stats[i] == NULL) {
+			queue_delayed_work(system_nrt_wq, dwork,
+						msecs_to_jiffies(10));
+			return;
+		}
+	}
+
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&alloc_list_lock);
+
+	/* cgroup got deleted or queue exited. */
+	if (!list_empty(&alloc_list)) {
+		blkg = list_first_entry(&alloc_list, struct blkio_group,
+						alloc_node);
+		for (i = 0; i < BLKIO_NR_POLICIES; i++) {
+			struct blkg_policy_data *pd = blkg->pd[i];
+
+			if (blkio_policy[i] && pd && !pd->stats_cpu)
+				swap(pd->stats_cpu, pcpu_stats[i]);
+		}
+
+		list_del_init(&blkg->alloc_node);
+	}
+
+	empty = list_empty(&alloc_list);
+
+	spin_unlock(&alloc_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
+
+	if (!empty)
+		goto alloc_stats;
+}
+
 /**
  * blkg_free - free a blkg
  * @blkg: blkg to free
@@ -491,9 +561,6 @@ static void blkg_free(struct blkio_group
  * @q: request_queue the new blkg is associated with
  *
  * Allocate a new blkg assocating @blkcg and @q.
- *
- * FIXME: Should be called with queue locked but currently isn't due to
- *        percpu stat breakage.
  */
 static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 				      struct request_queue *q)
@@ -509,6 +576,7 @@ static struct blkio_group *blkg_alloc(st
 	spin_lock_init(&blkg->stats_lock);
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	INIT_LIST_HEAD(&blkg->alloc_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
 	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
@@ -530,13 +598,6 @@ static struct blkio_group *blkg_alloc(st
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
-
-		/* broken, read comment in the callsite */
-		pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-		if (!pd->stats_cpu) {
-			blkg_free(blkg);
-			return NULL;
-		}
 	}
 
 	/* invoke per-policy init */
@@ -556,7 +617,7 @@ struct blkio_group *blkg_lookup_create(s
 				       bool for_root)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	struct blkio_group *blkg, *new_blkg;
+	struct blkio_group *blkg;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -580,48 +641,27 @@ struct blkio_group *blkg_lookup_create(s
 
 	/*
 	 * Allocate and initialize.
-	 *
-	 * FIXME: The following is broken.  Percpu memory allocation
-	 * requires %GFP_KERNEL context and can't be performed from IO
-	 * path.  Allocation here should inherently be atomic and the
-	 * following lock dancing can be removed once the broken percpu
-	 * allocation is fixed.
 	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	new_blkg = blkg_alloc(blkcg, q);
-
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-
-	/* did bypass get turned on inbetween? */
-	if (unlikely(blk_queue_bypass(q)) && !for_root) {
-		blkg = ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-		goto out;
-	}
-
-	/* did someone beat us to it? */
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(blkg))
-		goto out;
+	blkg = blkg_alloc(blkcg, q);
 
 	/* did alloc fail? */
-	if (unlikely(!new_blkg)) {
+	if (unlikely(!blkg)) {
 		blkg = ERR_PTR(-ENOMEM);
 		goto out;
 	}
 
 	/* insert */
 	spin_lock(&blkcg->lock);
-	swap(blkg, new_blkg);
-
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-
 	spin_unlock(&blkcg->lock);
+
+	spin_lock(&alloc_list_lock);
+	list_add(&blkg->alloc_node, &alloc_list);
+	/* Queue per cpu stat allocation from worker thread. */
+	queue_delayed_work(system_nrt_wq, &blkio_stat_alloc_work, 0);
+	spin_unlock(&alloc_list_lock);
 out:
-	blkg_free(new_blkg);
 	return blkg;
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
@@ -654,6 +694,10 @@ static void blkg_destroy(struct blkio_gr
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
+	spin_lock(&alloc_list_lock);
+	list_del_init(&blkg->alloc_node);
+	spin_unlock(&alloc_list_lock);
+
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -752,6 +796,9 @@ static void blkio_reset_stats_cpu(struct
 	struct blkg_policy_data *pd = blkg->pd[plid];
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
+
+	if (pd->stats_cpu == NULL)
+		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
 	 * possibility of returning some inconsistent value on 32bit arch
@@ -883,6 +930,9 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
+	if (pd->stats_cpu == NULL)
+		return val;
+
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
 		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);
Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-03-07 20:36:44.007949131 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-03-07 20:37:51.189951195 -0500
@@ -190,6 +190,8 @@ struct blkio_group {
 	spinlock_t stats_lock;
 	struct blkg_policy_data *pd[BLKIO_NR_POLICIES];
 
+	/* List of blkg waiting for per cpu stats memory to be allocated */
+	struct list_head alloc_node;
 	struct rcu_head rcu_head;
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 19:42                                     ` Vivek Goyal
@ 2012-03-07 22:56                                       ` Tejun Heo
  2012-03-07 23:08                                         ` Andrew Morton
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-07 22:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, Mar 07, 2012 at 02:42:29PM -0500, Vivek Goyal wrote:
> blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner
> 
> Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
> IO path there are times when we want GFP_NOIO semantics. As there is no
> way to pass the allocation flags to alloc_percpu(), this patch delays the
> allocation of stats using a worker thread.
> 
> v2-> tejun suggested following changes. Changed the patch accordingly.
> 	- move alloc_node location in structure
> 	- reduce the size of names of some of the fields
> 	- Reduce the scope of locking of alloc_list_lock
> 	- Simplified stat_alloc_fn() by allocating stats for all
> 	  policies in one go and then assigning these to a group.
> 
> v3 -> Andrew suggested to put some comments in the code. Also raised
>       concerns about trying to allocate infinitely in case of allocation
>       failure. I have changed the logic to sleep for 10ms before retrying.
>       That should take care of non-preemptible UP kernels.
> 
> v4 -> Tejun had more suggestions.
> 	- drop list_for_each_entry_all()
> 	- instead of msleep() use queue_delayed_work()
> 	- Some cleanups realted to more compact coding.
>  
> v5-> tejun suggested more cleanups leading to more compact code.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

Looks good to me.  I'll roll it into the stats series.  Andrew, if you
still have objections, please scream.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 14:55                             ` Vivek Goyal
  2012-03-07 17:05                               ` Tejun Heo
@ 2012-03-07 23:05                               ` Andrew Morton
  2012-03-08 17:57                                 ` Vivek Goyal
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-03-07 23:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, 7 Mar 2012 09:55:56 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Mar 06, 2012 at 01:55:31PM -0800, Andrew Morton wrote:
> 
> [..]
> > > > hoo boy that looks like an infinite loop.  What's going on here?
> > > 
> > > If allocation fails, I am trying to allocate it again in infinite loop.
> > > What should I do? Try it after sleeping a bit? Or give up after certain
> > > number of tries? This is in worker thread context though, so main IO path
> > > is not impacted.
> > 
> > On a non-preemptible unprocessor kernel it's game over, isn't it? 
> > Unless someone frees some memory from interrupt context it is time for
> > the Big Red Button.
> 
> Yes.  Its an issue on non-preemptible UP kernels. I changed the logic to
> msleep(10) before retrying. Tested on UP non-preemptible kernel with
> always failing allocation and things are fine.
> 
> > 
> > I'm not sure what to suggest, really - if an allocation failed then
> > there's nothing the caller can reliably do to fix that.  The best
> > approach is to fail all the way back to userspace with -ENOMEM.
> 
> As user space is not waiting for this allocation, -ENOMEM is really
> not an option.

Well, it would have to be -EIO, because the block layer is stupid about
errnos.

> > 
> > In this context I suppose you could drop a warning into the logs then
> > bale out and retry on the next IO attempt.
> 
> Yes, that also can be done. I found msleep(10) to be easier solution then
> remvoing group from list, and trying again when new IO comes in. Is this
> acceptable?

Seems a bit sucky to me.  That allocation isn't *needed* for the kernel
to be able to complete the IO operation.  It's just that we
(mis)designed things so that we're dependent upon it succeeding.  Sigh.

msleep() will cause that kernel thread to contribute to load average
when it is in this state.  Intentional?

> [..]
> > 
> > btw, speaking of uniprocessor: please do perform a uniprocessor build
> > and see what impact the patch has upon the size(1) output for the .o
> > files.  We should try to minimize the pointless bloat for the UP
> > kernel.
> 
> But this logic is required both for UP and SMP kernels. So bloat on UP
> is not unnecessary?

UP doesn't need a per-cpu variable, hence it doesn't need to run
alloc_per_cpu() at all.  For UP all we need to do is to aggregate a
`struct blkio_group_stats' within `struct blkg_policy_data'?

This could still be done with suitable abstraction and wrappers. 
Whether that's desirable depends on how fat the API ends up, I guess.

> I ran size(1) on block/blk-cgroup.o with and without the patch and I can
> see some bloat.
> 
> Without patch(UP kernel)
> ------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   12950    5248      50   18248    4748 block/blk-cgroup.o
> 
> With patch(UP kernel)
> ------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   13316    5376      58   18750    493e block/blk-cgroup.o

Yeah.

The additional text imposes runtime overhead, but there's also
additional cost from things like the extra pointer hops to access the
per-cpu data.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 22:56                                       ` Tejun Heo
@ 2012-03-07 23:08                                         ` Andrew Morton
  2012-03-07 23:15                                           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2012-03-07 23:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, 7 Mar 2012 14:56:29 -0800
Tejun Heo <tj@kernel.org> wrote:

> On Wed, Mar 07, 2012 at 02:42:29PM -0500, Vivek Goyal wrote:
> > blk-cgroup: Alloc per cpu stats from worker thread in a delayed manner
> > 
> > Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
> > IO path there are times when we want GFP_NOIO semantics. As there is no
> > way to pass the allocation flags to alloc_percpu(), this patch delays the
> > allocation of stats using a worker thread.
> > 
> > v2-> tejun suggested following changes. Changed the patch accordingly.
> > 	- move alloc_node location in structure
> > 	- reduce the size of names of some of the fields
> > 	- Reduce the scope of locking of alloc_list_lock
> > 	- Simplified stat_alloc_fn() by allocating stats for all
> > 	  policies in one go and then assigning these to a group.
> > 
> > v3 -> Andrew suggested to put some comments in the code. Also raised
> >       concerns about trying to allocate infinitely in case of allocation
> >       failure. I have changed the logic to sleep for 10ms before retrying.
> >       That should take care of non-preemptible UP kernels.
> > 
> > v4 -> Tejun had more suggestions.
> > 	- drop list_for_each_entry_all()
> > 	- instead of msleep() use queue_delayed_work()
> > 	- Some cleanups realted to more compact coding.
> >  
> > v5-> tejun suggested more cleanups leading to more compact code.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> Looks good to me.  I'll roll it into the stats series.  Andrew, if you
> still have objections, please scream.

Just the "msleep adds to load average" issue needs thinking about.

It's rather a miserable patch :(

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 23:08                                         ` Andrew Morton
@ 2012-03-07 23:15                                           ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-03-07 23:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vivek Goyal, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

Hello, Andrew.

On Wed, Mar 07, 2012 at 03:08:06PM -0800, Andrew Morton wrote:
> Just the "msleep adds to load average" issue needs thinking about.

Oh, it's now requeueing delayed work item so that's solved.

> It's rather a miserable patch :(

Agreed.  Well, I'd like to think we tried hard enough for
alternatives.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-07 23:05                               ` Andrew Morton
@ 2012-03-08 17:57                                 ` Vivek Goyal
  2012-03-08 18:08                                   ` Tejun Heo
  2012-03-08 19:06                                   ` Andrew Morton
  0 siblings, 2 replies; 56+ messages in thread
From: Vivek Goyal @ 2012-03-08 17:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Wed, Mar 07, 2012 at 03:05:49PM -0800, Andrew Morton wrote:

[..]
> > > btw, speaking of uniprocessor: please do perform a uniprocessor build
> > > and see what impact the patch has upon the size(1) output for the .o
> > > files.  We should try to minimize the pointless bloat for the UP
> > > kernel.
> > 
> > But this logic is required both for UP and SMP kernels. So bloat on UP
> > is not unnecessary?
> 
> UP doesn't need a per-cpu variable, hence it doesn't need to run
> alloc_per_cpu() at all.  For UP all we need to do is to aggregate a
> `struct blkio_group_stats' within `struct blkg_policy_data'?
> 
> This could still be done with suitable abstraction and wrappers. 
> Whether that's desirable depends on how fat the API ends up, I guess.
> 

Ok, here is the patch which gets rid of allocating per cpu stats in case
of UP. Here are the size stats with and without patch.

Without patch (UP kernel)
-------------------------
# size block/blk-cgroup.o
   text    data     bss     dec     hex filename
  13377    5504      58   18939    49fb block/blk-cgroup.o

With patch (UP kernel)
----------------------
# size block/blk-cgroup.o
   text    data     bss     dec     hex filename
  12678    5312      49   18039    4677 block/blk-cgroup.o

Do you think it is worth introducing these ifdefs.

Anyway, I am assuming that optimization for UP issue is not blocking the
acceptance of other alloc per cpu patch. I have replaced msleep()
with queue_delayed_work(). Hopefully it is little less miserable now.

Tejun, I noticed that in UP case, once in a while cgroup removal is
hanging. Looks like it is hung in cgroup_rmdir() somewhere. I will debug
more to find out what is happening. May be blkcg->refcount issue.

Thanks
Vivek

---
 block/blk-cgroup.c |   65 +++++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-cgroup.h |    6 ++++
 2 files changed, 61 insertions(+), 10 deletions(-)

Index: tejun-misc/block/blk-cgroup.h
===================================================================
--- tejun-misc.orig/block/blk-cgroup.h	2012-03-08 23:28:17.000000000 -0500
+++ tejun-misc/block/blk-cgroup.h	2012-03-08 23:37:25.310887020 -0500
@@ -168,9 +168,13 @@ struct blkg_policy_data {
 	struct blkio_group_conf conf;
 
 	struct blkio_group_stats stats;
+
+#ifdef CONFIG_SMP
 	/* Per cpu stats pointer */
 	struct blkio_group_stats_cpu __percpu *stats_cpu;
-
+#else
+	struct blkio_group_stats_cpu stats_cpu;
+#endif
 	/* pol->pdata_size bytes of private data used by policy impl */
 	char pdata[] __aligned(__alignof__(unsigned long long));
 };
Index: tejun-misc/block/blk-cgroup.c
===================================================================
--- tejun-misc.orig/block/blk-cgroup.c	2012-03-08 23:37:14.079886676 -0500
+++ tejun-misc/block/blk-cgroup.c	2012-03-08 23:37:55.104888008 -0500
@@ -35,9 +35,11 @@ static DEFINE_SPINLOCK(alloc_list_lock);
 static LIST_HEAD(alloc_list);
 
 /* Array of per cpu stat pointers allocated for blk groups */
+#ifdef CONFIG_SMP
 static void *pcpu_stats[BLKIO_NR_POLICIES];
 static void blkio_stat_alloc_fn(struct work_struct *);
 static DECLARE_DELAYED_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
+#endif
 
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
@@ -73,6 +75,42 @@ struct cgroup_subsys blkio_subsys = {
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
+#ifdef CONFIG_SMP
+static inline struct blkio_group_stats_cpu *
+pd_this_cpu_stat_ptr(struct blkg_policy_data *pd)
+{
+	return this_cpu_ptr(pd->stats_cpu);
+}
+
+static inline struct blkio_group_stats_cpu *
+pd_cpu_stat_ptr(struct blkg_policy_data *pd, int cpu)
+{
+	return per_cpu_ptr(pd->stats_cpu, cpu);
+}
+
+static inline bool pd_stat_cpu_valid(struct blkg_policy_data *pd)
+{
+	return pd->stats_cpu ? true : false;
+}
+#else /* UP */
+static inline struct blkio_group_stats_cpu *
+pd_this_cpu_stat_ptr(struct blkg_policy_data *pd)
+{
+	return &pd->stats_cpu;
+}
+
+static inline struct blkio_group_stats_cpu *
+pd_cpu_stat_ptr(struct blkg_policy_data *pd, int cpu)
+{
+	return &pd->stats_cpu;
+}
+
+static inline bool pd_stat_cpu_valid(struct blkg_policy_data *pd)
+{
+	return true;
+}
+#endif /* CONFIG_SMP */
+
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 {
 	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
@@ -401,7 +439,7 @@ void blkiocg_update_dispatch_stats(struc
 	unsigned long flags;
 
 	/* If per cpu stats are not allocated yet, don't do any accounting. */
-	if (pd->stats_cpu == NULL)
+	if (!pd_stat_cpu_valid(pd))
 		return;
 
 	/*
@@ -411,7 +449,7 @@ void blkiocg_update_dispatch_stats(struc
 	 */
 	local_irq_save(flags);
 
-	stats_cpu = this_cpu_ptr(pd->stats_cpu);
+	stats_cpu = pd_this_cpu_stat_ptr(pd);
 
 	u64_stats_update_begin(&stats_cpu->syncp);
 	stats_cpu->sectors += bytes >> 9;
@@ -457,7 +495,7 @@ void blkiocg_update_io_merged_stats(stru
 	unsigned long flags;
 
 	/* If per cpu stats are not allocated yet, don't do any accounting. */
-	if (pd->stats_cpu == NULL)
+	if (!pd_stat_cpu_valid(pd))
 		return;
 
 	/*
@@ -467,7 +505,7 @@ void blkiocg_update_io_merged_stats(stru
 	 */
 	local_irq_save(flags);
 
-	stats_cpu = this_cpu_ptr(pd->stats_cpu);
+	stats_cpu = pd_this_cpu_stat_ptr(pd);
 
 	u64_stats_update_begin(&stats_cpu->syncp);
 	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_MERGED], 1,
@@ -481,6 +519,7 @@ EXPORT_SYMBOL_GPL(blkiocg_update_io_merg
  * Worker for allocating per cpu stat for blk groups. This is scheduled
  * once there are some groups on the alloc_list waiting for allocation
  */
+#ifdef CONFIG_SMP
 static void blkio_stat_alloc_fn(struct work_struct *work)
 {
 
@@ -530,6 +569,7 @@ alloc_stats:
 	if (!empty)
 		goto alloc_stats;
 }
+#endif
 
 /**
  * blkg_free - free a blkg
@@ -548,7 +588,9 @@ static void blkg_free(struct blkio_group
 		struct blkg_policy_data *pd = blkg->pd[i];
 
 		if (pd) {
+#ifdef CONFIG_SMP
 			free_percpu(pd->stats_cpu);
+#endif
 			kfree(pd);
 		}
 	}
@@ -657,11 +699,15 @@ struct blkio_group *blkg_lookup_create(s
 	list_add(&blkg->q_node, &q->blkg_list);
 	spin_unlock(&blkcg->lock);
 
+	/* In case of UP, stats are not per cpu */
+#ifdef CONFIG_SMP
 	spin_lock(&alloc_list_lock);
 	list_add(&blkg->alloc_node, &alloc_list);
 	/* Queue per cpu stat allocation from worker thread. */
 	queue_delayed_work(system_nrt_wq, &blkio_stat_alloc_work, 0);
 	spin_unlock(&alloc_list_lock);
+#endif
+
 out:
 	return blkg;
 }
@@ -730,9 +776,10 @@ void update_root_blkg_pd(struct request_
 	pd = kzalloc(sizeof(*pd) + pol->pdata_size, GFP_KERNEL);
 	WARN_ON_ONCE(!pd);
 
+#ifdef CONFIG_SMP
 	pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
 	WARN_ON_ONCE(!pd->stats_cpu);
-
+#endif
 	blkg->pd[plid] = pd;
 	pd->blkg = blkg;
 	pol->ops.blkio_init_group_fn(blkg);
@@ -798,7 +845,7 @@ static void blkio_reset_stats_cpu(struct
 	struct blkio_group_stats_cpu *stats_cpu;
 	int i, j, k;
 
-	if (pd->stats_cpu == NULL)
+	if (!pd_stat_cpu_valid(pd))
 		return;
 	/*
 	 * Note: On 64 bit arch this should not be an issue. This has the
@@ -812,7 +859,7 @@ static void blkio_reset_stats_cpu(struct
 	 * unless this becomes a real issue.
 	 */
 	for_each_possible_cpu(i) {
-		stats_cpu = per_cpu_ptr(pd->stats_cpu, i);
+		stats_cpu = pd_cpu_stat_ptr(pd, i);
 		stats_cpu->sectors = 0;
 		for(j = 0; j < BLKIO_STAT_CPU_NR; j++)
 			for (k = 0; k < BLKIO_STAT_TOTAL; k++)
@@ -931,12 +978,12 @@ static uint64_t blkio_read_stat_cpu(stru
 	struct blkio_group_stats_cpu *stats_cpu;
 	u64 val = 0, tval;
 
-	if (pd->stats_cpu == NULL)
+	if (!pd_stat_cpu_valid(pd))
 		return val;
 
 	for_each_possible_cpu(cpu) {
 		unsigned int start;
-		stats_cpu = per_cpu_ptr(pd->stats_cpu, cpu);
+		stats_cpu = pd_cpu_stat_ptr(pd, cpu);
 
 		do {
 			start = u64_stats_fetch_begin(&stats_cpu->syncp);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 17:57                                 ` Vivek Goyal
@ 2012-03-08 18:08                                   ` Tejun Heo
  2012-03-08 18:11                                     ` Tejun Heo
  2012-03-08 20:16                                     ` Vivek Goyal
  2012-03-08 19:06                                   ` Andrew Morton
  1 sibling, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2012-03-08 18:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

Hello, Vivek.

On Thu, Mar 08, 2012 at 12:57:08PM -0500, Vivek Goyal wrote:
> Ok, here is the patch which gets rid of allocating per cpu stats in case
> of UP. Here are the size stats with and without patch.
> 
> Without patch (UP kernel)
> -------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   13377    5504      58   18939    49fb block/blk-cgroup.o
> 
> With patch (UP kernel)
> ----------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   12678    5312      49   18039    4677 block/blk-cgroup.o
> 
> Do you think it is worth introducing these ifdefs.

I'd rather not.  The whole thing goes away if blkio cgroup is
configured out so it's not like we're adding an unconditional burden
to minimal configurations.  Let's keep things simple.

> Anyway, I am assuming that optimization for UP issue is not blocking the
> acceptance of other alloc per cpu patch. I have replaced msleep()
> with queue_delayed_work(). Hopefully it is little less miserable now.

Yeah, I was gonna send the merged patchset yesterday.  Just got caught
up in something else.

> Tejun, I noticed that in UP case, once in a while cgroup removal is
> hanging. Looks like it is hung in cgroup_rmdir() somewhere. I will debug
> more to find out what is happening. May be blkcg->refcount issue.

It's probably from something forgetting to put cgroup and pre_destroy
waiting for it.  Such bugs would have been masked before but now show
up as stalls during rmdir.  I'm not too happy with how cgroup is
handling cgroup file additions and removals and hoping to implement
proper 'sever' semantics similar to that of sysfs.  In the longer
term, such behavior should go away but for now we'll just have to hunt
down the actual bug to avoid stalls (which we have to do anyway but
it's more visible now).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 18:08                                   ` Tejun Heo
@ 2012-03-08 18:11                                     ` Tejun Heo
  2012-03-08 18:22                                       ` Vivek Goyal
  2012-03-08 20:16                                     ` Vivek Goyal
  1 sibling, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-08 18:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 10:08:33AM -0800, Tejun Heo wrote:
> It's probably from something forgetting to put cgroup and pre_destroy
> waiting for it.  Such bugs would have been masked before but now show
> up as stalls during rmdir.  I'm not too happy with how cgroup is
> handling cgroup file additions and removals and hoping to implement
> proper 'sever' semantics similar to that of sysfs.  In the longer
> term, such behavior should go away but for now we'll just have to hunt
> down the actual bug to avoid stalls (which we have to do anyway but
> it's more visible now).

Hmm... I'll probably need it for dynamic file additions and removals
for dynamic policy updates && the blkcg changes are scheduled for
3.5-rc1 window (not 3.4-rc1), so we should have enough time to resolve
this.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 18:11                                     ` Tejun Heo
@ 2012-03-08 18:22                                       ` Vivek Goyal
  2012-03-08 18:27                                         ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-08 18:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 10:11:25AM -0800, Tejun Heo wrote:
> On Thu, Mar 08, 2012 at 10:08:33AM -0800, Tejun Heo wrote:
> > It's probably from something forgetting to put cgroup and pre_destroy
> > waiting for it.  Such bugs would have been masked before but now show
> > up as stalls during rmdir.  I'm not too happy with how cgroup is
> > handling cgroup file additions and removals and hoping to implement
> > proper 'sever' semantics similar to that of sysfs.  In the longer
> > term, such behavior should go away but for now we'll just have to hunt
> > down the actual bug to avoid stalls (which we have to do anyway but
> > it's more visible now).
> 
> Hmm... I'll probably need it for dynamic file additions and removals
> for dynamic policy updates && the blkcg changes are scheduled for
> 3.5-rc1 window (not 3.4-rc1), so we should have enough time to resolve
> this.

Does that mean in 3.4 we will continue to kill throttling groups and lose
all the policies and stats upon elevator change? So 3.4 will be one odd
release w.r.t this user visible change.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 18:22                                       ` Vivek Goyal
@ 2012-03-08 18:27                                         ` Tejun Heo
  2012-03-15 16:48                                           ` Vivek Goyal
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-08 18:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 01:22:45PM -0500, Vivek Goyal wrote:
> Does that mean in 3.4 we will continue to kill throttling groups and lose
> all the policies and stats upon elevator change? So 3.4 will be one odd
> release w.r.t this user visible change.

The whole thing is scheduled for 3.5, so no worries.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 17:57                                 ` Vivek Goyal
  2012-03-08 18:08                                   ` Tejun Heo
@ 2012-03-08 19:06                                   ` Andrew Morton
  1 sibling, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2012-03-08 19:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, 8 Mar 2012 12:57:08 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Wed, Mar 07, 2012 at 03:05:49PM -0800, Andrew Morton wrote:
> 
> [..]
> > > > btw, speaking of uniprocessor: please do perform a uniprocessor build
> > > > and see what impact the patch has upon the size(1) output for the .o
> > > > files.  We should try to minimize the pointless bloat for the UP
> > > > kernel.
> > > 
> > > But this logic is required both for UP and SMP kernels. So bloat on UP
> > > is not unnecessary?
> > 
> > UP doesn't need a per-cpu variable, hence it doesn't need to run
> > alloc_per_cpu() at all.  For UP all we need to do is to aggregate a
> > `struct blkio_group_stats' within `struct blkg_policy_data'?
> > 
> > This could still be done with suitable abstraction and wrappers. 
> > Whether that's desirable depends on how fat the API ends up, I guess.
> > 
> 
> Ok, here is the patch which gets rid of allocating per cpu stats in case
> of UP. Here are the size stats with and without patch.
> 
> Without patch (UP kernel)
> -------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   13377    5504      58   18939    49fb block/blk-cgroup.o
> 
> With patch (UP kernel)
> ----------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   12678    5312      49   18039    4677 block/blk-cgroup.o
> 
> Do you think it is worth introducing these ifdefs.

That's a fairly nice improvement and the ifdeffery isn't excessive.

I'd be concerned about having two such different code paths though: it
weakens our performance testing coverage and I expect that most core
kernel developers only test SMP.

Perhaps a better approach would be to make the percpu stuff separately
configurable, so that SMP people can test it and so that small SMP can
avoid it as well.  (Do we know that the percpu feature is a net win on
a 2-way?).

It's quite common for us to degrade UP in favor of SMP, often in the
area of percpuification.  This is bad.  But blkcg surely isn't the most
appropriate place to start addressing this :(


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 18:08                                   ` Tejun Heo
  2012-03-08 18:11                                     ` Tejun Heo
@ 2012-03-08 20:16                                     ` Vivek Goyal
  2012-03-08 20:33                                       ` Vivek Goyal
  1 sibling, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-08 20:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 10:08:33AM -0800, Tejun Heo wrote:

[..]
> > Tejun, I noticed that in UP case, once in a while cgroup removal is
> > hanging. Looks like it is hung in cgroup_rmdir() somewhere. I will debug
> > more to find out what is happening. May be blkcg->refcount issue.
> 
> It's probably from something forgetting to put cgroup and pre_destroy
> waiting for it.  Such bugs would have been masked before but now show
> up as stalls during rmdir.

I am not sure what is happening here yet. What I have noticed that
somebody is holding a reference on blkg->refcnt and that's why css->refcnt
is not zero hence rmdir is hanging.

I susect it is cfqq refcount on blkg which is not released till cfqq is
reclaimed.

Looking at the code, in general it seems to be a problem. If a task 
issues bunch of IO, changes the cgroup and does not issue IO any more
for some time, that means old cfqq will still be linked to task's
cic and still be holding reference to blkg and one can't remove the
cgroup.

We had this disucssion in the past. So looks like to get rid of this
problem, you will have to drop old cic->cfqq association during
cgroup change to avoid hanging rmdir.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 20:16                                     ` Vivek Goyal
@ 2012-03-08 20:33                                       ` Vivek Goyal
  2012-03-08 20:35                                         ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-08 20:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 03:16:16PM -0500, Vivek Goyal wrote:
> On Thu, Mar 08, 2012 at 10:08:33AM -0800, Tejun Heo wrote:
> 
> [..]
> > > Tejun, I noticed that in UP case, once in a while cgroup removal is
> > > hanging. Looks like it is hung in cgroup_rmdir() somewhere. I will debug
> > > more to find out what is happening. May be blkcg->refcount issue.
> > 
> > It's probably from something forgetting to put cgroup and pre_destroy
> > waiting for it.  Such bugs would have been masked before but now show
> > up as stalls during rmdir.
> 
> I am not sure what is happening here yet. What I have noticed that
> somebody is holding a reference on blkg->refcnt and that's why css->refcnt
> is not zero hence rmdir is hanging.
> 
> I susect it is cfqq refcount on blkg which is not released till cfqq is
> reclaimed.
> 
> Looking at the code, in general it seems to be a problem. If a task 
> issues bunch of IO, changes the cgroup and does not issue IO any more
> for some time, that means old cfqq will still be linked to task's
> cic and still be holding reference to blkg and one can't remove the
> cgroup.
> 
> We had this disucssion in the past. So looks like to get rid of this
> problem, you will have to drop old cic->cfqq association during
> cgroup change to avoid hanging rmdir.

Ok, I can confirm that it is cfqq reference on blkg which is an issue. If
I move my shell to a child cgroup and try to do some operations (in the
context of shell, like autocompletion/reading an uncached dir), then IO
is issued in the context of shell, I move out the shell out of cgroup and
then try to delete it, it hangs. Once I exit out of shell, blkg reference
is dropped and cgroup is deleted.

So we do need to cleanup the cic->cfqq upon cgroup change synchronously.

That will still not solve the issue of a process dumping tons of
IO on device (large nr_requests) and then moving out of cgroup. Now
cgroup deletion will still hang till all the IO in the cgroup
completes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 20:33                                       ` Vivek Goyal
@ 2012-03-08 20:35                                         ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2012-03-08 20:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

Hello, Vivek.

On Thu, Mar 08, 2012 at 03:33:31PM -0500, Vivek Goyal wrote:
> Ok, I can confirm that it is cfqq reference on blkg which is an issue. If
> I move my shell to a child cgroup and try to do some operations (in the
> context of shell, like autocompletion/reading an uncached dir), then IO
> is issued in the context of shell, I move out the shell out of cgroup and
> then try to delete it, it hangs. Once I exit out of shell, blkg reference
> is dropped and cgroup is deleted.
> 
> So we do need to cleanup the cic->cfqq upon cgroup change synchronously.

I see.  Thanks for confirming the problem.

> That will still not solve the issue of a process dumping tons of
> IO on device (large nr_requests) and then moving out of cgroup. Now
> cgroup deletion will still hang till all the IO in the cgroup
> completes.

I don't think the latter is too much of a problem.  I've been meaning
to decouple destruction from ref putting anyway, so let's see whether
that's possible.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-08 18:27                                         ` Tejun Heo
@ 2012-03-15 16:48                                           ` Vivek Goyal
  2012-03-15 16:59                                             ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Vivek Goyal @ 2012-03-15 16:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

On Thu, Mar 08, 2012 at 10:27:12AM -0800, Tejun Heo wrote:
> On Thu, Mar 08, 2012 at 01:22:45PM -0500, Vivek Goyal wrote:
> > Does that mean in 3.4 we will continue to kill throttling groups and lose
> > all the policies and stats upon elevator change? So 3.4 will be one odd
> > release w.r.t this user visible change.
> 
> The whole thing is scheduled for 3.5, so no worries.

Tejun,

Your patches are in for-3.4/core branch of Jens's tree. So these indeed
are going in 3.4 and not 3.5. I am confused.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-15 16:48                                           ` Vivek Goyal
@ 2012-03-15 16:59                                             ` Tejun Heo
  2012-03-20 11:50                                               ` Jens Axboe
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2012-03-15 16:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, axboe, hughd, avi, nate, cl, linux-kernel, dpshah,
	ctalbott, rni

Hello,

On Thu, Mar 15, 2012 at 12:48:00PM -0400, Vivek Goyal wrote:
> Your patches are in for-3.4/core branch of Jens's tree. So these indeed
> are going in 3.4 and not 3.5. I am confused.

Yeah, that's a bit confusing but they really aren't scheduled for the
upcoming merge window.  I suppose Jens just didn't want to create
for-3.5 branch at the moment.  Jens?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock
  2012-03-15 16:59                                             ` Tejun Heo
@ 2012-03-20 11:50                                               ` Jens Axboe
  0 siblings, 0 replies; 56+ messages in thread
From: Jens Axboe @ 2012-03-20 11:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, Andrew Morton, hughd, avi, nate, cl, linux-kernel,
	dpshah, ctalbott, rni

On Thu, Mar 15 2012, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 15, 2012 at 12:48:00PM -0400, Vivek Goyal wrote:
> > Your patches are in for-3.4/core branch of Jens's tree. So these indeed
> > are going in 3.4 and not 3.5. I am confused.
> 
> Yeah, that's a bit confusing but they really aren't scheduled for the
> upcoming merge window.  I suppose Jens just didn't want to create
> for-3.5 branch at the moment.  Jens?

They are going into 3.5, I just need to shuffle the branches a bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2012-03-20 11:50 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-23 22:30 [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Tejun Heo
2012-02-23 22:30 ` [PATCH 1/8] mempool: factor out mempool_fill() Tejun Heo
2012-02-23 22:30 ` [PATCH 2/8] mempool: separate out __mempool_create() Tejun Heo
2012-02-23 22:30 ` [PATCH 3/8] mempool, percpu: implement percpu mempool Tejun Heo
2012-02-23 22:30 ` [PATCH 4/8] block: fix deadlock through percpu allocation in blk-cgroup Tejun Heo
2012-02-23 22:30 ` [PATCH 5/8] blkcg: don't use percpu for merged stats Tejun Heo
2012-02-23 22:30 ` [PATCH 6/8] blkcg: simplify stat reset Tejun Heo
2012-02-23 22:30 ` [PATCH 7/8] blkcg: restructure blkio_get_stat() Tejun Heo
2012-02-23 22:30 ` [PATCH 8/8] blkcg: remove blkio_group->stats_lock Tejun Heo
2012-02-23 22:43 ` [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats_lock Andrew Morton
2012-02-23 23:01   ` Tejun Heo
2012-02-23 23:12     ` Tejun Heo
2012-02-23 23:22       ` Andrew Morton
2012-02-23 23:24         ` Tejun Heo
2012-02-24 14:20       ` Vivek Goyal
2012-02-25 21:44         ` Tejun Heo
2012-02-27  3:11           ` Vivek Goyal
2012-02-27  9:11             ` Tejun Heo
2012-02-27 19:43               ` Vivek Goyal
2012-02-29 17:36                 ` Vivek Goyal
2012-03-05 22:13                   ` Tejun Heo
2012-03-06 21:09                     ` Vivek Goyal
2012-03-06 21:20                       ` Andrew Morton
2012-03-06 21:34                         ` Vivek Goyal
2012-03-06 21:55                           ` Andrew Morton
2012-03-07 14:55                             ` Vivek Goyal
2012-03-07 17:05                               ` Tejun Heo
2012-03-07 19:13                                 ` Vivek Goyal
2012-03-07 19:22                                   ` Tejun Heo
2012-03-07 19:42                                     ` Vivek Goyal
2012-03-07 22:56                                       ` Tejun Heo
2012-03-07 23:08                                         ` Andrew Morton
2012-03-07 23:15                                           ` Tejun Heo
2012-03-07 23:05                               ` Andrew Morton
2012-03-08 17:57                                 ` Vivek Goyal
2012-03-08 18:08                                   ` Tejun Heo
2012-03-08 18:11                                     ` Tejun Heo
2012-03-08 18:22                                       ` Vivek Goyal
2012-03-08 18:27                                         ` Tejun Heo
2012-03-15 16:48                                           ` Vivek Goyal
2012-03-15 16:59                                             ` Tejun Heo
2012-03-20 11:50                                               ` Jens Axboe
2012-03-08 20:16                                     ` Vivek Goyal
2012-03-08 20:33                                       ` Vivek Goyal
2012-03-08 20:35                                         ` Tejun Heo
2012-03-08 19:06                                   ` Andrew Morton
2012-02-25  3:44       ` Vivek Goyal
2012-02-25 21:46         ` Tejun Heo
2012-02-25 22:21           ` Tejun Heo
2012-02-27 14:25             ` Vivek Goyal
2012-02-27 14:40               ` Vivek Goyal
2012-03-05 17:45                 ` Tejun Heo
2012-02-27 18:22       ` Vivek Goyal
2012-02-29 19:03         ` Vivek Goyal
2012-03-05 17:20           ` Tejun Heo
2012-03-05 18:03             ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).