linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4 v2] blk-mq: use percpu_ida to manage tags
@ 2013-10-15  1:05 Shaohua Li
  2013-10-15  1:05 ` [patch 1/4 v2] percpu_ida: make percpu_ida percpu size/batch configurable Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Shaohua Li @ 2013-10-15  1:05 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: kmo

Hi,
blk-mq and percpu_ida use similar algorithm to manage tags. The difference
is when a cpu can't allocate tags, blk-mq will use ipi to purge remote cpu
cache, while percpu-ida directly purges remote cpu cache. In practice, the
percpu-ida approach is much faster when we can't allocate enough for percpu
cache.

This patch makes blk-mq use percpu_ida to manage tags, this avoids some
duplicate code and has better performance as I mentioned. To test the patches,
rebase blk-mq tree to latest kernel first.

The v2 patch updated some log descriptions, fixed one issue Kent pointed out
and added performance data.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 1/4 v2] percpu_ida: make percpu_ida percpu size/batch configurable
  2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
@ 2013-10-15  1:05 ` Shaohua Li
  2013-10-15  1:05 ` [patch 2/4 v2] percpu_ida: add percpu_ida_for_each_free Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2013-10-15  1:05 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: kmo

[-- Attachment #1: 0001-percpu_ida-make-percpu_ida-percpu-size-batch-configu.patch --]
[-- Type: text/plain, Size: 5099 bytes --]

Make percpu_ida percpu size/batch configurable. The block-mq-tag will use it.

After block-mq uses percpu_ida to manage tags, performance is improved. My test
is done in a 2 sockets machine, 12 process cross the 2 sockets. So if there is
lock contention or ipi, should be stressed heavily. Testing is done for
null-blk.

hw_queue_depth	nopatch iops	patch iops
64		~800k/s		~1470k/s
2048		~4470k/s	~4340k/s

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/percpu_ida.h |   18 +++++++++++++++++-
 lib/percpu_ida.c           |   28 +++++++++++-----------------
 2 files changed, 28 insertions(+), 18 deletions(-)

Index: master/include/linux/percpu_ida.h
===================================================================
--- master.orig/include/linux/percpu_ida.h	2013-10-11 14:34:40.367149789 +0800
+++ master/include/linux/percpu_ida.h	2013-10-15 08:24:39.000000000 +0800
@@ -16,6 +16,8 @@ struct percpu_ida {
 	 * percpu_ida_init()
 	 */
 	unsigned			nr_tags;
+	unsigned			percpu_max_size;
+	unsigned			percpu_batch_size;
 
 	struct percpu_ida_cpu __percpu	*tag_cpu;
 
@@ -51,10 +53,24 @@ struct percpu_ida {
 	} ____cacheline_aligned_in_smp;
 };
 
+/*
+ * Number of tags we move between the percpu freelist and the global freelist at
+ * a time
+ */
+#define IDA_DEFAULT_PCPU_BATCH_MOVE	32U
+/* Max size of percpu freelist, */
+#define IDA_DEFAULT_PCPU_SIZE	((IDA_DEFAULT_PCPU_BATCH_MOVE * 3) / 2)
+
 int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
 void percpu_ida_free(struct percpu_ida *pool, unsigned tag);
 
 void percpu_ida_destroy(struct percpu_ida *pool);
-int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags);
+int __percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags,
+	unsigned long max_size, unsigned long batch_size);
+static inline int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags)
+{
+	return __percpu_ida_init(pool, nr_tags, IDA_DEFAULT_PCPU_SIZE,
+		IDA_DEFAULT_PCPU_BATCH_MOVE);
+}
 
 #endif /* __PERCPU_IDA_H__ */
Index: master/lib/percpu_ida.c
===================================================================
--- master.orig/lib/percpu_ida.c	2013-10-11 14:34:40.367149789 +0800
+++ master/lib/percpu_ida.c	2013-10-15 08:24:39.000000000 +0800
@@ -30,15 +30,6 @@
 #include <linux/spinlock.h>
 #include <linux/percpu_ida.h>
 
-/*
- * Number of tags we move between the percpu freelist and the global freelist at
- * a time
- */
-#define IDA_PCPU_BATCH_MOVE	32U
-
-/* Max size of percpu freelist, */
-#define IDA_PCPU_SIZE		((IDA_PCPU_BATCH_MOVE * 3) / 2)
-
 struct percpu_ida_cpu {
 	/*
 	 * Even though this is percpu, we need a lock for tag stealing by remote
@@ -78,7 +69,7 @@ static inline void steal_tags(struct per
 	struct percpu_ida_cpu *remote;
 
 	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
-	     cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2;
+	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
 	     cpus_have_tags--) {
 		cpu = cpumask_next(cpu, &pool->cpus_have_tags);
 
@@ -123,7 +114,7 @@ static inline void alloc_global_tags(str
 {
 	move_tags(tags->freelist, &tags->nr_free,
 		  pool->freelist, &pool->nr_free,
-		  min(pool->nr_free, IDA_PCPU_BATCH_MOVE));
+		  min(pool->nr_free, pool->percpu_batch_size));
 }
 
 static inline unsigned alloc_local_tag(struct percpu_ida *pool,
@@ -245,17 +236,17 @@ void percpu_ida_free(struct percpu_ida *
 		wake_up(&pool->wait);
 	}
 
-	if (nr_free == IDA_PCPU_SIZE) {
+	if (nr_free == pool->percpu_max_size) {
 		spin_lock(&pool->lock);
 
 		/*
 		 * Global lock held and irqs disabled, don't need percpu
 		 * lock
 		 */
-		if (tags->nr_free == IDA_PCPU_SIZE) {
+		if (tags->nr_free == pool->percpu_max_size) {
 			move_tags(pool->freelist, &pool->nr_free,
 				  tags->freelist, &tags->nr_free,
-				  IDA_PCPU_BATCH_MOVE);
+				  pool->percpu_batch_size);
 
 			wake_up(&pool->wait);
 		}
@@ -292,7 +283,8 @@ EXPORT_SYMBOL_GPL(percpu_ida_destroy);
  * Allocation is percpu, but sharding is limited by nr_tags - for best
  * performance, the workload should not span more cpus than nr_tags / 128.
  */
-int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags)
+int __percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags,
+	unsigned long max_size, unsigned long batch_size)
 {
 	unsigned i, cpu, order;
 
@@ -301,6 +293,8 @@ int percpu_ida_init(struct percpu_ida *p
 	init_waitqueue_head(&pool->wait);
 	spin_lock_init(&pool->lock);
 	pool->nr_tags = nr_tags;
+	pool->percpu_max_size = max_size;
+	pool->percpu_batch_size = batch_size;
 
 	/* Guard against overflow */
 	if (nr_tags > (unsigned) INT_MAX + 1) {
@@ -319,7 +313,7 @@ int percpu_ida_init(struct percpu_ida *p
 	pool->nr_free = nr_tags;
 
 	pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) +
-				       IDA_PCPU_SIZE * sizeof(unsigned),
+				       pool->percpu_max_size * sizeof(unsigned),
 				       sizeof(unsigned));
 	if (!pool->tag_cpu)
 		goto err;
@@ -332,4 +326,4 @@ err:
 	percpu_ida_destroy(pool);
 	return -ENOMEM;
 }
-EXPORT_SYMBOL_GPL(percpu_ida_init);
+EXPORT_SYMBOL_GPL(__percpu_ida_init);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 2/4 v2] percpu_ida: add percpu_ida_for_each_free
  2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
  2013-10-15  1:05 ` [patch 1/4 v2] percpu_ida: make percpu_ida percpu size/batch configurable Shaohua Li
@ 2013-10-15  1:05 ` Shaohua Li
  2013-10-15  1:05 ` [patch 3/4 v2] percpu_ida: add an API to return free tags Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2013-10-15  1:05 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: kmo

[-- Attachment #1: 0002-percpu_ida-add-percpu_ida_for_each_free.patch --]
[-- Type: text/plain, Size: 2394 bytes --]

Add a new API to iterate free ids. blk-mq-tag will use it.

Note, this doesn't guarantee to iterate all free ids restrictly. Caller should
be aware of this. blk-mq uses it to do sanity check for request timedout, so
can tolerate the limitation.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/percpu_ida.h |    4 ++++
 lib/percpu_ida.c           |   44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

Index: master/include/linux/percpu_ida.h
===================================================================
--- master.orig/include/linux/percpu_ida.h	2013-10-15 08:28:36.198031207 +0800
+++ master/include/linux/percpu_ida.h	2013-10-15 08:30:19.996733244 +0800
@@ -73,4 +73,8 @@ static inline int percpu_ida_init(struct
 		IDA_DEFAULT_PCPU_BATCH_MOVE);
 }
 
+typedef int (*percpu_ida_cb)(unsigned, void *);
+int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
+	void *data);
+
 #endif /* __PERCPU_IDA_H__ */
Index: master/lib/percpu_ida.c
===================================================================
--- master.orig/lib/percpu_ida.c	2013-10-15 08:28:36.198031207 +0800
+++ master/lib/percpu_ida.c	2013-10-15 08:40:06.937353699 +0800
@@ -327,3 +327,47 @@ err:
 	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(__percpu_ida_init);
+
+/**
+ * percpu_ida_for_each_free - iterate free ids of a pool
+ * @pool: pool to iterate
+ * @fn: interate callback function
+ * @data: parameter for @fn
+ *
+ * Note, this doesn't guarantee to iterate all free ids restrictly. Some free
+ * ids might be missed, some might be iterated duplicated, and some might
+ * be iterated and not free soon.
+ */
+int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
+	void *data)
+{
+	unsigned long flags;
+	struct percpu_ida_cpu *remote;
+	unsigned cpu, i, err = 0;
+
+	local_irq_save(flags);
+	for_each_possible_cpu(cpu) {
+		remote = per_cpu_ptr(pool->tag_cpu, cpu);
+		spin_lock(&remote->lock);
+		for (i = 0; i < remote->nr_free; i++) {
+			err = fn(remote->freelist[i], data);
+			if (err)
+				break;
+		}
+		spin_unlock(&remote->lock);
+		if (err)
+			goto out;
+	}
+
+	spin_lock(&pool->lock);
+	for (i = 0; i < pool->nr_free; i++) {
+		err = fn(pool->freelist[i], data);
+		if (err)
+			break;
+	}
+	spin_unlock(&pool->lock);
+out:
+	local_irq_restore(flags);
+	return err;
+}
+EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 3/4 v2] percpu_ida: add an API to return free tags
  2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
  2013-10-15  1:05 ` [patch 1/4 v2] percpu_ida: make percpu_ida percpu size/batch configurable Shaohua Li
  2013-10-15  1:05 ` [patch 2/4 v2] percpu_ida: add percpu_ida_for_each_free Shaohua Li
@ 2013-10-15  1:05 ` Shaohua Li
  2013-10-15  1:05 ` [patch 4/4 v2] blk-mq: switch to percpu-ida for tag management Shaohua Li
  2013-10-15 15:16 ` [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2013-10-15  1:05 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: kmo

[-- Attachment #1: 0003-percpu_ida-add-an-API-to-return-free-tags.patch --]
[-- Type: text/plain, Size: 1815 bytes --]

add an API to return free tags, blk-mq-tag will use it.
Note, this just returns a snapshot of free tags number. blk-mq-tag has two
usages of it. One is for info output for diagnosis. The other is to quickly
check if there are free tags for request dispatch checking. Neither requires
very precise.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 include/linux/percpu_ida.h |    1 +
 lib/percpu_ida.c           |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Index: master/include/linux/percpu_ida.h
===================================================================
--- master.orig/include/linux/percpu_ida.h	2013-10-15 08:47:40.179650593 +0800
+++ master/include/linux/percpu_ida.h	2013-10-15 08:47:40.175650783 +0800
@@ -77,4 +77,5 @@ typedef int (*percpu_ida_cb)(unsigned, v
 int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
 	void *data);
 
+unsigned percpu_ida_free_tags(struct percpu_ida *pool, int cpu);
 #endif /* __PERCPU_IDA_H__ */
Index: master/lib/percpu_ida.c
===================================================================
--- master.orig/lib/percpu_ida.c	2013-10-15 08:47:40.179650593 +0800
+++ master/lib/percpu_ida.c	2013-10-15 08:47:40.175650783 +0800
@@ -371,3 +371,20 @@ out:
 	return err;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);
+
+/**
+ * percpu_ida_free_tags - return free tags number of a specific cpu or global pool
+ * @pool: pool related
+ * @cpu: specific cpu or global pool if @cpu == nr_cpu_ids
+ *
+ * Note: this just returns a snapshot of free tags number.
+ */
+unsigned percpu_ida_free_tags(struct percpu_ida *pool, int cpu)
+{
+	struct percpu_ida_cpu *remote;
+	if (cpu == nr_cpu_ids)
+		return pool->nr_free;
+	remote = per_cpu_ptr(pool->tag_cpu, cpu);
+	return remote->nr_free;
+}
+EXPORT_SYMBOL_GPL(percpu_ida_free_tags);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [patch 4/4 v2] blk-mq: switch to percpu-ida for tag management
  2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
                   ` (2 preceding siblings ...)
  2013-10-15  1:05 ` [patch 3/4 v2] percpu_ida: add an API to return free tags Shaohua Li
@ 2013-10-15  1:05 ` Shaohua Li
  2013-10-15 15:16 ` [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2013-10-15  1:05 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: kmo

[-- Attachment #1: 0004-blk-mq-switch-to-percpu-ida-for-tag-menagement.patch --]
[-- Type: text/plain, Size: 15381 bytes --]

Using percpu-ida to manage blk-mq tags. the percpu-ida has similar algorithm
like the blk-mq-tag. The difference is when a cpu can't allocate tags
blk-mq-tag uses ipi to purge remote cpu cache and percpu-ida directly purges
remote cpu cache. In practice (testing null_blk), the percpu-ida approach is
much faster when total tags aren't enough.

My test is done in a 2 sockets machine, 12 process cross the 2 sockets. So if
there is lock contention or ipi, should be stressed heavily. Testing is done
for null-blk.

hw_queue_depth	nopatch iops	patch iops
64		~800k/s		~1470k/s
2048		~4470k/s	~4340k/s

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 block/blk-mq-tag.c |  470 +++++++----------------------------------------------
 1 file changed, 68 insertions(+), 402 deletions(-)

Index: master/block/blk-mq-tag.c
===================================================================
--- master.orig/block/blk-mq-tag.c	2013-10-15 08:55:03.054076357 +0800
+++ master/block/blk-mq-tag.c	2013-10-15 09:03:03.212045568 +0800
@@ -1,9 +1,6 @@
 #include <linux/kernel.h>
-#include <linux/threads.h>
 #include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
+#include <linux/percpu_ida.h>
 
 #include <linux/blk-mq.h>
 #include "blk.h"
@@ -11,291 +8,53 @@
 #include "blk-mq-tag.h"
 
 /*
- * Per-cpu cache entries
- */
-struct blk_mq_tag_map {
-	unsigned int nr_free;
-	unsigned int freelist[];
-};
-
-/*
  * Per tagged queue (tag address space) map
  */
 struct blk_mq_tags {
 	unsigned int nr_tags;
-	unsigned int reserved_tags;
-	unsigned int batch_move;
-	unsigned int max_cache;
-
-	struct {
-		spinlock_t lock;
-		unsigned int nr_free;
-		unsigned int *freelist;
-		unsigned int nr_reserved;
-		unsigned int *reservelist;
-		struct list_head wait;
-	} ____cacheline_aligned_in_smp;
-
-	struct blk_mq_tag_map __percpu *free_maps;
-
-	struct blk_mq_cpu_notifier cpu_notifier;
-};
+	unsigned int nr_reserved_tags;
+	unsigned int nr_batch_move;
+	unsigned int nr_max_cache;
 
-struct blk_mq_tag_wait {
-	struct list_head list;
-	struct task_struct *task;
+	struct percpu_ida free_tags;
+	struct percpu_ida reserved_tags;
 };
 
-#define DEFINE_TAG_WAIT(name)						\
-	struct blk_mq_tag_wait name = {					\
-		.list		= LIST_HEAD_INIT((name).list),		\
-		.task		= current,				\
-	}
-
-static unsigned int move_tags(unsigned int *dst, unsigned int *dst_nr,
-			      unsigned int *src, unsigned int *src_nr,
-			      unsigned int nr_to_move)
-{
-	nr_to_move = min(nr_to_move, *src_nr);
-	*src_nr -= nr_to_move;
-	memcpy(dst + *dst_nr, src + *src_nr, sizeof(int) * nr_to_move);
-	*dst_nr += nr_to_move;
-
-	return nr_to_move;
-}
-
-static void __wake_waiters(struct blk_mq_tags *tags, unsigned int nr)
-{
-	while (nr && !list_empty(&tags->wait)) {
-		struct blk_mq_tag_wait *waiter;
-
-		waiter = list_entry(tags->wait.next, struct blk_mq_tag_wait,
-					list);
-		list_del_init(&waiter->list);
-		wake_up_process(waiter->task);
-		nr--;
-	}
-}
-
-static void __blk_mq_tag_return(struct blk_mq_tags *tags,
-				struct blk_mq_tag_map *map, unsigned int nr)
-{
-	unsigned int waiters;
-
-	lockdep_assert_held(&tags->lock);
-
-	waiters = move_tags(tags->freelist, &tags->nr_free, map->freelist,
-				&map->nr_free, nr);
-	if (!list_empty(&tags->wait))
-		__wake_waiters(tags, waiters);
-}
-
-static void blk_mq_tag_return(struct blk_mq_tags *tags,
-			      struct blk_mq_tag_map *map, unsigned int nr)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&tags->lock, flags);
-	__blk_mq_tag_return(tags, map, nr);
-	spin_unlock_irqrestore(&tags->lock, flags);
-}
-
-#if NR_CPUS != 1
-static void prune_cache(void *data)
-{
-	struct blk_mq_tags *tags = data;
-	struct blk_mq_tag_map *map;
-
-	map = per_cpu_ptr(tags->free_maps, smp_processor_id());
-
-	spin_lock(&tags->lock);
-	__blk_mq_tag_return(tags, map, tags->batch_move);
-	spin_unlock(&tags->lock);
-}
-#endif
-
-static void ipi_local_caches(struct blk_mq_tags *tags, unsigned int this_cpu)
-{
-#if NR_CPUS != 1
-	cpumask_var_t ipi_mask;
-	unsigned int i, total;
-
-	/*
-	 * We could per-cpu cache this things, but overhead is probably not
-	 * large enough to care about it. If we fail, just punt to doing a
-	 * prune on all CPUs.
-	 */
-	if (!alloc_cpumask_var(&ipi_mask, GFP_ATOMIC)) {
-		smp_call_function(prune_cache, tags, 0);
-		return;
-	}
-
-	cpumask_clear(ipi_mask);
-
-	total = 0;
-	for_each_online_cpu(i) {
-		struct blk_mq_tag_map *map = per_cpu_ptr(tags->free_maps, i);
-
-		if (!map->nr_free)
-			continue;
-
-		total += map->nr_free;
-		cpumask_set_cpu(i, ipi_mask);
-
-		if (total > tags->batch_move)
-			break;
-	}
-
-	if (total) {
-		preempt_disable();
-		smp_call_function_many(ipi_mask, prune_cache, tags, 0);
-		preempt_enable();
-	}
-
-	free_cpumask_var(ipi_mask);
-#endif
-}
-
-/*
- * Wait on a free tag, move batch to map when we have it. Returns with
- * local CPU irq flags saved in 'flags'.
- */
-static void wait_on_tags(struct blk_mq_tags *tags, struct blk_mq_tag_map **map,
-			 unsigned long *flags)
-{
-	DEFINE_TAG_WAIT(wait);
-
-	do {
-		spin_lock_irqsave(&tags->lock, *flags);
-
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-
-		if (list_empty(&wait.list))
-			list_add_tail(&wait.list, &tags->wait);
-
-		*map = this_cpu_ptr(tags->free_maps);
-		if ((*map)->nr_free || tags->nr_free) {
-			if (!(*map)->nr_free) {
-				move_tags((*map)->freelist, &(*map)->nr_free,
-						tags->freelist, &tags->nr_free,
-						tags->batch_move);
-			}
-
-			if (!list_empty(&wait.list))
-				list_del(&wait.list);
-
-			spin_unlock(&tags->lock);
-			break;
-		}
-
-		spin_unlock_irqrestore(&tags->lock, *flags);
-		ipi_local_caches(tags, raw_smp_processor_id());
-		io_schedule();
-	} while (1);
-
-	__set_current_state(TASK_RUNNING);
-}
-
 void blk_mq_wait_for_tags(struct blk_mq_tags *tags)
 {
-	struct blk_mq_tag_map *map;
-	unsigned long flags;
-
-	ipi_local_caches(tags, raw_smp_processor_id());
-	wait_on_tags(tags, &map, &flags);
-	local_irq_restore(flags);
+	int tag = blk_mq_get_tag(tags, __GFP_WAIT, false);
+	blk_mq_put_tag(tags, tag);
 }
 
 bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 {
-	return !tags || tags->nr_free != 0;
+	return !tags ||
+		percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0;
 }
 
 static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
 {
-	struct blk_mq_tag_map *map;
-	unsigned int this_cpu;
-	unsigned long flags;
-	unsigned int tag;
-
-	local_irq_save(flags);
-	this_cpu = smp_processor_id();
-	map = per_cpu_ptr(tags->free_maps, this_cpu);
-
-	/*
-	 * Grab from local per-cpu cache, if we can
-	 */
-	do {
-		if (map->nr_free) {
-			map->nr_free--;
-			tag = map->freelist[map->nr_free];
-			local_irq_restore(flags);
-			return tag;
-		}
-
-		/*
-		 * Grab from device map, if we can
-		 */
-		if (tags->nr_free) {
-			spin_lock(&tags->lock);
-			move_tags(map->freelist, &map->nr_free, tags->freelist,
-					&tags->nr_free, tags->batch_move);
-			spin_unlock(&tags->lock);
-			continue;
-		}
-
-		local_irq_restore(flags);
-
-		if (!(gfp & __GFP_WAIT))
-			break;
-
-		ipi_local_caches(tags, this_cpu);
-
-		/*
-		 * All are busy, wait. Returns with irqs disabled again
-		 * and potentially new 'map' pointer.
-		 */
-		wait_on_tags(tags, &map, &flags);
-	} while (1);
+	int tag;
 
-	return BLK_MQ_TAG_FAIL;
+	tag = percpu_ida_alloc(&tags->free_tags, gfp);
+	if (tag < 0)
+		return BLK_MQ_TAG_FAIL;
+	return tag + tags->nr_reserved_tags;
 }
 
 static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
 					      gfp_t gfp)
 {
-	unsigned int tag = BLK_MQ_TAG_FAIL;
-	DEFINE_TAG_WAIT(wait);
+	int tag;
 
-	if (unlikely(!tags->reserved_tags)) {
+	if (unlikely(!tags->nr_reserved_tags)) {
 		WARN_ON_ONCE(1);
 		return BLK_MQ_TAG_FAIL;
 	}
 
-	do {
-		spin_lock_irq(&tags->lock);
-		if (tags->nr_reserved) {
-			tags->nr_reserved--;
-			tag = tags->reservelist[tags->nr_reserved];
-			break;
-		}
-
-		if (!(gfp & __GFP_WAIT))
-			break;
-
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-
-		if (list_empty(&wait.list))
-			list_add_tail(&wait.list, &tags->wait);
-
-		spin_unlock_irq(&tags->lock);
-		io_schedule();
-	} while (1);
-
-	if (!list_empty(&wait.list))
-		list_del(&wait.list);
-
-	spin_unlock_irq(&tags->lock);
+	tag = percpu_ida_alloc(&tags->reserved_tags, gfp);
+	if (tag < 0)
+		return BLK_MQ_TAG_FAIL;
 	return tag;
 }
 
@@ -309,57 +68,38 @@ unsigned int blk_mq_get_tag(struct blk_m
 
 static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
 {
-	struct blk_mq_tag_map *map;
-	unsigned long flags;
-
 	BUG_ON(tag >= tags->nr_tags);
 
-	local_irq_save(flags);
-	map = this_cpu_ptr(tags->free_maps);
-
-	map->freelist[map->nr_free] = tag;
-	map->nr_free++;
-
-	if (map->nr_free >= tags->max_cache ||
-	    !list_empty_careful(&tags->wait)) {
-		spin_lock(&tags->lock);
-		__blk_mq_tag_return(tags, map, tags->batch_move);
-		spin_unlock(&tags->lock);
-	}
-
-	local_irq_restore(flags);
+	percpu_ida_free(&tags->free_tags, tag - tags->nr_reserved_tags);
 }
 
 static void __blk_mq_put_reserved_tag(struct blk_mq_tags *tags,
 				      unsigned int tag)
 {
-	unsigned long flags;
-
-	BUG_ON(tag >= tags->reserved_tags);
-
-	spin_lock_irqsave(&tags->lock, flags);
-	tags->reservelist[tags->nr_reserved] = tag;
-	tags->nr_reserved++;
+	BUG_ON(tag >= tags->nr_reserved_tags);
 
-	if (!list_empty(&tags->wait))
-		__wake_waiters(tags, 1);
-
-	spin_unlock_irqrestore(&tags->lock, flags);
+	percpu_ida_free(&tags->reserved_tags, tag);
 }
 
 void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
 {
-	if (tag >= tags->reserved_tags)
+	if (tag >= tags->nr_reserved_tags)
 		__blk_mq_put_tag(tags, tag);
 	else
 		__blk_mq_put_reserved_tag(tags, tag);
 }
 
+static int __blk_mq_tag_iter(unsigned id, void *data)
+{
+	unsigned long *tag_map = data;
+	__set_bit(id, tag_map);
+	return 0;
+}
+
 void blk_mq_tag_busy_iter(struct blk_mq_tags *tags,
 			  void (*fn)(void *, unsigned long *), void *data)
 {
-	unsigned long flags, *tag_map;
-	unsigned int i, j;
+	unsigned long *tag_map;
 	size_t map_size;
 
 	map_size = ALIGN(tags->nr_tags, BITS_PER_LONG) / BITS_PER_LONG;
@@ -367,56 +107,21 @@ void blk_mq_tag_busy_iter(struct blk_mq_
 	if (!tag_map)
 		return;
 
-	local_irq_save(flags);
-
-	for_each_online_cpu(i) {
-		struct blk_mq_tag_map *map = per_cpu_ptr(tags->free_maps, i);
-
-		for (j = 0; j < map->nr_free; j++)
-			__set_bit(map->freelist[j], tag_map);
-	}
-
-	if (tags->nr_free || tags->nr_reserved) {
-		spin_lock(&tags->lock);
-
-		if (tags->nr_reserved)
-			for (j = 0; j < tags->nr_reserved; j++)
-				__set_bit(tags->reservelist[j], tag_map);
-
-		if (tags->nr_free)
-			for (j = 0; j < tags->nr_free; j++)
-				__set_bit(tags->freelist[j], tag_map);
-
-		spin_unlock(&tags->lock);
-	}
-
-	local_irq_restore(flags);
+	percpu_ida_for_each_free(&tags->free_tags, __blk_mq_tag_iter, tag_map);
+	if (tags->nr_reserved_tags)
+		percpu_ida_for_each_free(&tags->reserved_tags, __blk_mq_tag_iter,
+			tag_map);
 
 	fn(data, tag_map);
 	kfree(tag_map);
 }
 
-static void blk_mq_tag_notify(void *data, unsigned long action,
-			      unsigned int cpu)
-{
-	/*
-	 * Move entries from this CPU to global pool
-	 */
-	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
-		struct blk_mq_tags *tags = data;
-		struct blk_mq_tag_map *map = per_cpu_ptr(tags->free_maps, cpu);
-
-		if (map->nr_free)
-			blk_mq_tag_return(tags, map, map->nr_free);
-	}
-}
-
 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 				     unsigned int reserved_tags, int node)
 {
 	unsigned int nr_tags, nr_cache;
 	struct blk_mq_tags *tags;
-	size_t map_size;
+	int ret;
 
 	if (total_tags > BLK_MQ_TAG_MAX) {
 		pr_err("blk-mq: tag depth too large\n");
@@ -435,104 +140,65 @@ struct blk_mq_tags *blk_mq_init_tags(uns
 	else if (nr_cache > BLK_MQ_TAG_CACHE_MAX)
 		nr_cache = BLK_MQ_TAG_CACHE_MAX;
 
-	map_size = sizeof(struct blk_mq_tag_map) + nr_cache * sizeof(int);
-	tags->free_maps = __alloc_percpu(map_size, sizeof(void *));
-	if (!tags->free_maps)
-		goto err_free_maps;
-
-	tags->freelist = kmalloc_node(sizeof(int) * nr_tags, GFP_KERNEL, node);
-	if (!tags->freelist)
-		goto err_freelist;
-
-	if (reserved_tags) {
-		tags->reservelist = kmalloc_node(sizeof(int) * reserved_tags,
-							GFP_KERNEL, node);
-		if (!tags->reservelist)
-			goto err_reservelist;
-	}
-
-	spin_lock_init(&tags->lock);
-	INIT_LIST_HEAD(&tags->wait);
 	tags->nr_tags = total_tags;
-	tags->reserved_tags = reserved_tags;
-	tags->max_cache = nr_cache;
-	tags->batch_move = max(1u, nr_cache / 2);
-
-	/*
-	 * Reserved tags are first
-	 */
-	if (reserved_tags) {
-		tags->nr_reserved = 0;
-		while (reserved_tags--) {
-			tags->reservelist[tags->nr_reserved] =
-							tags->nr_reserved;
-			tags->nr_reserved++;
-		}
-	}
+	tags->nr_reserved_tags = reserved_tags;
+	tags->nr_max_cache = nr_cache;
+	tags->nr_batch_move = max(1u, nr_cache / 2);
+
+	ret = __percpu_ida_init(&tags->free_tags, tags->nr_tags -
+				tags->nr_reserved_tags,
+				tags->nr_max_cache,
+				tags->nr_batch_move);
+	if (ret)
+		goto err_free_tags;
 
-	/*
-	 * Rest of the tags start at the queue list
-	 */
-	tags->nr_free = 0;
-	while (nr_tags--) {
-		tags->freelist[tags->nr_free] = tags->nr_free +
-							tags->nr_reserved;
-		tags->nr_free++;
+	if (reserved_tags) {
+		/*
+		 * With max_cahe and batch set to 1, the allocator fallbacks to
+		 * no cached. It's fine reserved tags allocation is slow.
+		 */
+		ret = __percpu_ida_init(&tags->reserved_tags, reserved_tags,
+				1, 1);
+		if (ret)
+			goto err_reserved_tags;
 	}
 
-	blk_mq_init_cpu_notifier(&tags->cpu_notifier, blk_mq_tag_notify, tags);
-	blk_mq_register_cpu_notifier(&tags->cpu_notifier);
 	return tags;
 
-err_reservelist:
-	kfree(tags->freelist);
-err_freelist:
-	free_percpu(tags->free_maps);
-err_free_maps:
+err_reserved_tags:
+	percpu_ida_destroy(&tags->free_tags);
+err_free_tags:
 	kfree(tags);
 	return NULL;
 }
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	blk_mq_unregister_cpu_notifier(&tags->cpu_notifier);
-	free_percpu(tags->free_maps);
-	kfree(tags->freelist);
-	kfree(tags->reservelist);
+	percpu_ida_destroy(&tags->free_tags);
+	percpu_ida_destroy(&tags->reserved_tags);
 	kfree(tags);
 }
 
 ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page)
 {
 	char *orig_page = page;
-	unsigned long flags;
-	struct list_head *tmp;
-	int waiters;
 	int cpu;
 
 	if (!tags)
 		return 0;
 
-	spin_lock_irqsave(&tags->lock, flags);
-
 	page += sprintf(page, "nr_tags=%u, reserved_tags=%u, batch_move=%u,"
-			" max_cache=%u\n", tags->nr_tags, tags->reserved_tags,
-			tags->batch_move, tags->max_cache);
-
-	waiters = 0;
-	list_for_each(tmp, &tags->wait)
-		waiters++;
-
-	page += sprintf(page, "nr_free=%u, nr_reserved=%u, waiters=%u\n",
-			tags->nr_free, tags->nr_reserved, waiters);
+			" max_cache=%u\n", tags->nr_tags, tags->nr_reserved_tags,
+			tags->nr_batch_move, tags->nr_max_cache);
 
-	for_each_online_cpu(cpu) {
-		struct blk_mq_tag_map *map = per_cpu_ptr(tags->free_maps, cpu);
+	page += sprintf(page, "nr_free=%u, nr_reserved=%u\n",
+			percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids),
+			percpu_ida_free_tags(&tags->reserved_tags, nr_cpu_ids));
 
+	for_each_possible_cpu(cpu) {
 		page += sprintf(page, "  cpu%02u: nr_free=%u\n", cpu,
-					map->nr_free);
+				percpu_ida_free_tags(&tags->free_tags, cpu));
 	}
 
-	spin_unlock_irqrestore(&tags->lock, flags);
 	return page - orig_page;
 }


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 0/4 v2] blk-mq: use percpu_ida to manage tags
  2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
                   ` (3 preceding siblings ...)
  2013-10-15  1:05 ` [patch 4/4 v2] blk-mq: switch to percpu-ida for tag management Shaohua Li
@ 2013-10-15 15:16 ` Jens Axboe
  4 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2013-10-15 15:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, kmo

On Tue, Oct 15 2013, Shaohua Li wrote:
> Hi,
> blk-mq and percpu_ida use similar algorithm to manage tags. The difference
> is when a cpu can't allocate tags, blk-mq will use ipi to purge remote cpu
> cache, while percpu-ida directly purges remote cpu cache. In practice, the
> percpu-ida approach is much faster when we can't allocate enough for percpu
> cache.
> 
> This patch makes blk-mq use percpu_ida to manage tags, this avoids some
> duplicate code and has better performance as I mentioned. To test the patches,
> rebase blk-mq tree to latest kernel first.
> 
> The v2 patch updated some log descriptions, fixed one issue Kent pointed out
> and added performance data.

Thanks, I've rebased it to 3.12-rc5 and applied the series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-10-15 15:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-15  1:05 [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Shaohua Li
2013-10-15  1:05 ` [patch 1/4 v2] percpu_ida: make percpu_ida percpu size/batch configurable Shaohua Li
2013-10-15  1:05 ` [patch 2/4 v2] percpu_ida: add percpu_ida_for_each_free Shaohua Li
2013-10-15  1:05 ` [patch 3/4 v2] percpu_ida: add an API to return free tags Shaohua Li
2013-10-15  1:05 ` [patch 4/4 v2] blk-mq: switch to percpu-ida for tag management Shaohua Li
2013-10-15 15:16 ` [patch 0/4 v2] blk-mq: use percpu_ida to manage tags Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).