linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask
@ 2014-01-20  3:44 Nicholas A. Bellinger
  2014-01-20  3:44 ` [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers " Nicholas A. Bellinger
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-20  3:44 UTC (permalink / raw)
  To: target-devel
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Kent Overstreet, Jens Axboe, Nicholas Bellinger

From: Nicholas Bellinger <nab@linux-iscsi.org>

Hi Linus,

Here is the -v2 series for converting percpu_ida_alloc() + consumer
usage to accept the task state bitmask parameter, w/o the extra
legacy gfp_t wrapper.

As requested, the first patch contains only the parameter change
to percpu_ida_alloc() + existing consumers, and makes no semantic
or behavior change.  This patch is CC'ed for stable, and will need
some light massaging to apply back to v3.12.y.

The second patch is a blk-mq cleanup to propigate the task state
bitmask usage up to the blk-mq vs. legacy split in blk_get_request().

The last patch fixes the original iscsi-target session reset bug
by passing TASK_INTERRUPTIBLE, and adds the signal_pending_state()
bit required in percpu_ida_alloc() code.  This is also CC'ed for
v3.12.y.

CC'ing Ingo + Peter for TASK_RUNNING + prepare_to_wait() bit.

Thank you,

--nab

Kent Overstreet (1):
  percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask

Nicholas Bellinger (2):
  blk-mq: Convert gfp_t parameters to task state bitmask
  iscsi-target: Fix connection reset hang with percpu_ida_alloc

 block/blk-core.c                         |    4 +++-
 block/blk-flush.c                        |    2 +-
 block/blk-mq-tag.c                       |   14 +++++++-------
 block/blk-mq-tag.h                       |    2 +-
 block/blk-mq.c                           |   28 ++++++++++++++--------------
 drivers/target/iscsi/iscsi_target_util.c |    8 ++++++--
 drivers/target/tcm_fc/tfc_cmd.c          |    2 +-
 drivers/vhost/scsi.c                     |    2 +-
 include/linux/blk-mq.h                   |    4 ++--
 include/linux/percpu_ida.h               |    3 ++-
 lib/percpu_ida.c                         |   17 +++++++++++------
 11 files changed, 49 insertions(+), 37 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-20  3:44 [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
@ 2014-01-20  3:44 ` Nicholas A. Bellinger
  2014-01-20 11:34   ` Peter Zijlstra
  2014-01-20  3:44 ` [PATCH-v2 2/3] blk-mq: Convert gfp_t parameters to " Nicholas A. Bellinger
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-20  3:44 UTC (permalink / raw)
  To: target-devel
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Kent Overstreet, Jens Axboe

From: Kent Overstreet <kmo@daterainc.com>

This patch changes percpu_ida_alloc() + callers to accept task state
bitmask for prepare_to_wait() for code like target/iscsi that needs
it for interruptible sleep, that is provided in a subsequent patch.

It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
and is forced to return a negative value when no tags are available.

v2 changes:
  - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
  - Drop signal_pending_state() call

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: <stable@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 block/blk-mq-tag.c                       |    6 ++++--
 drivers/target/iscsi/iscsi_target_util.c |    8 ++++++--
 drivers/target/tcm_fc/tfc_cmd.c          |    2 +-
 drivers/vhost/scsi.c                     |    2 +-
 include/linux/percpu_ida.h               |    3 ++-
 lib/percpu_ida.c                         |   12 ++++++------
 6 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d64a02f..5d70edc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -36,7 +36,8 @@ static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
 {
 	int tag;
 
-	tag = percpu_ida_alloc(&tags->free_tags, gfp);
+	tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
+			       TASK_UNINTERRUPTIBLE : TASK_RUNNING);
 	if (tag < 0)
 		return BLK_MQ_TAG_FAIL;
 	return tag + tags->nr_reserved_tags;
@@ -52,7 +53,8 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
 		return BLK_MQ_TAG_FAIL;
 	}
 
-	tag = percpu_ida_alloc(&tags->reserved_tags, gfp);
+	tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
+			       TASK_UNINTERRUPTIBLE : TASK_RUNNING);
 	if (tag < 0)
 		return BLK_MQ_TAG_FAIL;
 	return tag;
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 0819e68..9b8e1db 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -156,9 +156,13 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)
 {
 	struct iscsi_cmd *cmd;
 	struct se_session *se_sess = conn->sess->se_sess;
-	int size, tag;
+	int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_UNINTERRUPTIBLE :
+				TASK_RUNNING;
+
+	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
+	if (tag < 0)
+		return NULL;
 
-	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, gfp_mask);
 	size = sizeof(struct iscsi_cmd) + conn->conn_transport->priv_size;
 	cmd = (struct iscsi_cmd *)(se_sess->sess_cmd_map + (tag * size));
 	memset(cmd, 0, size);
diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index 479ec56..8b2c1aa 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -438,7 +438,7 @@ static void ft_recv_cmd(struct ft_sess *sess, struct fc_frame *fp)
 	struct se_session *se_sess = sess->se_sess;
 	int tag;
 
-	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, GFP_ATOMIC);
+	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
 	if (tag < 0)
 		goto busy;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 84488a8..2d084fb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -728,7 +728,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq,
 	}
 	se_sess = tv_nexus->tvn_se_sess;
 
-	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, GFP_ATOMIC);
+	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
 	if (tag < 0) {
 		pr_err("Unable to obtain tag for tcm_vhost_cmd\n");
 		return ERR_PTR(-ENOMEM);
diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h
index 1900bd0..f5cfdd6 100644
--- a/include/linux/percpu_ida.h
+++ b/include/linux/percpu_ida.h
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 #include <linux/bitops.h>
 #include <linux/init.h>
+#include <linux/sched.h>
 #include <linux/spinlock_types.h>
 #include <linux/wait.h>
 #include <linux/cpumask.h>
@@ -61,7 +62,7 @@ struct percpu_ida {
 /* Max size of percpu freelist, */
 #define IDA_DEFAULT_PCPU_SIZE	((IDA_DEFAULT_PCPU_BATCH_MOVE * 3) / 2)
 
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
+int percpu_ida_alloc(struct percpu_ida *pool, int state);
 void percpu_ida_free(struct percpu_ida *pool, unsigned tag);
 
 void percpu_ida_destroy(struct percpu_ida *pool);
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf..a48ce2e 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -132,22 +132,22 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 /**
  * percpu_ida_alloc - allocate a tag
  * @pool: pool to allocate from
- * @gfp: gfp flags
+ * @state: task state for prepare_to_wait
  *
  * Returns a tag - an integer in the range [0..nr_tags) (passed to
  * tag_pool_init()), or otherwise -ENOSPC on allocation failure.
  *
  * Safe to be called from interrupt context (assuming it isn't passed
- * __GFP_WAIT, of course).
+ * TASK_UNINTERRUPTIBLE, of course).
  *
  * @gfp indicates whether or not to wait until a free id is available (it's not
  * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
  * however long it takes until another thread frees an id (same semantics as a
  * mempool).
  *
- * Will not fail if passed __GFP_WAIT.
+ * Will not fail if passed TASK_UNINTERRUPTIBLE.
  */
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
 	DEFINE_WAIT(wait);
 	struct percpu_ida_cpu *tags;
@@ -174,7 +174,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
 		 *
 		 * global lock held and irqs disabled, don't need percpu lock
 		 */
-		prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(&pool->wait, &wait, state);
 
 		if (!tags->nr_free)
 			alloc_global_tags(pool, tags);
@@ -191,7 +191,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
 		spin_unlock(&pool->lock);
 		local_irq_restore(flags);
 
-		if (tag >= 0 || !(gfp & __GFP_WAIT))
+		if (tag >= 0 || state == TASK_RUNNING)
 			break;
 
 		schedule();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH-v2 2/3] blk-mq: Convert gfp_t parameters to task state bitmask
  2014-01-20  3:44 [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
  2014-01-20  3:44 ` [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers " Nicholas A. Bellinger
@ 2014-01-20  3:44 ` Nicholas A. Bellinger
  2014-01-20  3:44 ` [PATCH-v2 3/3] iscsi-target: Fix connection reset hang with percpu_ida_alloc Nicholas A. Bellinger
  2014-01-22 19:58 ` [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
  3 siblings, 0 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-20  3:44 UTC (permalink / raw)
  To: target-devel
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Kent Overstreet, Jens Axboe, Nicholas Bellinger

From: Nicholas Bellinger <nab@linux-iscsi.org>

This patch propigates the use of task state bitmask for
percpu_ida_alloc() up the blk-mq callchain, to the point in
blk_get_request() where the blk-mq vs. blk-old split occurs.

Along with the obvious parameters changes, there are two cases
in mq_flush_work() + blk_mq_make_request() where the original
code was using __GFP_WAIT|GFP_ATOMIC that always expect a tag
which have been converted to TASK_UNINTERRUPTIBLE.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 block/blk-core.c       |    4 +++-
 block/blk-flush.c      |    2 +-
 block/blk-mq-tag.c     |   16 +++++++---------
 block/blk-mq-tag.h     |    2 +-
 block/blk-mq.c         |   28 ++++++++++++++--------------
 include/linux/blk-mq.h |    4 ++--
 6 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8bdd012..ab0dc9a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1112,7 +1112,9 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	if (q->mq_ops)
-		return blk_mq_alloc_request(q, rw, gfp_mask, false);
+		return blk_mq_alloc_request(q, rw, (gfp_mask & __GFP_WAIT) ?
+					    TASK_UNINTERRUPTIBLE : TASK_RUNNING,
+					    false);
 	else
 		return blk_old_get_request(q, rw, gfp_mask);
 }
diff --git a/block/blk-flush.c b/block/blk-flush.c
index fb6f3c0..8dd6ff8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -286,7 +286,7 @@ static void mq_flush_work(struct work_struct *work)
 
 	/* We don't need set REQ_FLUSH_SEQ, it's for consistency */
 	rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ,
-		__GFP_WAIT|GFP_ATOMIC, true);
+				  TASK_UNINTERRUPTIBLE, true);
 	rq->cmd_type = REQ_TYPE_FS;
 	rq->end_io = flush_end_io;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5d70edc..20777bd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -32,19 +32,18 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 		percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0;
 }
 
-static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp)
+static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, int state)
 {
 	int tag;
 
-	tag = percpu_ida_alloc(&tags->free_tags, (gfp & __GFP_WAIT) ?
-			       TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+	tag = percpu_ida_alloc(&tags->free_tags, state);
 	if (tag < 0)
 		return BLK_MQ_TAG_FAIL;
 	return tag + tags->nr_reserved_tags;
 }
 
 static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
-					      gfp_t gfp)
+					      int state)
 {
 	int tag;
 
@@ -53,19 +52,18 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags,
 		return BLK_MQ_TAG_FAIL;
 	}
 
-	tag = percpu_ida_alloc(&tags->reserved_tags, (gfp & __GFP_WAIT) ?
-			       TASK_UNINTERRUPTIBLE : TASK_RUNNING);
+	tag = percpu_ida_alloc(&tags->reserved_tags, state);
 	if (tag < 0)
 		return BLK_MQ_TAG_FAIL;
 	return tag;
 }
 
-unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved)
+unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, int state, bool reserved)
 {
 	if (!reserved)
-		return __blk_mq_get_tag(tags, gfp);
+		return __blk_mq_get_tag(tags, state);
 
-	return __blk_mq_get_reserved_tag(tags, gfp);
+	return __blk_mq_get_reserved_tag(tags, state);
 }
 
 static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 947ba2c..b3c1487 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -6,7 +6,7 @@ struct blk_mq_tags;
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
-extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved);
+extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, int state, bool reserved);
 extern void blk_mq_wait_for_tags(struct blk_mq_tags *tags);
 extern void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag);
 extern void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c79126e..80bbfbd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -75,13 +75,13 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
 		set_bit(ctx->index_hw, hctx->ctx_map);
 }
 
-static struct request *blk_mq_alloc_rq(struct blk_mq_hw_ctx *hctx, gfp_t gfp,
+static struct request *blk_mq_alloc_rq(struct blk_mq_hw_ctx *hctx, int state,
 				       bool reserved)
 {
 	struct request *rq;
 	unsigned int tag;
 
-	tag = blk_mq_get_tag(hctx->tags, gfp, reserved);
+	tag = blk_mq_get_tag(hctx->tags, state, reserved);
 	if (tag != BLK_MQ_TAG_FAIL) {
 		rq = hctx->rqs[tag];
 		rq->tag = tag;
@@ -183,13 +183,13 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 }
 
 static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx,
-					      gfp_t gfp, bool reserved)
+					      int state, bool reserved)
 {
-	return blk_mq_alloc_rq(hctx, gfp, reserved);
+	return blk_mq_alloc_rq(hctx, state, reserved);
 }
 
 static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
-						   int rw, gfp_t gfp,
+						   int rw, int state,
 						   bool reserved)
 {
 	struct request *rq;
@@ -198,14 +198,14 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
 		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
 		struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
-		rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved);
+		rq = __blk_mq_alloc_request(hctx, state, reserved);
 		if (rq) {
 			blk_mq_rq_ctx_init(q, ctx, rq, rw);
 			break;
 		}
 
 		blk_mq_put_ctx(ctx);
-		if (!(gfp & __GFP_WAIT))
+		if (state == TASK_RUNNING)
 			break;
 
 		__blk_mq_run_hw_queue(hctx);
@@ -216,28 +216,28 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
-		gfp_t gfp, bool reserved)
+		int state, bool reserved)
 {
 	struct request *rq;
 
 	if (blk_mq_queue_enter(q))
 		return NULL;
 
-	rq = blk_mq_alloc_request_pinned(q, rw, gfp, reserved);
+	rq = blk_mq_alloc_request_pinned(q, rw, state, reserved);
 	if (rq)
 		blk_mq_put_ctx(rq->mq_ctx);
 	return rq;
 }
 
 struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw,
-					      gfp_t gfp)
+					      int state)
 {
 	struct request *rq;
 
 	if (blk_mq_queue_enter(q))
 		return NULL;
 
-	rq = blk_mq_alloc_request_pinned(q, rw, gfp, true);
+	rq = blk_mq_alloc_request_pinned(q, rw, state, true);
 	if (rq)
 		blk_mq_put_ctx(rq->mq_ctx);
 	return rq;
@@ -928,14 +928,14 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
 	trace_block_getrq(q, bio, rw);
-	rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false);
+	rq = __blk_mq_alloc_request(hctx, TASK_RUNNING, false);
 	if (likely(rq))
 		blk_mq_rq_ctx_init(q, ctx, rq, rw);
 	else {
 		blk_mq_put_ctx(ctx);
 		trace_block_sleeprq(q, bio, rw);
-		rq = blk_mq_alloc_request_pinned(q, rw, __GFP_WAIT|GFP_ATOMIC,
-							false);
+		rq = blk_mq_alloc_request_pinned(q, rw, TASK_UNINTERRUPTIBLE,
+						 false);
 		ctx = rq->mq_ctx;
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
 	}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ab0e9b2..91ee75a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -124,8 +124,8 @@ void blk_mq_insert_request(struct request_queue *, struct request *, bool);
 void blk_mq_run_queues(struct request_queue *q, bool async);
 void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, bool reserved);
-struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, gfp_t gfp);
+struct request *blk_mq_alloc_request(struct request_queue *q, int rw, int state, bool reserved);
+struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, int state);
 struct request *blk_mq_rq_from_tag(struct request_queue *q, unsigned int tag);
 
 struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH-v2 3/3] iscsi-target: Fix connection reset hang with percpu_ida_alloc
  2014-01-20  3:44 [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
  2014-01-20  3:44 ` [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers " Nicholas A. Bellinger
  2014-01-20  3:44 ` [PATCH-v2 2/3] blk-mq: Convert gfp_t parameters to " Nicholas A. Bellinger
@ 2014-01-20  3:44 ` Nicholas A. Bellinger
  2014-01-22 19:58 ` [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
  3 siblings, 0 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-20  3:44 UTC (permalink / raw)
  To: target-devel
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Kent Overstreet, Jens Axboe, Nicholas Bellinger

From: Nicholas Bellinger <nab@linux-iscsi.org>

This patch addresses a bug where connection reset would hang
indefinately once percpu_ida_alloc() was starved for tags, due
to the fact that it always assumed uninterruptible sleep mode.

So now make percpu_ida_alloc() check for signal_pending_state() for
making interruptible sleep optional, and convert iscsit_allocate_cmd()
to set TASK_INTERRUPTIBLE for GFP_KERNEL, or TASK_RUNNING for
GFP_ATOMIC.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: <stable@vger.kernel.org> #3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 drivers/target/iscsi/iscsi_target_util.c |    2 +-
 lib/percpu_ida.c                         |    9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 9b8e1db..5477eca 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -156,7 +156,7 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, gfp_t gfp_mask)
 {
 	struct iscsi_cmd *cmd;
 	struct se_session *se_sess = conn->sess->se_sess;
-	int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_UNINTERRUPTIBLE :
+	int size, tag, state = (gfp_mask & __GFP_WAIT) ? TASK_INTERRUPTIBLE :
 				TASK_RUNNING;
 
 	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index a48ce2e..a667110 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -138,14 +138,14 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  * tag_pool_init()), or otherwise -ENOSPC on allocation failure.
  *
  * Safe to be called from interrupt context (assuming it isn't passed
- * TASK_UNINTERRUPTIBLE, of course).
+ * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
  *
  * @gfp indicates whether or not to wait until a free id is available (it's not
  * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
  * however long it takes until another thread frees an id (same semantics as a
  * mempool).
  *
- * Will not fail if passed TASK_UNINTERRUPTIBLE.
+ * Will not fail if passed TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE.
  */
 int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
@@ -194,6 +194,11 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
 		if (tag >= 0 || state == TASK_RUNNING)
 			break;
 
+		if (signal_pending_state(state, current)) {
+			tag = -ERESTARTSYS;
+			break;
+		}
+
 		schedule();
 
 		local_irq_save(flags);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-20  3:44 ` [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers " Nicholas A. Bellinger
@ 2014-01-20 11:34   ` Peter Zijlstra
  2014-01-21 22:09     ` Nicholas A. Bellinger
  2014-01-21 22:18     ` Kent Overstreet
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-20 11:34 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: target-devel, linux-kernel, Linus Torvalds, Ingo Molnar,
	Kent Overstreet, Jens Axboe

On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> From: Kent Overstreet <kmo@daterainc.com>
> 
> This patch changes percpu_ida_alloc() + callers to accept task state
> bitmask for prepare_to_wait() for code like target/iscsi that needs
> it for interruptible sleep, that is provided in a subsequent patch.
> 
> It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> and is forced to return a negative value when no tags are available.
> 
> v2 changes:
>   - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
>   - Drop signal_pending_state() call

Urgh, you made me look at percpu_ida... steal_tags() does a
for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
multiple ticks on SGI class hardware. That is a _very_ long time indeed.

Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
actual tag, while the other only moves tags about -- semantic mismatch.

I do not get the comment near prepare to wait -- why does it matter if
percpu_ida_free() flips a cpus_have_tags bit?

Given I don't understand this comment, its hard for me to properly
review the proposed patch series.

Help?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-20 11:34   ` Peter Zijlstra
@ 2014-01-21 22:09     ` Nicholas A. Bellinger
  2014-01-21 22:18     ` Kent Overstreet
  1 sibling, 0 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-21 22:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: target-devel, linux-kernel, Linus Torvalds, Ingo Molnar,
	Kent Overstreet, Jens Axboe

On Mon, 2014-01-20 at 12:34 +0100, Peter Zijlstra wrote:
> On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> > From: Kent Overstreet <kmo@daterainc.com>
> > 
> > This patch changes percpu_ida_alloc() + callers to accept task state
> > bitmask for prepare_to_wait() for code like target/iscsi that needs
> > it for interruptible sleep, that is provided in a subsequent patch.
> > 
> > It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> > waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> > and is forced to return a negative value when no tags are available.
> > 
> > v2 changes:
> >   - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> >   - Drop signal_pending_state() call
> 
> Urgh, you made me look at percpu_ida... steal_tags() does a
> for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
> multiple ticks on SGI class hardware. That is a _very_ long time indeed.
> 

So given the performance penalties involved in the steal tag slow path,
consumers should typically be pre-allocating a larger number of
percpu_ida tags than necessary to (ideally) avoid this logic completely.

> Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
> actual tag, while the other only moves tags about -- semantic mismatch.
> 

How about just in-lining alloc_global_tags() into percpu_ida_alloc()..?

> I do not get the comment near prepare to wait -- why does it matter if
> percpu_ida_free() flips a cpus_have_tags bit?
> 

Mmm, not sure on that one.

> Given I don't understand this comment, its hard for me to properly
> review the proposed patch series.
> 

Kent..?

--nab


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-20 11:34   ` Peter Zijlstra
  2014-01-21 22:09     ` Nicholas A. Bellinger
@ 2014-01-21 22:18     ` Kent Overstreet
  2014-01-22 19:53       ` Nicholas A. Bellinger
  2014-01-23 12:47       ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Kent Overstreet @ 2014-01-21 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Mon, Jan 20, 2014 at 12:34:15PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> > From: Kent Overstreet <kmo@daterainc.com>
> > 
> > This patch changes percpu_ida_alloc() + callers to accept task state
> > bitmask for prepare_to_wait() for code like target/iscsi that needs
> > it for interruptible sleep, that is provided in a subsequent patch.
> > 
> > It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> > waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> > and is forced to return a negative value when no tags are available.
> > 
> > v2 changes:
> >   - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> >   - Drop signal_pending_state() call
> 
> Urgh, you made me look at percpu_ida... steal_tags() does a
> for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
> multiple ticks on SGI class hardware. That is a _very_ long time indeed.

It's not that bad in practice - the looping is limited by the number of other
CPUs that actually have tags on their freelists - i.e. the CPUs that have
recently been using that block device or whatever the percpu_ida is for. And we
loop while cpu_have_tags is greater than some threshold (there's another debate
about that) - the intention is not to steal tags unless too many other CPUs have
tags on their local freelists.

That said, for huge SGI class hardware I think you'd want the freelists to not
be percpu, but rather be per core or something - that's probably a reasonable
optimization for most hardware anyways.

> Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
> actual tag, while the other only moves tags about -- semantic mismatch.

Yeah, kind of. It is doing allocation, but not the same sort of allocation.

> I do not get the comment near prepare to wait -- why does it matter if
> percpu_ida_free() flips a cpus_have_tags bit?

Did I write that comment? It is a crappy comment...

Ok, in userspace we'd be using condition variables here, but this is the kernel
so we need to carefully order putting ourselves on a waitlist, and checking the
condition that determines whether we wait, and on the wakeup end changing things
that affect that condition and doing the wakeup. steal_tags() is checking the
condition that goes with the prepare_to_wait(), that's all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-21 22:18     ` Kent Overstreet
@ 2014-01-22 19:53       ` Nicholas A. Bellinger
  2014-01-23 18:40         ` Nicholas A. Bellinger
  2014-01-23 12:47       ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-22 19:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

Hi Peter,

Does this satisfy your questions..?

Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
usage in percpu_ida_alloc() that need to be addressed before I can drop
this series into target-pending/for-next to address the original bug..?

Thank you,

--nab

On Tue, 2014-01-21 at 14:18 -0800, Kent Overstreet wrote:
> On Mon, Jan 20, 2014 at 12:34:15PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 20, 2014 at 03:44:44AM +0000, Nicholas A. Bellinger wrote:
> > > From: Kent Overstreet <kmo@daterainc.com>
> > > 
> > > This patch changes percpu_ida_alloc() + callers to accept task state
> > > bitmask for prepare_to_wait() for code like target/iscsi that needs
> > > it for interruptible sleep, that is provided in a subsequent patch.
> > > 
> > > It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep
> > > waiting for a new tag, or TASK_RUNNING when the caller cannot sleep,
> > > and is forced to return a negative value when no tags are available.
> > > 
> > > v2 changes:
> > >   - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes
> > >   - Drop signal_pending_state() call
> > 
> > Urgh, you made me look at percpu_ida... steal_tags() does a
> > for_each_cpus() with IRQs disabled. This mean you'll disable IRQs for
> > multiple ticks on SGI class hardware. That is a _very_ long time indeed.
> 
> It's not that bad in practice - the looping is limited by the number of other
> CPUs that actually have tags on their freelists - i.e. the CPUs that have
> recently been using that block device or whatever the percpu_ida is for. And we
> loop while cpu_have_tags is greater than some threshold (there's another debate
> about that) - the intention is not to steal tags unless too many other CPUs have
> tags on their local freelists.
> 
> That said, for huge SGI class hardware I think you'd want the freelists to not
> be percpu, but rather be per core or something - that's probably a reasonable
> optimization for most hardware anyways.
> 
> > Then there's alloc_global_tags() vs alloc_local_tags(), one gets an
> > actual tag, while the other only moves tags about -- semantic mismatch.
> 
> Yeah, kind of. It is doing allocation, but not the same sort of allocation.
> 
> > I do not get the comment near prepare to wait -- why does it matter if
> > percpu_ida_free() flips a cpus_have_tags bit?
> 
> Did I write that comment? It is a crappy comment...
> 
> Ok, in userspace we'd be using condition variables here, but this is the kernel
> so we need to carefully order putting ourselves on a waitlist, and checking the
> condition that determines whether we wait, and on the wakeup end changing things
> that affect that condition and doing the wakeup. steal_tags() is checking the
> condition that goes with the prepare_to_wait(), that's all.
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask
  2014-01-20  3:44 [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
                   ` (2 preceding siblings ...)
  2014-01-20  3:44 ` [PATCH-v2 3/3] iscsi-target: Fix connection reset hang with percpu_ida_alloc Nicholas A. Bellinger
@ 2014-01-22 19:58 ` Nicholas A. Bellinger
  3 siblings, 0 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-22 19:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	Kent Overstreet, target-devel

Hey Jens,

On Mon, 2014-01-20 at 03:44 +0000, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger <nab@linux-iscsi.org>
> 
> Hi Linus,
> 
> Here is the -v2 series for converting percpu_ida_alloc() + consumer
> usage to accept the task state bitmask parameter, w/o the extra
> legacy gfp_t wrapper.
> 
> As requested, the first patch contains only the parameter change
> to percpu_ida_alloc() + existing consumers, and makes no semantic
> or behavior change.  This patch is CC'ed for stable, and will need
> some light massaging to apply back to v3.12.y.
> 

Any objections to the blk-mq changes in patch #1..?  Please ACK.  :)

> The second patch is a blk-mq cleanup to propigate the task state
> bitmask usage up to the blk-mq vs. legacy split in blk_get_request().
> 

How about this one..?  Do you want to pick it up for v3.15, or have it
included in target-pending/for-next now..?

Thanks,

--nab

> The last patch fixes the original iscsi-target session reset bug
> by passing TASK_INTERRUPTIBLE, and adds the signal_pending_state()
> bit required in percpu_ida_alloc() code.  This is also CC'ed for
> v3.12.y.
> 
> CC'ing Ingo + Peter for TASK_RUNNING + prepare_to_wait() bit.
> 
> Thank you,
> 
> --nab
> 
> Kent Overstreet (1):
>   percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
> 
> Nicholas Bellinger (2):
>   blk-mq: Convert gfp_t parameters to task state bitmask
>   iscsi-target: Fix connection reset hang with percpu_ida_alloc
> 
>  block/blk-core.c                         |    4 +++-
>  block/blk-flush.c                        |    2 +-
>  block/blk-mq-tag.c                       |   14 +++++++-------
>  block/blk-mq-tag.h                       |    2 +-
>  block/blk-mq.c                           |   28 ++++++++++++++--------------
>  drivers/target/iscsi/iscsi_target_util.c |    8 ++++++--
>  drivers/target/tcm_fc/tfc_cmd.c          |    2 +-
>  drivers/vhost/scsi.c                     |    2 +-
>  include/linux/blk-mq.h                   |    4 ++--
>  include/linux/percpu_ida.h               |    3 ++-
>  lib/percpu_ida.c                         |   17 +++++++++++------
>  11 files changed, 49 insertions(+), 37 deletions(-)
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-21 22:18     ` Kent Overstreet
  2014-01-22 19:53       ` Nicholas A. Bellinger
@ 2014-01-23 12:47       ` Peter Zijlstra
  2014-01-23 13:28         ` Kent Overstreet
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 12:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Tue, Jan 21, 2014 at 02:18:52PM -0800, Kent Overstreet wrote:
> > I do not get the comment near prepare to wait -- why does it matter if
> > percpu_ida_free() flips a cpus_have_tags bit?
> 
> Did I write that comment? It is a crappy comment...
> 
> Ok, in userspace we'd be using condition variables here, but this is the kernel
> so we need to carefully order putting ourselves on a waitlist, and checking the
> condition that determines whether we wait, and on the wakeup end changing things
> that affect that condition and doing the wakeup. steal_tags() is checking the
> condition that goes with the prepare_to_wait(), that's all.

How about something like so?

---
 lib/percpu_ida.c | 126 ++++++++++++++++++++++++++++++-------------------------
 1 file changed, 70 insertions(+), 56 deletions(-)

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf91d0f..fc13d70b5b5e 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
 	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
 	struct percpu_ida_cpu *remote;
 
+	lockdep_assert_held(&pool->lock);
+
 	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
 	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
 	     cpus_have_tags--) {
@@ -105,19 +107,7 @@ static inline void steal_tags(struct percpu_ida *pool,
 	}
 }
 
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
-				     struct percpu_ida_cpu *tags)
-{
-	move_tags(tags->freelist, &tags->nr_free,
-		  pool->freelist, &pool->nr_free,
-		  min(pool->nr_free, pool->percpu_batch_size));
-}
-
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida_cpu *tags)
 {
 	int tag = -ENOSPC;
 
@@ -129,6 +119,50 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 	return tag;
 }
 
+static inline int
+__alloc_global_tag(struct percpu_ida *pool, struct percpu_ida_cpu *tags)
+{
+	int tag = -ENOSPC;
+
+	spin_lock(&pool->lock);
+
+	if (!tags->nr_free) {
+		/*
+		 * Move tags from the global-, onto our percpu-freelist.
+		 */
+		move_tags(tags->freelist, &tags->nr_free,
+			  pool->freelist, &pool->nr_free,
+			  min(pool->nr_free, pool->percpu_batch_size));
+	}
+
+	if (!tags->nr_free)
+		steal_tags(pool, tags);
+
+	if (tags->nr_free) {
+		tag = tags->freelist[--tags->nr_free];
+		if (tags->nr_free) {
+			cpumask_set_cpu(smp_processor_id(),
+					&pool->cpus_have_tags);
+		}
+	}
+
+	spin_unlock(&pool->lock);
+
+	return tag;
+}
+
+static inline int alloc_global_tag(struct percpu_ida *pool)
+{
+	unsigned long flags;
+	int tag;
+
+	local_irq_safe(flags);
+	tag = __alloc_global_tag(pool, this_cpu_ptr(pool->tag_cpu));
+	local_irq_restore(flags);
+
+	return tag;
+}
+
 /**
  * percpu_ida_alloc - allocate a tag
  * @pool: pool to allocate from
@@ -147,61 +181,34 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  *
  * Will not fail if passed __GFP_WAIT.
  */
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
-	DEFINE_WAIT(wait);
 	struct percpu_ida_cpu *tags;
 	unsigned long flags;
-	int tag;
+	int tag, ret = -ENOSPC;
 
 	local_irq_save(flags);
 	tags = this_cpu_ptr(pool->tag_cpu);
 
-	/* Fastpath */
 	tag = alloc_local_tag(tags);
-	if (likely(tag >= 0)) {
-		local_irq_restore(flags);
-		return tag;
-	}
+	if (unlikely(tag < 0))
+		tag = __alloc_global_tag(pool, tags);
 
-	while (1) {
-		spin_lock(&pool->lock);
-
-		/*
-		 * prepare_to_wait() must come before steal_tags(), in case
-		 * percpu_ida_free() on another cpu flips a bit in
-		 * cpus_have_tags
-		 *
-		 * global lock held and irqs disabled, don't need percpu lock
-		 */
-		prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
-
-		if (!tags->nr_free)
-			alloc_global_tags(pool, tags);
-		if (!tags->nr_free)
-			steal_tags(pool, tags);
-
-		if (tags->nr_free) {
-			tag = tags->freelist[--tags->nr_free];
-			if (tags->nr_free)
-				cpumask_set_cpu(smp_processor_id(),
-						&pool->cpus_have_tags);
-		}
-
-		spin_unlock(&pool->lock);
-		local_irq_restore(flags);
+	local_irq_restore(flags);
 
-		if (tag >= 0 || !(gfp & __GFP_WAIT))
-			break;
+	if (likely(tag >= 0))
+		return tag;
 
-		schedule();
+	if (state != TASK_RUNNING) {
+		ret = ___wait_event(pool->wait,
+				(tag = alloc_global_tag(pool)) >= 0,
+				state, 0, 0, schedule());
 
-		local_irq_save(flags);
-		tags = this_cpu_ptr(pool->tag_cpu);
+		if (tag >= 0)
+			ret = tag;
 	}
 
-	finish_wait(&pool->wait, &wait);
-	return tag;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_alloc);
 
@@ -235,12 +242,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
 		wake_up(&pool->wait);
 	}
 
+	/*
+	 * No point in waking when 1 < n < max_size free, because steal_tags()
+	 * will assume max_size per set bit, having more free will not make it
+	 * more or less likely it will visit this cpu.
+	 */
+
 	if (nr_free == pool->percpu_max_size) {
 		spin_lock(&pool->lock);
 
 		/*
-		 * Global lock held and irqs disabled, don't need percpu
-		 * lock
+		 * Global lock held and irqs disabled, don't need percpu lock
+		 * because everybody accessing remote @tags will hold
+		 * pool->lock -- steal_tags().
 		 */
 		if (tags->nr_free == pool->percpu_max_size) {
 			move_tags(pool->freelist, &pool->nr_free,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 12:47       ` Peter Zijlstra
@ 2014-01-23 13:28         ` Kent Overstreet
  2014-01-23 13:50           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2014-01-23 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 01:47:53PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 02:18:52PM -0800, Kent Overstreet wrote:
> > > I do not get the comment near prepare to wait -- why does it matter if
> > > percpu_ida_free() flips a cpus_have_tags bit?
> > 
> > Did I write that comment? It is a crappy comment...
> > 
> > Ok, in userspace we'd be using condition variables here, but this is the kernel
> > so we need to carefully order putting ourselves on a waitlist, and checking the
> > condition that determines whether we wait, and on the wakeup end changing things
> > that affect that condition and doing the wakeup. steal_tags() is checking the
> > condition that goes with the prepare_to_wait(), that's all.
> 
> How about something like so?

I like it - my only concern is that your patch has the effect of calling
__alloc_global_tag() twice before sleeping on alloc failure - given that
we're also doing a prepare_to_wait() I'm not concerned about touching
the global freelist twice, but we're also calling steal_tags() twice and
that's potentially more expensive.

It should be ok, because I expect when steal_tags() is going to fail
most of the time it'll check the bitmap and not run the loop, but I
think there's enough room for pathological behaviour here to sleep on
it.

pool->lock is also going to be fairly badly contended in the worst case,
and that can get real bad real fast... now that I think about it we
probably want to avoid the __alloc_global_tag() double call just because
of that, pool->lock is going to be quite a bit more contended than the
waitlist lock just because fo the amount of work done under it.

though my old code was also calling prepare_to_wait() with pool->lock
held, which was dumb.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 13:28         ` Kent Overstreet
@ 2014-01-23 13:50           ` Peter Zijlstra
  2014-01-23 13:55             ` Kent Overstreet
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 13:50 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> pool->lock is also going to be fairly badly contended in the worst case,
> and that can get real bad real fast... now that I think about it we
> probably want to avoid the __alloc_global_tag() double call just because
> of that, pool->lock is going to be quite a bit more contended than the
> waitlist lock just because fo the amount of work done under it.

OK, how about this then.. Not quite at pretty though

---
 lib/percpu_ida.c | 128 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 63 deletions(-)

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 9d054bf91d0f..a1279622436a 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
 	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
 	struct percpu_ida_cpu *remote;
 
+	lockdep_assert_held(&pool->lock);
+
 	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
 	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
 	     cpus_have_tags--) {
@@ -105,26 +107,52 @@ static inline void steal_tags(struct percpu_ida *pool,
 	}
 }
 
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
-				     struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida *pool)
 {
-	move_tags(tags->freelist, &tags->nr_free,
-		  pool->freelist, &pool->nr_free,
-		  min(pool->nr_free, pool->percpu_batch_size));
+	struct percpu_ida_cpu *tags;
+	unsigned long flags;
+	int tag = -ENOSPC;
+
+	local_irq_save(flags);
+	tags = this_cpu_ptr(pool->tag_cpu);
+	spin_lock(&tags->lock);
+	if (tags->nr_free)
+		tag = tags->freelist[--tags->nr_free];
+	spin_unlock_irqrestore(&tags->lock, flags);
+
+	return tag;
 }
 
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_global_tag(struct percpu_ida *pool)
 {
+	struct percpu_ida_cpu *tags;
+	unsigned long flags;
 	int tag = -ENOSPC;
 
-	spin_lock(&tags->lock);
-	if (tags->nr_free)
+	spin_lock_irqsave(&pool->lock, flags);
+	tags = this_cpu_ptr(pool->tag_cpu);
+
+	if (!tags->nr_free) {
+		/*
+		 * Move tags from the global-, onto our percpu-freelist.
+		 */
+		move_tags(tags->freelist, &tags->nr_free,
+			  pool->freelist, &pool->nr_free,
+			  min(pool->nr_free, pool->percpu_batch_size));
+	}
+
+	if (!tags->nr_free)
+		steal_tags(pool, tags);
+
+	if (tags->nr_free) {
 		tag = tags->freelist[--tags->nr_free];
-	spin_unlock(&tags->lock);
+		if (tags->nr_free) {
+			cpumask_set_cpu(smp_processor_id(),
+					&pool->cpus_have_tags);
+		}
+	}
+
+	spin_unlock_irqrestore(&pool->lock, flags);
 
 	return tag;
 }
@@ -147,61 +175,28 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  *
  * Will not fail if passed __GFP_WAIT.
  */
-int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
-	DEFINE_WAIT(wait);
-	struct percpu_ida_cpu *tags;
-	unsigned long flags;
-	int tag;
-
-	local_irq_save(flags);
-	tags = this_cpu_ptr(pool->tag_cpu);
-
-	/* Fastpath */
-	tag = alloc_local_tag(tags);
-	if (likely(tag >= 0)) {
-		local_irq_restore(flags);
-		return tag;
-	}
-
-	while (1) {
-		spin_lock(&pool->lock);
+	int ret;
 
-		/*
-		 * prepare_to_wait() must come before steal_tags(), in case
-		 * percpu_ida_free() on another cpu flips a bit in
-		 * cpus_have_tags
-		 *
-		 * global lock held and irqs disabled, don't need percpu lock
-		 */
-		prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+	ret = alloc_local_tag(pool);
+	if (likely(ret >= 0))
+		return ret;
 
-		if (!tags->nr_free)
-			alloc_global_tags(pool, tags);
-		if (!tags->nr_free)
-			steal_tags(pool, tags);
+	if (state != TASK_RUNNING) {
+		int tag;
 
-		if (tags->nr_free) {
-			tag = tags->freelist[--tags->nr_free];
-			if (tags->nr_free)
-				cpumask_set_cpu(smp_processor_id(),
-						&pool->cpus_have_tags);
-		}
+		ret = ___wait_event(pool->wait,
+				(tag = alloc_global_tag(pool)) >= 0,
+				state, 0, 0, schedule());
 
-		spin_unlock(&pool->lock);
-		local_irq_restore(flags);
-
-		if (tag >= 0 || !(gfp & __GFP_WAIT))
-			break;
-
-		schedule();
-
-		local_irq_save(flags);
-		tags = this_cpu_ptr(pool->tag_cpu);
+		if (tag >= 0)
+			ret = tag;
+	} else {
+		ret = alloc_global_tag(pool);
 	}
 
-	finish_wait(&pool->wait, &wait);
-	return tag;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_alloc);
 
@@ -235,12 +230,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
 		wake_up(&pool->wait);
 	}
 
+	/*
+	 * No point in waking when 1 < n < max_size free, because steal_tags()
+	 * will assume max_size per set bit, having more free will not make it
+	 * more or less likely it will visit this cpu.
+	 */
+
 	if (nr_free == pool->percpu_max_size) {
 		spin_lock(&pool->lock);
 
 		/*
-		 * Global lock held and irqs disabled, don't need percpu
-		 * lock
+		 * Global lock held and irqs disabled, don't need percpu lock
+		 * because everybody accessing remote @tags will hold
+		 * pool->lock -- steal_tags().
 		 */
 		if (tags->nr_free == pool->percpu_max_size) {
 			move_tags(pool->freelist, &pool->nr_free,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 13:50           ` Peter Zijlstra
@ 2014-01-23 13:55             ` Kent Overstreet
  2014-01-23 15:43               ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Kent Overstreet @ 2014-01-23 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 02:50:03PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.
> 
> OK, how about this then.. Not quite at pretty though

Heh, I suppose that is a solution. Let me screw around to see what I can
come up with tomorrow, but if I can't come up with anything I like I'm
not opposed to this.

> ---
>  lib/percpu_ida.c | 128 ++++++++++++++++++++++++++++---------------------------
>  1 file changed, 65 insertions(+), 63 deletions(-)
> 
> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index 9d054bf91d0f..a1279622436a 100644
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -68,6 +68,8 @@ static inline void steal_tags(struct percpu_ida *pool,
>  	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
>  	struct percpu_ida_cpu *remote;
>  
> +	lockdep_assert_held(&pool->lock);
> +
>  	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
>  	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
>  	     cpus_have_tags--) {
> @@ -105,26 +107,52 @@ static inline void steal_tags(struct percpu_ida *pool,
>  	}
>  }
>  
> -/*
> - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
> - * our percpu freelist:
> - */
> -static inline void alloc_global_tags(struct percpu_ida *pool,
> -				     struct percpu_ida_cpu *tags)
> +static inline int alloc_local_tag(struct percpu_ida *pool)
>  {
> -	move_tags(tags->freelist, &tags->nr_free,
> -		  pool->freelist, &pool->nr_free,
> -		  min(pool->nr_free, pool->percpu_batch_size));
> +	struct percpu_ida_cpu *tags;
> +	unsigned long flags;
> +	int tag = -ENOSPC;
> +
> +	local_irq_save(flags);
> +	tags = this_cpu_ptr(pool->tag_cpu);
> +	spin_lock(&tags->lock);
> +	if (tags->nr_free)
> +		tag = tags->freelist[--tags->nr_free];
> +	spin_unlock_irqrestore(&tags->lock, flags);
> +
> +	return tag;
>  }
>  
> -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
> +static inline int alloc_global_tag(struct percpu_ida *pool)
>  {
> +	struct percpu_ida_cpu *tags;
> +	unsigned long flags;
>  	int tag = -ENOSPC;
>  
> -	spin_lock(&tags->lock);
> -	if (tags->nr_free)
> +	spin_lock_irqsave(&pool->lock, flags);
> +	tags = this_cpu_ptr(pool->tag_cpu);
> +
> +	if (!tags->nr_free) {
> +		/*
> +		 * Move tags from the global-, onto our percpu-freelist.
> +		 */
> +		move_tags(tags->freelist, &tags->nr_free,
> +			  pool->freelist, &pool->nr_free,
> +			  min(pool->nr_free, pool->percpu_batch_size));
> +	}
> +
> +	if (!tags->nr_free)
> +		steal_tags(pool, tags);
> +
> +	if (tags->nr_free) {
>  		tag = tags->freelist[--tags->nr_free];
> -	spin_unlock(&tags->lock);
> +		if (tags->nr_free) {
> +			cpumask_set_cpu(smp_processor_id(),
> +					&pool->cpus_have_tags);
> +		}
> +	}
> +
> +	spin_unlock_irqrestore(&pool->lock, flags);
>  
>  	return tag;
>  }
> @@ -147,61 +175,28 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
>   *
>   * Will not fail if passed __GFP_WAIT.
>   */
> -int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
> +int percpu_ida_alloc(struct percpu_ida *pool, int state)
>  {
> -	DEFINE_WAIT(wait);
> -	struct percpu_ida_cpu *tags;
> -	unsigned long flags;
> -	int tag;
> -
> -	local_irq_save(flags);
> -	tags = this_cpu_ptr(pool->tag_cpu);
> -
> -	/* Fastpath */
> -	tag = alloc_local_tag(tags);
> -	if (likely(tag >= 0)) {
> -		local_irq_restore(flags);
> -		return tag;
> -	}
> -
> -	while (1) {
> -		spin_lock(&pool->lock);
> +	int ret;
>  
> -		/*
> -		 * prepare_to_wait() must come before steal_tags(), in case
> -		 * percpu_ida_free() on another cpu flips a bit in
> -		 * cpus_have_tags
> -		 *
> -		 * global lock held and irqs disabled, don't need percpu lock
> -		 */
> -		prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> +	ret = alloc_local_tag(pool);
> +	if (likely(ret >= 0))
> +		return ret;
>  
> -		if (!tags->nr_free)
> -			alloc_global_tags(pool, tags);
> -		if (!tags->nr_free)
> -			steal_tags(pool, tags);
> +	if (state != TASK_RUNNING) {
> +		int tag;
>  
> -		if (tags->nr_free) {
> -			tag = tags->freelist[--tags->nr_free];
> -			if (tags->nr_free)
> -				cpumask_set_cpu(smp_processor_id(),
> -						&pool->cpus_have_tags);
> -		}
> +		ret = ___wait_event(pool->wait,
> +				(tag = alloc_global_tag(pool)) >= 0,
> +				state, 0, 0, schedule());
>  
> -		spin_unlock(&pool->lock);
> -		local_irq_restore(flags);
> -
> -		if (tag >= 0 || !(gfp & __GFP_WAIT))
> -			break;
> -
> -		schedule();
> -
> -		local_irq_save(flags);
> -		tags = this_cpu_ptr(pool->tag_cpu);
> +		if (tag >= 0)
> +			ret = tag;
> +	} else {
> +		ret = alloc_global_tag(pool);
>  	}
>  
> -	finish_wait(&pool->wait, &wait);
> -	return tag;
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(percpu_ida_alloc);
>  
> @@ -235,12 +230,19 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
>  		wake_up(&pool->wait);
>  	}
>  
> +	/*
> +	 * No point in waking when 1 < n < max_size free, because steal_tags()
> +	 * will assume max_size per set bit, having more free will not make it
> +	 * more or less likely it will visit this cpu.
> +	 */
> +
>  	if (nr_free == pool->percpu_max_size) {
>  		spin_lock(&pool->lock);
>  
>  		/*
> -		 * Global lock held and irqs disabled, don't need percpu
> -		 * lock
> +		 * Global lock held and irqs disabled, don't need percpu lock
> +		 * because everybody accessing remote @tags will hold
> +		 * pool->lock -- steal_tags().
>  		 */
>  		if (tags->nr_free == pool->percpu_max_size) {
>  			move_tags(pool->freelist, &pool->nr_free,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 13:55             ` Kent Overstreet
@ 2014-01-23 15:43               ` Peter Zijlstra
  2014-01-23 16:22                 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 15:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:55:39AM -0800, Kent Overstreet wrote:
> On Thu, Jan 23, 2014 at 02:50:03PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > > pool->lock is also going to be fairly badly contended in the worst case,
> > > and that can get real bad real fast... now that I think about it we
> > > probably want to avoid the __alloc_global_tag() double call just because
> > > of that, pool->lock is going to be quite a bit more contended than the
> > > waitlist lock just because fo the amount of work done under it.
> > 
> > OK, how about this then.. Not quite at pretty though
> 
> Heh, I suppose that is a solution. Let me screw around to see what I can
> come up with tomorrow, but if I can't come up with anything I like I'm
> not opposed to this.

Please also consider the below patch.

---
Subject: percpu-ida: Reduce IRQ held duration

Its bad manners to hold IRQs disabled over a full cpumask iteration.
Change it so that it disables the IRQs only where strictly required to
avoid lock inversion issues.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -342,33 +342,31 @@ EXPORT_SYMBOL_GPL(__percpu_ida_init);
 int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
 	void *data)
 {
-	unsigned long flags;
 	struct percpu_ida_cpu *remote;
-	unsigned cpu, i, err = 0;
+	unsigned long flags;
+	int cpu, i, err = 0;
 
-	local_irq_save(flags);
 	for_each_possible_cpu(cpu) {
 		remote = per_cpu_ptr(pool->tag_cpu, cpu);
-		spin_lock(&remote->lock);
+		spin_lock_irqsave(&remote->lock, flags);
 		for (i = 0; i < remote->nr_free; i++) {
 			err = fn(remote->freelist[i], data);
 			if (err)
 				break;
 		}
-		spin_unlock(&remote->lock);
+		spin_unlock_irqrestore(&remote->lock, flags);
 		if (err)
-			goto out;
+			return err;
 	}
 
-	spin_lock(&pool->lock);
+	spin_lock_irqsave(&pool->lock, flags);
 	for (i = 0; i < pool->nr_free; i++) {
 		err = fn(pool->freelist[i], data);
 		if (err)
 			break;
 	}
-	spin_unlock(&pool->lock);
-out:
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&pool->lock, flags);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 15:43               ` Peter Zijlstra
@ 2014-01-23 16:22                 ` Peter Zijlstra
  2014-01-23 16:46                   ` Peter Zijlstra
                                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 16:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> pool->lock is also going to be fairly badly contended in the worst case,
> and that can get real bad real fast... now that I think about it we
> probably want to avoid the __alloc_global_tag() double call just because
> of that, pool->lock is going to be quite a bit more contended than the
> waitlist lock just because fo the amount of work done under it.

On top of the two previous; I think we can reduce pool->lock contention
by not holding it while doing steal_tags().

By dropping pool->lock around steal_tags() we loose serialization over:

  pool->cpus_have_tags is an atomic bitmask, and
  pool->cpu_last_stolem, that's a heuristic anyway, so sod it.

We further loose the guarantee relied upon by percpu_ida_free(), so have
it also acquire the tags->lock, which should be a far less contended
resource.

Now everything modifying percpu_ida_cpu state holds
percpu_ida_cpu::lock, everything that modifies the actual percpu_ida
freelists holds percpu_ida::lock, and percpu_ida_cpu::lock nests inside
percpu_ida::lock.


The only annoying thing is that we're still holding IRQs over
steal_tags(), we should be able to make that a preempt_disable() without
too much effort, or very much cheat and drop even that and rely on the
percpu_ida_cpu::lock to serialize everything and just hope that we don't
migrate too often.

But that's for another patch.

---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -68,8 +68,6 @@ static inline void steal_tags(struct per
 	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
 	struct percpu_ida_cpu *remote;
 
-	lockdep_assert_held(&pool->lock);
-
 	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
 	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
 	     cpus_have_tags--) {
@@ -141,18 +139,24 @@ static inline int alloc_global_tag(struc
 			  min(pool->nr_free, pool->percpu_batch_size));
 	}
 
+	spin_unlock(&pool->lock);
+
 	if (!tags->nr_free)
 		steal_tags(pool, tags);
 
 	if (tags->nr_free) {
-		tag = tags->freelist[--tags->nr_free];
+		spin_lock(&tags->lock);
 		if (tags->nr_free) {
-			cpumask_set_cpu(smp_processor_id(),
-					&pool->cpus_have_tags);
+			tag = tags->freelist[--tags->nr_free];
+			if (tags->nr_free) {
+				cpumask_set_cpu(smp_processor_id(),
+						&pool->cpus_have_tags);
+			}
 		}
+		spin_unlock(&tags->lock);
 	}
 
-	spin_unlock_irqrestore(&pool->lock, flags);
+	local_irq_restore(flags);
 
 	return tag;
 }
@@ -238,12 +242,8 @@ void percpu_ida_free(struct percpu_ida *
 
 	if (nr_free == pool->percpu_max_size) {
 		spin_lock(&pool->lock);
+		spin_lock(&tags->lock);
 
-		/*
-		 * Global lock held and irqs disabled, don't need percpu lock
-		 * because everybody accessing remote @tags will hold
-		 * pool->lock -- steal_tags().
-		 */
 		if (tags->nr_free == pool->percpu_max_size) {
 			move_tags(pool->freelist, &pool->nr_free,
 				  tags->freelist, &tags->nr_free,
@@ -251,6 +251,8 @@ void percpu_ida_free(struct percpu_ida *
 
 			wake_up(&pool->wait);
 		}
+
+		spin_unlock(&tags->lock);
 		spin_unlock(&pool->lock);
 	}
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 16:22                 ` Peter Zijlstra
@ 2014-01-23 16:46                   ` Peter Zijlstra
  2014-01-23 19:31                   ` Kent Overstreet
  2014-02-10  9:30                   ` Alexander Gordeev
  2 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 16:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.

> Now everything modifying percpu_ida_cpu state holds
> percpu_ida_cpu::lock

Almost that.

---
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -51,6 +51,15 @@ static inline void move_tags(unsigned *d
 	*dst_nr += nr;
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+	if (l1 > l2)
+		swap(l1, l2);
+
+	spin_lock(l1);
+	spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * Try to steal tags from a remote cpu's percpu freelist.
  *
@@ -87,7 +96,7 @@ static inline void steal_tags(struct per
 		if (remote == tags)
 			continue;
 
-		spin_lock(&remote->lock);
+		double_lock(&tags->lock, &remote->lock);
 
 		if (remote->nr_free) {
 			memcpy(tags->freelist,
@@ -99,6 +108,7 @@ static inline void steal_tags(struct per
 		}
 
 		spin_unlock(&remote->lock);
+		spin_unlock(&tags->lock);
 
 		if (tags->nr_free)
 			break;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-22 19:53       ` Nicholas A. Bellinger
@ 2014-01-23 18:40         ` Nicholas A. Bellinger
  2014-01-23 19:12           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-23 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> Hi Peter,
> 
> Does this satisfy your questions..?
> 
> Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> usage in percpu_ida_alloc() that need to be addressed before I can drop
> this series into target-pending/for-next to address the original bug..?
> 

Given the silence, I'll assume your OK with the initial TASK_RUNNING +
prepare_to_wait() bit, right..?

--nab


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 18:40         ` Nicholas A. Bellinger
@ 2014-01-23 19:12           ` Peter Zijlstra
  2014-01-23 19:31             ` Nicholas A. Bellinger
  2014-01-23 19:34             ` Kent Overstreet
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-23 19:12 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > Hi Peter,
> > 
> > Does this satisfy your questions..?
> > 
> > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > this series into target-pending/for-next to address the original bug..?
> > 
> 
> Given the silence, 

You mean the silence in which I send a 4+ emails earlier today?

> I'll assume your OK with the initial TASK_RUNNING +
> prepare_to_wait() bit, right..?

No, I would prefer not to do that. While it does work its awkward at
best.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 19:12           ` Peter Zijlstra
@ 2014-01-23 19:31             ` Nicholas A. Bellinger
  2014-01-23 19:38               ` Nicholas A. Bellinger
  2014-01-23 19:34             ` Kent Overstreet
  1 sibling, 1 reply; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-23 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, 2014-01-23 at 20:12 +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > Hi Peter,
> > > 
> > > Does this satisfy your questions..?
> > > 
> > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > this series into target-pending/for-next to address the original bug..?
> > > 
> > 
> > Given the silence, 
> 
> You mean the silence in which I send a 4+ emails earlier today?
> 
> > I'll assume your OK with the initial TASK_RUNNING +
> > prepare_to_wait() bit, right..?
> 
> No, I would prefer not to do that. While it does work its awkward at
> best.

AFAICT, those changes don't address the original bug that the series was
trying to address, allowing the percpu_ida_alloc() tag stealing slow
path to be interrupted by a signal..

Also, keep in mind this change needs to be backported to >= v3.12, which
is why the percpu_ida changes have been kept to a minimum.

--nab



  


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 16:22                 ` Peter Zijlstra
  2014-01-23 16:46                   ` Peter Zijlstra
@ 2014-01-23 19:31                   ` Kent Overstreet
  2014-02-10  9:30                   ` Alexander Gordeev
  2 siblings, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2014-01-23 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.
> 
> On top of the two previous; I think we can reduce pool->lock contention
> by not holding it while doing steal_tags().
> 
> By dropping pool->lock around steal_tags() we loose serialization over:
> 
>   pool->cpus_have_tags is an atomic bitmask, and
>   pool->cpu_last_stolem, that's a heuristic anyway, so sod it.
> 
> We further loose the guarantee relied upon by percpu_ida_free(), so have
> it also acquire the tags->lock, which should be a far less contended
> resource.
> 
> Now everything modifying percpu_ida_cpu state holds
> percpu_ida_cpu::lock, everything that modifies the actual percpu_ida
> freelists holds percpu_ida::lock, and percpu_ida_cpu::lock nests inside
> percpu_ida::lock.

I find myself wondering why I didn't originally do this - I'm a bit worried
there was some subtle race I've forgotten about with this approach - but
probably I just got tired of trying out and trying to reason about different
subtle optimizations is all :)

So yeah, assuming we can't find anything wrong with this approach - this should
be great.

> The only annoying thing is that we're still holding IRQs over
> steal_tags(), we should be able to make that a preempt_disable() without
> too much effort, or very much cheat and drop even that and rely on the
> percpu_ida_cpu::lock to serialize everything and just hope that we don't
> migrate too often.

I don't think that's an issue - it's pretty hard to come up with a scenario
where most/all of a large number of cpus are marked in the bit vector as having
tags, and _then_ none of them actually have tags (because as soon as one of them
does, we'll succeed and break out of the loop).

And then that would only be able to happen once, because we clear bits as we go.

And we need interrupts disabled while we're holding any of the spinlocks (as
freeing can certainly happen in atomic context), so the only alternative would
be to save/restore interrupts on every loop iteration, and that'll get _real_
expensive fast. (last I profiled it nested push/pop of the flags register was
_ridiculous_, sigh)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 19:12           ` Peter Zijlstra
  2014-01-23 19:31             ` Nicholas A. Bellinger
@ 2014-01-23 19:34             ` Kent Overstreet
  1 sibling, 0 replies; 25+ messages in thread
From: Kent Overstreet @ 2014-01-23 19:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas A. Bellinger, target-devel, linux-kernel,
	Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 08:12:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > Hi Peter,
> > > 
> > > Does this satisfy your questions..?
> > > 
> > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > this series into target-pending/for-next to address the original bug..?
> > > 
> > 
> > Given the silence, 
> 
> You mean the silence in which I send a 4+ emails earlier today?
> 
> > I'll assume your OK with the initial TASK_RUNNING +
> > prepare_to_wait() bit, right..?
> 
> No, I would prefer not to do that. While it does work its awkward at
> best.

I do like the improvements, but personally, I really don't see anything wrong
with the initial patch and for backporting that really is what we should do -
this code _is_ subtle enough backporting our (your) improvements is not
something I'd want to do, having debugged this code when I first wrote it...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 19:31             ` Nicholas A. Bellinger
@ 2014-01-23 19:38               ` Nicholas A. Bellinger
  2014-01-24 15:14                 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-23 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, 2014-01-23 at 11:31 -0800, Nicholas A. Bellinger wrote:
> On Thu, 2014-01-23 at 20:12 +0100, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 10:40:04AM -0800, Nicholas A. Bellinger wrote:
> > > On Wed, 2014-01-22 at 11:53 -0800, Nicholas A. Bellinger wrote:
> > > > Hi Peter,
> > > > 
> > > > Does this satisfy your questions..?
> > > > 
> > > > Do you have any more concerns about TASK_RUNNING + prepare_to_wait()
> > > > usage in percpu_ida_alloc() that need to be addressed before I can drop
> > > > this series into target-pending/for-next to address the original bug..?
> > > > 
> > > 
> > > Given the silence, 
> > 
> > You mean the silence in which I send a 4+ emails earlier today?
> > 
> > > I'll assume your OK with the initial TASK_RUNNING +
> > > prepare_to_wait() bit, right..?
> > 
> > No, I would prefer not to do that. While it does work its awkward at
> > best.
> 
> AFAICT, those changes don't address the original bug that the series was
> trying to address, allowing the percpu_ida_alloc() tag stealing slow
> path to be interrupted by a signal..
> 
> Also, keep in mind this change needs to be backported to >= v3.12, which
> is why the percpu_ida changes have been kept to a minimum.
> 

<A little too eager to SEND>

So would you prefer the following addition to the original bugfix
instead..?

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index a48ce2e..58b6714 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -174,7 +174,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
                 *
                 * global lock held and irqs disabled, don't need percpu lock
                 */
-               prepare_to_wait(&pool->wait, &wait, state);
+               if (state != TASK_RUNNING)
+                       prepare_to_wait(&pool->wait, &wait, state);
 
                if (!tags->nr_free)
                        alloc_global_tags(pool, tags);
@@ -199,8 +200,9 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
                local_irq_save(flags);
                tags = this_cpu_ptr(pool->tag_cpu);
        }
+       if (state != TASK_RUNNING)
+               finish_wait(&pool->wait, &wait);
 
-       finish_wait(&pool->wait, &wait);
        return tag;
 }
 EXPORT_SYMBOL_GPL(percpu_ida_alloc);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 19:38               ` Nicholas A. Bellinger
@ 2014-01-24 15:14                 ` Peter Zijlstra
  2014-01-25  6:33                   ` Nicholas A. Bellinger
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-01-24 15:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 11:38:24AM -0800, Nicholas A. Bellinger wrote:
> > AFAICT, those changes don't address the original bug that the series was
> > trying to address, allowing the percpu_ida_alloc() tag stealing slow
> > path to be interrupted by a signal..
> > 
> > Also, keep in mind this change needs to be backported to >= v3.12, which
> > is why the percpu_ida changes have been kept to a minimum.

Well, the other option is to revert whatever caused the issue in the
first place :-)

I'm not much for making ugly fixes just because its easier to backport.

> <A little too eager to SEND>
> 
> So would you prefer the following addition to the original bugfix
> instead..?

I'll make a right old mess out of percpu_ida.c, but yeah.

> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index a48ce2e..58b6714 100644
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -174,7 +174,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
>                  *
>                  * global lock held and irqs disabled, don't need percpu lock
>                  */
> -               prepare_to_wait(&pool->wait, &wait, state);
> +               if (state != TASK_RUNNING)
> +                       prepare_to_wait(&pool->wait, &wait, state);
>  
>                 if (!tags->nr_free)
>                         alloc_global_tags(pool, tags);
> @@ -199,8 +200,9 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
>                 local_irq_save(flags);
>                 tags = this_cpu_ptr(pool->tag_cpu);
>         }
> +       if (state != TASK_RUNNING)
> +               finish_wait(&pool->wait, &wait);
>  
> -       finish_wait(&pool->wait, &wait);
>         return tag;
>  }
>  EXPORT_SYMBOL_GPL(percpu_ida_alloc);
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-24 15:14                 ` Peter Zijlstra
@ 2014-01-25  6:33                   ` Nicholas A. Bellinger
  0 siblings, 0 replies; 25+ messages in thread
From: Nicholas A. Bellinger @ 2014-01-25  6:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kmo, target-devel, linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Fri, 2014-01-24 at 16:14 +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 11:38:24AM -0800, Nicholas A. Bellinger wrote:
> > > AFAICT, those changes don't address the original bug that the series was
> > > trying to address, allowing the percpu_ida_alloc() tag stealing slow
> > > path to be interrupted by a signal..
> > > 
> > > Also, keep in mind this change needs to be backported to >= v3.12, which
> > > is why the percpu_ida changes have been kept to a minimum.
> 
> Well, the other option is to revert whatever caused the issue in the
> first place :-)
> 
> I'm not much for making ugly fixes just because its easier to backport.
> 

Me either, but given Kent's comment on subtle issues for larger
improvements in the tag stealing slow path, it's the right approach for
addressing a bug in stable code.

Also, I'd also prefer to avoid introducing new issues for blk-mq in the
first round of stable code, give the amount of testing it's already
undergone with percpu_ida for v3.13.

> > <A little too eager to SEND>
> > 
> > So would you prefer the following addition to the original bugfix
> > instead..?
> 
> I'll make a right old mess out of percpu_ida.c, but yeah.
> 

Kent and I would both like to see your improvements merged in upstream,
just not in a manner that would cause Greg-KH to start cursing loudly
for a stable backport.

That said, unless Jens throws a last minute NACK, I'll be queuing patch
#1 + #3 into target-pending/for-next for sunday night's build.

Jens, please feel free to pickup Patch #2 for a post v3.15 merge at your
earliest convenience.

Thanks,

--nab

> > diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> > index a48ce2e..58b6714 100644
> > --- a/lib/percpu_ida.c
> > +++ b/lib/percpu_ida.c
> > @@ -174,7 +174,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
> >                  *
> >                  * global lock held and irqs disabled, don't need percpu lock
> >                  */
> > -               prepare_to_wait(&pool->wait, &wait, state);
> > +               if (state != TASK_RUNNING)
> > +                       prepare_to_wait(&pool->wait, &wait, state);
> >  
> >                 if (!tags->nr_free)
> >                         alloc_global_tags(pool, tags);
> > @@ -199,8 +200,9 @@ int percpu_ida_alloc(struct percpu_ida *pool, int state)
> >                 local_irq_save(flags);
> >                 tags = this_cpu_ptr(pool->tag_cpu);
> >         }
> > +       if (state != TASK_RUNNING)
> > +               finish_wait(&pool->wait, &wait);
> >  
> > -       finish_wait(&pool->wait, &wait);
> >         return tag;
> >  }
> >  EXPORT_SYMBOL_GPL(percpu_ida_alloc);
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers accept task state bitmask
  2014-01-23 16:22                 ` Peter Zijlstra
  2014-01-23 16:46                   ` Peter Zijlstra
  2014-01-23 19:31                   ` Kent Overstreet
@ 2014-02-10  9:30                   ` Alexander Gordeev
  2 siblings, 0 replies; 25+ messages in thread
From: Alexander Gordeev @ 2014-02-10  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kent Overstreet, Nicholas A. Bellinger, target-devel,
	linux-kernel, Linus Torvalds, Ingo Molnar, Jens Axboe

On Thu, Jan 23, 2014 at 05:22:54PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 05:28:29AM -0800, Kent Overstreet wrote:
> > pool->lock is also going to be fairly badly contended in the worst case,
> > and that can get real bad real fast... now that I think about it we
> > probably want to avoid the __alloc_global_tag() double call just because
> > of that, pool->lock is going to be quite a bit more contended than the
> > waitlist lock just because fo the amount of work done under it.
> 
> On top of the two previous; I think we can reduce pool->lock contention
> by not holding it while doing steal_tags().
> 
> By dropping pool->lock around steal_tags() we loose serialization over:
> 
>   pool->cpus_have_tags is an atomic bitmask, and
>   pool->cpu_last_stolem, that's a heuristic anyway, so sod it.
> 
> We further loose the guarantee relied upon by percpu_ida_free(), so have
> it also acquire the tags->lock, which should be a far less contended
> resource.
> 
> Now everything modifying percpu_ida_cpu state holds
> percpu_ida_cpu::lock, everything that modifies the actual percpu_ida
> freelists holds percpu_ida::lock, and percpu_ida_cpu::lock nests inside
> percpu_ida::lock.
> 
> 
> The only annoying thing is that we're still holding IRQs over
> steal_tags(), we should be able to make that a preempt_disable() without
> too much effort, or very much cheat and drop even that and rely on the
> percpu_ida_cpu::lock to serialize everything and just hope that we don't
> migrate too often.
> 
> But that's for another patch.
> 
> ---
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -68,8 +68,6 @@ static inline void steal_tags(struct per
>  	unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
>  	struct percpu_ida_cpu *remote;
>  
> -	lockdep_assert_held(&pool->lock);
> -
>  	for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
>  	     cpus_have_tags * pool->percpu_max_size > pool->nr_tags / 2;
>  	     cpus_have_tags--) {

Two concurrent threads find and unset the very same bit few lines below

		cpu = cpumask_next(cpu, &pool->cpus_have_tags);

		[...]

		cpumask_clear_cpu(cpu, &pool->cpus_have_tags);

The second's thread unset races with cpumask_set_cpu() in percpu_ida_free()
or alloc_global_tag()

	if (nr_free == 1) {
		cpumask_set_cpu(smp_processor_id(),
				&pool->cpus_have_tags);
		wake_up(&pool->wait);
	}

Which results in the woken thread does not find the (illegitimately) unset
bit while looping over the mask in steal_tags(). I suspect this or another
thread might enter an unwakable sleep as the number of bits in the mask
and number of threads in the waitqueue became unbalanced.

Hope, I am wrong here.

> @@ -141,18 +139,24 @@ static inline int alloc_global_tag(struc
>  			  min(pool->nr_free, pool->percpu_batch_size));
>  	}
>  
> +	spin_unlock(&pool->lock);
> +
>  	if (!tags->nr_free)
>  		steal_tags(pool, tags);
>  
>  	if (tags->nr_free) {
> -		tag = tags->freelist[--tags->nr_free];
> +		spin_lock(&tags->lock);
>  		if (tags->nr_free) {
> -			cpumask_set_cpu(smp_processor_id(),
> -					&pool->cpus_have_tags);
> +			tag = tags->freelist[--tags->nr_free];
> +			if (tags->nr_free) {
> +				cpumask_set_cpu(smp_processor_id(),
> +						&pool->cpus_have_tags);
> +			}
>  		}
> +		spin_unlock(&tags->lock);
>  	}
>  
> -	spin_unlock_irqrestore(&pool->lock, flags);
> +	local_irq_restore(flags);
>  
>  	return tag;
>  }
> @@ -238,12 +242,8 @@ void percpu_ida_free(struct percpu_ida *
>  
>  	if (nr_free == pool->percpu_max_size) {
>  		spin_lock(&pool->lock);
> +		spin_lock(&tags->lock);
>  
> -		/*
> -		 * Global lock held and irqs disabled, don't need percpu lock
> -		 * because everybody accessing remote @tags will hold
> -		 * pool->lock -- steal_tags().
> -		 */
>  		if (tags->nr_free == pool->percpu_max_size) {
>  			move_tags(pool->freelist, &pool->nr_free,
>  				  tags->freelist, &tags->nr_free,
> @@ -251,6 +251,8 @@ void percpu_ida_free(struct percpu_ida *
>  
>  			wake_up(&pool->wait);
>  		}
> +
> +		spin_unlock(&tags->lock);
>  		spin_unlock(&pool->lock);
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-02-10  9:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-20  3:44 [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger
2014-01-20  3:44 ` [PATCH-v2 1/3] percpu_ida: Make percpu_ida_alloc + callers " Nicholas A. Bellinger
2014-01-20 11:34   ` Peter Zijlstra
2014-01-21 22:09     ` Nicholas A. Bellinger
2014-01-21 22:18     ` Kent Overstreet
2014-01-22 19:53       ` Nicholas A. Bellinger
2014-01-23 18:40         ` Nicholas A. Bellinger
2014-01-23 19:12           ` Peter Zijlstra
2014-01-23 19:31             ` Nicholas A. Bellinger
2014-01-23 19:38               ` Nicholas A. Bellinger
2014-01-24 15:14                 ` Peter Zijlstra
2014-01-25  6:33                   ` Nicholas A. Bellinger
2014-01-23 19:34             ` Kent Overstreet
2014-01-23 12:47       ` Peter Zijlstra
2014-01-23 13:28         ` Kent Overstreet
2014-01-23 13:50           ` Peter Zijlstra
2014-01-23 13:55             ` Kent Overstreet
2014-01-23 15:43               ` Peter Zijlstra
2014-01-23 16:22                 ` Peter Zijlstra
2014-01-23 16:46                   ` Peter Zijlstra
2014-01-23 19:31                   ` Kent Overstreet
2014-02-10  9:30                   ` Alexander Gordeev
2014-01-20  3:44 ` [PATCH-v2 2/3] blk-mq: Convert gfp_t parameters to " Nicholas A. Bellinger
2014-01-20  3:44 ` [PATCH-v2 3/3] iscsi-target: Fix connection reset hang with percpu_ida_alloc Nicholas A. Bellinger
2014-01-22 19:58 ` [PATCH-v2 0/3] percpu_ida+Co: Make percpu_ida_alloc accept task state bitmask Nicholas A. Bellinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).