All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
@ 2018-02-03  4:21 Ming Lei
  2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

Hi All,

This patchset supports global tags which was started by Hannes originally:

	https://marc.info/?l=linux-block&m=149132580511346&w=2

Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
driver can avoid to support two IO paths(legacy and blk-mq), especially
recent discusion mentioned that SCSI_MQ will be enabled at default soon.

	https://marc.info/?l=linux-scsi&m=151727684915589&w=2

With the above two changes, it should be easier to convert SCSI drivers'
reply queue into blk-mq's hctx, then the automatic irq affinity issue can
be solved easily, please see detailed descrption in commit log.

Also drivers may require to complete request on the submission CPU
for avoiding hard/soft deadlock, which can be done easily with blk_mq
too.

	https://marc.info/?t=151601851400001&r=1&w=2

The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
so that IO hang issue can be avoided inside legacy IO path, this issue is
a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.

Thanks
Ming

Ming Lei (5):
  blk-mq: tags: define several fields of tags as pointer
  blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  block: null_blk: introduce module parameter of 'g_global_tags'
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

 block/bfq-iosched.c        |  4 +--
 block/blk-mq-debugfs.c     | 11 ++++----
 block/blk-mq-sched.c       |  2 +-
 block/blk-mq-tag.c         | 67 +++++++++++++++++++++++++++++-----------------
 block/blk-mq-tag.h         | 15 ++++++++---
 block/blk-mq.c             | 31 ++++++++++++++++-----
 block/blk-mq.h             |  3 ++-
 block/kyber-iosched.c      |  2 +-
 drivers/block/null_blk.c   |  6 +++++
 drivers/scsi/hosts.c       |  1 +
 drivers/scsi/virtio_scsi.c | 59 +++-------------------------------------
 include/linux/blk-mq.h     |  2 ++
 include/scsi/scsi_host.h   |  3 +++
 13 files changed, 105 insertions(+), 101 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
@ 2018-02-03  4:21 ` Ming Lei
  2018-02-05  6:57   ` Hannes Reinecke
  2018-02-08 17:34     ` Bart Van Assche
  2018-02-03  4:21 ` [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

This patch changes tags->breserved_tags, tags->bitmap_tags and
tags->active_queues as pointer, and prepares for supporting global tags.

No functional change.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c    |  4 ++--
 block/blk-mq-debugfs.c | 10 +++++-----
 block/blk-mq-tag.c     | 48 ++++++++++++++++++++++++++----------------------
 block/blk-mq-tag.h     | 10 +++++++---
 block/blk-mq.c         |  2 +-
 block/kyber-iosched.c  |  2 +-
 6 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..1e1211814a57 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
 			WARN_ON_ONCE(1);
 			return;
 		}
-		bt = &tags->breserved_tags;
+		bt = tags->breserved_tags;
 	} else
-		bt = &tags->bitmap_tags;
+		bt = tags->bitmap_tags;
 
 	if (unlikely(bfqd->sb_shift != bt->sb.shift))
 		bfq_update_depths(bfqd, bt);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..0dfafa4b655a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
 	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
 	seq_printf(m, "active_queues=%d\n",
-		   atomic_read(&tags->active_queues));
+		   atomic_read(tags->active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
-	sbitmap_queue_show(&tags->bitmap_tags, m);
+	sbitmap_queue_show(tags->bitmap_tags, m);
 
 	if (tags->nr_reserved_tags) {
 		seq_puts(m, "\nbreserved_tags:\n");
-		sbitmap_queue_show(&tags->breserved_tags, m);
+		sbitmap_queue_show(tags->breserved_tags, m);
 	}
 }
 
@@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->tags)
-		sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->sched_tags)
-		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..571797dc36cb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 	if (!tags)
 		return true;
 
-	return sbitmap_any_bit_clear(&tags->bitmap_tags.sb);
+	return sbitmap_any_bit_clear(&tags->bitmap_tags->sb);
 }
 
 /*
@@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
 	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		atomic_inc(&hctx->tags->active_queues);
+		atomic_inc(hctx->tags->active_queues);
 
 	return true;
 }
@@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-	sbitmap_queue_wake_all(&tags->bitmap_tags);
+	sbitmap_queue_wake_all(tags->bitmap_tags);
 	if (include_reserve)
-		sbitmap_queue_wake_all(&tags->breserved_tags);
+		sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return;
 
-	atomic_dec(&tags->active_queues);
+	atomic_dec(tags->active_queues);
 
 	blk_mq_tag_wakeup_all(tags, false);
 }
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	if (bt->sb.depth == 1)
 		return true;
 
-	users = atomic_read(&hctx->tags->active_queues);
+	users = atomic_read(hctx->tags->active_queues);
 	if (!users)
 		return true;
 
@@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			WARN_ON_ONCE(1);
 			return BLK_MQ_TAG_FAIL;
 		}
-		bt = &tags->breserved_tags;
+		bt = tags->breserved_tags;
 		tag_offset = 0;
 	} else {
-		bt = &tags->bitmap_tags;
+		bt = tags->bitmap_tags;
 		tag_offset = tags->nr_reserved_tags;
 	}
 
@@ -165,9 +165,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
-			bt = &tags->breserved_tags;
+			bt = tags->breserved_tags;
 		else
-			bt = &tags->bitmap_tags;
+			bt = tags->bitmap_tags;
 
 		finish_wait(&ws->wait, &wait);
 		ws = bt_wait_ptr(bt, data->hctx);
@@ -189,10 +189,10 @@ void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
 		const int real_tag = tag - tags->nr_reserved_tags;
 
 		BUG_ON(real_tag >= tags->nr_tags);
-		sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
+		sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
 	} else {
 		BUG_ON(tag >= tags->nr_reserved_tags);
-		sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
+		sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
 	}
 }
 
@@ -283,8 +283,8 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, tags->breserved_tags, fn, priv, true);
+	bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, false);
 }
 
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
@@ -346,8 +346,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 			continue;
 
 		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
 	}
 
 }
@@ -365,15 +365,15 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
-	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
+	if (bt_alloc(tags->bitmap_tags, depth, round_robin, node))
 		goto free_tags;
-	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
+	if (bt_alloc(tags->breserved_tags, tags->nr_reserved_tags, round_robin,
 		     node))
 		goto free_bitmap_tags;
 
 	return tags;
 free_bitmap_tags:
-	sbitmap_queue_free(&tags->bitmap_tags);
+	sbitmap_queue_free(tags->bitmap_tags);
 free_tags:
 	kfree(tags);
 	return NULL;
@@ -397,13 +397,17 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
+	tags->bitmap_tags = &tags->__bitmap_tags;
+	tags->breserved_tags = &tags->__breserved_tags;
+	tags->active_queues = &tags->__active_queues;
+
 	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
 }
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	sbitmap_queue_free(&tags->bitmap_tags);
-	sbitmap_queue_free(&tags->breserved_tags);
+	sbitmap_queue_free(tags->bitmap_tags);
+	sbitmap_queue_free(tags->breserved_tags);
 	kfree(tags);
 }
 
@@ -454,7 +458,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		 * Don't need (or can't) update reserved tags here, they
 		 * remain static and should never need resizing.
 		 */
-		sbitmap_queue_resize(&tags->bitmap_tags, tdepth);
+		sbitmap_queue_resize(tags->bitmap_tags, tdepth);
 	}
 
 	return 0;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..a68323fa0c02 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -11,10 +11,14 @@ struct blk_mq_tags {
 	unsigned int nr_tags;
 	unsigned int nr_reserved_tags;
 
-	atomic_t active_queues;
+	atomic_t *active_queues;
+	atomic_t __active_queues;
 
-	struct sbitmap_queue bitmap_tags;
-	struct sbitmap_queue breserved_tags;
+	struct sbitmap_queue *bitmap_tags;
+	struct sbitmap_queue *breserved_tags;
+
+	struct sbitmap_queue __bitmap_tags;
+	struct sbitmap_queue __breserved_tags;
 
 	struct request **rqs;
 	struct request **static_rqs;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102e2149..69d4534870af 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1136,7 +1136,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
 		return false;
 	}
 
-	ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
+	ws = bt_wait_ptr(this_hctx->tags->bitmap_tags, this_hctx);
 	add_wait_queue(&ws->wait, wait);
 
 	/*
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index f95c60774ce8..4adaced76382 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -281,7 +281,7 @@ static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
 	 * All of the hardware queues have the same depth, so we can just grab
 	 * the shift of the first one.
 	 */
-	return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+	return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift;
 }
 
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
  2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
@ 2018-02-03  4:21 ` Ming Lei
  2018-02-05  6:54   ` Hannes Reinecke
  2018-02-03  4:21 ` [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags' Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
reply queues, but tags is often HBA wide.

These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment.

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned, and HPSA has been broken
with v4.15+:

	https://marc.info/?l=linux-kernel&m=151748144730409&w=2

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx, but one tricky thing is
the HBA wide(instead of hw queue wide) tags.

This patch is based on the following Hannes's patch:

	https://marc.info/?l=linux-block&m=149132580511346&w=2

One big difference with Hannes's is that this patch only makes the tags sbitmap
and active_queues data structure HBA wide, and others are kept as NUMA locality,
such as request, hctx, tags, ...

The following patch will support global tags on null_blk, also the performance
data is provided, no obvious performance loss is observed when the whole
hw queue depth is same.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Arun Easi <arun.easi@cavium.com>
Cc: Omar Sandoval <osandov@fb.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Don Brace <don.brace@microsemi.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Peter Rivera <peter.rivera@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq-tag.c     | 23 ++++++++++++++++++-----
 block/blk-mq-tag.h     |  5 ++++-
 block/blk-mq.c         | 29 ++++++++++++++++++++++++-----
 block/blk-mq.h         |  3 ++-
 include/linux/blk-mq.h |  2 ++
 7 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0dfafa4b655a..0f0fafe03f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(SG_MERGE),
+	HCTX_FLAG_NAME(GLOBAL_TAGS),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55c0a745b427..191d4bc95f0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	int ret;
 
 	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
-					       set->reserved_tags);
+					       set->reserved_tags, false);
 	if (!hctx->sched_tags)
 		return -ENOMEM;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 571797dc36cb..66377d09eaeb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	return NULL;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+				     unsigned int total_tags,
 				     unsigned int reserved_tags,
-				     int node, int alloc_policy)
+				     int node, int alloc_policy,
+				     bool global_tag)
 {
 	struct blk_mq_tags *tags;
 
@@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
+	WARN_ON(global_tag && !set->global_tags);
+	if (global_tag && set->global_tags) {
+		tags->bitmap_tags = set->global_tags->bitmap_tags;
+		tags->breserved_tags = set->global_tags->breserved_tags;
+		tags->active_queues = set->global_tags->active_queues;
+		tags->global_tags = true;
+		return tags;
+	}
 	tags->bitmap_tags = &tags->__bitmap_tags;
 	tags->breserved_tags = &tags->__breserved_tags;
 	tags->active_queues = &tags->__active_queues;
@@ -406,8 +416,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	sbitmap_queue_free(tags->bitmap_tags);
-	sbitmap_queue_free(tags->breserved_tags);
+	if (!tags->global_tags) {
+		sbitmap_queue_free(tags->bitmap_tags);
+		sbitmap_queue_free(tags->breserved_tags);
+	}
 	kfree(tags);
 }
 
@@ -441,7 +453,8 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		if (tdepth > 16 * BLKDEV_MAX_RQ)
 			return -EINVAL;
 
-		new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth, 0);
+		new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth, 0,
+				(*tagsptr)->global_tags);
 		if (!new)
 			return -ENOMEM;
 		ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index a68323fa0c02..a87b5cfa5726 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -20,13 +20,16 @@ struct blk_mq_tags {
 	struct sbitmap_queue __bitmap_tags;
 	struct sbitmap_queue __breserved_tags;
 
+	bool	global_tags;
 	struct request **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
 };
 
 
-extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
+extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+		unsigned int nr_tags, unsigned int reserved_tags, int node,
+		int alloc_policy, bool global_tag);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 69d4534870af..a98466dc71b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2023,7 +2023,8 @@ void blk_mq_free_rq_map(struct blk_mq_tags *tags)
 struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 					unsigned int hctx_idx,
 					unsigned int nr_tags,
-					unsigned int reserved_tags)
+					unsigned int reserved_tags,
+					bool global_tags)
 {
 	struct blk_mq_tags *tags;
 	int node;
@@ -2032,8 +2033,9 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
-				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
+	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
+				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags),
+				global_tags);
 	if (!tags)
 		return NULL;
 
@@ -2336,7 +2338,8 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
 	int ret = 0;
 
 	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
-					set->queue_depth, set->reserved_tags);
+					set->queue_depth, set->reserved_tags,
+					!!(set->flags & BLK_MQ_F_GLOBAL_TAGS));
 	if (!set->tags[hctx_idx])
 		return false;
 
@@ -2891,15 +2894,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;
 
+	if (set->flags & BLK_MQ_F_GLOBAL_TAGS) {
+		ret = -ENOMEM;
+		set->global_tags = blk_mq_init_tags(set, set->queue_depth,
+				set->reserved_tags, set->numa_node,
+				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags),
+				false);
+		if (!set->global_tags)
+			goto out_free_mq_map;
+	}
+
 	ret = blk_mq_alloc_rq_maps(set);
 	if (ret)
-		goto out_free_mq_map;
+		goto out_free_global_tags;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
 
+out_free_global_tags:
+	if (set->global_tags)
+		blk_mq_free_tags(set->global_tags);
 out_free_mq_map:
 	kfree(set->mq_map);
 	set->mq_map = NULL;
@@ -2914,6 +2930,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i;
 
+	if (set->global_tags)
+		blk_mq_free_tags(set->global_tags);
+
 	for (i = 0; i < nr_cpu_ids; i++)
 		blk_mq_free_map_and_requests(set, i);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..d1d9a0a8e1fa 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -61,7 +61,8 @@ void blk_mq_free_rq_map(struct blk_mq_tags *tags);
 struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 					unsigned int hctx_idx,
 					unsigned int nr_tags,
-					unsigned int reserved_tags);
+					unsigned int reserved_tags,
+					bool global_tags);
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..8548c72d6b4a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -82,6 +82,7 @@ struct blk_mq_tag_set {
 	void			*driver_data;
 
 	struct blk_mq_tags	**tags;
+	struct blk_mq_tags	*global_tags;	/* for BLK_MQ_F_GLOBAL_TAGS */
 
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
@@ -175,6 +176,7 @@ enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
+	BLK_MQ_F_GLOBAL_TAGS	= 1 << 3,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags'
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
  2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
  2018-02-03  4:21 ` [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS Ming Lei
@ 2018-02-03  4:21 ` Ming Lei
  2018-02-05  6:54   ` Hannes Reinecke
  2018-02-03  4:21 ` [PATCH 4/5] scsi: introduce force_blk_mq Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

This patch introduces the parameter of 'g_global_tags' so that we can
test this feature by null_blk easiy.

Not see obvious performance drop with global_tags when the whole hw
depth is kept as same:

1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 submit_queues=4 hw_queue_depth=1

2) 'global_tags', global hw queue depth is 4, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 submit_queues=4 hw_queue_depth=4

3) fio test done in above two settings:
   fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  --name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 --name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3

1M IOPS can be reached in both above tests which is done in one VM.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Arun Easi <arun.easi@cavium.com>
Cc: Omar Sandoval <osandov@fb.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Don Brace <don.brace@microsemi.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Peter Rivera <peter.rivera@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/null_blk.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..ad0834efad42 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -163,6 +163,10 @@ static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
 
+static int g_global_tags = 0;
+module_param_named(global_tags, g_global_tags, int, S_IRUGO);
+MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
+
 static int g_home_node = NUMA_NO_NODE;
 module_param_named(home_node, g_home_node, int, S_IRUGO);
 MODULE_PARM_DESC(home_node, "Home node for the device");
@@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
 	set->flags = BLK_MQ_F_SHOULD_MERGE;
 	if (g_no_sched)
 		set->flags |= BLK_MQ_F_NO_SCHED;
+	if (g_global_tags)
+		set->flags |= BLK_MQ_F_GLOBAL_TAGS;
 	set->driver_data = NULL;
 
 	if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/5] scsi: introduce force_blk_mq
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
                   ` (2 preceding siblings ...)
  2018-02-03  4:21 ` [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags' Ming Lei
@ 2018-02-03  4:21 ` Ming Lei
  2018-02-05  6:57   ` Hannes Reinecke
  2018-02-03  4:21 ` [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity Ming Lei
  2018-02-05  6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

This patch may clean up driver a lot by providing blk-mq only support, espeically
it is easier to convert multiple reply queues into blk_mq's MQ for the following
purposes:

1) use blk_mq multiple hw queue to deal with allocated irq vectors of all offline
CPU affinity[1]:

	[1] https://marc.info/?l=linux-kernel&m=151748144730409&w=2

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned.

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx.

2) some drivers[1] require to complete request in the submission CPU for
avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
to reinvent wheels for solving the problem.

	[2] https://marc.info/?t=151601851400001&r=1&w=2

Sovling the above issues for non-MQ path may not be easy, or introduce
unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
recently[3]:

	[3] https://marc.info/?l=linux-scsi&m=151727684915589&w=2

Cc: Hannes Reinecke <hare@suse.de>
Cc: Arun Easi <arun.easi@cavium.com>
Cc: Omar Sandoval <osandov@fb.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Don Brace <don.brace@microsemi.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Peter Rivera <peter.rivera@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/hosts.c     | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index fe3a0da3ec97..c75cebd7911d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 		shost->dma_boundary = 0xffffffff;
 
 	shost->use_blk_mq = scsi_use_blk_mq;
+	shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
 	device_initialize(&shost->shost_gendev);
 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..4118760e5c32 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/* tell scsi core we support blk-mq only */
+	unsigned force_blk_mq:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
                   ` (3 preceding siblings ...)
  2018-02-03  4:21 ` [PATCH 4/5] scsi: introduce force_blk_mq Ming Lei
@ 2018-02-03  4:21 ` Ming Lei
  2018-02-05  6:57   ` Hannes Reinecke
  2018-02-05  6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
  5 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-03  4:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Hannes Reinecke, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Kashyap Desai, Peter Rivera, Paolo Bonzini,
	Laurence Oberman, Ming Lei

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

	irq 36, cpu list 0-7
	irq 37, cpu list 0-7
	irq 38, cpu list 0-7
	irq 39, cpu list 0-1
	irq 40, cpu list 4,6
	irq 41, cpu list 2-3
	irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue hasn't online
CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Arun Easi <arun.easi@cavium.com>
Cc: Omar Sandoval <osandov@fb.com>,
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Don Brace <don.brace@microsemi.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Peter Rivera <peter.rivera@broadcom.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/virtio_scsi.c | 59 +++-------------------------------------------
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
 	seqcount_t tgt_seq;
 
-	/* Count of outstanding requests. */
-	atomic_t reqs;
-
 	/* Currently active virtqueue for requests sent to this target. */
 	struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	struct virtio_scsi_cmd *cmd = buf;
 	struct scsi_cmnd *sc = cmd->sc;
 	struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
-	struct virtio_scsi_target_state *tgt =
-				scsi_target(sc->device)->hostdata;
 
 	dev_dbg(&sc->device->sdev_gendev,
 		"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi *vscsi, void *buf)
 	}
 
 	sc->scsi_done(sc);
-
-	atomic_dec(&tgt->reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 					struct scsi_cmnd *sc)
 {
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt =
-				scsi_target(sc->device)->hostdata;
 
-	atomic_inc(&tgt->reqs);
 	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
 	return &vscsi->req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-					       struct virtio_scsi_target_state *tgt)
-{
-	struct virtio_scsi_vq *vq;
-	unsigned long flags;
-	u32 queue_num;
-
-	local_irq_save(flags);
-	if (atomic_inc_return(&tgt->reqs) > 1) {
-		unsigned long seq;
-
-		do {
-			seq = read_seqcount_begin(&tgt->tgt_seq);
-			vq = tgt->req_vq;
-		} while (read_seqcount_retry(&tgt->tgt_seq, seq));
-	} else {
-		/* no writes can be concurrent because of atomic_t */
-		write_seqcount_begin(&tgt->tgt_seq);
-
-		/* keep previous req_vq if a reader just arrived */
-		if (unlikely(atomic_read(&tgt->reqs) > 1)) {
-			vq = tgt->req_vq;
-			goto unlock;
-		}
-
-		queue_num = smp_processor_id();
-		while (unlikely(queue_num >= vscsi->num_queues))
-			queue_num -= vscsi->num_queues;
-		tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
- unlock:
-		write_seqcount_end(&tgt->tgt_seq);
-	}
-	local_irq_restore(flags);
-
-	return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 				       struct scsi_cmnd *sc)
 {
 	struct virtio_scsi *vscsi = shost_priv(sh);
-	struct virtio_scsi_target_state *tgt =
-				scsi_target(sc->device)->hostdata;
-	struct virtio_scsi_vq *req_vq;
-
-	if (shost_use_blk_mq(sh))
-		req_vq = virtscsi_pick_vq_mq(vscsi, sc);
-	else
-		req_vq = virtscsi_pick_vq(vscsi, tgt);
+	struct virtio_scsi_vq *req_vq = virtscsi_pick_vq_mq(vscsi, sc);
 
 	return virtscsi_queuecommand(vscsi, req_vq, sc);
 }
@@ -775,7 +721,6 @@ static int virtscsi_target_alloc(struct scsi_target *starget)
 		return -ENOMEM;
 
 	seqcount_init(&tgt->tgt_seq);
-	atomic_set(&tgt->reqs, 0);
 	tgt->req_vq = &vscsi->req_vqs[0];
 
 	starget->hostdata = tgt;
@@ -823,6 +768,7 @@ static struct scsi_host_template virtscsi_host_template_single = {
 	.target_alloc = virtscsi_target_alloc,
 	.target_destroy = virtscsi_target_destroy,
 	.track_queue_depth = 1,
+	.force_blk_mq = 1,
 };
 
 static struct scsi_host_template virtscsi_host_template_multi = {
@@ -844,6 +790,7 @@ static struct scsi_host_template virtscsi_host_template_multi = {
 	.target_destroy = virtscsi_target_destroy,
 	.map_queues = virtscsi_map_queues,
 	.track_queue_depth = 1,
+	.force_blk_mq = 1,
 };
 
 #define virtscsi_config_get(vdev, fld) \
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  2018-02-03  4:21 ` [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS Ming Lei
@ 2018-02-05  6:54   ` Hannes Reinecke
  2018-02-05 10:35     ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> reply queues, but tags is often HBA wide.
> 
> These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> for automatic affinity assignment.
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned, and HPSA has been broken
> with v4.15+:
> 
> 	https://marc.info/?l=linux-kernel&m=151748144730409&w=2
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx, but one tricky thing is
> the HBA wide(instead of hw queue wide) tags.
> 
> This patch is based on the following Hannes's patch:
> 
> 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> 
> One big difference with Hannes's is that this patch only makes the tags sbitmap
> and active_queues data structure HBA wide, and others are kept as NUMA locality,
> such as request, hctx, tags, ...
> 
> The following patch will support global tags on null_blk, also the performance
> data is provided, no obvious performance loss is observed when the whole
> hw queue depth is same.
> 
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Arun Easi <arun.easi@cavium.com>
> Cc: Omar Sandoval <osandov@fb.com>,
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Don Brace <don.brace@microsemi.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Peter Rivera <peter.rivera@broadcom.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   |  2 +-
>  block/blk-mq-tag.c     | 23 ++++++++++++++++++-----
>  block/blk-mq-tag.h     |  5 ++++-
>  block/blk-mq.c         | 29 ++++++++++++++++++++++++-----
>  block/blk-mq.h         |  3 ++-
>  include/linux/blk-mq.h |  2 ++
>  7 files changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 0dfafa4b655a..0f0fafe03f5d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
>  	HCTX_FLAG_NAME(SHOULD_MERGE),
>  	HCTX_FLAG_NAME(TAG_SHARED),
>  	HCTX_FLAG_NAME(SG_MERGE),
> +	HCTX_FLAG_NAME(GLOBAL_TAGS),
>  	HCTX_FLAG_NAME(BLOCKING),
>  	HCTX_FLAG_NAME(NO_SCHED),
>  };
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55c0a745b427..191d4bc95f0e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
>  	int ret;
>  
>  	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
> -					       set->reserved_tags);
> +					       set->reserved_tags, false);
>  	if (!hctx->sched_tags)
>  		return -ENOMEM;
>  
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 571797dc36cb..66377d09eaeb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  	return NULL;
>  }
>  
> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> +				     unsigned int total_tags,
>  				     unsigned int reserved_tags,
> -				     int node, int alloc_policy)
> +				     int node, int alloc_policy,
> +				     bool global_tag)
>  {
>  	struct blk_mq_tags *tags;
>  
> @@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>  	tags->nr_tags = total_tags;
>  	tags->nr_reserved_tags = reserved_tags;
>  
> +	WARN_ON(global_tag && !set->global_tags);
> +	if (global_tag && set->global_tags) {
> +		tags->bitmap_tags = set->global_tags->bitmap_tags;
> +		tags->breserved_tags = set->global_tags->breserved_tags;
> +		tags->active_queues = set->global_tags->active_queues;
> +		tags->global_tags = true;
> +		return tags;
> +	}
>  	tags->bitmap_tags = &tags->__bitmap_tags;
>  	tags->breserved_tags = &tags->__breserved_tags;
>  	tags->active_queues = &tags->__active_queues;
Do we really need the 'global_tag' flag here?
Can't we just rely on the ->global_tags pointer to be set?

> @@ -406,8 +416,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>  
>  void blk_mq_free_tags(struct blk_mq_tags *tags)
>  {
> -	sbitmap_queue_free(tags->bitmap_tags);
> -	sbitmap_queue_free(tags->breserved_tags);
> +	if (!tags->global_tags) {
> +		sbitmap_queue_free(tags->bitmap_tags);
> +		sbitmap_queue_free(tags->breserved_tags);
> +	}
>  	kfree(tags);
>  }
>  
> @@ -441,7 +453,8 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  		if (tdepth > 16 * BLKDEV_MAX_RQ)
>  			return -EINVAL;
>  
> -		new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth, 0);
> +		new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth, 0,
> +				(*tagsptr)->global_tags);
>  		if (!new)
>  			return -ENOMEM;
>  		ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index a68323fa0c02..a87b5cfa5726 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -20,13 +20,16 @@ struct blk_mq_tags {
>  	struct sbitmap_queue __bitmap_tags;
>  	struct sbitmap_queue __breserved_tags;
>  
> +	bool	global_tags;
>  	struct request **rqs;
>  	struct request **static_rqs;
>  	struct list_head page_list;
>  };
>  
>  
> -extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
> +extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> +		unsigned int nr_tags, unsigned int reserved_tags, int node,
> +		int alloc_policy, bool global_tag);
>  extern void blk_mq_free_tags(struct blk_mq_tags *tags);
>  
>  extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 69d4534870af..a98466dc71b5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2023,7 +2023,8 @@ void blk_mq_free_rq_map(struct blk_mq_tags *tags)
>  struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  					unsigned int hctx_idx,
>  					unsigned int nr_tags,
> -					unsigned int reserved_tags)
> +					unsigned int reserved_tags,
> +					bool global_tags)
>  {
>  	struct blk_mq_tags *tags;
>  	int node;
> @@ -2032,8 +2033,9 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  	if (node == NUMA_NO_NODE)
>  		node = set->numa_node;
>  
> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
> -				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
> +				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags),
> +				global_tags);
>  	if (!tags)
>  		return NULL;
>  > @@ -2336,7 +2338,8 @@ static bool __blk_mq_alloc_rq_map(struct
blk_mq_tag_set *set, int hctx_idx)
>  	int ret = 0;
>  
>  	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
> -					set->queue_depth, set->reserved_tags);
> +					set->queue_depth, set->reserved_tags,
> +					!!(set->flags & BLK_MQ_F_GLOBAL_TAGS));
>  	if (!set->tags[hctx_idx])
>  		return false;
>  
> @@ -2891,15 +2894,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>  	if (ret)
>  		goto out_free_mq_map;
>  
> +	if (set->flags & BLK_MQ_F_GLOBAL_TAGS) {
> +		ret = -ENOMEM;
> +		set->global_tags = blk_mq_init_tags(set, set->queue_depth,
> +				set->reserved_tags, set->numa_node,
> +				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags),
> +				false);
> +		if (!set->global_tags)
> +			goto out_free_mq_map;
> +	}
> +
>  	ret = blk_mq_alloc_rq_maps(set);
>  	if (ret)
> -		goto out_free_mq_map;
> +		goto out_free_global_tags;
>  
>  	mutex_init(&set->tag_list_lock);
>  	INIT_LIST_HEAD(&set->tag_list);
>  
>  	return 0;
>  
> +out_free_global_tags:
> +	if (set->global_tags)
> +		blk_mq_free_tags(set->global_tags);
>  out_free_mq_map:
>  	kfree(set->mq_map);
>  	set->mq_map = NULL;
> @@ -2914,6 +2930,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>  {
>  	int i;
>  
> +	if (set->global_tags)
> +		blk_mq_free_tags(set->global_tags);
> +
>  	for (i = 0; i < nr_cpu_ids; i++)
>  		blk_mq_free_map_and_requests(set, i);
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 88c558f71819..d1d9a0a8e1fa 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -61,7 +61,8 @@ void blk_mq_free_rq_map(struct blk_mq_tags *tags);
>  struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  					unsigned int hctx_idx,
>  					unsigned int nr_tags,
> -					unsigned int reserved_tags);
> +					unsigned int reserved_tags,
> +					bool global_tags);
>  int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		     unsigned int hctx_idx, unsigned int depth);
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 8efcf49796a3..8548c72d6b4a 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -82,6 +82,7 @@ struct blk_mq_tag_set {
>  	void			*driver_data;
>  
>  	struct blk_mq_tags	**tags;
> +	struct blk_mq_tags	*global_tags;	/* for BLK_MQ_F_GLOBAL_TAGS */
>  
>  	struct mutex		tag_list_lock;
>  	struct list_head	tag_list;
> @@ -175,6 +176,7 @@ enum {
>  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
>  	BLK_MQ_F_SG_MERGE	= 1 << 2,
> +	BLK_MQ_F_GLOBAL_TAGS	= 1 << 3,
>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 
Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags'
  2018-02-03  4:21 ` [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags' Ming Lei
@ 2018-02-05  6:54   ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> This patch introduces the parameter of 'g_global_tags' so that we can
> test this feature by null_blk easiy.
> 
> Not see obvious performance drop with global_tags when the whole hw
> depth is kept as same:
> 
> 1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 submit_queues=4 hw_queue_depth=1
> 
> 2) 'global_tags', global hw queue depth is 4, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 submit_queues=4 hw_queue_depth=4
> 
> 3) fio test done in above two settings:
>    fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  --name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 --name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3
> 
> 1M IOPS can be reached in both above tests which is done in one VM.
> 
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Arun Easi <arun.easi@cavium.com>
> Cc: Omar Sandoval <osandov@fb.com>,
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Don Brace <don.brace@microsemi.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Peter Rivera <peter.rivera@broadcom.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/null_blk.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 287a09611c0f..ad0834efad42 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -163,6 +163,10 @@ static int g_submit_queues = 1;
>  module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
>  MODULE_PARM_DESC(submit_queues, "Number of submission queues");
>  
> +static int g_global_tags = 0;
> +module_param_named(global_tags, g_global_tags, int, S_IRUGO);
> +MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
> +
>  static int g_home_node = NUMA_NO_NODE;
>  module_param_named(home_node, g_home_node, int, S_IRUGO);
>  MODULE_PARM_DESC(home_node, "Home node for the device");
> @@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
>  	set->flags = BLK_MQ_F_SHOULD_MERGE;
>  	if (g_no_sched)
>  		set->flags |= BLK_MQ_F_NO_SCHED;
> +	if (g_global_tags)
> +		set->flags |= BLK_MQ_F_GLOBAL_TAGS;
>  	set->driver_data = NULL;
>  
>  	if ((nullb && nullb->dev->blocking) || g_blocking)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/5] scsi: introduce force_blk_mq
  2018-02-03  4:21 ` [PATCH 4/5] scsi: introduce force_blk_mq Ming Lei
@ 2018-02-05  6:57   ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> From scsi driver view, it is a bit troublesome to support both blk-mq
> and non-blk-mq at the same time, especially when drivers need to support
> multi hw-queue.
> 
> This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
> can provide blk-mq only support, so driver code can avoid the trouble
> for supporting both.
> 
> This patch may clean up driver a lot by providing blk-mq only support, espeically
> it is easier to convert multiple reply queues into blk_mq's MQ for the following
> purposes:
> 
> 1) use blk_mq multiple hw queue to deal with allocated irq vectors of all offline
> CPU affinity[1]:
> 
> 	[1] https://marc.info/?l=linux-kernel&m=151748144730409&w=2
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned.
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx.
> 
> 2) some drivers[1] require to complete request in the submission CPU for
> avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
> to reinvent wheels for solving the problem.
> 
> 	[2] https://marc.info/?t=151601851400001&r=1&w=2
> 
> Sovling the above issues for non-MQ path may not be easy, or introduce
> unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
> recently[3]:
> 
> 	[3] https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> 
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Arun Easi <arun.easi@cavium.com>
> Cc: Omar Sandoval <osandov@fb.com>,
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Don Brace <don.brace@microsemi.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Peter Rivera <peter.rivera@broadcom.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/hosts.c     | 1 +
>  include/scsi/scsi_host.h | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index fe3a0da3ec97..c75cebd7911d 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
>  		shost->dma_boundary = 0xffffffff;
>  
>  	shost->use_blk_mq = scsi_use_blk_mq;
> +	shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
>  
>  	device_initialize(&shost->shost_gendev);
>  	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index a8b7bf879ced..4118760e5c32 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -452,6 +452,9 @@ struct scsi_host_template {
>  	/* True if the controller does not support WRITE SAME */
>  	unsigned no_write_same:1;
>  
> +	/* tell scsi core we support blk-mq only */
> +	unsigned force_blk_mq:1;
> +
>  	/*
>  	 * Countdown for host blocking with no commands outstanding.
>  	 */
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
  2018-02-03  4:21 ` [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity Ming Lei
@ 2018-02-05  6:57   ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
> virtio-scsi, the irq affinity assigned can become the following shape:
> 
> 	irq 36, cpu list 0-7
> 	irq 37, cpu list 0-7
> 	irq 38, cpu list 0-7
> 	irq 39, cpu list 0-1
> 	irq 40, cpu list 4,6
> 	irq 41, cpu list 2-3
> 	irq 42, cpu list 5,7
> 
> Then IO hang is triggered in case of non-SCSI_MQ.
> 
> Given storage IO is always C/S model, there isn't such issue with SCSI_MQ(blk-mq),
> because no IO can be submitted to one hw queue if the hw queue hasn't online
> CPUs.
> 
> Fix this issue by forcing to use blk_mq.
> 
> BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
> been quite stable, so it shouldn't cause extra risk.
> 
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Arun Easi <arun.easi@cavium.com>
> Cc: Omar Sandoval <osandov@fb.com>,
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Don Brace <don.brace@microsemi.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Peter Rivera <peter.rivera@broadcom.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/virtio_scsi.c | 59 +++-------------------------------------------
>  1 file changed, 3 insertions(+), 56 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer
  2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
@ 2018-02-05  6:57   ` Hannes Reinecke
  2018-02-08 17:34     ` Bart Van Assche
  1 sibling, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> This patch changes tags->breserved_tags, tags->bitmap_tags and
> tags->active_queues as pointer, and prepares for supporting global tags.
> 
> No functional change.
> 
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bfq-iosched.c    |  4 ++--
>  block/blk-mq-debugfs.c | 10 +++++-----
>  block/blk-mq-tag.c     | 48 ++++++++++++++++++++++++++----------------------
>  block/blk-mq-tag.h     | 10 +++++++---
>  block/blk-mq.c         |  2 +-
>  block/kyber-iosched.c  |  2 +-
>  6 files changed, 42 insertions(+), 34 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
                   ` (4 preceding siblings ...)
  2018-02-03  4:21 ` [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity Ming Lei
@ 2018-02-05  6:58 ` Hannes Reinecke
  2018-02-05  7:05     ` Kashyap Desai
  2018-02-05 10:23   ` Ming Lei
  5 siblings, 2 replies; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-05  6:58 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On 02/03/2018 05:21 AM, Ming Lei wrote:
> Hi All,
> 
> This patchset supports global tags which was started by Hannes originally:
> 
> 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> 
> Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> driver can avoid to support two IO paths(legacy and blk-mq), especially
> recent discusion mentioned that SCSI_MQ will be enabled at default soon.
> 
> 	https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> 
> With the above two changes, it should be easier to convert SCSI drivers'
> reply queue into blk-mq's hctx, then the automatic irq affinity issue can
> be solved easily, please see detailed descrption in commit log.
> 
> Also drivers may require to complete request on the submission CPU
> for avoiding hard/soft deadlock, which can be done easily with blk_mq
> too.
> 
> 	https://marc.info/?t=151601851400001&r=1&w=2
> 
> The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> so that IO hang issue can be avoided inside legacy IO path, this issue is
> a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.
> 
> Thanks
> Ming
> 
> Ming Lei (5):
>   blk-mq: tags: define several fields of tags as pointer
>   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
>   block: null_blk: introduce module parameter of 'g_global_tags'
>   scsi: introduce force_blk_mq
>   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> 
>  block/bfq-iosched.c        |  4 +--
>  block/blk-mq-debugfs.c     | 11 ++++----
>  block/blk-mq-sched.c       |  2 +-
>  block/blk-mq-tag.c         | 67 +++++++++++++++++++++++++++++-----------------
>  block/blk-mq-tag.h         | 15 ++++++++---
>  block/blk-mq.c             | 31 ++++++++++++++++-----
>  block/blk-mq.h             |  3 ++-
>  block/kyber-iosched.c      |  2 +-
>  drivers/block/null_blk.c   |  6 +++++
>  drivers/scsi/hosts.c       |  1 +
>  drivers/scsi/virtio_scsi.c | 59 +++-------------------------------------
>  include/linux/blk-mq.h     |  2 ++
>  include/scsi/scsi_host.h   |  3 +++
>  13 files changed, 105 insertions(+), 101 deletions(-)
> 
Thanks Ming for picking this up.

I'll give it a shot and see how it behaves on other hardware.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-05  6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
@ 2018-02-05  7:05     ` Kashyap Desai
  2018-02-05 10:23   ` Ming Lei
  1 sibling, 0 replies; 39+ messages in thread
From: Kashyap Desai @ 2018-02-05  7:05 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei, Jens Axboe, linux-block,
	Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Hannes Reinecke [mailto:hare@suse.de]
> Sent: Monday, February 5, 2018 12:28 PM
> To: Ming Lei; Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig;
> Mike Snitzer
> Cc: linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> Petersen;
> James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> Rivera;
> Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> >
> > This patchset supports global tags which was started by Hannes
> > originally:
> >
> > 	https://marc.info/?l=3Dlinux-block&m=3D149132580511346&w=3D2
> >
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq),
> > especially recent discusion mentioned that SCSI_MQ will be enabled at
> default soon.
> >
> > 	https://marc.info/?l=3Dlinux-scsi&m=3D151727684915589&w=3D2
> >
> > With the above two changes, it should be easier to convert SCSI drivers=
'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > can be solved easily, please see detailed descrption in commit log.
> >
> > Also drivers may require to complete request on the submission CPU for
> > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> >
> > 	https://marc.info/?t=3D151601851400001&r=3D1&w=3D2
> >
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue
> > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > v4.15+.
> >
> > Thanks
> > Ming
> >
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> >
> >  block/bfq-iosched.c        |  4 +--
> >  block/blk-mq-debugfs.c     | 11 ++++----
> >  block/blk-mq-sched.c       |  2 +-
> >  block/blk-mq-tag.c         | 67
> > +++++++++++++++++++++++++++++-----------------
> >  block/blk-mq-tag.h         | 15 ++++++++---
> >  block/blk-mq.c             | 31 ++++++++++++++++-----
> >  block/blk-mq.h             |  3 ++-
> >  block/kyber-iosched.c      |  2 +-
> >  drivers/block/null_blk.c   |  6 +++++
> >  drivers/scsi/hosts.c       |  1 +
> >  drivers/scsi/virtio_scsi.c | 59
> > +++-------------------------------------
> >  include/linux/blk-mq.h     |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> >
> Thanks Ming for picking this up.
>
> I'll give it a shot and see how it behaves on other hardware.

Ming -

There is no way we can enable global tags from SCSI stack in this patch
series.   I still think we have no solution for issue described below in
this patch series.
https://marc.info/?t=3D151601851400001&r=3D1&w=3D2

What we will be doing is just use global tag HBA wide instead of h/w queue
based. We still have more than one reply queue ending up completion one CPU=
.
Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
parameter to simulate the issue. We need more number of Online CPU than
reply-queue.
We may see completion redirected to original CPU because of
"QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
in local ISR routine.


Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: F. Imend=C3=B6rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB=
 21284
> (AG N=C3=BCrnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
@ 2018-02-05  7:05     ` Kashyap Desai
  0 siblings, 0 replies; 39+ messages in thread
From: Kashyap Desai @ 2018-02-05  7:05 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei, Jens Axboe, linux-block,
	Christoph Hellwig, Mike Snitzer
  Cc: linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Hannes Reinecke [mailto:hare@suse.de]
> Sent: Monday, February 5, 2018 12:28 PM
> To: Ming Lei; Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig;
> Mike Snitzer
> Cc: linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> Petersen;
> James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> Rivera;
> Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> >
> > This patchset supports global tags which was started by Hannes
> > originally:
> >
> > 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> >
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq),
> > especially recent discusion mentioned that SCSI_MQ will be enabled at
> default soon.
> >
> > 	https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> >
> > With the above two changes, it should be easier to convert SCSI drivers'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > can be solved easily, please see detailed descrption in commit log.
> >
> > Also drivers may require to complete request on the submission CPU for
> > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> >
> > 	https://marc.info/?t=151601851400001&r=1&w=2
> >
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue
> > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > v4.15+.
> >
> > Thanks
> > Ming
> >
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> >
> >  block/bfq-iosched.c        |  4 +--
> >  block/blk-mq-debugfs.c     | 11 ++++----
> >  block/blk-mq-sched.c       |  2 +-
> >  block/blk-mq-tag.c         | 67
> > +++++++++++++++++++++++++++++-----------------
> >  block/blk-mq-tag.h         | 15 ++++++++---
> >  block/blk-mq.c             | 31 ++++++++++++++++-----
> >  block/blk-mq.h             |  3 ++-
> >  block/kyber-iosched.c      |  2 +-
> >  drivers/block/null_blk.c   |  6 +++++
> >  drivers/scsi/hosts.c       |  1 +
> >  drivers/scsi/virtio_scsi.c | 59
> > +++-------------------------------------
> >  include/linux/blk-mq.h     |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> >
> Thanks Ming for picking this up.
>
> I'll give it a shot and see how it behaves on other hardware.

Ming -

There is no way we can enable global tags from SCSI stack in this patch
series.   I still think we have no solution for issue described below in
this patch series.
https://marc.info/?t=151601851400001&r=1&w=2

What we will be doing is just use global tag HBA wide instead of h/w queue
based. We still have more than one reply queue ending up completion one CPU.
Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
parameter to simulate the issue. We need more number of Online CPU than
reply-queue.
We may see completion redirected to original CPU because of
"QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
in local ISR routine.


Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-05  7:05     ` Kashyap Desai
  (?)
@ 2018-02-05 10:17     ` Ming Lei
  2018-02-06  6:03       ` Kashyap Desai
  -1 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-05 10:17 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Mon, Feb 05, 2018 at 12:35:13PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Hannes Reinecke [mailto:hare@suse.de]
> > Sent: Monday, February 5, 2018 12:28 PM
> > To: Ming Lei; Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig;
> > Mike Snitzer
> > Cc: linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> > Petersen;
> > James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> > Rivera;
> > Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On 02/03/2018 05:21 AM, Ming Lei wrote:
> > > Hi All,
> > >
> > > This patchset supports global tags which was started by Hannes
> > > originally:
> > >
> > > 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> > >
> > > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > > driver can avoid to support two IO paths(legacy and blk-mq),
> > > especially recent discusion mentioned that SCSI_MQ will be enabled at
> > default soon.
> > >
> > > 	https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> > >
> > > With the above two changes, it should be easier to convert SCSI drivers'
> > > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > > can be solved easily, please see detailed descrption in commit log.
> > >
> > > Also drivers may require to complete request on the submission CPU for
> > > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> > >
> > > 	https://marc.info/?t=151601851400001&r=1&w=2
> > >
> > > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > > so that IO hang issue can be avoided inside legacy IO path, this issue
> > > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > > v4.15+.
> > >
> > > Thanks
> > > Ming
> > >
> > > Ming Lei (5):
> > >   blk-mq: tags: define several fields of tags as pointer
> > >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> > >   block: null_blk: introduce module parameter of 'g_global_tags'
> > >   scsi: introduce force_blk_mq
> > >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > >
> > >  block/bfq-iosched.c        |  4 +--
> > >  block/blk-mq-debugfs.c     | 11 ++++----
> > >  block/blk-mq-sched.c       |  2 +-
> > >  block/blk-mq-tag.c         | 67
> > > +++++++++++++++++++++++++++++-----------------
> > >  block/blk-mq-tag.h         | 15 ++++++++---
> > >  block/blk-mq.c             | 31 ++++++++++++++++-----
> > >  block/blk-mq.h             |  3 ++-
> > >  block/kyber-iosched.c      |  2 +-
> > >  drivers/block/null_blk.c   |  6 +++++
> > >  drivers/scsi/hosts.c       |  1 +
> > >  drivers/scsi/virtio_scsi.c | 59
> > > +++-------------------------------------
> > >  include/linux/blk-mq.h     |  2 ++
> > >  include/scsi/scsi_host.h   |  3 +++
> > >  13 files changed, 105 insertions(+), 101 deletions(-)
> > >
> > Thanks Ming for picking this up.
> >
> > I'll give it a shot and see how it behaves on other hardware.
> 
> Ming -
> 
> There is no way we can enable global tags from SCSI stack in this patch

It has been done in V2 of this patchset, which will be posted out after
HPSA's issue is fixed:

	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Global tags will be enabled easily via .host_tagset of scsi_host_template.

> series.   I still think we have no solution for issue described below in
> this patch series.
> https://marc.info/?t=151601851400001&r=1&w=2
> 
> What we will be doing is just use global tag HBA wide instead of h/w queue
> based.

Right, that is just what the 1st three patches are doing.

> We still have more than one reply queue ending up completion one CPU.

pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
smp_affinity_enable has to be set as 1, but seems it is the default
setting.

Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which figures out an optimal number of vectors, and the computation is
based on cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
mapped to some of reply queues, these queues won't be active(no request
submitted to these queues). The mechanism of PCI_IRQ_AFFINITY basically
makes sure that more than one irq vector won't be handled by one same CPU,
and the irq vector spread is done in irq_create_affinity_masks().

> Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
> parameter to simulate the issue. We need more number of Online CPU than
> reply-queue.

IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq vector number, num_possible_cpus()/num_online_cpus() and each
irq vector's affinity assignment.

> We may see completion redirected to original CPU because of
> "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
> in local ISR routine.

Could you dump each irq vector's affinity assignment of your megaraid
in your test?

And the following script can do it easily, and the pci path
(the 1st column of lspci output) need to be passed, such as: 00:1c.4, 

#!/bin/sh
if [ $# -ge 1 ]; then
    PCID=$1
else
    PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7`
fi
PCIP=`find /sys/devices -name *$PCID | grep pci`
IRQS=`ls $PCIP/msi_irqs`

echo "kernel version: "
uname -a

echo "PCI name is $PCID, dump its irq affinity:"
for IRQ in $IRQS; do
    CPUS=`cat /proc/irq/$IRQ/smp_affinity_list`
    echo "\tirq $IRQ, cpu list $CPUS"
done


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-05  6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
  2018-02-05  7:05     ` Kashyap Desai
@ 2018-02-05 10:23   ` Ming Lei
  1 sibling, 0 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-05 10:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On Mon, Feb 05, 2018 at 07:58:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> > 
> > This patchset supports global tags which was started by Hannes originally:
> > 
> > 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> > 
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq), especially
> > recent discusion mentioned that SCSI_MQ will be enabled at default soon.
> > 
> > 	https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> > 
> > With the above two changes, it should be easier to convert SCSI drivers'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue can
> > be solved easily, please see detailed descrption in commit log.
> > 
> > Also drivers may require to complete request on the submission CPU
> > for avoiding hard/soft deadlock, which can be done easily with blk_mq
> > too.
> > 
> > 	https://marc.info/?t=151601851400001&r=1&w=2
> > 
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue is
> > a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.
> > 
> > Thanks
> > Ming
> > 
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > 
> >  block/bfq-iosched.c        |  4 +--
> >  block/blk-mq-debugfs.c     | 11 ++++----
> >  block/blk-mq-sched.c       |  2 +-
> >  block/blk-mq-tag.c         | 67 +++++++++++++++++++++++++++++-----------------
> >  block/blk-mq-tag.h         | 15 ++++++++---
> >  block/blk-mq.c             | 31 ++++++++++++++++-----
> >  block/blk-mq.h             |  3 ++-
> >  block/kyber-iosched.c      |  2 +-
> >  drivers/block/null_blk.c   |  6 +++++
> >  drivers/scsi/hosts.c       |  1 +
> >  drivers/scsi/virtio_scsi.c | 59 +++-------------------------------------
> >  include/linux/blk-mq.h     |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> > 
> Thanks Ming for picking this up.
> 
> I'll give it a shot and see how it behaves on other hardware.

Hi Hannes,

Thanks for looking at it.

I am working on V2, which has fixed some issues, and added your patch
of 'scsi: Add template flag 'host_tagset', but causes a HPSA kernel
oops. Once it is fixed, I will posted V2 out, then there will be one
real example about how to use global tags for converting reply queue
to blk-mq hctx.

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  2018-02-05  6:54   ` Hannes Reinecke
@ 2018-02-05 10:35     ` Ming Lei
  0 siblings, 0 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-05 10:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Kashyap Desai,
	Peter Rivera, Paolo Bonzini, Laurence Oberman

On Mon, Feb 05, 2018 at 07:54:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> > reply queues, but tags is often HBA wide.
> > 
> > These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> > for automatic affinity assignment.
> > 
> > Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> > has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> > for some irq vectors, this can't be avoided even though the allocation
> > is improved.
> > 
> > So all these drivers have to avoid to ask HBA to complete request in
> > reply queue which hasn't online CPUs assigned, and HPSA has been broken
> > with v4.15+:
> > 
> > 	https://marc.info/?l=linux-kernel&m=151748144730409&w=2
> > 
> > This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> > hw queue by mapping each reply queue into hctx, but one tricky thing is
> > the HBA wide(instead of hw queue wide) tags.
> > 
> > This patch is based on the following Hannes's patch:
> > 
> > 	https://marc.info/?l=linux-block&m=149132580511346&w=2
> > 
> > One big difference with Hannes's is that this patch only makes the tags sbitmap
> > and active_queues data structure HBA wide, and others are kept as NUMA locality,
> > such as request, hctx, tags, ...
> > 
> > The following patch will support global tags on null_blk, also the performance
> > data is provided, no obvious performance loss is observed when the whole
> > hw queue depth is same.
> > 
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Arun Easi <arun.easi@cavium.com>
> > Cc: Omar Sandoval <osandov@fb.com>,
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> > Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: Don Brace <don.brace@microsemi.com>
> > Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> > Cc: Peter Rivera <peter.rivera@broadcom.com>
> > Cc: Laurence Oberman <loberman@redhat.com>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-debugfs.c |  1 +
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c     | 23 ++++++++++++++++++-----
> >  block/blk-mq-tag.h     |  5 ++++-
> >  block/blk-mq.c         | 29 ++++++++++++++++++++++++-----
> >  block/blk-mq.h         |  3 ++-
> >  include/linux/blk-mq.h |  2 ++
> >  7 files changed, 52 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 0dfafa4b655a..0f0fafe03f5d 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
> >  	HCTX_FLAG_NAME(SHOULD_MERGE),
> >  	HCTX_FLAG_NAME(TAG_SHARED),
> >  	HCTX_FLAG_NAME(SG_MERGE),
> > +	HCTX_FLAG_NAME(GLOBAL_TAGS),
> >  	HCTX_FLAG_NAME(BLOCKING),
> >  	HCTX_FLAG_NAME(NO_SCHED),
> >  };
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55c0a745b427..191d4bc95f0e 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
> >  	int ret;
> >  
> >  	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
> > -					       set->reserved_tags);
> > +					       set->reserved_tags, false);
> >  	if (!hctx->sched_tags)
> >  		return -ENOMEM;
> >  
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 571797dc36cb..66377d09eaeb 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
> >  	return NULL;
> >  }
> >  
> > -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> > +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> > +				     unsigned int total_tags,
> >  				     unsigned int reserved_tags,
> > -				     int node, int alloc_policy)
> > +				     int node, int alloc_policy,
> > +				     bool global_tag)
> >  {
> >  	struct blk_mq_tags *tags;
> >  
> > @@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> >  	tags->nr_tags = total_tags;
> >  	tags->nr_reserved_tags = reserved_tags;
> >  
> > +	WARN_ON(global_tag && !set->global_tags);
> > +	if (global_tag && set->global_tags) {
> > +		tags->bitmap_tags = set->global_tags->bitmap_tags;
> > +		tags->breserved_tags = set->global_tags->breserved_tags;
> > +		tags->active_queues = set->global_tags->active_queues;
> > +		tags->global_tags = true;
> > +		return tags;
> > +	}
> >  	tags->bitmap_tags = &tags->__bitmap_tags;
> >  	tags->breserved_tags = &tags->__breserved_tags;
> >  	tags->active_queues = &tags->__active_queues;
> Do we really need the 'global_tag' flag here?
> Can't we just rely on the ->global_tags pointer to be set?

The same code is used for creating scheduler tags too, so the parameter
of 'global_tag' has to be introduced.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-05 10:17     ` Ming Lei
@ 2018-02-06  6:03       ` Kashyap Desai
  2018-02-06  8:04         ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Kashyap Desai @ 2018-02-06  6:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> > We still have more than one reply queue ending up completion one CPU.
>
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> smp_affinity_enable has to be set as 1, but seems it is the default
setting.
>
> Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which
> figures out an optimal number of vectors, and the computation is based
on
> cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> some of reply queues, these queues won't be active(no request submitted
to
> these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
that
> more than one irq vector won't be handled by one same CPU, and the irq
> vector spread is done in irq_create_affinity_masks().
>
> > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > module parameter to simulate the issue. We need more number of Online
> > CPU than reply-queue.
>
> IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq
> vector number, num_possible_cpus()/num_online_cpus() and each irq
> vector's affinity assignment.
>
> > We may see completion redirected to original CPU because of
> > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > busy in local ISR routine.
>
> Could you dump each irq vector's affinity assignment of your megaraid in
your
> test?

To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
driver.  System has total 16 online CPUs.

Output of affinity hints.
kernel version:
Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
x86_64 GNU/Linux
PCI name is 83:00.0, dump its irq affinity:
irq 105, cpu list 0-3,8-11

Affinity mask is created properly, but only CPU-0 is overloaded with
interrupt processing.

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 47861 MB
node 0 free: 46516 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 64491 MB
node 1 free: 62805 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Output of  system activities (sar).  (gnice is 100% and it is consumed in
megaraid_sas ISR routine.)


12:44:40 PM     CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest    %gnice     %idle
12:44:41 PM     all         6.03      0.00        29.98      0.00
0.00         0.00        0.00        0.00        0.00         63.99
12:44:41 PM       0         0.00      0.00         0.00        0.00
0.00         0.00        0.00        0.00       100.00         0


In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
also used " host_tagset" V2 patch set for megaraid_sas.

Using RFC requested in -
"https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is avoided
(you can noticed that gnice is shifted to softirq. Even though it is 100%
consumed, There is always exit for existing completion loop due to
irqpoll_weight  @irq_poll_init().

Average:        CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest    %gnice     %idle
Average:        all          4.25      0.00        21.61      0.00
0.00      0.00         6.61           0.00      0.00     67.54
Average:          0           0.00      0.00         0.00      0.00
0.00      0.00       100.00        0.00      0.00      0.00


Hope this clarifies. We need different fix to avoid lockups. Can we
consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
?

>
> And the following script can do it easily, and the pci path (the 1st
column of
> lspci output) need to be passed, such as: 00:1c.4,
>
> #!/bin/sh
> if [ $# -ge 1 ]; then
>     PCID=$1
> else
>     PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7` fi PCIP=`find
> /sys/devices -name *$PCID | grep pci` IRQS=`ls $PCIP/msi_irqs`
>
> echo "kernel version: "
> uname -a
>
> echo "PCI name is $PCID, dump its irq affinity:"
> for IRQ in $IRQS; do
>     CPUS=`cat /proc/irq/$IRQ/smp_affinity_list`
>     echo "\tirq $IRQ, cpu list $CPUS"
> done
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06  6:03       ` Kashyap Desai
@ 2018-02-06  8:04         ` Ming Lei
  2018-02-06 11:29           ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-06  8:04 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > We still have more than one reply queue ending up completion one CPU.
> >
> > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > smp_affinity_enable has to be set as 1, but seems it is the default
> setting.
> >
> > Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
> which
> > figures out an optimal number of vectors, and the computation is based
> on
> > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> > some of reply queues, these queues won't be active(no request submitted
> to
> > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
> that
> > more than one irq vector won't be handled by one same CPU, and the irq
> > vector spread is done in irq_create_affinity_masks().
> >
> > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > module parameter to simulate the issue. We need more number of Online
> > > CPU than reply-queue.
> >
> > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
> irq
> > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > vector's affinity assignment.
> >
> > > We may see completion redirected to original CPU because of
> > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > > busy in local ISR routine.
> >
> > Could you dump each irq vector's affinity assignment of your megaraid in
> your
> > test?
> 
> To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
> driver.  System has total 16 online CPUs.

I suggest you don't do the restriction of single MSI-x vector, and just
use the device supported number of msi-x vector.

> 
> Output of affinity hints.
> kernel version:
> Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
> x86_64 GNU/Linux
> PCI name is 83:00.0, dump its irq affinity:
> irq 105, cpu list 0-3,8-11

In this case, which CPU is selected for handling the interrupt is
decided by interrupt controller, and it is easy to cause CPU overload
if interrupt controller always selects one same CPU to handle the irq.

> 
> Affinity mask is created properly, but only CPU-0 is overloaded with
> interrupt processing.
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 47861 MB
> node 0 free: 46516 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 64491 MB
> node 1 free: 62805 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Output of  system activities (sar).  (gnice is 100% and it is consumed in
> megaraid_sas ISR routine.)
> 
> 
> 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> 12:44:41 PM     all         6.03      0.00        29.98      0.00
> 0.00         0.00        0.00        0.00        0.00         63.99
> 12:44:41 PM       0         0.00      0.00         0.00        0.00
> 0.00         0.00        0.00        0.00       100.00         0
> 
> 
> In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> also used " host_tagset" V2 patch set for megaraid_sas.
> 
> Using RFC requested in -
> "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is avoided
> (you can noticed that gnice is shifted to softirq. Even though it is 100%
> consumed, There is always exit for existing completion loop due to
> irqpoll_weight  @irq_poll_init().
> 
> Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> Average:        all          4.25      0.00        21.61      0.00
> 0.00      0.00         6.61           0.00      0.00     67.54
> Average:          0           0.00      0.00         0.00      0.00
> 0.00      0.00       100.00        0.00      0.00      0.00
> 
> 
> Hope this clarifies. We need different fix to avoid lockups. Can we
> consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
> ?

Please use the device's supported msi-x vectors number, and see if there
is this issue. If there is, you can use irq poll too, which isn't contradictory
with the blk-mq approach taken by this patchset.

Hope I clarifies too, :-)


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06  8:04         ` Ming Lei
@ 2018-02-06 11:29           ` Kashyap Desai
  2018-02-06 12:31             ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Kashyap Desai @ 2018-02-06 11:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Tuesday, February 6, 2018 1:35 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > We still have more than one reply queue ending up completion one
CPU.
> > >
> > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > > smp_affinity_enable has to be set as 1, but seems it is the default
> > setting.
> > >
> > > Please see kernel/irq/affinity.c, especially
> > > irq_calc_affinity_vectors()
> > which
> > > figures out an optimal number of vectors, and the computation is
> > > based
> > on
> > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > mapped to some of reply queues, these queues won't be active(no
> > > request submitted
> > to
> > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > sure
> > that
> > > more than one irq vector won't be handled by one same CPU, and the
> > > irq vector spread is done in irq_create_affinity_masks().
> > >
> > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > > module parameter to simulate the issue. We need more number of
> > > > Online CPU than reply-queue.
> > >
> > > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > returned
> > irq
> > > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > > vector's affinity assignment.
> > >
> > > > We may see completion redirected to original CPU because of
> > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one
> > > > CPU busy in local ISR routine.
> > >
> > > Could you dump each irq vector's affinity assignment of your
> > > megaraid in
> > your
> > > test?
> >
> > To quickly reproduce, I restricted to single MSI-x vector on
> > megaraid_sas driver.  System has total 16 online CPUs.
>
> I suggest you don't do the restriction of single MSI-x vector, and just
use the
> device supported number of msi-x vector.

Hi Ming,  CPU lock up is seen even though it is not single msi-x vector.
Actual scenario need some specific topology and server for overnight test.
Issue can be seen on servers which has more than 16 logical CPUs and
Thunderbolt series MR controller which supports at max 16 MSIx vectors.
>
> >
> > Output of affinity hints.
> > kernel version:
> > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64
> > x86_64
> > x86_64 GNU/Linux
> > PCI name is 83:00.0, dump its irq affinity:
> > irq 105, cpu list 0-3,8-11
>
> In this case, which CPU is selected for handling the interrupt is
decided by
> interrupt controller, and it is easy to cause CPU overload if interrupt
controller
> always selects one same CPU to handle the irq.
>
> >
> > Affinity mask is created properly, but only CPU-0 is overloaded with
> > interrupt processing.
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 8 9 10 11
> > node 0 size: 47861 MB
> > node 0 free: 46516 MB
> > node 1 cpus: 4 5 6 7 12 13 14 15
> > node 1 size: 64491 MB
> > node 1 free: 62805 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> >
> > Output of  system activities (sar).  (gnice is 100% and it is consumed
> > in megaraid_sas ISR routine.)
> >
> >
> > 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait    %steal
> > %irq     %soft    %guest    %gnice     %idle
> > 12:44:41 PM     all         6.03      0.00        29.98      0.00
> > 0.00         0.00        0.00        0.00        0.00         63.99
> > 12:44:41 PM       0         0.00      0.00         0.00        0.00
> > 0.00         0.00        0.00        0.00       100.00         0
> >
> >
> > In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> > also used " host_tagset" V2 patch set for megaraid_sas.
> >
> > Using RFC requested in -
> > "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is
> > avoided (you can noticed that gnice is shifted to softirq. Even though
> > it is 100% consumed, There is always exit for existing completion loop
> > due to irqpoll_weight  @irq_poll_init().
> >
> > Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> > %irq     %soft    %guest    %gnice     %idle
> > Average:        all          4.25      0.00        21.61      0.00
> > 0.00      0.00         6.61           0.00      0.00     67.54
> > Average:          0           0.00      0.00         0.00      0.00
> > 0.00      0.00       100.00        0.00      0.00      0.00
> >
> >
> > Hope this clarifies. We need different fix to avoid lockups. Can we
> > consider using irq poll interface if #CPU is more than Reply
queue/MSI-x.
> > ?
>
> Please use the device's supported msi-x vectors number, and see if there
is this
> issue. If there is, you can use irq poll too, which isn't contradictory
with the
> blk-mq approach taken by this patchset.

Device supported scenario need more time to reproduce, but it is more
quick method is to just use single MSI-x vector and try to create worst
case IO completion loop.
Using irq poll, my test run without any CPU lockup. I tried your latest V2
series as well and that is also behaving the same.

BTW - I am seeing drastically performance drop using V2 series of patch on
megaraid_sas. Those who is testing HPSA, can also verify if that is a
generic behavior.
See below perf top data. "bt_iter" is consuming 4 times more CPU.

    35.94%  [kernel]        [k] bt_iter
     8.41%  [kernel]        [k] blk_mq_queue_tag_busy_iter
     6.37%  [kernel]        [k] _find_next_bit
     4.19%  [kernel]        [k] native_queued_spin_lock_slowpath
     1.81%  [kernel]        [k] sbitmap_any_bit_set
     1.46%  [megaraid_sas]  [k] megasas_build_io_fusion
     1.26%  [megaraid_sas]  [k] complete_cmd_fusion

Without V2 patch -

     7.90%  [kernel]        [k] bt_iter
     4.96%  [megaraid_sas]  [k] megasas_build_io_fusion
     3.72%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     3.09%  [megaraid_sas]  [k] complete_cmd_fusion
     2.59%  [kernel]        [k] scsi_softirq_done
     2.11%  [kernel]        [k] switch_mm_irqs_off
     1.79%  [kernel]        [k] _raw_spin_lock
     1.68%  [kernel]        [k] _find_next_bit
     1.52%  [kernel]        [k] blk_mq_free_request
     1.46%  [kernel]        [k] scsi_queue_rq
     1.39%  [megaraid_sas]  [k] megasas_queue_command
     1.36%  [megaraid_sas]  [k] megasas_build_ldio_fusion


numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 31555 MB
node 0 free: 28781 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 32229 MB
node 1 free: 29419 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10


Performance drop is more than 100% if I use host wide shared tagging V2
patch set.


>
> Hope I clarifies too, :-)
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06 11:29           ` Kashyap Desai
@ 2018-02-06 12:31             ` Ming Lei
  2018-02-06 14:27               ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-06 12:31 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Tuesday, February 6, 2018 1:35 PM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > Hi Kashyap,
> >
> > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > We still have more than one reply queue ending up completion one
> CPU.
> > > >
> > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > > > smp_affinity_enable has to be set as 1, but seems it is the default
> > > setting.
> > > >
> > > > Please see kernel/irq/affinity.c, especially
> > > > irq_calc_affinity_vectors()
> > > which
> > > > figures out an optimal number of vectors, and the computation is
> > > > based
> > > on
> > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > mapped to some of reply queues, these queues won't be active(no
> > > > request submitted
> > > to
> > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > sure
> > > that
> > > > more than one irq vector won't be handled by one same CPU, and the
> > > > irq vector spread is done in irq_create_affinity_masks().
> > > >
> > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > > > module parameter to simulate the issue. We need more number of
> > > > > Online CPU than reply-queue.
> > > >
> > > > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > returned
> > > irq
> > > > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > > > vector's affinity assignment.
> > > >
> > > > > We may see completion redirected to original CPU because of
> > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one
> > > > > CPU busy in local ISR routine.
> > > >
> > > > Could you dump each irq vector's affinity assignment of your
> > > > megaraid in
> > > your
> > > > test?
> > >
> > > To quickly reproduce, I restricted to single MSI-x vector on
> > > megaraid_sas driver.  System has total 16 online CPUs.
> >
> > I suggest you don't do the restriction of single MSI-x vector, and just
> use the
> > device supported number of msi-x vector.
> 
> Hi Ming,  CPU lock up is seen even though it is not single msi-x vector.
> Actual scenario need some specific topology and server for overnight test.
> Issue can be seen on servers which has more than 16 logical CPUs and
> Thunderbolt series MR controller which supports at max 16 MSIx vectors.
> >
> > >
> > > Output of affinity hints.
> > > kernel version:
> > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64
> > > x86_64
> > > x86_64 GNU/Linux
> > > PCI name is 83:00.0, dump its irq affinity:
> > > irq 105, cpu list 0-3,8-11
> >
> > In this case, which CPU is selected for handling the interrupt is
> decided by
> > interrupt controller, and it is easy to cause CPU overload if interrupt
> controller
> > always selects one same CPU to handle the irq.
> >
> > >
> > > Affinity mask is created properly, but only CPU-0 is overloaded with
> > > interrupt processing.
> > >
> > > # numactl --hardware
> > > available: 2 nodes (0-1)
> > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > node 0 size: 47861 MB
> > > node 0 free: 46516 MB
> > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > node 1 size: 64491 MB
> > > node 1 free: 62805 MB
> > > node distances:
> > > node   0   1
> > >   0:  10  21
> > >   1:  21  10
> > >
> > > Output of  system activities (sar).  (gnice is 100% and it is consumed
> > > in megaraid_sas ISR routine.)
> > >
> > >
> > > 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait    %steal
> > > %irq     %soft    %guest    %gnice     %idle
> > > 12:44:41 PM     all         6.03      0.00        29.98      0.00
> > > 0.00         0.00        0.00        0.00        0.00         63.99
> > > 12:44:41 PM       0         0.00      0.00         0.00        0.00
> > > 0.00         0.00        0.00        0.00       100.00         0
> > >
> > >
> > > In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> > > also used " host_tagset" V2 patch set for megaraid_sas.
> > >
> > > Using RFC requested in -
> > > "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is
> > > avoided (you can noticed that gnice is shifted to softirq. Even though
> > > it is 100% consumed, There is always exit for existing completion loop
> > > due to irqpoll_weight  @irq_poll_init().
> > >
> > > Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> > > %irq     %soft    %guest    %gnice     %idle
> > > Average:        all          4.25      0.00        21.61      0.00
> > > 0.00      0.00         6.61           0.00      0.00     67.54
> > > Average:          0           0.00      0.00         0.00      0.00
> > > 0.00      0.00       100.00        0.00      0.00      0.00
> > >
> > >
> > > Hope this clarifies. We need different fix to avoid lockups. Can we
> > > consider using irq poll interface if #CPU is more than Reply
> queue/MSI-x.
> > > ?
> >
> > Please use the device's supported msi-x vectors number, and see if there
> is this
> > issue. If there is, you can use irq poll too, which isn't contradictory
> with the
> > blk-mq approach taken by this patchset.
> 
> Device supported scenario need more time to reproduce, but it is more
> quick method is to just use single MSI-x vector and try to create worst
> case IO completion loop.
> Using irq poll, my test run without any CPU lockup. I tried your latest V2
> series as well and that is also behaving the same.

Again, you can use irq poll, which isn't contradictory with blk-mq.

> 
> BTW - I am seeing drastically performance drop using V2 series of patch on
> megaraid_sas. Those who is testing HPSA, can also verify if that is a
> generic behavior.

OK, I will see if I can find a megaraid_sas to see the performance drop
issue. If I can't, I will try to run performance test on HPSA.

Could you share us your patch for enabling global_tags/MQ on megaraid_sas so
that I can reproduce your test?

> See below perf top data. "bt_iter" is consuming 4 times more CPU.

Could you share us what the IOPS/CPU utilization effect is after
applying the patch V2? And your test script?

In theory, it shouldn't, because the HBA only supports HBA wide tags,
that means the allocation has to share a HBA wide sbitmap no matter
if global tags is used or not.

Anyway, I will take a look at the performance test and data.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06 12:31             ` Ming Lei
@ 2018-02-06 14:27               ` Kashyap Desai
  2018-02-06 15:46                 ` Ming Lei
  2018-02-07  6:50                 ` Hannes Reinecke
  0 siblings, 2 replies; 39+ messages in thread
From: Kashyap Desai @ 2018-02-06 14:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Tuesday, February 6, 2018 6:02 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Tuesday, February 6, 2018 1:35 PM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > Hi Kashyap,
> > >
> > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > > We still have more than one reply queue ending up completion
> > > > > > one
> > CPU.
> > > > >
> > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that
> > > > > means smp_affinity_enable has to be set as 1, but seems it is
> > > > > the default
> > > > setting.
> > > > >
> > > > > Please see kernel/irq/affinity.c, especially
> > > > > irq_calc_affinity_vectors()
> > > > which
> > > > > figures out an optimal number of vectors, and the computation is
> > > > > based
> > > > on
> > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > > mapped to some of reply queues, these queues won't be active(no
> > > > > request submitted
> > > > to
> > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > > sure
> > > > that
> > > > > more than one irq vector won't be handled by one same CPU, and
> > > > > the irq vector spread is done in irq_create_affinity_masks().
> > > > >
> > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver
> > > > > > via module parameter to simulate the issue. We need more
> > > > > > number of Online CPU than reply-queue.
> > > > >
> > > > > IMO, you don't need to simulate the issue,
> > > > > pci_alloc_irq_vectors(
> > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > > returned
> > > > irq
> > > > > vector number, num_possible_cpus()/num_online_cpus() and each
> > > > > irq vector's affinity assignment.
> > > > >
> > > > > > We may see completion redirected to original CPU because of
> > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep
> > > > > > one CPU busy in local ISR routine.
> > > > >
> > > > > Could you dump each irq vector's affinity assignment of your
> > > > > megaraid in
> > > > your
> > > > > test?
> > > >
> > > > To quickly reproduce, I restricted to single MSI-x vector on
> > > > megaraid_sas driver.  System has total 16 online CPUs.
> > >
> > > I suggest you don't do the restriction of single MSI-x vector, and
> > > just
> > use the
> > > device supported number of msi-x vector.
> >
> > Hi Ming,  CPU lock up is seen even though it is not single msi-x
vector.
> > Actual scenario need some specific topology and server for overnight
test.
> > Issue can be seen on servers which has more than 16 logical CPUs and
> > Thunderbolt series MR controller which supports at max 16 MSIx
vectors.
> > >
> > > >
> > > > Output of affinity hints.
> > > > kernel version:
> > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018
> > > > x86_64
> > > > x86_64
> > > > x86_64 GNU/Linux
> > > > PCI name is 83:00.0, dump its irq affinity:
> > > > irq 105, cpu list 0-3,8-11
> > >
> > > In this case, which CPU is selected for handling the interrupt is
> > decided by
> > > interrupt controller, and it is easy to cause CPU overload if
> > > interrupt
> > controller
> > > always selects one same CPU to handle the irq.
> > >
> > > >
> > > > Affinity mask is created properly, but only CPU-0 is overloaded
> > > > with interrupt processing.
> > > >
> > > > # numactl --hardware
> > > > available: 2 nodes (0-1)
> > > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > > node 0 size: 47861 MB
> > > > node 0 free: 46516 MB
> > > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > > node 1 size: 64491 MB
> > > > node 1 free: 62805 MB
> > > > node distances:
> > > > node   0   1
> > > >   0:  10  21
> > > >   1:  21  10
> > > >
> > > > Output of  system activities (sar).  (gnice is 100% and it is
> > > > consumed in megaraid_sas ISR routine.)
> > > >
> > > >
> > > > 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait
%steal
> > > > %irq     %soft    %guest    %gnice     %idle
> > > > 12:44:41 PM     all         6.03      0.00        29.98      0.00
> > > > 0.00         0.00        0.00        0.00        0.00
63.99
> > > > 12:44:41 PM       0         0.00      0.00         0.00
0.00
> > > > 0.00         0.00        0.00        0.00       100.00         0
> > > >
> > > >
> > > > In my test, I used rq_affinity is set to 2.
> > > > (QUEUE_FLAG_SAME_FORCE). I also used " host_tagset" V2 patch set
for
> megaraid_sas.
> > > >
> > > > Using RFC requested in -
> > > > "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is
> > > > avoided (you can noticed that gnice is shifted to softirq. Even
> > > > though it is 100% consumed, There is always exit for existing
> > > > completion loop due to irqpoll_weight  @irq_poll_init().
> > > >
> > > > Average:        CPU      %usr     %nice      %sys   %iowait
%steal
> > > > %irq     %soft    %guest    %gnice     %idle
> > > > Average:        all          4.25      0.00        21.61      0.00
> > > > 0.00      0.00         6.61           0.00      0.00     67.54
> > > > Average:          0           0.00      0.00         0.00
0.00
> > > > 0.00      0.00       100.00        0.00      0.00      0.00
> > > >
> > > >
> > > > Hope this clarifies. We need different fix to avoid lockups. Can
> > > > we consider using irq poll interface if #CPU is more than Reply
> > queue/MSI-x.
> > > > ?
> > >
> > > Please use the device's supported msi-x vectors number, and see if
> > > there
> > is this
> > > issue. If there is, you can use irq poll too, which isn't
> > > contradictory
> > with the
> > > blk-mq approach taken by this patchset.
> >
> > Device supported scenario need more time to reproduce, but it is more
> > quick method is to just use single MSI-x vector and try to create
> > worst case IO completion loop.
> > Using irq poll, my test run without any CPU lockup. I tried your
> > latest V2 series as well and that is also behaving the same.
>
> Again, you can use irq poll, which isn't contradictory with blk-mq.

Just wanted to explained that issue of CPU lock up is different.  Thanks
for clarification.
>
> >
> > BTW - I am seeing drastically performance drop using V2 series of
> > patch on megaraid_sas. Those who is testing HPSA, can also verify if
> > that is a generic behavior.
>
> OK, I will see if I can find a megaraid_sas to see the performance drop
issue. If I
> can't, I will try to run performance test on HPSA.

Patch is appended.

>
> Could you share us your patch for enabling global_tags/MQ on
megaraid_sas
> so that I can reproduce your test?
>
> > See below perf top data. "bt_iter" is consuming 4 times more CPU.
>
> Could you share us what the IOPS/CPU utilization effect is after
applying the
> patch V2? And your test script?
Regarding CPU utilization, I need to test one more time. Currently system
is in used.

I run below fio test on total 24 SSDs expander attached.

numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
--ioengine=libaio --rw=randread

Performance dropped from 1.6 M IOPs to 770K IOPs.

>
> In theory, it shouldn't, because the HBA only supports HBA wide tags,
that
> means the allocation has to share a HBA wide sbitmap no matter if global
tags
> is used or not.
>
> Anyway, I will take a look at the performance test and data.
>
>
> Thanks,
> Ming


Megaraid_sas version of shared tag set.


diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
b/drivers/scsi/megaraid/megaraid_sas_base.c
index 0f1d88f..75ea86b 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -50,6 +50,7 @@
 #include <linux/mutex.h>
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq-pci.h>

 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -220,6 +221,15 @@ static int megasas_get_ld_vf_affiliation(struct
megasas_instance *instance,
 static inline void
 megasas_init_ctrl_params(struct megasas_instance *instance);

+
+static int megaraid_sas_map_queues(struct Scsi_Host *shost)
+{
+	struct megasas_instance *instance;
+	instance = (struct megasas_instance *)shost->hostdata;
+
+        return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev);
+}
+
 /**
  * megasas_set_dma_settings -	Populate DMA address, length and flags for
DCMDs
  * @instance:			Adapter soft state
@@ -3177,6 +3187,8 @@ struct device_attribute *megaraid_host_attrs[] = {
 	.use_clustering = ENABLE_CLUSTERING,
 	.change_queue_depth = scsi_change_queue_depth,
 	.no_write_same = 1,
+	.map_queues = megaraid_sas_map_queues,
+	.host_tagset = 1,
 };

 /**
@@ -5965,6 +5977,9 @@ static int megasas_io_attach(struct megasas_instance
*instance)
 	host->max_lun = MEGASAS_MAX_LUN;
 	host->max_cmd_len = 16;

+	/* map reply queue to blk_mq hw queue */
+	host->nr_hw_queues = instance->msix_vectors;
+
 	/*
 	 * Notify the mid-layer about the new controller
 	 */
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 073ced0..034d976 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2655,11 +2655,15 @@ static void megasas_stream_detect(struct
megasas_instance *instance,
 			fp_possible = (io_info.fpOkForIo > 0) ? true :
false;
 	}

+#if 0
 	/* Use raw_smp_processor_id() for now until cmd->request->cpu is
CPU
 	   id by default, not CPU group id, otherwise all MSI-X queues
won't
 	   be utilized */
 	cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ?
 		raw_smp_processor_id() % instance->msix_vectors : 0;
+#endif
+
+	cmd->request_desc->SCSIIO.MSIxIndex =
blk_mq_unique_tag_to_hwq(scp->request->tag);

 	praid_context = &io_request->RaidContext;

@@ -2985,9 +2989,13 @@ static void megasas_build_ld_nonrw_fusion(struct
megasas_instance *instance,
 	}

 	cmd->request_desc->SCSIIO.DevHandle = io_request->DevHandle;
+
+#if 0
 	cmd->request_desc->SCSIIO.MSIxIndex =
 		instance->msix_vectors ?
 		(raw_smp_processor_id() % instance->msix_vectors) : 0;
+#endif
+	cmd->request_desc->SCSIIO.MSIxIndex =
blk_mq_unique_tag_to_hwq(scmd->request->tag);


 	if (!fp_possible) {

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06 14:27               ` Kashyap Desai
@ 2018-02-06 15:46                 ` Ming Lei
  2018-02-07  6:50                 ` Hannes Reinecke
  1 sibling, 0 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-06 15:46 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Tue, Feb 06, 2018 at 07:57:35PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Tuesday, February 6, 2018 6:02 PM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > Hi Kashyap,
> >
> > On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > > > -----Original Message-----
> > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > Sent: Tuesday, February 6, 2018 1:35 PM
> > > > To: Kashyap Desai
> > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > Hi Kashyap,
> > > >
> > > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > > > We still have more than one reply queue ending up completion
> > > > > > > one
> > > CPU.
> > > > > >
> > > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that
> > > > > > means smp_affinity_enable has to be set as 1, but seems it is
> > > > > > the default
> > > > > setting.
> > > > > >
> > > > > > Please see kernel/irq/affinity.c, especially
> > > > > > irq_calc_affinity_vectors()
> > > > > which
> > > > > > figures out an optimal number of vectors, and the computation is
> > > > > > based
> > > > > on
> > > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > > > mapped to some of reply queues, these queues won't be active(no
> > > > > > request submitted
> > > > > to
> > > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > > > sure
> > > > > that
> > > > > > more than one irq vector won't be handled by one same CPU, and
> > > > > > the irq vector spread is done in irq_create_affinity_masks().
> > > > > >
> > > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver
> > > > > > > via module parameter to simulate the issue. We need more
> > > > > > > number of Online CPU than reply-queue.
> > > > > >
> > > > > > IMO, you don't need to simulate the issue,
> > > > > > pci_alloc_irq_vectors(
> > > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > > > returned
> > > > > irq
> > > > > > vector number, num_possible_cpus()/num_online_cpus() and each
> > > > > > irq vector's affinity assignment.
> > > > > >
> > > > > > > We may see completion redirected to original CPU because of
> > > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep
> > > > > > > one CPU busy in local ISR routine.
> > > > > >
> > > > > > Could you dump each irq vector's affinity assignment of your
> > > > > > megaraid in
> > > > > your
> > > > > > test?
> > > > >
> > > > > To quickly reproduce, I restricted to single MSI-x vector on
> > > > > megaraid_sas driver.  System has total 16 online CPUs.
> > > >
> > > > I suggest you don't do the restriction of single MSI-x vector, and
> > > > just
> > > use the
> > > > device supported number of msi-x vector.
> > >
> > > Hi Ming,  CPU lock up is seen even though it is not single msi-x
> vector.
> > > Actual scenario need some specific topology and server for overnight
> test.
> > > Issue can be seen on servers which has more than 16 logical CPUs and
> > > Thunderbolt series MR controller which supports at max 16 MSIx
> vectors.
> > > >
> > > > >
> > > > > Output of affinity hints.
> > > > > kernel version:
> > > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018
> > > > > x86_64
> > > > > x86_64
> > > > > x86_64 GNU/Linux
> > > > > PCI name is 83:00.0, dump its irq affinity:
> > > > > irq 105, cpu list 0-3,8-11
> > > >
> > > > In this case, which CPU is selected for handling the interrupt is
> > > decided by
> > > > interrupt controller, and it is easy to cause CPU overload if
> > > > interrupt
> > > controller
> > > > always selects one same CPU to handle the irq.
> > > >
> > > > >
> > > > > Affinity mask is created properly, but only CPU-0 is overloaded
> > > > > with interrupt processing.
> > > > >
> > > > > # numactl --hardware
> > > > > available: 2 nodes (0-1)
> > > > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > > > node 0 size: 47861 MB
> > > > > node 0 free: 46516 MB
> > > > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > > > node 1 size: 64491 MB
> > > > > node 1 free: 62805 MB
> > > > > node distances:
> > > > > node   0   1
> > > > >   0:  10  21
> > > > >   1:  21  10
> > > > >
> > > > > Output of  system activities (sar).  (gnice is 100% and it is
> > > > > consumed in megaraid_sas ISR routine.)
> > > > >
> > > > >
> > > > > 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait
> %steal
> > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > 12:44:41 PM     all         6.03      0.00        29.98      0.00
> > > > > 0.00         0.00        0.00        0.00        0.00
> 63.99
> > > > > 12:44:41 PM       0         0.00      0.00         0.00
> 0.00
> > > > > 0.00         0.00        0.00        0.00       100.00         0
> > > > >
> > > > >
> > > > > In my test, I used rq_affinity is set to 2.
> > > > > (QUEUE_FLAG_SAME_FORCE). I also used " host_tagset" V2 patch set
> for
> > megaraid_sas.
> > > > >
> > > > > Using RFC requested in -
> > > > > "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is
> > > > > avoided (you can noticed that gnice is shifted to softirq. Even
> > > > > though it is 100% consumed, There is always exit for existing
> > > > > completion loop due to irqpoll_weight  @irq_poll_init().
> > > > >
> > > > > Average:        CPU      %usr     %nice      %sys   %iowait
> %steal
> > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > Average:        all          4.25      0.00        21.61      0.00
> > > > > 0.00      0.00         6.61           0.00      0.00     67.54
> > > > > Average:          0           0.00      0.00         0.00
> 0.00
> > > > > 0.00      0.00       100.00        0.00      0.00      0.00
> > > > >
> > > > >
> > > > > Hope this clarifies. We need different fix to avoid lockups. Can
> > > > > we consider using irq poll interface if #CPU is more than Reply
> > > queue/MSI-x.
> > > > > ?
> > > >
> > > > Please use the device's supported msi-x vectors number, and see if
> > > > there
> > > is this
> > > > issue. If there is, you can use irq poll too, which isn't
> > > > contradictory
> > > with the
> > > > blk-mq approach taken by this patchset.
> > >
> > > Device supported scenario need more time to reproduce, but it is more
> > > quick method is to just use single MSI-x vector and try to create
> > > worst case IO completion loop.
> > > Using irq poll, my test run without any CPU lockup. I tried your
> > > latest V2 series as well and that is also behaving the same.
> >
> > Again, you can use irq poll, which isn't contradictory with blk-mq.
> 
> Just wanted to explained that issue of CPU lock up is different.  Thanks
> for clarification.
> >
> > >
> > > BTW - I am seeing drastically performance drop using V2 series of
> > > patch on megaraid_sas. Those who is testing HPSA, can also verify if
> > > that is a generic behavior.
> >
> > OK, I will see if I can find a megaraid_sas to see the performance drop
> issue. If I
> > can't, I will try to run performance test on HPSA.
> 
> Patch is appended.
> 
> >
> > Could you share us your patch for enabling global_tags/MQ on
> megaraid_sas
> > so that I can reproduce your test?
> >
> > > See below perf top data. "bt_iter" is consuming 4 times more CPU.
> >
> > Could you share us what the IOPS/CPU utilization effect is after
> applying the
> > patch V2? And your test script?
> Regarding CPU utilization, I need to test one more time. Currently system
> is in used.
> 
> I run below fio test on total 24 SSDs expander attached.
> 
> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> --ioengine=libaio --rw=randread
> 
> Performance dropped from 1.6 M IOPs to 770K IOPs.
> 
> >
> > In theory, it shouldn't, because the HBA only supports HBA wide tags,
> that
> > means the allocation has to share a HBA wide sbitmap no matter if global
> tags
> > is used or not.
> >
> > Anyway, I will take a look at the performance test and data.
> >
> >
> > Thanks,
> > Ming
> 
> 
> Megaraid_sas version of shared tag set.

Thanks for providing the patch.

> 
> 
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 0f1d88f..75ea86b 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -50,6 +50,7 @@
>  #include <linux/mutex.h>
>  #include <linux/poll.h>
>  #include <linux/vmalloc.h>
> +#include <linux/blk-mq-pci.h>
> 
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -220,6 +221,15 @@ static int megasas_get_ld_vf_affiliation(struct
> megasas_instance *instance,
>  static inline void
>  megasas_init_ctrl_params(struct megasas_instance *instance);
> 
> +
> +static int megaraid_sas_map_queues(struct Scsi_Host *shost)
> +{
> +	struct megasas_instance *instance;
> +	instance = (struct megasas_instance *)shost->hostdata;
> +
> +        return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev);
> +}
> +
>  /**
>   * megasas_set_dma_settings -	Populate DMA address, length and flags for
> DCMDs
>   * @instance:			Adapter soft state
> @@ -3177,6 +3187,8 @@ struct device_attribute *megaraid_host_attrs[] = {
>  	.use_clustering = ENABLE_CLUSTERING,
>  	.change_queue_depth = scsi_change_queue_depth,
>  	.no_write_same = 1,
> +	.map_queues = megaraid_sas_map_queues,
> +	.host_tagset = 1,
>  };
> 
>  /**
> @@ -5965,6 +5977,9 @@ static int megasas_io_attach(struct megasas_instance
> *instance)
>  	host->max_lun = MEGASAS_MAX_LUN;
>  	host->max_cmd_len = 16;
> 
> +	/* map reply queue to blk_mq hw queue */
> +	host->nr_hw_queues = instance->msix_vectors;
> +
>  	/*
>  	 * Notify the mid-layer about the new controller
>  	 */
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 073ced0..034d976 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -2655,11 +2655,15 @@ static void megasas_stream_detect(struct
> megasas_instance *instance,
>  			fp_possible = (io_info.fpOkForIo > 0) ? true :
> false;
>  	}
> 
> +#if 0
>  	/* Use raw_smp_processor_id() for now until cmd->request->cpu is
> CPU
>  	   id by default, not CPU group id, otherwise all MSI-X queues
> won't
>  	   be utilized */
>  	cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ?
>  		raw_smp_processor_id() % instance->msix_vectors : 0;
> +#endif
> +
> +	cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(scp->request->tag);

Looks the above line is wrong, and you just use reply queue 0 in this way
to complete rq.

Follows the correct usage:

	cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(blk_mq_unique_tag(scp->request));

> 
>  	praid_context = &io_request->RaidContext;
> 
> @@ -2985,9 +2989,13 @@ static void megasas_build_ld_nonrw_fusion(struct
> megasas_instance *instance,
>  	}
> 
>  	cmd->request_desc->SCSIIO.DevHandle = io_request->DevHandle;
> +
> +#if 0
>  	cmd->request_desc->SCSIIO.MSIxIndex =
>  		instance->msix_vectors ?
>  		(raw_smp_processor_id() % instance->msix_vectors) : 0;
> +#endif
> +	cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(scmd->request->tag);

Same with above, could you fix the patch and run your performance test
again?

Thanks
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-06 14:27               ` Kashyap Desai
  2018-02-06 15:46                 ` Ming Lei
@ 2018-02-07  6:50                 ` Hannes Reinecke
  2018-02-07 12:23                   ` Ming Lei
  1 sibling, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-07  6:50 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

Hi all,

[ .. ]
>>
>> Could you share us your patch for enabling global_tags/MQ on
> megaraid_sas
>> so that I can reproduce your test?
>>
>>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
>>
>> Could you share us what the IOPS/CPU utilization effect is after
> applying the
>> patch V2? And your test script?
> Regarding CPU utilization, I need to test one more time. Currently system
> is in used.
> 
> I run below fio test on total 24 SSDs expander attached.
> 
> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> --ioengine=libaio --rw=randread
> 
> Performance dropped from 1.6 M IOPs to 770K IOPs.
> 
This is basically what we've seen with earlier iterations.

>>
>> In theory, it shouldn't, because the HBA only supports HBA wide tags,
>> that means the allocation has to share a HBA wide sbitmap no matte>> if global tags is used or not.
>>
>> Anyway, I will take a look at the performance test and data.
>>
>>
>> Thanks,
>> Ming
> 
> 
> Megaraid_sas version of shared tag set.
> 
Whee; thanks for that.

I've just finished a patchset moving megarai_sas_fusion to embedded
commands (and cutting down the size of 'struct megasas_cmd_fusion' by
half :-), so that will come in just handy.

Will give it a spin.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-07  6:50                 ` Hannes Reinecke
@ 2018-02-07 12:23                   ` Ming Lei
  2018-02-07 14:14                     ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-07 12:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kashyap Desai, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> [ .. ]
> >>
> >> Could you share us your patch for enabling global_tags/MQ on
> > megaraid_sas
> >> so that I can reproduce your test?
> >>
> >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> >>
> >> Could you share us what the IOPS/CPU utilization effect is after
> > applying the
> >> patch V2? And your test script?
> > Regarding CPU utilization, I need to test one more time. Currently system
> > is in used.
> > 
> > I run below fio test on total 24 SSDs expander attached.
> > 
> > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > --ioengine=libaio --rw=randread
> > 
> > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > 
> This is basically what we've seen with earlier iterations.

Hi Hannes,

As I mentioned in another mail[1], Kashyap's patch has a big issue, which
causes only reply queue 0 used.

[1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2

So could you guys run your performance test again after fixing the patch?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-07 12:23                   ` Ming Lei
@ 2018-02-07 14:14                     ` Kashyap Desai
  2018-02-08  1:23                       ` Ming Lei
  2018-02-08  7:00                       ` Hannes Reinecke
  0 siblings, 2 replies; 39+ messages in thread
From: Kashyap Desai @ 2018-02-07 14:14 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Wednesday, February 7, 2018 5:53 PM
> To: Hannes Reinecke
> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > Hi all,
> >
> > [ .. ]
> > >>
> > >> Could you share us your patch for enabling global_tags/MQ on
> > > megaraid_sas
> > >> so that I can reproduce your test?
> > >>
> > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> > >>
> > >> Could you share us what the IOPS/CPU utilization effect is after
> > > applying the
> > >> patch V2? And your test script?
> > > Regarding CPU utilization, I need to test one more time. Currently
> > > system is in used.
> > >
> > > I run below fio test on total 24 SSDs expander attached.
> > >
> > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > --ioengine=libaio --rw=randread
> > >
> > > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > >
> > This is basically what we've seen with earlier iterations.
>
> Hi Hannes,
>
> As I mentioned in another mail[1], Kashyap's patch has a big issue,
which
> causes only reply queue 0 used.
>
> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
>
> So could you guys run your performance test again after fixing the
patch?

Ming -

I tried after change you requested.  Performance drop is still unresolved.
>From 1.6 M IOPS to 770K IOPS.

See below data. All 24 reply queue is in used correctly.

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1  NAME
 360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
 364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
 362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
 345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
 341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
 369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
 359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
 358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
 350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
 342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
 344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
 346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
 361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
 348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
 368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
 354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
 351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
 352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
 356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
 337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
 343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
 355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
 335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
 363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas


Average:        CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest    %gnice     %idle
Average:         18      3.80      0.00     14.78     10.08      0.00
0.00      4.01      0.00      0.00     67.33
Average:         19      3.26      0.00     15.35     10.62      0.00
0.00      4.03      0.00      0.00     66.74
Average:         20      3.42      0.00     14.57     10.67      0.00
0.00      3.84      0.00      0.00     67.50
Average:         21      3.19      0.00     15.60     10.75      0.00
0.00      4.16      0.00      0.00     66.30
Average:         22      3.58      0.00     15.15     10.66      0.00
0.00      3.51      0.00      0.00     67.11
Average:         23      3.34      0.00     15.36     10.63      0.00
0.00      4.17      0.00      0.00     66.50
Average:         24      3.50      0.00     14.58     10.93      0.00
0.00      3.85      0.00      0.00     67.13
Average:         25      3.20      0.00     14.68     10.86      0.00
0.00      4.31      0.00      0.00     66.95
Average:         26      3.27      0.00     14.80     10.70      0.00
0.00      3.68      0.00      0.00     67.55
Average:         27      3.58      0.00     15.36     10.80      0.00
0.00      3.79      0.00      0.00     66.48
Average:         28      3.46      0.00     15.17     10.46      0.00
0.00      3.32      0.00      0.00     67.59
Average:         29      3.34      0.00     14.42     10.72      0.00
0.00      3.34      0.00      0.00     68.18
Average:         30      3.34      0.00     15.08     10.70      0.00
0.00      3.89      0.00      0.00     66.99
Average:         31      3.26      0.00     15.33     10.47      0.00
0.00      3.33      0.00      0.00     67.61
Average:         32      3.21      0.00     14.80     10.61      0.00
0.00      3.70      0.00      0.00     67.67
Average:         33      3.40      0.00     13.88     10.55      0.00
0.00      4.02      0.00      0.00     68.15
Average:         34      3.74      0.00     17.41     10.61      0.00
0.00      4.51      0.00      0.00     63.73
Average:         35      3.35      0.00     14.37     10.74      0.00
0.00      3.84      0.00      0.00     67.71
Average:         36      0.54      0.00      1.77      0.00      0.00
0.00      0.00      0.00      0.00     97.69
..
Average:         54      3.60      0.00     15.17     10.39      0.00
0.00      4.22      0.00      0.00     66.62
Average:         55      3.33      0.00     14.85     10.55      0.00
0.00      3.96      0.00      0.00     67.31
Average:         56      3.40      0.00     15.19     10.54      0.00
0.00      3.74      0.00      0.00     67.13
Average:         57      3.41      0.00     13.98     10.78      0.00
0.00      4.10      0.00      0.00     67.73
Average:         58      3.32      0.00     15.16     10.52      0.00
0.00      4.01      0.00      0.00     66.99
Average:         59      3.17      0.00     15.80     10.35      0.00
0.00      3.86      0.00      0.00     66.80
Average:         60      3.00      0.00     14.63     10.59      0.00
0.00      3.97      0.00      0.00     67.80
Average:         61      3.34      0.00     14.70     10.66      0.00
0.00      4.32      0.00      0.00     66.97
Average:         62      3.34      0.00     15.29     10.56      0.00
0.00      3.89      0.00      0.00     66.92
Average:         63      3.29      0.00     14.51     10.72      0.00
0.00      3.85      0.00      0.00     67.62
Average:         64      3.48      0.00     15.31     10.65      0.00
0.00      3.97      0.00      0.00     66.60
Average:         65      3.34      0.00     14.36     10.80      0.00
0.00      4.11      0.00      0.00     67.39
Average:         66      3.13      0.00     14.94     10.70      0.00
0.00      4.10      0.00      0.00     67.13
Average:         67      3.06      0.00     15.56     10.69      0.00
0.00      3.82      0.00      0.00     66.88
Average:         68      3.33      0.00     14.98     10.61      0.00
0.00      3.81      0.00      0.00     67.27
Average:         69      3.20      0.00     15.43     10.70      0.00
0.00      3.82      0.00      0.00     66.85
Average:         70      3.34      0.00     17.14     10.59      0.00
0.00      3.00      0.00      0.00     65.92
Average:         71      3.41      0.00     14.94     10.56      0.00
0.00      3.41      0.00      0.00     67.69

Perf top -

  64.33%  [kernel]            [k] bt_iter
   4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
   4.23%  [kernel]            [k] _find_next_bit
   2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
   1.09%  [kernel]            [k] sbitmap_any_bit_set
   0.71%  [kernel]            [k] sbitmap_queue_clear
   0.63%  [kernel]            [k] find_next_bit
   0.54%  [kernel]            [k] _raw_spin_lock_irqsave

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-07 14:14                     ` Kashyap Desai
@ 2018-02-08  1:23                       ` Ming Lei
  2018-02-08  7:00                       ` Hannes Reinecke
  1 sibling, 0 replies; 39+ messages in thread
From: Ming Lei @ 2018-02-08  1:23 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Wed, Feb 07, 2018 at 07:44:04PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, February 7, 2018 5:53 PM
> > To: Hannes Reinecke
> > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > > Hi all,
> > >
> > > [ .. ]
> > > >>
> > > >> Could you share us your patch for enabling global_tags/MQ on
> > > > megaraid_sas
> > > >> so that I can reproduce your test?
> > > >>
> > > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> > > >>
> > > >> Could you share us what the IOPS/CPU utilization effect is after
> > > > applying the
> > > >> patch V2? And your test script?
> > > > Regarding CPU utilization, I need to test one more time. Currently
> > > > system is in used.
> > > >
> > > > I run below fio test on total 24 SSDs expander attached.
> > > >
> > > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > --ioengine=libaio --rw=randread
> > > >
> > > > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > >
> > > This is basically what we've seen with earlier iterations.
> >
> > Hi Hannes,
> >
> > As I mentioned in another mail[1], Kashyap's patch has a big issue,
> which
> > causes only reply queue 0 used.
> >
> > [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> >
> > So could you guys run your performance test again after fixing the
> patch?
> 
> Ming -
> 
> I tried after change you requested.  Performance drop is still unresolved.
> From 1.6 M IOPS to 770K IOPS.
> 
> See below data. All 24 reply queue is in used correctly.
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1  NAME
>  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
>  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
>  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
>  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
>  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
>  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
>  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
>  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
>  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
>  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
>  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
>  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
>  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
>  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
>  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
>  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
>  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
>  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
>  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
>  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
>  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
>  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
>  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
>  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> 
> 
> Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> Average:         18      3.80      0.00     14.78     10.08      0.00
> 0.00      4.01      0.00      0.00     67.33
> Average:         19      3.26      0.00     15.35     10.62      0.00
> 0.00      4.03      0.00      0.00     66.74
> Average:         20      3.42      0.00     14.57     10.67      0.00
> 0.00      3.84      0.00      0.00     67.50
> Average:         21      3.19      0.00     15.60     10.75      0.00
> 0.00      4.16      0.00      0.00     66.30
> Average:         22      3.58      0.00     15.15     10.66      0.00
> 0.00      3.51      0.00      0.00     67.11
> Average:         23      3.34      0.00     15.36     10.63      0.00
> 0.00      4.17      0.00      0.00     66.50
> Average:         24      3.50      0.00     14.58     10.93      0.00
> 0.00      3.85      0.00      0.00     67.13
> Average:         25      3.20      0.00     14.68     10.86      0.00
> 0.00      4.31      0.00      0.00     66.95
> Average:         26      3.27      0.00     14.80     10.70      0.00
> 0.00      3.68      0.00      0.00     67.55
> Average:         27      3.58      0.00     15.36     10.80      0.00
> 0.00      3.79      0.00      0.00     66.48
> Average:         28      3.46      0.00     15.17     10.46      0.00
> 0.00      3.32      0.00      0.00     67.59
> Average:         29      3.34      0.00     14.42     10.72      0.00
> 0.00      3.34      0.00      0.00     68.18
> Average:         30      3.34      0.00     15.08     10.70      0.00
> 0.00      3.89      0.00      0.00     66.99
> Average:         31      3.26      0.00     15.33     10.47      0.00
> 0.00      3.33      0.00      0.00     67.61
> Average:         32      3.21      0.00     14.80     10.61      0.00
> 0.00      3.70      0.00      0.00     67.67
> Average:         33      3.40      0.00     13.88     10.55      0.00
> 0.00      4.02      0.00      0.00     68.15
> Average:         34      3.74      0.00     17.41     10.61      0.00
> 0.00      4.51      0.00      0.00     63.73
> Average:         35      3.35      0.00     14.37     10.74      0.00
> 0.00      3.84      0.00      0.00     67.71
> Average:         36      0.54      0.00      1.77      0.00      0.00
> 0.00      0.00      0.00      0.00     97.69
> ..
> Average:         54      3.60      0.00     15.17     10.39      0.00
> 0.00      4.22      0.00      0.00     66.62
> Average:         55      3.33      0.00     14.85     10.55      0.00
> 0.00      3.96      0.00      0.00     67.31
> Average:         56      3.40      0.00     15.19     10.54      0.00
> 0.00      3.74      0.00      0.00     67.13
> Average:         57      3.41      0.00     13.98     10.78      0.00
> 0.00      4.10      0.00      0.00     67.73
> Average:         58      3.32      0.00     15.16     10.52      0.00
> 0.00      4.01      0.00      0.00     66.99
> Average:         59      3.17      0.00     15.80     10.35      0.00
> 0.00      3.86      0.00      0.00     66.80
> Average:         60      3.00      0.00     14.63     10.59      0.00
> 0.00      3.97      0.00      0.00     67.80
> Average:         61      3.34      0.00     14.70     10.66      0.00
> 0.00      4.32      0.00      0.00     66.97
> Average:         62      3.34      0.00     15.29     10.56      0.00
> 0.00      3.89      0.00      0.00     66.92
> Average:         63      3.29      0.00     14.51     10.72      0.00
> 0.00      3.85      0.00      0.00     67.62
> Average:         64      3.48      0.00     15.31     10.65      0.00
> 0.00      3.97      0.00      0.00     66.60
> Average:         65      3.34      0.00     14.36     10.80      0.00
> 0.00      4.11      0.00      0.00     67.39
> Average:         66      3.13      0.00     14.94     10.70      0.00
> 0.00      4.10      0.00      0.00     67.13
> Average:         67      3.06      0.00     15.56     10.69      0.00
> 0.00      3.82      0.00      0.00     66.88
> Average:         68      3.33      0.00     14.98     10.61      0.00
> 0.00      3.81      0.00      0.00     67.27
> Average:         69      3.20      0.00     15.43     10.70      0.00
> 0.00      3.82      0.00      0.00     66.85
> Average:         70      3.34      0.00     17.14     10.59      0.00
> 0.00      3.00      0.00      0.00     65.92
> Average:         71      3.41      0.00     14.94     10.56      0.00
> 0.00      3.41      0.00      0.00     67.69
> 
> Perf top -
> 
>   64.33%  [kernel]            [k] bt_iter
>    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
>    4.23%  [kernel]            [k] _find_next_bit
>    2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
>    1.09%  [kernel]            [k] sbitmap_any_bit_set
>    0.71%  [kernel]            [k] sbitmap_queue_clear
>    0.63%  [kernel]            [k] find_next_bit
>    0.54%  [kernel]            [k] _raw_spin_lock_irqsave

The above trace says nothing about the performance drop, and it just means some
disk stat utilities are crazy reading /proc/diskstats or /sys/block/sda/stat, see
below, and the performance drop might be related with this crazy reading too.

bt_iter
    <-bt_for_each
        <-blk_mq_queue_tag_busy_iter
            <-blk_mq_in_flight
                <-part_in_flight
                    <-part_stat_show
                    <-diskstats_show
                    <-part_round_stats
            <-blk_mq_timeout_work

If you are using fio to run the test, could you show us the fio log(with and
without the patchset) and don't start any disk stat utilities meantime?

Also seems none is the default scheduler after this patchset is applied,
could you run same test with mq-deadline?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-07 14:14                     ` Kashyap Desai
  2018-02-08  1:23                       ` Ming Lei
@ 2018-02-08  7:00                       ` Hannes Reinecke
  2018-02-08 16:53                         ` Ming Lei
  1 sibling, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2018-02-08  7:00 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

On 02/07/2018 03:14 PM, Kashyap Desai wrote:
>> -----Original Message-----
>> From: Ming Lei [mailto:ming.lei@redhat.com]
>> Sent: Wednesday, February 7, 2018 5:53 PM
>> To: Hannes Reinecke
>> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
>> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
>> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
>> Rivera; Paolo Bonzini; Laurence Oberman
>> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
>> force_blk_mq
>>
>> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> [ .. ]
>>>>>
>>>>> Could you share us your patch for enabling global_tags/MQ on
>>>> megaraid_sas
>>>>> so that I can reproduce your test?
>>>>>
>>>>>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
>>>>>
>>>>> Could you share us what the IOPS/CPU utilization effect is after
>>>> applying the
>>>>> patch V2? And your test script?
>>>> Regarding CPU utilization, I need to test one more time. Currently
>>>> system is in used.
>>>>
>>>> I run below fio test on total 24 SSDs expander attached.
>>>>
>>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
>>>> --ioengine=libaio --rw=randread
>>>>
>>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
>>>>
>>> This is basically what we've seen with earlier iterations.
>>
>> Hi Hannes,
>>
>> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> which
>> causes only reply queue 0 used.
>>
>> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
>>
>> So could you guys run your performance test again after fixing the
> patch?
> 
> Ming -
> 
> I tried after change you requested.  Performance drop is still unresolved.
> From 1.6 M IOPS to 770K IOPS.
> 
> See below data. All 24 reply queue is in used correctly.
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1  NAME
>  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
>  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
>  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
>  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
>  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
>  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
>  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
>  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
>  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
>  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
>  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
>  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
>  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
>  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
>  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
>  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
>  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
>  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
>  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
>  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
>  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
>  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
>  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
>  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> 
> 
> Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> Average:         18      3.80      0.00     14.78     10.08      0.00
> 0.00      4.01      0.00      0.00     67.33
> Average:         19      3.26      0.00     15.35     10.62      0.00
> 0.00      4.03      0.00      0.00     66.74
> Average:         20      3.42      0.00     14.57     10.67      0.00
> 0.00      3.84      0.00      0.00     67.50
> Average:         21      3.19      0.00     15.60     10.75      0.00
> 0.00      4.16      0.00      0.00     66.30
> Average:         22      3.58      0.00     15.15     10.66      0.00
> 0.00      3.51      0.00      0.00     67.11
> Average:         23      3.34      0.00     15.36     10.63      0.00
> 0.00      4.17      0.00      0.00     66.50
> Average:         24      3.50      0.00     14.58     10.93      0.00
> 0.00      3.85      0.00      0.00     67.13
> Average:         25      3.20      0.00     14.68     10.86      0.00
> 0.00      4.31      0.00      0.00     66.95
> Average:         26      3.27      0.00     14.80     10.70      0.00
> 0.00      3.68      0.00      0.00     67.55
> Average:         27      3.58      0.00     15.36     10.80      0.00
> 0.00      3.79      0.00      0.00     66.48
> Average:         28      3.46      0.00     15.17     10.46      0.00
> 0.00      3.32      0.00      0.00     67.59
> Average:         29      3.34      0.00     14.42     10.72      0.00
> 0.00      3.34      0.00      0.00     68.18
> Average:         30      3.34      0.00     15.08     10.70      0.00
> 0.00      3.89      0.00      0.00     66.99
> Average:         31      3.26      0.00     15.33     10.47      0.00
> 0.00      3.33      0.00      0.00     67.61
> Average:         32      3.21      0.00     14.80     10.61      0.00
> 0.00      3.70      0.00      0.00     67.67
> Average:         33      3.40      0.00     13.88     10.55      0.00
> 0.00      4.02      0.00      0.00     68.15
> Average:         34      3.74      0.00     17.41     10.61      0.00
> 0.00      4.51      0.00      0.00     63.73
> Average:         35      3.35      0.00     14.37     10.74      0.00
> 0.00      3.84      0.00      0.00     67.71
> Average:         36      0.54      0.00      1.77      0.00      0.00
> 0.00      0.00      0.00      0.00     97.69
> ..
> Average:         54      3.60      0.00     15.17     10.39      0.00
> 0.00      4.22      0.00      0.00     66.62
> Average:         55      3.33      0.00     14.85     10.55      0.00
> 0.00      3.96      0.00      0.00     67.31
> Average:         56      3.40      0.00     15.19     10.54      0.00
> 0.00      3.74      0.00      0.00     67.13
> Average:         57      3.41      0.00     13.98     10.78      0.00
> 0.00      4.10      0.00      0.00     67.73
> Average:         58      3.32      0.00     15.16     10.52      0.00
> 0.00      4.01      0.00      0.00     66.99
> Average:         59      3.17      0.00     15.80     10.35      0.00
> 0.00      3.86      0.00      0.00     66.80
> Average:         60      3.00      0.00     14.63     10.59      0.00
> 0.00      3.97      0.00      0.00     67.80
> Average:         61      3.34      0.00     14.70     10.66      0.00
> 0.00      4.32      0.00      0.00     66.97
> Average:         62      3.34      0.00     15.29     10.56      0.00
> 0.00      3.89      0.00      0.00     66.92
> Average:         63      3.29      0.00     14.51     10.72      0.00
> 0.00      3.85      0.00      0.00     67.62
> Average:         64      3.48      0.00     15.31     10.65      0.00
> 0.00      3.97      0.00      0.00     66.60
> Average:         65      3.34      0.00     14.36     10.80      0.00
> 0.00      4.11      0.00      0.00     67.39
> Average:         66      3.13      0.00     14.94     10.70      0.00
> 0.00      4.10      0.00      0.00     67.13
> Average:         67      3.06      0.00     15.56     10.69      0.00
> 0.00      3.82      0.00      0.00     66.88
> Average:         68      3.33      0.00     14.98     10.61      0.00
> 0.00      3.81      0.00      0.00     67.27
> Average:         69      3.20      0.00     15.43     10.70      0.00
> 0.00      3.82      0.00      0.00     66.85
> Average:         70      3.34      0.00     17.14     10.59      0.00
> 0.00      3.00      0.00      0.00     65.92
> Average:         71      3.41      0.00     14.94     10.56      0.00
> 0.00      3.41      0.00      0.00     67.69
> 
> Perf top -
> 
>   64.33%  [kernel]            [k] bt_iter
>    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
>    4.23%  [kernel]            [k] _find_next_bit
>    2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
>    1.09%  [kernel]            [k] sbitmap_any_bit_set
>    0.71%  [kernel]            [k] sbitmap_queue_clear
>    0.63%  [kernel]            [k] find_next_bit
>    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> 
Ah. So we're spending quite some time in trying to find a free tag.
I guess this is due to every queue starting at the same position trying
to find a free tag, which inevitably leads to a contention.

Can't we lay out the pointers so that each queue starts looking for free
bits at a _different_ location?

IE if we evenly spread the initial position for each queue and use a
round-robin algorithm we should be getting better results, methinks.

I'll give it a go once the hickups with converting megaraid_sas to
embedded commands are done with :-(

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-08  7:00                       ` Hannes Reinecke
@ 2018-02-08 16:53                         ` Ming Lei
  2018-02-09  4:58                           ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-08 16:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kashyap Desai, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> >> -----Original Message-----
> >> From: Ming Lei [mailto:ming.lei@redhat.com]
> >> Sent: Wednesday, February 7, 2018 5:53 PM
> >> To: Hannes Reinecke
> >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> >> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > Sandoval;
> >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> >> Rivera; Paolo Bonzini; Laurence Oberman
> >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> >> force_blk_mq
> >>
> >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> >>> Hi all,
> >>>
> >>> [ .. ]
> >>>>>
> >>>>> Could you share us your patch for enabling global_tags/MQ on
> >>>> megaraid_sas
> >>>>> so that I can reproduce your test?
> >>>>>
> >>>>>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> >>>>>
> >>>>> Could you share us what the IOPS/CPU utilization effect is after
> >>>> applying the
> >>>>> patch V2? And your test script?
> >>>> Regarding CPU utilization, I need to test one more time. Currently
> >>>> system is in used.
> >>>>
> >>>> I run below fio test on total 24 SSDs expander attached.
> >>>>
> >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> >>>> --ioengine=libaio --rw=randread
> >>>>
> >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> >>>>
> >>> This is basically what we've seen with earlier iterations.
> >>
> >> Hi Hannes,
> >>
> >> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> > which
> >> causes only reply queue 0 used.
> >>
> >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> >>
> >> So could you guys run your performance test again after fixing the
> > patch?
> > 
> > Ming -
> > 
> > I tried after change you requested.  Performance drop is still unresolved.
> > From 1.6 M IOPS to 770K IOPS.
> > 
> > See below data. All 24 reply queue is in used correctly.
> > 
> > IRQs / 1 second(s)
> > IRQ#  TOTAL  NODE0   NODE1  NAME
> >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > 
> > 
> > Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> > %irq     %soft    %guest    %gnice     %idle
> > Average:         18      3.80      0.00     14.78     10.08      0.00
> > 0.00      4.01      0.00      0.00     67.33
> > Average:         19      3.26      0.00     15.35     10.62      0.00
> > 0.00      4.03      0.00      0.00     66.74
> > Average:         20      3.42      0.00     14.57     10.67      0.00
> > 0.00      3.84      0.00      0.00     67.50
> > Average:         21      3.19      0.00     15.60     10.75      0.00
> > 0.00      4.16      0.00      0.00     66.30
> > Average:         22      3.58      0.00     15.15     10.66      0.00
> > 0.00      3.51      0.00      0.00     67.11
> > Average:         23      3.34      0.00     15.36     10.63      0.00
> > 0.00      4.17      0.00      0.00     66.50
> > Average:         24      3.50      0.00     14.58     10.93      0.00
> > 0.00      3.85      0.00      0.00     67.13
> > Average:         25      3.20      0.00     14.68     10.86      0.00
> > 0.00      4.31      0.00      0.00     66.95
> > Average:         26      3.27      0.00     14.80     10.70      0.00
> > 0.00      3.68      0.00      0.00     67.55
> > Average:         27      3.58      0.00     15.36     10.80      0.00
> > 0.00      3.79      0.00      0.00     66.48
> > Average:         28      3.46      0.00     15.17     10.46      0.00
> > 0.00      3.32      0.00      0.00     67.59
> > Average:         29      3.34      0.00     14.42     10.72      0.00
> > 0.00      3.34      0.00      0.00     68.18
> > Average:         30      3.34      0.00     15.08     10.70      0.00
> > 0.00      3.89      0.00      0.00     66.99
> > Average:         31      3.26      0.00     15.33     10.47      0.00
> > 0.00      3.33      0.00      0.00     67.61
> > Average:         32      3.21      0.00     14.80     10.61      0.00
> > 0.00      3.70      0.00      0.00     67.67
> > Average:         33      3.40      0.00     13.88     10.55      0.00
> > 0.00      4.02      0.00      0.00     68.15
> > Average:         34      3.74      0.00     17.41     10.61      0.00
> > 0.00      4.51      0.00      0.00     63.73
> > Average:         35      3.35      0.00     14.37     10.74      0.00
> > 0.00      3.84      0.00      0.00     67.71
> > Average:         36      0.54      0.00      1.77      0.00      0.00
> > 0.00      0.00      0.00      0.00     97.69
> > ..
> > Average:         54      3.60      0.00     15.17     10.39      0.00
> > 0.00      4.22      0.00      0.00     66.62
> > Average:         55      3.33      0.00     14.85     10.55      0.00
> > 0.00      3.96      0.00      0.00     67.31
> > Average:         56      3.40      0.00     15.19     10.54      0.00
> > 0.00      3.74      0.00      0.00     67.13
> > Average:         57      3.41      0.00     13.98     10.78      0.00
> > 0.00      4.10      0.00      0.00     67.73
> > Average:         58      3.32      0.00     15.16     10.52      0.00
> > 0.00      4.01      0.00      0.00     66.99
> > Average:         59      3.17      0.00     15.80     10.35      0.00
> > 0.00      3.86      0.00      0.00     66.80
> > Average:         60      3.00      0.00     14.63     10.59      0.00
> > 0.00      3.97      0.00      0.00     67.80
> > Average:         61      3.34      0.00     14.70     10.66      0.00
> > 0.00      4.32      0.00      0.00     66.97
> > Average:         62      3.34      0.00     15.29     10.56      0.00
> > 0.00      3.89      0.00      0.00     66.92
> > Average:         63      3.29      0.00     14.51     10.72      0.00
> > 0.00      3.85      0.00      0.00     67.62
> > Average:         64      3.48      0.00     15.31     10.65      0.00
> > 0.00      3.97      0.00      0.00     66.60
> > Average:         65      3.34      0.00     14.36     10.80      0.00
> > 0.00      4.11      0.00      0.00     67.39
> > Average:         66      3.13      0.00     14.94     10.70      0.00
> > 0.00      4.10      0.00      0.00     67.13
> > Average:         67      3.06      0.00     15.56     10.69      0.00
> > 0.00      3.82      0.00      0.00     66.88
> > Average:         68      3.33      0.00     14.98     10.61      0.00
> > 0.00      3.81      0.00      0.00     67.27
> > Average:         69      3.20      0.00     15.43     10.70      0.00
> > 0.00      3.82      0.00      0.00     66.85
> > Average:         70      3.34      0.00     17.14     10.59      0.00
> > 0.00      3.00      0.00      0.00     65.92
> > Average:         71      3.41      0.00     14.94     10.56      0.00
> > 0.00      3.41      0.00      0.00     67.69
> > 
> > Perf top -
> > 
> >   64.33%  [kernel]            [k] bt_iter
> >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> >    4.23%  [kernel]            [k] _find_next_bit
> >    2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
> >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> >    0.71%  [kernel]            [k] sbitmap_queue_clear
> >    0.63%  [kernel]            [k] find_next_bit
> >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > 
> Ah. So we're spending quite some time in trying to find a free tag.
> I guess this is due to every queue starting at the same position trying
> to find a free tag, which inevitably leads to a contention.

IMO, the above trace means that blk_mq_in_flight() may be the bottleneck, and looks
not related with tag allocation.

Kashyap, could you run your performance test again after disabling iostat by the following
command on all test devices and killing all utilities which may read iostat(/proc/diskstats, ...)?

	echo 0 > /sys/block/sdN/queue/iostat

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer
  2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
@ 2018-02-08 17:34     ` Bart Van Assche
  2018-02-08 17:34     ` Bart Van Assche
  1 sibling, 0 replies; 39+ messages in thread
From: Bart Van Assche @ 2018-02-08 17:34 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe
  Cc: hch, martin.petersen, hare, linux-scsi, don.brace,
	james.bottomley, pbonzini, arun.easi, osandov, loberman,
	kashyap.desai, peter.rivera

T24gU2F0LCAyMDE4LTAyLTAzIGF0IDEyOjIxICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gZGlm
ZiAtLWdpdCBhL2Jsb2NrL2Jsay1tcS10YWcuaCBiL2Jsb2NrL2Jsay1tcS10YWcuaA0KPiBpbmRl
eCA2MWRlYWIwYjVhNWEuLmE2ODMyM2ZhMGMwMiAxMDA2NDQNCj4gLS0tIGEvYmxvY2svYmxrLW1x
LXRhZy5oDQo+ICsrKyBiL2Jsb2NrL2Jsay1tcS10YWcuaA0KPiBAQCAtMTEsMTAgKzExLDE0IEBA
IHN0cnVjdCBibGtfbXFfdGFncyB7DQo+ICAJdW5zaWduZWQgaW50IG5yX3RhZ3M7DQo+ICAJdW5z
aWduZWQgaW50IG5yX3Jlc2VydmVkX3RhZ3M7DQo+ICANCj4gLQlhdG9taWNfdCBhY3RpdmVfcXVl
dWVzOw0KPiArCWF0b21pY190ICphY3RpdmVfcXVldWVzOw0KPiArCWF0b21pY190IF9fYWN0aXZl
X3F1ZXVlczsNCj4gIA0KPiAtCXN0cnVjdCBzYml0bWFwX3F1ZXVlIGJpdG1hcF90YWdzOw0KPiAt
CXN0cnVjdCBzYml0bWFwX3F1ZXVlIGJyZXNlcnZlZF90YWdzOw0KPiArCXN0cnVjdCBzYml0bWFw
X3F1ZXVlICpiaXRtYXBfdGFnczsNCj4gKwlzdHJ1Y3Qgc2JpdG1hcF9xdWV1ZSAqYnJlc2VydmVk
X3RhZ3M7DQo+ICsNCj4gKwlzdHJ1Y3Qgc2JpdG1hcF9xdWV1ZSBfX2JpdG1hcF90YWdzOw0KPiAr
CXN0cnVjdCBzYml0bWFwX3F1ZXVlIF9fYnJlc2VydmVkX3RhZ3M7DQo+ICANCj4gIAlzdHJ1Y3Qg
cmVxdWVzdCAqKnJxczsNCj4gIAlzdHJ1Y3QgcmVxdWVzdCAqKnN0YXRpY19ycXM7DQoNClRoaXMg
aXMgZ2V0dGluZyB1Z2x5OiBtdWx0aXBsZSBwb2ludGVycyB0aGF0IGVpdGhlciBhbGwgcG9pbnQg
YXQgYW4gZWxlbWVudA0KaW5zaWRlIHN0cnVjdCBibGtfbXFfdGFncyBvciBhbGwgcG9pbnQgYXQg
YSBtZW1iZXIgb3V0c2lkZSBibGtfbXFfdGFncy4gSGF2ZQ0KeW91IGNvbnNpZGVyZWQgdG8gaW50
cm9kdWNlIGEgbmV3IHN0cnVjdHVyZSBmb3IgdGhlc2UgbWVtYmVycyAoYWN0aXZlX3F1ZXVlcywN
CmJpdG1hcF90YWdzIGFuZCBicmVzZXJ2ZWRfdGFncykgc3VjaCB0aGF0IG9ubHkgYSBzaW5nbGUg
bmV3IHBvaW50ZXIgaGFzIHRvDQpiZSBpbnRyb2R1Y2VkIGluIHN0cnVjdCBibGtfbXFfdGFncz8g
QWRkaXRpb25hbGx5LCB3aHkgZG9lcyBldmVyeQ0KYmxrX21xX3RhZ3MgaW5zdGFuY2UgaGF2ZSBp
dHMgb3duIHN0YXRpY19ycXMgYXJyYXkgaWYgdGFncyBhcmUgc2hhcmVkDQphY3Jvc3MgaGFyZHdh
cmUgcXVldWVzPyBibGtfbXFfZ2V0X3RhZygpIGFsbG9jYXRlcyBhIHRhZyBlaXRoZXIgZnJvbQ0K
Yml0bWFwX3RhZ3Mgb3IgZnJvbSBicmVzZXJ2ZWRfdGFncy4gU28gaWYgdGhlc2UgdGFnIHNldHMg
YXJlIHNoYXJlZCBJDQpkb24ndCB0aGluayB0aGF0IGl0IGlzIG5lY2Vzc2FyeSB0byBoYXZlIG9u
ZSBzdGF0aWNfcnFzIGFycmF5IHBlciBzdHJ1Y3QNCmJsa19tcV90YWdzIGluc3RhbmNlLg0KDQpU
aGFua3MsDQoNCkJhcnQuDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer
@ 2018-02-08 17:34     ` Bart Van Assche
  0 siblings, 0 replies; 39+ messages in thread
From: Bart Van Assche @ 2018-02-08 17:34 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe
  Cc: hch, martin.petersen, hare, linux-scsi, don.brace,
	james.bottomley, pbonzini, arun.easi, osandov, loberman,
	kashyap.desai, peter.rivera

On Sat, 2018-02-03 at 12:21 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 61deab0b5a5a..a68323fa0c02 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -11,10 +11,14 @@ struct blk_mq_tags {
>  	unsigned int nr_tags;
>  	unsigned int nr_reserved_tags;
>  
> -	atomic_t active_queues;
> +	atomic_t *active_queues;
> +	atomic_t __active_queues;
>  
> -	struct sbitmap_queue bitmap_tags;
> -	struct sbitmap_queue breserved_tags;
> +	struct sbitmap_queue *bitmap_tags;
> +	struct sbitmap_queue *breserved_tags;
> +
> +	struct sbitmap_queue __bitmap_tags;
> +	struct sbitmap_queue __breserved_tags;
>  
>  	struct request **rqs;
>  	struct request **static_rqs;

This is getting ugly: multiple pointers that either all point at an element
inside struct blk_mq_tags or all point at a member outside blk_mq_tags. Have
you considered to introduce a new structure for these members (active_queues,
bitmap_tags and breserved_tags) such that only a single new pointer has to
be introduced in struct blk_mq_tags? Additionally, why does every
blk_mq_tags instance have its own static_rqs array if tags are shared
across hardware queues? blk_mq_get_tag() allocates a tag either from
bitmap_tags or from breserved_tags. So if these tag sets are shared I
don't think that it is necessary to have one static_rqs array per struct
blk_mq_tags instance.

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-08 16:53                         ` Ming Lei
@ 2018-02-09  4:58                           ` Kashyap Desai
  2018-02-09  5:31                             ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Kashyap Desai @ 2018-02-09  4:58 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	linux-scsi, Arun Easi, Omar Sandoval, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Don Brace, Peter Rivera,
	Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Thursday, February 8, 2018 10:23 PM
> To: Hannes Reinecke
> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > >> -----Original Message-----
> > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > >> To: Hannes Reinecke
> > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > >> Easi; Omar
> > > Sandoval;
> > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > >> Rivera; Paolo Bonzini; Laurence Oberman
> > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > >> introduce force_blk_mq
> > >>
> > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > >>> Hi all,
> > >>>
> > >>> [ .. ]
> > >>>>>
> > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > >>>> megaraid_sas
> > >>>>> so that I can reproduce your test?
> > >>>>>
> > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more
CPU.
> > >>>>>
> > >>>>> Could you share us what the IOPS/CPU utilization effect is after
> > >>>> applying the
> > >>>>> patch V2? And your test script?
> > >>>> Regarding CPU utilization, I need to test one more time.
> > >>>> Currently system is in used.
> > >>>>
> > >>>> I run below fio test on total 24 SSDs expander attached.
> > >>>>
> > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > >>>> --ioengine=libaio --rw=randread
> > >>>>
> > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > >>>>
> > >>> This is basically what we've seen with earlier iterations.
> > >>
> > >> Hi Hannes,
> > >>
> > >> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> > > which
> > >> causes only reply queue 0 used.
> > >>
> > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > >>
> > >> So could you guys run your performance test again after fixing the
> > > patch?
> > >
> > > Ming -
> > >
> > > I tried after change you requested.  Performance drop is still
unresolved.
> > > From 1.6 M IOPS to 770K IOPS.
> > >
> > > See below data. All 24 reply queue is in used correctly.
> > >
> > > IRQs / 1 second(s)
> > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > >
> > >
> > > Average:        CPU      %usr     %nice      %sys   %iowait
%steal
> > > %irq     %soft    %guest    %gnice     %idle
> > > Average:         18      3.80      0.00     14.78     10.08
0.00
> > > 0.00      4.01      0.00      0.00     67.33
> > > Average:         19      3.26      0.00     15.35     10.62
0.00
> > > 0.00      4.03      0.00      0.00     66.74
> > > Average:         20      3.42      0.00     14.57     10.67
0.00
> > > 0.00      3.84      0.00      0.00     67.50
> > > Average:         21      3.19      0.00     15.60     10.75
0.00
> > > 0.00      4.16      0.00      0.00     66.30
> > > Average:         22      3.58      0.00     15.15     10.66
0.00
> > > 0.00      3.51      0.00      0.00     67.11
> > > Average:         23      3.34      0.00     15.36     10.63
0.00
> > > 0.00      4.17      0.00      0.00     66.50
> > > Average:         24      3.50      0.00     14.58     10.93
0.00
> > > 0.00      3.85      0.00      0.00     67.13
> > > Average:         25      3.20      0.00     14.68     10.86
0.00
> > > 0.00      4.31      0.00      0.00     66.95
> > > Average:         26      3.27      0.00     14.80     10.70
0.00
> > > 0.00      3.68      0.00      0.00     67.55
> > > Average:         27      3.58      0.00     15.36     10.80
0.00
> > > 0.00      3.79      0.00      0.00     66.48
> > > Average:         28      3.46      0.00     15.17     10.46
0.00
> > > 0.00      3.32      0.00      0.00     67.59
> > > Average:         29      3.34      0.00     14.42     10.72
0.00
> > > 0.00      3.34      0.00      0.00     68.18
> > > Average:         30      3.34      0.00     15.08     10.70
0.00
> > > 0.00      3.89      0.00      0.00     66.99
> > > Average:         31      3.26      0.00     15.33     10.47
0.00
> > > 0.00      3.33      0.00      0.00     67.61
> > > Average:         32      3.21      0.00     14.80     10.61
0.00
> > > 0.00      3.70      0.00      0.00     67.67
> > > Average:         33      3.40      0.00     13.88     10.55
0.00
> > > 0.00      4.02      0.00      0.00     68.15
> > > Average:         34      3.74      0.00     17.41     10.61
0.00
> > > 0.00      4.51      0.00      0.00     63.73
> > > Average:         35      3.35      0.00     14.37     10.74
0.00
> > > 0.00      3.84      0.00      0.00     67.71
> > > Average:         36      0.54      0.00      1.77      0.00
0.00
> > > 0.00      0.00      0.00      0.00     97.69
> > > ..
> > > Average:         54      3.60      0.00     15.17     10.39
0.00
> > > 0.00      4.22      0.00      0.00     66.62
> > > Average:         55      3.33      0.00     14.85     10.55
0.00
> > > 0.00      3.96      0.00      0.00     67.31
> > > Average:         56      3.40      0.00     15.19     10.54
0.00
> > > 0.00      3.74      0.00      0.00     67.13
> > > Average:         57      3.41      0.00     13.98     10.78
0.00
> > > 0.00      4.10      0.00      0.00     67.73
> > > Average:         58      3.32      0.00     15.16     10.52
0.00
> > > 0.00      4.01      0.00      0.00     66.99
> > > Average:         59      3.17      0.00     15.80     10.35
0.00
> > > 0.00      3.86      0.00      0.00     66.80
> > > Average:         60      3.00      0.00     14.63     10.59
0.00
> > > 0.00      3.97      0.00      0.00     67.80
> > > Average:         61      3.34      0.00     14.70     10.66
0.00
> > > 0.00      4.32      0.00      0.00     66.97
> > > Average:         62      3.34      0.00     15.29     10.56
0.00
> > > 0.00      3.89      0.00      0.00     66.92
> > > Average:         63      3.29      0.00     14.51     10.72
0.00
> > > 0.00      3.85      0.00      0.00     67.62
> > > Average:         64      3.48      0.00     15.31     10.65
0.00
> > > 0.00      3.97      0.00      0.00     66.60
> > > Average:         65      3.34      0.00     14.36     10.80
0.00
> > > 0.00      4.11      0.00      0.00     67.39
> > > Average:         66      3.13      0.00     14.94     10.70
0.00
> > > 0.00      4.10      0.00      0.00     67.13
> > > Average:         67      3.06      0.00     15.56     10.69
0.00
> > > 0.00      3.82      0.00      0.00     66.88
> > > Average:         68      3.33      0.00     14.98     10.61
0.00
> > > 0.00      3.81      0.00      0.00     67.27
> > > Average:         69      3.20      0.00     15.43     10.70
0.00
> > > 0.00      3.82      0.00      0.00     66.85
> > > Average:         70      3.34      0.00     17.14     10.59
0.00
> > > 0.00      3.00      0.00      0.00     65.92
> > > Average:         71      3.41      0.00     14.94     10.56
0.00
> > > 0.00      3.41      0.00      0.00     67.69
> > >
> > > Perf top -
> > >
> > >   64.33%  [kernel]            [k] bt_iter
> > >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> > >    4.23%  [kernel]            [k] _find_next_bit
> > >    2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
> > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > >    0.63%  [kernel]            [k] find_next_bit
> > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > >
> > Ah. So we're spending quite some time in trying to find a free tag.
> > I guess this is due to every queue starting at the same position
> > trying to find a free tag, which inevitably leads to a contention.
>
> IMO, the above trace means that blk_mq_in_flight() may be the
bottleneck,
> and looks not related with tag allocation.
>
> Kashyap, could you run your performance test again after disabling
iostat by
> the following command on all test devices and killing all utilities
which may
> read iostat(/proc/diskstats, ...)?
>
> 	echo 0 > /sys/block/sdN/queue/iostat

Ming - After changing iostat = 0 , I see performance issue is resolved.

Below is perf top output after iostats = 0


  23.45%  [kernel]             [k] bt_iter
   2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
   2.18%  [kernel]             [k] _find_next_bit
   2.06%  [megaraid_sas]       [k] complete_cmd_fusion
   1.87%  [kernel]             [k] clflush_cache_range
   1.70%  [kernel]             [k] dma_pte_clear_level
   1.56%  [kernel]             [k] __domain_mapping
   1.55%  [kernel]             [k] sbitmap_queue_clear
   1.30%  [kernel]             [k] gup_pgd_range


>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-09  4:58                           ` Kashyap Desai
@ 2018-02-09  5:31                             ` Ming Lei
  2018-02-09  8:42                               ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-09  5:31 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Thursday, February 8, 2018 10:23 PM
> > To: Hannes Reinecke
> > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > >> -----Original Message-----
> > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > >> To: Hannes Reinecke
> > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > >> Easi; Omar
> > > > Sandoval;
> > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > > Peter
> > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > >> introduce force_blk_mq
> > > >>
> > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > > >>> Hi all,
> > > >>>
> > > >>> [ .. ]
> > > >>>>>
> > > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > > >>>> megaraid_sas
> > > >>>>> so that I can reproduce your test?
> > > >>>>>
> > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more
> CPU.
> > > >>>>>
> > > >>>>> Could you share us what the IOPS/CPU utilization effect is after
> > > >>>> applying the
> > > >>>>> patch V2? And your test script?
> > > >>>> Regarding CPU utilization, I need to test one more time.
> > > >>>> Currently system is in used.
> > > >>>>
> > > >>>> I run below fio test on total 24 SSDs expander attached.
> > > >>>>
> > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > >>>> --ioengine=libaio --rw=randread
> > > >>>>
> > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > >>>>
> > > >>> This is basically what we've seen with earlier iterations.
> > > >>
> > > >> Hi Hannes,
> > > >>
> > > >> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> > > > which
> > > >> causes only reply queue 0 used.
> > > >>
> > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > >>
> > > >> So could you guys run your performance test again after fixing the
> > > > patch?
> > > >
> > > > Ming -
> > > >
> > > > I tried after change you requested.  Performance drop is still
> unresolved.
> > > > From 1.6 M IOPS to 770K IOPS.
> > > >
> > > > See below data. All 24 reply queue is in used correctly.
> > > >
> > > > IRQs / 1 second(s)
> > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > > >
> > > >
> > > > Average:        CPU      %usr     %nice      %sys   %iowait
> %steal
> > > > %irq     %soft    %guest    %gnice     %idle
> > > > Average:         18      3.80      0.00     14.78     10.08
> 0.00
> > > > 0.00      4.01      0.00      0.00     67.33
> > > > Average:         19      3.26      0.00     15.35     10.62
> 0.00
> > > > 0.00      4.03      0.00      0.00     66.74
> > > > Average:         20      3.42      0.00     14.57     10.67
> 0.00
> > > > 0.00      3.84      0.00      0.00     67.50
> > > > Average:         21      3.19      0.00     15.60     10.75
> 0.00
> > > > 0.00      4.16      0.00      0.00     66.30
> > > > Average:         22      3.58      0.00     15.15     10.66
> 0.00
> > > > 0.00      3.51      0.00      0.00     67.11
> > > > Average:         23      3.34      0.00     15.36     10.63
> 0.00
> > > > 0.00      4.17      0.00      0.00     66.50
> > > > Average:         24      3.50      0.00     14.58     10.93
> 0.00
> > > > 0.00      3.85      0.00      0.00     67.13
> > > > Average:         25      3.20      0.00     14.68     10.86
> 0.00
> > > > 0.00      4.31      0.00      0.00     66.95
> > > > Average:         26      3.27      0.00     14.80     10.70
> 0.00
> > > > 0.00      3.68      0.00      0.00     67.55
> > > > Average:         27      3.58      0.00     15.36     10.80
> 0.00
> > > > 0.00      3.79      0.00      0.00     66.48
> > > > Average:         28      3.46      0.00     15.17     10.46
> 0.00
> > > > 0.00      3.32      0.00      0.00     67.59
> > > > Average:         29      3.34      0.00     14.42     10.72
> 0.00
> > > > 0.00      3.34      0.00      0.00     68.18
> > > > Average:         30      3.34      0.00     15.08     10.70
> 0.00
> > > > 0.00      3.89      0.00      0.00     66.99
> > > > Average:         31      3.26      0.00     15.33     10.47
> 0.00
> > > > 0.00      3.33      0.00      0.00     67.61
> > > > Average:         32      3.21      0.00     14.80     10.61
> 0.00
> > > > 0.00      3.70      0.00      0.00     67.67
> > > > Average:         33      3.40      0.00     13.88     10.55
> 0.00
> > > > 0.00      4.02      0.00      0.00     68.15
> > > > Average:         34      3.74      0.00     17.41     10.61
> 0.00
> > > > 0.00      4.51      0.00      0.00     63.73
> > > > Average:         35      3.35      0.00     14.37     10.74
> 0.00
> > > > 0.00      3.84      0.00      0.00     67.71
> > > > Average:         36      0.54      0.00      1.77      0.00
> 0.00
> > > > 0.00      0.00      0.00      0.00     97.69
> > > > ..
> > > > Average:         54      3.60      0.00     15.17     10.39
> 0.00
> > > > 0.00      4.22      0.00      0.00     66.62
> > > > Average:         55      3.33      0.00     14.85     10.55
> 0.00
> > > > 0.00      3.96      0.00      0.00     67.31
> > > > Average:         56      3.40      0.00     15.19     10.54
> 0.00
> > > > 0.00      3.74      0.00      0.00     67.13
> > > > Average:         57      3.41      0.00     13.98     10.78
> 0.00
> > > > 0.00      4.10      0.00      0.00     67.73
> > > > Average:         58      3.32      0.00     15.16     10.52
> 0.00
> > > > 0.00      4.01      0.00      0.00     66.99
> > > > Average:         59      3.17      0.00     15.80     10.35
> 0.00
> > > > 0.00      3.86      0.00      0.00     66.80
> > > > Average:         60      3.00      0.00     14.63     10.59
> 0.00
> > > > 0.00      3.97      0.00      0.00     67.80
> > > > Average:         61      3.34      0.00     14.70     10.66
> 0.00
> > > > 0.00      4.32      0.00      0.00     66.97
> > > > Average:         62      3.34      0.00     15.29     10.56
> 0.00
> > > > 0.00      3.89      0.00      0.00     66.92
> > > > Average:         63      3.29      0.00     14.51     10.72
> 0.00
> > > > 0.00      3.85      0.00      0.00     67.62
> > > > Average:         64      3.48      0.00     15.31     10.65
> 0.00
> > > > 0.00      3.97      0.00      0.00     66.60
> > > > Average:         65      3.34      0.00     14.36     10.80
> 0.00
> > > > 0.00      4.11      0.00      0.00     67.39
> > > > Average:         66      3.13      0.00     14.94     10.70
> 0.00
> > > > 0.00      4.10      0.00      0.00     67.13
> > > > Average:         67      3.06      0.00     15.56     10.69
> 0.00
> > > > 0.00      3.82      0.00      0.00     66.88
> > > > Average:         68      3.33      0.00     14.98     10.61
> 0.00
> > > > 0.00      3.81      0.00      0.00     67.27
> > > > Average:         69      3.20      0.00     15.43     10.70
> 0.00
> > > > 0.00      3.82      0.00      0.00     66.85
> > > > Average:         70      3.34      0.00     17.14     10.59
> 0.00
> > > > 0.00      3.00      0.00      0.00     65.92
> > > > Average:         71      3.41      0.00     14.94     10.56
> 0.00
> > > > 0.00      3.41      0.00      0.00     67.69
> > > >
> > > > Perf top -
> > > >
> > > >   64.33%  [kernel]            [k] bt_iter
> > > >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> > > >    4.23%  [kernel]            [k] _find_next_bit
> > > >    2.40%  [kernel]            [k] native_queued_spin_lock_slowpath
> > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > >    0.63%  [kernel]            [k] find_next_bit
> > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > >
> > > Ah. So we're spending quite some time in trying to find a free tag.
> > > I guess this is due to every queue starting at the same position
> > > trying to find a free tag, which inevitably leads to a contention.
> >
> > IMO, the above trace means that blk_mq_in_flight() may be the
> bottleneck,
> > and looks not related with tag allocation.
> >
> > Kashyap, could you run your performance test again after disabling
> iostat by
> > the following command on all test devices and killing all utilities
> which may
> > read iostat(/proc/diskstats, ...)?
> >
> > 	echo 0 > /sys/block/sdN/queue/iostat
> 
> Ming - After changing iostat = 0 , I see performance issue is resolved.
> 
> Below is perf top output after iostats = 0
> 
> 
>   23.45%  [kernel]             [k] bt_iter
>    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
>    2.18%  [kernel]             [k] _find_next_bit
>    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
>    1.87%  [kernel]             [k] clflush_cache_range
>    1.70%  [kernel]             [k] dma_pte_clear_level
>    1.56%  [kernel]             [k] __domain_mapping
>    1.55%  [kernel]             [k] sbitmap_queue_clear
>    1.30%  [kernel]             [k] gup_pgd_range

Hi Kashyap,

Thanks for your test and update.

Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though
iostats is disabled, and I guess there may be utilities which are reading
iostats a bit frequently.

Either there is issue introduced in part_round_stats() recently since I
remember that this counter should have been read at most one time during
one jiffies in IO path, or the implementation of blk_mq_in_flight()
can become a bit heavy in your environment. Jens may have idea about this
issue.

And I guess the lockup issue may be avoided by this approach now?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-09  5:31                             ` Ming Lei
@ 2018-02-09  8:42                               ` Kashyap Desai
  2018-02-10  1:01                                 ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Kashyap Desai @ 2018-02-09  8:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Friday, February 9, 2018 11:01 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Thursday, February 8, 2018 10:23 PM
> > > To: Hannes Reinecke
> > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > >> -----Original Message-----
> > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > >> To: Hannes Reinecke
> > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > >> Arun Easi; Omar
> > > > > Sandoval;
> > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > >> Brace;
> > > > > Peter
> > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > >> introduce force_blk_mq
> > > > >>
> > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> [ .. ]
> > > > >>>>>
> > > > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > > > >>>> megaraid_sas
> > > > >>>>> so that I can reproduce your test?
> > > > >>>>>
> > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times
> > > > >>>>>> more
> > CPU.
> > > > >>>>>
> > > > >>>>> Could you share us what the IOPS/CPU utilization effect is
> > > > >>>>> after
> > > > >>>> applying the
> > > > >>>>> patch V2? And your test script?
> > > > >>>> Regarding CPU utilization, I need to test one more time.
> > > > >>>> Currently system is in used.
> > > > >>>>
> > > > >>>> I run below fio test on total 24 SSDs expander attached.
> > > > >>>>
> > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > >>>> --ioengine=libaio --rw=randread
> > > > >>>>
> > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > >>>>
> > > > >>> This is basically what we've seen with earlier iterations.
> > > > >>
> > > > >> Hi Hannes,
> > > > >>
> > > > >> As I mentioned in another mail[1], Kashyap's patch has a big
> > > > >> issue,
> > > > > which
> > > > >> causes only reply queue 0 used.
> > > > >>
> > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > >>
> > > > >> So could you guys run your performance test again after fixing
> > > > >> the
> > > > > patch?
> > > > >
> > > > > Ming -
> > > > >
> > > > > I tried after change you requested.  Performance drop is still
> > unresolved.
> > > > > From 1.6 M IOPS to 770K IOPS.
> > > > >
> > > > > See below data. All 24 reply queue is in used correctly.
> > > > >
> > > > > IRQs / 1 second(s)
> > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > > > >
> > > > >
> > > > > Average:        CPU      %usr     %nice      %sys   %iowait
> > %steal
> > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > Average:         18      3.80      0.00     14.78     10.08
> > 0.00
> > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > Average:         19      3.26      0.00     15.35     10.62
> > 0.00
> > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > Average:         20      3.42      0.00     14.57     10.67
> > 0.00
> > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > Average:         21      3.19      0.00     15.60     10.75
> > 0.00
> > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > Average:         22      3.58      0.00     15.15     10.66
> > 0.00
> > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > Average:         23      3.34      0.00     15.36     10.63
> > 0.00
> > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > Average:         24      3.50      0.00     14.58     10.93
> > 0.00
> > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > Average:         25      3.20      0.00     14.68     10.86
> > 0.00
> > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > Average:         26      3.27      0.00     14.80     10.70
> > 0.00
> > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > Average:         27      3.58      0.00     15.36     10.80
> > 0.00
> > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > Average:         28      3.46      0.00     15.17     10.46
> > 0.00
> > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > Average:         29      3.34      0.00     14.42     10.72
> > 0.00
> > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > Average:         30      3.34      0.00     15.08     10.70
> > 0.00
> > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > Average:         31      3.26      0.00     15.33     10.47
> > 0.00
> > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > Average:         32      3.21      0.00     14.80     10.61
> > 0.00
> > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > Average:         33      3.40      0.00     13.88     10.55
> > 0.00
> > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > Average:         34      3.74      0.00     17.41     10.61
> > 0.00
> > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > Average:         35      3.35      0.00     14.37     10.74
> > 0.00
> > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > Average:         36      0.54      0.00      1.77      0.00
> > 0.00
> > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > ..
> > > > > Average:         54      3.60      0.00     15.17     10.39
> > 0.00
> > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > Average:         55      3.33      0.00     14.85     10.55
> > 0.00
> > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > Average:         56      3.40      0.00     15.19     10.54
> > 0.00
> > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > Average:         57      3.41      0.00     13.98     10.78
> > 0.00
> > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > Average:         58      3.32      0.00     15.16     10.52
> > 0.00
> > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > Average:         59      3.17      0.00     15.80     10.35
> > 0.00
> > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > Average:         60      3.00      0.00     14.63     10.59
> > 0.00
> > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > Average:         61      3.34      0.00     14.70     10.66
> > 0.00
> > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > Average:         62      3.34      0.00     15.29     10.56
> > 0.00
> > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > Average:         63      3.29      0.00     14.51     10.72
> > 0.00
> > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > Average:         64      3.48      0.00     15.31     10.65
> > 0.00
> > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > Average:         65      3.34      0.00     14.36     10.80
> > 0.00
> > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > Average:         66      3.13      0.00     14.94     10.70
> > 0.00
> > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > Average:         67      3.06      0.00     15.56     10.69
> > 0.00
> > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > Average:         68      3.33      0.00     14.98     10.61
> > 0.00
> > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > Average:         69      3.20      0.00     15.43     10.70
> > 0.00
> > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > Average:         70      3.34      0.00     17.14     10.59
> > 0.00
> > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > Average:         71      3.41      0.00     14.94     10.56
> > 0.00
> > > > > 0.00      3.41      0.00      0.00     67.69
> > > > >
> > > > > Perf top -
> > > > >
> > > > >   64.33%  [kernel]            [k] bt_iter
> > > > >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > >    2.40%  [kernel]            [k]
native_queued_spin_lock_slowpath
> > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > > >
> > > > Ah. So we're spending quite some time in trying to find a free
tag.
> > > > I guess this is due to every queue starting at the same position
> > > > trying to find a free tag, which inevitably leads to a contention.
> > >
> > > IMO, the above trace means that blk_mq_in_flight() may be the
> > bottleneck,
> > > and looks not related with tag allocation.
> > >
> > > Kashyap, could you run your performance test again after disabling
> > iostat by
> > > the following command on all test devices and killing all utilities
> > which may
> > > read iostat(/proc/diskstats, ...)?
> > >
> > > 	echo 0 > /sys/block/sdN/queue/iostat
> >
> > Ming - After changing iostat = 0 , I see performance issue is
resolved.
> >
> > Below is perf top output after iostats = 0
> >
> >
> >   23.45%  [kernel]             [k] bt_iter
> >    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
> >    2.18%  [kernel]             [k] _find_next_bit
> >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> >    1.87%  [kernel]             [k] clflush_cache_range
> >    1.70%  [kernel]             [k] dma_pte_clear_level
> >    1.56%  [kernel]             [k] __domain_mapping
> >    1.55%  [kernel]             [k] sbitmap_queue_clear
> >    1.30%  [kernel]             [k] gup_pgd_range
>
> Hi Kashyap,
>
> Thanks for your test and update.
>
> Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though
> iostats is disabled, and I guess there may be utilities which are
reading iostats
> a bit frequently.

I  will be doing some more testing and post you my findings.

>
> Either there is issue introduced in part_round_stats() recently since I
> remember that this counter should have been read at most one time during
> one jiffies in IO path, or the implementation of blk_mq_in_flight() can
become
> a bit heavy in your environment. Jens may have idea about this issue.
>
> And I guess the lockup issue may be avoided by this approach now?

NO.  For CPU Lock up we need irq poll interface to quit from ISR loop of
the driver.

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-09  8:42                               ` Kashyap Desai
@ 2018-02-10  1:01                                 ` Ming Lei
  2018-02-11  5:31                                   ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-10  1:01 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Friday, February 9, 2018 11:01 AM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > -----Original Message-----
> > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > To: Hannes Reinecke
> > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > >> -----Original Message-----
> > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > >> To: Hannes Reinecke
> > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > >> Arun Easi; Omar
> > > > > > Sandoval;
> > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > >> Brace;
> > > > > > Peter
> > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > >> introduce force_blk_mq
> > > > > >>
> > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> wrote:
> > > > > >>> Hi all,
> > > > > >>>
> > > > > >>> [ .. ]
> > > > > >>>>>
> > > > > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > > > > >>>> megaraid_sas
> > > > > >>>>> so that I can reproduce your test?
> > > > > >>>>>
> > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times
> > > > > >>>>>> more
> > > CPU.
> > > > > >>>>>
> > > > > >>>>> Could you share us what the IOPS/CPU utilization effect is
> > > > > >>>>> after
> > > > > >>>> applying the
> > > > > >>>>> patch V2? And your test script?
> > > > > >>>> Regarding CPU utilization, I need to test one more time.
> > > > > >>>> Currently system is in used.
> > > > > >>>>
> > > > > >>>> I run below fio test on total 24 SSDs expander attached.
> > > > > >>>>
> > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > > >>>> --ioengine=libaio --rw=randread
> > > > > >>>>
> > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > >>>>
> > > > > >>> This is basically what we've seen with earlier iterations.
> > > > > >>
> > > > > >> Hi Hannes,
> > > > > >>
> > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big
> > > > > >> issue,
> > > > > > which
> > > > > >> causes only reply queue 0 used.
> > > > > >>
> > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > >>
> > > > > >> So could you guys run your performance test again after fixing
> > > > > >> the
> > > > > > patch?
> > > > > >
> > > > > > Ming -
> > > > > >
> > > > > > I tried after change you requested.  Performance drop is still
> > > unresolved.
> > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > >
> > > > > > See below data. All 24 reply queue is in used correctly.
> > > > > >
> > > > > > IRQs / 1 second(s)
> > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > > > > >
> > > > > >
> > > > > > Average:        CPU      %usr     %nice      %sys   %iowait
> > > %steal
> > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > Average:         18      3.80      0.00     14.78     10.08
> > > 0.00
> > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > Average:         19      3.26      0.00     15.35     10.62
> > > 0.00
> > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > Average:         20      3.42      0.00     14.57     10.67
> > > 0.00
> > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > Average:         21      3.19      0.00     15.60     10.75
> > > 0.00
> > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > Average:         22      3.58      0.00     15.15     10.66
> > > 0.00
> > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > Average:         23      3.34      0.00     15.36     10.63
> > > 0.00
> > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > Average:         24      3.50      0.00     14.58     10.93
> > > 0.00
> > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > Average:         25      3.20      0.00     14.68     10.86
> > > 0.00
> > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > Average:         26      3.27      0.00     14.80     10.70
> > > 0.00
> > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > Average:         27      3.58      0.00     15.36     10.80
> > > 0.00
> > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > Average:         28      3.46      0.00     15.17     10.46
> > > 0.00
> > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > Average:         29      3.34      0.00     14.42     10.72
> > > 0.00
> > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > Average:         30      3.34      0.00     15.08     10.70
> > > 0.00
> > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > Average:         31      3.26      0.00     15.33     10.47
> > > 0.00
> > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > Average:         32      3.21      0.00     14.80     10.61
> > > 0.00
> > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > Average:         33      3.40      0.00     13.88     10.55
> > > 0.00
> > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > Average:         34      3.74      0.00     17.41     10.61
> > > 0.00
> > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > Average:         35      3.35      0.00     14.37     10.74
> > > 0.00
> > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > Average:         36      0.54      0.00      1.77      0.00
> > > 0.00
> > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > ..
> > > > > > Average:         54      3.60      0.00     15.17     10.39
> > > 0.00
> > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > Average:         55      3.33      0.00     14.85     10.55
> > > 0.00
> > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > Average:         56      3.40      0.00     15.19     10.54
> > > 0.00
> > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > Average:         57      3.41      0.00     13.98     10.78
> > > 0.00
> > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > Average:         58      3.32      0.00     15.16     10.52
> > > 0.00
> > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > Average:         59      3.17      0.00     15.80     10.35
> > > 0.00
> > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > Average:         60      3.00      0.00     14.63     10.59
> > > 0.00
> > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > Average:         61      3.34      0.00     14.70     10.66
> > > 0.00
> > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > Average:         62      3.34      0.00     15.29     10.56
> > > 0.00
> > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > Average:         63      3.29      0.00     14.51     10.72
> > > 0.00
> > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > Average:         64      3.48      0.00     15.31     10.65
> > > 0.00
> > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > Average:         65      3.34      0.00     14.36     10.80
> > > 0.00
> > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > Average:         66      3.13      0.00     14.94     10.70
> > > 0.00
> > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > Average:         67      3.06      0.00     15.56     10.69
> > > 0.00
> > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > Average:         68      3.33      0.00     14.98     10.61
> > > 0.00
> > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > Average:         69      3.20      0.00     15.43     10.70
> > > 0.00
> > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > Average:         70      3.34      0.00     17.14     10.59
> > > 0.00
> > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > Average:         71      3.41      0.00     14.94     10.56
> > > 0.00
> > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > >
> > > > > > Perf top -
> > > > > >
> > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > >    2.40%  [kernel]            [k]
> native_queued_spin_lock_slowpath
> > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > > > >
> > > > > Ah. So we're spending quite some time in trying to find a free
> tag.
> > > > > I guess this is due to every queue starting at the same position
> > > > > trying to find a free tag, which inevitably leads to a contention.
> > > >
> > > > IMO, the above trace means that blk_mq_in_flight() may be the
> > > bottleneck,
> > > > and looks not related with tag allocation.
> > > >
> > > > Kashyap, could you run your performance test again after disabling
> > > iostat by
> > > > the following command on all test devices and killing all utilities
> > > which may
> > > > read iostat(/proc/diskstats, ...)?
> > > >
> > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > >
> > > Ming - After changing iostat = 0 , I see performance issue is
> resolved.
> > >
> > > Below is perf top output after iostats = 0
> > >
> > >
> > >   23.45%  [kernel]             [k] bt_iter
> > >    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
> > >    2.18%  [kernel]             [k] _find_next_bit
> > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > >    1.87%  [kernel]             [k] clflush_cache_range
> > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > >    1.56%  [kernel]             [k] __domain_mapping
> > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > >    1.30%  [kernel]             [k] gup_pgd_range
> >
> > Hi Kashyap,
> >
> > Thanks for your test and update.
> >
> > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though
> > iostats is disabled, and I guess there may be utilities which are
> reading iostats
> > a bit frequently.
> 
> I  will be doing some more testing and post you my findings.

I will find sometime this weekend to see if I can cook a patch to
address this issue of io accounting.

> 
> >
> > Either there is issue introduced in part_round_stats() recently since I
> > remember that this counter should have been read at most one time during
> > one jiffies in IO path, or the implementation of blk_mq_in_flight() can
> become
> > a bit heavy in your environment. Jens may have idea about this issue.
> >
> > And I guess the lockup issue may be avoided by this approach now?
> 
> NO.  For CPU Lock up we need irq poll interface to quit from ISR loop of
> the driver.

Actually after this patchset starts working, the request's completion is
done basically on the submission CPU. Seems all CPU shouldn't have been
overloaded, given your system has so many msix irq vectors and enough
CPU cores.

I am interested in this problem too, but I think we have to fix the io
accounting issue first. Once the accounting issue(which may cause too
much CPU consumed up in interrupt handler) is fixed, let's see if there
is still the lockup issue. If there is, 'perf' may tell us something. But
from your previous perf trace, looks only the accounting symbols are listed
in hot path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-10  1:01                                 ` Ming Lei
@ 2018-02-11  5:31                                   ` Ming Lei
  2018-02-12 18:35                                     ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-11  5:31 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> Hi Kashyap,
> 
> On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Friday, February 9, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > > force_blk_mq
> > >
> > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > -----Original Message-----
> > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > To: Hannes Reinecke
> > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > > Easi; Omar
> > > > Sandoval;
> > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > > Peter
> > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > introduce force_blk_mq
> > > > >
> > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > >> -----Original Message-----
> > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > >> To: Hannes Reinecke
> > > > > > >> Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > >> Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > >> Brace;
> > > > > > > Peter
> > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > >> introduce force_blk_mq
> > > > > > >>
> > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> > wrote:
> > > > > > >>> Hi all,
> > > > > > >>>
> > > > > > >>> [ .. ]
> > > > > > >>>>>
> > > > > > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > > > > > >>>> megaraid_sas
> > > > > > >>>>> so that I can reproduce your test?
> > > > > > >>>>>
> > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times
> > > > > > >>>>>> more
> > > > CPU.
> > > > > > >>>>>
> > > > > > >>>>> Could you share us what the IOPS/CPU utilization effect is
> > > > > > >>>>> after
> > > > > > >>>> applying the
> > > > > > >>>>> patch V2? And your test script?
> > > > > > >>>> Regarding CPU utilization, I need to test one more time.
> > > > > > >>>> Currently system is in used.
> > > > > > >>>>
> > > > > > >>>> I run below fio test on total 24 SSDs expander attached.
> > > > > > >>>>
> > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > > > >>>> --ioengine=libaio --rw=randread
> > > > > > >>>>
> > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > >>>>
> > > > > > >>> This is basically what we've seen with earlier iterations.
> > > > > > >>
> > > > > > >> Hi Hannes,
> > > > > > >>
> > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a big
> > > > > > >> issue,
> > > > > > > which
> > > > > > >> causes only reply queue 0 used.
> > > > > > >>
> > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > >>
> > > > > > >> So could you guys run your performance test again after fixing
> > > > > > >> the
> > > > > > > patch?
> > > > > > >
> > > > > > > Ming -
> > > > > > >
> > > > > > > I tried after change you requested.  Performance drop is still
> > > > unresolved.
> > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > >
> > > > > > > See below data. All 24 reply queue is in used correctly.
> > > > > > >
> > > > > > > IRQs / 1 second(s)
> > > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge megasas
> > > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge megasas
> > > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge megasas
> > > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge megasas
> > > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge megasas
> > > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge megasas
> > > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge megasas
> > > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge megasas
> > > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge megasas
> > > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge megasas
> > > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge megasas
> > > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge megasas
> > > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge megasas
> > > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge megasas
> > > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge megasas
> > > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge megasas
> > > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge megasas
> > > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge megasas
> > > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge megasas
> > > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge megasas
> > > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge megasas
> > > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge megasas
> > > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge megasas
> > > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge megasas
> > > > > > >
> > > > > > >
> > > > > > > Average:        CPU      %usr     %nice      %sys   %iowait
> > > > %steal
> > > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > > Average:         18      3.80      0.00     14.78     10.08
> > > > 0.00
> > > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > > Average:         19      3.26      0.00     15.35     10.62
> > > > 0.00
> > > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > > Average:         20      3.42      0.00     14.57     10.67
> > > > 0.00
> > > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > > Average:         21      3.19      0.00     15.60     10.75
> > > > 0.00
> > > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > > Average:         22      3.58      0.00     15.15     10.66
> > > > 0.00
> > > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > > Average:         23      3.34      0.00     15.36     10.63
> > > > 0.00
> > > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > > Average:         24      3.50      0.00     14.58     10.93
> > > > 0.00
> > > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > > Average:         25      3.20      0.00     14.68     10.86
> > > > 0.00
> > > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > > Average:         26      3.27      0.00     14.80     10.70
> > > > 0.00
> > > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > > Average:         27      3.58      0.00     15.36     10.80
> > > > 0.00
> > > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > > Average:         28      3.46      0.00     15.17     10.46
> > > > 0.00
> > > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > > Average:         29      3.34      0.00     14.42     10.72
> > > > 0.00
> > > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > > Average:         30      3.34      0.00     15.08     10.70
> > > > 0.00
> > > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > > Average:         31      3.26      0.00     15.33     10.47
> > > > 0.00
> > > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > > Average:         32      3.21      0.00     14.80     10.61
> > > > 0.00
> > > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > > Average:         33      3.40      0.00     13.88     10.55
> > > > 0.00
> > > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > > Average:         34      3.74      0.00     17.41     10.61
> > > > 0.00
> > > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > > Average:         35      3.35      0.00     14.37     10.74
> > > > 0.00
> > > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > > Average:         36      0.54      0.00      1.77      0.00
> > > > 0.00
> > > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > > ..
> > > > > > > Average:         54      3.60      0.00     15.17     10.39
> > > > 0.00
> > > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > > Average:         55      3.33      0.00     14.85     10.55
> > > > 0.00
> > > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > > Average:         56      3.40      0.00     15.19     10.54
> > > > 0.00
> > > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > > Average:         57      3.41      0.00     13.98     10.78
> > > > 0.00
> > > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > > Average:         58      3.32      0.00     15.16     10.52
> > > > 0.00
> > > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > > Average:         59      3.17      0.00     15.80     10.35
> > > > 0.00
> > > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > > Average:         60      3.00      0.00     14.63     10.59
> > > > 0.00
> > > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > > Average:         61      3.34      0.00     14.70     10.66
> > > > 0.00
> > > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > > Average:         62      3.34      0.00     15.29     10.56
> > > > 0.00
> > > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > > Average:         63      3.29      0.00     14.51     10.72
> > > > 0.00
> > > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > > Average:         64      3.48      0.00     15.31     10.65
> > > > 0.00
> > > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > > Average:         65      3.34      0.00     14.36     10.80
> > > > 0.00
> > > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > > Average:         66      3.13      0.00     14.94     10.70
> > > > 0.00
> > > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > > Average:         67      3.06      0.00     15.56     10.69
> > > > 0.00
> > > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > > Average:         68      3.33      0.00     14.98     10.61
> > > > 0.00
> > > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > > Average:         69      3.20      0.00     15.43     10.70
> > > > 0.00
> > > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > > Average:         70      3.34      0.00     17.14     10.59
> > > > 0.00
> > > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > > Average:         71      3.41      0.00     14.94     10.56
> > > > 0.00
> > > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > > >
> > > > > > > Perf top -
> > > > > > >
> > > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > > >    4.86%  [kernel]            [k] blk_mq_queue_tag_busy_iter
> > > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > > >    2.40%  [kernel]            [k]
> > native_queued_spin_lock_slowpath
> > > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > > > > >
> > > > > > Ah. So we're spending quite some time in trying to find a free
> > tag.
> > > > > > I guess this is due to every queue starting at the same position
> > > > > > trying to find a free tag, which inevitably leads to a contention.
> > > > >
> > > > > IMO, the above trace means that blk_mq_in_flight() may be the
> > > > bottleneck,
> > > > > and looks not related with tag allocation.
> > > > >
> > > > > Kashyap, could you run your performance test again after disabling
> > > > iostat by
> > > > > the following command on all test devices and killing all utilities
> > > > which may
> > > > > read iostat(/proc/diskstats, ...)?
> > > > >
> > > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > > >
> > > > Ming - After changing iostat = 0 , I see performance issue is
> > resolved.
> > > >
> > > > Below is perf top output after iostats = 0
> > > >
> > > >
> > > >   23.45%  [kernel]             [k] bt_iter
> > > >    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
> > > >    2.18%  [kernel]             [k] _find_next_bit
> > > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > > >    1.87%  [kernel]             [k] clflush_cache_range
> > > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > > >    1.56%  [kernel]             [k] __domain_mapping
> > > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > > >    1.30%  [kernel]             [k] gup_pgd_range
> > >
> > > Hi Kashyap,
> > >
> > > Thanks for your test and update.
> > >
> > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even though
> > > iostats is disabled, and I guess there may be utilities which are
> > reading iostats
> > > a bit frequently.
> > 
> > I  will be doing some more testing and post you my findings.
> 
> I will find sometime this weekend to see if I can cook a patch to
> address this issue of io accounting.

Hi Kashyap,

Please test the top 5 patches in the following tree to see if
megaraid_sas's performance is OK:

	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-v2

This tree is made by adding these 5 patches against patchset V2.

If possible, please provide us the performance data without these
patches and with these patches, together with perf trace.

The top 5 patches are for addressing the io accounting issue, and which
should be the main reason for your performance drop, even lockup in
megaraid_sas's ISR, IMO.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-11  5:31                                   ` Ming Lei
@ 2018-02-12 18:35                                     ` Kashyap Desai
  2018-02-13  0:40                                       ` Ming Lei
  0 siblings, 1 reply; 39+ messages in thread
From: Kashyap Desai @ 2018-02-12 18:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Sunday, February 11, 2018 11:01 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > Hi Kashyap,
> >
> > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > -----Original Message-----
> > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > To: Kashyap Desai
> > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > -----Original Message-----
> > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > To: Hannes Reinecke
> > > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
wrote:
> > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > >> -----Original Message-----
> > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > >> To: Hannes Reinecke
> > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > Sandoval;
> > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > >> Don Brace;
> > > > > > > > Peter
> > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > >> tags & introduce force_blk_mq
> > > > > > > >>
> > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> > > wrote:
> > > > > > > >>> Hi all,
> > > > > > > >>>
> > > > > > > >>> [ .. ]
> > > > > > > >>>>>
> > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > >>>>> global_tags/MQ on
> > > > > > > >>>> megaraid_sas
> > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > >>>>>
> > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > >>>>>> times more
> > > > > CPU.
> > > > > > > >>>>>
> > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > >>>>> effect is after
> > > > > > > >>>> applying the
> > > > > > > >>>>> patch V2? And your test script?
> > > > > > > >>>> Regarding CPU utilization, I need to test one more
time.
> > > > > > > >>>> Currently system is in used.
> > > > > > > >>>>
> > > > > > > >>>> I run below fio test on total 24 SSDs expander
attached.
> > > > > > > >>>>
> > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64
> > > > > > > >>>> --bs=4k --ioengine=libaio --rw=randread
> > > > > > > >>>>
> > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > > >>>>
> > > > > > > >>> This is basically what we've seen with earlier
iterations.
> > > > > > > >>
> > > > > > > >> Hi Hannes,
> > > > > > > >>
> > > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a
> > > > > > > >> big issue,
> > > > > > > > which
> > > > > > > >> causes only reply queue 0 used.
> > > > > > > >>
> > > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > > >>
> > > > > > > >> So could you guys run your performance test again after
> > > > > > > >> fixing the
> > > > > > > > patch?
> > > > > > > >
> > > > > > > > Ming -
> > > > > > > >
> > > > > > > > I tried after change you requested.  Performance drop is
> > > > > > > > still
> > > > > unresolved.
> > > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > > >
> > > > > > > > See below data. All 24 reply queue is in used correctly.
> > > > > > > >
> > > > > > > > IRQs / 1 second(s)
> > > > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge
megasas
> > > > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge
megasas
> > > > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge
megasas
> > > > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge
megasas
> > > > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge
megasas
> > > > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge
megasas
> > > > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge
megasas
> > > > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge
megasas
> > > > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge
megasas
> > > > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge
megasas
> > > > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge
megasas
> > > > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge
megasas
> > > > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge
megasas
> > > > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge
megasas
> > > > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge
megasas
> > > > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge
megasas
> > > > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge
megasas
> > > > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge
megasas
> > > > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge
megasas
> > > > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge
megasas
> > > > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge
megasas
> > > > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge
megasas
> > > > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge
megasas
> > > > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge
megasas
> > > > > > > >
> > > > > > > >
> > > > > > > > Average:        CPU      %usr     %nice      %sys
%iowait
> > > > > %steal
> > > > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > > > Average:         18      3.80      0.00     14.78
10.08
> > > > > 0.00
> > > > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > > > Average:         19      3.26      0.00     15.35
10.62
> > > > > 0.00
> > > > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > > > Average:         20      3.42      0.00     14.57
10.67
> > > > > 0.00
> > > > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > > > Average:         21      3.19      0.00     15.60
10.75
> > > > > 0.00
> > > > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > > > Average:         22      3.58      0.00     15.15
10.66
> > > > > 0.00
> > > > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > > > Average:         23      3.34      0.00     15.36
10.63
> > > > > 0.00
> > > > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > > > Average:         24      3.50      0.00     14.58
10.93
> > > > > 0.00
> > > > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > > > Average:         25      3.20      0.00     14.68
10.86
> > > > > 0.00
> > > > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > > > Average:         26      3.27      0.00     14.80
10.70
> > > > > 0.00
> > > > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > > > Average:         27      3.58      0.00     15.36
10.80
> > > > > 0.00
> > > > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > > > Average:         28      3.46      0.00     15.17
10.46
> > > > > 0.00
> > > > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > > > Average:         29      3.34      0.00     14.42
10.72
> > > > > 0.00
> > > > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > > > Average:         30      3.34      0.00     15.08
10.70
> > > > > 0.00
> > > > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > > > Average:         31      3.26      0.00     15.33
10.47
> > > > > 0.00
> > > > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > > > Average:         32      3.21      0.00     14.80
10.61
> > > > > 0.00
> > > > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > > > Average:         33      3.40      0.00     13.88
10.55
> > > > > 0.00
> > > > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > > > Average:         34      3.74      0.00     17.41
10.61
> > > > > 0.00
> > > > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > > > Average:         35      3.35      0.00     14.37
10.74
> > > > > 0.00
> > > > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > > > Average:         36      0.54      0.00      1.77
0.00
> > > > > 0.00
> > > > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > > > ..
> > > > > > > > Average:         54      3.60      0.00     15.17
10.39
> > > > > 0.00
> > > > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > > > Average:         55      3.33      0.00     14.85
10.55
> > > > > 0.00
> > > > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > > > Average:         56      3.40      0.00     15.19
10.54
> > > > > 0.00
> > > > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > > > Average:         57      3.41      0.00     13.98
10.78
> > > > > 0.00
> > > > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > > > Average:         58      3.32      0.00     15.16
10.52
> > > > > 0.00
> > > > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > > > Average:         59      3.17      0.00     15.80
10.35
> > > > > 0.00
> > > > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > > > Average:         60      3.00      0.00     14.63
10.59
> > > > > 0.00
> > > > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > > > Average:         61      3.34      0.00     14.70
10.66
> > > > > 0.00
> > > > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > > > Average:         62      3.34      0.00     15.29
10.56
> > > > > 0.00
> > > > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > > > Average:         63      3.29      0.00     14.51
10.72
> > > > > 0.00
> > > > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > > > Average:         64      3.48      0.00     15.31
10.65
> > > > > 0.00
> > > > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > > > Average:         65      3.34      0.00     14.36
10.80
> > > > > 0.00
> > > > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > > > Average:         66      3.13      0.00     14.94
10.70
> > > > > 0.00
> > > > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > > > Average:         67      3.06      0.00     15.56
10.69
> > > > > 0.00
> > > > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > > > Average:         68      3.33      0.00     14.98
10.61
> > > > > 0.00
> > > > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > > > Average:         69      3.20      0.00     15.43
10.70
> > > > > 0.00
> > > > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > > > Average:         70      3.34      0.00     17.14
10.59
> > > > > 0.00
> > > > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > > > Average:         71      3.41      0.00     14.94
10.56
> > > > > 0.00
> > > > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > > > >
> > > > > > > > Perf top -
> > > > > > > >
> > > > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > > > >    4.86%  [kernel]            [k]
blk_mq_queue_tag_busy_iter
> > > > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > > > >    2.40%  [kernel]            [k]
> > > native_queued_spin_lock_slowpath
> > > > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > > > > > >
> > > > > > > Ah. So we're spending quite some time in trying to find a
> > > > > > > free
> > > tag.
> > > > > > > I guess this is due to every queue starting at the same
> > > > > > > position trying to find a free tag, which inevitably leads
to a
> contention.
> > > > > >
> > > > > > IMO, the above trace means that blk_mq_in_flight() may be the
> > > > > bottleneck,
> > > > > > and looks not related with tag allocation.
> > > > > >
> > > > > > Kashyap, could you run your performance test again after
> > > > > > disabling
> > > > > iostat by
> > > > > > the following command on all test devices and killing all
> > > > > > utilities
> > > > > which may
> > > > > > read iostat(/proc/diskstats, ...)?
> > > > > >
> > > > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > > > >
> > > > > Ming - After changing iostat = 0 , I see performance issue is
> > > resolved.
> > > > >
> > > > > Below is perf top output after iostats = 0
> > > > >
> > > > >
> > > > >   23.45%  [kernel]             [k] bt_iter
> > > > >    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
> > > > >    2.18%  [kernel]             [k] _find_next_bit
> > > > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > > > >    1.87%  [kernel]             [k] clflush_cache_range
> > > > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > > > >    1.56%  [kernel]             [k] __domain_mapping
> > > > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > > > >    1.30%  [kernel]             [k] gup_pgd_range
> > > >
> > > > Hi Kashyap,
> > > >
> > > > Thanks for your test and update.
> > > >
> > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even
> > > > though iostats is disabled, and I guess there may be utilities
> > > > which are
> > > reading iostats
> > > > a bit frequently.
> > >
> > > I  will be doing some more testing and post you my findings.
> >
> > I will find sometime this weekend to see if I can cook a patch to
> > address this issue of io accounting.
>
> Hi Kashyap,
>
> Please test the top 5 patches in the following tree to see if
megaraid_sas's
> performance is OK:
>
> 	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-
> v2
>
> This tree is made by adding these 5 patches against patchset V2.
>

Ming -
I applied 5 patches on top of V2 and behavior is still unchanged. Below is
perf top data. (1000K IOPS)

  34.58%  [kernel]                 [k] bt_iter
   2.96%  [kernel]                 [k] sbitmap_any_bit_set
   2.77%  [kernel]                 [k] bt_iter_global_tags
   1.75%  [megaraid_sas]           [k] complete_cmd_fusion
   1.62%  [kernel]                 [k] sbitmap_queue_clear
   1.62%  [kernel]                 [k] _raw_spin_lock
   1.51%  [kernel]                 [k] blk_mq_run_hw_queue
   1.45%  [kernel]                 [k] gup_pgd_range
   1.31%  [kernel]                 [k] irq_entries_start
   1.29%  fio                      [.] __fio_gettime
   1.13%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.95%  [kernel]                 [k] native_queued_spin_lock_slowpath
   0.92%  [kernel]                 [k] scsi_queue_rq
   0.91%  [kernel]                 [k] blk_mq_run_hw_queues
   0.85%  [kernel]                 [k] blk_mq_get_request
   0.81%  [kernel]                 [k] switch_mm_irqs_off
   0.78%  [megaraid_sas]           [k] megasas_build_io_fusion
   0.77%  [kernel]                 [k] __schedule
   0.73%  [kernel]                 [k] update_load_avg
   0.69%  [kernel]                 [k] fput
   0.65%  [kernel]                 [k] scsi_dispatch_cmd
   0.64%  fio                      [.] fio_libaio_event
   0.53%  [kernel]                 [k] do_io_submit
   0.52%  [kernel]                 [k] read_tsc
   0.51%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
   0.51%  [kernel]                 [k] scsi_softirq_done
   0.50%  [kernel]                 [k] kobject_put
   0.50%  [kernel]                 [k] cpuidle_enter_state
   0.49%  [kernel]                 [k] native_write_msr
   0.48%  fio                      [.] io_completed

Below is perf top data with iostat=0  (1400K IOPS)

   4.87%  [kernel]                      [k] sbitmap_any_bit_set
   2.93%  [kernel]                      [k] _raw_spin_lock
   2.84%  [megaraid_sas]                [k] complete_cmd_fusion
   2.38%  [kernel]                      [k] irq_entries_start
   2.36%  [kernel]                      [k] gup_pgd_range
   2.35%  [kernel]                      [k] blk_mq_run_hw_queue
   2.30%  [kernel]                      [k] sbitmap_queue_clear
   2.01%  fio                           [.] __fio_gettime
   1.78%  [kernel]                      [k] _raw_spin_lock_irqsave
   1.51%  [kernel]                      [k] scsi_queue_rq
   1.43%  [kernel]                      [k] blk_mq_run_hw_queues
   1.36%  [kernel]                      [k] fput
   1.32%  [kernel]                      [k] __schedule
   1.31%  [kernel]                      [k] switch_mm_irqs_off
   1.29%  [kernel]                      [k] update_load_avg
   1.25%  [megaraid_sas]                [k] megasas_build_io_fusion
   1.22%  [kernel]                      [k]
native_queued_spin_lock_slowpath
   1.03%  [kernel]                      [k] scsi_dispatch_cmd
   1.03%  [kernel]                      [k] blk_mq_get_request
   0.91%  fio                           [.] fio_libaio_event
   0.89%  [kernel]                      [k] scsi_softirq_done
   0.87%  [kernel]                      [k] kobject_put
   0.86%  [kernel]                      [k] cpuidle_enter_state
   0.84%  fio                           [.] io_completed
   0.83%  [kernel]                      [k] do_io_submit
   0.83%  [megaraid_sas]                [k]
megasas_build_and_issue_cmd_fusion
   0.83%  [kernel]                      [k] __switch_to
   0.82%  [kernel]                      [k] read_tsc
   0.80%  [kernel]                      [k] native_write_msr
   0.76%  [kernel]                      [k] aio_comp


Perf data without V2 patch applied.  (1600K IOPS)

   5.97%  [megaraid_sas]           [k] complete_cmd_fusion
   5.24%  [kernel]                 [k] bt_iter
   3.28%  [kernel]                 [k] _raw_spin_lock
   2.98%  [kernel]                 [k] irq_entries_start
   2.29%  fio                      [.] __fio_gettime
   2.04%  [kernel]                 [k] scsi_queue_rq
   1.92%  [megaraid_sas]           [k] megasas_build_io_fusion
   1.61%  [kernel]                 [k] switch_mm_irqs_off
   1.59%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
   1.41%  [kernel]                 [k] scsi_dispatch_cmd
   1.33%  [kernel]                 [k] scsi_softirq_done
   1.18%  [kernel]                 [k] gup_pgd_range
   1.18%  [kernel]                 [k] blk_mq_complete_request
   1.13%  [kernel]                 [k] blk_mq_free_request
   1.05%  [kernel]                 [k] do_io_submit
   1.04%  [kernel]                 [k] _find_next_bit
   1.02%  [kernel]                 [k] blk_mq_get_request
   0.95%  [megaraid_sas]           [k] megasas_build_ldio_fusion
   0.95%  [kernel]                 [k] scsi_dec_host_busy
   0.89%  fio                      [.] get_io_u
   0.88%  [kernel]                 [k] entry_SYSCALL_64
   0.84%  [megaraid_sas]           [k] megasas_queue_command
   0.79%  [kernel]                 [k] native_write_msr
   0.77%  [kernel]                 [k] read_tsc
   0.73%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.73%  fio                      [.] fio_libaio_commit
   0.72%  [kernel]                 [k] kmem_cache_alloc
   0.72%  [kernel]                 [k] blkdev_direct_IO
   0.69%  [megaraid_sas]           [k] MR_GetPhyParams
   0.68%  [kernel]                 [k] blk_mq_dequeue_f


> If possible, please provide us the performance data without these
patches and
> with these patches, together with perf trace.
>
> The top 5 patches are for addressing the io accounting issue, and which
> should be the main reason for your performance drop, even lockup in
> megaraid_sas's ISR, IMO.

I think performance drop is different issue. May be a side effect of the
patch set. Even though we fix this perf issue, cpu lock up is completely
different issue.
Regarding cpu lock up, there was similar discussion and folks are finding
irq poll is good method to resolve lockup.  Not sure why NVME driver did
not opted irq_poll, but there was extensive discussion and I am also
seeing cpu lock up mainly due to multiple completion queue/reply queue is
tied to single CPU. We have weighing method in irq poll to quit ISR and
that is the way we can avoid lock-up.
http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.html

>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-12 18:35                                     ` Kashyap Desai
@ 2018-02-13  0:40                                       ` Ming Lei
  2018-02-14  6:28                                         ` Kashyap Desai
  0 siblings, 1 reply; 39+ messages in thread
From: Ming Lei @ 2018-02-13  0:40 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

Hi Kashyap,

On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Sunday, February 11, 2018 11:01 AM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > Hi Kashyap,
> > >
> > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > -----Original Message-----
> > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > To: Kashyap Desai
> > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > > Easi; Omar
> > > > Sandoval;
> > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > Brace;
> > > > Peter
> > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > introduce force_blk_mq
> > > > >
> > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > To: Hannes Reinecke
> > > > > > > Cc: Kashyap Desai; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > > Arun Easi; Omar
> > > > > > Sandoval;
> > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > > Brace;
> > > > > > Peter
> > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > > introduce force_blk_mq
> > > > > > >
> > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> wrote:
> > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > >> -----Original Message-----
> > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > >> To: Hannes Reinecke
> > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > > Sandoval;
> > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > >> Don Brace;
> > > > > > > > > Peter
> > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > >> tags & introduce force_blk_mq
> > > > > > > > >>
> > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> > > > wrote:
> > > > > > > > >>> Hi all,
> > > > > > > > >>>
> > > > > > > > >>> [ .. ]
> > > > > > > > >>>>>
> > > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > > >>>>> global_tags/MQ on
> > > > > > > > >>>> megaraid_sas
> > > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > > >>>>>
> > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > > >>>>>> times more
> > > > > > CPU.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > > >>>>> effect is after
> > > > > > > > >>>> applying the
> > > > > > > > >>>>> patch V2? And your test script?
> > > > > > > > >>>> Regarding CPU utilization, I need to test one more
> time.
> > > > > > > > >>>> Currently system is in used.
> > > > > > > > >>>>
> > > > > > > > >>>> I run below fio test on total 24 SSDs expander
> attached.
> > > > > > > > >>>>
> > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64
> > > > > > > > >>>> --bs=4k --ioengine=libaio --rw=randread
> > > > > > > > >>>>
> > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > > > >>>>
> > > > > > > > >>> This is basically what we've seen with earlier
> iterations.
> > > > > > > > >>
> > > > > > > > >> Hi Hannes,
> > > > > > > > >>
> > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch has a
> > > > > > > > >> big issue,
> > > > > > > > > which
> > > > > > > > >> causes only reply queue 0 used.
> > > > > > > > >>
> > > > > > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > > > >>
> > > > > > > > >> So could you guys run your performance test again after
> > > > > > > > >> fixing the
> > > > > > > > > patch?
> > > > > > > > >
> > > > > > > > > Ming -
> > > > > > > > >
> > > > > > > > > I tried after change you requested.  Performance drop is
> > > > > > > > > still
> > > > > > unresolved.
> > > > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > > > >
> > > > > > > > > See below data. All 24 reply queue is in used correctly.
> > > > > > > > >
> > > > > > > > > IRQs / 1 second(s)
> > > > > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge
> megasas
> > > > > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge
> megasas
> > > > > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge
> megasas
> > > > > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge
> megasas
> > > > > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge
> megasas
> > > > > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge
> megasas
> > > > > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge
> megasas
> > > > > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge
> megasas
> > > > > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge
> megasas
> > > > > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge
> megasas
> > > > > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge
> megasas
> > > > > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge
> megasas
> > > > > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge
> megasas
> > > > > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge
> megasas
> > > > > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge
> megasas
> > > > > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge
> megasas
> > > > > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge
> megasas
> > > > > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge
> megasas
> > > > > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge
> megasas
> > > > > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge
> megasas
> > > > > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge
> megasas
> > > > > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge
> megasas
> > > > > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge
> megasas
> > > > > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge
> megasas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Average:        CPU      %usr     %nice      %sys
> %iowait
> > > > > > %steal
> > > > > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > > > > Average:         18      3.80      0.00     14.78
> 10.08
> > > > > > 0.00
> > > > > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > > > > Average:         19      3.26      0.00     15.35
> 10.62
> > > > > > 0.00
> > > > > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > > > > Average:         20      3.42      0.00     14.57
> 10.67
> > > > > > 0.00
> > > > > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > > > > Average:         21      3.19      0.00     15.60
> 10.75
> > > > > > 0.00
> > > > > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > > > > Average:         22      3.58      0.00     15.15
> 10.66
> > > > > > 0.00
> > > > > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > > > > Average:         23      3.34      0.00     15.36
> 10.63
> > > > > > 0.00
> > > > > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > > > > Average:         24      3.50      0.00     14.58
> 10.93
> > > > > > 0.00
> > > > > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > > > > Average:         25      3.20      0.00     14.68
> 10.86
> > > > > > 0.00
> > > > > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > > > > Average:         26      3.27      0.00     14.80
> 10.70
> > > > > > 0.00
> > > > > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > > > > Average:         27      3.58      0.00     15.36
> 10.80
> > > > > > 0.00
> > > > > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > > > > Average:         28      3.46      0.00     15.17
> 10.46
> > > > > > 0.00
> > > > > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > > > > Average:         29      3.34      0.00     14.42
> 10.72
> > > > > > 0.00
> > > > > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > > > > Average:         30      3.34      0.00     15.08
> 10.70
> > > > > > 0.00
> > > > > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > > > > Average:         31      3.26      0.00     15.33
> 10.47
> > > > > > 0.00
> > > > > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > > > > Average:         32      3.21      0.00     14.80
> 10.61
> > > > > > 0.00
> > > > > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > > > > Average:         33      3.40      0.00     13.88
> 10.55
> > > > > > 0.00
> > > > > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > > > > Average:         34      3.74      0.00     17.41
> 10.61
> > > > > > 0.00
> > > > > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > > > > Average:         35      3.35      0.00     14.37
> 10.74
> > > > > > 0.00
> > > > > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > > > > Average:         36      0.54      0.00      1.77
> 0.00
> > > > > > 0.00
> > > > > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > > > > ..
> > > > > > > > > Average:         54      3.60      0.00     15.17
> 10.39
> > > > > > 0.00
> > > > > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > > > > Average:         55      3.33      0.00     14.85
> 10.55
> > > > > > 0.00
> > > > > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > > > > Average:         56      3.40      0.00     15.19
> 10.54
> > > > > > 0.00
> > > > > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > > > > Average:         57      3.41      0.00     13.98
> 10.78
> > > > > > 0.00
> > > > > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > > > > Average:         58      3.32      0.00     15.16
> 10.52
> > > > > > 0.00
> > > > > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > > > > Average:         59      3.17      0.00     15.80
> 10.35
> > > > > > 0.00
> > > > > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > > > > Average:         60      3.00      0.00     14.63
> 10.59
> > > > > > 0.00
> > > > > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > > > > Average:         61      3.34      0.00     14.70
> 10.66
> > > > > > 0.00
> > > > > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > > > > Average:         62      3.34      0.00     15.29
> 10.56
> > > > > > 0.00
> > > > > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > > > > Average:         63      3.29      0.00     14.51
> 10.72
> > > > > > 0.00
> > > > > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > > > > Average:         64      3.48      0.00     15.31
> 10.65
> > > > > > 0.00
> > > > > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > > > > Average:         65      3.34      0.00     14.36
> 10.80
> > > > > > 0.00
> > > > > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > > > > Average:         66      3.13      0.00     14.94
> 10.70
> > > > > > 0.00
> > > > > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > > > > Average:         67      3.06      0.00     15.56
> 10.69
> > > > > > 0.00
> > > > > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > > > > Average:         68      3.33      0.00     14.98
> 10.61
> > > > > > 0.00
> > > > > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > > > > Average:         69      3.20      0.00     15.43
> 10.70
> > > > > > 0.00
> > > > > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > > > > Average:         70      3.34      0.00     17.14
> 10.59
> > > > > > 0.00
> > > > > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > > > > Average:         71      3.41      0.00     14.94
> 10.56
> > > > > > 0.00
> > > > > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > > > > >
> > > > > > > > > Perf top -
> > > > > > > > >
> > > > > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > > > > >    4.86%  [kernel]            [k]
> blk_mq_queue_tag_busy_iter
> > > > > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > > > > >    2.40%  [kernel]            [k]
> > > > native_queued_spin_lock_slowpath
> > > > > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > > > > >    0.54%  [kernel]            [k] _raw_spin_lock_irqsave
> > > > > > > > >
> > > > > > > > Ah. So we're spending quite some time in trying to find a
> > > > > > > > free
> > > > tag.
> > > > > > > > I guess this is due to every queue starting at the same
> > > > > > > > position trying to find a free tag, which inevitably leads
> to a
> > contention.
> > > > > > >
> > > > > > > IMO, the above trace means that blk_mq_in_flight() may be the
> > > > > > bottleneck,
> > > > > > > and looks not related with tag allocation.
> > > > > > >
> > > > > > > Kashyap, could you run your performance test again after
> > > > > > > disabling
> > > > > > iostat by
> > > > > > > the following command on all test devices and killing all
> > > > > > > utilities
> > > > > > which may
> > > > > > > read iostat(/proc/diskstats, ...)?
> > > > > > >
> > > > > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > > > > >
> > > > > > Ming - After changing iostat = 0 , I see performance issue is
> > > > resolved.
> > > > > >
> > > > > > Below is perf top output after iostats = 0
> > > > > >
> > > > > >
> > > > > >   23.45%  [kernel]             [k] bt_iter
> > > > > >    2.27%  [kernel]             [k] blk_mq_queue_tag_busy_iter
> > > > > >    2.18%  [kernel]             [k] _find_next_bit
> > > > > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > > > > >    1.87%  [kernel]             [k] clflush_cache_range
> > > > > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > > > > >    1.56%  [kernel]             [k] __domain_mapping
> > > > > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > > > > >    1.30%  [kernel]             [k] gup_pgd_range
> > > > >
> > > > > Hi Kashyap,
> > > > >
> > > > > Thanks for your test and update.
> > > > >
> > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf even
> > > > > though iostats is disabled, and I guess there may be utilities
> > > > > which are
> > > > reading iostats
> > > > > a bit frequently.
> > > >
> > > > I  will be doing some more testing and post you my findings.
> > >
> > > I will find sometime this weekend to see if I can cook a patch to
> > > address this issue of io accounting.
> >
> > Hi Kashyap,
> >
> > Please test the top 5 patches in the following tree to see if
> megaraid_sas's
> > performance is OK:
> >
> > 	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-
> > v2
> >
> > This tree is made by adding these 5 patches against patchset V2.
> >
> 
> Ming -
> I applied 5 patches on top of V2 and behavior is still unchanged. Below is
> perf top data. (1000K IOPS)
> 
>   34.58%  [kernel]                 [k] bt_iter
>    2.96%  [kernel]                 [k] sbitmap_any_bit_set
>    2.77%  [kernel]                 [k] bt_iter_global_tags
>    1.75%  [megaraid_sas]           [k] complete_cmd_fusion
>    1.62%  [kernel]                 [k] sbitmap_queue_clear
>    1.62%  [kernel]                 [k] _raw_spin_lock
>    1.51%  [kernel]                 [k] blk_mq_run_hw_queue
>    1.45%  [kernel]                 [k] gup_pgd_range
>    1.31%  [kernel]                 [k] irq_entries_start
>    1.29%  fio                      [.] __fio_gettime
>    1.13%  [kernel]                 [k] _raw_spin_lock_irqsave
>    0.95%  [kernel]                 [k] native_queued_spin_lock_slowpath
>    0.92%  [kernel]                 [k] scsi_queue_rq
>    0.91%  [kernel]                 [k] blk_mq_run_hw_queues
>    0.85%  [kernel]                 [k] blk_mq_get_request
>    0.81%  [kernel]                 [k] switch_mm_irqs_off
>    0.78%  [megaraid_sas]           [k] megasas_build_io_fusion
>    0.77%  [kernel]                 [k] __schedule
>    0.73%  [kernel]                 [k] update_load_avg
>    0.69%  [kernel]                 [k] fput
>    0.65%  [kernel]                 [k] scsi_dispatch_cmd
>    0.64%  fio                      [.] fio_libaio_event
>    0.53%  [kernel]                 [k] do_io_submit
>    0.52%  [kernel]                 [k] read_tsc
>    0.51%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
>    0.51%  [kernel]                 [k] scsi_softirq_done
>    0.50%  [kernel]                 [k] kobject_put
>    0.50%  [kernel]                 [k] cpuidle_enter_state
>    0.49%  [kernel]                 [k] native_write_msr
>    0.48%  fio                      [.] io_completed
> 
> Below is perf top data with iostat=0  (1400K IOPS)
> 
>    4.87%  [kernel]                      [k] sbitmap_any_bit_set
>    2.93%  [kernel]                      [k] _raw_spin_lock
>    2.84%  [megaraid_sas]                [k] complete_cmd_fusion
>    2.38%  [kernel]                      [k] irq_entries_start
>    2.36%  [kernel]                      [k] gup_pgd_range
>    2.35%  [kernel]                      [k] blk_mq_run_hw_queue
>    2.30%  [kernel]                      [k] sbitmap_queue_clear
>    2.01%  fio                           [.] __fio_gettime
>    1.78%  [kernel]                      [k] _raw_spin_lock_irqsave
>    1.51%  [kernel]                      [k] scsi_queue_rq
>    1.43%  [kernel]                      [k] blk_mq_run_hw_queues
>    1.36%  [kernel]                      [k] fput
>    1.32%  [kernel]                      [k] __schedule
>    1.31%  [kernel]                      [k] switch_mm_irqs_off
>    1.29%  [kernel]                      [k] update_load_avg
>    1.25%  [megaraid_sas]                [k] megasas_build_io_fusion
>    1.22%  [kernel]                      [k]
> native_queued_spin_lock_slowpath
>    1.03%  [kernel]                      [k] scsi_dispatch_cmd
>    1.03%  [kernel]                      [k] blk_mq_get_request
>    0.91%  fio                           [.] fio_libaio_event
>    0.89%  [kernel]                      [k] scsi_softirq_done
>    0.87%  [kernel]                      [k] kobject_put
>    0.86%  [kernel]                      [k] cpuidle_enter_state
>    0.84%  fio                           [.] io_completed
>    0.83%  [kernel]                      [k] do_io_submit
>    0.83%  [megaraid_sas]                [k]
> megasas_build_and_issue_cmd_fusion
>    0.83%  [kernel]                      [k] __switch_to
>    0.82%  [kernel]                      [k] read_tsc
>    0.80%  [kernel]                      [k] native_write_msr
>    0.76%  [kernel]                      [k] aio_comp
> 
> 
> Perf data without V2 patch applied.  (1600K IOPS)
> 
>    5.97%  [megaraid_sas]           [k] complete_cmd_fusion
>    5.24%  [kernel]                 [k] bt_iter
>    3.28%  [kernel]                 [k] _raw_spin_lock
>    2.98%  [kernel]                 [k] irq_entries_start
>    2.29%  fio                      [.] __fio_gettime
>    2.04%  [kernel]                 [k] scsi_queue_rq
>    1.92%  [megaraid_sas]           [k] megasas_build_io_fusion
>    1.61%  [kernel]                 [k] switch_mm_irqs_off
>    1.59%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
>    1.41%  [kernel]                 [k] scsi_dispatch_cmd
>    1.33%  [kernel]                 [k] scsi_softirq_done
>    1.18%  [kernel]                 [k] gup_pgd_range
>    1.18%  [kernel]                 [k] blk_mq_complete_request
>    1.13%  [kernel]                 [k] blk_mq_free_request
>    1.05%  [kernel]                 [k] do_io_submit
>    1.04%  [kernel]                 [k] _find_next_bit
>    1.02%  [kernel]                 [k] blk_mq_get_request
>    0.95%  [megaraid_sas]           [k] megasas_build_ldio_fusion
>    0.95%  [kernel]                 [k] scsi_dec_host_busy
>    0.89%  fio                      [.] get_io_u
>    0.88%  [kernel]                 [k] entry_SYSCALL_64
>    0.84%  [megaraid_sas]           [k] megasas_queue_command
>    0.79%  [kernel]                 [k] native_write_msr
>    0.77%  [kernel]                 [k] read_tsc
>    0.73%  [kernel]                 [k] _raw_spin_lock_irqsave
>    0.73%  fio                      [.] fio_libaio_commit
>    0.72%  [kernel]                 [k] kmem_cache_alloc
>    0.72%  [kernel]                 [k] blkdev_direct_IO
>    0.69%  [megaraid_sas]           [k] MR_GetPhyParams
>    0.68%  [kernel]                 [k] blk_mq_dequeue_f

The above data is very helpful to understand the issue, great thanks!

With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS
is 1400K, but 1600K IOPS can be reached without all these patches with
iostats as 1.

BTW, could you share us what the machine is? ARM64? I saw ARM64's cache
coherence performance is bad before. In the dual socket system(each socket
has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed after
the 5 patches are applied on V2 in null_blk test, which is described in commit
log.

Looks it means single sbitmap can't perform well under MQ's case in which
there will be much more concurrent submissions and completions. In case
of single hw queue(current linus tree), one hctx->run_work only allows
one __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are
used in round-robin way, which may cause contention on the single sbitmap
too, especially io accounting may consume a bit much more CPU, I guess that may
contribute some on the CPU lockup.

Could you run your test without V2 patches by setting 'iostats' as 0?
and could you share us what the .can_queue is in this HBA?

> 
> 
> > If possible, please provide us the performance data without these
> patches and
> > with these patches, together with perf trace.
> >
> > The top 5 patches are for addressing the io accounting issue, and which
> > should be the main reason for your performance drop, even lockup in
> > megaraid_sas's ISR, IMO.
> 
> I think performance drop is different issue. May be a side effect of the
> patch set. Even though we fix this perf issue, cpu lock up is completely
> different issue.

The performance drop is caused by the global data structure of sbitmap
which is accessed from all CPUs concurrently.

> Regarding cpu lock up, there was similar discussion and folks are finding
> irq poll is good method to resolve lockup.  Not sure why NVME driver did
> not opted irq_poll, but there was extensive discussion and I am also

NVMe's hw queues won't use host wide tags, so no such issue.

> seeing cpu lock up mainly due to multiple completion queue/reply queue is
> tied to single CPU. We have weighing method in irq poll to quit ISR and
> that is the way we can avoid lock-up.
> http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.html

This patch can make sure that one request is always completed in the submission
CPU, but contention on the global sbitmap is too big and causes performance drop.

Now looks this is really an interesting topic for discussion.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
  2018-02-13  0:40                                       ` Ming Lei
@ 2018-02-14  6:28                                         ` Kashyap Desai
  0 siblings, 0 replies; 39+ messages in thread
From: Kashyap Desai @ 2018-02-14  6:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, linux-scsi, Arun Easi, Omar Sandoval,
	Martin K . Petersen, James Bottomley, Christoph Hellwig,
	Don Brace, Peter Rivera, Paolo Bonzini, Laurence Oberman

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Tuesday, February 13, 2018 6:11 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Sunday, February 11, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > > Hi Kashyap,
> > > >
> > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > > -----Original Message-----
> > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > > To: Kashyap Desai
> > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > > To: Hannes Reinecke
> > > > > > > > Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > Don Brace;
> > > > > > > Peter
> > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > tags & introduce force_blk_mq
> > > > > > > >
> > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> > wrote:
> > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > > >> -----Original Message-----
> > > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > > >> To: Hannes Reinecke
> > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > > > Sandoval;
> > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph
> > > > > > > > > >> Hellwig; Don Brace;
> > > > > > > > > > Peter
> > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support
> > > > > > > > > >> global tags & introduce force_blk_mq
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes
> > > > > > > > > >> Reinecke
> > > > > wrote:
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> [ .. ]
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > > > >>>>> global_tags/MQ on
> > > > > > > > > >>>> megaraid_sas
> > > > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > > > >>>>>> times more
> > > > > > > CPU.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > > > >>>>> effect is after
> > > > > > > > > >>>> applying the
> > > > > > > > > >>>>> patch V2? And your test script?
> > > > > > > > > >>>> Regarding CPU utilization, I need to test one more
> > time.
> > > > > > > > > >>>> Currently system is in used.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I run below fio test on total 24 SSDs expander
> > attached.
> > > > > > > > > >>>>
> > > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread
> > > > > > > > > >>>> --iodepth=64 --bs=4k --ioengine=libaio
> > > > > > > > > >>>> --rw=randread
> > > > > > > > > >>>>
> > > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > > > > >>>>
> > > > > > > > > >>> This is basically what we've seen with earlier
> > iterations.
> > > > > > > > > >>
> > > > > > > > > >> Hi Hannes,
> > > > > > > > > >>
> > > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch
> > > > > > > > > >> has a big issue,
> > > > > > > > > > which
> > > > > > > > > >> causes only reply queue 0 used.
> > > > > > > > > >>
> > > > > > > > > >> [1]
> > > > > > > > > >> https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > > > > >>
> > > > > > > > > >> So could you guys run your performance test again
> > > > > > > > > >> after fixing the
> > > > > > > > > > patch?
> > > > > > > > > >
> > > > > > > > > > Ming -
> > > > > > > > > >
> > > > > > > > > > I tried after change you requested.  Performance drop
> > > > > > > > > > is still
> > > > > > > unresolved.
> > > > > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > > > > >
> > > > > > > > > > See below data. All 24 reply queue is in used
correctly.
> > > > > > > > > >
> > > > > > > > > > IRQs / 1 second(s)
> > > > > > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge
> > megasas
> > > > > > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge
> > megasas
> > > > > > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge
> > megasas
> > > > > > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge
> > megasas
> > > > > > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge
> > megasas
> > > > > > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge
> > megasas
> > > > > > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge
> > megasas
> > > > > > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge
> > megasas
> > > > > > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge
> > megasas
> > > > > > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge
> > megasas
> > > > > > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge
> > megasas
> > > > > > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge
> > megasas
> > > > > > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge
> > megasas
> > > > > > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge
> > megasas
> > > > > > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge
> > megasas
> > > > > > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge
> > megasas
> > > > > > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge
> > megasas
> > > > > > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge
> > megasas
> > > > > > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge
> > megasas
> > > > > > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge
> > megasas
> > > > > > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge
> > megasas
> > > > > > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge
> > megasas
> > > > > > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge
> > megasas
> > > > > > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge
> > megasas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Average:        CPU      %usr     %nice      %sys
> > %iowait
> > > > > > > %steal
> > > > > > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > > > > > Average:         18      3.80      0.00     14.78
> > 10.08
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > > > > > Average:         19      3.26      0.00     15.35
> > 10.62
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > > > > > Average:         20      3.42      0.00     14.57
> > 10.67
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > > > > > Average:         21      3.19      0.00     15.60
> > 10.75
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > > > > > Average:         22      3.58      0.00     15.15
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > > > > > Average:         23      3.34      0.00     15.36
> > 10.63
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > > > > > Average:         24      3.50      0.00     14.58
> > 10.93
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > > > > > Average:         25      3.20      0.00     14.68
> > 10.86
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > > > > > Average:         26      3.27      0.00     14.80
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > > > > > Average:         27      3.58      0.00     15.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > > > > > Average:         28      3.46      0.00     15.17
> > 10.46
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > > > > > Average:         29      3.34      0.00     14.42
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > > > > > Average:         30      3.34      0.00     15.08
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > > > > > Average:         31      3.26      0.00     15.33
> > 10.47
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > > > > > Average:         32      3.21      0.00     14.80
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > > > > > Average:         33      3.40      0.00     13.88
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > > > > > Average:         34      3.74      0.00     17.41
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > > > > > Average:         35      3.35      0.00     14.37
> > 10.74
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > > > > > Average:         36      0.54      0.00      1.77
> > 0.00
> > > > > > > 0.00
> > > > > > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > > > > > ..
> > > > > > > > > > Average:         54      3.60      0.00     15.17
> > 10.39
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > > > > > Average:         55      3.33      0.00     14.85
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > > > > > Average:         56      3.40      0.00     15.19
> > 10.54
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > > > > > Average:         57      3.41      0.00     13.98
> > 10.78
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > > > > > Average:         58      3.32      0.00     15.16
> > 10.52
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > > > > > Average:         59      3.17      0.00     15.80
> > 10.35
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > > > > > Average:         60      3.00      0.00     14.63
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > > > > > Average:         61      3.34      0.00     14.70
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > > > > > Average:         62      3.34      0.00     15.29
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > > > > > Average:         63      3.29      0.00     14.51
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > > > > > Average:         64      3.48      0.00     15.31
> > 10.65
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > > > > > Average:         65      3.34      0.00     14.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > > > > > Average:         66      3.13      0.00     14.94
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > > > > > Average:         67      3.06      0.00     15.56
> > 10.69
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > > > > > Average:         68      3.33      0.00     14.98
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > > > > > Average:         69      3.20      0.00     15.43
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > > > > > Average:         70      3.34      0.00     17.14
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > > > > > Average:         71      3.41      0.00     14.94
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > > > > > >
> > > > > > > > > > Perf top -
> > > > > > > > > >
> > > > > > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > > > > > >    4.86%  [kernel]            [k]
> > blk_mq_queue_tag_busy_iter
> > > > > > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > > > > > >    2.40%  [kernel]            [k]
> > > > > native_queued_spin_lock_slowpath
> > > > > > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > > > > > >    0.54%  [kernel]            [k]
_raw_spin_lock_irqsave
> > > > > > > > > >
> > > > > > > > > Ah. So we're spending quite some time in trying to find
> > > > > > > > > a free
> > > > > tag.
> > > > > > > > > I guess this is due to every queue starting at the same
> > > > > > > > > position trying to find a free tag, which inevitably
> > > > > > > > > leads
> > to a
> > > contention.
> > > > > > > >
> > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be
> > > > > > > > the
> > > > > > > bottleneck,
> > > > > > > > and looks not related with tag allocation.
> > > > > > > >
> > > > > > > > Kashyap, could you run your performance test again after
> > > > > > > > disabling
> > > > > > > iostat by
> > > > > > > > the following command on all test devices and killing all
> > > > > > > > utilities
> > > > > > > which may
> > > > > > > > read iostat(/proc/diskstats, ...)?
> > > > > > > >
> > > > > > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > > > > > >
> > > > > > > Ming - After changing iostat = 0 , I see performance issue
> > > > > > > is
> > > > > resolved.
> > > > > > >
> > > > > > > Below is perf top output after iostats = 0
> > > > > > >
> > > > > > >
> > > > > > >   23.45%  [kernel]             [k] bt_iter
> > > > > > >    2.27%  [kernel]             [k]
blk_mq_queue_tag_busy_iter
> > > > > > >    2.18%  [kernel]             [k] _find_next_bit
> > > > > > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > > > > > >    1.87%  [kernel]             [k] clflush_cache_range
> > > > > > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > > > > > >    1.56%  [kernel]             [k] __domain_mapping
> > > > > > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > > > > > >    1.30%  [kernel]             [k] gup_pgd_range
> > > > > >
> > > > > > Hi Kashyap,
> > > > > >
> > > > > > Thanks for your test and update.
> > > > > >
> > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf
> > > > > > even though iostats is disabled, and I guess there may be
> > > > > > utilities which are
> > > > > reading iostats
> > > > > > a bit frequently.
> > > > >
> > > > > I  will be doing some more testing and post you my findings.
> > > >
> > > > I will find sometime this weekend to see if I can cook a patch to
> > > > address this issue of io accounting.
> > >
> > > Hi Kashyap,
> > >
> > > Please test the top 5 patches in the following tree to see if
> > megaraid_sas's
> > > performance is OK:
> > >
> > > 	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-
> > > v2
> > >
> > > This tree is made by adding these 5 patches against patchset V2.
> > >
> >
> > Ming -
> > I applied 5 patches on top of V2 and behavior is still unchanged.
> > Below is perf top data. (1000K IOPS)
> >
> >   34.58%  [kernel]                 [k] bt_iter
> >    2.96%  [kernel]                 [k] sbitmap_any_bit_set
> >    2.77%  [kernel]                 [k] bt_iter_global_tags
> >    1.75%  [megaraid_sas]           [k] complete_cmd_fusion
> >    1.62%  [kernel]                 [k] sbitmap_queue_clear
> >    1.62%  [kernel]                 [k] _raw_spin_lock
> >    1.51%  [kernel]                 [k] blk_mq_run_hw_queue
> >    1.45%  [kernel]                 [k] gup_pgd_range
> >    1.31%  [kernel]                 [k] irq_entries_start
> >    1.29%  fio                      [.] __fio_gettime
> >    1.13%  [kernel]                 [k] _raw_spin_lock_irqsave
> >    0.95%  [kernel]                 [k]
native_queued_spin_lock_slowpath
> >    0.92%  [kernel]                 [k] scsi_queue_rq
> >    0.91%  [kernel]                 [k] blk_mq_run_hw_queues
> >    0.85%  [kernel]                 [k] blk_mq_get_request
> >    0.81%  [kernel]                 [k] switch_mm_irqs_off
> >    0.78%  [megaraid_sas]           [k] megasas_build_io_fusion
> >    0.77%  [kernel]                 [k] __schedule
> >    0.73%  [kernel]                 [k] update_load_avg
> >    0.69%  [kernel]                 [k] fput
> >    0.65%  [kernel]                 [k] scsi_dispatch_cmd
> >    0.64%  fio                      [.] fio_libaio_event
> >    0.53%  [kernel]                 [k] do_io_submit
> >    0.52%  [kernel]                 [k] read_tsc
> >    0.51%  [megaraid_sas]           [k]
megasas_build_and_issue_cmd_fusion
> >    0.51%  [kernel]                 [k] scsi_softirq_done
> >    0.50%  [kernel]                 [k] kobject_put
> >    0.50%  [kernel]                 [k] cpuidle_enter_state
> >    0.49%  [kernel]                 [k] native_write_msr
> >    0.48%  fio                      [.] io_completed
> >
> > Below is perf top data with iostat=0  (1400K IOPS)
> >
> >    4.87%  [kernel]                      [k] sbitmap_any_bit_set
> >    2.93%  [kernel]                      [k] _raw_spin_lock
> >    2.84%  [megaraid_sas]                [k] complete_cmd_fusion
> >    2.38%  [kernel]                      [k] irq_entries_start
> >    2.36%  [kernel]                      [k] gup_pgd_range
> >    2.35%  [kernel]                      [k] blk_mq_run_hw_queue
> >    2.30%  [kernel]                      [k] sbitmap_queue_clear
> >    2.01%  fio                           [.] __fio_gettime
> >    1.78%  [kernel]                      [k] _raw_spin_lock_irqsave
> >    1.51%  [kernel]                      [k] scsi_queue_rq
> >    1.43%  [kernel]                      [k] blk_mq_run_hw_queues
> >    1.36%  [kernel]                      [k] fput
> >    1.32%  [kernel]                      [k] __schedule
> >    1.31%  [kernel]                      [k] switch_mm_irqs_off
> >    1.29%  [kernel]                      [k] update_load_avg
> >    1.25%  [megaraid_sas]                [k] megasas_build_io_fusion
> >    1.22%  [kernel]                      [k]
> > native_queued_spin_lock_slowpath
> >    1.03%  [kernel]                      [k] scsi_dispatch_cmd
> >    1.03%  [kernel]                      [k] blk_mq_get_request
> >    0.91%  fio                           [.] fio_libaio_event
> >    0.89%  [kernel]                      [k] scsi_softirq_done
> >    0.87%  [kernel]                      [k] kobject_put
> >    0.86%  [kernel]                      [k] cpuidle_enter_state
> >    0.84%  fio                           [.] io_completed
> >    0.83%  [kernel]                      [k] do_io_submit
> >    0.83%  [megaraid_sas]                [k]
> > megasas_build_and_issue_cmd_fusion
> >    0.83%  [kernel]                      [k] __switch_to
> >    0.82%  [kernel]                      [k] read_tsc
> >    0.80%  [kernel]                      [k] native_write_msr
> >    0.76%  [kernel]                      [k] aio_comp
> >
> >
> > Perf data without V2 patch applied.  (1600K IOPS)
> >
> >    5.97%  [megaraid_sas]           [k] complete_cmd_fusion
> >    5.24%  [kernel]                 [k] bt_iter
> >    3.28%  [kernel]                 [k] _raw_spin_lock
> >    2.98%  [kernel]                 [k] irq_entries_start
> >    2.29%  fio                      [.] __fio_gettime
> >    2.04%  [kernel]                 [k] scsi_queue_rq
> >    1.92%  [megaraid_sas]           [k] megasas_build_io_fusion
> >    1.61%  [kernel]                 [k] switch_mm_irqs_off
> >    1.59%  [megaraid_sas]           [k]
megasas_build_and_issue_cmd_fusion
> >    1.41%  [kernel]                 [k] scsi_dispatch_cmd
> >    1.33%  [kernel]                 [k] scsi_softirq_done
> >    1.18%  [kernel]                 [k] gup_pgd_range
> >    1.18%  [kernel]                 [k] blk_mq_complete_request
> >    1.13%  [kernel]                 [k] blk_mq_free_request
> >    1.05%  [kernel]                 [k] do_io_submit
> >    1.04%  [kernel]                 [k] _find_next_bit
> >    1.02%  [kernel]                 [k] blk_mq_get_request
> >    0.95%  [megaraid_sas]           [k] megasas_build_ldio_fusion
> >    0.95%  [kernel]                 [k] scsi_dec_host_busy
> >    0.89%  fio                      [.] get_io_u
> >    0.88%  [kernel]                 [k] entry_SYSCALL_64
> >    0.84%  [megaraid_sas]           [k] megasas_queue_command
> >    0.79%  [kernel]                 [k] native_write_msr
> >    0.77%  [kernel]                 [k] read_tsc
> >    0.73%  [kernel]                 [k] _raw_spin_lock_irqsave
> >    0.73%  fio                      [.] fio_libaio_commit
> >    0.72%  [kernel]                 [k] kmem_cache_alloc
> >    0.72%  [kernel]                 [k] blkdev_direct_IO
> >    0.69%  [megaraid_sas]           [k] MR_GetPhyParams
> >    0.68%  [kernel]                 [k] blk_mq_dequeue_f
>
> The above data is very helpful to understand the issue, great thanks!
>
> With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS is
1400K, but
> 1600K IOPS can be reached without all these patches with iostats as 1.
>
> BTW, could you share us what the machine is? ARM64? I saw ARM64's cache
> coherence performance is bad before. In the dual socket system(each
socket
> has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed
after the
> 5 patches are applied on V2 in null_blk test, which is described in
commit log.

I am using Intel Skylake/Lewisburg/Purley.

>
> Looks it means single sbitmap can't perform well under MQ's case in
which
> there will be much more concurrent submissions and completions. In case
of
> single hw queue(current linus tree), one hctx->run_work only allows one
> __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are
> used in round-robin way, which may cause contention on the single
sbitmap
> too, especially io accounting may consume a bit much more CPU, I guess
that
> may contribute some on the CPU lockup.
>
> Could you run your test without V2 patches by setting 'iostats' as 0?

Tested without V2 patch set. Iostat=1.  IOPS = 1600K

   5.93%  [megaraid_sas]              [k] complete_cmd_fusion
   5.34%  [kernel]                    [k] bt_iter
   3.23%  [kernel]                    [k] _raw_spin_lock
   2.92%  [kernel]                    [k] irq_entries_start
   2.57%  fio                         [.] __fio_gettime
   2.10%  [kernel]                    [k] scsi_queue_rq
   1.98%  [megaraid_sas]              [k] megasas_build_io_fusion
   1.93%  [kernel]                    [k] switch_mm_irqs_off
   1.79%  [megaraid_sas]              [k]
megasas_build_and_issue_cmd_fusion
   1.45%  [kernel]                    [k] scsi_softirq_done
   1.42%  [kernel]                    [k] scsi_dispatch_cmd
   1.23%  [kernel]                    [k] blk_mq_complete_request
   1.11%  [megaraid_sas]              [k] megasas_build_ldio_fusion
   1.11%  [kernel]                    [k] gup_pgd_range
   1.08%  [kernel]                    [k] blk_mq_free_request
   1.03%  [kernel]                    [k] do_io_submit
   1.02%  [kernel]                    [k] _find_next_bit
   1.00%  [kernel]                    [k] scsi_dec_host_busy
   0.94%  [kernel]                    [k] blk_mq_get_request
   0.93%  [megaraid_sas]              [k] megasas_queue_command
   0.92%  [kernel]                    [k] native_write_msr
   0.85%  fio                         [.] get_io_u
   0.83%  [kernel]                    [k] entry_SYSCALL_64
   0.83%  [kernel]                    [k] _raw_spin_lock_irqsave
   0.82%  [kernel]                    [k] read_tsc
   0.81%  [sd_mod]                    [k] sd_init_command
   0.67%  [kernel]                    [k] kmem_cache_alloc
   0.63%  [kernel]                    [k] memset_erms
   0.63%  [kernel]                    [k] aio_read_events
   0.62%  [kernel]                    [k] blkdev_dir


Tested without V2 patch set. Iostat=0. IOPS = 1600K

   5.79%  [megaraid_sas]           [k] complete_cmd_fusion
   3.28%  [kernel]                 [k] _raw_spin_lock
   3.28%  [kernel]                 [k] irq_entries_start
   2.10%  [kernel]                 [k] scsi_queue_rq
   1.96%  fio                      [.] __fio_gettime
   1.85%  [megaraid_sas]           [k] megasas_build_io_fusion
   1.68%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
   1.36%  [kernel]                 [k] gup_pgd_range
   1.36%  [kernel]                 [k] scsi_dispatch_cmd
   1.28%  [kernel]                 [k] do_io_submit
   1.25%  [kernel]                 [k] switch_mm_irqs_off
   1.20%  [kernel]                 [k] blk_mq_free_request
   1.18%  [megaraid_sas]           [k] megasas_build_ldio_fusion
   1.11%  [kernel]                 [k] dput
   1.07%  [kernel]                 [k] scsi_softirq_done
   1.07%  fio                      [.] get_io_u
   1.07%  [kernel]                 [k] scsi_dec_host_busy
   1.02%  [kernel]                 [k] blk_mq_get_request
   0.96%  [sd_mod]                 [k] sd_init_command
   0.92%  [kernel]                 [k] entry_SYSCALL_64
   0.89%  [kernel]                 [k] blk_mq_make_request
   0.87%  [kernel]                 [k] blkdev_direct_IO
   0.84%  [kernel]                 [k] blk_mq_complete_request
   0.78%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.77%  [kernel]                 [k] lookup_ioctx
   0.76%  [megaraid_sas]           [k] MR_GetPhyParams
   0.75%  [kernel]                 [k] blk_mq_dequeue_from_ctx
   0.75%  [kernel]                 [k] memset_erms
   0.74%  [kernel]                 [k] kmem_cache_alloc
   0.72%  [megaraid_sas]           [k] megasas_queue_comman
> and could you share us what the .can_queue is in this HBA?

can_queue = 8072. In my test I used --iodepth=128 for 12 SCSI device (R0
Volume.) FIO will only push 1536 outstanding commands.


>
> >
> >
> > > If possible, please provide us the performance data without these
> > patches and
> > > with these patches, together with perf trace.
> > >
> > > The top 5 patches are for addressing the io accounting issue, and
> > > which should be the main reason for your performance drop, even
> > > lockup in megaraid_sas's ISR, IMO.
> >
> > I think performance drop is different issue. May be a side effect of
> > the patch set. Even though we fix this perf issue, cpu lock up is
> > completely different issue.
>
> The performance drop is caused by the global data structure of sbitmap
which
> is accessed from all CPUs concurrently.
>
> > Regarding cpu lock up, there was similar discussion and folks are
> > finding irq poll is good method to resolve lockup.  Not sure why NVME
> > driver did not opted irq_poll, but there was extensive discussion and
> > I am also
>
> NVMe's hw queues won't use host wide tags, so no such issue.
>
> > seeing cpu lock up mainly due to multiple completion queue/reply queue
> > is tied to single CPU. We have weighing method in irq poll to quit ISR
> > and that is the way we can avoid lock-up.
> > http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.ht
> > ml
>
> This patch can make sure that one request is always completed in the
> submission CPU, but contention on the global sbitmap is too big and
causes
> performance drop.
>
> Now looks this is really an interesting topic for discussion.
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-02-14  6:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-03  4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
2018-02-03  4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
2018-02-05  6:57   ` Hannes Reinecke
2018-02-08 17:34   ` Bart Van Assche
2018-02-08 17:34     ` Bart Van Assche
2018-02-03  4:21 ` [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS Ming Lei
2018-02-05  6:54   ` Hannes Reinecke
2018-02-05 10:35     ` Ming Lei
2018-02-03  4:21 ` [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags' Ming Lei
2018-02-05  6:54   ` Hannes Reinecke
2018-02-03  4:21 ` [PATCH 4/5] scsi: introduce force_blk_mq Ming Lei
2018-02-05  6:57   ` Hannes Reinecke
2018-02-03  4:21 ` [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity Ming Lei
2018-02-05  6:57   ` Hannes Reinecke
2018-02-05  6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
2018-02-05  7:05   ` Kashyap Desai
2018-02-05  7:05     ` Kashyap Desai
2018-02-05 10:17     ` Ming Lei
2018-02-06  6:03       ` Kashyap Desai
2018-02-06  8:04         ` Ming Lei
2018-02-06 11:29           ` Kashyap Desai
2018-02-06 12:31             ` Ming Lei
2018-02-06 14:27               ` Kashyap Desai
2018-02-06 15:46                 ` Ming Lei
2018-02-07  6:50                 ` Hannes Reinecke
2018-02-07 12:23                   ` Ming Lei
2018-02-07 14:14                     ` Kashyap Desai
2018-02-08  1:23                       ` Ming Lei
2018-02-08  7:00                       ` Hannes Reinecke
2018-02-08 16:53                         ` Ming Lei
2018-02-09  4:58                           ` Kashyap Desai
2018-02-09  5:31                             ` Ming Lei
2018-02-09  8:42                               ` Kashyap Desai
2018-02-10  1:01                                 ` Ming Lei
2018-02-11  5:31                                   ` Ming Lei
2018-02-12 18:35                                     ` Kashyap Desai
2018-02-13  0:40                                       ` Ming Lei
2018-02-14  6:28                                         ` Kashyap Desai
2018-02-05 10:23   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.