All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
@ 2018-06-28  3:19 Ming Lei
  2018-06-28  3:19 ` [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Ming Lei @ 2018-06-28  3:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke

Hi,

The 1st 2 patch improves ctx->lock uses, and it is observed that IOPS
may be improved by ~5% in rand IO test on MegaRaid SAS run by Kashyap.

The 3rd patch fixes rand IO performance regression on MegaRaid SAS
test, still reported by Kashyap.

Ming Lei (3):
  blk-mq: use list_splice_tail() to insert requests
  blk-mq: only attempt to merge bio if there is rq in sw queue
  blk-mq: dequeue request one by one from sw queue iff hctx is busy

 block/blk-mq-sched.c   | 14 ++++----------
 block/blk-mq.c         | 38 ++++++++++++++++++++++++++++++--------
 include/linux/blk-mq.h |  1 +
 3 files changed, 35 insertions(+), 18 deletions(-)

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>

-- 
2.9.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests
  2018-06-28  3:19 [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Ming Lei
@ 2018-06-28  3:19 ` Ming Lei
  2018-06-28  7:27   ` Johannes Thumshirn
  2018-06-28  3:19 ` [PATCH 2/3] blk-mq: only attempt to merge bio if there is rq in sw queue Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-06-28  3:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche

list_splice_tail() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 70c65bb6c013..20b0519cb3b8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 			    struct list_head *list)
 
 {
+	struct request *rq;
+
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
 	 */
-	spin_lock(&ctx->lock);
-	while (!list_empty(list)) {
-		struct request *rq;
-
-		rq = list_first_entry(list, struct request, queuelist);
+	list_for_each_entry(rq, list, queuelist) {
 		BUG_ON(rq->mq_ctx != ctx);
-		list_del_init(&rq->queuelist);
-		__blk_mq_insert_req_list(hctx, rq, false);
+		trace_block_rq_insert(hctx->queue, rq);
 	}
+
+	spin_lock(&ctx->lock);
+	list_splice_tail(list, &ctx->rq_list);
 	blk_mq_hctx_mark_pending(hctx, ctx);
 	spin_unlock(&ctx->lock);
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] blk-mq: only attempt to merge bio if there is rq in sw queue
  2018-06-28  3:19 [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Ming Lei
  2018-06-28  3:19 ` [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests Ming Lei
@ 2018-06-28  3:19 ` Ming Lei
  2018-06-28  3:19 ` [PATCH 3/3] blk-mq: dequeue request one by one from sw queue iff hctx is busy Ming Lei
  2018-06-28  5:42 ` [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Kashyap Desai
  3 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2018-06-28  3:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche

Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 56c493c6cd90..f5745acc2d98 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -339,7 +339,8 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 		return e->type->ops.mq.bio_merge(hctx, bio);
 	}
 
-	if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
+	if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
+			!list_empty_careful(&ctx->rq_list)) {
 		/* default per sw-queue merge */
 		spin_lock(&ctx->lock);
 		ret = blk_mq_attempt_merge(q, ctx, bio);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] blk-mq: dequeue request one by one from sw queue iff hctx is busy
  2018-06-28  3:19 [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Ming Lei
  2018-06-28  3:19 ` [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests Ming Lei
  2018-06-28  3:19 ` [PATCH 2/3] blk-mq: only attempt to merge bio if there is rq in sw queue Ming Lei
@ 2018-06-28  3:19 ` Ming Lei
  2018-06-28  5:42 ` [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Kashyap Desai
  3 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2018-06-28  3:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke

It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes EWMA to figure out if queue is busy, then only dequeue
request one by one from sw queue when queue is busy.

Kashyap verified that this patch basically brings back rand IO perf
on megasas_raid in case of none io scheduler. Meantime I tried this
patch on HDD, and not see obvious performance loss on sequential IO
test too.

Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 11 ++---------
 block/blk-mq.c         | 24 +++++++++++++++++++++++-
 include/linux/blk-mq.h |  1 +
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f5745acc2d98..8fbf3db32666 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -219,15 +219,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 		}
 	} else if (has_sched_dispatch) {
 		blk_mq_do_dispatch_sched(hctx);
-	} else if (q->mq_ops->get_budget) {
-		/*
-		 * If we need to get budget before queuing request, we
-		 * dequeue request one by one from sw queue for avoiding
-		 * to mess up I/O merge when dispatch runs out of resource.
-		 *
-		 * TODO: get more budgets, and dequeue more requests in
-		 * one time.
-		 */
+	} else if (READ_ONCE(hctx->busy)) {
+		/* dequeue request one by one from sw queue if queue is busy */
 		blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 20b0519cb3b8..2f20c9e3efda 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1074,6 +1074,25 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
 	return true;
 }
 
+/* update queue busy with EWMA (7/8 * ewma(t)  + 1/8 * busy(t + 1)) */
+static void blk_mq_update_hctx_busy(struct blk_mq_hw_ctx *hctx, unsigned int busy)
+{
+	const unsigned weight = 8;
+	const unsigned factor = 4;
+	unsigned int ewma;
+
+	if (hctx->queue->elevator)
+		return;
+
+	ewma = READ_ONCE(hctx->busy);
+
+	ewma *= weight - 1;
+	ewma += busy << factor;
+	ewma /= weight;
+
+	WRITE_ONCE(hctx->busy, ewma);
+}
+
 #define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
@@ -1206,7 +1225,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			blk_mq_run_hw_queue(hctx, true);
 		else if (needs_restart && (ret == BLK_STS_RESOURCE))
 			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
-	}
+
+		blk_mq_update_hctx_busy(hctx, 1);
+	} else
+		blk_mq_update_hctx_busy(hctx, 0);
 
 	return (queued + errors) != 0;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..a5113e22d720 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -34,6 +34,7 @@ struct blk_mq_hw_ctx {
 
 	struct sbitmap		ctx_map;
 
+	unsigned int		busy;
 	struct blk_mq_ctx	*dispatch_from;
 
 	struct blk_mq_ctx	**ctxs;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
  2018-06-28  3:19 [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Ming Lei
                   ` (2 preceding siblings ...)
  2018-06-28  3:19 ` [PATCH 3/3] blk-mq: dequeue request one by one from sw queue iff hctx is busy Ming Lei
@ 2018-06-28  5:42 ` Kashyap Desai
  2018-06-28  6:03   ` jianchao.wang
  3 siblings, 1 reply; 13+ messages in thread
From: Kashyap Desai @ 2018-06-28  5:42 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Laurence Oberman, Omar Sandoval, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke

Ming -

Performance drop is resolved on my setup, but may be some stability of the
kernel is caused due to this patch set. I have not tried without patch
set, but in case you can figure out if below crash is due to this patch
set, I can try potential fix as well.  I am seeing kernel panic if I use
below three settings in my fio script.

iodepth_batch=8
iodepth_batch_complete_min=4
iodepth_batch_complete_max=16


Here is panic detail -

[65237.831029] ------------[ cut here ]------------
[65237.831031] list_add corruption. prev->next should be next
(ffffbb1d0a41fdf8), but was ffff94d0f0c57240. (prev=ffff94d0f0c57240).
[65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
__list_add_valid+0x6a/0x70
[65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
cryptd
[65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
dm_region_hash dm_log dm_mod
[65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
OE     4.18.0-rc1+ #1
[65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
06/22/2017
[65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
[65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
<0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
[65237.831130] RSP: 0018:ffffbb1d0a41fdd8 EFLAGS: 00010286
[65237.831131] RAX: 0000000000000000 RBX: ffff94d0ebf90000 RCX:
0000000000000006
[65237.831132] RDX: 0000000000000000 RSI: 0000000000000086 RDI:
ffff94d91ec16970
[65237.831132] RBP: ffffdb1cff59e980 R08: 0000000000000000 R09:
0000000000000663
[65237.831133] R10: 0000000000000004 R11: 0000000000000000 R12:
0000000000000001
[65237.831134] R13: 0000000000000000 R14: ffff94d0f0c57240 R15:
ffff94d0f0f04c40
[65237.831135] FS:  00007fad10d02740(0000) GS:ffff94d91ec00000(0000)
knlGS:0000000000000000
[65237.831135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[65237.831136] CR2: 00007fad10c2a004 CR3: 000000085600e006 CR4:
00000000007606e0
[65237.831137] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[65237.831138] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[65237.831138] PKRU: 55555554
[65237.831138] Call Trace:
[65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
[65237.831150]  blk_flush_plug_list+0xe7/0x240
[65237.831152]  blk_finish_plug+0x27/0x40
[65237.831156]  __x64_sys_io_submit+0xc9/0x180
[65237.831162]  do_syscall_64+0x5b/0x180
[65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


Kashyap

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Thursday, June 28, 2018 8:49 AM
> To: Jens Axboe
> Cc: linux-block@vger.kernel.org; Ming Lei; Kashyap Desai; Laurence
Oberman;
> Omar Sandoval; Christoph Hellwig; Bart Van Assche; Hannes Reinecke
> Subject: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
>
> Hi,
>
> The 1st 2 patch improves ctx->lock uses, and it is observed that IOPS
> may be improved by ~5% in rand IO test on MegaRaid SAS run by Kashyap.
>
> The 3rd patch fixes rand IO performance regression on MegaRaid SAS
> test, still reported by Kashyap.
>
> Ming Lei (3):
>   blk-mq: use list_splice_tail() to insert requests
>   blk-mq: only attempt to merge bio if there is rq in sw queue
>   blk-mq: dequeue request one by one from sw queue iff hctx is busy
>
>  block/blk-mq-sched.c   | 14 ++++----------
>  block/blk-mq.c         | 38 ++++++++++++++++++++++++++++++--------
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 35 insertions(+), 18 deletions(-)
>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
>
> --
> 2.9.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
  2018-06-28  5:42 ` [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Kashyap Desai
@ 2018-06-28  6:03   ` jianchao.wang
  2018-06-28  6:35     ` Kashyap Desai
  2018-06-28  7:08     ` Ming Lei
  0 siblings, 2 replies; 13+ messages in thread
From: jianchao.wang @ 2018-06-28  6:03 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei, Jens Axboe
  Cc: linux-block, Laurence Oberman, Omar Sandoval, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke



On 06/28/2018 01:42 PM, Kashyap Desai wrote:
> Ming -
> 
> Performance drop is resolved on my setup, but may be some stability of the
> kernel is caused due to this patch set. I have not tried without patch
> set, but in case you can figure out if below crash is due to this patch
> set, I can try potential fix as well.  I am seeing kernel panic if I use
> below three settings in my fio script.
> 
> iodepth_batch=8
> iodepth_batch_complete_min=4
> iodepth_batch_complete_max=16
> 
> 
> Here is panic detail -
> 
> [65237.831029] ------------[ cut here ]------------
> [65237.831031] list_add corruption. prev->next should be next
> (ffffbb1d0a41fdf8), but was ffff94d0f0c57240. (prev=ffff94d0f0c57240).
> [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
> scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
> ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
> xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
> ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
> cryptd
> [65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
> snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
> joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
> acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
> ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
> dm_region_hash dm_log dm_mod
> [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
> OE     4.18.0-rc1+ #1
> [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
> 06/22/2017
> [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
> [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
> c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
> <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
> [65237.831130] RSP: 0018:ffffbb1d0a41fdd8 EFLAGS: 00010286
> [65237.831131] RAX: 0000000000000000 RBX: ffff94d0ebf90000 RCX:
> 0000000000000006
> [65237.831132] RDX: 0000000000000000 RSI: 0000000000000086 RDI:
> ffff94d91ec16970
> [65237.831132] RBP: ffffdb1cff59e980 R08: 0000000000000000 R09:
> 0000000000000663
> [65237.831133] R10: 0000000000000004 R11: 0000000000000000 R12:
> 0000000000000001
> [65237.831134] R13: 0000000000000000 R14: ffff94d0f0c57240 R15:
> ffff94d0f0f04c40
> [65237.831135] FS:  00007fad10d02740(0000) GS:ffff94d91ec00000(0000)
> knlGS:0000000000000000
> [65237.831135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [65237.831136] CR2: 00007fad10c2a004 CR3: 000000085600e006 CR4:
> 00000000007606e0
> [65237.831137] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [65237.831138] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [65237.831138] PKRU: 55555554
> [65237.831138] Call Trace:
> [65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
> [65237.831150]  blk_flush_plug_list+0xe7/0x240
> [65237.831152]  blk_finish_plug+0x27/0x40
> [65237.831156]  __x64_sys_io_submit+0xc9/0x180
> [65237.831162]  do_syscall_64+0x5b/0x180
> [65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I guess we need to clean list after list_splice_tail in the 1/1 patch as following
@@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 			    struct list_head *list)
 
 {
...
+
+	spin_lock(&ctx->lock);
+	list_splice_tail(list, &ctx->rq_list);
+       INIT_LIST_HEAD(list);          // ---> HERE !
 	blk_mq_hctx_mark_pending(hctx, ctx);
 	spin_unlock(&ctx->lock);

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
  2018-06-28  6:03   ` jianchao.wang
@ 2018-06-28  6:35     ` Kashyap Desai
  2018-06-28  7:08     ` Ming Lei
  1 sibling, 0 replies; 13+ messages in thread
From: Kashyap Desai @ 2018-06-28  6:35 UTC (permalink / raw)
  To: jianchao.wang, Ming Lei, Jens Axboe
  Cc: linux-block, Laurence Oberman, Omar Sandoval, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke

>
> I guess we need to clean list after list_splice_tail in the 1/1 patch as
> following
> @@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct
> blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>  			    struct list_head *list)
>
>  {
> ...
> +
> +	spin_lock(&ctx->lock);
> +	list_splice_tail(list, &ctx->rq_list);
> +       INIT_LIST_HEAD(list);          // ---> HERE !

I will try replacing " list_splice_tail" with " list_splice_tail_init"

Kashyap

>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  	spin_unlock(&ctx->lock);
>
> Thanks
> Jianchao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
  2018-06-28  6:03   ` jianchao.wang
  2018-06-28  6:35     ` Kashyap Desai
@ 2018-06-28  7:08     ` Ming Lei
  2018-06-28  7:40       ` Kashyap Desai
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2018-06-28  7:08 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Kashyap Desai, Jens Axboe, linux-block, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke

On Thu, Jun 28, 2018 at 02:03:47PM +0800, jianchao.wang wrote:
> 
> 
> On 06/28/2018 01:42 PM, Kashyap Desai wrote:
> > Ming -
> > 
> > Performance drop is resolved on my setup, but may be some stability of the
> > kernel is caused due to this patch set. I have not tried without patch
> > set, but in case you can figure out if below crash is due to this patch
> > set, I can try potential fix as well.  I am seeing kernel panic if I use
> > below three settings in my fio script.
> > 
> > iodepth_batch=8
> > iodepth_batch_complete_min=4
> > iodepth_batch_complete_max=16
> > 
> > 
> > Here is panic detail -
> > 
> > [65237.831029] ------------[ cut here ]------------
> > [65237.831031] list_add corruption. prev->next should be next
> > (ffffbb1d0a41fdf8), but was ffff94d0f0c57240. (prev=ffff94d0f0c57240).
> > [65237.831046] WARNING: CPU: 18 PID: 13897 at lib/list_debug.c:28
> > __list_add_valid+0x6a/0x70
> > [65237.831046] Modules linked in: mpt3sas raid_class ses enclosure
> > scsi_transport_sas megaraid_sas(OE) xt_CHECKSUM ipt_MASQUERADE tun
> > ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6
> > xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc
> > ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
> > ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4
> > nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> > iptable_security iptable_raw ebtable_filter ebtables ip6table_filter
> > ip6_tables iptable_filter intel_rapl skx_edac nfit libnvdimm
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> > snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel crypto_simd
> > cryptd
> > [65237.831079]  snd_hda_codec snd_hda_core glue_helper snd_hwdep iTCO_wdt
> > snd_seq iTCO_vendor_support snd_seq_device pcspkr snd_pcm sg i2c_i801
> > joydev wmi mei_me ipmi_si mei snd_timer acpi_power_meter ioatdma snd
> > acpi_pad ipmi_devintf ipmi_msghandler soundcore lpc_ich nfsd auth_rpcgss
> > nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ast i2c_algo_bit
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe
> > ahci libahci crc32c_intel libata mdio i2c_core dca dm_mirror
> > dm_region_hash dm_log dm_mod
> > [65237.831109] CPU: 18 PID: 13897 Comm: fio Kdump: loaded Tainted: G
> > OE     4.18.0-rc1+ #1
> > [65237.831110] Hardware name: Supermicro Super Server/X11DPG-QT, BIOS 1.0
> > 06/22/2017
> > [65237.831112] RIP: 0010:__list_add_valid+0x6a/0x70
> > [65237.831112] Code: 31 c0 48 c7 c7 90 08 ab a0 e8 22 b4 cd ff 0f 0b 31 c0
> > c3 48 89 d1 48 c7 c7 40 08 ab a0 48 89 f2 48 89 c6 31 c0 e8 06 b4 cd ff
> > <0f> 0b 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 57
> > [65237.831130] RSP: 0018:ffffbb1d0a41fdd8 EFLAGS: 00010286
> > [65237.831131] RAX: 0000000000000000 RBX: ffff94d0ebf90000 RCX:
> > 0000000000000006
> > [65237.831132] RDX: 0000000000000000 RSI: 0000000000000086 RDI:
> > ffff94d91ec16970
> > [65237.831132] RBP: ffffdb1cff59e980 R08: 0000000000000000 R09:
> > 0000000000000663
> > [65237.831133] R10: 0000000000000004 R11: 0000000000000000 R12:
> > 0000000000000001
> > [65237.831134] R13: 0000000000000000 R14: ffff94d0f0c57240 R15:
> > ffff94d0f0f04c40
> > [65237.831135] FS:  00007fad10d02740(0000) GS:ffff94d91ec00000(0000)
> > knlGS:0000000000000000
> > [65237.831135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [65237.831136] CR2: 00007fad10c2a004 CR3: 000000085600e006 CR4:
> > 00000000007606e0
> > [65237.831137] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [65237.831138] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [65237.831138] PKRU: 55555554
> > [65237.831138] Call Trace:
> > [65237.831145]  blk_mq_flush_plug_list+0x12d/0x260
> > [65237.831150]  blk_flush_plug_list+0xe7/0x240
> > [65237.831152]  blk_finish_plug+0x27/0x40
> > [65237.831156]  __x64_sys_io_submit+0xc9/0x180
> > [65237.831162]  do_syscall_64+0x5b/0x180
> > [65237.831166]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> I guess we need to clean list after list_splice_tail in the 1/1 patch as following
> @@ -1533,19 +1533,19 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>  			    struct list_head *list)
>  
>  {
> ...
> +
> +	spin_lock(&ctx->lock);
> +	list_splice_tail(list, &ctx->rq_list);
> +       INIT_LIST_HEAD(list);          // ---> HERE !
>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  	spin_unlock(&ctx->lock);

Right.

Kashyap, could you test the following patch?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2f20c9e3efda..7d972b1c3153 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1567,7 +1567,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	}
 
 	spin_lock(&ctx->lock);
-	list_splice_tail(list, &ctx->rq_list);
+	list_splice_tail_init(list, &ctx->rq_list);
 	blk_mq_hctx_mark_pending(hctx, ctx);
 	spin_unlock(&ctx->lock);
 }

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests
  2018-06-28  3:19 ` [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests Ming Lei
@ 2018-06-28  7:27   ` Johannes Thumshirn
  2018-06-28 19:44     ` Jens Axboe
  2018-06-28 19:52     ` Steven Rostedt
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2018-06-28  7:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche,
	Steven Rostedt

Hi Ming,

On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
> +	list_for_each_entry(rq, list, queuelist) {
>  		BUG_ON(rq->mq_ctx != ctx);
> -		list_del_init(&rq->queuelist);
> -		__blk_mq_insert_req_list(hctx, rq, false);
> +		trace_block_rq_insert(hctx->queue, rq);
>  	}

I wonder if we really need the above loop unconditionally. It does
some BUG_ON() sanity checking (which I hate but it was already there
so not your problem) and tracing of the request insertion.

So can we maybe only run this loop if tracing is enabled? Not sure if
this is possible though. Maybe Steven (Cced) can help here.

Byte,
     Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/3] blk-mq: improve IO perf in case of none io sched
  2018-06-28  7:08     ` Ming Lei
@ 2018-06-28  7:40       ` Kashyap Desai
  0 siblings, 0 replies; 13+ messages in thread
From: Kashyap Desai @ 2018-06-28  7:40 UTC (permalink / raw)
  To: Ming Lei, jianchao.wang
  Cc: Jens Axboe, linux-block, Laurence Oberman, Omar Sandoval,
	Christoph Hellwig, Bart Van Assche, Hannes Reinecke

> Right.
>
> Kashyap, could you test the following patch?
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2f20c9e3efda..7d972b1c3153 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1567,7 +1567,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx
> *hctx, struct blk_mq_ctx *ctx,
>  	}
>
>  	spin_lock(&ctx->lock);
> -	list_splice_tail(list, &ctx->rq_list);
> +	list_splice_tail_init(list, &ctx->rq_list);
>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  	spin_unlock(&ctx->lock);

I tested above mentioned change and kernel panic as posted in this
discussion is resolved.

Kashyap

>  }
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests
  2018-06-28  7:27   ` Johannes Thumshirn
@ 2018-06-28 19:44     ` Jens Axboe
  2018-06-28 19:52     ` Steven Rostedt
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2018-06-28 19:44 UTC (permalink / raw)
  To: Johannes Thumshirn, Ming Lei
  Cc: linux-block, Kashyap Desai, Laurence Oberman, Omar Sandoval,
	Christoph Hellwig, Bart Van Assche, Steven Rostedt

On 6/28/18 1:27 AM, Johannes Thumshirn wrote:
> Hi Ming,
> 
> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
>> +	list_for_each_entry(rq, list, queuelist) {
>>  		BUG_ON(rq->mq_ctx != ctx);
>> -		list_del_init(&rq->queuelist);
>> -		__blk_mq_insert_req_list(hctx, rq, false);
>> +		trace_block_rq_insert(hctx->queue, rq);
>>  	}
> 
> I wonder if we really need the above loop unconditionally. It does
> some BUG_ON() sanity checking (which I hate but it was already there
> so not your problem) and tracing of the request insertion.

We can probably kill that now, but jfyi, this was introduced because
of a specific issue.

> So can we maybe only run this loop if tracing is enabled? Not sure if
> this is possible though. Maybe Steven (Cced) can help here.

Don't think it is, since we can be tracing through a variety of
different ways. I think it's best to leave this as-is. If this
were to be changed, it'd have to be a separate patch anyway,
it should not be mixed up with this change.

It does need the list_splice_tail_init() as others found out,
or you'll leave the source list corrupted.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests
  2018-06-28  7:27   ` Johannes Thumshirn
  2018-06-28 19:44     ` Jens Axboe
@ 2018-06-28 19:52     ` Steven Rostedt
  2018-06-28 19:58       ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2018-06-28 19:52 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Ming Lei, Jens Axboe, linux-block, Kashyap Desai,
	Laurence Oberman, Omar Sandoval, Christoph Hellwig,
	Bart Van Assche

On Thu, 28 Jun 2018 09:27:08 +0200
Johannes Thumshirn <jthumshirn@suse.de> wrote:

> Hi Ming,
> 
> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
> > +	list_for_each_entry(rq, list, queuelist) {
> >  		BUG_ON(rq->mq_ctx != ctx);
> > -		list_del_init(&rq->queuelist);
> > -		__blk_mq_insert_req_list(hctx, rq, false);
> > +		trace_block_rq_insert(hctx->queue, rq);
> >  	}  
> 
> I wonder if we really need the above loop unconditionally. It does
> some BUG_ON() sanity checking (which I hate but it was already there
> so not your problem) and tracing of the request insertion.
> 
> So can we maybe only run this loop if tracing is enabled? Not sure if
> this is possible though. Maybe Steven (Cced) can help here.

Yes:

	if (trace_block_rq_insert_enabled()) {
		list_for_each_entry(rq, list, queuelist) {
			BUG_ON(rq->mq_ctx != ctx);
			list_del_init(&rq->queuelist);
			__blk_mq_insert_req_list(hctx, rq, false);
			trace_block_rq_insert(hctx->queue, rq);
	 	}
	}

This will only call the loop if the trace event "block_rq_insert" has
been activated. It also uses the jump label infrastructure, so that if
statement is a non-conditional branch (static_key_false()).

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests
  2018-06-28 19:52     ` Steven Rostedt
@ 2018-06-28 19:58       ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2018-06-28 19:58 UTC (permalink / raw)
  To: Steven Rostedt, Johannes Thumshirn
  Cc: Ming Lei, linux-block, Kashyap Desai, Laurence Oberman,
	Omar Sandoval, Christoph Hellwig, Bart Van Assche

On 6/28/18 1:52 PM, Steven Rostedt wrote:
> On Thu, 28 Jun 2018 09:27:08 +0200
> Johannes Thumshirn <jthumshirn@suse.de> wrote:
> 
>> Hi Ming,
>>
>> On Thu, Jun 28, 2018 at 11:19:16AM +0800, Ming Lei wrote:
>>> +	list_for_each_entry(rq, list, queuelist) {
>>>  		BUG_ON(rq->mq_ctx != ctx);
>>> -		list_del_init(&rq->queuelist);
>>> -		__blk_mq_insert_req_list(hctx, rq, false);
>>> +		trace_block_rq_insert(hctx->queue, rq);
>>>  	}  
>>
>> I wonder if we really need the above loop unconditionally. It does
>> some BUG_ON() sanity checking (which I hate but it was already there
>> so not your problem) and tracing of the request insertion.
>>
>> So can we maybe only run this loop if tracing is enabled? Not sure if
>> this is possible though. Maybe Steven (Cced) can help here.
> 
> Yes:
> 
> 	if (trace_block_rq_insert_enabled()) {
> 		list_for_each_entry(rq, list, queuelist) {
> 			BUG_ON(rq->mq_ctx != ctx);
> 			list_del_init(&rq->queuelist);
> 			__blk_mq_insert_req_list(hctx, rq, false);
> 			trace_block_rq_insert(hctx->queue, rq);
> 	 	}
> 	}
> 
> This will only call the loop if the trace event "block_rq_insert" has
> been activated. It also uses the jump label infrastructure, so that if
> statement is a non-conditional branch (static_key_false()).

This works for both ftrace and blktrace?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-06-28 19:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28  3:19 [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Ming Lei
2018-06-28  3:19 ` [PATCH 1/3] blk-mq: use list_splice_tail() to insert requests Ming Lei
2018-06-28  7:27   ` Johannes Thumshirn
2018-06-28 19:44     ` Jens Axboe
2018-06-28 19:52     ` Steven Rostedt
2018-06-28 19:58       ` Jens Axboe
2018-06-28  3:19 ` [PATCH 2/3] blk-mq: only attempt to merge bio if there is rq in sw queue Ming Lei
2018-06-28  3:19 ` [PATCH 3/3] blk-mq: dequeue request one by one from sw queue iff hctx is busy Ming Lei
2018-06-28  5:42 ` [PATCH 0/3] blk-mq: improve IO perf in case of none io sched Kashyap Desai
2018-06-28  6:03   ` jianchao.wang
2018-06-28  6:35     ` Kashyap Desai
2018-06-28  7:08     ` Ming Lei
2018-06-28  7:40       ` Kashyap Desai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.