All of lore.kernel.org
 help / color / mirror / Atom feed
* lk 3.17-rc4 blk_mq large write problems
@ 2014-09-10  3:55 Douglas Gilbert
  2014-09-10 15:41 ` Christoph Hellwig
  2014-09-22 23:14 ` Douglas Gilbert
  0 siblings, 2 replies; 13+ messages in thread
From: Douglas Gilbert @ 2014-09-10  3:55 UTC (permalink / raw)
  To: SCSI development list

[-- Attachment #1: Type: text/plain, Size: 2366 bytes --]

A few days ago I was trying to create a large file
(say 16 GB) of zeros on an ext4 file system:
    dd if=/dev/zero bs=64k count=256k of=zero_16g.bin

After about 5 seconds there was a NULL de-reference that
crashed the machine (shown below). This was with a clean
version of lk 3.17-rc4 (from kernel.org) where the target
was a SATA SSD directly connected to a LSI 9300-4i SAS-3
HBA (mpt3sas). Significantly (IMO) the kernel boot line
contained:
    scsi_mod.use_blk_mq=Y

In all cases changing that to "N" fixed the problem. I tried
many things, including a SAS SSD but the problem persisted when
use_blk_mq=Y. It doesn't always oops as shown in the first
case below. There were also:
   - immediate reboots
   - lock-ups without any oops on the console
   - different oopses of a somewhat stranger nature
     (hard to catch as logging everything on a real
      serial port is fiddly) like double bus errors

Rob Elliott has been unable to replicate this problem.

Today I switched to another machine running Debian 7 (the
first machine was Ubuntu 14.04 based); both x86_64.
Built the same kernel on the second machine, this time
with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
directly connected. Roughly speaking it was the same
test case:
   # <create 1 partition on say /dev/sdb>
   # mkfs.ext4 /dev/sdb1
   # mount /dev/sdb1 /mnt/spare
   # cd /mnt/spare
   # dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
   # cd
   # umount /mnt/spare

Usually the dd or the umount would crash. Then after a
crash, following a power cycle, the mount would crash.
Changing to scsi_mod.use_blk_mq=N restored sanity.

Tried some other SAS controllers: couldn't get a MR-9240-4i
(MegaRaid) to work at all on my newer box (doesn't like
PCIe 3 ?). Got a ARC-1882I working and it did not have
problems with the big dd (perhaps the arcmsr driver still
uses the host_lock to serialize commands).

So it could be common, bad code in the mpt2sas and mpt3sas
drivers. Or it could be somewhere else. Perhaps there is
more than one problem.

Testers out there are encouraged to run the above test case.
The SATA and SAS SSDs that I used can consume writes in the
300 to 600 MB/sec range.

Part of the strangeness of this first attached oops is that
blk_mq_timeout_check() appears twice. The second one (typically
from the umount) is a blown stack.

Enjoy.
Doug Gilbert







[-- Attachment #2: dd_lk317rc3_oops.txt --]
[-- Type: text/plain, Size: 2891 bytes --]

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
PGD 2149ec067 PUD 214265067 PMD 0 
Oops: 0000 [#1] SMP 
Modules linked in: x86_pkg_temp_thermal kvm_intel kvm nfsd ehci_pci ehci_hcd crct10dif_pclmul serio_raw parport_pc auth_rpcgss oid_registry exportfs nfs lockd sunrpc binfmt_misc fuse lp parport ext4 crc16 jbd2 usbhid ses xhci_hcd r8169 usbcore usb_common
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.17.0-rc3 #69
Hardware name: Gigabyte Technology Co., Ltd. Z97M-D3H/Z97M-D3H, BIOS F5 05/30/2014
task: ffff88021513e090 ti: ffff88021518c000 task.ti: ffff88021518c000
RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
RSP: 0018:ffff88021fb83e10  EFLAGS: 00010282
RAX: ffffffff8127cd20 RBX: 0000000000000000 RCX: ffff8800d3dc8d40
RDX: ffff88020fe9c0c8 RSI: 0000000000002007 RDI: ffff8800d3dc8c00
RBP: ffff88020fe9c0c8 R08: ffff880037970088 R09: ffff880037970000
R10: ffff88021e8024e8 R11: 0000000000000002 R12: 0000000000000449
R13: ffff880037970000 R14: ffff88021fb83ea8 R15: ffff88021520c000
FS:  0000000000000000(0000) GS:ffff88021fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000214321000 CR4: 00000000001407e0
Stack:
 ffff8800d3dc8c00 ffff88020fe9c0c8 ffffffff8118f1d7 00000000000026fb
 ffff88020fe9d400 ffffffff811905db ffff880214fb33c0 ffff880037970000
 ffffffff81190570 ffff88021fb83ea8 0000000000000020 ffffffff81193430
Call Trace:
 <IRQ> 
 [<ffffffff8118f1d7>] ? blk_rq_timed_out+0x17/0x80
 [<ffffffff811905db>] ? blk_mq_timeout_check+0x6b/0x90
 [<ffffffff81190570>] ? blk_mq_attempt_merge+0xb0/0xb0
 [<ffffffff81193430>] ? blk_mq_tag_busy_iter+0x50/0x80
 [<ffffffff81190684>] ? blk_mq_rq_timer+0x84/0x120
 [<ffffffff81190600>] ? blk_mq_timeout_check+0x90/0x90
 [<ffffffff81076ea2>] ? call_timer_fn.isra.36+0x12/0x70
 [<ffffffff8107709a>] ? run_timer_softirq+0x19a/0x230
 [<ffffffff8103d6e5>] ? __do_softirq+0xd5/0x1f0
 [<ffffffff8103d995>] ? irq_exit+0x45/0x50
 [<ffffffff8102a6bb>] ? smp_apic_timer_interrupt+0x3b/0x50
 [<ffffffff8140dc4a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
 [<ffffffff81327deb>] ? cpuidle_enter_state+0x4b/0xc0
 [<ffffffff81327ddd>] ? cpuidle_enter_state+0x3d/0xc0
 [<ffffffff810655e7>] ? cpu_startup_entry+0x237/0x270
Code: e8 d8 b3 ff ff 85 c0 75 cd e9 54 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 be 07 20 00 00 53 48 8b 9f f8 00 00 00 <48> 8b 03 48 89 df 48 8b 28 e8 24 ac ff ff 83 bd 54 01 00 00 ff 
RIP  [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0
 RSP <ffff88021fb83e10>
CR2: 0000000000000000
---[ end trace 659752a390e3d62e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt

[-- Attachment #3: mount1_lk317rc4_bug.txt --]
[-- Type: text/plain, Size: 2614 bytes --]

BUG: unable to handle kernel paging request at 000000017f6b91a0
IP: [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
PGD 3a77e067 PUD 0 
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP 
Modules linked in: fuse hfsplus hfs minix vfat msdos fat ext4 crc16 jbd2 nfsd auth_rpcgss oid_registry exportfs nfs lockd sunrpc usbhid ohci_pci ehci_pci ohci_hcd ehci_hcd parport_pc k8temp serio_raw parport usbcore usb_common sg mpt2sas sr_mod
CPU: 0 PID: 5005 Comm: mount Not tainted 3.17.0-rc4 #1
Hardware name: ASUSTek Computer INC. K8N-LR/K8N-LR, BIOS 0303 04/14/2006
task: ffff88003d354790 ti: ffff88003d338000 task.ti: ffff88003d338000
RIP: 0010:[<ffffffff8106ab1f>]  [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
RSP: 0018:ffff88003fc03e00  EFLAGS: 00010046
RAX: 000000000000cf28 RBX: ffff88003d3547f8 RCX: 000000003fc18c40
RDX: ffffffff815b5700 RSI: 0000000000047b86 RDI: ffff88003d354790
RBP: 0000000000047b86 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003fc117a0
R13: 00000004842567a4 R14: ffff88003d3547f8 R15: 00000016194179b7
FS:  00007f85e76367e0(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000017f6b91a0 CR3: 0000000033d75000 CR4: 00000000000007f0
Stack:
 ffffffff8105fa1c ffff88003a71c170 ffff88003fc117a0 0000000000000000
 0000000000000000 ffff88003fc11740 ffffffff810618d5 00000000000007fe
 ffffffff8105e905 ffff88003fc12100 ffff88003fc11740 0000000000000000
Call Trace:
 <IRQ> 
 [<ffffffff8105fa1c>] ? update_curr+0x9c/0xf0
 [<ffffffff810618d5>] ? task_tick_fair+0x1f5/0x4c0
 [<ffffffff8105e905>] ? sched_clock_local+0x15/0x80
 [<ffffffff8105a614>] ? scheduler_tick+0x64/0xe0
 [<ffffffff8107cd48>] ? update_process_times+0x58/0x80
 [<ffffffff81089f6d>] ? tick_sched_timer+0x4d/0x150
 [<ffffffff8107d2d9>] ? __run_hrtimer.isra.35+0x49/0xd0
 [<ffffffff8107d907>] ? hrtimer_interrupt+0xf7/0x240
 [<ffffffff8102c2f6>] ? smp_apic_timer_interrupt+0x36/0x50
 [<ffffffff8142374a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
Code: 48 c7 c0 f4 ff ff ff 5b eb d9 66 90 48 8b 47 08 48 63 48 18 48 8b 87 88 06 00 00 48 8b 50 60 0f 1f 44 00 00 48 8b 82 a8 00 00 00 <48> 03 04 cd a0 2f 5f 81 48 01 30 48 8b 52 40 48 85 d2 75 e5 c3 
RIP  [<ffffffff8106ab1f>] cpuacct_charge+0x1f/0x40
 RSP <ffff88003fc03e00>
CR2: 000000017f6b91a0
---[ end trace 18c8bb81a9313bee ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10  3:55 lk 3.17-rc4 blk_mq large write problems Douglas Gilbert
@ 2014-09-10 15:41 ` Christoph Hellwig
  2014-09-10 16:47   ` Jens Axboe
  2014-09-11  0:58   ` Douglas Gilbert
  2014-09-22 23:14 ` Douglas Gilbert
  1 sibling, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2014-09-10 15:41 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: SCSI development list, Jens Axboe

While it might not help with a blown stack, can you give the patch below
a try?  I tries to solve a problem where the timeout handler hits
before we've fully set up a command.  While I'd like to understand the
root cause of why we're hitting it as well, I'd also really to fix that
race. It would also be good to get a gdb listing of the exact area in
scsi_times_out listed in the oops.

---
From: Christoph Hellwig <hch@lst.de>
Subject: blk-mq: call blk_mq_start_request from ->queue_rq

When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.

Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.

Signed-off-by: Christoph Hellwig <hch@lst.de>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5189cb1..db9990b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -374,7 +374,7 @@ void blk_mq_complete_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
-static void blk_mq_start_request(struct request *rq, bool last)
+void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
@@ -405,29 +405,18 @@ static void blk_mq_start_request(struct request *rq, bool last)
 		 */
 		rq->nr_phys_segments++;
 	}
-
-	/*
-	 * Flag the last request in the series so that drivers know when IO
-	 * should be kicked off, if they don't do it on a per-request basis.
-	 *
-	 * Note: the flag isn't the only condition drivers should do kick off.
-	 * If drive is busy, the last request might not have the bit set.
-	 */
-	if (last)
-		rq->cmd_flags |= REQ_END;
 }
+EXPORT_SYMBOL(blk_mq_start_request);
 
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
 	trace_block_rq_requeue(q, rq);
-	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-
-	rq->cmd_flags &= ~REQ_END;
-
-	if (q->dma_drain_size && blk_rq_bytes(rq))
-		rq->nr_phys_segments--;
+	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+		if (q->dma_drain_size && blk_rq_bytes(rq))
+			rq->nr_phys_segments--;
+	}
 }
 
 void blk_mq_requeue_request(struct request *rq)
@@ -735,9 +724,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		rq = list_first_entry(&rq_list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 
-		blk_mq_start_request(rq, list_empty(&rq_list));
-
-		ret = q->mq_ops->queue_rq(hctx, rq);
+		ret = q->mq_ops->queue_rq(hctx, rq, list_empty(&rq_list));
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
@@ -1177,14 +1164,13 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		int ret;
 
 		blk_mq_bio_to_request(rq, bio);
-		blk_mq_start_request(rq, true);
 
 		/*
 		 * For OK queue, we are done. For error, kill it. Any other
 		 * error (busy), just add it to our list as we previously
 		 * would have done
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, rq);
+		ret = q->mq_ops->queue_rq(data.hctx, rq, true);
 		if (ret == BLK_MQ_RQ_QUEUE_OK)
 			goto done;
 		else {
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index db1e956..9b0127a 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3775,13 +3775,16 @@ static bool mtip_check_unal_depth(struct blk_mq_hw_ctx *hctx,
 	return false;
 }
 
-static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+		bool last)
 {
 	int ret;
 
 	if (unlikely(mtip_check_unal_depth(hctx, rq)))
 		return BLK_MQ_RQ_QUEUE_BUSY;
 
+	blk_mq_start_request(rq);
+
 	ret = mtip_submit_request(hctx, rq);
 	if (likely(!ret))
 		return BLK_MQ_RQ_QUEUE_OK;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index a3b042c..d098adfbb 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -313,10 +313,13 @@ static void null_request_fn(struct request_queue *q)
 	}
 }
 
-static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+		bool last)
 {
 	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
+	blk_mq_start_request(rq);
+
 	cmd->rq = rq;
 	cmd->nq = hctx->driver_data;
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0a58140..4b08906 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -164,14 +164,14 @@ static void virtblk_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
 }
 
-static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
+		bool last)
 {
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	unsigned long flags;
 	unsigned int num;
 	int qid = hctx->queue_num;
-	const bool last = (req->cmd_flags & REQ_END) != 0;
 	int err;
 	bool notify = false;
 
@@ -213,6 +213,8 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
 	}
 
+	blk_mq_start_request(req);
+
 	spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
 	err = __virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
 	if (err) {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9c44392..dd6e912 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1856,7 +1856,8 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	blk_mq_complete_request(cmd->request);
 }
 
-static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req,
+		bool last)
 {
 	struct request_queue *q = req->q;
 	struct scsi_device *sdev = q->queuedata;
@@ -1890,6 +1891,8 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 	scsi_init_cmd_errh(cmd);
 	cmd->scsi_done = scsi_mq_done;
 
+	blk_mq_start_request(req);
+
 	reason = scsi_dispatch_cmd(cmd);
 	if (reason) {
 		scsi_set_blocked(cmd, reason);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index eb726b9..aed92d5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -77,7 +77,7 @@ struct blk_mq_tag_set {
 	struct list_head	tag_list;
 };
 
-typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, struct request *);
+typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, struct request *, bool);
 typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -160,6 +160,7 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
 struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
 struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);
 
+void blk_mq_start_request(struct request *rq);
 void blk_mq_end_io(struct request *rq, int error);
 void __blk_mq_end_io(struct request *rq, int error);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 66c2167..bb7d664 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,7 +188,6 @@ enum rq_flag_bits {
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_KERNEL, 		/* direct IO to kernel pages */
 	__REQ_PM,		/* runtime pm request */
-	__REQ_END,		/* last of chain of requests */
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
 	__REQ_NR_BITS,		/* stops here */
@@ -242,7 +241,6 @@ enum rq_flag_bits {
 #define REQ_SECURE		(1ULL << __REQ_SECURE)
 #define REQ_KERNEL		(1ULL << __REQ_KERNEL)
 #define REQ_PM			(1ULL << __REQ_PM)
-#define REQ_END			(1ULL << __REQ_END)
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 15:41 ` Christoph Hellwig
@ 2014-09-10 16:47   ` Jens Axboe
  2014-09-10 18:09     ` Christoph Hellwig
  2014-09-11  0:58   ` Douglas Gilbert
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2014-09-10 16:47 UTC (permalink / raw)
  To: Christoph Hellwig, Douglas Gilbert; +Cc: SCSI development list

On 2014-09-10 09:41, Christoph Hellwig wrote:
> While it might not help with a blown stack, can you give the patch below
> a try?  I tries to solve a problem where the timeout handler hits
> before we've fully set up a command.  While I'd like to understand the
> root cause of why we're hitting it as well, I'd also really to fix that
> race. It would also be good to get a gdb listing of the exact area in
> scsi_times_out listed in the oops.

It's a really long window, but it does exist. I'd be curious if the 
patch makes a difference for this weird case. I have not seen anything 
like it here.

BTW, please don't mix up the REQ_END and ->queue_rq() changes with the 
changed start_request API.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 16:47   ` Jens Axboe
@ 2014-09-10 18:09     ` Christoph Hellwig
  2014-09-10 18:26       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2014-09-10 18:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Douglas Gilbert, SCSI development list

On Wed, Sep 10, 2014 at 10:47:49AM -0600, Jens Axboe wrote:
> BTW, please don't mix up the REQ_END and ->queue_rq() changes with the
> changed start_request API.

I have to.  It's set by start_request, so we need to pass down the last
argument to keep the old behavior.  And once we pass the argument we
can just it directly.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 18:09     ` Christoph Hellwig
@ 2014-09-10 18:26       ` Jens Axboe
  2014-09-10 18:40         ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2014-09-10 18:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Douglas Gilbert, SCSI development list

On 09/10/2014 12:09 PM, Christoph Hellwig wrote:
> On Wed, Sep 10, 2014 at 10:47:49AM -0600, Jens Axboe wrote:
>> BTW, please don't mix up the REQ_END and ->queue_rq() changes with the
>> changed start_request API.
> 
> I have to.  It's set by start_request, so we need to pass down the last
> argument to keep the old behavior.  And once we pass the argument we
> can just it directly.

It could still be done in the caller, but arguably, you'd have to do it
twice unless the ->queue_rq() call was rolled into a function.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 18:26       ` Jens Axboe
@ 2014-09-10 18:40         ` Christoph Hellwig
  2014-09-10 18:42           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2014-09-10 18:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Douglas Gilbert, SCSI development list

On Wed, Sep 10, 2014 at 12:26:57PM -0600, Jens Axboe wrote:
> > I have to.  It's set by start_request, so we need to pass down the last
> > argument to keep the old behavior.  And once we pass the argument we
> > can just it directly.
> 
> It could still be done in the caller, but arguably, you'd have to do it
> twice unless the ->queue_rq() call was rolled into a function.

At which point we'd better do it the right way and just pass the flag
instead of wasting a request flag for it.  The other benefit is that
there is a clear compile time API break for ->queue_rq that reminds
people that need to start using blk_mq_start_request.

Anyway, still waiting for test reports of the people that can reproduce
the timeouts.  By the time I'll submit the patch it should have a much
better changelog.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 18:40         ` Christoph Hellwig
@ 2014-09-10 18:42           ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2014-09-10 18:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Douglas Gilbert, SCSI development list

On 09/10/2014 12:40 PM, Christoph Hellwig wrote:
> On Wed, Sep 10, 2014 at 12:26:57PM -0600, Jens Axboe wrote:
>>> I have to.  It's set by start_request, so we need to pass down the last
>>> argument to keep the old behavior.  And once we pass the argument we
>>> can just it directly.
>>
>> It could still be done in the caller, but arguably, you'd have to do it
>> twice unless the ->queue_rq() call was rolled into a function.
> 
> At which point we'd better do it the right way and just pass the flag
> instead of wasting a request flag for it.  The other benefit is that
> there is a clear compile time API break for ->queue_rq that reminds
> people that need to start using blk_mq_start_request.

Yeah that's a good point, hopefully they will look up the commit and
figure out what to do. I'm not adverse to doing these two things at
once, but it should at least have a good mention of it in the changelog.
It's not even mentioned.

> Anyway, still waiting for test reports of the people that can reproduce
> the timeouts.  By the time I'll submit the patch it should have a much
> better changelog.

Thanks, sounds good.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10 15:41 ` Christoph Hellwig
  2014-09-10 16:47   ` Jens Axboe
@ 2014-09-11  0:58   ` Douglas Gilbert
  2014-09-11  2:00     ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Douglas Gilbert @ 2014-09-11  0:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: SCSI development list, Jens Axboe

On 14-09-10 11:41 AM, Christoph Hellwig wrote:
> While it might not help with a blown stack, can you give the patch below
> a try?  I tries to solve a problem where the timeout handler hits
> before we've fully set up a command.  While I'd like to understand the
> root cause of why we're hitting it as well, I'd also really to fix that
> race. It would also be good to get a gdb listing of the exact area in
> scsi_times_out listed in the oops.

RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>] scsi_times_out+0xe/0x2e0

(gdb) disassemble scsi_times_out
Dump of assembler code for function scsi_times_out:
    0xffffffff8127d030 <+0>:	push   %rbp
    0xffffffff8127d031 <+1>:	mov    $0x2007,%esi
    0xffffffff8127d036 <+6>:	push   %rbx
    0xffffffff8127d037 <+7>:	mov    0xf8(%rdi),%rbx
    0xffffffff8127d03e <+14>:	mov    (%rbx),%rax
    0xffffffff8127d041 <+17>:	mov    %rbx,%rdi
    0xffffffff8127d044 <+20>:	mov    (%rax),%rbp
    0xffffffff8127d047 <+23>:	callq  0xffffffff81277c70 <scsi_log_completion>
    0xffffffff8127d04c <+28>:	cmpl   $0xffffffff,0x154(%rbp)
    0xffffffff8127d053 <+35>:	je     0xffffffff8127d05f <scsi_times_out+47>
...

which seems to agree 'objdump -drS scsi_error.o':

00000000000028b0 <scsi_times_out>:
     28b0:	55                   	push   %rbp
     28b1:	be 07 20 00 00       	mov    $0x2007,%esi
     28b6:	53                   	push   %rbx
     28b7:	48 8b 9f f8 00 00 00 	mov    0xf8(%rdi),%rbx
     28be:	48 8b 03             	mov    (%rbx),%rax
     28c1:	48 89 df             	mov    %rbx,%rdi
     28c4:	48 8b 28             	mov    (%rax),%rbp
     28c7:	e8 00 00 00 00       	callq  28cc <scsi_times_out+0x1c>
			28c8: R_X86_64_PC32	scsi_log_completion-0x4
     28cc:	83 bd 54 01 00 00 ff 	cmpl   $0xffffffff,0x154(%rbp)

> From: Christoph Hellwig <hch@lst.de>
> Subject: blk-mq: call blk_mq_start_request from ->queue_rq
>
> When we call blk_mq_start_request from the core blk-mq code before calling into
> ->queue_rq there is a racy window where the timeout handler can hit before we've
> fully set up the driver specific part of the command.
>
> Move the call to blk_mq_start_request into the driver so the driver can start
> the request only once it is fully set up.

Using my original (newer) machine with a SAS SSD, today
I'm seeing only the "blown stack" oops on umount. And on
the next reboot, if use_blk_mq=Y then doing the mount
(on the SAS SSD) causes an instant reboot.

Same with and without this patch. I'll try again with the
SATA SSD (but I need to archive its contents first) and
maybe I can get back to the scsi_times_out oops.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-11  0:58   ` Douglas Gilbert
@ 2014-09-11  2:00     ` Jens Axboe
  2014-09-11  3:48       ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2014-09-11  2:00 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig; +Cc: SCSI development list

On 2014-09-10 18:58, Douglas Gilbert wrote:
> On 14-09-10 11:41 AM, Christoph Hellwig wrote:
>> While it might not help with a blown stack, can you give the patch below
>> a try?  I tries to solve a problem where the timeout handler hits
>> before we've fully set up a command.  While I'd like to understand the
>> root cause of why we're hitting it as well, I'd also really to fix that
>> race. It would also be good to get a gdb listing of the exact area in
>> scsi_times_out listed in the oops.
>
> RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>]
> scsi_times_out+0xe/0x2e0
>
> (gdb) disassemble scsi_times_out
> Dump of assembler code for function scsi_times_out:
>     0xffffffff8127d030 <+0>:    push   %rbp
>     0xffffffff8127d031 <+1>:    mov    $0x2007,%esi
>     0xffffffff8127d036 <+6>:    push   %rbx
>     0xffffffff8127d037 <+7>:    mov    0xf8(%rdi),%rbx
>     0xffffffff8127d03e <+14>:    mov    (%rbx),%rax
>     0xffffffff8127d041 <+17>:    mov    %rbx,%rdi
>     0xffffffff8127d044 <+20>:    mov    (%rax),%rbp
>     0xffffffff8127d047 <+23>:    callq  0xffffffff81277c70
> <scsi_log_completion>
>     0xffffffff8127d04c <+28>:    cmpl   $0xffffffff,0x154(%rbp)
>     0xffffffff8127d053 <+35>:    je     0xffffffff8127d05f
> <scsi_times_out+47>
> ...
>
> which seems to agree 'objdump -drS scsi_error.o':
>
> 00000000000028b0 <scsi_times_out>:
>      28b0:    55                       push   %rbp
>      28b1:    be 07 20 00 00           mov    $0x2007,%esi
>      28b6:    53                       push   %rbx
>      28b7:    48 8b 9f f8 00 00 00     mov    0xf8(%rdi),%rbx
>      28be:    48 8b 03                 mov    (%rbx),%rax
>      28c1:    48 89 df                 mov    %rbx,%rdi
>      28c4:    48 8b 28                 mov    (%rax),%rbp
>      28c7:    e8 00 00 00 00           callq  28cc <scsi_times_out+0x1c>
>              28c8: R_X86_64_PC32    scsi_log_completion-0x4
>      28cc:    83 bd 54 01 00 00 ff     cmpl   $0xffffffff,0x154(%rbp)

This would be more useful if you had DEBUGINFO enabled. At least it 
would save use some time :-)


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: lk 3.17-rc4 blk_mq large write problems
  2014-09-11  2:00     ` Jens Axboe
@ 2014-09-11  3:48       ` Elliott, Robert (Server Storage)
  2014-09-11  5:37         ` Douglas Gilbert
  0 siblings, 1 reply; 13+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-09-11  3:48 UTC (permalink / raw)
  To: Jens Axboe, dgilbert, Christoph Hellwig; +Cc: SCSI development list



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Jens Axboe
> Sent: Wednesday, September 10, 2014 9:00 PM
> To: dgilbert@interlog.com; Christoph Hellwig
> Cc: SCSI development list
> Subject: Re: lk 3.17-rc4 blk_mq large write problems
> 
> On 2014-09-10 18:58, Douglas Gilbert wrote:
> > On 14-09-10 11:41 AM, Christoph Hellwig wrote:
> >> While it might not help with a blown stack, can you give the patch
> below
> >> a try?  I tries to solve a problem where the timeout handler hits
> >> before we've fully set up a command.  While I'd like to understand
> the
> >> root cause of why we're hitting it as well, I'd also really to fix
> that
> >> race. It would also be good to get a gdb listing of the exact area in
> >> scsi_times_out listed in the oops.
> >
> > RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>]
> > scsi_times_out+0xe/0x2e0
> >
> > (gdb) disassemble scsi_times_out
> > Dump of assembler code for function scsi_times_out:
> >     0xffffffff8127d030 <+0>:    push   %rbp
> >     0xffffffff8127d031 <+1>:    mov    $0x2007,%esi
> >     0xffffffff8127d036 <+6>:    push   %rbx
> >     0xffffffff8127d037 <+7>:    mov    0xf8(%rdi),%rbx
> >     0xffffffff8127d03e <+14>:    mov    (%rbx),%rax
> >     0xffffffff8127d041 <+17>:    mov    %rbx,%rdi
> >     0xffffffff8127d044 <+20>:    mov    (%rax),%rbp
> >     0xffffffff8127d047 <+23>:    callq  0xffffffff81277c70
> > <scsi_log_completion>
> >     0xffffffff8127d04c <+28>:    cmpl   $0xffffffff,0x154(%rbp)
> >     0xffffffff8127d053 <+35>:    je     0xffffffff8127d05f
> > <scsi_times_out+47>
> > ...
> >
> > which seems to agree 'objdump -drS scsi_error.o':
> >
> > 00000000000028b0 <scsi_times_out>:
> >      28b0:    55                       push   %rbp
> >      28b1:    be 07 20 00 00           mov    $0x2007,%esi
> >      28b6:    53                       push   %rbx
> >      28b7:    48 8b 9f f8 00 00 00     mov    0xf8(%rdi),%rbx
> >      28be:    48 8b 03                 mov    (%rbx),%rax
> >      28c1:    48 89 df                 mov    %rbx,%rdi
> >      28c4:    48 8b 28                 mov    (%rax),%rbp
> >      28c7:    e8 00 00 00 00           callq  28cc
> <scsi_times_out+0x1c>
> >              28c8: R_X86_64_PC32    scsi_log_completion-0x4
> >      28cc:    83 bd 54 01 00 00 ff     cmpl   $0xffffffff,0x154(%rbp)
> 
> This would be more useful if you had DEBUGINFO enabled. At least it
> would save use some time :-)
> 

On my system, that function compiles to:

enum blk_eh_timer_return scsi_times_out(struct request *req)
{
    2580:       55                      push   %rbp
    2581:       48 89 e5                mov    %rsp,%rbp
    2584:       41 55                   push   %r13
    2586:       41 54                   push   %r12
    2588:       53                      push   %rbx
    2589:       48 83 ec 08             sub    $0x8,%rsp
    258d:       e8 00 00 00 00          callq  2592 <scsi_times_out+0x12>
                        258e: R_X86_64_PC32     mcount-0x4
        struct scsi_cmnd *scmd = req->special;
    2592:       48 8b 9f f8 00 00 00    mov    0xf8(%rdi),%rbx
        enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
        struct Scsi_Host *host = scmd->device->host;
    2599:       48 8b 03                mov    (%rbx),%rax
    259c:       4c 8b 20                mov    (%rax),%r12
    259f:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

So, Doug's +0xe corresponds to +0x19 here.

258d-2591 and the extra code preceding them are for stack
overflow checking, which I have enabled, and the location
for the optional ftrace jump if tracing this function is
turned on.

2592 is scmd = req->special.  %rdi is req, the first and
only function argument (per the x86_64 abi).  0xf8(%rdi)
is referencing req->special.  req must be OK, since that's
not the instruction pointer address that failed.

2599 is dereferencing req->special (i.e., scmd) to set host.
req->special must contain NULL, causing the exception
at this IP.

As a short-term workaround, it might be better to just let
this function exit if scmd is NULL, assuming that it was 
already dealt with properly.  That would still be misled
if the race condition, whatever it may be, allows scmd to
be reused before this function runs; this function would
then operate on a future command.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-11  3:48       ` Elliott, Robert (Server Storage)
@ 2014-09-11  5:37         ` Douglas Gilbert
  2014-09-17 17:04           ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Douglas Gilbert @ 2014-09-11  5:37 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), Jens Axboe, Christoph Hellwig
  Cc: SCSI development list

On 14-09-10 11:48 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
>> owner@vger.kernel.org] On Behalf Of Jens Axboe
>> Sent: Wednesday, September 10, 2014 9:00 PM
>> To: dgilbert@interlog.com; Christoph Hellwig
>> Cc: SCSI development list
>> Subject: Re: lk 3.17-rc4 blk_mq large write problems
>>
>> On 2014-09-10 18:58, Douglas Gilbert wrote:
>>> On 14-09-10 11:41 AM, Christoph Hellwig wrote:
>>>> While it might not help with a blown stack, can you give the patch
>> below
>>>> a try?  I tries to solve a problem where the timeout handler hits
>>>> before we've fully set up a command.  While I'd like to understand
>> the
>>>> root cause of why we're hitting it as well, I'd also really to fix
>> that
>>>> race. It would also be good to get a gdb listing of the exact area in
>>>> scsi_times_out listed in the oops.
>>>
>>> RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>]
>>> scsi_times_out+0xe/0x2e0
>>>
>>> (gdb) disassemble scsi_times_out
>>> Dump of assembler code for function scsi_times_out:
>>>      0xffffffff8127d030 <+0>:    push   %rbp
>>>      0xffffffff8127d031 <+1>:    mov    $0x2007,%esi
>>>      0xffffffff8127d036 <+6>:    push   %rbx
>>>      0xffffffff8127d037 <+7>:    mov    0xf8(%rdi),%rbx
>>>      0xffffffff8127d03e <+14>:    mov    (%rbx),%rax
>>>      0xffffffff8127d041 <+17>:    mov    %rbx,%rdi
>>>      0xffffffff8127d044 <+20>:    mov    (%rax),%rbp
>>>      0xffffffff8127d047 <+23>:    callq  0xffffffff81277c70
>>> <scsi_log_completion>
>>>      0xffffffff8127d04c <+28>:    cmpl   $0xffffffff,0x154(%rbp)
>>>      0xffffffff8127d053 <+35>:    je     0xffffffff8127d05f
>>> <scsi_times_out+47>
>>> ...
>>>
>>> which seems to agree 'objdump -drS scsi_error.o':
>>>
>>> 00000000000028b0 <scsi_times_out>:
>>>       28b0:    55                       push   %rbp
>>>       28b1:    be 07 20 00 00           mov    $0x2007,%esi
>>>       28b6:    53                       push   %rbx
>>>       28b7:    48 8b 9f f8 00 00 00     mov    0xf8(%rdi),%rbx
>>>       28be:    48 8b 03                 mov    (%rbx),%rax
>>>       28c1:    48 89 df                 mov    %rbx,%rdi
>>>       28c4:    48 8b 28                 mov    (%rax),%rbp
>>>       28c7:    e8 00 00 00 00           callq  28cc
>> <scsi_times_out+0x1c>
>>>               28c8: R_X86_64_PC32    scsi_log_completion-0x4
>>>       28cc:    83 bd 54 01 00 00 ff     cmpl   $0xffffffff,0x154(%rbp)
>>
>> This would be more useful if you had DEBUGINFO enabled. At least it
>> would save use some time :-)
>>
>
> On my system, that function compiles to:
>
> enum blk_eh_timer_return scsi_times_out(struct request *req)
> {
>      2580:       55                      push   %rbp
>      2581:       48 89 e5                mov    %rsp,%rbp
>      2584:       41 55                   push   %r13
>      2586:       41 54                   push   %r12
>      2588:       53                      push   %rbx
>      2589:       48 83 ec 08             sub    $0x8,%rsp
>      258d:       e8 00 00 00 00          callq  2592 <scsi_times_out+0x12>
>                          258e: R_X86_64_PC32     mcount-0x4
>          struct scsi_cmnd *scmd = req->special;
>      2592:       48 8b 9f f8 00 00 00    mov    0xf8(%rdi),%rbx
>          enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
>          struct Scsi_Host *host = scmd->device->host;
>      2599:       48 8b 03                mov    (%rbx),%rax
>      259c:       4c 8b 20                mov    (%rax),%r12
>      259f:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>
> So, Doug's +0xe corresponds to +0x19 here.
>
> 258d-2591 and the extra code preceding them are for stack
> overflow checking, which I have enabled, and the location
> for the optional ftrace jump if tracing this function is
> turned on.
>
> 2592 is scmd = req->special.  %rdi is req, the first and
> only function argument (per the x86_64 abi).  0xf8(%rdi)
> is referencing req->special.  req must be OK, since that's
> not the instruction pointer address that failed.
>
> 2599 is dereferencing req->special (i.e., scmd) to set host.
> req->special must contain NULL, causing the exception
> at this IP.
>
> As a short-term workaround, it might be better to just let
> this function exit if scmd is NULL, assuming that it was
> already dealt with properly.  That would still be misled
> if the race condition, whatever it may be, allows scmd to
> be reused before this function runs; this function would
> then operate on a future command.

Switching back to a SATA SSD gives me the same
"scsi_times_out+0xe" NULL de-reference oops shown at the
start of this thread.

(gdb) list *(scsi_times_out+0xe)
0xffffffff812804fe is in scsi_times_out (drivers/scsi/scsi_error.c:277).
272	 */
273	enum blk_eh_timer_return scsi_times_out(struct request *req)
274	{
275		struct scsi_cmnd *scmd = req->special;
276		enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
277		struct Scsi_Host *host = scmd->device->host;
278	
279		trace_scsi_dispatch_cmd_timeout(scmd);
280		scsi_log_completion(scmd, TIMEOUT_ERROR);

As Rob suggested.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-11  5:37         ` Douglas Gilbert
@ 2014-09-17 17:04           ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2014-09-17 17:04 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Elliott, Robert (Server Storage), Jens Axboe, SCSI development list

Doug,

can you test the latest for-linus branch in Jens' tree
(git://git.kernel.dk/linux-block).  It has a fix for the infinite
recursion with the FUA flag, and a lot of timeout handling fixes.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: lk 3.17-rc4 blk_mq large write problems
  2014-09-10  3:55 lk 3.17-rc4 blk_mq large write problems Douglas Gilbert
  2014-09-10 15:41 ` Christoph Hellwig
@ 2014-09-22 23:14 ` Douglas Gilbert
  1 sibling, 0 replies; 13+ messages in thread
From: Douglas Gilbert @ 2014-09-22 23:14 UTC (permalink / raw)
  To: SCSI development list
  Cc: Christoph Hellwig, Jens Axboe, Elliott, Robert (Server Storage)

On 14-09-09 11:55 PM, Douglas Gilbert wrote:
> A few days ago I was trying to create a large file
> (say 16 GB) of zeros on an ext4 file system:
>     dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
>
> After about 5 seconds there was a NULL de-reference that
> crashed the machine (shown below). This was with a clean
> version of lk 3.17-rc4 (from kernel.org) where the target
> was a SATA SSD directly connected to a LSI 9300-4i SAS-3
> HBA (mpt3sas). Significantly (IMO) the kernel boot line
> contained:
>     scsi_mod.use_blk_mq=Y
>
> In all cases changing that to "N" fixed the problem. I tried
> many things, including a SAS SSD but the problem persisted when
> use_blk_mq=Y. It doesn't always oops as shown in the first
> case below. There were also:
>    - immediate reboots
>    - lock-ups without any oops on the console
>    - different oopses of a somewhat stranger nature
>      (hard to catch as logging everything on a real
>       serial port is fiddly) like double bus errors
>
> Rob Elliott has been unable to replicate this problem.
>
> Today I switched to another machine running Debian 7 (the
> first machine was Ubuntu 14.04 based); both x86_64.
> Built the same kernel on the second machine, this time
> with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
> directly connected. Roughly speaking it was the same
> test case:
>    # <create 1 partition on say /dev/sdb>
>    # mkfs.ext4 /dev/sdb1
>    # mount /dev/sdb1 /mnt/spare
>    # cd /mnt/spare
>    # dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
>    # cd
>    # umount /mnt/spare
>
> Usually the dd or the umount would crash. Then after a
> crash, following a power cycle, the mount would crash.
> Changing to scsi_mod.use_blk_mq=N restored sanity.
>
> Tried some other SAS controllers: couldn't get a MR-9240-4i
> (MegaRaid) to work at all on my newer box (doesn't like
> PCIe 3 ?). Got a ARC-1882I working and it did not have
> problems with the big dd (perhaps the arcmsr driver still
> uses the host_lock to serialize commands).
>
> So it could be common, bad code in the mpt2sas and mpt3sas
> drivers. Or it could be somewhere else. Perhaps there is
> more than one problem.
>
> Testers out there are encouraged to run the above test case.
> The SATA and SAS SSDs that I used can consume writes in the
> 300 to 600 MB/sec range.
>
> Part of the strangeness of this first attached oops is that
> blk_mq_timeout_check() appears twice. The second one (typically
> from the umount) is a blown stack.

Using the block/for-linus tree that I built today,
the freeze-during-boot-up problem has gone away as
reported earlier.

That allows me to retest the problem reported in this
thread with the same disk (INTEL SSDSA2M080) and the
same configuration. Just did four cycles of the test
sequence shown above plus a shutdown. No problems seen.

Doug Gilbert


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-09-22 23:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-10  3:55 lk 3.17-rc4 blk_mq large write problems Douglas Gilbert
2014-09-10 15:41 ` Christoph Hellwig
2014-09-10 16:47   ` Jens Axboe
2014-09-10 18:09     ` Christoph Hellwig
2014-09-10 18:26       ` Jens Axboe
2014-09-10 18:40         ` Christoph Hellwig
2014-09-10 18:42           ` Jens Axboe
2014-09-11  0:58   ` Douglas Gilbert
2014-09-11  2:00     ` Jens Axboe
2014-09-11  3:48       ` Elliott, Robert (Server Storage)
2014-09-11  5:37         ` Douglas Gilbert
2014-09-17 17:04           ` Christoph Hellwig
2014-09-22 23:14 ` Douglas Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.