Re: lk 3.17-rc4 blk_mq large write problems

From: Douglas Gilbert <dgilbert@interlog.com>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
	Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>
Cc: SCSI development list <linux-scsi@vger.kernel.org>
Subject: Re: lk 3.17-rc4 blk_mq large write problems
Date: Thu, 11 Sep 2014 01:37:03 -0400	[thread overview]
Message-ID: <541134FF.40807@interlog.com> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B402958C72114@G9W0745.americas.hpqcorp.net>

On 14-09-10 11:48 PM, Elliott, Robert (Server Storage) wrote:
>
>
>> -----Original Message-----
>> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
>> owner@vger.kernel.org] On Behalf Of Jens Axboe
>> Sent: Wednesday, September 10, 2014 9:00 PM
>> To: dgilbert@interlog.com; Christoph Hellwig
>> Cc: SCSI development list
>> Subject: Re: lk 3.17-rc4 blk_mq large write problems
>>
>> On 2014-09-10 18:58, Douglas Gilbert wrote:
>>> On 14-09-10 11:41 AM, Christoph Hellwig wrote:
>>>> While it might not help with a blown stack, can you give the patch
>> below
>>>> a try?  I tries to solve a problem where the timeout handler hits
>>>> before we've fully set up a command.  While I'd like to understand
>> the
>>>> root cause of why we're hitting it as well, I'd also really to fix
>> that
>>>> race. It would also be good to get a gdb listing of the exact area in
>>>> scsi_times_out listed in the oops.
>>>
>>> RIP: 0010:[<ffffffff8127cd2e>]  [<ffffffff8127cd2e>]
>>> scsi_times_out+0xe/0x2e0
>>>
>>> (gdb) disassemble scsi_times_out
>>> Dump of assembler code for function scsi_times_out:
>>>      0xffffffff8127d030 <+0>:    push   %rbp
>>>      0xffffffff8127d031 <+1>:    mov    $0x2007,%esi
>>>      0xffffffff8127d036 <+6>:    push   %rbx
>>>      0xffffffff8127d037 <+7>:    mov    0xf8(%rdi),%rbx
>>>      0xffffffff8127d03e <+14>:    mov    (%rbx),%rax
>>>      0xffffffff8127d041 <+17>:    mov    %rbx,%rdi
>>>      0xffffffff8127d044 <+20>:    mov    (%rax),%rbp
>>>      0xffffffff8127d047 <+23>:    callq  0xffffffff81277c70
>>> <scsi_log_completion>
>>>      0xffffffff8127d04c <+28>:    cmpl   $0xffffffff,0x154(%rbp)
>>>      0xffffffff8127d053 <+35>:    je     0xffffffff8127d05f
>>> <scsi_times_out+47>
>>> ...
>>>
>>> which seems to agree 'objdump -drS scsi_error.o':
>>>
>>> 00000000000028b0 <scsi_times_out>:
>>>       28b0:    55                       push   %rbp
>>>       28b1:    be 07 20 00 00           mov    $0x2007,%esi
>>>       28b6:    53                       push   %rbx
>>>       28b7:    48 8b 9f f8 00 00 00     mov    0xf8(%rdi),%rbx
>>>       28be:    48 8b 03                 mov    (%rbx),%rax
>>>       28c1:    48 89 df                 mov    %rbx,%rdi
>>>       28c4:    48 8b 28                 mov    (%rax),%rbp
>>>       28c7:    e8 00 00 00 00           callq  28cc
>> <scsi_times_out+0x1c>
>>>               28c8: R_X86_64_PC32    scsi_log_completion-0x4
>>>       28cc:    83 bd 54 01 00 00 ff     cmpl   $0xffffffff,0x154(%rbp)
>>
>> This would be more useful if you had DEBUGINFO enabled. At least it
>> would save use some time :-)
>>
>
> On my system, that function compiles to:
>
> enum blk_eh_timer_return scsi_times_out(struct request *req)
> {
>      2580:       55                      push   %rbp
>      2581:       48 89 e5                mov    %rsp,%rbp
>      2584:       41 55                   push   %r13
>      2586:       41 54                   push   %r12
>      2588:       53                      push   %rbx
>      2589:       48 83 ec 08             sub    $0x8,%rsp
>      258d:       e8 00 00 00 00          callq  2592 <scsi_times_out+0x12>
>                          258e: R_X86_64_PC32     mcount-0x4
>          struct scsi_cmnd *scmd = req->special;
>      2592:       48 8b 9f f8 00 00 00    mov    0xf8(%rdi),%rbx
>          enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
>          struct Scsi_Host *host = scmd->device->host;
>      2599:       48 8b 03                mov    (%rbx),%rax
>      259c:       4c 8b 20                mov    (%rax),%r12
>      259f:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>
> So, Doug's +0xe corresponds to +0x19 here.
>
> 258d-2591 and the extra code preceding them are for stack
> overflow checking, which I have enabled, and the location
> for the optional ftrace jump if tracing this function is
> turned on.
>
> 2592 is scmd = req->special.  %rdi is req, the first and
> only function argument (per the x86_64 abi).  0xf8(%rdi)
> is referencing req->special.  req must be OK, since that's
> not the instruction pointer address that failed.
>
> 2599 is dereferencing req->special (i.e., scmd) to set host.
> req->special must contain NULL, causing the exception
> at this IP.
>
> As a short-term workaround, it might be better to just let
> this function exit if scmd is NULL, assuming that it was
> already dealt with properly.  That would still be misled
> if the race condition, whatever it may be, allows scmd to
> be reused before this function runs; this function would
> then operate on a future command.

Switching back to a SATA SSD gives me the same
"scsi_times_out+0xe" NULL de-reference oops shown at the
start of this thread.

(gdb) list *(scsi_times_out+0xe)
0xffffffff812804fe is in scsi_times_out (drivers/scsi/scsi_error.c:277).
272	 */
273	enum blk_eh_timer_return scsi_times_out(struct request *req)
274	{
275		struct scsi_cmnd *scmd = req->special;
276		enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
277		struct Scsi_Host *host = scmd->device->host;
278	
279		trace_scsi_dispatch_cmd_timeout(scmd);
280		scsi_log_completion(scmd, TIMEOUT_ERROR);

As Rob suggested.