linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blktests scsi/007 failure
@ 2023-04-14  7:36 Shin'ichiro Kawasaki
  2023-04-14  8:09 ` Chaitanya Kulkarni
  2023-04-14  8:33 ` John Garry
  0 siblings, 2 replies; 8+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-04-14  7:36 UTC (permalink / raw)
  To: Bart Van Assche, linux-scsi, linux-block

Hello Bart,

Recently, I built a new blktests trial environment on QEMU. With this
environment, I observe scsi/007 failure. FYI, let me share blktests output [1]
and kernel message [2].

I found the failure depends on kernel configs for debug such as KASAN. When I
enable KASAN, the test case fails. When I disable KASAN, the test case passes.
It looks that the failure depends on the slow kernel (and/or slow machine).

The test case sets 1 second to the block layer timeout to trigger the SCSI error
handler. It also sets 3 seconds to scsi_debug delay assuming the error handler
completes before the 3 seconds. From the kernel message, it looks that the error
handler takes longer than the 3 seconds delay, so I/O completes as success
before the error handler completion. This I/O success is not expected then the
test case fails. As a trial, I extended the scsi_debug delay time to 10 seconds,
then I observed the test case passes.

Do you expect the I/O success by slow SCSI error handler? If so, the test case
needs improvement by extending the scsi_debug delay time.


[1] blktests output

scsi/007 (Trigger the SCSI error handler)                    [failed]
    runtime  25.594s  ...  13.646s
    --- tests/scsi/007.out      2023-04-06 10:11:07.926670528 +0900
    +++ /home/shin/Blktests/blktests/results/nodev/scsi/007.out.bad     2023-04-14 16:09:45.447892078 +0900
    @@ -1,3 +1,3 @@
     Running scsi/007
    -Reading from scsi_debug failed
    +Reading from scsi_debug succeeded
     Test complete

[2] kernel message

[ 3714.407999] run blktests scsi/007 at 2023-04-14 16:09:31
[ 3714.523102] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1)
[ 3714.525023] scsi host3: scsi_debug: version 0191 [20210520]
                 dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
[ 3714.533733] scsi 3:0:0:0: Direct-Access     Linux    scsi_debug       0191 PQ: 0 ANSI: 7
[ 3714.543198] sd 3:0:0:0: Power-on or device reset occurred
[ 3714.543250] sd 3:0:0:0: Attached scsi generic sg2 type 0
[ 3714.550936] sd 3:0:0:0: [sdc] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB)
[ 3714.554821] sd 3:0:0:0: [sdc] Write Protect is off
[ 3714.558024] sd 3:0:0:0: [sdc] Mode Sense: 73 00 10 08
[ 3714.562414] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 3714.566601] sd 3:0:0:0: [sdc] Preferred minimum I/O size 512 bytes
[ 3714.570045] sd 3:0:0:0: [sdc] Optimal transfer size 524288 bytes
[ 3714.586397] sd 3:0:0:0: [sdc] Attached SCSI disk
[ 3715.999917] sd 3:0:0:0: [sdc] tag#103 abort scheduled
[ 3716.015174] sd 3:0:0:0: [sdc] tag#103 aborting command
[ 3716.019935] sd 3:0:0:0: [sdc] tag#103 retry aborted command
[ 3717.090803] sd 3:0:0:0: [sdc] tag#178 previous abort failed
[ 3717.098780] scsi host3: Waking error handler thread
[ 3717.098917] scsi host3: scsi_eh_3: waking up 0/1/1
[ 3717.106279] scsi host3: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1
[ 3717.111212] scsi host3: Total of 1 commands on 1 devices require eh work
[ 3717.116170] sd 3:0:0:0: scsi_eh_3: Sending BDR
[ 3717.120493] sd 3:0:0:0: [sdc] tag#178 scsi_eh_done result: 2
[ 3717.125183] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd timeleft: 10000
[ 3717.130301] sd 3:0:0:0: Power-on or device reset occurred
[ 3717.134935] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
[ 3717.140241] sd 3:0:0:0: [sdc] tag#178 scsi_eh_tur return: 2001
[ 3719.033180] sd 3:0:0:0: [sdc] tag#178 scsi_eh_done result: 0
[ 3719.037656] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd timeleft: 8114
[ 3719.043293] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
[ 3719.049631] sd 3:0:0:0: [sdc] tag#178 scsi_eh_tur return: 2002
[ 3719.055489] sd 3:0:0:0: [sdc] tag#178 scsi_eh_3: flush retry cmd
[ 3719.061420] scsi host3: waking up host to restart
[ 3719.069512] scsi host3: scsi_eh_3: sleeping
[ 3720.175944] sd 3:0:0:0: [sdc] tag#180 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=1s
[ 3720.182709] sd 3:0:0:0: [sdc] tag#180 CDB: Read(10) 28 00 00 00 3f 80 00 00 08 00
[ 3720.189825] I/O error, dev sdc, sector 16256 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[ 3722.342015] sd 3:0:0:0: Power-on or device reset occurred
[ 3725.349863] sd 3:0:0:0: Power-on or device reset occurred
[ 3727.275069] sd 3:0:0:0: [sdc] tag#79 Medium access timeout failure. Offlining disk!
[ 3727.279944] sd 3:0:0:0: [sdc] tag#79 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=7s
[ 3727.285131] sd 3:0:0:0: [sdc] tag#79 CDB: Read(10) 28 00 00 00 3f 80 00 00 08 00
[ 3727.289961] I/O error, dev sdc, sector 16256 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 3727.294862] Buffer I/O error on dev sdc, logical block 2032, async page read
[ 3727.996161] sd 3:0:0:0: [sdc] Synchronizing SCSI cache


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-14  7:36 blktests scsi/007 failure Shin'ichiro Kawasaki
@ 2023-04-14  8:09 ` Chaitanya Kulkarni
  2023-04-14  8:53   ` Shin'ichiro Kawasaki
  2023-04-14  8:33 ` John Garry
  1 sibling, 1 reply; 8+ messages in thread
From: Chaitanya Kulkarni @ 2023-04-14  8:09 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, Bart Van Assche, linux-scsi, linux-block

On 4/14/2023 12:36 AM, Shin'ichiro Kawasaki wrote:
> Hello Bart,
> 
> Recently, I built a new blktests trial environment on QEMU. With this
> environment, I observe scsi/007 failure. FYI, let me share blktests output [1]
> and kernel message [2].
> 
> I found the failure depends on kernel configs for debug such as KASAN. When I
> enable KASAN, the test case fails. When I disable KASAN, the test case passes.
> It looks that the failure depends on the slow kernel (and/or slow machine).
> 
> The test case sets 1 second to the block layer timeout to trigger the SCSI error
> handler. It also sets 3 seconds to scsi_debug delay assuming the error handler
> completes before the 3 seconds. From the kernel message, it looks that the error
> handler takes longer than the 3 seconds delay, so I/O completes as success
> before the error handler completion. This I/O success is not expected then the
> test case fails. As a trial, I extended the scsi_debug delay time to 10 seconds,
> then I observed the test case passes.
> 
> Do you expect the I/O success by slow SCSI error handler? If so, the test case
> needs improvement by extending the scsi_debug delay time.
> 
> 

I faced the same problem, but increased timeout it is passing :-

blktests (master) # ./check scsi/007 
                                                     scsi/007 (Trigger 
the SCSI error handler)                    [failed] 
                                       runtime    ...  38.529s 
 
                    --- tests/scsi/007.out      2022-07-18 
16:14:37.204230234 -0700 
              +++ /root/blktests/results/nodev/scsi/007.out.bad 
2023-04-14 01:03:26.518991337 -0700 
@@ -1,3 +1,3 @@ 
                                                      Running scsi/007 
 
                                  -Reading from scsi_debug failed 
 
               +Reading from scsi_debug succeeded 
 
Test complete 
                                                blktests (master) # vim 
tests/scsi/007
 
 
                                  blktests (master) # git diff 
 
               diff --git a/tests/scsi/007 b/tests/scsi/007 
 
index e7088a1..1aa3e03 100755 
                                                     --- 
a/tests/scsi/007 
                                                 +++ b/tests/scsi/007 
 
                              @@ -41,7 +41,7 @@ test() { 
 
                   echo 1 > "/sys/class/block/$dev/queue/io_timeout" 
 
echo "I/O timeout = $(<"/sys/class/block/$dev/queue/io_timeout")" 
 >>"$FULL"                                                   # Change 
the scsi_debug delay to 3 seconds. 
                                   -       delay_s=3 
 
                +       delay_s=10 
 
      freq=$(config_hz) 
 
jdelay=$((delay_s * "${freq}")) 
                                                     echo 
"CONFIG_HZ=${freq} jdelay=${jdelay}" >>"$FULL" 


blktests (master) # ./check scsi/007 
                                                     scsi/007 (Trigger 
the SCSI error handler)                    [passed] 
                                       runtime  21.264s  ...  21.267s 
 
                blktests (master) # 
 
blktests (master) #

you can give it a try to see if it fixes the problem ..

-ck



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-14  7:36 blktests scsi/007 failure Shin'ichiro Kawasaki
  2023-04-14  8:09 ` Chaitanya Kulkarni
@ 2023-04-14  8:33 ` John Garry
  2023-04-14  8:58   ` Shin'ichiro Kawasaki
  1 sibling, 1 reply; 8+ messages in thread
From: John Garry @ 2023-04-14  8:33 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, Bart Van Assche, linux-scsi, linux-block

On 14/04/2023 08:36, Shin'ichiro Kawasaki wrote:
> Hello Bart,
> 
> Recently, I built a new blktests trial environment on QEMU. With this
> environment, I observe scsi/007 failure. FYI, let me share blktests output [1]
> and kernel message [2].
> 

I did not notice which kernel you are using - did you mention it somewhere?

> I found the failure depends on kernel configs for debug such as KASAN. When I
> enable KASAN, the test case fails. When I disable KASAN, the test case passes.
> It looks that the failure depends on the slow kernel (and/or slow machine).
> 
> The test case sets 1 second to the block layer timeout to trigger the SCSI error
> handler. It also sets 3 seconds to scsi_debug delay assuming the error handler
> completes before the 3 seconds. From the kernel message, it looks that the error
> handler takes longer than the 3 seconds delay, so I/O completes as success
> before the error handler completion. This I/O success is not expected then the
> test case fails. As a trial, I extended the scsi_debug delay time to 10 seconds,
> then I observed the test case passes.
> 
> Do you expect the I/O success by slow SCSI error handler? If so, the test case
> needs improvement by extending the scsi_debug delay time.

The failure may be due to one of my changes. Please see 
https://lore.kernel.org/lkml/5bdbfbbc-bac1-84a1-5f50-33a443e3292a@oracle.com/

> 
> 
> [1] blktests output
> 
> scsi/007 (Trigger the SCSI error handler)                    [failed]
>      runtime  25.594s  ...  13.646s
>      --- tests/scsi/007.out      2023-04-06 10:11:07.926670528 +0900
>      +++ /home/shin/Blktests/blktests/results/nodev/scsi/007.out.bad     2023-04-14 16:09:45.447892078 +0900
>      @@ -1,3 +1,3 @@
>       Running scsi/007
>      -Reading from scsi_debug failed
>      +Reading from scsi_debug succeeded
>       Test complete
> 
> [2] kernel message
> 
> [ 3714.407999] run blktests scsi/007 at 2023-04-14 16:09:31
> [ 3714.523102] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1)
> [ 3714.525023] scsi host3: scsi_debug: version 0191 [20210520]
>                   dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
> [ 3714.533733] scsi 3:0:0:0: Direct-Access     Linux    scsi_debug       0191 PQ: 0 ANSI: 7
> [ 3714.543198] sd 3:0:0:0: Power-on or device reset occurred
> [ 3714.543250] sd 3:0:0:0: Attached scsi generic sg2 type 0
> [ 3714.550936] sd 3:0:0:0: [sdc] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB)
> [ 3714.554821] sd 3:0:0:0: [sdc] Write Protect is off
> [ 3714.558024] sd 3:0:0:0: [sdc] Mode Sense: 73 00 10 08
> [ 3714.562414] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
> [ 3714.566601] sd 3:0:0:0: [sdc] Preferred minimum I/O size 512 bytes
> [ 3714.570045] sd 3:0:0:0: [sdc] Optimal transfer size 524288 bytes
> [ 3714.586397] sd 3:0:0:0: [sdc] Attached SCSI disk
> [ 3715.999917] sd 3:0:0:0: [sdc] tag#103 abort scheduled
> [ 3716.015174] sd 3:0:0:0: [sdc] tag#103 aborting command
> [ 3716.019935] sd 3:0:0:0: [sdc] tag#103 retry aborted command
> [ 3717.090803] sd 3:0:0:0: [sdc] tag#178 previous abort failed
> [ 3717.098780] scsi host3: Waking error handler thread
> [ 3717.098917] scsi host3: scsi_eh_3: waking up 0/1/1
> [ 3717.106279] scsi host3: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1
> [ 3717.111212] scsi host3: Total of 1 commands on 1 devices require eh work
> [ 3717.116170] sd 3:0:0:0: scsi_eh_3: Sending BDR
> [ 3717.120493] sd 3:0:0:0: [sdc] tag#178 scsi_eh_done result: 2
> [ 3717.125183] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd timeleft: 10000
> [ 3717.130301] sd 3:0:0:0: Power-on or device reset occurred
> [ 3717.134935] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> [ 3717.140241] sd 3:0:0:0: [sdc] tag#178 scsi_eh_tur return: 2001
> [ 3719.033180] sd 3:0:0:0: [sdc] tag#178 scsi_eh_done result: 0
> [ 3719.037656] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd timeleft: 8114
> [ 3719.043293] sd 3:0:0:0: [sdc] tag#178 scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> [ 3719.049631] sd 3:0:0:0: [sdc] tag#178 scsi_eh_tur return: 2002
> [ 3719.055489] sd 3:0:0:0: [sdc] tag#178 scsi_eh_3: flush retry cmd
> [ 3719.061420] scsi host3: waking up host to restart
> [ 3719.069512] scsi host3: scsi_eh_3: sleeping
> [ 3720.175944] sd 3:0:0:0: [sdc] tag#180 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=1s
> [ 3720.182709] sd 3:0:0:0: [sdc] tag#180 CDB: Read(10) 28 00 00 00 3f 80 00 00 08 00
> [ 3720.189825] I/O error, dev sdc, sector 16256 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
> [ 3722.342015] sd 3:0:0:0: Power-on or device reset occurred
> [ 3725.349863] sd 3:0:0:0: Power-on or device reset occurred
> [ 3727.275069] sd 3:0:0:0: [sdc] tag#79 Medium access timeout failure. Offlining disk!
> [ 3727.279944] sd 3:0:0:0: [sdc] tag#79 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=7s
> [ 3727.285131] sd 3:0:0:0: [sdc] tag#79 CDB: Read(10) 28 00 00 00 3f 80 00 00 08 00
> [ 3727.289961] I/O error, dev sdc, sector 16256 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
> [ 3727.294862] Buffer I/O error on dev sdc, logical block 2032, async page read
> [ 3727.996161] sd 3:0:0:0: [sdc] Synchronizing SCSI cache
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-14  8:09 ` Chaitanya Kulkarni
@ 2023-04-14  8:53   ` Shin'ichiro Kawasaki
  0 siblings, 0 replies; 8+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-04-14  8:53 UTC (permalink / raw)
  To: Chaitanya Kulkarni; +Cc: Bart Van Assche, linux-scsi, linux-block

On Apr 14, 2023 / 08:09, Chaitanya Kulkarni wrote:
> On 4/14/2023 12:36 AM, Shin'ichiro Kawasaki wrote:
> > Hello Bart,
> > 
> > Recently, I built a new blktests trial environment on QEMU. With this
> > environment, I observe scsi/007 failure. FYI, let me share blktests output [1]
> > and kernel message [2].
> > 
> > I found the failure depends on kernel configs for debug such as KASAN. When I
> > enable KASAN, the test case fails. When I disable KASAN, the test case passes.
> > It looks that the failure depends on the slow kernel (and/or slow machine).
> > 
> > The test case sets 1 second to the block layer timeout to trigger the SCSI error
> > handler. It also sets 3 seconds to scsi_debug delay assuming the error handler
> > completes before the 3 seconds. From the kernel message, it looks that the error
> > handler takes longer than the 3 seconds delay, so I/O completes as success
> > before the error handler completion. This I/O success is not expected then the
> > test case fails. As a trial, I extended the scsi_debug delay time to 10 seconds,
> > then I observed the test case passes.
> > 
> > Do you expect the I/O success by slow SCSI error handler? If so, the test case
> > needs improvement by extending the scsi_debug delay time.
> > 
> > 
> 
> I faced the same problem, but increased timeout it is passing :-

Good to know that I'm not the only one seeing this issue :)

[snip]

> 
> echo "I/O timeout = $(<"/sys/class/block/$dev/queue/io_timeout")" 
>  >>"$FULL"                                                   # Change 
> the scsi_debug delay to 3 seconds. 
>                                    -       delay_s=3 
>  
>                 +       delay_s=10 

Yes, I did similar trial and observed the test case passes. While I tried it, I
observed that longer delay time is required when I add more kernel debug options
(KASAN, lock related options, etc). So, delay extension to 10 seconds may not be
enough to cover slower kernels or slower systems. I guess the test case will
need some time measurement to decide the delay time good for the system. But
before discussing such test case fix, I wanted to confirm that the delay time
extension is the right fix approach.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-14  8:33 ` John Garry
@ 2023-04-14  8:58   ` Shin'ichiro Kawasaki
  2023-04-20 12:26     ` Shin'ichiro Kawasaki
  0 siblings, 1 reply; 8+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-04-14  8:58 UTC (permalink / raw)
  To: John Garry; +Cc: Bart Van Assche, linux-scsi, linux-block

On Apr 14, 2023 / 09:33, John Garry wrote:
> On 14/04/2023 08:36, Shin'ichiro Kawasaki wrote:
> > Hello Bart,
> > 
> > Recently, I built a new blktests trial environment on QEMU. With this
> > environment, I observe scsi/007 failure. FYI, let me share blktests output [1]
> > and kernel message [2].
> > 
> 
> I did not notice which kernel you are using - did you mention it somewhere?

I forgot to mention it. Sorry. I observe the failure with kernel versions 6.2
and 6.3-rcX.

> 
> > I found the failure depends on kernel configs for debug such as KASAN. When I
> > enable KASAN, the test case fails. When I disable KASAN, the test case passes.
> > It looks that the failure depends on the slow kernel (and/or slow machine).
> > 
> > The test case sets 1 second to the block layer timeout to trigger the SCSI error
> > handler. It also sets 3 seconds to scsi_debug delay assuming the error handler
> > completes before the 3 seconds. From the kernel message, it looks that the error
> > handler takes longer than the 3 seconds delay, so I/O completes as success
> > before the error handler completion. This I/O success is not expected then the
> > test case fails. As a trial, I extended the scsi_debug delay time to 10 seconds,
> > then I observed the test case passes.
> > 
> > Do you expect the I/O success by slow SCSI error handler? If so, the test case
> > needs improvement by extending the scsi_debug delay time.
> 
> The failure may be due to one of my changes. Please see
> https://lore.kernel.org/lkml/5bdbfbbc-bac1-84a1-5f50-33a443e3292a@oracle.com/

Thanks for the notice. I think your changes were applied to 6.4/scsi-queue,
which I've not yet tried. Then it should not be related to your changes.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-14  8:58   ` Shin'ichiro Kawasaki
@ 2023-04-20 12:26     ` Shin'ichiro Kawasaki
  2023-04-20 12:59       ` John Garry
  0 siblings, 1 reply; 8+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-04-20 12:26 UTC (permalink / raw)
  To: John Garry; +Cc: Bart Van Assche, linux-scsi, linux-block

On Apr 14, 2023 / 17:58, Shin'ichiro Kawasaki wrote:
> On Apr 14, 2023 / 09:33, John Garry wrote:

[...]

> > The failure may be due to one of my changes. Please see
> > https://lore.kernel.org/lkml/5bdbfbbc-bac1-84a1-5f50-33a443e3292a@oracle.com/
> 
> Thanks for the notice. I think your changes were applied to 6.4/scsi-queue,
> which I've not yet tried. Then it should not be related to your changes.

I took a closer look in your changes for kernel v6.4, and noticed that it might
affect the scsi/007 failure I observed with kernel v6.3-rcX. I did some trials
and found these:

- On kernel v6.3-rc7 without your changes, the test case scsi/007 fails with
  unexpected read command success (The failure I found and reported).
- On kernel v6.3-rc7 with your changes until "scsi: scsi_debug: Dynamically
  allocate sdebug_queued_cmd" [1], scsi/007 fails and causes system hang.
  Kernel reported "BUG sdebug_queued_cmd". When I reverte [1] from the kernel,
  the failure symptom is same as v6.3-rc7 (no hang, no BUG).
- On kernel v6.3-rc7 with your changes including [1] and "scsi: scsi_debug:
  Abort commands from scsi_debug_device_reset()" [2], scsi/007 passes.

[1] https://lore.kernel.org/lkml/20230327074310.1862889-7-john.g.garry@oracle.com/
[2] https://lore.kernel.org/linux-scsi/20230416175654.159163-1-john.g.garry@oracle.com/

Your fix [2] intended to fix the BUG that [1] caused, but it also fixed the
scsi/007 failure I found :)


To understand the failure deeper, I added debug prints in scsi_debug, using
kernel v6.3-rc7 with your changes just before [1]. This kernel does not have the
fix [2], then it does not abort commands at device reset. When scsi error
handler does BDR, bus device reset, scsi_debug does not cancel the hrtimer for
the commands issued to the scsi_debug. This hrtimer is alive across the reset.
When that hrtimer expires, scsi_debug completes the command that issued _after_
BDR. The hrtimer for the command before BDR completes the command after BDR
since those two commands use the same scsi_cmnd and rq objects reused. Then the
command issued after BDR completes earlier than expected, and results in the
unexpected read command success and scsi/007 failure.

After applying the fix [2], scsi_debug cancels hrtimers at reset. Then, the
hrtimers started before reset do not affect the commands issued after reset.

These findings mean that the scsi/007 failure I found with kernel v6.3-rc7
indicated the bug in scsi_debug, and the commit [2] fixed it. Now I don't think
blktests side fix for scsi/007 is required. Good :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-20 12:26     ` Shin'ichiro Kawasaki
@ 2023-04-20 12:59       ` John Garry
  2023-04-21  1:03         ` Shin'ichiro Kawasaki
  0 siblings, 1 reply; 8+ messages in thread
From: John Garry @ 2023-04-20 12:59 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki
  Cc: Bart Van Assche, linux-scsi, linux-block, dgilbert

On 20/04/2023 13:26, Shin'ichiro Kawasaki wrote:
>> Thanks for the notice. I think your changes were applied to 6.4/scsi-queue,
>> which I've not yet tried. Then it should not be related to your changes.
> I took a closer look in your changes for kernel v6.4, and noticed that it might
> affect the scsi/007 failure I observed with kernel v6.3-rcX. I did some trials
> and found these:
> 
> - On kernel v6.3-rc7 without your changes, the test case scsi/007 fails with
>    unexpected read command success (The failure I found and reported).
> - On kernel v6.3-rc7 with your changes until "scsi: scsi_debug: Dynamically
>    allocate sdebug_queued_cmd" [1], scsi/007 fails and causes system hang.
>    Kernel reported "BUG sdebug_queued_cmd". When I reverte [1] from the kernel,
>    the failure symptom is same as v6.3-rc7 (no hang, no BUG).
> - On kernel v6.3-rc7 with your changes including [1] and "scsi: scsi_debug:
>    Abort commands from scsi_debug_device_reset()" [2], scsi/007 passes.
> 
> [1]https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230327074310.1862889-7-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!LO2F4s8nfVkzWonRz3dAAjnNVnWR9BxaU3O5S0eyOQ2LfEvDoYKqox5_uN2SctlIhs5Dzq762TQTlB5jHTJX_WPW$  
> [2]https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20230416175654.159163-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!LO2F4s8nfVkzWonRz3dAAjnNVnWR9BxaU3O5S0eyOQ2LfEvDoYKqox5_uN2SctlIhs5Dzq762TQTlB5jHa0ytu5Y$  
> 
> Your fix [2] intended to fix the BUG that [1] caused, but it also fixed the
> scsi/007 failure I found 😄

Great

> 
> 
> To understand the failure deeper, I added debug prints in scsi_debug, using
> kernel v6.3-rc7 with your changes just before [1]. This kernel does not have the
> fix [2], then it does not abort commands at device reset. When scsi error
> handler does BDR, bus device reset, scsi_debug does not cancel the hrtimer for
> the commands issued to the scsi_debug. This hrtimer is alive across the reset.
> When that hrtimer expires, scsi_debug completes the command that issued_after_
> BDR. The hrtimer for the command before BDR completes the command after BDR
> since those two commands use the same scsi_cmnd and rq objects reused. Then the
> command issued after BDR completes earlier than expected, and results in the
> unexpected read command success and scsi/007 failure.
> 
> After applying the fix [2], scsi_debug cancels hrtimers at reset. Then, the
> hrtimers started before reset do not affect the commands issued after reset.
> 
> These findings mean that the scsi/007 failure I found with kernel v6.3-rc7
> indicated the bug in scsi_debug, and the commit [2] fixed it. Now I don't think
> blktests side fix for scsi/007 is required. Good 😄

Do you know why you specifically were seeing this issue for v6.3-rc7? Is 
it just timing related? I seem to remember you mentioning debug configs 
earlier.

It would be nice to see this issue fixed for 6.3 and earlier, but, 
considering the circumstances, it doesn't look straightforward.

Thanks for the info,
John


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blktests scsi/007 failure
  2023-04-20 12:59       ` John Garry
@ 2023-04-21  1:03         ` Shin'ichiro Kawasaki
  0 siblings, 0 replies; 8+ messages in thread
From: Shin'ichiro Kawasaki @ 2023-04-21  1:03 UTC (permalink / raw)
  To: John Garry; +Cc: Bart Van Assche, linux-scsi, linux-block, dgilbert

On Apr 20, 2023 / 13:59, John Garry wrote:
[...]
> Do you know why you specifically were seeing this issue for v6.3-rc7? Is it
> just timing related? I seem to remember you mentioning debug configs
> earlier.

The test case failure was observed with v6.2 also. I did not try older kernels,
but I guess the issue has been existing for long time. The failure is observed
by chance, and the failure ratio increases when I enable kernel debug options. I
think the issue has been hidden, and unveiled with slow kernel and slow system.
Also I think the issue is timing related based on my observations.

> 
> It would be nice to see this issue fixed for 6.3 and earlier, but,
> considering the circumstances, it doesn't look straightforward.

Agreed. The fix patch can not be applied to kernel v6.3 or older version. This
is scsi_debug issue, and I'm not sure if it's worth back-porting effort to
stable kernels.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-21  1:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14  7:36 blktests scsi/007 failure Shin'ichiro Kawasaki
2023-04-14  8:09 ` Chaitanya Kulkarni
2023-04-14  8:53   ` Shin'ichiro Kawasaki
2023-04-14  8:33 ` John Garry
2023-04-14  8:58   ` Shin'ichiro Kawasaki
2023-04-20 12:26     ` Shin'ichiro Kawasaki
2023-04-20 12:59       ` John Garry
2023-04-21  1:03         ` Shin'ichiro Kawasaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).