linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* for-next hangs on test srp/012
@ 2018-12-11 22:58 Bart Van Assche
  2018-12-11 23:05 ` Mike Snitzer
  2018-12-12  0:02 ` Jens Axboe
  0 siblings, 2 replies; 20+ messages in thread
From: Bart Van Assche @ 2018-12-11 22:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Mike Snitzer

Hi Jens,

If I run the following subset of blktests:

  while :; do ./check -q srp && ./check -q nvmeof-mp; done

against today's for-next branch (commit dd2bf2df85a7) then after some
time the following hang is reported:

INFO: task fio:14869 blocked for more than 120 seconds.
      Not tainted 4.20.0-rc6-dbg+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fio             D25272 14869  14195 0x00000000
Call Trace:
 __schedule+0x401/0xe50
 schedule+0x4e/0xd0
 io_schedule+0x21/0x50
 blk_mq_get_tag+0x46d/0x640
 blk_mq_get_request+0x7c0/0xa00
 blk_mq_make_request+0x241/0xa70
 generic_make_request+0x411/0x950
 submit_bio+0x9b/0x250
 blkdev_direct_IO+0x7fb/0x870
 generic_file_direct_write+0x119/0x210
 __generic_file_write_iter+0x11c/0x280
 blkdev_write_iter+0x13c/0x220
 aio_write+0x204/0x310
 io_submit_one+0x9c6/0xe70
 __x64_sys_io_submit+0x115/0x340
 do_syscall_64+0x71/0x210
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

When that hang occurs my list-pending-block-requests script does not show
any pending requests:

# list-pending-block-requests        
dm-0
loop0
loop1
loop2
loop3
loop4
loop5
loop6
loop7
nullb0
nullb1
sda
sdb
sdc
sdd
vda
vdb

Enabling fail_if_no_path mode did not resolve the hang so I don't think
that the root cause is in any of the dm drivers used in this test:

# dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1

The same test passes against kernel v4.20-rc6.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
@ 2018-12-11 23:05 ` Mike Snitzer
  2018-12-11 23:09   ` Bart Van Assche
  2018-12-12  0:02 ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2018-12-11 23:05 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, dm-devel

On Tue, Dec 11 2018 at  5:58pm -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> Hi Jens,
> 
> If I run the following subset of blktests:
> 
>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> 
> against today's for-next branch (commit dd2bf2df85a7) then after some
> time the following hang is reported:

Jens has applied a fix from me today that I think is very relevant;
it'll be in tomorrow's for-next:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01

Please try with that and see how things go.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-11 23:05 ` Mike Snitzer
@ 2018-12-11 23:09   ` Bart Van Assche
  2018-12-12  0:22     ` Mike Snitzer
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2018-12-11 23:09 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel

On Tue, 2018-12-11 at 18:05 -0500, Mike Snitzer wrote:
> On Tue, Dec 11 2018 at  5:58pm -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
> 
> > Hi Jens,
> > 
> > If I run the following subset of blktests:
> > 
> >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > 
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
> 
> Jens has applied a fix from me today that I think is very relevant;
> it'll be in tomorrow's for-next:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01
> 
> Please try with that and see how things go.

Hi Mike,

Thank you for your reply. I should have mentioned that I used Jens' for-next
branch as starting point and that that commit is present in the branch I used
as a starting point for my tests:

$ git branch --contains c4576aed8d85d808cd6443bda58393d525207d01    
* axboe-block-for-next

Bart.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
  2018-12-11 23:05 ` Mike Snitzer
@ 2018-12-12  0:02 ` Jens Axboe
  2018-12-12  0:18   ` Bart Van Assche
  2018-12-12  0:19   ` Ming Lei
  1 sibling, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  0:02 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Mike Snitzer

On 12/11/18 3:58 PM, Bart Van Assche wrote:
> Hi Jens,
> 
> If I run the following subset of blktests:
> 
>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> 
> against today's for-next branch (commit dd2bf2df85a7) then after some
> time the following hang is reported:
> 
> INFO: task fio:14869 blocked for more than 120 seconds.
>       Not tainted 4.20.0-rc6-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> fio             D25272 14869  14195 0x00000000
> Call Trace:
>  __schedule+0x401/0xe50
>  schedule+0x4e/0xd0
>  io_schedule+0x21/0x50
>  blk_mq_get_tag+0x46d/0x640
>  blk_mq_get_request+0x7c0/0xa00
>  blk_mq_make_request+0x241/0xa70
>  generic_make_request+0x411/0x950
>  submit_bio+0x9b/0x250
>  blkdev_direct_IO+0x7fb/0x870
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  blkdev_write_iter+0x13c/0x220
>  aio_write+0x204/0x310
>  io_submit_one+0x9c6/0xe70
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> When that hang occurs my list-pending-block-requests script does not show
> any pending requests:
> 
> # list-pending-block-requests        
> dm-0
> loop0
> loop1
> loop2
> loop3
> loop4
> loop5
> loop6
> loop7
> nullb0
> nullb1
> sda
> sdb
> sdc
> sdd
> vda
> vdb
> 
> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> that the root cause is in any of the dm drivers used in this test:
> 
> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> 
> The same test passes against kernel v4.20-rc6.

What device is this being run on?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  0:02 ` Jens Axboe
@ 2018-12-12  0:18   ` Bart Van Assche
  2018-12-12  0:19   ` Ming Lei
  1 sibling, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2018-12-12  0:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Mike Snitzer

On Tue, 2018-12-11 at 17:02 -0700, Jens Axboe wrote:
> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > Hi Jens,
> > 
> > If I run the following subset of blktests:
> > 
> >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > 
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
> > 
> > INFO: task fio:14869 blocked for more than 120 seconds.
> >       Not tainted 4.20.0-rc6-dbg+ #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > fio             D25272 14869  14195 0x00000000
> > Call Trace:
> >  __schedule+0x401/0xe50
> >  schedule+0x4e/0xd0
> >  io_schedule+0x21/0x50
> >  blk_mq_get_tag+0x46d/0x640
> >  blk_mq_get_request+0x7c0/0xa00
> >  blk_mq_make_request+0x241/0xa70
> >  generic_make_request+0x411/0x950
> >  submit_bio+0x9b/0x250
> >  blkdev_direct_IO+0x7fb/0x870
> >  generic_file_direct_write+0x119/0x210
> >  __generic_file_write_iter+0x11c/0x280
> >  blkdev_write_iter+0x13c/0x220
> >  aio_write+0x204/0x310
> >  io_submit_one+0x9c6/0xe70
> >  __x64_sys_io_submit+0x115/0x340
> >  do_syscall_64+0x71/0x210
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > When that hang occurs my list-pending-block-requests script does not show
> > any pending requests:
> > 
> > # list-pending-block-requests        
> > dm-0
> > loop0
> > loop1
> > loop2
> > loop3
> > loop4
> > loop5
> > loop6
> > loop7
> > nullb0
> > nullb1
> > sda
> > sdb
> > sdc
> > sdd
> > vda
> > vdb
> > 
> > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > that the root cause is in any of the dm drivers used in this test:
> > 
> > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > 
> > The same test passes against kernel v4.20-rc6.
> 
> What device is this being run on?

Older versions of the srp and nvmeof-mp tests used the brd block device.
Today these tests use null_blk with memory_backed set to 1. See also
configure_null_blk() in common/multipath-over-rdma. null_blk is accessed
by ib_srpt. The dm-mpath driver is stacked on top of the ib_srp instance
that communicates with ib_srpt driver. The ib_srp and ib_srpt drivers
communicate with each other over the loopback functionality of the
rdma_rxe driver.

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  0:02 ` Jens Axboe
  2018-12-12  0:18   ` Bart Van Assche
@ 2018-12-12  0:19   ` Ming Lei
  2018-12-12  0:27     ` Mike Snitzer
  1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12  0:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bart Van Assche, linux-block, Mike Snitzer

On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > Hi Jens,
> >
> > If I run the following subset of blktests:
> >
> >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
> >
> > INFO: task fio:14869 blocked for more than 120 seconds.
> >       Not tainted 4.20.0-rc6-dbg+ #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > fio             D25272 14869  14195 0x00000000
> > Call Trace:
> >  __schedule+0x401/0xe50
> >  schedule+0x4e/0xd0
> >  io_schedule+0x21/0x50
> >  blk_mq_get_tag+0x46d/0x640
> >  blk_mq_get_request+0x7c0/0xa00
> >  blk_mq_make_request+0x241/0xa70
> >  generic_make_request+0x411/0x950
> >  submit_bio+0x9b/0x250
> >  blkdev_direct_IO+0x7fb/0x870
> >  generic_file_direct_write+0x119/0x210
> >  __generic_file_write_iter+0x11c/0x280
> >  blkdev_write_iter+0x13c/0x220
> >  aio_write+0x204/0x310
> >  io_submit_one+0x9c6/0xe70
> >  __x64_sys_io_submit+0x115/0x340
> >  do_syscall_64+0x71/0x210
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > When that hang occurs my list-pending-block-requests script does not show
> > any pending requests:
> >
> > # list-pending-block-requests
> > dm-0
> > loop0
> > loop1
> > loop2
> > loop3
> > loop4
> > loop5
> > loop6
> > loop7
> > nullb0
> > nullb1
> > sda
> > sdb
> > sdc
> > sdd
> > vda
> > vdb
> >
> > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > that the root cause is in any of the dm drivers used in this test:
> >
> > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >
> > The same test passes against kernel v4.20-rc6.
>
> What device is this being run on?

I saw this issue on usb storage too.

Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
the sched_tag's sbitmap queue.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-11 23:09   ` Bart Van Assche
@ 2018-12-12  0:22     ` Mike Snitzer
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Snitzer @ 2018-12-12  0:22 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, dm-devel

On Tue, Dec 11 2018 at  6:09pm -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On Tue, 2018-12-11 at 18:05 -0500, Mike Snitzer wrote:
> > On Tue, Dec 11 2018 at  5:58pm -0500,
> > Bart Van Assche <bvanassche@acm.org> wrote:
> > 
> > > Hi Jens,
> > > 
> > > If I run the following subset of blktests:
> > > 
> > >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > > 
> > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > time the following hang is reported:
> > 
> > Jens has applied a fix from me today that I think is very relevant;
> > it'll be in tomorrow's for-next:
> > 
> > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01
> > 
> > Please try with that and see how things go.
> 
> Hi Mike,
> 
> Thank you for your reply. I should have mentioned that I used Jens' for-next
> branch as starting point and that that commit is present in the branch I used
> as a starting point for my tests:
> 
> $ git branch --contains c4576aed8d85d808cd6443bda58393d525207d01    
> * axboe-block-for-next

next-20181211 doesn't contain c4576aed8d85d808cd6443bda58393d525207d01

I'll try to reproduce, but in the meantime can you just try jens'
for-4.21/block?  It is currently 2c4d5356e64

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  0:19   ` Ming Lei
@ 2018-12-12  0:27     ` Mike Snitzer
  2018-12-12  0:38       ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2018-12-12  0:27 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, Bart Van Assche, linux-block

On Tue, Dec 11 2018 at  7:19pm -0500,
Ming Lei <tom.leiming@gmail.com> wrote:

> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > Hi Jens,
> > >
> > > If I run the following subset of blktests:
> > >
> > >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > >
> > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > time the following hang is reported:
> > >
> > > INFO: task fio:14869 blocked for more than 120 seconds.
> > >       Not tainted 4.20.0-rc6-dbg+ #1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > fio             D25272 14869  14195 0x00000000
> > > Call Trace:
> > >  __schedule+0x401/0xe50
> > >  schedule+0x4e/0xd0
> > >  io_schedule+0x21/0x50
> > >  blk_mq_get_tag+0x46d/0x640
> > >  blk_mq_get_request+0x7c0/0xa00
> > >  blk_mq_make_request+0x241/0xa70
> > >  generic_make_request+0x411/0x950
> > >  submit_bio+0x9b/0x250
> > >  blkdev_direct_IO+0x7fb/0x870
> > >  generic_file_direct_write+0x119/0x210
> > >  __generic_file_write_iter+0x11c/0x280
> > >  blkdev_write_iter+0x13c/0x220
> > >  aio_write+0x204/0x310
> > >  io_submit_one+0x9c6/0xe70
> > >  __x64_sys_io_submit+0x115/0x340
> > >  do_syscall_64+0x71/0x210
> > >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >
> > > When that hang occurs my list-pending-block-requests script does not show
> > > any pending requests:
> > >
> > > # list-pending-block-requests
> > > dm-0
> > > loop0
> > > loop1
> > > loop2
> > > loop3
> > > loop4
> > > loop5
> > > loop6
> > > loop7
> > > nullb0
> > > nullb1
> > > sda
> > > sdb
> > > sdc
> > > sdd
> > > vda
> > > vdb
> > >
> > > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > that the root cause is in any of the dm drivers used in this test:
> > >
> > > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > >
> > > The same test passes against kernel v4.20-rc6.
> >
> > What device is this being run on?
> 
> I saw this issue on usb storage too.
> 
> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> the sched_tag's sbitmap queue.

You saw this running the same tests as Bart?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  0:27     ` Mike Snitzer
@ 2018-12-12  0:38       ` Ming Lei
  2018-12-12  1:05         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12  0:38 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, Bart Van Assche, linux-block

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Tue, Dec 11 2018 at  7:19pm -0500,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
> > On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > > Hi Jens,
> > > >
> > > > If I run the following subset of blktests:
> > > >
> > > >   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > > >
> > > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > > time the following hang is reported:
> > > >
> > > > INFO: task fio:14869 blocked for more than 120 seconds.
> > > >       Not tainted 4.20.0-rc6-dbg+ #1
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > fio             D25272 14869  14195 0x00000000
> > > > Call Trace:
> > > >  __schedule+0x401/0xe50
> > > >  schedule+0x4e/0xd0
> > > >  io_schedule+0x21/0x50
> > > >  blk_mq_get_tag+0x46d/0x640
> > > >  blk_mq_get_request+0x7c0/0xa00
> > > >  blk_mq_make_request+0x241/0xa70
> > > >  generic_make_request+0x411/0x950
> > > >  submit_bio+0x9b/0x250
> > > >  blkdev_direct_IO+0x7fb/0x870
> > > >  generic_file_direct_write+0x119/0x210
> > > >  __generic_file_write_iter+0x11c/0x280
> > > >  blkdev_write_iter+0x13c/0x220
> > > >  aio_write+0x204/0x310
> > > >  io_submit_one+0x9c6/0xe70
> > > >  __x64_sys_io_submit+0x115/0x340
> > > >  do_syscall_64+0x71/0x210
> > > >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > >
> > > > When that hang occurs my list-pending-block-requests script does not show
> > > > any pending requests:
> > > >
> > > > # list-pending-block-requests
> > > > dm-0
> > > > loop0
> > > > loop1
> > > > loop2
> > > > loop3
> > > > loop4
> > > > loop5
> > > > loop6
> > > > loop7
> > > > nullb0
> > > > nullb1
> > > > sda
> > > > sdb
> > > > sdc
> > > > sdd
> > > > vda
> > > > vdb
> > > >
> > > > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > > that the root cause is in any of the dm drivers used in this test:
> > > >
> > > > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > > >
> > > > The same test passes against kernel v4.20-rc6.
> > >
> > > What device is this being run on?
> >
> > I saw this issue on usb storage too.
> >
> > Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > the sched_tag's sbitmap queue.
>
> You saw this running the same tests as Bart?

Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
see the attached test script.

Mostly it can be triggered in one run, sometimes it needs more.

Thanks
Ming Lei

[-- Attachment #2: parted --]
[-- Type: application/octet-stream, Size: 953 bytes --]

#!/bin/bash

parted_disk()
{
	DISK=$1
	CNT=$2

	DEV_SIZE=`blockdev --getsz $DISK`
	DEV_SIZE=`expr ${DEV_SIZE} / 2048`MB

	DEV_NAME=`basename $DISK`
	Q_PATH=/sys/block/$DEV_NAME/queue

	if [ -f $Q_PATH/../device/queue_depth ]; then
		MY_SCSI_QD=`cat $Q_PATH/../device/queue_depth`
	else
		MY_SCSI_QD="N/A"
	fi
	MQ_IOSCHEDS=`sed 's/[][]//g' $Q_PATH/scheduler`

	for SCHED in $MQ_IOSCHEDS; do
		echo $SCHED > $Q_PATH/scheduler
		MY_SCHED=`cat $Q_PATH/scheduler | sed -n 's/.*\[\(.*\)\].*/\1/p'`
		echo "Dev. $DEV_NAME, SCHED: $MY_SCHED, SCSI QD: $MY_SCSI_QD"
		count=0
		wipefs -a $DISK && blockdev --rereadpt $DISK
		while [ $count -lt $CNT ]; do
			parted -m -s $DISK mklabel msdos &&
					parted -m -s -a none $DISK mkpart primary 0MB 32MB &&
					parted -m -s -a none $DISK mkpart primary 32MB $DEV_SIZE &&
					parted -s $DISK p
			let count++
			if ! (( count % 10 )); then
				echo $count
			fi
		done
	done
}

DISK=$1
CNT=$2
parted_disk $DISK $CNT

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  0:38       ` Ming Lei
@ 2018-12-12  1:05         ` Jens Axboe
  2018-12-12  1:23           ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  1:05 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block

On 12/11/18 5:38 PM, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>
>> On Tue, Dec 11 2018 at  7:19pm -0500,
>> Ming Lei <tom.leiming@gmail.com> wrote:
>>
>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>> Hi Jens,
>>>>>
>>>>> If I run the following subset of blktests:
>>>>>
>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>
>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>> time the following hang is reported:
>>>>>
>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> fio             D25272 14869  14195 0x00000000
>>>>> Call Trace:
>>>>>  __schedule+0x401/0xe50
>>>>>  schedule+0x4e/0xd0
>>>>>  io_schedule+0x21/0x50
>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>  generic_make_request+0x411/0x950
>>>>>  submit_bio+0x9b/0x250
>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>  generic_file_direct_write+0x119/0x210
>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>  aio_write+0x204/0x310
>>>>>  io_submit_one+0x9c6/0xe70
>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>  do_syscall_64+0x71/0x210
>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>
>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>> any pending requests:
>>>>>
>>>>> # list-pending-block-requests
>>>>> dm-0
>>>>> loop0
>>>>> loop1
>>>>> loop2
>>>>> loop3
>>>>> loop4
>>>>> loop5
>>>>> loop6
>>>>> loop7
>>>>> nullb0
>>>>> nullb1
>>>>> sda
>>>>> sdb
>>>>> sdc
>>>>> sdd
>>>>> vda
>>>>> vdb
>>>>>
>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>
>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>
>>>>> The same test passes against kernel v4.20-rc6.
>>>>
>>>> What device is this being run on?
>>>
>>> I saw this issue on usb storage too.
>>>
>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>> the sched_tag's sbitmap queue.
>>
>> You saw this running the same tests as Bart?
> 
> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> see the attached test script.
> 
> Mostly it can be triggered in one run, sometimes it needs more.

I'll take a look. The ->cleared doesn't make sense for QD=1, or on
one word in general. But I'd like to try and understand why it hangs.

Are you using a scheduler?

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..e6f376287ead 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,50 @@
 #include <linux/sbitmap.h>
 #include <linux/seq_file.h>
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+	unsigned long __maybe_unused flags;
+	bool ret = false;
+
+	if (sb->map_nr <= 1)
+		return false;
+
+	/* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+	local_irq_save(flags);
+#endif
+	spin_lock(&sb->map[index].swap_lock);
+
+	if (!sb->map[index].cleared)
+		goto out_unlock;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	ret = true;
+out_unlock:
+	spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+	local_irq_restore(flags);
+#endif
+	return ret;
+}
+
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      gfp_t flags, int node)
 {
@@ -70,6 +114,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 	unsigned int bits_per_word = 1U << sb->shift;
 	unsigned int i;
 
+	for (i = 0; i < sb->map_nr; i++)
+		sbitmap_deferred_clear(sb, i);
+
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
 
@@ -112,47 +159,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
-	unsigned long mask, val;
-	unsigned long __maybe_unused flags;
-	bool ret = false;
-
-	/* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
-	local_irq_save(flags);
-#endif
-	spin_lock(&sb->map[index].swap_lock);
-
-	if (!sb->map[index].cleared)
-		goto out_unlock;
-
-	/*
-	 * First get a stable cleared mask, setting the old mask to 0.
-	 */
-	do {
-		mask = sb->map[index].cleared;
-	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
-	/*
-	 * Now clear the masked bits in our free word
-	 */
-	do {
-		val = sb->map[index].word;
-	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
-	ret = true;
-out_unlock:
-	spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
-	local_irq_restore(flags);
-#endif
-	return ret;
-}
-
 static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
 				     unsigned int alloc_hint, bool round_robin)
 {
@@ -590,7 +596,12 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
 void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
 			 unsigned int cpu)
 {
-	sbitmap_deferred_clear_bit(&sbq->sb, nr);
+	struct sbitmap *sb = &sbq->sb;
+
+	if (sbq->sb.map_nr > 1)
+		sbitmap_deferred_clear_bit(sb, nr);
+	else
+		sbitmap_clear_bit_unlock(sb, nr);
 
 	/*
 	 * Pairs with the memory barrier in set_current_state() to ensure the

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:05         ` Jens Axboe
@ 2018-12-12  1:23           ` Jens Axboe
  2018-12-12  1:36             ` Jens Axboe
  2018-12-12  1:37             ` Ming Lei
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  1:23 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block

On 12/11/18 6:05 PM, Jens Axboe wrote:
> On 12/11/18 5:38 PM, Ming Lei wrote:
>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>> On Tue, Dec 11 2018 at  7:19pm -0500,
>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>
>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> If I run the following subset of blktests:
>>>>>>
>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>
>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>> time the following hang is reported:
>>>>>>
>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>> fio             D25272 14869  14195 0x00000000
>>>>>> Call Trace:
>>>>>>  __schedule+0x401/0xe50
>>>>>>  schedule+0x4e/0xd0
>>>>>>  io_schedule+0x21/0x50
>>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>>  generic_make_request+0x411/0x950
>>>>>>  submit_bio+0x9b/0x250
>>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>>  aio_write+0x204/0x310
>>>>>>  io_submit_one+0x9c6/0xe70
>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>  do_syscall_64+0x71/0x210
>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>
>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>> any pending requests:
>>>>>>
>>>>>> # list-pending-block-requests
>>>>>> dm-0
>>>>>> loop0
>>>>>> loop1
>>>>>> loop2
>>>>>> loop3
>>>>>> loop4
>>>>>> loop5
>>>>>> loop6
>>>>>> loop7
>>>>>> nullb0
>>>>>> nullb1
>>>>>> sda
>>>>>> sdb
>>>>>> sdc
>>>>>> sdd
>>>>>> vda
>>>>>> vdb
>>>>>>
>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>
>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>
>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>
>>>>> What device is this being run on?
>>>>
>>>> I saw this issue on usb storage too.
>>>>
>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>> the sched_tag's sbitmap queue.
>>>
>>> You saw this running the same tests as Bart?
>>
>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>> see the attached test script.
>>
>> Mostly it can be triggered in one run, sometimes it needs more.
> 
> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> one word in general. But I'd like to try and understand why it hangs.
> 
> Are you using a scheduler?

OK, I think I see what it is, the shallow is missing the deferred clear.
On top of this, probably worth to check at what depths deferred starts
to make sense. For QD == 1, definitely not. But that should be on top of
the fix.

Can you try this one?


diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..d98ba7af6bce 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,47 @@
 #include <linux/sbitmap.h>
 #include <linux/seq_file.h>
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+	unsigned long __maybe_unused flags;
+	bool ret = false;
+
+	/* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+	local_irq_save(flags);
+#endif
+	spin_lock(&sb->map[index].swap_lock);
+
+	if (!sb->map[index].cleared)
+		goto out_unlock;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	ret = true;
+out_unlock:
+	spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+	local_irq_restore(flags);
+#endif
+	return ret;
+}
+
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      gfp_t flags, int node)
 {
@@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 	unsigned int bits_per_word = 1U << sb->shift;
 	unsigned int i;
 
+	for (i = 0; i < sb->map_nr; i++)
+		sbitmap_deferred_clear(sb, i);
+
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
 
@@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
-	unsigned long mask, val;
-	unsigned long __maybe_unused flags;
-	bool ret = false;
-
-	/* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
-	local_irq_save(flags);
-#endif
-	spin_lock(&sb->map[index].swap_lock);
-
-	if (!sb->map[index].cleared)
-		goto out_unlock;
-
-	/*
-	 * First get a stable cleared mask, setting the old mask to 0.
-	 */
-	do {
-		mask = sb->map[index].cleared;
-	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
-	/*
-	 * Now clear the masked bits in our free word
-	 */
-	do {
-		val = sb->map[index].word;
-	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
-	ret = true;
-out_unlock:
-	spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
-	local_irq_restore(flags);
-#endif
-	return ret;
-}
-
 static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
 				     unsigned int alloc_hint, bool round_robin)
 {
@@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
 	index = SB_NR_TO_INDEX(sb, alloc_hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
+again:
 		nr = __sbitmap_get_word(&sb->map[index].word,
 					min(sb->map[index].depth, shallow_depth),
 					SB_NR_TO_BIT(sb, alloc_hint), true);
@@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
 			break;
 		}
 
+		if (sbitmap_deferred_clear(sb, index))
+			goto again;
+
 		/* Jump to next index. */
 		index++;
 		alloc_hint = index << sb->shift;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:23           ` Jens Axboe
@ 2018-12-12  1:36             ` Jens Axboe
  2018-12-12  1:43               ` Ming Lei
  2018-12-12  1:37             ` Ming Lei
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  1:36 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block

On 12/11/18 6:23 PM, Jens Axboe wrote:
> On 12/11/18 6:05 PM, Jens Axboe wrote:
>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>
>>>> On Tue, Dec 11 2018 at  7:19pm -0500,
>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>
>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> If I run the following subset of blktests:
>>>>>>>
>>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>
>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>> time the following hang is reported:
>>>>>>>
>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>> fio             D25272 14869  14195 0x00000000
>>>>>>> Call Trace:
>>>>>>>  __schedule+0x401/0xe50
>>>>>>>  schedule+0x4e/0xd0
>>>>>>>  io_schedule+0x21/0x50
>>>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>>>  generic_make_request+0x411/0x950
>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>>>  aio_write+0x204/0x310
>>>>>>>  io_submit_one+0x9c6/0xe70
>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>
>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>> any pending requests:
>>>>>>>
>>>>>>> # list-pending-block-requests
>>>>>>> dm-0
>>>>>>> loop0
>>>>>>> loop1
>>>>>>> loop2
>>>>>>> loop3
>>>>>>> loop4
>>>>>>> loop5
>>>>>>> loop6
>>>>>>> loop7
>>>>>>> nullb0
>>>>>>> nullb1
>>>>>>> sda
>>>>>>> sdb
>>>>>>> sdc
>>>>>>> sdd
>>>>>>> vda
>>>>>>> vdb
>>>>>>>
>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>
>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>
>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>
>>>>>> What device is this being run on?
>>>>>
>>>>> I saw this issue on usb storage too.
>>>>>
>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>> the sched_tag's sbitmap queue.
>>>>
>>>> You saw this running the same tests as Bart?
>>>
>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>> see the attached test script.
>>>
>>> Mostly it can be triggered in one run, sometimes it needs more.
>>
>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>> one word in general. But I'd like to try and understand why it hangs.
>>
>> Are you using a scheduler?
> 
> OK, I think I see what it is, the shallow is missing the deferred clear.
> On top of this, probably worth to check at what depths deferred starts
> to make sense. For QD == 1, definitely not. But that should be on top of
> the fix.
> 
> Can you try this one?

This one should be complete. The former should fix the case, this one
just adds the last two cases where we ignored ->cleared. For some of
them we'll never hit it since deferred isn't used for that, but I
think we're better off with full coverage.


diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..5b3e56d68dab 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,47 @@
 #include <linux/sbitmap.h>
 #include <linux/seq_file.h>
 
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+	unsigned long mask, val;
+	unsigned long __maybe_unused flags;
+	bool ret = false;
+
+	/* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+	local_irq_save(flags);
+#endif
+	spin_lock(&sb->map[index].swap_lock);
+
+	if (!sb->map[index].cleared)
+		goto out_unlock;
+
+	/*
+	 * First get a stable cleared mask, setting the old mask to 0.
+	 */
+	do {
+		mask = sb->map[index].cleared;
+	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+	/*
+	 * Now clear the masked bits in our free word
+	 */
+	do {
+		val = sb->map[index].word;
+	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+	ret = true;
+out_unlock:
+	spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+	local_irq_restore(flags);
+#endif
+	return ret;
+}
+
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
 		      gfp_t flags, int node)
 {
@@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
 	unsigned int bits_per_word = 1U << sb->shift;
 	unsigned int i;
 
+	for (i = 0; i < sb->map_nr; i++)
+		sbitmap_deferred_clear(sb, i);
+
 	sb->depth = depth;
 	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
 
@@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
 	return nr;
 }
 
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
-	unsigned long mask, val;
-	unsigned long __maybe_unused flags;
-	bool ret = false;
-
-	/* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
-	local_irq_save(flags);
-#endif
-	spin_lock(&sb->map[index].swap_lock);
-
-	if (!sb->map[index].cleared)
-		goto out_unlock;
-
-	/*
-	 * First get a stable cleared mask, setting the old mask to 0.
-	 */
-	do {
-		mask = sb->map[index].cleared;
-	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
-	/*
-	 * Now clear the masked bits in our free word
-	 */
-	do {
-		val = sb->map[index].word;
-	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
-	ret = true;
-out_unlock:
-	spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
-	local_irq_restore(flags);
-#endif
-	return ret;
-}
-
 static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
 				     unsigned int alloc_hint, bool round_robin)
 {
@@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
 	index = SB_NR_TO_INDEX(sb, alloc_hint);
 
 	for (i = 0; i < sb->map_nr; i++) {
+again:
 		nr = __sbitmap_get_word(&sb->map[index].word,
 					min(sb->map[index].depth, shallow_depth),
 					SB_NR_TO_BIT(sb, alloc_hint), true);
@@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
 			break;
 		}
 
+		if (sbitmap_deferred_clear(sb, index))
+			goto again;
+
 		/* Jump to next index. */
 		index++;
 		alloc_hint = index << sb->shift;
@@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
 	unsigned int i;
 
 	for (i = 0; i < sb->map_nr; i++) {
-		if (sb->map[i].word)
+		if (sb->map[i].word & ~sb->map[i].cleared)
 			return true;
 	}
 	return false;
@@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
 
 	for (i = 0; i < sb->map_nr; i++) {
 		const struct sbitmap_word *word = &sb->map[i];
+		unsigned long mask = word->word & ~word->cleared;
 		unsigned long ret;
 
-		ret = find_first_zero_bit(&word->word, word->depth);
+		ret = find_first_zero_bit(&mask, word->depth);
 		if (ret < word->depth)
 			return true;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:23           ` Jens Axboe
  2018-12-12  1:36             ` Jens Axboe
@ 2018-12-12  1:37             ` Ming Lei
  2018-12-12  1:39               ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12  1:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On Tue, Dec 11, 2018 at 06:23:31PM -0700, Jens Axboe wrote:
> On 12/11/18 6:05 PM, Jens Axboe wrote:
> > On 12/11/18 5:38 PM, Ming Lei wrote:
> >> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>
> >>> On Tue, Dec 11 2018 at  7:19pm -0500,
> >>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>
> >>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>> Hi Jens,
> >>>>>>
> >>>>>> If I run the following subset of blktests:
> >>>>>>
> >>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>
> >>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>> time the following hang is reported:
> >>>>>>
> >>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>> fio             D25272 14869  14195 0x00000000
> >>>>>> Call Trace:
> >>>>>>  __schedule+0x401/0xe50
> >>>>>>  schedule+0x4e/0xd0
> >>>>>>  io_schedule+0x21/0x50
> >>>>>>  blk_mq_get_tag+0x46d/0x640
> >>>>>>  blk_mq_get_request+0x7c0/0xa00
> >>>>>>  blk_mq_make_request+0x241/0xa70
> >>>>>>  generic_make_request+0x411/0x950
> >>>>>>  submit_bio+0x9b/0x250
> >>>>>>  blkdev_direct_IO+0x7fb/0x870
> >>>>>>  generic_file_direct_write+0x119/0x210
> >>>>>>  __generic_file_write_iter+0x11c/0x280
> >>>>>>  blkdev_write_iter+0x13c/0x220
> >>>>>>  aio_write+0x204/0x310
> >>>>>>  io_submit_one+0x9c6/0xe70
> >>>>>>  __x64_sys_io_submit+0x115/0x340
> >>>>>>  do_syscall_64+0x71/0x210
> >>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>
> >>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>> any pending requests:
> >>>>>>
> >>>>>> # list-pending-block-requests
> >>>>>> dm-0
> >>>>>> loop0
> >>>>>> loop1
> >>>>>> loop2
> >>>>>> loop3
> >>>>>> loop4
> >>>>>> loop5
> >>>>>> loop6
> >>>>>> loop7
> >>>>>> nullb0
> >>>>>> nullb1
> >>>>>> sda
> >>>>>> sdb
> >>>>>> sdc
> >>>>>> sdd
> >>>>>> vda
> >>>>>> vdb
> >>>>>>
> >>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>
> >>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>
> >>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>
> >>>>> What device is this being run on?
> >>>>
> >>>> I saw this issue on usb storage too.
> >>>>
> >>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>> the sched_tag's sbitmap queue.
> >>>
> >>> You saw this running the same tests as Bart?
> >>
> >> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >> see the attached test script.
> >>
> >> Mostly it can be triggered in one run, sometimes it needs more.
> > 
> > I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > one word in general. But I'd like to try and understand why it hangs.
> > 
> > Are you using a scheduler?
> 
> OK, I think I see what it is, the shallow is missing the deferred clear.
> On top of this, probably worth to check at what depths deferred starts
> to make sense. For QD == 1, definitely not. But that should be on top of
> the fix.
> 
> Can you try this one?
> 
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 2261136ae067..d98ba7af6bce 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -20,6 +20,47 @@
>  #include <linux/sbitmap.h>
>  #include <linux/seq_file.h>
>  
> +/*
> + * See if we have deferred clears that we can batch move
> + */
> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> +{
> +	unsigned long mask, val;
> +	unsigned long __maybe_unused flags;
> +	bool ret = false;
> +
> +	/* Silence bogus lockdep warning */
> +#if defined(CONFIG_LOCKDEP)
> +	local_irq_save(flags);
> +#endif
> +	spin_lock(&sb->map[index].swap_lock);
> +
> +	if (!sb->map[index].cleared)
> +		goto out_unlock;
> +
> +	/*
> +	 * First get a stable cleared mask, setting the old mask to 0.
> +	 */
> +	do {
> +		mask = sb->map[index].cleared;
> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> +
> +	/*
> +	 * Now clear the masked bits in our free word
> +	 */
> +	do {
> +		val = sb->map[index].word;
> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> +
> +	ret = true;
> +out_unlock:
> +	spin_unlock(&sb->map[index].swap_lock);
> +#if defined(CONFIG_LOCKDEP)
> +	local_irq_restore(flags);
> +#endif
> +	return ret;
> +}
> +
>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>  		      gfp_t flags, int node)
>  {
> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>  	unsigned int bits_per_word = 1U << sb->shift;
>  	unsigned int i;
>  
> +	for (i = 0; i < sb->map_nr; i++)
> +		sbitmap_deferred_clear(sb, i);
> +
>  	sb->depth = depth;
>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>  
> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>  	return nr;
>  }
>  
> -/*
> - * See if we have deferred clears that we can batch move
> - */
> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> -{
> -	unsigned long mask, val;
> -	unsigned long __maybe_unused flags;
> -	bool ret = false;
> -
> -	/* Silence bogus lockdep warning */
> -#if defined(CONFIG_LOCKDEP)
> -	local_irq_save(flags);
> -#endif
> -	spin_lock(&sb->map[index].swap_lock);
> -
> -	if (!sb->map[index].cleared)
> -		goto out_unlock;
> -
> -	/*
> -	 * First get a stable cleared mask, setting the old mask to 0.
> -	 */
> -	do {
> -		mask = sb->map[index].cleared;
> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> -
> -	/*
> -	 * Now clear the masked bits in our free word
> -	 */
> -	do {
> -		val = sb->map[index].word;
> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> -
> -	ret = true;
> -out_unlock:
> -	spin_unlock(&sb->map[index].swap_lock);
> -#if defined(CONFIG_LOCKDEP)
> -	local_irq_restore(flags);
> -#endif
> -	return ret;
> -}
> -
>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>  				     unsigned int alloc_hint, bool round_robin)
>  {
> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
>  
>  	for (i = 0; i < sb->map_nr; i++) {
> +again:
>  		nr = __sbitmap_get_word(&sb->map[index].word,
>  					min(sb->map[index].depth, shallow_depth),
>  					SB_NR_TO_BIT(sb, alloc_hint), true);
> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>  			break;
>  		}
>  
> +		if (sbitmap_deferred_clear(sb, index))
> +			goto again;
> +
>  		/* Jump to next index. */
>  		index++;
>  		alloc_hint = index << sb->shift;

Yeah, this one does work.

thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:37             ` Ming Lei
@ 2018-12-12  1:39               ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  1:39 UTC (permalink / raw)
  To: Ming Lei; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On 12/11/18 6:37 PM, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:23:31PM -0700, Jens Axboe wrote:
>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>
>>>>> On Tue, Dec 11 2018 at  7:19pm -0500,
>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>
>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> If I run the following subset of blktests:
>>>>>>>>
>>>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>
>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>> time the following hang is reported:
>>>>>>>>
>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>> fio             D25272 14869  14195 0x00000000
>>>>>>>> Call Trace:
>>>>>>>>  __schedule+0x401/0xe50
>>>>>>>>  schedule+0x4e/0xd0
>>>>>>>>  io_schedule+0x21/0x50
>>>>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>>>>  generic_make_request+0x411/0x950
>>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>>>>  aio_write+0x204/0x310
>>>>>>>>  io_submit_one+0x9c6/0xe70
>>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>
>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>> any pending requests:
>>>>>>>>
>>>>>>>> # list-pending-block-requests
>>>>>>>> dm-0
>>>>>>>> loop0
>>>>>>>> loop1
>>>>>>>> loop2
>>>>>>>> loop3
>>>>>>>> loop4
>>>>>>>> loop5
>>>>>>>> loop6
>>>>>>>> loop7
>>>>>>>> nullb0
>>>>>>>> nullb1
>>>>>>>> sda
>>>>>>>> sdb
>>>>>>>> sdc
>>>>>>>> sdd
>>>>>>>> vda
>>>>>>>> vdb
>>>>>>>>
>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>
>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>
>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>
>>>>>>> What device is this being run on?
>>>>>>
>>>>>> I saw this issue on usb storage too.
>>>>>>
>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>> the sched_tag's sbitmap queue.
>>>>>
>>>>> You saw this running the same tests as Bart?
>>>>
>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>> see the attached test script.
>>>>
>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>
>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>> one word in general. But I'd like to try and understand why it hangs.
>>>
>>> Are you using a scheduler?
>>
>> OK, I think I see what it is, the shallow is missing the deferred clear.
>> On top of this, probably worth to check at what depths deferred starts
>> to make sense. For QD == 1, definitely not. But that should be on top of
>> the fix.
>>
>> Can you try this one?
>>
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index 2261136ae067..d98ba7af6bce 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -20,6 +20,47 @@
>>  #include <linux/sbitmap.h>
>>  #include <linux/seq_file.h>
>>  
>> +/*
>> + * See if we have deferred clears that we can batch move
>> + */
>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> +{
>> +	unsigned long mask, val;
>> +	unsigned long __maybe_unused flags;
>> +	bool ret = false;
>> +
>> +	/* Silence bogus lockdep warning */
>> +#if defined(CONFIG_LOCKDEP)
>> +	local_irq_save(flags);
>> +#endif
>> +	spin_lock(&sb->map[index].swap_lock);
>> +
>> +	if (!sb->map[index].cleared)
>> +		goto out_unlock;
>> +
>> +	/*
>> +	 * First get a stable cleared mask, setting the old mask to 0.
>> +	 */
>> +	do {
>> +		mask = sb->map[index].cleared;
>> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> +
>> +	/*
>> +	 * Now clear the masked bits in our free word
>> +	 */
>> +	do {
>> +		val = sb->map[index].word;
>> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> +
>> +	ret = true;
>> +out_unlock:
>> +	spin_unlock(&sb->map[index].swap_lock);
>> +#if defined(CONFIG_LOCKDEP)
>> +	local_irq_restore(flags);
>> +#endif
>> +	return ret;
>> +}
>> +
>>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>>  		      gfp_t flags, int node)
>>  {
>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>>  	unsigned int bits_per_word = 1U << sb->shift;
>>  	unsigned int i;
>>  
>> +	for (i = 0; i < sb->map_nr; i++)
>> +		sbitmap_deferred_clear(sb, i);
>> +
>>  	sb->depth = depth;
>>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>  
>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>>  	return nr;
>>  }
>>  
>> -/*
>> - * See if we have deferred clears that we can batch move
>> - */
>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> -{
>> -	unsigned long mask, val;
>> -	unsigned long __maybe_unused flags;
>> -	bool ret = false;
>> -
>> -	/* Silence bogus lockdep warning */
>> -#if defined(CONFIG_LOCKDEP)
>> -	local_irq_save(flags);
>> -#endif
>> -	spin_lock(&sb->map[index].swap_lock);
>> -
>> -	if (!sb->map[index].cleared)
>> -		goto out_unlock;
>> -
>> -	/*
>> -	 * First get a stable cleared mask, setting the old mask to 0.
>> -	 */
>> -	do {
>> -		mask = sb->map[index].cleared;
>> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> -
>> -	/*
>> -	 * Now clear the masked bits in our free word
>> -	 */
>> -	do {
>> -		val = sb->map[index].word;
>> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> -
>> -	ret = true;
>> -out_unlock:
>> -	spin_unlock(&sb->map[index].swap_lock);
>> -#if defined(CONFIG_LOCKDEP)
>> -	local_irq_restore(flags);
>> -#endif
>> -	return ret;
>> -}
>> -
>>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>>  				     unsigned int alloc_hint, bool round_robin)
>>  {
>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
>>  
>>  	for (i = 0; i < sb->map_nr; i++) {
>> +again:
>>  		nr = __sbitmap_get_word(&sb->map[index].word,
>>  					min(sb->map[index].depth, shallow_depth),
>>  					SB_NR_TO_BIT(sb, alloc_hint), true);
>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>  			break;
>>  		}
>>  
>> +		if (sbitmap_deferred_clear(sb, index))
>> +			goto again;
>> +
>>  		/* Jump to next index. */
>>  		index++;
>>  		alloc_hint = index << sb->shift;
> 
> Yeah, this one does work.

Good, thanks for testing. Can you try the next one too, just for good
measure? Then I can add your Tested-by.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:36             ` Jens Axboe
@ 2018-12-12  1:43               ` Ming Lei
  2018-12-12  1:44                 ` Jens Axboe
  2018-12-12  1:49                 ` Ming Lei
  0 siblings, 2 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-12  1:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> On 12/11/18 6:23 PM, Jens Axboe wrote:
> > On 12/11/18 6:05 PM, Jens Axboe wrote:
> >> On 12/11/18 5:38 PM, Ming Lei wrote:
> >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>>
> >>>> On Tue, Dec 11 2018 at  7:19pm -0500,
> >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>>
> >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> If I run the following subset of blktests:
> >>>>>>>
> >>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>>
> >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>>> time the following hang is reported:
> >>>>>>>
> >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>>> fio             D25272 14869  14195 0x00000000
> >>>>>>> Call Trace:
> >>>>>>>  __schedule+0x401/0xe50
> >>>>>>>  schedule+0x4e/0xd0
> >>>>>>>  io_schedule+0x21/0x50
> >>>>>>>  blk_mq_get_tag+0x46d/0x640
> >>>>>>>  blk_mq_get_request+0x7c0/0xa00
> >>>>>>>  blk_mq_make_request+0x241/0xa70
> >>>>>>>  generic_make_request+0x411/0x950
> >>>>>>>  submit_bio+0x9b/0x250
> >>>>>>>  blkdev_direct_IO+0x7fb/0x870
> >>>>>>>  generic_file_direct_write+0x119/0x210
> >>>>>>>  __generic_file_write_iter+0x11c/0x280
> >>>>>>>  blkdev_write_iter+0x13c/0x220
> >>>>>>>  aio_write+0x204/0x310
> >>>>>>>  io_submit_one+0x9c6/0xe70
> >>>>>>>  __x64_sys_io_submit+0x115/0x340
> >>>>>>>  do_syscall_64+0x71/0x210
> >>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>>
> >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>>> any pending requests:
> >>>>>>>
> >>>>>>> # list-pending-block-requests
> >>>>>>> dm-0
> >>>>>>> loop0
> >>>>>>> loop1
> >>>>>>> loop2
> >>>>>>> loop3
> >>>>>>> loop4
> >>>>>>> loop5
> >>>>>>> loop6
> >>>>>>> loop7
> >>>>>>> nullb0
> >>>>>>> nullb1
> >>>>>>> sda
> >>>>>>> sdb
> >>>>>>> sdc
> >>>>>>> sdd
> >>>>>>> vda
> >>>>>>> vdb
> >>>>>>>
> >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>>
> >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>>
> >>>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>>
> >>>>>> What device is this being run on?
> >>>>>
> >>>>> I saw this issue on usb storage too.
> >>>>>
> >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>>> the sched_tag's sbitmap queue.
> >>>>
> >>>> You saw this running the same tests as Bart?
> >>>
> >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >>> see the attached test script.
> >>>
> >>> Mostly it can be triggered in one run, sometimes it needs more.
> >>
> >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> >> one word in general. But I'd like to try and understand why it hangs.
> >>
> >> Are you using a scheduler?
> > 
> > OK, I think I see what it is, the shallow is missing the deferred clear.
> > On top of this, probably worth to check at what depths deferred starts
> > to make sense. For QD == 1, definitely not. But that should be on top of
> > the fix.
> > 
> > Can you try this one?
> 
> This one should be complete. The former should fix the case, this one
> just adds the last two cases where we ignored ->cleared. For some of
> them we'll never hit it since deferred isn't used for that, but I
> think we're better off with full coverage.
> 
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 2261136ae067..5b3e56d68dab 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -20,6 +20,47 @@
>  #include <linux/sbitmap.h>
>  #include <linux/seq_file.h>
>  
> +/*
> + * See if we have deferred clears that we can batch move
> + */
> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> +{
> +	unsigned long mask, val;
> +	unsigned long __maybe_unused flags;
> +	bool ret = false;
> +
> +	/* Silence bogus lockdep warning */
> +#if defined(CONFIG_LOCKDEP)
> +	local_irq_save(flags);
> +#endif
> +	spin_lock(&sb->map[index].swap_lock);
> +
> +	if (!sb->map[index].cleared)
> +		goto out_unlock;
> +
> +	/*
> +	 * First get a stable cleared mask, setting the old mask to 0.
> +	 */
> +	do {
> +		mask = sb->map[index].cleared;
> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> +
> +	/*
> +	 * Now clear the masked bits in our free word
> +	 */
> +	do {
> +		val = sb->map[index].word;
> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> +
> +	ret = true;
> +out_unlock:
> +	spin_unlock(&sb->map[index].swap_lock);
> +#if defined(CONFIG_LOCKDEP)
> +	local_irq_restore(flags);
> +#endif
> +	return ret;
> +}
> +
>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>  		      gfp_t flags, int node)
>  {
> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>  	unsigned int bits_per_word = 1U << sb->shift;
>  	unsigned int i;
>  
> +	for (i = 0; i < sb->map_nr; i++)
> +		sbitmap_deferred_clear(sb, i);
> +
>  	sb->depth = depth;
>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>  
> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>  	return nr;
>  }
>  
> -/*
> - * See if we have deferred clears that we can batch move
> - */
> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> -{
> -	unsigned long mask, val;
> -	unsigned long __maybe_unused flags;
> -	bool ret = false;
> -
> -	/* Silence bogus lockdep warning */
> -#if defined(CONFIG_LOCKDEP)
> -	local_irq_save(flags);
> -#endif
> -	spin_lock(&sb->map[index].swap_lock);
> -
> -	if (!sb->map[index].cleared)
> -		goto out_unlock;
> -
> -	/*
> -	 * First get a stable cleared mask, setting the old mask to 0.
> -	 */
> -	do {
> -		mask = sb->map[index].cleared;
> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> -
> -	/*
> -	 * Now clear the masked bits in our free word
> -	 */
> -	do {
> -		val = sb->map[index].word;
> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> -
> -	ret = true;
> -out_unlock:
> -	spin_unlock(&sb->map[index].swap_lock);
> -#if defined(CONFIG_LOCKDEP)
> -	local_irq_restore(flags);
> -#endif
> -	return ret;
> -}
> -
>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>  				     unsigned int alloc_hint, bool round_robin)
>  {
> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
>  
>  	for (i = 0; i < sb->map_nr; i++) {
> +again:
>  		nr = __sbitmap_get_word(&sb->map[index].word,
>  					min(sb->map[index].depth, shallow_depth),
>  					SB_NR_TO_BIT(sb, alloc_hint), true);
> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>  			break;
>  		}
>  
> +		if (sbitmap_deferred_clear(sb, index))
> +			goto again;
> +
>  		/* Jump to next index. */
>  		index++;
>  		alloc_hint = index << sb->shift;
> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
>  	unsigned int i;
>  
>  	for (i = 0; i < sb->map_nr; i++) {
> -		if (sb->map[i].word)
> +		if (sb->map[i].word & ~sb->map[i].cleared)
>  			return true;
>  	}
>  	return false;
> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>  
>  	for (i = 0; i < sb->map_nr; i++) {
>  		const struct sbitmap_word *word = &sb->map[i];
> +		unsigned long mask = word->word & ~word->cleared;
>  		unsigned long ret;
>  
> -		ret = find_first_zero_bit(&word->word, word->depth);
> +		ret = find_first_zero_bit(&mask, word->depth);
>  		if (ret < word->depth)
>  			return true;
>  	}

This one works too:

Tested-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:43               ` Ming Lei
@ 2018-12-12  1:44                 ` Jens Axboe
  2018-12-12  1:49                 ` Ming Lei
  1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  1:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On 12/11/18 6:43 PM, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
>> On 12/11/18 6:23 PM, Jens Axboe wrote:
>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>>
>>>>>> On Tue, Dec 11 2018 at  7:19pm -0500,
>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> If I run the following subset of blktests:
>>>>>>>>>
>>>>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>>
>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>>> time the following hang is reported:
>>>>>>>>>
>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>>> fio             D25272 14869  14195 0x00000000
>>>>>>>>> Call Trace:
>>>>>>>>>  __schedule+0x401/0xe50
>>>>>>>>>  schedule+0x4e/0xd0
>>>>>>>>>  io_schedule+0x21/0x50
>>>>>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>>>>>  generic_make_request+0x411/0x950
>>>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>>>>>  aio_write+0x204/0x310
>>>>>>>>>  io_submit_one+0x9c6/0xe70
>>>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>>
>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>>> any pending requests:
>>>>>>>>>
>>>>>>>>> # list-pending-block-requests
>>>>>>>>> dm-0
>>>>>>>>> loop0
>>>>>>>>> loop1
>>>>>>>>> loop2
>>>>>>>>> loop3
>>>>>>>>> loop4
>>>>>>>>> loop5
>>>>>>>>> loop6
>>>>>>>>> loop7
>>>>>>>>> nullb0
>>>>>>>>> nullb1
>>>>>>>>> sda
>>>>>>>>> sdb
>>>>>>>>> sdc
>>>>>>>>> sdd
>>>>>>>>> vda
>>>>>>>>> vdb
>>>>>>>>>
>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>>
>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>>
>>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>>
>>>>>>>> What device is this being run on?
>>>>>>>
>>>>>>> I saw this issue on usb storage too.
>>>>>>>
>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>>> the sched_tag's sbitmap queue.
>>>>>>
>>>>>> You saw this running the same tests as Bart?
>>>>>
>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>>> see the attached test script.
>>>>>
>>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>>
>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>>> one word in general. But I'd like to try and understand why it hangs.
>>>>
>>>> Are you using a scheduler?
>>>
>>> OK, I think I see what it is, the shallow is missing the deferred clear.
>>> On top of this, probably worth to check at what depths deferred starts
>>> to make sense. For QD == 1, definitely not. But that should be on top of
>>> the fix.
>>>
>>> Can you try this one?
>>
>> This one should be complete. The former should fix the case, this one
>> just adds the last two cases where we ignored ->cleared. For some of
>> them we'll never hit it since deferred isn't used for that, but I
>> think we're better off with full coverage.
>>
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index 2261136ae067..5b3e56d68dab 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -20,6 +20,47 @@
>>  #include <linux/sbitmap.h>
>>  #include <linux/seq_file.h>
>>  
>> +/*
>> + * See if we have deferred clears that we can batch move
>> + */
>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> +{
>> +	unsigned long mask, val;
>> +	unsigned long __maybe_unused flags;
>> +	bool ret = false;
>> +
>> +	/* Silence bogus lockdep warning */
>> +#if defined(CONFIG_LOCKDEP)
>> +	local_irq_save(flags);
>> +#endif
>> +	spin_lock(&sb->map[index].swap_lock);
>> +
>> +	if (!sb->map[index].cleared)
>> +		goto out_unlock;
>> +
>> +	/*
>> +	 * First get a stable cleared mask, setting the old mask to 0.
>> +	 */
>> +	do {
>> +		mask = sb->map[index].cleared;
>> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> +
>> +	/*
>> +	 * Now clear the masked bits in our free word
>> +	 */
>> +	do {
>> +		val = sb->map[index].word;
>> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> +
>> +	ret = true;
>> +out_unlock:
>> +	spin_unlock(&sb->map[index].swap_lock);
>> +#if defined(CONFIG_LOCKDEP)
>> +	local_irq_restore(flags);
>> +#endif
>> +	return ret;
>> +}
>> +
>>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>>  		      gfp_t flags, int node)
>>  {
>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>>  	unsigned int bits_per_word = 1U << sb->shift;
>>  	unsigned int i;
>>  
>> +	for (i = 0; i < sb->map_nr; i++)
>> +		sbitmap_deferred_clear(sb, i);
>> +
>>  	sb->depth = depth;
>>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>  
>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>>  	return nr;
>>  }
>>  
>> -/*
>> - * See if we have deferred clears that we can batch move
>> - */
>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> -{
>> -	unsigned long mask, val;
>> -	unsigned long __maybe_unused flags;
>> -	bool ret = false;
>> -
>> -	/* Silence bogus lockdep warning */
>> -#if defined(CONFIG_LOCKDEP)
>> -	local_irq_save(flags);
>> -#endif
>> -	spin_lock(&sb->map[index].swap_lock);
>> -
>> -	if (!sb->map[index].cleared)
>> -		goto out_unlock;
>> -
>> -	/*
>> -	 * First get a stable cleared mask, setting the old mask to 0.
>> -	 */
>> -	do {
>> -		mask = sb->map[index].cleared;
>> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> -
>> -	/*
>> -	 * Now clear the masked bits in our free word
>> -	 */
>> -	do {
>> -		val = sb->map[index].word;
>> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> -
>> -	ret = true;
>> -out_unlock:
>> -	spin_unlock(&sb->map[index].swap_lock);
>> -#if defined(CONFIG_LOCKDEP)
>> -	local_irq_restore(flags);
>> -#endif
>> -	return ret;
>> -}
>> -
>>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>>  				     unsigned int alloc_hint, bool round_robin)
>>  {
>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
>>  
>>  	for (i = 0; i < sb->map_nr; i++) {
>> +again:
>>  		nr = __sbitmap_get_word(&sb->map[index].word,
>>  					min(sb->map[index].depth, shallow_depth),
>>  					SB_NR_TO_BIT(sb, alloc_hint), true);
>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>  			break;
>>  		}
>>  
>> +		if (sbitmap_deferred_clear(sb, index))
>> +			goto again;
>> +
>>  		/* Jump to next index. */
>>  		index++;
>>  		alloc_hint = index << sb->shift;
>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
>>  	unsigned int i;
>>  
>>  	for (i = 0; i < sb->map_nr; i++) {
>> -		if (sb->map[i].word)
>> +		if (sb->map[i].word & ~sb->map[i].cleared)
>>  			return true;
>>  	}
>>  	return false;
>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>>  
>>  	for (i = 0; i < sb->map_nr; i++) {
>>  		const struct sbitmap_word *word = &sb->map[i];
>> +		unsigned long mask = word->word & ~word->cleared;
>>  		unsigned long ret;
>>  
>> -		ret = find_first_zero_bit(&word->word, word->depth);
>> +		ret = find_first_zero_bit(&mask, word->depth);
>>  		if (ret < word->depth)
>>  			return true;
>>  	}
> 
> This one works too:
> 
> Tested-by: Ming Lei <ming.lei@redhat.com>

Awesome, thanks Ming!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:43               ` Ming Lei
  2018-12-12  1:44                 ` Jens Axboe
@ 2018-12-12  1:49                 ` Ming Lei
  2018-12-12  2:03                   ` Ming Lei
  1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12  1:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> > On 12/11/18 6:23 PM, Jens Axboe wrote:
> > > On 12/11/18 6:05 PM, Jens Axboe wrote:
> > >> On 12/11/18 5:38 PM, Ming Lei wrote:
> > >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> > >>>>
> > >>>> On Tue, Dec 11 2018 at  7:19pm -0500,
> > >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> > >>>>
> > >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>
> > >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > >>>>>>> Hi Jens,
> > >>>>>>>
> > >>>>>>> If I run the following subset of blktests:
> > >>>>>>>
> > >>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > >>>>>>>
> > >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> > >>>>>>> time the following hang is reported:
> > >>>>>>>
> > >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> > >>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
> > >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >>>>>>> fio             D25272 14869  14195 0x00000000
> > >>>>>>> Call Trace:
> > >>>>>>>  __schedule+0x401/0xe50
> > >>>>>>>  schedule+0x4e/0xd0
> > >>>>>>>  io_schedule+0x21/0x50
> > >>>>>>>  blk_mq_get_tag+0x46d/0x640
> > >>>>>>>  blk_mq_get_request+0x7c0/0xa00
> > >>>>>>>  blk_mq_make_request+0x241/0xa70
> > >>>>>>>  generic_make_request+0x411/0x950
> > >>>>>>>  submit_bio+0x9b/0x250
> > >>>>>>>  blkdev_direct_IO+0x7fb/0x870
> > >>>>>>>  generic_file_direct_write+0x119/0x210
> > >>>>>>>  __generic_file_write_iter+0x11c/0x280
> > >>>>>>>  blkdev_write_iter+0x13c/0x220
> > >>>>>>>  aio_write+0x204/0x310
> > >>>>>>>  io_submit_one+0x9c6/0xe70
> > >>>>>>>  __x64_sys_io_submit+0x115/0x340
> > >>>>>>>  do_syscall_64+0x71/0x210
> > >>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >>>>>>>
> > >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> > >>>>>>> any pending requests:
> > >>>>>>>
> > >>>>>>> # list-pending-block-requests
> > >>>>>>> dm-0
> > >>>>>>> loop0
> > >>>>>>> loop1
> > >>>>>>> loop2
> > >>>>>>> loop3
> > >>>>>>> loop4
> > >>>>>>> loop5
> > >>>>>>> loop6
> > >>>>>>> loop7
> > >>>>>>> nullb0
> > >>>>>>> nullb1
> > >>>>>>> sda
> > >>>>>>> sdb
> > >>>>>>> sdc
> > >>>>>>> sdd
> > >>>>>>> vda
> > >>>>>>> vdb
> > >>>>>>>
> > >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > >>>>>>> that the root cause is in any of the dm drivers used in this test:
> > >>>>>>>
> > >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > >>>>>>>
> > >>>>>>> The same test passes against kernel v4.20-rc6.
> > >>>>>>
> > >>>>>> What device is this being run on?
> > >>>>>
> > >>>>> I saw this issue on usb storage too.
> > >>>>>
> > >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > >>>>> the sched_tag's sbitmap queue.
> > >>>>
> > >>>> You saw this running the same tests as Bart?
> > >>>
> > >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> > >>> see the attached test script.
> > >>>
> > >>> Mostly it can be triggered in one run, sometimes it needs more.
> > >>
> > >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > >> one word in general. But I'd like to try and understand why it hangs.
> > >>
> > >> Are you using a scheduler?
> > > 
> > > OK, I think I see what it is, the shallow is missing the deferred clear.
> > > On top of this, probably worth to check at what depths deferred starts
> > > to make sense. For QD == 1, definitely not. But that should be on top of
> > > the fix.
> > > 
> > > Can you try this one?
> > 
> > This one should be complete. The former should fix the case, this one
> > just adds the last two cases where we ignored ->cleared. For some of
> > them we'll never hit it since deferred isn't used for that, but I
> > think we're better off with full coverage.
> > 
> > 
> > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > index 2261136ae067..5b3e56d68dab 100644
> > --- a/lib/sbitmap.c
> > +++ b/lib/sbitmap.c
> > @@ -20,6 +20,47 @@
> >  #include <linux/sbitmap.h>
> >  #include <linux/seq_file.h>
> >  
> > +/*
> > + * See if we have deferred clears that we can batch move
> > + */
> > +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > +{
> > +	unsigned long mask, val;
> > +	unsigned long __maybe_unused flags;
> > +	bool ret = false;
> > +
> > +	/* Silence bogus lockdep warning */
> > +#if defined(CONFIG_LOCKDEP)
> > +	local_irq_save(flags);
> > +#endif
> > +	spin_lock(&sb->map[index].swap_lock);
> > +
> > +	if (!sb->map[index].cleared)
> > +		goto out_unlock;
> > +
> > +	/*
> > +	 * First get a stable cleared mask, setting the old mask to 0.
> > +	 */
> > +	do {
> > +		mask = sb->map[index].cleared;
> > +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > +
> > +	/*
> > +	 * Now clear the masked bits in our free word
> > +	 */
> > +	do {
> > +		val = sb->map[index].word;
> > +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > +
> > +	ret = true;
> > +out_unlock:
> > +	spin_unlock(&sb->map[index].swap_lock);
> > +#if defined(CONFIG_LOCKDEP)
> > +	local_irq_restore(flags);
> > +#endif
> > +	return ret;
> > +}
> > +
> >  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> >  		      gfp_t flags, int node)
> >  {
> > @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> >  	unsigned int bits_per_word = 1U << sb->shift;
> >  	unsigned int i;
> >  
> > +	for (i = 0; i < sb->map_nr; i++)
> > +		sbitmap_deferred_clear(sb, i);
> > +
> >  	sb->depth = depth;
> >  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> >  
> > @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> >  	return nr;
> >  }
> >  
> > -/*
> > - * See if we have deferred clears that we can batch move
> > - */
> > -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > -{
> > -	unsigned long mask, val;
> > -	unsigned long __maybe_unused flags;
> > -	bool ret = false;
> > -
> > -	/* Silence bogus lockdep warning */
> > -#if defined(CONFIG_LOCKDEP)
> > -	local_irq_save(flags);
> > -#endif
> > -	spin_lock(&sb->map[index].swap_lock);
> > -
> > -	if (!sb->map[index].cleared)
> > -		goto out_unlock;
> > -
> > -	/*
> > -	 * First get a stable cleared mask, setting the old mask to 0.
> > -	 */
> > -	do {
> > -		mask = sb->map[index].cleared;
> > -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > -
> > -	/*
> > -	 * Now clear the masked bits in our free word
> > -	 */
> > -	do {
> > -		val = sb->map[index].word;
> > -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > -
> > -	ret = true;
> > -out_unlock:
> > -	spin_unlock(&sb->map[index].swap_lock);
> > -#if defined(CONFIG_LOCKDEP)
> > -	local_irq_restore(flags);
> > -#endif
> > -	return ret;
> > -}
> > -
> >  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> >  				     unsigned int alloc_hint, bool round_robin)
> >  {
> > @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >  	index = SB_NR_TO_INDEX(sb, alloc_hint);
> >  
> >  	for (i = 0; i < sb->map_nr; i++) {
> > +again:
> >  		nr = __sbitmap_get_word(&sb->map[index].word,
> >  					min(sb->map[index].depth, shallow_depth),
> >  					SB_NR_TO_BIT(sb, alloc_hint), true);
> > @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >  			break;
> >  		}
> >  
> > +		if (sbitmap_deferred_clear(sb, index))
> > +			goto again;
> > +
> >  		/* Jump to next index. */
> >  		index++;
> >  		alloc_hint = index << sb->shift;
> > @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> >  	unsigned int i;
> >  
> >  	for (i = 0; i < sb->map_nr; i++) {
> > -		if (sb->map[i].word)
> > +		if (sb->map[i].word & ~sb->map[i].cleared)
> >  			return true;
> >  	}
> >  	return false;
> > @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> >  
> >  	for (i = 0; i < sb->map_nr; i++) {
> >  		const struct sbitmap_word *word = &sb->map[i];
> > +		unsigned long mask = word->word & ~word->cleared;
> >  		unsigned long ret;
> >  
> > -		ret = find_first_zero_bit(&word->word, word->depth);
> > +		ret = find_first_zero_bit(&mask, word->depth);
> >  		if (ret < word->depth)
> >  			return true;
> >  	}
> 
> This one works too:
> 
> Tested-by: Ming Lei <ming.lei@redhat.com>

Please hold on, just found this issue is triggered again after the 4th
run,

[  135.588599] ================start test sanity/004================
[  136.864175] scsi host14: scsi_debug: version 0188 [20180128]
                 dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
[  136.873018] scsi 14:0:0:0: Direct-Access     Linux    scsi_debug       0188 PQ: 0 ANSI: 7
[  136.874282] sd 14:0:0:0: Power-on or device reset occurred
[  136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[  136.879284] sd 14:0:0:0: [sde] Write Protect is off
[  136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
[  136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
[  136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
[  139.465844]  sdd:
[  139.754360]  sdd:
[  143.650074] hpet1: lost 765 rtc interrupts
[  145.078137]  sdd:
[  145.699220]  sdd:
[  146.693532]  sdd:
[  147.557814]  sdd:
[  148.742290]  sdd:
[  149.429425]  sdd:
[  149.944095]  sdd: sdd1
[  152.631125]  sdd:
[  184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
[  184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[  184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
[  184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
[  184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
[  185.872632]  sdd: sdd1 sdd2
[  242.446969] hpet1: lost 673 rtc interrupts
[  252.701145] hpet1: lost 655 rtc interrupts
[  257.814148] hpet1: lost 300 rtc interrupts
[  278.315651] hpet1: lost 1311 rtc interrupts
[  278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
[  336.228781] hpet1: lost 948 rtc interrupts
[  342.623253] hpet1: lost 409 rtc interrupts
[  344.205027] hpet1: lost 14 rtc interrupts
[  344.457622] hpet1: lost 15 rtc interrupts
[  344.584608] hpet1: lost 7 rtc interrupts
[  344.706382] hpet1: lost 7 rtc interrupts
[  345.012803] hpet1: lost 19 rtc interrupts
[  345.109177] hpet1: lost 5 rtc interrupts
[  345.353934] hpet1: lost 14 rtc interrupts
[  345.477123] hpet1: lost 7 rtc interrupts
[  345.530757] hpet1: lost 3 rtc interrupts
[  348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
[  348.328567] hpet1: lost 162 rtc interrupts
[  348.506049] hpet1: lost 10 rtc interrupts
[  350.461310] hpet1: lost 124 rtc interrupts
[  351.307411] hpet1: lost 50 rtc interrupts
[  351.401932] hpet1: lost 6 rtc interrupts
[  351.564204] hpet1: lost 9 rtc interrupts
[  351.648124] hpet1: lost 4 rtc interrupts
[  351.828972] hpet1: lost 11 rtc interrupts
[  353.043051] hpet1: lost 76 rtc interrupts
[  353.705926] hpet1: lost 38 rtc interrupts
[  353.867703] hpet1: lost 9 rtc interrupts
[  354.030183] hpet1: lost 10 rtc interrupts
[  354.071030] hpet1: lost 1 rtc interrupts
[  354.194600] hpet1: lost 7 rtc interrupts
[  354.316562] hpet1: lost 7 rtc interrupts
[  355.167894] hpet1: lost 1376 rtc interrupts
[  364.089267] hpet1: lost 153 rtc interrupts
[  364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
[  366.869943] hpet1: lost 177 rtc interrupts
[  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.940768] parted          D    0  3645   3239 0x00000000
[  369.941500] Call Trace:
[  369.941874]  ? __schedule+0x6d9/0x74c
[  369.942392]  ? wbt_done+0x5e/0x5e
[  369.942864]  ? wbt_cleanup_cb+0x16/0x16
[  369.943404]  ? wbt_done+0x5e/0x5e
[  369.943874]  schedule+0x67/0x78
[  369.944298]  io_schedule+0x12/0x33
[  369.944771]  rq_qos_wait+0xb5/0x119
[  369.945193]  ? karma_partition+0x1c2/0x1c2
[  369.945691]  ? wbt_cleanup_cb+0x16/0x16
[  369.946151]  wbt_wait+0x85/0xb6
[  369.946540]  __rq_qos_throttle+0x23/0x2f
[  369.947014]  blk_mq_make_request+0xe6/0x40a
[  369.947518]  generic_make_request+0x192/0x2fe
[  369.948042]  ? submit_bio+0x103/0x11f
[  369.948486]  ? __radix_tree_lookup+0x35/0xb5
[  369.949011]  submit_bio+0x103/0x11f
[  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
[  369.949962]  submit_bio_wait+0x53/0x7f
[  369.950469]  blkdev_issue_flush+0x8a/0xae
[  369.951032]  blkdev_fsync+0x2f/0x3a
[  369.951502]  do_fsync+0x2e/0x47
[  369.951887]  __x64_sys_fsync+0x10/0x13
[  369.952374]  do_syscall_64+0x89/0x149
[  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  369.953492] RIP: 0033:0x7f95a1e729d4
[  369.953996] Code: Bad RIP value.
[  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
[  375.962995] ================end test sanity/004: (HANG, -1)================

Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  1:49                 ` Ming Lei
@ 2018-12-12  2:03                   ` Ming Lei
  2018-12-12  2:25                     ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12  2:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block

On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> > On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> > > On 12/11/18 6:23 PM, Jens Axboe wrote:
> > > > On 12/11/18 6:05 PM, Jens Axboe wrote:
> > > >> On 12/11/18 5:38 PM, Ming Lei wrote:
> > > >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> > > >>>>
> > > >>>> On Tue, Dec 11 2018 at  7:19pm -0500,
> > > >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> > > >>>>
> > > >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > >>>>>>
> > > >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > >>>>>>> Hi Jens,
> > > >>>>>>>
> > > >>>>>>> If I run the following subset of blktests:
> > > >>>>>>>
> > > >>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > > >>>>>>>
> > > >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> > > >>>>>>> time the following hang is reported:
> > > >>>>>>>
> > > >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> > > >>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
> > > >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > >>>>>>> fio             D25272 14869  14195 0x00000000
> > > >>>>>>> Call Trace:
> > > >>>>>>>  __schedule+0x401/0xe50
> > > >>>>>>>  schedule+0x4e/0xd0
> > > >>>>>>>  io_schedule+0x21/0x50
> > > >>>>>>>  blk_mq_get_tag+0x46d/0x640
> > > >>>>>>>  blk_mq_get_request+0x7c0/0xa00
> > > >>>>>>>  blk_mq_make_request+0x241/0xa70
> > > >>>>>>>  generic_make_request+0x411/0x950
> > > >>>>>>>  submit_bio+0x9b/0x250
> > > >>>>>>>  blkdev_direct_IO+0x7fb/0x870
> > > >>>>>>>  generic_file_direct_write+0x119/0x210
> > > >>>>>>>  __generic_file_write_iter+0x11c/0x280
> > > >>>>>>>  blkdev_write_iter+0x13c/0x220
> > > >>>>>>>  aio_write+0x204/0x310
> > > >>>>>>>  io_submit_one+0x9c6/0xe70
> > > >>>>>>>  __x64_sys_io_submit+0x115/0x340
> > > >>>>>>>  do_syscall_64+0x71/0x210
> > > >>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > >>>>>>>
> > > >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> > > >>>>>>> any pending requests:
> > > >>>>>>>
> > > >>>>>>> # list-pending-block-requests
> > > >>>>>>> dm-0
> > > >>>>>>> loop0
> > > >>>>>>> loop1
> > > >>>>>>> loop2
> > > >>>>>>> loop3
> > > >>>>>>> loop4
> > > >>>>>>> loop5
> > > >>>>>>> loop6
> > > >>>>>>> loop7
> > > >>>>>>> nullb0
> > > >>>>>>> nullb1
> > > >>>>>>> sda
> > > >>>>>>> sdb
> > > >>>>>>> sdc
> > > >>>>>>> sdd
> > > >>>>>>> vda
> > > >>>>>>> vdb
> > > >>>>>>>
> > > >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > >>>>>>> that the root cause is in any of the dm drivers used in this test:
> > > >>>>>>>
> > > >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > > >>>>>>>
> > > >>>>>>> The same test passes against kernel v4.20-rc6.
> > > >>>>>>
> > > >>>>>> What device is this being run on?
> > > >>>>>
> > > >>>>> I saw this issue on usb storage too.
> > > >>>>>
> > > >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > > >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > > >>>>> the sched_tag's sbitmap queue.
> > > >>>>
> > > >>>> You saw this running the same tests as Bart?
> > > >>>
> > > >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> > > >>> see the attached test script.
> > > >>>
> > > >>> Mostly it can be triggered in one run, sometimes it needs more.
> > > >>
> > > >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > > >> one word in general. But I'd like to try and understand why it hangs.
> > > >>
> > > >> Are you using a scheduler?
> > > > 
> > > > OK, I think I see what it is, the shallow is missing the deferred clear.
> > > > On top of this, probably worth to check at what depths deferred starts
> > > > to make sense. For QD == 1, definitely not. But that should be on top of
> > > > the fix.
> > > > 
> > > > Can you try this one?
> > > 
> > > This one should be complete. The former should fix the case, this one
> > > just adds the last two cases where we ignored ->cleared. For some of
> > > them we'll never hit it since deferred isn't used for that, but I
> > > think we're better off with full coverage.
> > > 
> > > 
> > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > > index 2261136ae067..5b3e56d68dab 100644
> > > --- a/lib/sbitmap.c
> > > +++ b/lib/sbitmap.c
> > > @@ -20,6 +20,47 @@
> > >  #include <linux/sbitmap.h>
> > >  #include <linux/seq_file.h>
> > >  
> > > +/*
> > > + * See if we have deferred clears that we can batch move
> > > + */
> > > +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > > +{
> > > +	unsigned long mask, val;
> > > +	unsigned long __maybe_unused flags;
> > > +	bool ret = false;
> > > +
> > > +	/* Silence bogus lockdep warning */
> > > +#if defined(CONFIG_LOCKDEP)
> > > +	local_irq_save(flags);
> > > +#endif
> > > +	spin_lock(&sb->map[index].swap_lock);
> > > +
> > > +	if (!sb->map[index].cleared)
> > > +		goto out_unlock;
> > > +
> > > +	/*
> > > +	 * First get a stable cleared mask, setting the old mask to 0.
> > > +	 */
> > > +	do {
> > > +		mask = sb->map[index].cleared;
> > > +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > > +
> > > +	/*
> > > +	 * Now clear the masked bits in our free word
> > > +	 */
> > > +	do {
> > > +		val = sb->map[index].word;
> > > +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > > +
> > > +	ret = true;
> > > +out_unlock:
> > > +	spin_unlock(&sb->map[index].swap_lock);
> > > +#if defined(CONFIG_LOCKDEP)
> > > +	local_irq_restore(flags);
> > > +#endif
> > > +	return ret;
> > > +}
> > > +
> > >  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> > >  		      gfp_t flags, int node)
> > >  {
> > > @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> > >  	unsigned int bits_per_word = 1U << sb->shift;
> > >  	unsigned int i;
> > >  
> > > +	for (i = 0; i < sb->map_nr; i++)
> > > +		sbitmap_deferred_clear(sb, i);
> > > +
> > >  	sb->depth = depth;
> > >  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> > >  
> > > @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> > >  	return nr;
> > >  }
> > >  
> > > -/*
> > > - * See if we have deferred clears that we can batch move
> > > - */
> > > -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > > -{
> > > -	unsigned long mask, val;
> > > -	unsigned long __maybe_unused flags;
> > > -	bool ret = false;
> > > -
> > > -	/* Silence bogus lockdep warning */
> > > -#if defined(CONFIG_LOCKDEP)
> > > -	local_irq_save(flags);
> > > -#endif
> > > -	spin_lock(&sb->map[index].swap_lock);
> > > -
> > > -	if (!sb->map[index].cleared)
> > > -		goto out_unlock;
> > > -
> > > -	/*
> > > -	 * First get a stable cleared mask, setting the old mask to 0.
> > > -	 */
> > > -	do {
> > > -		mask = sb->map[index].cleared;
> > > -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > > -
> > > -	/*
> > > -	 * Now clear the masked bits in our free word
> > > -	 */
> > > -	do {
> > > -		val = sb->map[index].word;
> > > -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > > -
> > > -	ret = true;
> > > -out_unlock:
> > > -	spin_unlock(&sb->map[index].swap_lock);
> > > -#if defined(CONFIG_LOCKDEP)
> > > -	local_irq_restore(flags);
> > > -#endif
> > > -	return ret;
> > > -}
> > > -
> > >  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> > >  				     unsigned int alloc_hint, bool round_robin)
> > >  {
> > > @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > >  	index = SB_NR_TO_INDEX(sb, alloc_hint);
> > >  
> > >  	for (i = 0; i < sb->map_nr; i++) {
> > > +again:
> > >  		nr = __sbitmap_get_word(&sb->map[index].word,
> > >  					min(sb->map[index].depth, shallow_depth),
> > >  					SB_NR_TO_BIT(sb, alloc_hint), true);
> > > @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > >  			break;
> > >  		}
> > >  
> > > +		if (sbitmap_deferred_clear(sb, index))
> > > +			goto again;
> > > +
> > >  		/* Jump to next index. */
> > >  		index++;
> > >  		alloc_hint = index << sb->shift;
> > > @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> > >  	unsigned int i;
> > >  
> > >  	for (i = 0; i < sb->map_nr; i++) {
> > > -		if (sb->map[i].word)
> > > +		if (sb->map[i].word & ~sb->map[i].cleared)
> > >  			return true;
> > >  	}
> > >  	return false;
> > > @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> > >  
> > >  	for (i = 0; i < sb->map_nr; i++) {
> > >  		const struct sbitmap_word *word = &sb->map[i];
> > > +		unsigned long mask = word->word & ~word->cleared;
> > >  		unsigned long ret;
> > >  
> > > -		ret = find_first_zero_bit(&word->word, word->depth);
> > > +		ret = find_first_zero_bit(&mask, word->depth);
> > >  		if (ret < word->depth)
> > >  			return true;
> > >  	}
> > 
> > This one works too:
> > 
> > Tested-by: Ming Lei <ming.lei@redhat.com>
> 
> Please hold on, just found this issue is triggered again after the 4th
> run,
> 
> [  135.588599] ================start test sanity/004================
> [  136.864175] scsi host14: scsi_debug: version 0188 [20180128]
>                  dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
> [  136.873018] scsi 14:0:0:0: Direct-Access     Linux    scsi_debug       0188 PQ: 0 ANSI: 7
> [  136.874282] sd 14:0:0:0: Power-on or device reset occurred
> [  136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
> [  136.879284] sd 14:0:0:0: [sde] Write Protect is off
> [  136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
> [  136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
> [  136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
> [  139.465844]  sdd:
> [  139.754360]  sdd:
> [  143.650074] hpet1: lost 765 rtc interrupts
> [  145.078137]  sdd:
> [  145.699220]  sdd:
> [  146.693532]  sdd:
> [  147.557814]  sdd:
> [  148.742290]  sdd:
> [  149.429425]  sdd:
> [  149.944095]  sdd: sdd1
> [  152.631125]  sdd:
> [  184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
> [  184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
> [  184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
> [  184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
> [  184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [  185.872632]  sdd: sdd1 sdd2
> [  242.446969] hpet1: lost 673 rtc interrupts
> [  252.701145] hpet1: lost 655 rtc interrupts
> [  257.814148] hpet1: lost 300 rtc interrupts
> [  278.315651] hpet1: lost 1311 rtc interrupts
> [  278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [  336.228781] hpet1: lost 948 rtc interrupts
> [  342.623253] hpet1: lost 409 rtc interrupts
> [  344.205027] hpet1: lost 14 rtc interrupts
> [  344.457622] hpet1: lost 15 rtc interrupts
> [  344.584608] hpet1: lost 7 rtc interrupts
> [  344.706382] hpet1: lost 7 rtc interrupts
> [  345.012803] hpet1: lost 19 rtc interrupts
> [  345.109177] hpet1: lost 5 rtc interrupts
> [  345.353934] hpet1: lost 14 rtc interrupts
> [  345.477123] hpet1: lost 7 rtc interrupts
> [  345.530757] hpet1: lost 3 rtc interrupts
> [  348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
> [  348.328567] hpet1: lost 162 rtc interrupts
> [  348.506049] hpet1: lost 10 rtc interrupts
> [  350.461310] hpet1: lost 124 rtc interrupts
> [  351.307411] hpet1: lost 50 rtc interrupts
> [  351.401932] hpet1: lost 6 rtc interrupts
> [  351.564204] hpet1: lost 9 rtc interrupts
> [  351.648124] hpet1: lost 4 rtc interrupts
> [  351.828972] hpet1: lost 11 rtc interrupts
> [  353.043051] hpet1: lost 76 rtc interrupts
> [  353.705926] hpet1: lost 38 rtc interrupts
> [  353.867703] hpet1: lost 9 rtc interrupts
> [  354.030183] hpet1: lost 10 rtc interrupts
> [  354.071030] hpet1: lost 1 rtc interrupts
> [  354.194600] hpet1: lost 7 rtc interrupts
> [  354.316562] hpet1: lost 7 rtc interrupts
> [  355.167894] hpet1: lost 1376 rtc interrupts
> [  364.089267] hpet1: lost 153 rtc interrupts
> [  364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [  366.869943] hpet1: lost 177 rtc interrupts
> [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
> [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
> [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  369.940768] parted          D    0  3645   3239 0x00000000
> [  369.941500] Call Trace:
> [  369.941874]  ? __schedule+0x6d9/0x74c
> [  369.942392]  ? wbt_done+0x5e/0x5e
> [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
> [  369.943404]  ? wbt_done+0x5e/0x5e
> [  369.943874]  schedule+0x67/0x78
> [  369.944298]  io_schedule+0x12/0x33
> [  369.944771]  rq_qos_wait+0xb5/0x119
> [  369.945193]  ? karma_partition+0x1c2/0x1c2
> [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
> [  369.946151]  wbt_wait+0x85/0xb6
> [  369.946540]  __rq_qos_throttle+0x23/0x2f
> [  369.947014]  blk_mq_make_request+0xe6/0x40a
> [  369.947518]  generic_make_request+0x192/0x2fe
> [  369.948042]  ? submit_bio+0x103/0x11f
> [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
> [  369.949011]  submit_bio+0x103/0x11f
> [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
> [  369.949962]  submit_bio_wait+0x53/0x7f
> [  369.950469]  blkdev_issue_flush+0x8a/0xae
> [  369.951032]  blkdev_fsync+0x2f/0x3a
> [  369.951502]  do_fsync+0x2e/0x47
> [  369.951887]  __x64_sys_fsync+0x10/0x13
> [  369.952374]  do_syscall_64+0x89/0x149
> [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  369.953492] RIP: 0033:0x7f95a1e729d4
> [  369.953996] Code: Bad RIP value.
> [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
> [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
> [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
> [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
> [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
> [  375.962995] ================end test sanity/004: (HANG, -1)================

This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
becomes -1 after the hang is triggered.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  2:03                   ` Ming Lei
@ 2018-12-12  2:25                     ` Jens Axboe
  2018-12-12  4:28                       ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12  2:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block, Josef Bacik

On 12/11/18 7:03 PM, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
>> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
>>> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
>>>> On 12/11/18 6:23 PM, Jens Axboe wrote:
>>>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 11 2018 at  7:19pm -0500,
>>>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> If I run the following subset of blktests:
>>>>>>>>>>>
>>>>>>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>>>>
>>>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>>>>> time the following hang is reported:
>>>>>>>>>>>
>>>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>>>>> fio             D25272 14869  14195 0x00000000
>>>>>>>>>>> Call Trace:
>>>>>>>>>>>  __schedule+0x401/0xe50
>>>>>>>>>>>  schedule+0x4e/0xd0
>>>>>>>>>>>  io_schedule+0x21/0x50
>>>>>>>>>>>  blk_mq_get_tag+0x46d/0x640
>>>>>>>>>>>  blk_mq_get_request+0x7c0/0xa00
>>>>>>>>>>>  blk_mq_make_request+0x241/0xa70
>>>>>>>>>>>  generic_make_request+0x411/0x950
>>>>>>>>>>>  submit_bio+0x9b/0x250
>>>>>>>>>>>  blkdev_direct_IO+0x7fb/0x870
>>>>>>>>>>>  generic_file_direct_write+0x119/0x210
>>>>>>>>>>>  __generic_file_write_iter+0x11c/0x280
>>>>>>>>>>>  blkdev_write_iter+0x13c/0x220
>>>>>>>>>>>  aio_write+0x204/0x310
>>>>>>>>>>>  io_submit_one+0x9c6/0xe70
>>>>>>>>>>>  __x64_sys_io_submit+0x115/0x340
>>>>>>>>>>>  do_syscall_64+0x71/0x210
>>>>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>>>>
>>>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>>>>> any pending requests:
>>>>>>>>>>>
>>>>>>>>>>> # list-pending-block-requests
>>>>>>>>>>> dm-0
>>>>>>>>>>> loop0
>>>>>>>>>>> loop1
>>>>>>>>>>> loop2
>>>>>>>>>>> loop3
>>>>>>>>>>> loop4
>>>>>>>>>>> loop5
>>>>>>>>>>> loop6
>>>>>>>>>>> loop7
>>>>>>>>>>> nullb0
>>>>>>>>>>> nullb1
>>>>>>>>>>> sda
>>>>>>>>>>> sdb
>>>>>>>>>>> sdc
>>>>>>>>>>> sdd
>>>>>>>>>>> vda
>>>>>>>>>>> vdb
>>>>>>>>>>>
>>>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>>>>
>>>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>>>>
>>>>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>>>>
>>>>>>>>>> What device is this being run on?
>>>>>>>>>
>>>>>>>>> I saw this issue on usb storage too.
>>>>>>>>>
>>>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>>>>> the sched_tag's sbitmap queue.
>>>>>>>>
>>>>>>>> You saw this running the same tests as Bart?
>>>>>>>
>>>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>>>>> see the attached test script.
>>>>>>>
>>>>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>>>>
>>>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>>>>> one word in general. But I'd like to try and understand why it hangs.
>>>>>>
>>>>>> Are you using a scheduler?
>>>>>
>>>>> OK, I think I see what it is, the shallow is missing the deferred clear.
>>>>> On top of this, probably worth to check at what depths deferred starts
>>>>> to make sense. For QD == 1, definitely not. But that should be on top of
>>>>> the fix.
>>>>>
>>>>> Can you try this one?
>>>>
>>>> This one should be complete. The former should fix the case, this one
>>>> just adds the last two cases where we ignored ->cleared. For some of
>>>> them we'll never hit it since deferred isn't used for that, but I
>>>> think we're better off with full coverage.
>>>>
>>>>
>>>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>>>> index 2261136ae067..5b3e56d68dab 100644
>>>> --- a/lib/sbitmap.c
>>>> +++ b/lib/sbitmap.c
>>>> @@ -20,6 +20,47 @@
>>>>  #include <linux/sbitmap.h>
>>>>  #include <linux/seq_file.h>
>>>>  
>>>> +/*
>>>> + * See if we have deferred clears that we can batch move
>>>> + */
>>>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>>>> +{
>>>> +	unsigned long mask, val;
>>>> +	unsigned long __maybe_unused flags;
>>>> +	bool ret = false;
>>>> +
>>>> +	/* Silence bogus lockdep warning */
>>>> +#if defined(CONFIG_LOCKDEP)
>>>> +	local_irq_save(flags);
>>>> +#endif
>>>> +	spin_lock(&sb->map[index].swap_lock);
>>>> +
>>>> +	if (!sb->map[index].cleared)
>>>> +		goto out_unlock;
>>>> +
>>>> +	/*
>>>> +	 * First get a stable cleared mask, setting the old mask to 0.
>>>> +	 */
>>>> +	do {
>>>> +		mask = sb->map[index].cleared;
>>>> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>>>> +
>>>> +	/*
>>>> +	 * Now clear the masked bits in our free word
>>>> +	 */
>>>> +	do {
>>>> +		val = sb->map[index].word;
>>>> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>>>> +
>>>> +	ret = true;
>>>> +out_unlock:
>>>> +	spin_unlock(&sb->map[index].swap_lock);
>>>> +#if defined(CONFIG_LOCKDEP)
>>>> +	local_irq_restore(flags);
>>>> +#endif
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>>>>  		      gfp_t flags, int node)
>>>>  {
>>>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>>>>  	unsigned int bits_per_word = 1U << sb->shift;
>>>>  	unsigned int i;
>>>>  
>>>> +	for (i = 0; i < sb->map_nr; i++)
>>>> +		sbitmap_deferred_clear(sb, i);
>>>> +
>>>>  	sb->depth = depth;
>>>>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>>>  
>>>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>>>>  	return nr;
>>>>  }
>>>>  
>>>> -/*
>>>> - * See if we have deferred clears that we can batch move
>>>> - */
>>>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>>>> -{
>>>> -	unsigned long mask, val;
>>>> -	unsigned long __maybe_unused flags;
>>>> -	bool ret = false;
>>>> -
>>>> -	/* Silence bogus lockdep warning */
>>>> -#if defined(CONFIG_LOCKDEP)
>>>> -	local_irq_save(flags);
>>>> -#endif
>>>> -	spin_lock(&sb->map[index].swap_lock);
>>>> -
>>>> -	if (!sb->map[index].cleared)
>>>> -		goto out_unlock;
>>>> -
>>>> -	/*
>>>> -	 * First get a stable cleared mask, setting the old mask to 0.
>>>> -	 */
>>>> -	do {
>>>> -		mask = sb->map[index].cleared;
>>>> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>>>> -
>>>> -	/*
>>>> -	 * Now clear the masked bits in our free word
>>>> -	 */
>>>> -	do {
>>>> -		val = sb->map[index].word;
>>>> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>>>> -
>>>> -	ret = true;
>>>> -out_unlock:
>>>> -	spin_unlock(&sb->map[index].swap_lock);
>>>> -#if defined(CONFIG_LOCKDEP)
>>>> -	local_irq_restore(flags);
>>>> -#endif
>>>> -	return ret;
>>>> -}
>>>> -
>>>>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>>>>  				     unsigned int alloc_hint, bool round_robin)
>>>>  {
>>>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>>>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
>>>>  
>>>>  	for (i = 0; i < sb->map_nr; i++) {
>>>> +again:
>>>>  		nr = __sbitmap_get_word(&sb->map[index].word,
>>>>  					min(sb->map[index].depth, shallow_depth),
>>>>  					SB_NR_TO_BIT(sb, alloc_hint), true);
>>>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>>>  			break;
>>>>  		}
>>>>  
>>>> +		if (sbitmap_deferred_clear(sb, index))
>>>> +			goto again;
>>>> +
>>>>  		/* Jump to next index. */
>>>>  		index++;
>>>>  		alloc_hint = index << sb->shift;
>>>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
>>>>  	unsigned int i;
>>>>  
>>>>  	for (i = 0; i < sb->map_nr; i++) {
>>>> -		if (sb->map[i].word)
>>>> +		if (sb->map[i].word & ~sb->map[i].cleared)
>>>>  			return true;
>>>>  	}
>>>>  	return false;
>>>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>>>>  
>>>>  	for (i = 0; i < sb->map_nr; i++) {
>>>>  		const struct sbitmap_word *word = &sb->map[i];
>>>> +		unsigned long mask = word->word & ~word->cleared;
>>>>  		unsigned long ret;
>>>>  
>>>> -		ret = find_first_zero_bit(&word->word, word->depth);
>>>> +		ret = find_first_zero_bit(&mask, word->depth);
>>>>  		if (ret < word->depth)
>>>>  			return true;
>>>>  	}
>>>
>>> This one works too:
>>>
>>> Tested-by: Ming Lei <ming.lei@redhat.com>
>>
>> Please hold on, just found this issue is triggered again after the 4th
>> run,
>>
>> [  135.588599] ================start test sanity/004================
>> [  136.864175] scsi host14: scsi_debug: version 0188 [20180128]
>>                  dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
>> [  136.873018] scsi 14:0:0:0: Direct-Access     Linux    scsi_debug       0188 PQ: 0 ANSI: 7
>> [  136.874282] sd 14:0:0:0: Power-on or device reset occurred
>> [  136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
>> [  136.879284] sd 14:0:0:0: [sde] Write Protect is off
>> [  136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
>> [  136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
>> [  136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
>> [  139.465844]  sdd:
>> [  139.754360]  sdd:
>> [  143.650074] hpet1: lost 765 rtc interrupts
>> [  145.078137]  sdd:
>> [  145.699220]  sdd:
>> [  146.693532]  sdd:
>> [  147.557814]  sdd:
>> [  148.742290]  sdd:
>> [  149.429425]  sdd:
>> [  149.944095]  sdd: sdd1
>> [  152.631125]  sdd:
>> [  184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
>> [  184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
>> [  184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
>> [  184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
>> [  184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [  185.872632]  sdd: sdd1 sdd2
>> [  242.446969] hpet1: lost 673 rtc interrupts
>> [  252.701145] hpet1: lost 655 rtc interrupts
>> [  257.814148] hpet1: lost 300 rtc interrupts
>> [  278.315651] hpet1: lost 1311 rtc interrupts
>> [  278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [  336.228781] hpet1: lost 948 rtc interrupts
>> [  342.623253] hpet1: lost 409 rtc interrupts
>> [  344.205027] hpet1: lost 14 rtc interrupts
>> [  344.457622] hpet1: lost 15 rtc interrupts
>> [  344.584608] hpet1: lost 7 rtc interrupts
>> [  344.706382] hpet1: lost 7 rtc interrupts
>> [  345.012803] hpet1: lost 19 rtc interrupts
>> [  345.109177] hpet1: lost 5 rtc interrupts
>> [  345.353934] hpet1: lost 14 rtc interrupts
>> [  345.477123] hpet1: lost 7 rtc interrupts
>> [  345.530757] hpet1: lost 3 rtc interrupts
>> [  348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
>> [  348.328567] hpet1: lost 162 rtc interrupts
>> [  348.506049] hpet1: lost 10 rtc interrupts
>> [  350.461310] hpet1: lost 124 rtc interrupts
>> [  351.307411] hpet1: lost 50 rtc interrupts
>> [  351.401932] hpet1: lost 6 rtc interrupts
>> [  351.564204] hpet1: lost 9 rtc interrupts
>> [  351.648124] hpet1: lost 4 rtc interrupts
>> [  351.828972] hpet1: lost 11 rtc interrupts
>> [  353.043051] hpet1: lost 76 rtc interrupts
>> [  353.705926] hpet1: lost 38 rtc interrupts
>> [  353.867703] hpet1: lost 9 rtc interrupts
>> [  354.030183] hpet1: lost 10 rtc interrupts
>> [  354.071030] hpet1: lost 1 rtc interrupts
>> [  354.194600] hpet1: lost 7 rtc interrupts
>> [  354.316562] hpet1: lost 7 rtc interrupts
>> [  355.167894] hpet1: lost 1376 rtc interrupts
>> [  364.089267] hpet1: lost 153 rtc interrupts
>> [  364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [  366.869943] hpet1: lost 177 rtc interrupts
>> [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
>> [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
>> [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [  369.940768] parted          D    0  3645   3239 0x00000000
>> [  369.941500] Call Trace:
>> [  369.941874]  ? __schedule+0x6d9/0x74c
>> [  369.942392]  ? wbt_done+0x5e/0x5e
>> [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
>> [  369.943404]  ? wbt_done+0x5e/0x5e
>> [  369.943874]  schedule+0x67/0x78
>> [  369.944298]  io_schedule+0x12/0x33
>> [  369.944771]  rq_qos_wait+0xb5/0x119
>> [  369.945193]  ? karma_partition+0x1c2/0x1c2
>> [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
>> [  369.946151]  wbt_wait+0x85/0xb6
>> [  369.946540]  __rq_qos_throttle+0x23/0x2f
>> [  369.947014]  blk_mq_make_request+0xe6/0x40a
>> [  369.947518]  generic_make_request+0x192/0x2fe
>> [  369.948042]  ? submit_bio+0x103/0x11f
>> [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
>> [  369.949011]  submit_bio+0x103/0x11f
>> [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
>> [  369.949962]  submit_bio_wait+0x53/0x7f
>> [  369.950469]  blkdev_issue_flush+0x8a/0xae
>> [  369.951032]  blkdev_fsync+0x2f/0x3a
>> [  369.951502]  do_fsync+0x2e/0x47
>> [  369.951887]  __x64_sys_fsync+0x10/0x13
>> [  369.952374]  do_syscall_64+0x89/0x149
>> [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  369.953492] RIP: 0033:0x7f95a1e729d4
>> [  369.953996] Code: Bad RIP value.
>> [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
>> [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
>> [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
>> [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
>> [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
>> [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
>> [  375.962995] ================end test sanity/004: (HANG, -1)================
> 
> This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
> becomes -1 after the hang is triggered.

I think it is - can you try and revert:

commit b6c7b58f5fcc2386bddf9852011c42c1d2b83979
Author: Josef Bacik <josef@toxicpanda.com>
Date:   Tue Dec 4 12:59:03 2018 -0500

    block: convert wbt_wait() to use rq_qos_wait()

and see if that helps?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: for-next hangs on test srp/012
  2018-12-12  2:25                     ` Jens Axboe
@ 2018-12-12  4:28                       ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-12  4:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block, Josef Bacik

On Tue, Dec 11, 2018 at 07:25:21PM -0700, Jens Axboe wrote:
> On 12/11/18 7:03 PM, Ming Lei wrote:
> > On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
> >> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> >>> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> >>>> On 12/11/18 6:23 PM, Jens Axboe wrote:
> >>>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
> >>>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
> >>>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, Dec 11 2018 at  7:19pm -0500,
> >>>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>
> >>>>>>>>>>> If I run the following subset of blktests:
> >>>>>>>>>>>
> >>>>>>>>>>>   while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>>>>>>
> >>>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>>>>>>> time the following hang is reported:
> >>>>>>>>>>>
> >>>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>>>>>>>       Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>>>>>>> fio             D25272 14869  14195 0x00000000
> >>>>>>>>>>> Call Trace:
> >>>>>>>>>>>  __schedule+0x401/0xe50
> >>>>>>>>>>>  schedule+0x4e/0xd0
> >>>>>>>>>>>  io_schedule+0x21/0x50
> >>>>>>>>>>>  blk_mq_get_tag+0x46d/0x640
> >>>>>>>>>>>  blk_mq_get_request+0x7c0/0xa00
> >>>>>>>>>>>  blk_mq_make_request+0x241/0xa70
> >>>>>>>>>>>  generic_make_request+0x411/0x950
> >>>>>>>>>>>  submit_bio+0x9b/0x250
> >>>>>>>>>>>  blkdev_direct_IO+0x7fb/0x870
> >>>>>>>>>>>  generic_file_direct_write+0x119/0x210
> >>>>>>>>>>>  __generic_file_write_iter+0x11c/0x280
> >>>>>>>>>>>  blkdev_write_iter+0x13c/0x220
> >>>>>>>>>>>  aio_write+0x204/0x310
> >>>>>>>>>>>  io_submit_one+0x9c6/0xe70
> >>>>>>>>>>>  __x64_sys_io_submit+0x115/0x340
> >>>>>>>>>>>  do_syscall_64+0x71/0x210
> >>>>>>>>>>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>>>>>>
> >>>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>>>>>>> any pending requests:
> >>>>>>>>>>>
> >>>>>>>>>>> # list-pending-block-requests
> >>>>>>>>>>> dm-0
> >>>>>>>>>>> loop0
> >>>>>>>>>>> loop1
> >>>>>>>>>>> loop2
> >>>>>>>>>>> loop3
> >>>>>>>>>>> loop4
> >>>>>>>>>>> loop5
> >>>>>>>>>>> loop6
> >>>>>>>>>>> loop7
> >>>>>>>>>>> nullb0
> >>>>>>>>>>> nullb1
> >>>>>>>>>>> sda
> >>>>>>>>>>> sdb
> >>>>>>>>>>> sdc
> >>>>>>>>>>> sdd
> >>>>>>>>>>> vda
> >>>>>>>>>>> vdb
> >>>>>>>>>>>
> >>>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>>>>>>
> >>>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>>>>>>
> >>>>>>>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>>>>>>
> >>>>>>>>>> What device is this being run on?
> >>>>>>>>>
> >>>>>>>>> I saw this issue on usb storage too.
> >>>>>>>>>
> >>>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>>>>>>> the sched_tag's sbitmap queue.
> >>>>>>>>
> >>>>>>>> You saw this running the same tests as Bart?
> >>>>>>>
> >>>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >>>>>>> see the attached test script.
> >>>>>>>
> >>>>>>> Mostly it can be triggered in one run, sometimes it needs more.
> >>>>>>
> >>>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> >>>>>> one word in general. But I'd like to try and understand why it hangs.
> >>>>>>
> >>>>>> Are you using a scheduler?
> >>>>>
> >>>>> OK, I think I see what it is, the shallow is missing the deferred clear.
> >>>>> On top of this, probably worth to check at what depths deferred starts
> >>>>> to make sense. For QD == 1, definitely not. But that should be on top of
> >>>>> the fix.
> >>>>>
> >>>>> Can you try this one?
> >>>>
> >>>> This one should be complete. The former should fix the case, this one
> >>>> just adds the last two cases where we ignored ->cleared. For some of
> >>>> them we'll never hit it since deferred isn't used for that, but I
> >>>> think we're better off with full coverage.
> >>>>
> >>>>
> >>>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> >>>> index 2261136ae067..5b3e56d68dab 100644
> >>>> --- a/lib/sbitmap.c
> >>>> +++ b/lib/sbitmap.c
> >>>> @@ -20,6 +20,47 @@
> >>>>  #include <linux/sbitmap.h>
> >>>>  #include <linux/seq_file.h>
> >>>>  
> >>>> +/*
> >>>> + * See if we have deferred clears that we can batch move
> >>>> + */
> >>>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> >>>> +{
> >>>> +	unsigned long mask, val;
> >>>> +	unsigned long __maybe_unused flags;
> >>>> +	bool ret = false;
> >>>> +
> >>>> +	/* Silence bogus lockdep warning */
> >>>> +#if defined(CONFIG_LOCKDEP)
> >>>> +	local_irq_save(flags);
> >>>> +#endif
> >>>> +	spin_lock(&sb->map[index].swap_lock);
> >>>> +
> >>>> +	if (!sb->map[index].cleared)
> >>>> +		goto out_unlock;
> >>>> +
> >>>> +	/*
> >>>> +	 * First get a stable cleared mask, setting the old mask to 0.
> >>>> +	 */
> >>>> +	do {
> >>>> +		mask = sb->map[index].cleared;
> >>>> +	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> >>>> +
> >>>> +	/*
> >>>> +	 * Now clear the masked bits in our free word
> >>>> +	 */
> >>>> +	do {
> >>>> +		val = sb->map[index].word;
> >>>> +	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> >>>> +
> >>>> +	ret = true;
> >>>> +out_unlock:
> >>>> +	spin_unlock(&sb->map[index].swap_lock);
> >>>> +#if defined(CONFIG_LOCKDEP)
> >>>> +	local_irq_restore(flags);
> >>>> +#endif
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> >>>>  		      gfp_t flags, int node)
> >>>>  {
> >>>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> >>>>  	unsigned int bits_per_word = 1U << sb->shift;
> >>>>  	unsigned int i;
> >>>>  
> >>>> +	for (i = 0; i < sb->map_nr; i++)
> >>>> +		sbitmap_deferred_clear(sb, i);
> >>>> +
> >>>>  	sb->depth = depth;
> >>>>  	sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> >>>>  
> >>>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> >>>>  	return nr;
> >>>>  }
> >>>>  
> >>>> -/*
> >>>> - * See if we have deferred clears that we can batch move
> >>>> - */
> >>>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> >>>> -{
> >>>> -	unsigned long mask, val;
> >>>> -	unsigned long __maybe_unused flags;
> >>>> -	bool ret = false;
> >>>> -
> >>>> -	/* Silence bogus lockdep warning */
> >>>> -#if defined(CONFIG_LOCKDEP)
> >>>> -	local_irq_save(flags);
> >>>> -#endif
> >>>> -	spin_lock(&sb->map[index].swap_lock);
> >>>> -
> >>>> -	if (!sb->map[index].cleared)
> >>>> -		goto out_unlock;
> >>>> -
> >>>> -	/*
> >>>> -	 * First get a stable cleared mask, setting the old mask to 0.
> >>>> -	 */
> >>>> -	do {
> >>>> -		mask = sb->map[index].cleared;
> >>>> -	} while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> >>>> -
> >>>> -	/*
> >>>> -	 * Now clear the masked bits in our free word
> >>>> -	 */
> >>>> -	do {
> >>>> -		val = sb->map[index].word;
> >>>> -	} while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> >>>> -
> >>>> -	ret = true;
> >>>> -out_unlock:
> >>>> -	spin_unlock(&sb->map[index].swap_lock);
> >>>> -#if defined(CONFIG_LOCKDEP)
> >>>> -	local_irq_restore(flags);
> >>>> -#endif
> >>>> -	return ret;
> >>>> -}
> >>>> -
> >>>>  static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> >>>>  				     unsigned int alloc_hint, bool round_robin)
> >>>>  {
> >>>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >>>>  	index = SB_NR_TO_INDEX(sb, alloc_hint);
> >>>>  
> >>>>  	for (i = 0; i < sb->map_nr; i++) {
> >>>> +again:
> >>>>  		nr = __sbitmap_get_word(&sb->map[index].word,
> >>>>  					min(sb->map[index].depth, shallow_depth),
> >>>>  					SB_NR_TO_BIT(sb, alloc_hint), true);
> >>>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >>>>  			break;
> >>>>  		}
> >>>>  
> >>>> +		if (sbitmap_deferred_clear(sb, index))
> >>>> +			goto again;
> >>>> +
> >>>>  		/* Jump to next index. */
> >>>>  		index++;
> >>>>  		alloc_hint = index << sb->shift;
> >>>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> >>>>  	unsigned int i;
> >>>>  
> >>>>  	for (i = 0; i < sb->map_nr; i++) {
> >>>> -		if (sb->map[i].word)
> >>>> +		if (sb->map[i].word & ~sb->map[i].cleared)
> >>>>  			return true;
> >>>>  	}
> >>>>  	return false;
> >>>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> >>>>  
> >>>>  	for (i = 0; i < sb->map_nr; i++) {
> >>>>  		const struct sbitmap_word *word = &sb->map[i];
> >>>> +		unsigned long mask = word->word & ~word->cleared;
> >>>>  		unsigned long ret;
> >>>>  
> >>>> -		ret = find_first_zero_bit(&word->word, word->depth);
> >>>> +		ret = find_first_zero_bit(&mask, word->depth);
> >>>>  		if (ret < word->depth)
> >>>>  			return true;
> >>>>  	}
> >>>
> >>> This one works too:
> >>>
> >>> Tested-by: Ming Lei <ming.lei@redhat.com>
> >>
> >> Please hold on, just found this issue is triggered again after the 4th
> >> run,
> >>
> >> [  135.588599] ================start test sanity/004================
> >> [  136.864175] scsi host14: scsi_debug: version 0188 [20180128]
> >>                  dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
> >> [  136.873018] scsi 14:0:0:0: Direct-Access     Linux    scsi_debug       0188 PQ: 0 ANSI: 7
> >> [  136.874282] sd 14:0:0:0: Power-on or device reset occurred
> >> [  136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
> >> [  136.879284] sd 14:0:0:0: [sde] Write Protect is off
> >> [  136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
> >> [  136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
> >> [  136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
> >> [  139.465844]  sdd:
> >> [  139.754360]  sdd:
> >> [  143.650074] hpet1: lost 765 rtc interrupts
> >> [  145.078137]  sdd:
> >> [  145.699220]  sdd:
> >> [  146.693532]  sdd:
> >> [  147.557814]  sdd:
> >> [  148.742290]  sdd:
> >> [  149.429425]  sdd:
> >> [  149.944095]  sdd: sdd1
> >> [  152.631125]  sdd:
> >> [  184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
> >> [  184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
> >> [  184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
> >> [  184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
> >> [  184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [  185.872632]  sdd: sdd1 sdd2
> >> [  242.446969] hpet1: lost 673 rtc interrupts
> >> [  252.701145] hpet1: lost 655 rtc interrupts
> >> [  257.814148] hpet1: lost 300 rtc interrupts
> >> [  278.315651] hpet1: lost 1311 rtc interrupts
> >> [  278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [  336.228781] hpet1: lost 948 rtc interrupts
> >> [  342.623253] hpet1: lost 409 rtc interrupts
> >> [  344.205027] hpet1: lost 14 rtc interrupts
> >> [  344.457622] hpet1: lost 15 rtc interrupts
> >> [  344.584608] hpet1: lost 7 rtc interrupts
> >> [  344.706382] hpet1: lost 7 rtc interrupts
> >> [  345.012803] hpet1: lost 19 rtc interrupts
> >> [  345.109177] hpet1: lost 5 rtc interrupts
> >> [  345.353934] hpet1: lost 14 rtc interrupts
> >> [  345.477123] hpet1: lost 7 rtc interrupts
> >> [  345.530757] hpet1: lost 3 rtc interrupts
> >> [  348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
> >> [  348.328567] hpet1: lost 162 rtc interrupts
> >> [  348.506049] hpet1: lost 10 rtc interrupts
> >> [  350.461310] hpet1: lost 124 rtc interrupts
> >> [  351.307411] hpet1: lost 50 rtc interrupts
> >> [  351.401932] hpet1: lost 6 rtc interrupts
> >> [  351.564204] hpet1: lost 9 rtc interrupts
> >> [  351.648124] hpet1: lost 4 rtc interrupts
> >> [  351.828972] hpet1: lost 11 rtc interrupts
> >> [  353.043051] hpet1: lost 76 rtc interrupts
> >> [  353.705926] hpet1: lost 38 rtc interrupts
> >> [  353.867703] hpet1: lost 9 rtc interrupts
> >> [  354.030183] hpet1: lost 10 rtc interrupts
> >> [  354.071030] hpet1: lost 1 rtc interrupts
> >> [  354.194600] hpet1: lost 7 rtc interrupts
> >> [  354.316562] hpet1: lost 7 rtc interrupts
> >> [  355.167894] hpet1: lost 1376 rtc interrupts
> >> [  364.089267] hpet1: lost 153 rtc interrupts
> >> [  364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [  366.869943] hpet1: lost 177 rtc interrupts
> >> [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
> >> [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
> >> [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [  369.940768] parted          D    0  3645   3239 0x00000000
> >> [  369.941500] Call Trace:
> >> [  369.941874]  ? __schedule+0x6d9/0x74c
> >> [  369.942392]  ? wbt_done+0x5e/0x5e
> >> [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
> >> [  369.943404]  ? wbt_done+0x5e/0x5e
> >> [  369.943874]  schedule+0x67/0x78
> >> [  369.944298]  io_schedule+0x12/0x33
> >> [  369.944771]  rq_qos_wait+0xb5/0x119
> >> [  369.945193]  ? karma_partition+0x1c2/0x1c2
> >> [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
> >> [  369.946151]  wbt_wait+0x85/0xb6
> >> [  369.946540]  __rq_qos_throttle+0x23/0x2f
> >> [  369.947014]  blk_mq_make_request+0xe6/0x40a
> >> [  369.947518]  generic_make_request+0x192/0x2fe
> >> [  369.948042]  ? submit_bio+0x103/0x11f
> >> [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
> >> [  369.949011]  submit_bio+0x103/0x11f
> >> [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
> >> [  369.949962]  submit_bio_wait+0x53/0x7f
> >> [  369.950469]  blkdev_issue_flush+0x8a/0xae
> >> [  369.951032]  blkdev_fsync+0x2f/0x3a
> >> [  369.951502]  do_fsync+0x2e/0x47
> >> [  369.951887]  __x64_sys_fsync+0x10/0x13
> >> [  369.952374]  do_syscall_64+0x89/0x149
> >> [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> [  369.953492] RIP: 0033:0x7f95a1e729d4
> >> [  369.953996] Code: Bad RIP value.
> >> [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> >> [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
> >> [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
> >> [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
> >> [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
> >> [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
> >> [  375.962995] ================end test sanity/004: (HANG, -1)================
> > 
> > This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
> > becomes -1 after the hang is triggered.
> 
> I think it is - can you try and revert:
> 
> commit b6c7b58f5fcc2386bddf9852011c42c1d2b83979
> Author: Josef Bacik <josef@toxicpanda.com>
> Date:   Tue Dec 4 12:59:03 2018 -0500
> 
>     block: convert wbt_wait() to use rq_qos_wait()
> 
> and see if that helps?

This one is a linus tree issue, just a bit harder to reproduce,
and actually not captured by my test robot, but has been triggered
several times when I run the parted test manually.

There is such report in RH internal test too.

thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-12-12  4:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
2018-12-11 23:05 ` Mike Snitzer
2018-12-11 23:09   ` Bart Van Assche
2018-12-12  0:22     ` Mike Snitzer
2018-12-12  0:02 ` Jens Axboe
2018-12-12  0:18   ` Bart Van Assche
2018-12-12  0:19   ` Ming Lei
2018-12-12  0:27     ` Mike Snitzer
2018-12-12  0:38       ` Ming Lei
2018-12-12  1:05         ` Jens Axboe
2018-12-12  1:23           ` Jens Axboe
2018-12-12  1:36             ` Jens Axboe
2018-12-12  1:43               ` Ming Lei
2018-12-12  1:44                 ` Jens Axboe
2018-12-12  1:49                 ` Ming Lei
2018-12-12  2:03                   ` Ming Lei
2018-12-12  2:25                     ` Jens Axboe
2018-12-12  4:28                       ` Ming Lei
2018-12-12  1:37             ` Ming Lei
2018-12-12  1:39               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).