* for-next hangs on test srp/012
@ 2018-12-11 22:58 Bart Van Assche
2018-12-11 23:05 ` Mike Snitzer
2018-12-12 0:02 ` Jens Axboe
0 siblings, 2 replies; 20+ messages in thread
From: Bart Van Assche @ 2018-12-11 22:58 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Mike Snitzer
Hi Jens,
If I run the following subset of blktests:
while :; do ./check -q srp && ./check -q nvmeof-mp; done
against today's for-next branch (commit dd2bf2df85a7) then after some
time the following hang is reported:
INFO: task fio:14869 blocked for more than 120 seconds.
Not tainted 4.20.0-rc6-dbg+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fio D25272 14869 14195 0x00000000
Call Trace:
__schedule+0x401/0xe50
schedule+0x4e/0xd0
io_schedule+0x21/0x50
blk_mq_get_tag+0x46d/0x640
blk_mq_get_request+0x7c0/0xa00
blk_mq_make_request+0x241/0xa70
generic_make_request+0x411/0x950
submit_bio+0x9b/0x250
blkdev_direct_IO+0x7fb/0x870
generic_file_direct_write+0x119/0x210
__generic_file_write_iter+0x11c/0x280
blkdev_write_iter+0x13c/0x220
aio_write+0x204/0x310
io_submit_one+0x9c6/0xe70
__x64_sys_io_submit+0x115/0x340
do_syscall_64+0x71/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
When that hang occurs my list-pending-block-requests script does not show
any pending requests:
# list-pending-block-requests
dm-0
loop0
loop1
loop2
loop3
loop4
loop5
loop6
loop7
nullb0
nullb1
sda
sdb
sdc
sdd
vda
vdb
Enabling fail_if_no_path mode did not resolve the hang so I don't think
that the root cause is in any of the dm drivers used in this test:
# dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
The same test passes against kernel v4.20-rc6.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
@ 2018-12-11 23:05 ` Mike Snitzer
2018-12-11 23:09 ` Bart Van Assche
2018-12-12 0:02 ` Jens Axboe
1 sibling, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2018-12-11 23:05 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Jens Axboe, linux-block, dm-devel
On Tue, Dec 11 2018 at 5:58pm -0500,
Bart Van Assche <bvanassche@acm.org> wrote:
> Hi Jens,
>
> If I run the following subset of blktests:
>
> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>
> against today's for-next branch (commit dd2bf2df85a7) then after some
> time the following hang is reported:
Jens has applied a fix from me today that I think is very relevant;
it'll be in tomorrow's for-next:
http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01
Please try with that and see how things go.
Thanks,
Mike
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-11 23:05 ` Mike Snitzer
@ 2018-12-11 23:09 ` Bart Van Assche
2018-12-12 0:22 ` Mike Snitzer
0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2018-12-11 23:09 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel
On Tue, 2018-12-11 at 18:05 -0500, Mike Snitzer wrote:
> On Tue, Dec 11 2018 at 5:58pm -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>
> > Hi Jens,
> >
> > If I run the following subset of blktests:
> >
> > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
>
> Jens has applied a fix from me today that I think is very relevant;
> it'll be in tomorrow's for-next:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01
>
> Please try with that and see how things go.
Hi Mike,
Thank you for your reply. I should have mentioned that I used Jens' for-next
branch as starting point and that that commit is present in the branch I used
as a starting point for my tests:
$ git branch --contains c4576aed8d85d808cd6443bda58393d525207d01
* axboe-block-for-next
Bart.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
2018-12-11 23:05 ` Mike Snitzer
@ 2018-12-12 0:02 ` Jens Axboe
2018-12-12 0:18 ` Bart Van Assche
2018-12-12 0:19 ` Ming Lei
1 sibling, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 0:02 UTC (permalink / raw)
To: Bart Van Assche; +Cc: linux-block, Mike Snitzer
On 12/11/18 3:58 PM, Bart Van Assche wrote:
> Hi Jens,
>
> If I run the following subset of blktests:
>
> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>
> against today's for-next branch (commit dd2bf2df85a7) then after some
> time the following hang is reported:
>
> INFO: task fio:14869 blocked for more than 120 seconds.
> Not tainted 4.20.0-rc6-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> fio D25272 14869 14195 0x00000000
> Call Trace:
> __schedule+0x401/0xe50
> schedule+0x4e/0xd0
> io_schedule+0x21/0x50
> blk_mq_get_tag+0x46d/0x640
> blk_mq_get_request+0x7c0/0xa00
> blk_mq_make_request+0x241/0xa70
> generic_make_request+0x411/0x950
> submit_bio+0x9b/0x250
> blkdev_direct_IO+0x7fb/0x870
> generic_file_direct_write+0x119/0x210
> __generic_file_write_iter+0x11c/0x280
> blkdev_write_iter+0x13c/0x220
> aio_write+0x204/0x310
> io_submit_one+0x9c6/0xe70
> __x64_sys_io_submit+0x115/0x340
> do_syscall_64+0x71/0x210
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> When that hang occurs my list-pending-block-requests script does not show
> any pending requests:
>
> # list-pending-block-requests
> dm-0
> loop0
> loop1
> loop2
> loop3
> loop4
> loop5
> loop6
> loop7
> nullb0
> nullb1
> sda
> sdb
> sdc
> sdd
> vda
> vdb
>
> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> that the root cause is in any of the dm drivers used in this test:
>
> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>
> The same test passes against kernel v4.20-rc6.
What device is this being run on?
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 0:02 ` Jens Axboe
@ 2018-12-12 0:18 ` Bart Van Assche
2018-12-12 0:19 ` Ming Lei
1 sibling, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2018-12-12 0:18 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Mike Snitzer
On Tue, 2018-12-11 at 17:02 -0700, Jens Axboe wrote:
> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > Hi Jens,
> >
> > If I run the following subset of blktests:
> >
> > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
> >
> > INFO: task fio:14869 blocked for more than 120 seconds.
> > Not tainted 4.20.0-rc6-dbg+ #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > fio D25272 14869 14195 0x00000000
> > Call Trace:
> > __schedule+0x401/0xe50
> > schedule+0x4e/0xd0
> > io_schedule+0x21/0x50
> > blk_mq_get_tag+0x46d/0x640
> > blk_mq_get_request+0x7c0/0xa00
> > blk_mq_make_request+0x241/0xa70
> > generic_make_request+0x411/0x950
> > submit_bio+0x9b/0x250
> > blkdev_direct_IO+0x7fb/0x870
> > generic_file_direct_write+0x119/0x210
> > __generic_file_write_iter+0x11c/0x280
> > blkdev_write_iter+0x13c/0x220
> > aio_write+0x204/0x310
> > io_submit_one+0x9c6/0xe70
> > __x64_sys_io_submit+0x115/0x340
> > do_syscall_64+0x71/0x210
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > When that hang occurs my list-pending-block-requests script does not show
> > any pending requests:
> >
> > # list-pending-block-requests
> > dm-0
> > loop0
> > loop1
> > loop2
> > loop3
> > loop4
> > loop5
> > loop6
> > loop7
> > nullb0
> > nullb1
> > sda
> > sdb
> > sdc
> > sdd
> > vda
> > vdb
> >
> > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > that the root cause is in any of the dm drivers used in this test:
> >
> > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >
> > The same test passes against kernel v4.20-rc6.
>
> What device is this being run on?
Older versions of the srp and nvmeof-mp tests used the brd block device.
Today these tests use null_blk with memory_backed set to 1. See also
configure_null_blk() in common/multipath-over-rdma. null_blk is accessed
by ib_srpt. The dm-mpath driver is stacked on top of the ib_srp instance
that communicates with ib_srpt driver. The ib_srp and ib_srpt drivers
communicate with each other over the loopback functionality of the
rdma_rxe driver.
Bart.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 0:02 ` Jens Axboe
2018-12-12 0:18 ` Bart Van Assche
@ 2018-12-12 0:19 ` Ming Lei
2018-12-12 0:27 ` Mike Snitzer
1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12 0:19 UTC (permalink / raw)
To: Jens Axboe; +Cc: Bart Van Assche, linux-block, Mike Snitzer
On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > Hi Jens,
> >
> > If I run the following subset of blktests:
> >
> > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >
> > against today's for-next branch (commit dd2bf2df85a7) then after some
> > time the following hang is reported:
> >
> > INFO: task fio:14869 blocked for more than 120 seconds.
> > Not tainted 4.20.0-rc6-dbg+ #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > fio D25272 14869 14195 0x00000000
> > Call Trace:
> > __schedule+0x401/0xe50
> > schedule+0x4e/0xd0
> > io_schedule+0x21/0x50
> > blk_mq_get_tag+0x46d/0x640
> > blk_mq_get_request+0x7c0/0xa00
> > blk_mq_make_request+0x241/0xa70
> > generic_make_request+0x411/0x950
> > submit_bio+0x9b/0x250
> > blkdev_direct_IO+0x7fb/0x870
> > generic_file_direct_write+0x119/0x210
> > __generic_file_write_iter+0x11c/0x280
> > blkdev_write_iter+0x13c/0x220
> > aio_write+0x204/0x310
> > io_submit_one+0x9c6/0xe70
> > __x64_sys_io_submit+0x115/0x340
> > do_syscall_64+0x71/0x210
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > When that hang occurs my list-pending-block-requests script does not show
> > any pending requests:
> >
> > # list-pending-block-requests
> > dm-0
> > loop0
> > loop1
> > loop2
> > loop3
> > loop4
> > loop5
> > loop6
> > loop7
> > nullb0
> > nullb1
> > sda
> > sdb
> > sdc
> > sdd
> > vda
> > vdb
> >
> > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > that the root cause is in any of the dm drivers used in this test:
> >
> > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >
> > The same test passes against kernel v4.20-rc6.
>
> What device is this being run on?
I saw this issue on usb storage too.
Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
the sched_tag's sbitmap queue.
thanks,
Ming Lei
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-11 23:09 ` Bart Van Assche
@ 2018-12-12 0:22 ` Mike Snitzer
0 siblings, 0 replies; 20+ messages in thread
From: Mike Snitzer @ 2018-12-12 0:22 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Jens Axboe, linux-block, dm-devel
On Tue, Dec 11 2018 at 6:09pm -0500,
Bart Van Assche <bvanassche@acm.org> wrote:
> On Tue, 2018-12-11 at 18:05 -0500, Mike Snitzer wrote:
> > On Tue, Dec 11 2018 at 5:58pm -0500,
> > Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > > Hi Jens,
> > >
> > > If I run the following subset of blktests:
> > >
> > > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > >
> > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > time the following hang is reported:
> >
> > Jens has applied a fix from me today that I think is very relevant;
> > it'll be in tomorrow's for-next:
> >
> > http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.21/block&id=c4576aed8d85d808cd6443bda58393d525207d01
> >
> > Please try with that and see how things go.
>
> Hi Mike,
>
> Thank you for your reply. I should have mentioned that I used Jens' for-next
> branch as starting point and that that commit is present in the branch I used
> as a starting point for my tests:
>
> $ git branch --contains c4576aed8d85d808cd6443bda58393d525207d01
> * axboe-block-for-next
next-20181211 doesn't contain c4576aed8d85d808cd6443bda58393d525207d01
I'll try to reproduce, but in the meantime can you just try jens'
for-4.21/block? It is currently 2c4d5356e64
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 0:19 ` Ming Lei
@ 2018-12-12 0:27 ` Mike Snitzer
2018-12-12 0:38 ` Ming Lei
0 siblings, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2018-12-12 0:27 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, Bart Van Assche, linux-block
On Tue, Dec 11 2018 at 7:19pm -0500,
Ming Lei <tom.leiming@gmail.com> wrote:
> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > Hi Jens,
> > >
> > > If I run the following subset of blktests:
> > >
> > > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > >
> > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > time the following hang is reported:
> > >
> > > INFO: task fio:14869 blocked for more than 120 seconds.
> > > Not tainted 4.20.0-rc6-dbg+ #1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > fio D25272 14869 14195 0x00000000
> > > Call Trace:
> > > __schedule+0x401/0xe50
> > > schedule+0x4e/0xd0
> > > io_schedule+0x21/0x50
> > > blk_mq_get_tag+0x46d/0x640
> > > blk_mq_get_request+0x7c0/0xa00
> > > blk_mq_make_request+0x241/0xa70
> > > generic_make_request+0x411/0x950
> > > submit_bio+0x9b/0x250
> > > blkdev_direct_IO+0x7fb/0x870
> > > generic_file_direct_write+0x119/0x210
> > > __generic_file_write_iter+0x11c/0x280
> > > blkdev_write_iter+0x13c/0x220
> > > aio_write+0x204/0x310
> > > io_submit_one+0x9c6/0xe70
> > > __x64_sys_io_submit+0x115/0x340
> > > do_syscall_64+0x71/0x210
> > > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >
> > > When that hang occurs my list-pending-block-requests script does not show
> > > any pending requests:
> > >
> > > # list-pending-block-requests
> > > dm-0
> > > loop0
> > > loop1
> > > loop2
> > > loop3
> > > loop4
> > > loop5
> > > loop6
> > > loop7
> > > nullb0
> > > nullb1
> > > sda
> > > sdb
> > > sdc
> > > sdd
> > > vda
> > > vdb
> > >
> > > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > that the root cause is in any of the dm drivers used in this test:
> > >
> > > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > >
> > > The same test passes against kernel v4.20-rc6.
> >
> > What device is this being run on?
>
> I saw this issue on usb storage too.
>
> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> the sched_tag's sbitmap queue.
You saw this running the same tests as Bart?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 0:27 ` Mike Snitzer
@ 2018-12-12 0:38 ` Ming Lei
2018-12-12 1:05 ` Jens Axboe
0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12 0:38 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Jens Axboe, Bart Van Assche, linux-block
[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]
On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Tue, Dec 11 2018 at 7:19pm -0500,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
> > On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > > Hi Jens,
> > > >
> > > > If I run the following subset of blktests:
> > > >
> > > > while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > > >
> > > > against today's for-next branch (commit dd2bf2df85a7) then after some
> > > > time the following hang is reported:
> > > >
> > > > INFO: task fio:14869 blocked for more than 120 seconds.
> > > > Not tainted 4.20.0-rc6-dbg+ #1
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > fio D25272 14869 14195 0x00000000
> > > > Call Trace:
> > > > __schedule+0x401/0xe50
> > > > schedule+0x4e/0xd0
> > > > io_schedule+0x21/0x50
> > > > blk_mq_get_tag+0x46d/0x640
> > > > blk_mq_get_request+0x7c0/0xa00
> > > > blk_mq_make_request+0x241/0xa70
> > > > generic_make_request+0x411/0x950
> > > > submit_bio+0x9b/0x250
> > > > blkdev_direct_IO+0x7fb/0x870
> > > > generic_file_direct_write+0x119/0x210
> > > > __generic_file_write_iter+0x11c/0x280
> > > > blkdev_write_iter+0x13c/0x220
> > > > aio_write+0x204/0x310
> > > > io_submit_one+0x9c6/0xe70
> > > > __x64_sys_io_submit+0x115/0x340
> > > > do_syscall_64+0x71/0x210
> > > > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > >
> > > > When that hang occurs my list-pending-block-requests script does not show
> > > > any pending requests:
> > > >
> > > > # list-pending-block-requests
> > > > dm-0
> > > > loop0
> > > > loop1
> > > > loop2
> > > > loop3
> > > > loop4
> > > > loop5
> > > > loop6
> > > > loop7
> > > > nullb0
> > > > nullb1
> > > > sda
> > > > sdb
> > > > sdc
> > > > sdd
> > > > vda
> > > > vdb
> > > >
> > > > Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > > that the root cause is in any of the dm drivers used in this test:
> > > >
> > > > # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > > 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > > >
> > > > The same test passes against kernel v4.20-rc6.
> > >
> > > What device is this being run on?
> >
> > I saw this issue on usb storage too.
> >
> > Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > the sched_tag's sbitmap queue.
>
> You saw this running the same tests as Bart?
Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
see the attached test script.
Mostly it can be triggered in one run, sometimes it needs more.
Thanks
Ming Lei
[-- Attachment #2: parted --]
[-- Type: application/octet-stream, Size: 953 bytes --]
#!/bin/bash
parted_disk()
{
DISK=$1
CNT=$2
DEV_SIZE=`blockdev --getsz $DISK`
DEV_SIZE=`expr ${DEV_SIZE} / 2048`MB
DEV_NAME=`basename $DISK`
Q_PATH=/sys/block/$DEV_NAME/queue
if [ -f $Q_PATH/../device/queue_depth ]; then
MY_SCSI_QD=`cat $Q_PATH/../device/queue_depth`
else
MY_SCSI_QD="N/A"
fi
MQ_IOSCHEDS=`sed 's/[][]//g' $Q_PATH/scheduler`
for SCHED in $MQ_IOSCHEDS; do
echo $SCHED > $Q_PATH/scheduler
MY_SCHED=`cat $Q_PATH/scheduler | sed -n 's/.*\[\(.*\)\].*/\1/p'`
echo "Dev. $DEV_NAME, SCHED: $MY_SCHED, SCSI QD: $MY_SCSI_QD"
count=0
wipefs -a $DISK && blockdev --rereadpt $DISK
while [ $count -lt $CNT ]; do
parted -m -s $DISK mklabel msdos &&
parted -m -s -a none $DISK mkpart primary 0MB 32MB &&
parted -m -s -a none $DISK mkpart primary 32MB $DEV_SIZE &&
parted -s $DISK p
let count++
if ! (( count % 10 )); then
echo $count
fi
done
done
}
DISK=$1
CNT=$2
parted_disk $DISK $CNT
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 0:38 ` Ming Lei
@ 2018-12-12 1:05 ` Jens Axboe
2018-12-12 1:23 ` Jens Axboe
0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 1:05 UTC (permalink / raw)
To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block
On 12/11/18 5:38 PM, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>
>> On Tue, Dec 11 2018 at 7:19pm -0500,
>> Ming Lei <tom.leiming@gmail.com> wrote:
>>
>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>> Hi Jens,
>>>>>
>>>>> If I run the following subset of blktests:
>>>>>
>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>
>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>> time the following hang is reported:
>>>>>
>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>> fio D25272 14869 14195 0x00000000
>>>>> Call Trace:
>>>>> __schedule+0x401/0xe50
>>>>> schedule+0x4e/0xd0
>>>>> io_schedule+0x21/0x50
>>>>> blk_mq_get_tag+0x46d/0x640
>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>> blk_mq_make_request+0x241/0xa70
>>>>> generic_make_request+0x411/0x950
>>>>> submit_bio+0x9b/0x250
>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>> generic_file_direct_write+0x119/0x210
>>>>> __generic_file_write_iter+0x11c/0x280
>>>>> blkdev_write_iter+0x13c/0x220
>>>>> aio_write+0x204/0x310
>>>>> io_submit_one+0x9c6/0xe70
>>>>> __x64_sys_io_submit+0x115/0x340
>>>>> do_syscall_64+0x71/0x210
>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>
>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>> any pending requests:
>>>>>
>>>>> # list-pending-block-requests
>>>>> dm-0
>>>>> loop0
>>>>> loop1
>>>>> loop2
>>>>> loop3
>>>>> loop4
>>>>> loop5
>>>>> loop6
>>>>> loop7
>>>>> nullb0
>>>>> nullb1
>>>>> sda
>>>>> sdb
>>>>> sdc
>>>>> sdd
>>>>> vda
>>>>> vdb
>>>>>
>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>
>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>
>>>>> The same test passes against kernel v4.20-rc6.
>>>>
>>>> What device is this being run on?
>>>
>>> I saw this issue on usb storage too.
>>>
>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>> the sched_tag's sbitmap queue.
>>
>> You saw this running the same tests as Bart?
>
> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> see the attached test script.
>
> Mostly it can be triggered in one run, sometimes it needs more.
I'll take a look. The ->cleared doesn't make sense for QD=1, or on
one word in general. But I'd like to try and understand why it hangs.
Are you using a scheduler?
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..e6f376287ead 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,50 @@
#include <linux/sbitmap.h>
#include <linux/seq_file.h>
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+ unsigned long mask, val;
+ unsigned long __maybe_unused flags;
+ bool ret = false;
+
+ if (sb->map_nr <= 1)
+ return false;
+
+ /* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+ local_irq_save(flags);
+#endif
+ spin_lock(&sb->map[index].swap_lock);
+
+ if (!sb->map[index].cleared)
+ goto out_unlock;
+
+ /*
+ * First get a stable cleared mask, setting the old mask to 0.
+ */
+ do {
+ mask = sb->map[index].cleared;
+ } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+ /*
+ * Now clear the masked bits in our free word
+ */
+ do {
+ val = sb->map[index].word;
+ } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+ ret = true;
+out_unlock:
+ spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+ local_irq_restore(flags);
+#endif
+ return ret;
+}
+
int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
gfp_t flags, int node)
{
@@ -70,6 +114,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
unsigned int bits_per_word = 1U << sb->shift;
unsigned int i;
+ for (i = 0; i < sb->map_nr; i++)
+ sbitmap_deferred_clear(sb, i);
+
sb->depth = depth;
sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -112,47 +159,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
return nr;
}
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
- unsigned long mask, val;
- unsigned long __maybe_unused flags;
- bool ret = false;
-
- /* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
- local_irq_save(flags);
-#endif
- spin_lock(&sb->map[index].swap_lock);
-
- if (!sb->map[index].cleared)
- goto out_unlock;
-
- /*
- * First get a stable cleared mask, setting the old mask to 0.
- */
- do {
- mask = sb->map[index].cleared;
- } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
- /*
- * Now clear the masked bits in our free word
- */
- do {
- val = sb->map[index].word;
- } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
- ret = true;
-out_unlock:
- spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
- local_irq_restore(flags);
-#endif
- return ret;
-}
-
static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
unsigned int alloc_hint, bool round_robin)
{
@@ -590,7 +596,12 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
unsigned int cpu)
{
- sbitmap_deferred_clear_bit(&sbq->sb, nr);
+ struct sbitmap *sb = &sbq->sb;
+
+ if (sbq->sb.map_nr > 1)
+ sbitmap_deferred_clear_bit(sb, nr);
+ else
+ sbitmap_clear_bit_unlock(sb, nr);
/*
* Pairs with the memory barrier in set_current_state() to ensure the
--
Jens Axboe
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:05 ` Jens Axboe
@ 2018-12-12 1:23 ` Jens Axboe
2018-12-12 1:36 ` Jens Axboe
2018-12-12 1:37 ` Ming Lei
0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 1:23 UTC (permalink / raw)
To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block
On 12/11/18 6:05 PM, Jens Axboe wrote:
> On 12/11/18 5:38 PM, Ming Lei wrote:
>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>> On Tue, Dec 11 2018 at 7:19pm -0500,
>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>
>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> If I run the following subset of blktests:
>>>>>>
>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>
>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>> time the following hang is reported:
>>>>>>
>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>> fio D25272 14869 14195 0x00000000
>>>>>> Call Trace:
>>>>>> __schedule+0x401/0xe50
>>>>>> schedule+0x4e/0xd0
>>>>>> io_schedule+0x21/0x50
>>>>>> blk_mq_get_tag+0x46d/0x640
>>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>>> blk_mq_make_request+0x241/0xa70
>>>>>> generic_make_request+0x411/0x950
>>>>>> submit_bio+0x9b/0x250
>>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>>> generic_file_direct_write+0x119/0x210
>>>>>> __generic_file_write_iter+0x11c/0x280
>>>>>> blkdev_write_iter+0x13c/0x220
>>>>>> aio_write+0x204/0x310
>>>>>> io_submit_one+0x9c6/0xe70
>>>>>> __x64_sys_io_submit+0x115/0x340
>>>>>> do_syscall_64+0x71/0x210
>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>
>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>> any pending requests:
>>>>>>
>>>>>> # list-pending-block-requests
>>>>>> dm-0
>>>>>> loop0
>>>>>> loop1
>>>>>> loop2
>>>>>> loop3
>>>>>> loop4
>>>>>> loop5
>>>>>> loop6
>>>>>> loop7
>>>>>> nullb0
>>>>>> nullb1
>>>>>> sda
>>>>>> sdb
>>>>>> sdc
>>>>>> sdd
>>>>>> vda
>>>>>> vdb
>>>>>>
>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>
>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>
>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>
>>>>> What device is this being run on?
>>>>
>>>> I saw this issue on usb storage too.
>>>>
>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>> the sched_tag's sbitmap queue.
>>>
>>> You saw this running the same tests as Bart?
>>
>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>> see the attached test script.
>>
>> Mostly it can be triggered in one run, sometimes it needs more.
>
> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> one word in general. But I'd like to try and understand why it hangs.
>
> Are you using a scheduler?
OK, I think I see what it is, the shallow is missing the deferred clear.
On top of this, probably worth to check at what depths deferred starts
to make sense. For QD == 1, definitely not. But that should be on top of
the fix.
Can you try this one?
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..d98ba7af6bce 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,47 @@
#include <linux/sbitmap.h>
#include <linux/seq_file.h>
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+ unsigned long mask, val;
+ unsigned long __maybe_unused flags;
+ bool ret = false;
+
+ /* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+ local_irq_save(flags);
+#endif
+ spin_lock(&sb->map[index].swap_lock);
+
+ if (!sb->map[index].cleared)
+ goto out_unlock;
+
+ /*
+ * First get a stable cleared mask, setting the old mask to 0.
+ */
+ do {
+ mask = sb->map[index].cleared;
+ } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+ /*
+ * Now clear the masked bits in our free word
+ */
+ do {
+ val = sb->map[index].word;
+ } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+ ret = true;
+out_unlock:
+ spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+ local_irq_restore(flags);
+#endif
+ return ret;
+}
+
int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
gfp_t flags, int node)
{
@@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
unsigned int bits_per_word = 1U << sb->shift;
unsigned int i;
+ for (i = 0; i < sb->map_nr; i++)
+ sbitmap_deferred_clear(sb, i);
+
sb->depth = depth;
sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
return nr;
}
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
- unsigned long mask, val;
- unsigned long __maybe_unused flags;
- bool ret = false;
-
- /* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
- local_irq_save(flags);
-#endif
- spin_lock(&sb->map[index].swap_lock);
-
- if (!sb->map[index].cleared)
- goto out_unlock;
-
- /*
- * First get a stable cleared mask, setting the old mask to 0.
- */
- do {
- mask = sb->map[index].cleared;
- } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
- /*
- * Now clear the masked bits in our free word
- */
- do {
- val = sb->map[index].word;
- } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
- ret = true;
-out_unlock:
- spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
- local_irq_restore(flags);
-#endif
- return ret;
-}
-
static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
unsigned int alloc_hint, bool round_robin)
{
@@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
index = SB_NR_TO_INDEX(sb, alloc_hint);
for (i = 0; i < sb->map_nr; i++) {
+again:
nr = __sbitmap_get_word(&sb->map[index].word,
min(sb->map[index].depth, shallow_depth),
SB_NR_TO_BIT(sb, alloc_hint), true);
@@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
break;
}
+ if (sbitmap_deferred_clear(sb, index))
+ goto again;
+
/* Jump to next index. */
index++;
alloc_hint = index << sb->shift;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:23 ` Jens Axboe
@ 2018-12-12 1:36 ` Jens Axboe
2018-12-12 1:43 ` Ming Lei
2018-12-12 1:37 ` Ming Lei
1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 1:36 UTC (permalink / raw)
To: Ming Lei, Mike Snitzer; +Cc: Bart Van Assche, linux-block
On 12/11/18 6:23 PM, Jens Axboe wrote:
> On 12/11/18 6:05 PM, Jens Axboe wrote:
>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>
>>>> On Tue, Dec 11 2018 at 7:19pm -0500,
>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>
>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> If I run the following subset of blktests:
>>>>>>>
>>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>
>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>> time the following hang is reported:
>>>>>>>
>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>> fio D25272 14869 14195 0x00000000
>>>>>>> Call Trace:
>>>>>>> __schedule+0x401/0xe50
>>>>>>> schedule+0x4e/0xd0
>>>>>>> io_schedule+0x21/0x50
>>>>>>> blk_mq_get_tag+0x46d/0x640
>>>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>>>> blk_mq_make_request+0x241/0xa70
>>>>>>> generic_make_request+0x411/0x950
>>>>>>> submit_bio+0x9b/0x250
>>>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>>>> generic_file_direct_write+0x119/0x210
>>>>>>> __generic_file_write_iter+0x11c/0x280
>>>>>>> blkdev_write_iter+0x13c/0x220
>>>>>>> aio_write+0x204/0x310
>>>>>>> io_submit_one+0x9c6/0xe70
>>>>>>> __x64_sys_io_submit+0x115/0x340
>>>>>>> do_syscall_64+0x71/0x210
>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>
>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>> any pending requests:
>>>>>>>
>>>>>>> # list-pending-block-requests
>>>>>>> dm-0
>>>>>>> loop0
>>>>>>> loop1
>>>>>>> loop2
>>>>>>> loop3
>>>>>>> loop4
>>>>>>> loop5
>>>>>>> loop6
>>>>>>> loop7
>>>>>>> nullb0
>>>>>>> nullb1
>>>>>>> sda
>>>>>>> sdb
>>>>>>> sdc
>>>>>>> sdd
>>>>>>> vda
>>>>>>> vdb
>>>>>>>
>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>
>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>
>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>
>>>>>> What device is this being run on?
>>>>>
>>>>> I saw this issue on usb storage too.
>>>>>
>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>> the sched_tag's sbitmap queue.
>>>>
>>>> You saw this running the same tests as Bart?
>>>
>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>> see the attached test script.
>>>
>>> Mostly it can be triggered in one run, sometimes it needs more.
>>
>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>> one word in general. But I'd like to try and understand why it hangs.
>>
>> Are you using a scheduler?
>
> OK, I think I see what it is, the shallow is missing the deferred clear.
> On top of this, probably worth to check at what depths deferred starts
> to make sense. For QD == 1, definitely not. But that should be on top of
> the fix.
>
> Can you try this one?
This one should be complete. The former should fix the case, this one
just adds the last two cases where we ignored ->cleared. For some of
them we'll never hit it since deferred isn't used for that, but I
think we're better off with full coverage.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 2261136ae067..5b3e56d68dab 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -20,6 +20,47 @@
#include <linux/sbitmap.h>
#include <linux/seq_file.h>
+/*
+ * See if we have deferred clears that we can batch move
+ */
+static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
+{
+ unsigned long mask, val;
+ unsigned long __maybe_unused flags;
+ bool ret = false;
+
+ /* Silence bogus lockdep warning */
+#if defined(CONFIG_LOCKDEP)
+ local_irq_save(flags);
+#endif
+ spin_lock(&sb->map[index].swap_lock);
+
+ if (!sb->map[index].cleared)
+ goto out_unlock;
+
+ /*
+ * First get a stable cleared mask, setting the old mask to 0.
+ */
+ do {
+ mask = sb->map[index].cleared;
+ } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
+
+ /*
+ * Now clear the masked bits in our free word
+ */
+ do {
+ val = sb->map[index].word;
+ } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
+
+ ret = true;
+out_unlock:
+ spin_unlock(&sb->map[index].swap_lock);
+#if defined(CONFIG_LOCKDEP)
+ local_irq_restore(flags);
+#endif
+ return ret;
+}
+
int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
gfp_t flags, int node)
{
@@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
unsigned int bits_per_word = 1U << sb->shift;
unsigned int i;
+ for (i = 0; i < sb->map_nr; i++)
+ sbitmap_deferred_clear(sb, i);
+
sb->depth = depth;
sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
@@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
return nr;
}
-/*
- * See if we have deferred clears that we can batch move
- */
-static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
-{
- unsigned long mask, val;
- unsigned long __maybe_unused flags;
- bool ret = false;
-
- /* Silence bogus lockdep warning */
-#if defined(CONFIG_LOCKDEP)
- local_irq_save(flags);
-#endif
- spin_lock(&sb->map[index].swap_lock);
-
- if (!sb->map[index].cleared)
- goto out_unlock;
-
- /*
- * First get a stable cleared mask, setting the old mask to 0.
- */
- do {
- mask = sb->map[index].cleared;
- } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
-
- /*
- * Now clear the masked bits in our free word
- */
- do {
- val = sb->map[index].word;
- } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
-
- ret = true;
-out_unlock:
- spin_unlock(&sb->map[index].swap_lock);
-#if defined(CONFIG_LOCKDEP)
- local_irq_restore(flags);
-#endif
- return ret;
-}
-
static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
unsigned int alloc_hint, bool round_robin)
{
@@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
index = SB_NR_TO_INDEX(sb, alloc_hint);
for (i = 0; i < sb->map_nr; i++) {
+again:
nr = __sbitmap_get_word(&sb->map[index].word,
min(sb->map[index].depth, shallow_depth),
SB_NR_TO_BIT(sb, alloc_hint), true);
@@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
break;
}
+ if (sbitmap_deferred_clear(sb, index))
+ goto again;
+
/* Jump to next index. */
index++;
alloc_hint = index << sb->shift;
@@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
unsigned int i;
for (i = 0; i < sb->map_nr; i++) {
- if (sb->map[i].word)
+ if (sb->map[i].word & ~sb->map[i].cleared)
return true;
}
return false;
@@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
for (i = 0; i < sb->map_nr; i++) {
const struct sbitmap_word *word = &sb->map[i];
+ unsigned long mask = word->word & ~word->cleared;
unsigned long ret;
- ret = find_first_zero_bit(&word->word, word->depth);
+ ret = find_first_zero_bit(&mask, word->depth);
if (ret < word->depth)
return true;
}
--
Jens Axboe
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:23 ` Jens Axboe
2018-12-12 1:36 ` Jens Axboe
@ 2018-12-12 1:37 ` Ming Lei
2018-12-12 1:39 ` Jens Axboe
1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12 1:37 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On Tue, Dec 11, 2018 at 06:23:31PM -0700, Jens Axboe wrote:
> On 12/11/18 6:05 PM, Jens Axboe wrote:
> > On 12/11/18 5:38 PM, Ming Lei wrote:
> >> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>
> >>> On Tue, Dec 11 2018 at 7:19pm -0500,
> >>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>
> >>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>> Hi Jens,
> >>>>>>
> >>>>>> If I run the following subset of blktests:
> >>>>>>
> >>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>
> >>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>> time the following hang is reported:
> >>>>>>
> >>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>> Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>> fio D25272 14869 14195 0x00000000
> >>>>>> Call Trace:
> >>>>>> __schedule+0x401/0xe50
> >>>>>> schedule+0x4e/0xd0
> >>>>>> io_schedule+0x21/0x50
> >>>>>> blk_mq_get_tag+0x46d/0x640
> >>>>>> blk_mq_get_request+0x7c0/0xa00
> >>>>>> blk_mq_make_request+0x241/0xa70
> >>>>>> generic_make_request+0x411/0x950
> >>>>>> submit_bio+0x9b/0x250
> >>>>>> blkdev_direct_IO+0x7fb/0x870
> >>>>>> generic_file_direct_write+0x119/0x210
> >>>>>> __generic_file_write_iter+0x11c/0x280
> >>>>>> blkdev_write_iter+0x13c/0x220
> >>>>>> aio_write+0x204/0x310
> >>>>>> io_submit_one+0x9c6/0xe70
> >>>>>> __x64_sys_io_submit+0x115/0x340
> >>>>>> do_syscall_64+0x71/0x210
> >>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>
> >>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>> any pending requests:
> >>>>>>
> >>>>>> # list-pending-block-requests
> >>>>>> dm-0
> >>>>>> loop0
> >>>>>> loop1
> >>>>>> loop2
> >>>>>> loop3
> >>>>>> loop4
> >>>>>> loop5
> >>>>>> loop6
> >>>>>> loop7
> >>>>>> nullb0
> >>>>>> nullb1
> >>>>>> sda
> >>>>>> sdb
> >>>>>> sdc
> >>>>>> sdd
> >>>>>> vda
> >>>>>> vdb
> >>>>>>
> >>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>
> >>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>
> >>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>
> >>>>> What device is this being run on?
> >>>>
> >>>> I saw this issue on usb storage too.
> >>>>
> >>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>> the sched_tag's sbitmap queue.
> >>>
> >>> You saw this running the same tests as Bart?
> >>
> >> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >> see the attached test script.
> >>
> >> Mostly it can be triggered in one run, sometimes it needs more.
> >
> > I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > one word in general. But I'd like to try and understand why it hangs.
> >
> > Are you using a scheduler?
>
> OK, I think I see what it is, the shallow is missing the deferred clear.
> On top of this, probably worth to check at what depths deferred starts
> to make sense. For QD == 1, definitely not. But that should be on top of
> the fix.
>
> Can you try this one?
>
>
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 2261136ae067..d98ba7af6bce 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -20,6 +20,47 @@
> #include <linux/sbitmap.h>
> #include <linux/seq_file.h>
>
> +/*
> + * See if we have deferred clears that we can batch move
> + */
> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> +{
> + unsigned long mask, val;
> + unsigned long __maybe_unused flags;
> + bool ret = false;
> +
> + /* Silence bogus lockdep warning */
> +#if defined(CONFIG_LOCKDEP)
> + local_irq_save(flags);
> +#endif
> + spin_lock(&sb->map[index].swap_lock);
> +
> + if (!sb->map[index].cleared)
> + goto out_unlock;
> +
> + /*
> + * First get a stable cleared mask, setting the old mask to 0.
> + */
> + do {
> + mask = sb->map[index].cleared;
> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> +
> + /*
> + * Now clear the masked bits in our free word
> + */
> + do {
> + val = sb->map[index].word;
> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> +
> + ret = true;
> +out_unlock:
> + spin_unlock(&sb->map[index].swap_lock);
> +#if defined(CONFIG_LOCKDEP)
> + local_irq_restore(flags);
> +#endif
> + return ret;
> +}
> +
> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> gfp_t flags, int node)
> {
> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> unsigned int bits_per_word = 1U << sb->shift;
> unsigned int i;
>
> + for (i = 0; i < sb->map_nr; i++)
> + sbitmap_deferred_clear(sb, i);
> +
> sb->depth = depth;
> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>
> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> return nr;
> }
>
> -/*
> - * See if we have deferred clears that we can batch move
> - */
> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> -{
> - unsigned long mask, val;
> - unsigned long __maybe_unused flags;
> - bool ret = false;
> -
> - /* Silence bogus lockdep warning */
> -#if defined(CONFIG_LOCKDEP)
> - local_irq_save(flags);
> -#endif
> - spin_lock(&sb->map[index].swap_lock);
> -
> - if (!sb->map[index].cleared)
> - goto out_unlock;
> -
> - /*
> - * First get a stable cleared mask, setting the old mask to 0.
> - */
> - do {
> - mask = sb->map[index].cleared;
> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> -
> - /*
> - * Now clear the masked bits in our free word
> - */
> - do {
> - val = sb->map[index].word;
> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> -
> - ret = true;
> -out_unlock:
> - spin_unlock(&sb->map[index].swap_lock);
> -#if defined(CONFIG_LOCKDEP)
> - local_irq_restore(flags);
> -#endif
> - return ret;
> -}
> -
> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> unsigned int alloc_hint, bool round_robin)
> {
> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> index = SB_NR_TO_INDEX(sb, alloc_hint);
>
> for (i = 0; i < sb->map_nr; i++) {
> +again:
> nr = __sbitmap_get_word(&sb->map[index].word,
> min(sb->map[index].depth, shallow_depth),
> SB_NR_TO_BIT(sb, alloc_hint), true);
> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> break;
> }
>
> + if (sbitmap_deferred_clear(sb, index))
> + goto again;
> +
> /* Jump to next index. */
> index++;
> alloc_hint = index << sb->shift;
Yeah, this one does work.
thanks,
Ming
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:37 ` Ming Lei
@ 2018-12-12 1:39 ` Jens Axboe
0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 1:39 UTC (permalink / raw)
To: Ming Lei; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On 12/11/18 6:37 PM, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:23:31PM -0700, Jens Axboe wrote:
>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>
>>>>> On Tue, Dec 11 2018 at 7:19pm -0500,
>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>
>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> If I run the following subset of blktests:
>>>>>>>>
>>>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>
>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>> time the following hang is reported:
>>>>>>>>
>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>> fio D25272 14869 14195 0x00000000
>>>>>>>> Call Trace:
>>>>>>>> __schedule+0x401/0xe50
>>>>>>>> schedule+0x4e/0xd0
>>>>>>>> io_schedule+0x21/0x50
>>>>>>>> blk_mq_get_tag+0x46d/0x640
>>>>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>>>>> blk_mq_make_request+0x241/0xa70
>>>>>>>> generic_make_request+0x411/0x950
>>>>>>>> submit_bio+0x9b/0x250
>>>>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>>>>> generic_file_direct_write+0x119/0x210
>>>>>>>> __generic_file_write_iter+0x11c/0x280
>>>>>>>> blkdev_write_iter+0x13c/0x220
>>>>>>>> aio_write+0x204/0x310
>>>>>>>> io_submit_one+0x9c6/0xe70
>>>>>>>> __x64_sys_io_submit+0x115/0x340
>>>>>>>> do_syscall_64+0x71/0x210
>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>
>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>> any pending requests:
>>>>>>>>
>>>>>>>> # list-pending-block-requests
>>>>>>>> dm-0
>>>>>>>> loop0
>>>>>>>> loop1
>>>>>>>> loop2
>>>>>>>> loop3
>>>>>>>> loop4
>>>>>>>> loop5
>>>>>>>> loop6
>>>>>>>> loop7
>>>>>>>> nullb0
>>>>>>>> nullb1
>>>>>>>> sda
>>>>>>>> sdb
>>>>>>>> sdc
>>>>>>>> sdd
>>>>>>>> vda
>>>>>>>> vdb
>>>>>>>>
>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>
>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>
>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>
>>>>>>> What device is this being run on?
>>>>>>
>>>>>> I saw this issue on usb storage too.
>>>>>>
>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>> the sched_tag's sbitmap queue.
>>>>>
>>>>> You saw this running the same tests as Bart?
>>>>
>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>> see the attached test script.
>>>>
>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>
>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>> one word in general. But I'd like to try and understand why it hangs.
>>>
>>> Are you using a scheduler?
>>
>> OK, I think I see what it is, the shallow is missing the deferred clear.
>> On top of this, probably worth to check at what depths deferred starts
>> to make sense. For QD == 1, definitely not. But that should be on top of
>> the fix.
>>
>> Can you try this one?
>>
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index 2261136ae067..d98ba7af6bce 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -20,6 +20,47 @@
>> #include <linux/sbitmap.h>
>> #include <linux/seq_file.h>
>>
>> +/*
>> + * See if we have deferred clears that we can batch move
>> + */
>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> +{
>> + unsigned long mask, val;
>> + unsigned long __maybe_unused flags;
>> + bool ret = false;
>> +
>> + /* Silence bogus lockdep warning */
>> +#if defined(CONFIG_LOCKDEP)
>> + local_irq_save(flags);
>> +#endif
>> + spin_lock(&sb->map[index].swap_lock);
>> +
>> + if (!sb->map[index].cleared)
>> + goto out_unlock;
>> +
>> + /*
>> + * First get a stable cleared mask, setting the old mask to 0.
>> + */
>> + do {
>> + mask = sb->map[index].cleared;
>> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> +
>> + /*
>> + * Now clear the masked bits in our free word
>> + */
>> + do {
>> + val = sb->map[index].word;
>> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> +
>> + ret = true;
>> +out_unlock:
>> + spin_unlock(&sb->map[index].swap_lock);
>> +#if defined(CONFIG_LOCKDEP)
>> + local_irq_restore(flags);
>> +#endif
>> + return ret;
>> +}
>> +
>> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>> gfp_t flags, int node)
>> {
>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>> unsigned int bits_per_word = 1U << sb->shift;
>> unsigned int i;
>>
>> + for (i = 0; i < sb->map_nr; i++)
>> + sbitmap_deferred_clear(sb, i);
>> +
>> sb->depth = depth;
>> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>
>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>> return nr;
>> }
>>
>> -/*
>> - * See if we have deferred clears that we can batch move
>> - */
>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> -{
>> - unsigned long mask, val;
>> - unsigned long __maybe_unused flags;
>> - bool ret = false;
>> -
>> - /* Silence bogus lockdep warning */
>> -#if defined(CONFIG_LOCKDEP)
>> - local_irq_save(flags);
>> -#endif
>> - spin_lock(&sb->map[index].swap_lock);
>> -
>> - if (!sb->map[index].cleared)
>> - goto out_unlock;
>> -
>> - /*
>> - * First get a stable cleared mask, setting the old mask to 0.
>> - */
>> - do {
>> - mask = sb->map[index].cleared;
>> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> -
>> - /*
>> - * Now clear the masked bits in our free word
>> - */
>> - do {
>> - val = sb->map[index].word;
>> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> -
>> - ret = true;
>> -out_unlock:
>> - spin_unlock(&sb->map[index].swap_lock);
>> -#if defined(CONFIG_LOCKDEP)
>> - local_irq_restore(flags);
>> -#endif
>> - return ret;
>> -}
>> -
>> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>> unsigned int alloc_hint, bool round_robin)
>> {
>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>> index = SB_NR_TO_INDEX(sb, alloc_hint);
>>
>> for (i = 0; i < sb->map_nr; i++) {
>> +again:
>> nr = __sbitmap_get_word(&sb->map[index].word,
>> min(sb->map[index].depth, shallow_depth),
>> SB_NR_TO_BIT(sb, alloc_hint), true);
>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>> break;
>> }
>>
>> + if (sbitmap_deferred_clear(sb, index))
>> + goto again;
>> +
>> /* Jump to next index. */
>> index++;
>> alloc_hint = index << sb->shift;
>
> Yeah, this one does work.
Good, thanks for testing. Can you try the next one too, just for good
measure? Then I can add your Tested-by.
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:36 ` Jens Axboe
@ 2018-12-12 1:43 ` Ming Lei
2018-12-12 1:44 ` Jens Axboe
2018-12-12 1:49 ` Ming Lei
0 siblings, 2 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-12 1:43 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> On 12/11/18 6:23 PM, Jens Axboe wrote:
> > On 12/11/18 6:05 PM, Jens Axboe wrote:
> >> On 12/11/18 5:38 PM, Ming Lei wrote:
> >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>>
> >>>> On Tue, Dec 11 2018 at 7:19pm -0500,
> >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>>
> >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> If I run the following subset of blktests:
> >>>>>>>
> >>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>>
> >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>>> time the following hang is reported:
> >>>>>>>
> >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>>> fio D25272 14869 14195 0x00000000
> >>>>>>> Call Trace:
> >>>>>>> __schedule+0x401/0xe50
> >>>>>>> schedule+0x4e/0xd0
> >>>>>>> io_schedule+0x21/0x50
> >>>>>>> blk_mq_get_tag+0x46d/0x640
> >>>>>>> blk_mq_get_request+0x7c0/0xa00
> >>>>>>> blk_mq_make_request+0x241/0xa70
> >>>>>>> generic_make_request+0x411/0x950
> >>>>>>> submit_bio+0x9b/0x250
> >>>>>>> blkdev_direct_IO+0x7fb/0x870
> >>>>>>> generic_file_direct_write+0x119/0x210
> >>>>>>> __generic_file_write_iter+0x11c/0x280
> >>>>>>> blkdev_write_iter+0x13c/0x220
> >>>>>>> aio_write+0x204/0x310
> >>>>>>> io_submit_one+0x9c6/0xe70
> >>>>>>> __x64_sys_io_submit+0x115/0x340
> >>>>>>> do_syscall_64+0x71/0x210
> >>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>>
> >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>>> any pending requests:
> >>>>>>>
> >>>>>>> # list-pending-block-requests
> >>>>>>> dm-0
> >>>>>>> loop0
> >>>>>>> loop1
> >>>>>>> loop2
> >>>>>>> loop3
> >>>>>>> loop4
> >>>>>>> loop5
> >>>>>>> loop6
> >>>>>>> loop7
> >>>>>>> nullb0
> >>>>>>> nullb1
> >>>>>>> sda
> >>>>>>> sdb
> >>>>>>> sdc
> >>>>>>> sdd
> >>>>>>> vda
> >>>>>>> vdb
> >>>>>>>
> >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>>
> >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>>
> >>>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>>
> >>>>>> What device is this being run on?
> >>>>>
> >>>>> I saw this issue on usb storage too.
> >>>>>
> >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>>> the sched_tag's sbitmap queue.
> >>>>
> >>>> You saw this running the same tests as Bart?
> >>>
> >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >>> see the attached test script.
> >>>
> >>> Mostly it can be triggered in one run, sometimes it needs more.
> >>
> >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> >> one word in general. But I'd like to try and understand why it hangs.
> >>
> >> Are you using a scheduler?
> >
> > OK, I think I see what it is, the shallow is missing the deferred clear.
> > On top of this, probably worth to check at what depths deferred starts
> > to make sense. For QD == 1, definitely not. But that should be on top of
> > the fix.
> >
> > Can you try this one?
>
> This one should be complete. The former should fix the case, this one
> just adds the last two cases where we ignored ->cleared. For some of
> them we'll never hit it since deferred isn't used for that, but I
> think we're better off with full coverage.
>
>
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 2261136ae067..5b3e56d68dab 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -20,6 +20,47 @@
> #include <linux/sbitmap.h>
> #include <linux/seq_file.h>
>
> +/*
> + * See if we have deferred clears that we can batch move
> + */
> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> +{
> + unsigned long mask, val;
> + unsigned long __maybe_unused flags;
> + bool ret = false;
> +
> + /* Silence bogus lockdep warning */
> +#if defined(CONFIG_LOCKDEP)
> + local_irq_save(flags);
> +#endif
> + spin_lock(&sb->map[index].swap_lock);
> +
> + if (!sb->map[index].cleared)
> + goto out_unlock;
> +
> + /*
> + * First get a stable cleared mask, setting the old mask to 0.
> + */
> + do {
> + mask = sb->map[index].cleared;
> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> +
> + /*
> + * Now clear the masked bits in our free word
> + */
> + do {
> + val = sb->map[index].word;
> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> +
> + ret = true;
> +out_unlock:
> + spin_unlock(&sb->map[index].swap_lock);
> +#if defined(CONFIG_LOCKDEP)
> + local_irq_restore(flags);
> +#endif
> + return ret;
> +}
> +
> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> gfp_t flags, int node)
> {
> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> unsigned int bits_per_word = 1U << sb->shift;
> unsigned int i;
>
> + for (i = 0; i < sb->map_nr; i++)
> + sbitmap_deferred_clear(sb, i);
> +
> sb->depth = depth;
> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>
> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> return nr;
> }
>
> -/*
> - * See if we have deferred clears that we can batch move
> - */
> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> -{
> - unsigned long mask, val;
> - unsigned long __maybe_unused flags;
> - bool ret = false;
> -
> - /* Silence bogus lockdep warning */
> -#if defined(CONFIG_LOCKDEP)
> - local_irq_save(flags);
> -#endif
> - spin_lock(&sb->map[index].swap_lock);
> -
> - if (!sb->map[index].cleared)
> - goto out_unlock;
> -
> - /*
> - * First get a stable cleared mask, setting the old mask to 0.
> - */
> - do {
> - mask = sb->map[index].cleared;
> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> -
> - /*
> - * Now clear the masked bits in our free word
> - */
> - do {
> - val = sb->map[index].word;
> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> -
> - ret = true;
> -out_unlock:
> - spin_unlock(&sb->map[index].swap_lock);
> -#if defined(CONFIG_LOCKDEP)
> - local_irq_restore(flags);
> -#endif
> - return ret;
> -}
> -
> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> unsigned int alloc_hint, bool round_robin)
> {
> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> index = SB_NR_TO_INDEX(sb, alloc_hint);
>
> for (i = 0; i < sb->map_nr; i++) {
> +again:
> nr = __sbitmap_get_word(&sb->map[index].word,
> min(sb->map[index].depth, shallow_depth),
> SB_NR_TO_BIT(sb, alloc_hint), true);
> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> break;
> }
>
> + if (sbitmap_deferred_clear(sb, index))
> + goto again;
> +
> /* Jump to next index. */
> index++;
> alloc_hint = index << sb->shift;
> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> unsigned int i;
>
> for (i = 0; i < sb->map_nr; i++) {
> - if (sb->map[i].word)
> + if (sb->map[i].word & ~sb->map[i].cleared)
> return true;
> }
> return false;
> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>
> for (i = 0; i < sb->map_nr; i++) {
> const struct sbitmap_word *word = &sb->map[i];
> + unsigned long mask = word->word & ~word->cleared;
> unsigned long ret;
>
> - ret = find_first_zero_bit(&word->word, word->depth);
> + ret = find_first_zero_bit(&mask, word->depth);
> if (ret < word->depth)
> return true;
> }
This one works too:
Tested-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:43 ` Ming Lei
@ 2018-12-12 1:44 ` Jens Axboe
2018-12-12 1:49 ` Ming Lei
1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 1:44 UTC (permalink / raw)
To: Ming Lei; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On 12/11/18 6:43 PM, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
>> On 12/11/18 6:23 PM, Jens Axboe wrote:
>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>>
>>>>>> On Tue, Dec 11 2018 at 7:19pm -0500,
>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> If I run the following subset of blktests:
>>>>>>>>>
>>>>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>>
>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>>> time the following hang is reported:
>>>>>>>>>
>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>>> fio D25272 14869 14195 0x00000000
>>>>>>>>> Call Trace:
>>>>>>>>> __schedule+0x401/0xe50
>>>>>>>>> schedule+0x4e/0xd0
>>>>>>>>> io_schedule+0x21/0x50
>>>>>>>>> blk_mq_get_tag+0x46d/0x640
>>>>>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>>>>>> blk_mq_make_request+0x241/0xa70
>>>>>>>>> generic_make_request+0x411/0x950
>>>>>>>>> submit_bio+0x9b/0x250
>>>>>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>>>>>> generic_file_direct_write+0x119/0x210
>>>>>>>>> __generic_file_write_iter+0x11c/0x280
>>>>>>>>> blkdev_write_iter+0x13c/0x220
>>>>>>>>> aio_write+0x204/0x310
>>>>>>>>> io_submit_one+0x9c6/0xe70
>>>>>>>>> __x64_sys_io_submit+0x115/0x340
>>>>>>>>> do_syscall_64+0x71/0x210
>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>>
>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>>> any pending requests:
>>>>>>>>>
>>>>>>>>> # list-pending-block-requests
>>>>>>>>> dm-0
>>>>>>>>> loop0
>>>>>>>>> loop1
>>>>>>>>> loop2
>>>>>>>>> loop3
>>>>>>>>> loop4
>>>>>>>>> loop5
>>>>>>>>> loop6
>>>>>>>>> loop7
>>>>>>>>> nullb0
>>>>>>>>> nullb1
>>>>>>>>> sda
>>>>>>>>> sdb
>>>>>>>>> sdc
>>>>>>>>> sdd
>>>>>>>>> vda
>>>>>>>>> vdb
>>>>>>>>>
>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>>
>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>>
>>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>>
>>>>>>>> What device is this being run on?
>>>>>>>
>>>>>>> I saw this issue on usb storage too.
>>>>>>>
>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>>> the sched_tag's sbitmap queue.
>>>>>>
>>>>>> You saw this running the same tests as Bart?
>>>>>
>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>>> see the attached test script.
>>>>>
>>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>>
>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>>> one word in general. But I'd like to try and understand why it hangs.
>>>>
>>>> Are you using a scheduler?
>>>
>>> OK, I think I see what it is, the shallow is missing the deferred clear.
>>> On top of this, probably worth to check at what depths deferred starts
>>> to make sense. For QD == 1, definitely not. But that should be on top of
>>> the fix.
>>>
>>> Can you try this one?
>>
>> This one should be complete. The former should fix the case, this one
>> just adds the last two cases where we ignored ->cleared. For some of
>> them we'll never hit it since deferred isn't used for that, but I
>> think we're better off with full coverage.
>>
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index 2261136ae067..5b3e56d68dab 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -20,6 +20,47 @@
>> #include <linux/sbitmap.h>
>> #include <linux/seq_file.h>
>>
>> +/*
>> + * See if we have deferred clears that we can batch move
>> + */
>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> +{
>> + unsigned long mask, val;
>> + unsigned long __maybe_unused flags;
>> + bool ret = false;
>> +
>> + /* Silence bogus lockdep warning */
>> +#if defined(CONFIG_LOCKDEP)
>> + local_irq_save(flags);
>> +#endif
>> + spin_lock(&sb->map[index].swap_lock);
>> +
>> + if (!sb->map[index].cleared)
>> + goto out_unlock;
>> +
>> + /*
>> + * First get a stable cleared mask, setting the old mask to 0.
>> + */
>> + do {
>> + mask = sb->map[index].cleared;
>> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> +
>> + /*
>> + * Now clear the masked bits in our free word
>> + */
>> + do {
>> + val = sb->map[index].word;
>> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> +
>> + ret = true;
>> +out_unlock:
>> + spin_unlock(&sb->map[index].swap_lock);
>> +#if defined(CONFIG_LOCKDEP)
>> + local_irq_restore(flags);
>> +#endif
>> + return ret;
>> +}
>> +
>> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>> gfp_t flags, int node)
>> {
>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>> unsigned int bits_per_word = 1U << sb->shift;
>> unsigned int i;
>>
>> + for (i = 0; i < sb->map_nr; i++)
>> + sbitmap_deferred_clear(sb, i);
>> +
>> sb->depth = depth;
>> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>
>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>> return nr;
>> }
>>
>> -/*
>> - * See if we have deferred clears that we can batch move
>> - */
>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> -{
>> - unsigned long mask, val;
>> - unsigned long __maybe_unused flags;
>> - bool ret = false;
>> -
>> - /* Silence bogus lockdep warning */
>> -#if defined(CONFIG_LOCKDEP)
>> - local_irq_save(flags);
>> -#endif
>> - spin_lock(&sb->map[index].swap_lock);
>> -
>> - if (!sb->map[index].cleared)
>> - goto out_unlock;
>> -
>> - /*
>> - * First get a stable cleared mask, setting the old mask to 0.
>> - */
>> - do {
>> - mask = sb->map[index].cleared;
>> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>> -
>> - /*
>> - * Now clear the masked bits in our free word
>> - */
>> - do {
>> - val = sb->map[index].word;
>> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>> -
>> - ret = true;
>> -out_unlock:
>> - spin_unlock(&sb->map[index].swap_lock);
>> -#if defined(CONFIG_LOCKDEP)
>> - local_irq_restore(flags);
>> -#endif
>> - return ret;
>> -}
>> -
>> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>> unsigned int alloc_hint, bool round_robin)
>> {
>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>> index = SB_NR_TO_INDEX(sb, alloc_hint);
>>
>> for (i = 0; i < sb->map_nr; i++) {
>> +again:
>> nr = __sbitmap_get_word(&sb->map[index].word,
>> min(sb->map[index].depth, shallow_depth),
>> SB_NR_TO_BIT(sb, alloc_hint), true);
>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>> break;
>> }
>>
>> + if (sbitmap_deferred_clear(sb, index))
>> + goto again;
>> +
>> /* Jump to next index. */
>> index++;
>> alloc_hint = index << sb->shift;
>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
>> unsigned int i;
>>
>> for (i = 0; i < sb->map_nr; i++) {
>> - if (sb->map[i].word)
>> + if (sb->map[i].word & ~sb->map[i].cleared)
>> return true;
>> }
>> return false;
>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>>
>> for (i = 0; i < sb->map_nr; i++) {
>> const struct sbitmap_word *word = &sb->map[i];
>> + unsigned long mask = word->word & ~word->cleared;
>> unsigned long ret;
>>
>> - ret = find_first_zero_bit(&word->word, word->depth);
>> + ret = find_first_zero_bit(&mask, word->depth);
>> if (ret < word->depth)
>> return true;
>> }
>
> This one works too:
>
> Tested-by: Ming Lei <ming.lei@redhat.com>
Awesome, thanks Ming!
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:43 ` Ming Lei
2018-12-12 1:44 ` Jens Axboe
@ 2018-12-12 1:49 ` Ming Lei
2018-12-12 2:03 ` Ming Lei
1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12 1:49 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> > On 12/11/18 6:23 PM, Jens Axboe wrote:
> > > On 12/11/18 6:05 PM, Jens Axboe wrote:
> > >> On 12/11/18 5:38 PM, Ming Lei wrote:
> > >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> > >>>>
> > >>>> On Tue, Dec 11 2018 at 7:19pm -0500,
> > >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> > >>>>
> > >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>
> > >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > >>>>>>> Hi Jens,
> > >>>>>>>
> > >>>>>>> If I run the following subset of blktests:
> > >>>>>>>
> > >>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > >>>>>>>
> > >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> > >>>>>>> time the following hang is reported:
> > >>>>>>>
> > >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> > >>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
> > >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >>>>>>> fio D25272 14869 14195 0x00000000
> > >>>>>>> Call Trace:
> > >>>>>>> __schedule+0x401/0xe50
> > >>>>>>> schedule+0x4e/0xd0
> > >>>>>>> io_schedule+0x21/0x50
> > >>>>>>> blk_mq_get_tag+0x46d/0x640
> > >>>>>>> blk_mq_get_request+0x7c0/0xa00
> > >>>>>>> blk_mq_make_request+0x241/0xa70
> > >>>>>>> generic_make_request+0x411/0x950
> > >>>>>>> submit_bio+0x9b/0x250
> > >>>>>>> blkdev_direct_IO+0x7fb/0x870
> > >>>>>>> generic_file_direct_write+0x119/0x210
> > >>>>>>> __generic_file_write_iter+0x11c/0x280
> > >>>>>>> blkdev_write_iter+0x13c/0x220
> > >>>>>>> aio_write+0x204/0x310
> > >>>>>>> io_submit_one+0x9c6/0xe70
> > >>>>>>> __x64_sys_io_submit+0x115/0x340
> > >>>>>>> do_syscall_64+0x71/0x210
> > >>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >>>>>>>
> > >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> > >>>>>>> any pending requests:
> > >>>>>>>
> > >>>>>>> # list-pending-block-requests
> > >>>>>>> dm-0
> > >>>>>>> loop0
> > >>>>>>> loop1
> > >>>>>>> loop2
> > >>>>>>> loop3
> > >>>>>>> loop4
> > >>>>>>> loop5
> > >>>>>>> loop6
> > >>>>>>> loop7
> > >>>>>>> nullb0
> > >>>>>>> nullb1
> > >>>>>>> sda
> > >>>>>>> sdb
> > >>>>>>> sdc
> > >>>>>>> sdd
> > >>>>>>> vda
> > >>>>>>> vdb
> > >>>>>>>
> > >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > >>>>>>> that the root cause is in any of the dm drivers used in this test:
> > >>>>>>>
> > >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > >>>>>>>
> > >>>>>>> The same test passes against kernel v4.20-rc6.
> > >>>>>>
> > >>>>>> What device is this being run on?
> > >>>>>
> > >>>>> I saw this issue on usb storage too.
> > >>>>>
> > >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > >>>>> the sched_tag's sbitmap queue.
> > >>>>
> > >>>> You saw this running the same tests as Bart?
> > >>>
> > >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> > >>> see the attached test script.
> > >>>
> > >>> Mostly it can be triggered in one run, sometimes it needs more.
> > >>
> > >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > >> one word in general. But I'd like to try and understand why it hangs.
> > >>
> > >> Are you using a scheduler?
> > >
> > > OK, I think I see what it is, the shallow is missing the deferred clear.
> > > On top of this, probably worth to check at what depths deferred starts
> > > to make sense. For QD == 1, definitely not. But that should be on top of
> > > the fix.
> > >
> > > Can you try this one?
> >
> > This one should be complete. The former should fix the case, this one
> > just adds the last two cases where we ignored ->cleared. For some of
> > them we'll never hit it since deferred isn't used for that, but I
> > think we're better off with full coverage.
> >
> >
> > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > index 2261136ae067..5b3e56d68dab 100644
> > --- a/lib/sbitmap.c
> > +++ b/lib/sbitmap.c
> > @@ -20,6 +20,47 @@
> > #include <linux/sbitmap.h>
> > #include <linux/seq_file.h>
> >
> > +/*
> > + * See if we have deferred clears that we can batch move
> > + */
> > +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > +{
> > + unsigned long mask, val;
> > + unsigned long __maybe_unused flags;
> > + bool ret = false;
> > +
> > + /* Silence bogus lockdep warning */
> > +#if defined(CONFIG_LOCKDEP)
> > + local_irq_save(flags);
> > +#endif
> > + spin_lock(&sb->map[index].swap_lock);
> > +
> > + if (!sb->map[index].cleared)
> > + goto out_unlock;
> > +
> > + /*
> > + * First get a stable cleared mask, setting the old mask to 0.
> > + */
> > + do {
> > + mask = sb->map[index].cleared;
> > + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > +
> > + /*
> > + * Now clear the masked bits in our free word
> > + */
> > + do {
> > + val = sb->map[index].word;
> > + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > +
> > + ret = true;
> > +out_unlock:
> > + spin_unlock(&sb->map[index].swap_lock);
> > +#if defined(CONFIG_LOCKDEP)
> > + local_irq_restore(flags);
> > +#endif
> > + return ret;
> > +}
> > +
> > int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> > gfp_t flags, int node)
> > {
> > @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> > unsigned int bits_per_word = 1U << sb->shift;
> > unsigned int i;
> >
> > + for (i = 0; i < sb->map_nr; i++)
> > + sbitmap_deferred_clear(sb, i);
> > +
> > sb->depth = depth;
> > sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> >
> > @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> > return nr;
> > }
> >
> > -/*
> > - * See if we have deferred clears that we can batch move
> > - */
> > -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > -{
> > - unsigned long mask, val;
> > - unsigned long __maybe_unused flags;
> > - bool ret = false;
> > -
> > - /* Silence bogus lockdep warning */
> > -#if defined(CONFIG_LOCKDEP)
> > - local_irq_save(flags);
> > -#endif
> > - spin_lock(&sb->map[index].swap_lock);
> > -
> > - if (!sb->map[index].cleared)
> > - goto out_unlock;
> > -
> > - /*
> > - * First get a stable cleared mask, setting the old mask to 0.
> > - */
> > - do {
> > - mask = sb->map[index].cleared;
> > - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > -
> > - /*
> > - * Now clear the masked bits in our free word
> > - */
> > - do {
> > - val = sb->map[index].word;
> > - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > -
> > - ret = true;
> > -out_unlock:
> > - spin_unlock(&sb->map[index].swap_lock);
> > -#if defined(CONFIG_LOCKDEP)
> > - local_irq_restore(flags);
> > -#endif
> > - return ret;
> > -}
> > -
> > static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> > unsigned int alloc_hint, bool round_robin)
> > {
> > @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > index = SB_NR_TO_INDEX(sb, alloc_hint);
> >
> > for (i = 0; i < sb->map_nr; i++) {
> > +again:
> > nr = __sbitmap_get_word(&sb->map[index].word,
> > min(sb->map[index].depth, shallow_depth),
> > SB_NR_TO_BIT(sb, alloc_hint), true);
> > @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > break;
> > }
> >
> > + if (sbitmap_deferred_clear(sb, index))
> > + goto again;
> > +
> > /* Jump to next index. */
> > index++;
> > alloc_hint = index << sb->shift;
> > @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> > unsigned int i;
> >
> > for (i = 0; i < sb->map_nr; i++) {
> > - if (sb->map[i].word)
> > + if (sb->map[i].word & ~sb->map[i].cleared)
> > return true;
> > }
> > return false;
> > @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> >
> > for (i = 0; i < sb->map_nr; i++) {
> > const struct sbitmap_word *word = &sb->map[i];
> > + unsigned long mask = word->word & ~word->cleared;
> > unsigned long ret;
> >
> > - ret = find_first_zero_bit(&word->word, word->depth);
> > + ret = find_first_zero_bit(&mask, word->depth);
> > if (ret < word->depth)
> > return true;
> > }
>
> This one works too:
>
> Tested-by: Ming Lei <ming.lei@redhat.com>
Please hold on, just found this issue is triggered again after the 4th
run,
[ 135.588599] ================start test sanity/004================
[ 136.864175] scsi host14: scsi_debug: version 0188 [20180128]
dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
[ 136.873018] scsi 14:0:0:0: Direct-Access Linux scsi_debug 0188 PQ: 0 ANSI: 7
[ 136.874282] sd 14:0:0:0: Power-on or device reset occurred
[ 136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[ 136.879284] sd 14:0:0:0: [sde] Write Protect is off
[ 136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
[ 136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
[ 139.465844] sdd:
[ 139.754360] sdd:
[ 143.650074] hpet1: lost 765 rtc interrupts
[ 145.078137] sdd:
[ 145.699220] sdd:
[ 146.693532] sdd:
[ 147.557814] sdd:
[ 148.742290] sdd:
[ 149.429425] sdd:
[ 149.944095] sdd: sdd1
[ 152.631125] sdd:
[ 184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
[ 184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
[ 184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
[ 184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
[ 185.872632] sdd: sdd1 sdd2
[ 242.446969] hpet1: lost 673 rtc interrupts
[ 252.701145] hpet1: lost 655 rtc interrupts
[ 257.814148] hpet1: lost 300 rtc interrupts
[ 278.315651] hpet1: lost 1311 rtc interrupts
[ 278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
[ 336.228781] hpet1: lost 948 rtc interrupts
[ 342.623253] hpet1: lost 409 rtc interrupts
[ 344.205027] hpet1: lost 14 rtc interrupts
[ 344.457622] hpet1: lost 15 rtc interrupts
[ 344.584608] hpet1: lost 7 rtc interrupts
[ 344.706382] hpet1: lost 7 rtc interrupts
[ 345.012803] hpet1: lost 19 rtc interrupts
[ 345.109177] hpet1: lost 5 rtc interrupts
[ 345.353934] hpet1: lost 14 rtc interrupts
[ 345.477123] hpet1: lost 7 rtc interrupts
[ 345.530757] hpet1: lost 3 rtc interrupts
[ 348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
[ 348.328567] hpet1: lost 162 rtc interrupts
[ 348.506049] hpet1: lost 10 rtc interrupts
[ 350.461310] hpet1: lost 124 rtc interrupts
[ 351.307411] hpet1: lost 50 rtc interrupts
[ 351.401932] hpet1: lost 6 rtc interrupts
[ 351.564204] hpet1: lost 9 rtc interrupts
[ 351.648124] hpet1: lost 4 rtc interrupts
[ 351.828972] hpet1: lost 11 rtc interrupts
[ 353.043051] hpet1: lost 76 rtc interrupts
[ 353.705926] hpet1: lost 38 rtc interrupts
[ 353.867703] hpet1: lost 9 rtc interrupts
[ 354.030183] hpet1: lost 10 rtc interrupts
[ 354.071030] hpet1: lost 1 rtc interrupts
[ 354.194600] hpet1: lost 7 rtc interrupts
[ 354.316562] hpet1: lost 7 rtc interrupts
[ 355.167894] hpet1: lost 1376 rtc interrupts
[ 364.089267] hpet1: lost 153 rtc interrupts
[ 364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
[ 366.869943] hpet1: lost 177 rtc interrupts
[ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 369.940768] parted D 0 3645 3239 0x00000000
[ 369.941500] Call Trace:
[ 369.941874] ? __schedule+0x6d9/0x74c
[ 369.942392] ? wbt_done+0x5e/0x5e
[ 369.942864] ? wbt_cleanup_cb+0x16/0x16
[ 369.943404] ? wbt_done+0x5e/0x5e
[ 369.943874] schedule+0x67/0x78
[ 369.944298] io_schedule+0x12/0x33
[ 369.944771] rq_qos_wait+0xb5/0x119
[ 369.945193] ? karma_partition+0x1c2/0x1c2
[ 369.945691] ? wbt_cleanup_cb+0x16/0x16
[ 369.946151] wbt_wait+0x85/0xb6
[ 369.946540] __rq_qos_throttle+0x23/0x2f
[ 369.947014] blk_mq_make_request+0xe6/0x40a
[ 369.947518] generic_make_request+0x192/0x2fe
[ 369.948042] ? submit_bio+0x103/0x11f
[ 369.948486] ? __radix_tree_lookup+0x35/0xb5
[ 369.949011] submit_bio+0x103/0x11f
[ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
[ 369.949962] submit_bio_wait+0x53/0x7f
[ 369.950469] blkdev_issue_flush+0x8a/0xae
[ 369.951032] blkdev_fsync+0x2f/0x3a
[ 369.951502] do_fsync+0x2e/0x47
[ 369.951887] __x64_sys_fsync+0x10/0x13
[ 369.952374] do_syscall_64+0x89/0x149
[ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 369.953492] RIP: 0033:0x7f95a1e729d4
[ 369.953996] Code: Bad RIP value.
[ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
[ 375.962995] ================end test sanity/004: (HANG, -1)================
Thanks,
Ming
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 1:49 ` Ming Lei
@ 2018-12-12 2:03 ` Ming Lei
2018-12-12 2:25 ` Jens Axboe
0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-12 2:03 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block
On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> > On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> > > On 12/11/18 6:23 PM, Jens Axboe wrote:
> > > > On 12/11/18 6:05 PM, Jens Axboe wrote:
> > > >> On 12/11/18 5:38 PM, Ming Lei wrote:
> > > >>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> > > >>>>
> > > >>>> On Tue, Dec 11 2018 at 7:19pm -0500,
> > > >>>> Ming Lei <tom.leiming@gmail.com> wrote:
> > > >>>>
> > > >>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > >>>>>>
> > > >>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> > > >>>>>>> Hi Jens,
> > > >>>>>>>
> > > >>>>>>> If I run the following subset of blktests:
> > > >>>>>>>
> > > >>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
> > > >>>>>>>
> > > >>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> > > >>>>>>> time the following hang is reported:
> > > >>>>>>>
> > > >>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> > > >>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
> > > >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > >>>>>>> fio D25272 14869 14195 0x00000000
> > > >>>>>>> Call Trace:
> > > >>>>>>> __schedule+0x401/0xe50
> > > >>>>>>> schedule+0x4e/0xd0
> > > >>>>>>> io_schedule+0x21/0x50
> > > >>>>>>> blk_mq_get_tag+0x46d/0x640
> > > >>>>>>> blk_mq_get_request+0x7c0/0xa00
> > > >>>>>>> blk_mq_make_request+0x241/0xa70
> > > >>>>>>> generic_make_request+0x411/0x950
> > > >>>>>>> submit_bio+0x9b/0x250
> > > >>>>>>> blkdev_direct_IO+0x7fb/0x870
> > > >>>>>>> generic_file_direct_write+0x119/0x210
> > > >>>>>>> __generic_file_write_iter+0x11c/0x280
> > > >>>>>>> blkdev_write_iter+0x13c/0x220
> > > >>>>>>> aio_write+0x204/0x310
> > > >>>>>>> io_submit_one+0x9c6/0xe70
> > > >>>>>>> __x64_sys_io_submit+0x115/0x340
> > > >>>>>>> do_syscall_64+0x71/0x210
> > > >>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > >>>>>>>
> > > >>>>>>> When that hang occurs my list-pending-block-requests script does not show
> > > >>>>>>> any pending requests:
> > > >>>>>>>
> > > >>>>>>> # list-pending-block-requests
> > > >>>>>>> dm-0
> > > >>>>>>> loop0
> > > >>>>>>> loop1
> > > >>>>>>> loop2
> > > >>>>>>> loop3
> > > >>>>>>> loop4
> > > >>>>>>> loop5
> > > >>>>>>> loop6
> > > >>>>>>> loop7
> > > >>>>>>> nullb0
> > > >>>>>>> nullb1
> > > >>>>>>> sda
> > > >>>>>>> sdb
> > > >>>>>>> sdc
> > > >>>>>>> sdd
> > > >>>>>>> vda
> > > >>>>>>> vdb
> > > >>>>>>>
> > > >>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> > > >>>>>>> that the root cause is in any of the dm drivers used in this test:
> > > >>>>>>>
> > > >>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> > > >>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> > > >>>>>>>
> > > >>>>>>> The same test passes against kernel v4.20-rc6.
> > > >>>>>>
> > > >>>>>> What device is this being run on?
> > > >>>>>
> > > >>>>> I saw this issue on usb storage too.
> > > >>>>>
> > > >>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> > > >>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> > > >>>>> the sched_tag's sbitmap queue.
> > > >>>>
> > > >>>> You saw this running the same tests as Bart?
> > > >>>
> > > >>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> > > >>> see the attached test script.
> > > >>>
> > > >>> Mostly it can be triggered in one run, sometimes it needs more.
> > > >>
> > > >> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> > > >> one word in general. But I'd like to try and understand why it hangs.
> > > >>
> > > >> Are you using a scheduler?
> > > >
> > > > OK, I think I see what it is, the shallow is missing the deferred clear.
> > > > On top of this, probably worth to check at what depths deferred starts
> > > > to make sense. For QD == 1, definitely not. But that should be on top of
> > > > the fix.
> > > >
> > > > Can you try this one?
> > >
> > > This one should be complete. The former should fix the case, this one
> > > just adds the last two cases where we ignored ->cleared. For some of
> > > them we'll never hit it since deferred isn't used for that, but I
> > > think we're better off with full coverage.
> > >
> > >
> > > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > > index 2261136ae067..5b3e56d68dab 100644
> > > --- a/lib/sbitmap.c
> > > +++ b/lib/sbitmap.c
> > > @@ -20,6 +20,47 @@
> > > #include <linux/sbitmap.h>
> > > #include <linux/seq_file.h>
> > >
> > > +/*
> > > + * See if we have deferred clears that we can batch move
> > > + */
> > > +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > > +{
> > > + unsigned long mask, val;
> > > + unsigned long __maybe_unused flags;
> > > + bool ret = false;
> > > +
> > > + /* Silence bogus lockdep warning */
> > > +#if defined(CONFIG_LOCKDEP)
> > > + local_irq_save(flags);
> > > +#endif
> > > + spin_lock(&sb->map[index].swap_lock);
> > > +
> > > + if (!sb->map[index].cleared)
> > > + goto out_unlock;
> > > +
> > > + /*
> > > + * First get a stable cleared mask, setting the old mask to 0.
> > > + */
> > > + do {
> > > + mask = sb->map[index].cleared;
> > > + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > > +
> > > + /*
> > > + * Now clear the masked bits in our free word
> > > + */
> > > + do {
> > > + val = sb->map[index].word;
> > > + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > > +
> > > + ret = true;
> > > +out_unlock:
> > > + spin_unlock(&sb->map[index].swap_lock);
> > > +#if defined(CONFIG_LOCKDEP)
> > > + local_irq_restore(flags);
> > > +#endif
> > > + return ret;
> > > +}
> > > +
> > > int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> > > gfp_t flags, int node)
> > > {
> > > @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> > > unsigned int bits_per_word = 1U << sb->shift;
> > > unsigned int i;
> > >
> > > + for (i = 0; i < sb->map_nr; i++)
> > > + sbitmap_deferred_clear(sb, i);
> > > +
> > > sb->depth = depth;
> > > sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> > >
> > > @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> > > return nr;
> > > }
> > >
> > > -/*
> > > - * See if we have deferred clears that we can batch move
> > > - */
> > > -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> > > -{
> > > - unsigned long mask, val;
> > > - unsigned long __maybe_unused flags;
> > > - bool ret = false;
> > > -
> > > - /* Silence bogus lockdep warning */
> > > -#if defined(CONFIG_LOCKDEP)
> > > - local_irq_save(flags);
> > > -#endif
> > > - spin_lock(&sb->map[index].swap_lock);
> > > -
> > > - if (!sb->map[index].cleared)
> > > - goto out_unlock;
> > > -
> > > - /*
> > > - * First get a stable cleared mask, setting the old mask to 0.
> > > - */
> > > - do {
> > > - mask = sb->map[index].cleared;
> > > - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> > > -
> > > - /*
> > > - * Now clear the masked bits in our free word
> > > - */
> > > - do {
> > > - val = sb->map[index].word;
> > > - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> > > -
> > > - ret = true;
> > > -out_unlock:
> > > - spin_unlock(&sb->map[index].swap_lock);
> > > -#if defined(CONFIG_LOCKDEP)
> > > - local_irq_restore(flags);
> > > -#endif
> > > - return ret;
> > > -}
> > > -
> > > static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> > > unsigned int alloc_hint, bool round_robin)
> > > {
> > > @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > > index = SB_NR_TO_INDEX(sb, alloc_hint);
> > >
> > > for (i = 0; i < sb->map_nr; i++) {
> > > +again:
> > > nr = __sbitmap_get_word(&sb->map[index].word,
> > > min(sb->map[index].depth, shallow_depth),
> > > SB_NR_TO_BIT(sb, alloc_hint), true);
> > > @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> > > break;
> > > }
> > >
> > > + if (sbitmap_deferred_clear(sb, index))
> > > + goto again;
> > > +
> > > /* Jump to next index. */
> > > index++;
> > > alloc_hint = index << sb->shift;
> > > @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> > > unsigned int i;
> > >
> > > for (i = 0; i < sb->map_nr; i++) {
> > > - if (sb->map[i].word)
> > > + if (sb->map[i].word & ~sb->map[i].cleared)
> > > return true;
> > > }
> > > return false;
> > > @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> > >
> > > for (i = 0; i < sb->map_nr; i++) {
> > > const struct sbitmap_word *word = &sb->map[i];
> > > + unsigned long mask = word->word & ~word->cleared;
> > > unsigned long ret;
> > >
> > > - ret = find_first_zero_bit(&word->word, word->depth);
> > > + ret = find_first_zero_bit(&mask, word->depth);
> > > if (ret < word->depth)
> > > return true;
> > > }
> >
> > This one works too:
> >
> > Tested-by: Ming Lei <ming.lei@redhat.com>
>
> Please hold on, just found this issue is triggered again after the 4th
> run,
>
> [ 135.588599] ================start test sanity/004================
> [ 136.864175] scsi host14: scsi_debug: version 0188 [20180128]
> dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
> [ 136.873018] scsi 14:0:0:0: Direct-Access Linux scsi_debug 0188 PQ: 0 ANSI: 7
> [ 136.874282] sd 14:0:0:0: Power-on or device reset occurred
> [ 136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
> [ 136.879284] sd 14:0:0:0: [sde] Write Protect is off
> [ 136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
> [ 136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
> [ 136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
> [ 139.465844] sdd:
> [ 139.754360] sdd:
> [ 143.650074] hpet1: lost 765 rtc interrupts
> [ 145.078137] sdd:
> [ 145.699220] sdd:
> [ 146.693532] sdd:
> [ 147.557814] sdd:
> [ 148.742290] sdd:
> [ 149.429425] sdd:
> [ 149.944095] sdd: sdd1
> [ 152.631125] sdd:
> [ 184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
> [ 184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
> [ 184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
> [ 184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
> [ 184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [ 185.872632] sdd: sdd1 sdd2
> [ 242.446969] hpet1: lost 673 rtc interrupts
> [ 252.701145] hpet1: lost 655 rtc interrupts
> [ 257.814148] hpet1: lost 300 rtc interrupts
> [ 278.315651] hpet1: lost 1311 rtc interrupts
> [ 278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [ 336.228781] hpet1: lost 948 rtc interrupts
> [ 342.623253] hpet1: lost 409 rtc interrupts
> [ 344.205027] hpet1: lost 14 rtc interrupts
> [ 344.457622] hpet1: lost 15 rtc interrupts
> [ 344.584608] hpet1: lost 7 rtc interrupts
> [ 344.706382] hpet1: lost 7 rtc interrupts
> [ 345.012803] hpet1: lost 19 rtc interrupts
> [ 345.109177] hpet1: lost 5 rtc interrupts
> [ 345.353934] hpet1: lost 14 rtc interrupts
> [ 345.477123] hpet1: lost 7 rtc interrupts
> [ 345.530757] hpet1: lost 3 rtc interrupts
> [ 348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
> [ 348.328567] hpet1: lost 162 rtc interrupts
> [ 348.506049] hpet1: lost 10 rtc interrupts
> [ 350.461310] hpet1: lost 124 rtc interrupts
> [ 351.307411] hpet1: lost 50 rtc interrupts
> [ 351.401932] hpet1: lost 6 rtc interrupts
> [ 351.564204] hpet1: lost 9 rtc interrupts
> [ 351.648124] hpet1: lost 4 rtc interrupts
> [ 351.828972] hpet1: lost 11 rtc interrupts
> [ 353.043051] hpet1: lost 76 rtc interrupts
> [ 353.705926] hpet1: lost 38 rtc interrupts
> [ 353.867703] hpet1: lost 9 rtc interrupts
> [ 354.030183] hpet1: lost 10 rtc interrupts
> [ 354.071030] hpet1: lost 1 rtc interrupts
> [ 354.194600] hpet1: lost 7 rtc interrupts
> [ 354.316562] hpet1: lost 7 rtc interrupts
> [ 355.167894] hpet1: lost 1376 rtc interrupts
> [ 364.089267] hpet1: lost 153 rtc interrupts
> [ 364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
> [ 366.869943] hpet1: lost 177 rtc interrupts
> [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
> [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
> [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 369.940768] parted D 0 3645 3239 0x00000000
> [ 369.941500] Call Trace:
> [ 369.941874] ? __schedule+0x6d9/0x74c
> [ 369.942392] ? wbt_done+0x5e/0x5e
> [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
> [ 369.943404] ? wbt_done+0x5e/0x5e
> [ 369.943874] schedule+0x67/0x78
> [ 369.944298] io_schedule+0x12/0x33
> [ 369.944771] rq_qos_wait+0xb5/0x119
> [ 369.945193] ? karma_partition+0x1c2/0x1c2
> [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
> [ 369.946151] wbt_wait+0x85/0xb6
> [ 369.946540] __rq_qos_throttle+0x23/0x2f
> [ 369.947014] blk_mq_make_request+0xe6/0x40a
> [ 369.947518] generic_make_request+0x192/0x2fe
> [ 369.948042] ? submit_bio+0x103/0x11f
> [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
> [ 369.949011] submit_bio+0x103/0x11f
> [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
> [ 369.949962] submit_bio_wait+0x53/0x7f
> [ 369.950469] blkdev_issue_flush+0x8a/0xae
> [ 369.951032] blkdev_fsync+0x2f/0x3a
> [ 369.951502] do_fsync+0x2e/0x47
> [ 369.951887] __x64_sys_fsync+0x10/0x13
> [ 369.952374] do_syscall_64+0x89/0x149
> [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 369.953492] RIP: 0033:0x7f95a1e729d4
> [ 369.953996] Code: Bad RIP value.
> [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
> [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
> [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
> [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
> [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
> [ 375.962995] ================end test sanity/004: (HANG, -1)================
This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
becomes -1 after the hang is triggered.
Thanks,
Ming
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 2:03 ` Ming Lei
@ 2018-12-12 2:25 ` Jens Axboe
2018-12-12 4:28 ` Ming Lei
0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-12 2:25 UTC (permalink / raw)
To: Ming Lei
Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block, Josef Bacik
On 12/11/18 7:03 PM, Ming Lei wrote:
> On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
>> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
>>> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
>>>> On 12/11/18 6:23 PM, Jens Axboe wrote:
>>>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
>>>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
>>>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 11 2018 at 7:19pm -0500,
>>>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> If I run the following subset of blktests:
>>>>>>>>>>>
>>>>>>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
>>>>>>>>>>>
>>>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
>>>>>>>>>>> time the following hang is reported:
>>>>>>>>>>>
>>>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
>>>>>>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
>>>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>>>>> fio D25272 14869 14195 0x00000000
>>>>>>>>>>> Call Trace:
>>>>>>>>>>> __schedule+0x401/0xe50
>>>>>>>>>>> schedule+0x4e/0xd0
>>>>>>>>>>> io_schedule+0x21/0x50
>>>>>>>>>>> blk_mq_get_tag+0x46d/0x640
>>>>>>>>>>> blk_mq_get_request+0x7c0/0xa00
>>>>>>>>>>> blk_mq_make_request+0x241/0xa70
>>>>>>>>>>> generic_make_request+0x411/0x950
>>>>>>>>>>> submit_bio+0x9b/0x250
>>>>>>>>>>> blkdev_direct_IO+0x7fb/0x870
>>>>>>>>>>> generic_file_direct_write+0x119/0x210
>>>>>>>>>>> __generic_file_write_iter+0x11c/0x280
>>>>>>>>>>> blkdev_write_iter+0x13c/0x220
>>>>>>>>>>> aio_write+0x204/0x310
>>>>>>>>>>> io_submit_one+0x9c6/0xe70
>>>>>>>>>>> __x64_sys_io_submit+0x115/0x340
>>>>>>>>>>> do_syscall_64+0x71/0x210
>>>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>>>>>>>
>>>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
>>>>>>>>>>> any pending requests:
>>>>>>>>>>>
>>>>>>>>>>> # list-pending-block-requests
>>>>>>>>>>> dm-0
>>>>>>>>>>> loop0
>>>>>>>>>>> loop1
>>>>>>>>>>> loop2
>>>>>>>>>>> loop3
>>>>>>>>>>> loop4
>>>>>>>>>>> loop5
>>>>>>>>>>> loop6
>>>>>>>>>>> loop7
>>>>>>>>>>> nullb0
>>>>>>>>>>> nullb1
>>>>>>>>>>> sda
>>>>>>>>>>> sdb
>>>>>>>>>>> sdc
>>>>>>>>>>> sdd
>>>>>>>>>>> vda
>>>>>>>>>>> vdb
>>>>>>>>>>>
>>>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
>>>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
>>>>>>>>>>>
>>>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
>>>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
>>>>>>>>>>>
>>>>>>>>>>> The same test passes against kernel v4.20-rc6.
>>>>>>>>>>
>>>>>>>>>> What device is this being run on?
>>>>>>>>>
>>>>>>>>> I saw this issue on usb storage too.
>>>>>>>>>
>>>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
>>>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
>>>>>>>>> the sched_tag's sbitmap queue.
>>>>>>>>
>>>>>>>> You saw this running the same tests as Bart?
>>>>>>>
>>>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
>>>>>>> see the attached test script.
>>>>>>>
>>>>>>> Mostly it can be triggered in one run, sometimes it needs more.
>>>>>>
>>>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
>>>>>> one word in general. But I'd like to try and understand why it hangs.
>>>>>>
>>>>>> Are you using a scheduler?
>>>>>
>>>>> OK, I think I see what it is, the shallow is missing the deferred clear.
>>>>> On top of this, probably worth to check at what depths deferred starts
>>>>> to make sense. For QD == 1, definitely not. But that should be on top of
>>>>> the fix.
>>>>>
>>>>> Can you try this one?
>>>>
>>>> This one should be complete. The former should fix the case, this one
>>>> just adds the last two cases where we ignored ->cleared. For some of
>>>> them we'll never hit it since deferred isn't used for that, but I
>>>> think we're better off with full coverage.
>>>>
>>>>
>>>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>>>> index 2261136ae067..5b3e56d68dab 100644
>>>> --- a/lib/sbitmap.c
>>>> +++ b/lib/sbitmap.c
>>>> @@ -20,6 +20,47 @@
>>>> #include <linux/sbitmap.h>
>>>> #include <linux/seq_file.h>
>>>>
>>>> +/*
>>>> + * See if we have deferred clears that we can batch move
>>>> + */
>>>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>>>> +{
>>>> + unsigned long mask, val;
>>>> + unsigned long __maybe_unused flags;
>>>> + bool ret = false;
>>>> +
>>>> + /* Silence bogus lockdep warning */
>>>> +#if defined(CONFIG_LOCKDEP)
>>>> + local_irq_save(flags);
>>>> +#endif
>>>> + spin_lock(&sb->map[index].swap_lock);
>>>> +
>>>> + if (!sb->map[index].cleared)
>>>> + goto out_unlock;
>>>> +
>>>> + /*
>>>> + * First get a stable cleared mask, setting the old mask to 0.
>>>> + */
>>>> + do {
>>>> + mask = sb->map[index].cleared;
>>>> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>>>> +
>>>> + /*
>>>> + * Now clear the masked bits in our free word
>>>> + */
>>>> + do {
>>>> + val = sb->map[index].word;
>>>> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>>>> +
>>>> + ret = true;
>>>> +out_unlock:
>>>> + spin_unlock(&sb->map[index].swap_lock);
>>>> +#if defined(CONFIG_LOCKDEP)
>>>> + local_irq_restore(flags);
>>>> +#endif
>>>> + return ret;
>>>> +}
>>>> +
>>>> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
>>>> gfp_t flags, int node)
>>>> {
>>>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
>>>> unsigned int bits_per_word = 1U << sb->shift;
>>>> unsigned int i;
>>>>
>>>> + for (i = 0; i < sb->map_nr; i++)
>>>> + sbitmap_deferred_clear(sb, i);
>>>> +
>>>> sb->depth = depth;
>>>> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
>>>>
>>>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
>>>> return nr;
>>>> }
>>>>
>>>> -/*
>>>> - * See if we have deferred clears that we can batch move
>>>> - */
>>>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>>>> -{
>>>> - unsigned long mask, val;
>>>> - unsigned long __maybe_unused flags;
>>>> - bool ret = false;
>>>> -
>>>> - /* Silence bogus lockdep warning */
>>>> -#if defined(CONFIG_LOCKDEP)
>>>> - local_irq_save(flags);
>>>> -#endif
>>>> - spin_lock(&sb->map[index].swap_lock);
>>>> -
>>>> - if (!sb->map[index].cleared)
>>>> - goto out_unlock;
>>>> -
>>>> - /*
>>>> - * First get a stable cleared mask, setting the old mask to 0.
>>>> - */
>>>> - do {
>>>> - mask = sb->map[index].cleared;
>>>> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
>>>> -
>>>> - /*
>>>> - * Now clear the masked bits in our free word
>>>> - */
>>>> - do {
>>>> - val = sb->map[index].word;
>>>> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
>>>> -
>>>> - ret = true;
>>>> -out_unlock:
>>>> - spin_unlock(&sb->map[index].swap_lock);
>>>> -#if defined(CONFIG_LOCKDEP)
>>>> - local_irq_restore(flags);
>>>> -#endif
>>>> - return ret;
>>>> -}
>>>> -
>>>> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
>>>> unsigned int alloc_hint, bool round_robin)
>>>> {
>>>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>>> index = SB_NR_TO_INDEX(sb, alloc_hint);
>>>>
>>>> for (i = 0; i < sb->map_nr; i++) {
>>>> +again:
>>>> nr = __sbitmap_get_word(&sb->map[index].word,
>>>> min(sb->map[index].depth, shallow_depth),
>>>> SB_NR_TO_BIT(sb, alloc_hint), true);
>>>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
>>>> break;
>>>> }
>>>>
>>>> + if (sbitmap_deferred_clear(sb, index))
>>>> + goto again;
>>>> +
>>>> /* Jump to next index. */
>>>> index++;
>>>> alloc_hint = index << sb->shift;
>>>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
>>>> unsigned int i;
>>>>
>>>> for (i = 0; i < sb->map_nr; i++) {
>>>> - if (sb->map[i].word)
>>>> + if (sb->map[i].word & ~sb->map[i].cleared)
>>>> return true;
>>>> }
>>>> return false;
>>>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
>>>>
>>>> for (i = 0; i < sb->map_nr; i++) {
>>>> const struct sbitmap_word *word = &sb->map[i];
>>>> + unsigned long mask = word->word & ~word->cleared;
>>>> unsigned long ret;
>>>>
>>>> - ret = find_first_zero_bit(&word->word, word->depth);
>>>> + ret = find_first_zero_bit(&mask, word->depth);
>>>> if (ret < word->depth)
>>>> return true;
>>>> }
>>>
>>> This one works too:
>>>
>>> Tested-by: Ming Lei <ming.lei@redhat.com>
>>
>> Please hold on, just found this issue is triggered again after the 4th
>> run,
>>
>> [ 135.588599] ================start test sanity/004================
>> [ 136.864175] scsi host14: scsi_debug: version 0188 [20180128]
>> dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
>> [ 136.873018] scsi 14:0:0:0: Direct-Access Linux scsi_debug 0188 PQ: 0 ANSI: 7
>> [ 136.874282] sd 14:0:0:0: Power-on or device reset occurred
>> [ 136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
>> [ 136.879284] sd 14:0:0:0: [sde] Write Protect is off
>> [ 136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
>> [ 136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
>> [ 136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
>> [ 139.465844] sdd:
>> [ 139.754360] sdd:
>> [ 143.650074] hpet1: lost 765 rtc interrupts
>> [ 145.078137] sdd:
>> [ 145.699220] sdd:
>> [ 146.693532] sdd:
>> [ 147.557814] sdd:
>> [ 148.742290] sdd:
>> [ 149.429425] sdd:
>> [ 149.944095] sdd: sdd1
>> [ 152.631125] sdd:
>> [ 184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
>> [ 184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
>> [ 184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
>> [ 184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
>> [ 184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [ 185.872632] sdd: sdd1 sdd2
>> [ 242.446969] hpet1: lost 673 rtc interrupts
>> [ 252.701145] hpet1: lost 655 rtc interrupts
>> [ 257.814148] hpet1: lost 300 rtc interrupts
>> [ 278.315651] hpet1: lost 1311 rtc interrupts
>> [ 278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [ 336.228781] hpet1: lost 948 rtc interrupts
>> [ 342.623253] hpet1: lost 409 rtc interrupts
>> [ 344.205027] hpet1: lost 14 rtc interrupts
>> [ 344.457622] hpet1: lost 15 rtc interrupts
>> [ 344.584608] hpet1: lost 7 rtc interrupts
>> [ 344.706382] hpet1: lost 7 rtc interrupts
>> [ 345.012803] hpet1: lost 19 rtc interrupts
>> [ 345.109177] hpet1: lost 5 rtc interrupts
>> [ 345.353934] hpet1: lost 14 rtc interrupts
>> [ 345.477123] hpet1: lost 7 rtc interrupts
>> [ 345.530757] hpet1: lost 3 rtc interrupts
>> [ 348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
>> [ 348.328567] hpet1: lost 162 rtc interrupts
>> [ 348.506049] hpet1: lost 10 rtc interrupts
>> [ 350.461310] hpet1: lost 124 rtc interrupts
>> [ 351.307411] hpet1: lost 50 rtc interrupts
>> [ 351.401932] hpet1: lost 6 rtc interrupts
>> [ 351.564204] hpet1: lost 9 rtc interrupts
>> [ 351.648124] hpet1: lost 4 rtc interrupts
>> [ 351.828972] hpet1: lost 11 rtc interrupts
>> [ 353.043051] hpet1: lost 76 rtc interrupts
>> [ 353.705926] hpet1: lost 38 rtc interrupts
>> [ 353.867703] hpet1: lost 9 rtc interrupts
>> [ 354.030183] hpet1: lost 10 rtc interrupts
>> [ 354.071030] hpet1: lost 1 rtc interrupts
>> [ 354.194600] hpet1: lost 7 rtc interrupts
>> [ 354.316562] hpet1: lost 7 rtc interrupts
>> [ 355.167894] hpet1: lost 1376 rtc interrupts
>> [ 364.089267] hpet1: lost 153 rtc interrupts
>> [ 364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
>> [ 366.869943] hpet1: lost 177 rtc interrupts
>> [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
>> [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
>> [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 369.940768] parted D 0 3645 3239 0x00000000
>> [ 369.941500] Call Trace:
>> [ 369.941874] ? __schedule+0x6d9/0x74c
>> [ 369.942392] ? wbt_done+0x5e/0x5e
>> [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
>> [ 369.943404] ? wbt_done+0x5e/0x5e
>> [ 369.943874] schedule+0x67/0x78
>> [ 369.944298] io_schedule+0x12/0x33
>> [ 369.944771] rq_qos_wait+0xb5/0x119
>> [ 369.945193] ? karma_partition+0x1c2/0x1c2
>> [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
>> [ 369.946151] wbt_wait+0x85/0xb6
>> [ 369.946540] __rq_qos_throttle+0x23/0x2f
>> [ 369.947014] blk_mq_make_request+0xe6/0x40a
>> [ 369.947518] generic_make_request+0x192/0x2fe
>> [ 369.948042] ? submit_bio+0x103/0x11f
>> [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
>> [ 369.949011] submit_bio+0x103/0x11f
>> [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
>> [ 369.949962] submit_bio_wait+0x53/0x7f
>> [ 369.950469] blkdev_issue_flush+0x8a/0xae
>> [ 369.951032] blkdev_fsync+0x2f/0x3a
>> [ 369.951502] do_fsync+0x2e/0x47
>> [ 369.951887] __x64_sys_fsync+0x10/0x13
>> [ 369.952374] do_syscall_64+0x89/0x149
>> [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 369.953492] RIP: 0033:0x7f95a1e729d4
>> [ 369.953996] Code: Bad RIP value.
>> [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
>> [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
>> [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
>> [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
>> [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
>> [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
>> [ 375.962995] ================end test sanity/004: (HANG, -1)================
>
> This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
> becomes -1 after the hang is triggered.
I think it is - can you try and revert:
commit b6c7b58f5fcc2386bddf9852011c42c1d2b83979
Author: Josef Bacik <josef@toxicpanda.com>
Date: Tue Dec 4 12:59:03 2018 -0500
block: convert wbt_wait() to use rq_qos_wait()
and see if that helps?
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: for-next hangs on test srp/012
2018-12-12 2:25 ` Jens Axboe
@ 2018-12-12 4:28 ` Ming Lei
0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-12 4:28 UTC (permalink / raw)
To: Jens Axboe
Cc: Ming Lei, Mike Snitzer, Bart Van Assche, linux-block, Josef Bacik
On Tue, Dec 11, 2018 at 07:25:21PM -0700, Jens Axboe wrote:
> On 12/11/18 7:03 PM, Ming Lei wrote:
> > On Wed, Dec 12, 2018 at 09:49:09AM +0800, Ming Lei wrote:
> >> On Wed, Dec 12, 2018 at 09:43:32AM +0800, Ming Lei wrote:
> >>> On Tue, Dec 11, 2018 at 06:36:21PM -0700, Jens Axboe wrote:
> >>>> On 12/11/18 6:23 PM, Jens Axboe wrote:
> >>>>> On 12/11/18 6:05 PM, Jens Axboe wrote:
> >>>>>> On 12/11/18 5:38 PM, Ming Lei wrote:
> >>>>>>> On Wed, Dec 12, 2018 at 8:28 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, Dec 11 2018 at 7:19pm -0500,
> >>>>>>>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Dec 12, 2018 at 8:04 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 12/11/18 3:58 PM, Bart Van Assche wrote:
> >>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>
> >>>>>>>>>>> If I run the following subset of blktests:
> >>>>>>>>>>>
> >>>>>>>>>>> while :; do ./check -q srp && ./check -q nvmeof-mp; done
> >>>>>>>>>>>
> >>>>>>>>>>> against today's for-next branch (commit dd2bf2df85a7) then after some
> >>>>>>>>>>> time the following hang is reported:
> >>>>>>>>>>>
> >>>>>>>>>>> INFO: task fio:14869 blocked for more than 120 seconds.
> >>>>>>>>>>> Not tainted 4.20.0-rc6-dbg+ #1
> >>>>>>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>>>>>>>>> fio D25272 14869 14195 0x00000000
> >>>>>>>>>>> Call Trace:
> >>>>>>>>>>> __schedule+0x401/0xe50
> >>>>>>>>>>> schedule+0x4e/0xd0
> >>>>>>>>>>> io_schedule+0x21/0x50
> >>>>>>>>>>> blk_mq_get_tag+0x46d/0x640
> >>>>>>>>>>> blk_mq_get_request+0x7c0/0xa00
> >>>>>>>>>>> blk_mq_make_request+0x241/0xa70
> >>>>>>>>>>> generic_make_request+0x411/0x950
> >>>>>>>>>>> submit_bio+0x9b/0x250
> >>>>>>>>>>> blkdev_direct_IO+0x7fb/0x870
> >>>>>>>>>>> generic_file_direct_write+0x119/0x210
> >>>>>>>>>>> __generic_file_write_iter+0x11c/0x280
> >>>>>>>>>>> blkdev_write_iter+0x13c/0x220
> >>>>>>>>>>> aio_write+0x204/0x310
> >>>>>>>>>>> io_submit_one+0x9c6/0xe70
> >>>>>>>>>>> __x64_sys_io_submit+0x115/0x340
> >>>>>>>>>>> do_syscall_64+0x71/0x210
> >>>>>>>>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>>>>>>>>
> >>>>>>>>>>> When that hang occurs my list-pending-block-requests script does not show
> >>>>>>>>>>> any pending requests:
> >>>>>>>>>>>
> >>>>>>>>>>> # list-pending-block-requests
> >>>>>>>>>>> dm-0
> >>>>>>>>>>> loop0
> >>>>>>>>>>> loop1
> >>>>>>>>>>> loop2
> >>>>>>>>>>> loop3
> >>>>>>>>>>> loop4
> >>>>>>>>>>> loop5
> >>>>>>>>>>> loop6
> >>>>>>>>>>> loop7
> >>>>>>>>>>> nullb0
> >>>>>>>>>>> nullb1
> >>>>>>>>>>> sda
> >>>>>>>>>>> sdb
> >>>>>>>>>>> sdc
> >>>>>>>>>>> sdd
> >>>>>>>>>>> vda
> >>>>>>>>>>> vdb
> >>>>>>>>>>>
> >>>>>>>>>>> Enabling fail_if_no_path mode did not resolve the hang so I don't think
> >>>>>>>>>>> that the root cause is in any of the dm drivers used in this test:
> >>>>>>>>>>>
> >>>>>>>>>>> # dmsetup ls | while read dm rest; do dmsetup message $dm 0 fail_if_no_path; done; dmsetup remove_all; dmsetup table
> >>>>>>>>>>> 360014056e756c6c62300000000000000: 0 65536 multipath 0 1 alua 1 1 service-time 0 1 2 8:16 1 1
> >>>>>>>>>>>
> >>>>>>>>>>> The same test passes against kernel v4.20-rc6.
> >>>>>>>>>>
> >>>>>>>>>> What device is this being run on?
> >>>>>>>>>
> >>>>>>>>> I saw this issue on usb storage too.
> >>>>>>>>>
> >>>>>>>>> Seems it is introduced by commit ea86ea2cdced ("sbitmap: ammortize cost of
> >>>>>>>>> clearing bits"). When the IO hang happens, .cleared is 2, and .busy is 0 on
> >>>>>>>>> the sched_tag's sbitmap queue.
> >>>>>>>>
> >>>>>>>> You saw this running the same tests as Bart?
> >>>>>>>
> >>>>>>> Not the srp test as done by Bart, I just run 'parted' test on usb storage disk,
> >>>>>>> see the attached test script.
> >>>>>>>
> >>>>>>> Mostly it can be triggered in one run, sometimes it needs more.
> >>>>>>
> >>>>>> I'll take a look. The ->cleared doesn't make sense for QD=1, or on
> >>>>>> one word in general. But I'd like to try and understand why it hangs.
> >>>>>>
> >>>>>> Are you using a scheduler?
> >>>>>
> >>>>> OK, I think I see what it is, the shallow is missing the deferred clear.
> >>>>> On top of this, probably worth to check at what depths deferred starts
> >>>>> to make sense. For QD == 1, definitely not. But that should be on top of
> >>>>> the fix.
> >>>>>
> >>>>> Can you try this one?
> >>>>
> >>>> This one should be complete. The former should fix the case, this one
> >>>> just adds the last two cases where we ignored ->cleared. For some of
> >>>> them we'll never hit it since deferred isn't used for that, but I
> >>>> think we're better off with full coverage.
> >>>>
> >>>>
> >>>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> >>>> index 2261136ae067..5b3e56d68dab 100644
> >>>> --- a/lib/sbitmap.c
> >>>> +++ b/lib/sbitmap.c
> >>>> @@ -20,6 +20,47 @@
> >>>> #include <linux/sbitmap.h>
> >>>> #include <linux/seq_file.h>
> >>>>
> >>>> +/*
> >>>> + * See if we have deferred clears that we can batch move
> >>>> + */
> >>>> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> >>>> +{
> >>>> + unsigned long mask, val;
> >>>> + unsigned long __maybe_unused flags;
> >>>> + bool ret = false;
> >>>> +
> >>>> + /* Silence bogus lockdep warning */
> >>>> +#if defined(CONFIG_LOCKDEP)
> >>>> + local_irq_save(flags);
> >>>> +#endif
> >>>> + spin_lock(&sb->map[index].swap_lock);
> >>>> +
> >>>> + if (!sb->map[index].cleared)
> >>>> + goto out_unlock;
> >>>> +
> >>>> + /*
> >>>> + * First get a stable cleared mask, setting the old mask to 0.
> >>>> + */
> >>>> + do {
> >>>> + mask = sb->map[index].cleared;
> >>>> + } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> >>>> +
> >>>> + /*
> >>>> + * Now clear the masked bits in our free word
> >>>> + */
> >>>> + do {
> >>>> + val = sb->map[index].word;
> >>>> + } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> >>>> +
> >>>> + ret = true;
> >>>> +out_unlock:
> >>>> + spin_unlock(&sb->map[index].swap_lock);
> >>>> +#if defined(CONFIG_LOCKDEP)
> >>>> + local_irq_restore(flags);
> >>>> +#endif
> >>>> + return ret;
> >>>> +}
> >>>> +
> >>>> int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
> >>>> gfp_t flags, int node)
> >>>> {
> >>>> @@ -70,6 +111,9 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int depth)
> >>>> unsigned int bits_per_word = 1U << sb->shift;
> >>>> unsigned int i;
> >>>>
> >>>> + for (i = 0; i < sb->map_nr; i++)
> >>>> + sbitmap_deferred_clear(sb, i);
> >>>> +
> >>>> sb->depth = depth;
> >>>> sb->map_nr = DIV_ROUND_UP(sb->depth, bits_per_word);
> >>>>
> >>>> @@ -112,47 +156,6 @@ static int __sbitmap_get_word(unsigned long *word, unsigned long depth,
> >>>> return nr;
> >>>> }
> >>>>
> >>>> -/*
> >>>> - * See if we have deferred clears that we can batch move
> >>>> - */
> >>>> -static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> >>>> -{
> >>>> - unsigned long mask, val;
> >>>> - unsigned long __maybe_unused flags;
> >>>> - bool ret = false;
> >>>> -
> >>>> - /* Silence bogus lockdep warning */
> >>>> -#if defined(CONFIG_LOCKDEP)
> >>>> - local_irq_save(flags);
> >>>> -#endif
> >>>> - spin_lock(&sb->map[index].swap_lock);
> >>>> -
> >>>> - if (!sb->map[index].cleared)
> >>>> - goto out_unlock;
> >>>> -
> >>>> - /*
> >>>> - * First get a stable cleared mask, setting the old mask to 0.
> >>>> - */
> >>>> - do {
> >>>> - mask = sb->map[index].cleared;
> >>>> - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask);
> >>>> -
> >>>> - /*
> >>>> - * Now clear the masked bits in our free word
> >>>> - */
> >>>> - do {
> >>>> - val = sb->map[index].word;
> >>>> - } while (cmpxchg(&sb->map[index].word, val, val & ~mask) != val);
> >>>> -
> >>>> - ret = true;
> >>>> -out_unlock:
> >>>> - spin_unlock(&sb->map[index].swap_lock);
> >>>> -#if defined(CONFIG_LOCKDEP)
> >>>> - local_irq_restore(flags);
> >>>> -#endif
> >>>> - return ret;
> >>>> -}
> >>>> -
> >>>> static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> >>>> unsigned int alloc_hint, bool round_robin)
> >>>> {
> >>>> @@ -215,6 +218,7 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >>>> index = SB_NR_TO_INDEX(sb, alloc_hint);
> >>>>
> >>>> for (i = 0; i < sb->map_nr; i++) {
> >>>> +again:
> >>>> nr = __sbitmap_get_word(&sb->map[index].word,
> >>>> min(sb->map[index].depth, shallow_depth),
> >>>> SB_NR_TO_BIT(sb, alloc_hint), true);
> >>>> @@ -223,6 +227,9 @@ int sbitmap_get_shallow(struct sbitmap *sb, unsigned int alloc_hint,
> >>>> break;
> >>>> }
> >>>>
> >>>> + if (sbitmap_deferred_clear(sb, index))
> >>>> + goto again;
> >>>> +
> >>>> /* Jump to next index. */
> >>>> index++;
> >>>> alloc_hint = index << sb->shift;
> >>>> @@ -242,7 +249,7 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb)
> >>>> unsigned int i;
> >>>>
> >>>> for (i = 0; i < sb->map_nr; i++) {
> >>>> - if (sb->map[i].word)
> >>>> + if (sb->map[i].word & ~sb->map[i].cleared)
> >>>> return true;
> >>>> }
> >>>> return false;
> >>>> @@ -255,9 +262,10 @@ bool sbitmap_any_bit_clear(const struct sbitmap *sb)
> >>>>
> >>>> for (i = 0; i < sb->map_nr; i++) {
> >>>> const struct sbitmap_word *word = &sb->map[i];
> >>>> + unsigned long mask = word->word & ~word->cleared;
> >>>> unsigned long ret;
> >>>>
> >>>> - ret = find_first_zero_bit(&word->word, word->depth);
> >>>> + ret = find_first_zero_bit(&mask, word->depth);
> >>>> if (ret < word->depth)
> >>>> return true;
> >>>> }
> >>>
> >>> This one works too:
> >>>
> >>> Tested-by: Ming Lei <ming.lei@redhat.com>
> >>
> >> Please hold on, just found this issue is triggered again after the 4th
> >> run,
> >>
> >> [ 135.588599] ================start test sanity/004================
> >> [ 136.864175] scsi host14: scsi_debug: version 0188 [20180128]
> >> dev_size_mb=1024, opts=0x0, submit_queues=1, statistics=0
> >> [ 136.873018] scsi 14:0:0:0: Direct-Access Linux scsi_debug 0188 PQ: 0 ANSI: 7
> >> [ 136.874282] sd 14:0:0:0: Power-on or device reset occurred
> >> [ 136.877196] sd 14:0:0:0: [sde] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
> >> [ 136.879284] sd 14:0:0:0: [sde] Write Protect is off
> >> [ 136.879892] sd 14:0:0:0: [sde] Mode Sense: 73 00 10 08
> >> [ 136.882643] sd 14:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
> >> [ 136.908305] sd 14:0:0:0: [sde] Attached SCSI disk
> >> [ 139.465844] sdd:
> >> [ 139.754360] sdd:
> >> [ 143.650074] hpet1: lost 765 rtc interrupts
> >> [ 145.078137] sdd:
> >> [ 145.699220] sdd:
> >> [ 146.693532] sdd:
> >> [ 147.557814] sdd:
> >> [ 148.742290] sdd:
> >> [ 149.429425] sdd:
> >> [ 149.944095] sdd: sdd1
> >> [ 152.631125] sdd:
> >> [ 184.209792] usb 1-1: reset high-speed USB device number 2 using ehci-pci
> >> [ 184.342465] sd 13:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
> >> [ 184.344021] sd 13:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 05 ff fc 20 00 00 e0 00
> >> [ 184.345109] print_req_error: I/O error, dev sdd, sector 100662304 flags 84000
> >> [ 184.347241] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [ 185.872632] sdd: sdd1 sdd2
> >> [ 242.446969] hpet1: lost 673 rtc interrupts
> >> [ 252.701145] hpet1: lost 655 rtc interrupts
> >> [ 257.814148] hpet1: lost 300 rtc interrupts
> >> [ 278.315651] hpet1: lost 1311 rtc interrupts
> >> [ 278.317589] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [ 336.228781] hpet1: lost 948 rtc interrupts
> >> [ 342.623253] hpet1: lost 409 rtc interrupts
> >> [ 344.205027] hpet1: lost 14 rtc interrupts
> >> [ 344.457622] hpet1: lost 15 rtc interrupts
> >> [ 344.584608] hpet1: lost 7 rtc interrupts
> >> [ 344.706382] hpet1: lost 7 rtc interrupts
> >> [ 345.012803] hpet1: lost 19 rtc interrupts
> >> [ 345.109177] hpet1: lost 5 rtc interrupts
> >> [ 345.353934] hpet1: lost 14 rtc interrupts
> >> [ 345.477123] hpet1: lost 7 rtc interrupts
> >> [ 345.530757] hpet1: lost 3 rtc interrupts
> >> [ 348.328566] hpet_rtc_timer_reinit: 2 callbacks suppressed
> >> [ 348.328567] hpet1: lost 162 rtc interrupts
> >> [ 348.506049] hpet1: lost 10 rtc interrupts
> >> [ 350.461310] hpet1: lost 124 rtc interrupts
> >> [ 351.307411] hpet1: lost 50 rtc interrupts
> >> [ 351.401932] hpet1: lost 6 rtc interrupts
> >> [ 351.564204] hpet1: lost 9 rtc interrupts
> >> [ 351.648124] hpet1: lost 4 rtc interrupts
> >> [ 351.828972] hpet1: lost 11 rtc interrupts
> >> [ 353.043051] hpet1: lost 76 rtc interrupts
> >> [ 353.705926] hpet1: lost 38 rtc interrupts
> >> [ 353.867703] hpet1: lost 9 rtc interrupts
> >> [ 354.030183] hpet1: lost 10 rtc interrupts
> >> [ 354.071030] hpet1: lost 1 rtc interrupts
> >> [ 354.194600] hpet1: lost 7 rtc interrupts
> >> [ 354.316562] hpet1: lost 7 rtc interrupts
> >> [ 355.167894] hpet1: lost 1376 rtc interrupts
> >> [ 364.089267] hpet1: lost 153 rtc interrupts
> >> [ 364.342905] systemd-journald[366]: Sent WATCHDOG=1 notification.
> >> [ 366.869943] hpet1: lost 177 rtc interrupts
> >> [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
> >> [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
> >> [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> [ 369.940768] parted D 0 3645 3239 0x00000000
> >> [ 369.941500] Call Trace:
> >> [ 369.941874] ? __schedule+0x6d9/0x74c
> >> [ 369.942392] ? wbt_done+0x5e/0x5e
> >> [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
> >> [ 369.943404] ? wbt_done+0x5e/0x5e
> >> [ 369.943874] schedule+0x67/0x78
> >> [ 369.944298] io_schedule+0x12/0x33
> >> [ 369.944771] rq_qos_wait+0xb5/0x119
> >> [ 369.945193] ? karma_partition+0x1c2/0x1c2
> >> [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
> >> [ 369.946151] wbt_wait+0x85/0xb6
> >> [ 369.946540] __rq_qos_throttle+0x23/0x2f
> >> [ 369.947014] blk_mq_make_request+0xe6/0x40a
> >> [ 369.947518] generic_make_request+0x192/0x2fe
> >> [ 369.948042] ? submit_bio+0x103/0x11f
> >> [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
> >> [ 369.949011] submit_bio+0x103/0x11f
> >> [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
> >> [ 369.949962] submit_bio_wait+0x53/0x7f
> >> [ 369.950469] blkdev_issue_flush+0x8a/0xae
> >> [ 369.951032] blkdev_fsync+0x2f/0x3a
> >> [ 369.951502] do_fsync+0x2e/0x47
> >> [ 369.951887] __x64_sys_fsync+0x10/0x13
> >> [ 369.952374] do_syscall_64+0x89/0x149
> >> [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> [ 369.953492] RIP: 0033:0x7f95a1e729d4
> >> [ 369.953996] Code: Bad RIP value.
> >> [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
> >> [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
> >> [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
> >> [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
> >> [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
> >> [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
> >> [ 375.962995] ================end test sanity/004: (HANG, -1)================
> >
> > This one may be another one, as I just observed that the wbt queue(RWQ_BG)'s inflight
> > becomes -1 after the hang is triggered.
>
> I think it is - can you try and revert:
>
> commit b6c7b58f5fcc2386bddf9852011c42c1d2b83979
> Author: Josef Bacik <josef@toxicpanda.com>
> Date: Tue Dec 4 12:59:03 2018 -0500
>
> block: convert wbt_wait() to use rq_qos_wait()
>
> and see if that helps?
This one is a linus tree issue, just a bit harder to reproduce,
and actually not captured by my test robot, but has been triggered
several times when I run the parted test manually.
There is such report in RH internal test too.
thanks,
Ming
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-12-12 4:28 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 22:58 for-next hangs on test srp/012 Bart Van Assche
2018-12-11 23:05 ` Mike Snitzer
2018-12-11 23:09 ` Bart Van Assche
2018-12-12 0:22 ` Mike Snitzer
2018-12-12 0:02 ` Jens Axboe
2018-12-12 0:18 ` Bart Van Assche
2018-12-12 0:19 ` Ming Lei
2018-12-12 0:27 ` Mike Snitzer
2018-12-12 0:38 ` Ming Lei
2018-12-12 1:05 ` Jens Axboe
2018-12-12 1:23 ` Jens Axboe
2018-12-12 1:36 ` Jens Axboe
2018-12-12 1:43 ` Ming Lei
2018-12-12 1:44 ` Jens Axboe
2018-12-12 1:49 ` Ming Lei
2018-12-12 2:03 ` Ming Lei
2018-12-12 2:25 ` Jens Axboe
2018-12-12 4:28 ` Ming Lei
2018-12-12 1:37 ` Ming Lei
2018-12-12 1:39 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).