From: Sagi Grimberg <sagi@grimberg.me>
To: Christoph Hellwig <hch@lst.de>, Keith Busch <kbusch@kernel.org>,
linux-nvme <linux-nvme@lists.infradead.org>,
Anton Eidelman <anton@lightbitslabs.com>,
Hannes Reinecke <hare@suse.de>
Subject: nvme deadlock with ANA
Date: Wed, 25 Mar 2020 23:23:50 -0700 [thread overview]
Message-ID: <5e38c566-3f26-288d-1004-161d1084b468@grimberg.me> (raw)
Hey,
I want to consult with you guys on a deadlock condition I'm able to
hit with a test that incorporate controller reconnect, ana updates
and live I/O with timeouts.
This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
well.
The deadlock combines 4 flows in parallel:
- ns scanning (triggered from reconnect)
- request timeout
- ANA update (triggered from reconnect)
- FS I/O coming into the mpath device
(1) ns scanning triggers disk revalidation -> update disk info ->
freeze queue -> but blocked, why?
(2) timeout handler reference the g_usage_counter - > but blocks in
the timeout handler, why?
(3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
takes the namespaces_rwsem - > but blocks, why?
(4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
-> which synchronize the ns_head srcu (see commit 504db087aacc) ->
but it blocks, why?
(5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
direct_make_request > blk_queue_enter -> but blocked, why?
==> because of (1), the request queue is under freeze -> deadlock.
Now as I said, this reproduced with nvme-tcp, rdma does pretty much the
same thing, and if we look at pci, it also calls nvme_dev_disable which
also calls nvme_stop_queues, and will also block in nvme_set_live (not
specific to ANA).
I'm trying to think about what is the proper solution for that, so I
thought I'd send it for some brainstorming...
Any thoughts on this?
See some traces for visualization:
=================================
- ns_scanning
--
Call Trace:
__schedule+0x2b9/0x6c0
schedule+0x42/0xb0
blk_mq_freeze_queue_wait+0x4b/0xb0
? wait_woken+0x80/0x80
blk_mq_freeze_queue+0x1b/0x20
nvme_update_disk_info+0x62/0x3a0 [nvme_core]
__nvme_revalidate_disk+0xb8/0x110 [nvme_core]
nvme_revalidate_disk+0xa2/0x110 [nvme_core]
revalidate_disk+0x2b/0xa0
nvme_validate_ns+0x49/0x900 [nvme_core]
? blk_mq_free_request+0xd2/0x100
? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
? __switch_to_asm+0x40/0x70
? _cond_resched+0x19/0x30
? down_read+0x13/0xa0
nvme_scan_work+0x24f/0x380 [nvme_core]
process_one_work+0x1db/0x380
worker_thread+0x4d/0x400
--
- request timeout
--
Call Trace:
__schedule+0x2b9/0x6c0
schedule+0x42/0xb0
schedule_timeout+0x203/0x2f0
? ttwu_do_activate+0x5b/0x70
wait_for_completion+0xb1/0x120
? wake_up_q+0x70/0x70
__flush_work+0x123/0x1d0
? worker_detach_from_pool+0xb0/0xb0
flush_work+0x10/0x20
nvme_tcp_timeout+0x69/0xb0 [nvme_tcp]
blk_mq_check_expired+0x13d/0x160
bt_iter+0x52/0x60
blk_mq_queue_tag_busy_iter+0x1a4/0x2f0
? blk_poll+0x350/0x350
? blk_poll+0x350/0x350
? syscall_return_via_sysret+0xf/0x7f
blk_mq_timeout_work+0x59/0x110
process_one_work+0x1db/0x380
worker_thread+0x4d/0x400
kthread+0x104/0x140
? process_one_work+0x380/0x380
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
...
Call Trace:
__schedule+0x2b9/0x6c0
schedule+0x42/0xb0
rwsem_down_read_slowpath+0x16c/0x4a0
down_read+0x85/0xa0
nvme_stop_queues+0x21/0x50 [nvme_core]
nvme_tcp_teardown_io_queues.part.21+0x19/0x80 [nvme_tcp]
nvme_tcp_error_recovery_work+0x33/0x80 [nvme_tcp]
process_one_work+0x1db/0x380
worker_thread+0x4d/0x400
kthread+0x104/0x140
? process_one_work+0x380/0x380
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
--
- ANA update
--
Call Trace:
__schedule+0x2b9/0x6c0
schedule+0x42/0xb0
schedule_timeout+0x203/0x2f0
? __queue_work+0x117/0x3f0
wait_for_completion+0xb1/0x120
? wake_up_q+0x70/0x70
__synchronize_srcu.part.0+0x81/0xb0
? __bpf_trace_rcu_utilization+0x10/0x10
? ktime_get_mono_fast_ns+0x4e/0xa0
synchronize_srcu_expedited+0x28/0x30
synchronize_srcu+0x57/0xe0
nvme_mpath_set_live+0x4f/0x140 [nvme_core]
nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
nvme_update_ana_state+0xca/0xe0 [nvme_core]
nvme_parse_ana_log+0xa1/0x180 [nvme_core]
? nvme_ana_work+0x20/0x20 [nvme_core]
nvme_read_ana_log+0x76/0x100 [nvme_core]
nvme_ana_work+0x15/0x20 [nvme_core]
process_one_work+0x1db/0x380
--
FS I/O
--
Call Trace:
__schedule+0x2b9/0x6c0
schedule+0x42/0xb0
blk_queue_enter+0x169/0x210
? wait_woken+0x80/0x80
direct_make_request+0x49/0xd0
nvme_ns_head_make_request+0xbc/0x3e0 [nvme_core]
? get_user_pages_fast+0xa5/0x190
generic_make_request+0xcf/0x320
submit_bio+0x42/0x170
? set_page_dirty_lock+0x40/0x60
iomap_dio_submit_bio.isra.0+0x55/0x60
iomap_dio_bio_actor+0x1c9/0x3d0
iomap_dio_actor+0x58/0x1c0
iomap_apply+0xd5/0x140
? iomap_dio_bio_actor+0x3d0/0x3d0
iomap_dio_rw+0x2c2/0x440
? iomap_dio_bio_actor+0x3d0/0x3d0
? __x64_sys_io_cancel+0x150/0x150
xfs_file_dio_aio_read+0x66/0xf0 [xfs]
? xfs_file_dio_aio_read+0x66/0xf0 [xfs]
xfs_file_read_iter+0xbf/0xe0 [xfs]
aio_read+0xc8/0x160
--
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next reply other threads:[~2020-03-26 6:24 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-26 6:23 Sagi Grimberg [this message]
2020-03-26 6:29 ` nvme deadlock with ANA Sagi Grimberg
2020-04-02 7:09 ` Sagi Grimberg
2020-04-02 15:18 ` Christoph Hellwig
2020-04-02 15:24 ` Sagi Grimberg
2020-04-02 15:30 ` Hannes Reinecke
2020-04-02 15:38 ` Sagi Grimberg
2020-04-02 17:22 ` James Smart
2020-04-02 16:00 ` Keith Busch
2020-04-02 16:08 ` Sagi Grimberg
2020-04-02 16:12 ` Hannes Reinecke
2020-04-02 16:18 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5e38c566-3f26-288d-1004-161d1084b468@grimberg.me \
--to=sagi@grimberg.me \
--cc=anton@lightbitslabs.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).