nvme deadlock with ANA

* nvme deadlock with ANA
@ 2020-03-26  6:23 Sagi Grimberg
  2020-03-26  6:29 ` Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-03-26  6:23 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme, Anton Eidelman,
	Hannes Reinecke

Hey,

I want to consult with you guys on a deadlock condition I'm able to
hit with a test that incorporate controller reconnect, ana updates
and live I/O with timeouts.

This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
well.

The deadlock combines 4 flows in parallel:
- ns scanning (triggered from reconnect)
- request timeout
- ANA update (triggered from reconnect)
- FS I/O coming into the mpath device

(1) ns scanning triggers disk revalidation -> update disk info ->
     freeze queue -> but blocked, why?

(2) timeout handler reference the g_usage_counter - > but blocks in
     the timeout handler, why?

(3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
     takes the namespaces_rwsem - > but blocks, why?

(4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
     -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
     but it blocks, why?

(5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
     direct_make_request > blk_queue_enter -> but blocked, why?

==> because of (1), the request queue is under freeze -> deadlock.

Now as I said, this reproduced with nvme-tcp, rdma does pretty much the
same thing, and if we look at pci, it also calls nvme_dev_disable which
also calls nvme_stop_queues, and will also block in nvme_set_live (not
specific to ANA).

I'm trying to think about what is the proper solution for that, so I
thought I'd send it for some brainstorming...

Any thoughts on this?

See some traces for visualization:
=================================
- ns_scanning
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  blk_mq_freeze_queue_wait+0x4b/0xb0
  ? wait_woken+0x80/0x80
  blk_mq_freeze_queue+0x1b/0x20
  nvme_update_disk_info+0x62/0x3a0 [nvme_core]
  __nvme_revalidate_disk+0xb8/0x110 [nvme_core]
  nvme_revalidate_disk+0xa2/0x110 [nvme_core]
  revalidate_disk+0x2b/0xa0
  nvme_validate_ns+0x49/0x900 [nvme_core]
  ? blk_mq_free_request+0xd2/0x100
  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
  ? __switch_to_asm+0x40/0x70
  ? _cond_resched+0x19/0x30
  ? down_read+0x13/0xa0
  nvme_scan_work+0x24f/0x380 [nvme_core]
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
--

- request timeout
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  schedule_timeout+0x203/0x2f0
  ? ttwu_do_activate+0x5b/0x70
  wait_for_completion+0xb1/0x120
  ? wake_up_q+0x70/0x70
  __flush_work+0x123/0x1d0
  ? worker_detach_from_pool+0xb0/0xb0
  flush_work+0x10/0x20
  nvme_tcp_timeout+0x69/0xb0 [nvme_tcp]
  blk_mq_check_expired+0x13d/0x160
  bt_iter+0x52/0x60
  blk_mq_queue_tag_busy_iter+0x1a4/0x2f0
  ? blk_poll+0x350/0x350
  ? blk_poll+0x350/0x350
  ? syscall_return_via_sysret+0xf/0x7f
  blk_mq_timeout_work+0x59/0x110
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
  kthread+0x104/0x140
  ? process_one_work+0x380/0x380
  ? kthread_park+0x80/0x80
  ret_from_fork+0x35/0x40

...

Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  rwsem_down_read_slowpath+0x16c/0x4a0
  down_read+0x85/0xa0
  nvme_stop_queues+0x21/0x50 [nvme_core]
  nvme_tcp_teardown_io_queues.part.21+0x19/0x80 [nvme_tcp]
  nvme_tcp_error_recovery_work+0x33/0x80 [nvme_tcp]
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
  kthread+0x104/0x140
  ? process_one_work+0x380/0x380
  ? kthread_park+0x80/0x80
  ret_from_fork+0x35/0x40
--

- ANA update
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  schedule_timeout+0x203/0x2f0
  ? __queue_work+0x117/0x3f0
  wait_for_completion+0xb1/0x120
  ? wake_up_q+0x70/0x70
  __synchronize_srcu.part.0+0x81/0xb0
  ? __bpf_trace_rcu_utilization+0x10/0x10
  ? ktime_get_mono_fast_ns+0x4e/0xa0
  synchronize_srcu_expedited+0x28/0x30
  synchronize_srcu+0x57/0xe0
  nvme_mpath_set_live+0x4f/0x140 [nvme_core]
  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
  nvme_update_ana_state+0xca/0xe0 [nvme_core]
  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
  ? nvme_ana_work+0x20/0x20 [nvme_core]
  nvme_read_ana_log+0x76/0x100 [nvme_core]
  nvme_ana_work+0x15/0x20 [nvme_core]
  process_one_work+0x1db/0x380
--

FS I/O
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  blk_queue_enter+0x169/0x210
  ? wait_woken+0x80/0x80
  direct_make_request+0x49/0xd0
  nvme_ns_head_make_request+0xbc/0x3e0 [nvme_core]
  ? get_user_pages_fast+0xa5/0x190
  generic_make_request+0xcf/0x320
  submit_bio+0x42/0x170
  ? set_page_dirty_lock+0x40/0x60
  iomap_dio_submit_bio.isra.0+0x55/0x60
  iomap_dio_bio_actor+0x1c9/0x3d0
  iomap_dio_actor+0x58/0x1c0
  iomap_apply+0xd5/0x140
  ? iomap_dio_bio_actor+0x3d0/0x3d0
  iomap_dio_rw+0x2c2/0x440
  ? iomap_dio_bio_actor+0x3d0/0x3d0
  ? __x64_sys_io_cancel+0x150/0x150
  xfs_file_dio_aio_read+0x66/0xf0 [xfs]
  ? xfs_file_dio_aio_read+0x66/0xf0 [xfs]
  xfs_file_read_iter+0xbf/0xe0 [xfs]
  aio_read+0xc8/0x160
--

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread