linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* nvme deadlock with ANA
@ 2020-03-26  6:23 Sagi Grimberg
  2020-03-26  6:29 ` Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-03-26  6:23 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme, Anton Eidelman,
	Hannes Reinecke

Hey,

I want to consult with you guys on a deadlock condition I'm able to
hit with a test that incorporate controller reconnect, ana updates
and live I/O with timeouts.

This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
well.

The deadlock combines 4 flows in parallel:
- ns scanning (triggered from reconnect)
- request timeout
- ANA update (triggered from reconnect)
- FS I/O coming into the mpath device

(1) ns scanning triggers disk revalidation -> update disk info ->
     freeze queue -> but blocked, why?

(2) timeout handler reference the g_usage_counter - > but blocks in
     the timeout handler, why?

(3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
     takes the namespaces_rwsem - > but blocks, why?

(4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
     -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
     but it blocks, why?

(5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
     direct_make_request > blk_queue_enter -> but blocked, why?

==> because of (1), the request queue is under freeze -> deadlock.

Now as I said, this reproduced with nvme-tcp, rdma does pretty much the
same thing, and if we look at pci, it also calls nvme_dev_disable which
also calls nvme_stop_queues, and will also block in nvme_set_live (not
specific to ANA).

I'm trying to think about what is the proper solution for that, so I
thought I'd send it for some brainstorming...

Any thoughts on this?


See some traces for visualization:
=================================
- ns_scanning
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  blk_mq_freeze_queue_wait+0x4b/0xb0
  ? wait_woken+0x80/0x80
  blk_mq_freeze_queue+0x1b/0x20
  nvme_update_disk_info+0x62/0x3a0 [nvme_core]
  __nvme_revalidate_disk+0xb8/0x110 [nvme_core]
  nvme_revalidate_disk+0xa2/0x110 [nvme_core]
  revalidate_disk+0x2b/0xa0
  nvme_validate_ns+0x49/0x900 [nvme_core]
  ? blk_mq_free_request+0xd2/0x100
  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
  ? __switch_to_asm+0x40/0x70
  ? _cond_resched+0x19/0x30
  ? down_read+0x13/0xa0
  nvme_scan_work+0x24f/0x380 [nvme_core]
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
--

- request timeout
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  schedule_timeout+0x203/0x2f0
  ? ttwu_do_activate+0x5b/0x70
  wait_for_completion+0xb1/0x120
  ? wake_up_q+0x70/0x70
  __flush_work+0x123/0x1d0
  ? worker_detach_from_pool+0xb0/0xb0
  flush_work+0x10/0x20
  nvme_tcp_timeout+0x69/0xb0 [nvme_tcp]
  blk_mq_check_expired+0x13d/0x160
  bt_iter+0x52/0x60
  blk_mq_queue_tag_busy_iter+0x1a4/0x2f0
  ? blk_poll+0x350/0x350
  ? blk_poll+0x350/0x350
  ? syscall_return_via_sysret+0xf/0x7f
  blk_mq_timeout_work+0x59/0x110
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
  kthread+0x104/0x140
  ? process_one_work+0x380/0x380
  ? kthread_park+0x80/0x80
  ret_from_fork+0x35/0x40

...

Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  rwsem_down_read_slowpath+0x16c/0x4a0
  down_read+0x85/0xa0
  nvme_stop_queues+0x21/0x50 [nvme_core]
  nvme_tcp_teardown_io_queues.part.21+0x19/0x80 [nvme_tcp]
  nvme_tcp_error_recovery_work+0x33/0x80 [nvme_tcp]
  process_one_work+0x1db/0x380
  worker_thread+0x4d/0x400
  kthread+0x104/0x140
  ? process_one_work+0x380/0x380
  ? kthread_park+0x80/0x80
  ret_from_fork+0x35/0x40
--

- ANA update
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  schedule_timeout+0x203/0x2f0
  ? __queue_work+0x117/0x3f0
  wait_for_completion+0xb1/0x120
  ? wake_up_q+0x70/0x70
  __synchronize_srcu.part.0+0x81/0xb0
  ? __bpf_trace_rcu_utilization+0x10/0x10
  ? ktime_get_mono_fast_ns+0x4e/0xa0
  synchronize_srcu_expedited+0x28/0x30
  synchronize_srcu+0x57/0xe0
  nvme_mpath_set_live+0x4f/0x140 [nvme_core]
  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
  nvme_update_ana_state+0xca/0xe0 [nvme_core]
  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
  ? nvme_ana_work+0x20/0x20 [nvme_core]
  nvme_read_ana_log+0x76/0x100 [nvme_core]
  nvme_ana_work+0x15/0x20 [nvme_core]
  process_one_work+0x1db/0x380
--

FS I/O
--
Call Trace:
  __schedule+0x2b9/0x6c0
  schedule+0x42/0xb0
  blk_queue_enter+0x169/0x210
  ? wait_woken+0x80/0x80
  direct_make_request+0x49/0xd0
  nvme_ns_head_make_request+0xbc/0x3e0 [nvme_core]
  ? get_user_pages_fast+0xa5/0x190
  generic_make_request+0xcf/0x320
  submit_bio+0x42/0x170
  ? set_page_dirty_lock+0x40/0x60
  iomap_dio_submit_bio.isra.0+0x55/0x60
  iomap_dio_bio_actor+0x1c9/0x3d0
  iomap_dio_actor+0x58/0x1c0
  iomap_apply+0xd5/0x140
  ? iomap_dio_bio_actor+0x3d0/0x3d0
  iomap_dio_rw+0x2c2/0x440
  ? iomap_dio_bio_actor+0x3d0/0x3d0
  ? __x64_sys_io_cancel+0x150/0x150
  xfs_file_dio_aio_read+0x66/0xf0 [xfs]
  ? xfs_file_dio_aio_read+0x66/0xf0 [xfs]
  xfs_file_read_iter+0xbf/0xe0 [xfs]
  aio_read+0xc8/0x160
--

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-03-26  6:23 nvme deadlock with ANA Sagi Grimberg
@ 2020-03-26  6:29 ` Sagi Grimberg
  2020-04-02  7:09   ` Sagi Grimberg
  2020-04-02 15:18 ` Christoph Hellwig
  2020-04-02 16:00 ` Keith Busch
  2 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2020-03-26  6:29 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme, Anton Eidelman,
	Hannes Reinecke, James Smart

Adding James as this can be also relevant for FC (or maybe it
isn't)...

On 3/25/20 11:23 PM, Sagi Grimberg wrote:
> Hey,
> 
> I want to consult with you guys on a deadlock condition I'm able to
> hit with a test that incorporate controller reconnect, ana updates
> and live I/O with timeouts.
> 
> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
> well.
> 
> The deadlock combines 4 flows in parallel:
> - ns scanning (triggered from reconnect)
> - request timeout
> - ANA update (triggered from reconnect)
> - FS I/O coming into the mpath device
> 
> (1) ns scanning triggers disk revalidation -> update disk info ->
>      freeze queue -> but blocked, why?
> 
> (2) timeout handler reference the g_usage_counter - > but blocks in
>      the timeout handler, why?
> 
> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>      takes the namespaces_rwsem - > but blocks, why?
> 
> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>      -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>      but it blocks, why?
> 
> (5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
>      direct_make_request > blk_queue_enter -> but blocked, why?
> 
> ==> because of (1), the request queue is under freeze -> deadlock.
> 
> Now as I said, this reproduced with nvme-tcp, rdma does pretty much the
> same thing, and if we look at pci, it also calls nvme_dev_disable which
> also calls nvme_stop_queues, and will also block in nvme_set_live (not
> specific to ANA).
> 
> I'm trying to think about what is the proper solution for that, so I
> thought I'd send it for some brainstorming...
> 
> Any thoughts on this?
> 
> 
> See some traces for visualization:
> =================================
> - ns_scanning
> -- 
> Call Trace:
>   __schedule+0x2b9/0x6c0
>   schedule+0x42/0xb0
>   blk_mq_freeze_queue_wait+0x4b/0xb0
>   ? wait_woken+0x80/0x80
>   blk_mq_freeze_queue+0x1b/0x20
>   nvme_update_disk_info+0x62/0x3a0 [nvme_core]
>   __nvme_revalidate_disk+0xb8/0x110 [nvme_core]
>   nvme_revalidate_disk+0xa2/0x110 [nvme_core]
>   revalidate_disk+0x2b/0xa0
>   nvme_validate_ns+0x49/0x900 [nvme_core]
>   ? blk_mq_free_request+0xd2/0x100
>   ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
>   ? __switch_to_asm+0x40/0x70
>   ? _cond_resched+0x19/0x30
>   ? down_read+0x13/0xa0
>   nvme_scan_work+0x24f/0x380 [nvme_core]
>   process_one_work+0x1db/0x380
>   worker_thread+0x4d/0x400
> -- 
> 
> - request timeout
> -- 
> Call Trace:
>   __schedule+0x2b9/0x6c0
>   schedule+0x42/0xb0
>   schedule_timeout+0x203/0x2f0
>   ? ttwu_do_activate+0x5b/0x70
>   wait_for_completion+0xb1/0x120
>   ? wake_up_q+0x70/0x70
>   __flush_work+0x123/0x1d0
>   ? worker_detach_from_pool+0xb0/0xb0
>   flush_work+0x10/0x20
>   nvme_tcp_timeout+0x69/0xb0 [nvme_tcp]
>   blk_mq_check_expired+0x13d/0x160
>   bt_iter+0x52/0x60
>   blk_mq_queue_tag_busy_iter+0x1a4/0x2f0
>   ? blk_poll+0x350/0x350
>   ? blk_poll+0x350/0x350
>   ? syscall_return_via_sysret+0xf/0x7f
>   blk_mq_timeout_work+0x59/0x110
>   process_one_work+0x1db/0x380
>   worker_thread+0x4d/0x400
>   kthread+0x104/0x140
>   ? process_one_work+0x380/0x380
>   ? kthread_park+0x80/0x80
>   ret_from_fork+0x35/0x40
> 
> ...
> 
> Call Trace:
>   __schedule+0x2b9/0x6c0
>   schedule+0x42/0xb0
>   rwsem_down_read_slowpath+0x16c/0x4a0
>   down_read+0x85/0xa0
>   nvme_stop_queues+0x21/0x50 [nvme_core]
>   nvme_tcp_teardown_io_queues.part.21+0x19/0x80 [nvme_tcp]
>   nvme_tcp_error_recovery_work+0x33/0x80 [nvme_tcp]
>   process_one_work+0x1db/0x380
>   worker_thread+0x4d/0x400
>   kthread+0x104/0x140
>   ? process_one_work+0x380/0x380
>   ? kthread_park+0x80/0x80
>   ret_from_fork+0x35/0x40
> -- 
> 
> - ANA update
> -- 
> Call Trace:
>   __schedule+0x2b9/0x6c0
>   schedule+0x42/0xb0
>   schedule_timeout+0x203/0x2f0
>   ? __queue_work+0x117/0x3f0
>   wait_for_completion+0xb1/0x120
>   ? wake_up_q+0x70/0x70
>   __synchronize_srcu.part.0+0x81/0xb0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   ? ktime_get_mono_fast_ns+0x4e/0xa0
>   synchronize_srcu_expedited+0x28/0x30
>   synchronize_srcu+0x57/0xe0
>   nvme_mpath_set_live+0x4f/0x140 [nvme_core]
>   nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
>   nvme_update_ana_state+0xca/0xe0 [nvme_core]
>   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>   ? nvme_ana_work+0x20/0x20 [nvme_core]
>   nvme_read_ana_log+0x76/0x100 [nvme_core]
>   nvme_ana_work+0x15/0x20 [nvme_core]
>   process_one_work+0x1db/0x380
> -- 
> 
> FS I/O
> -- 
> Call Trace:
>   __schedule+0x2b9/0x6c0
>   schedule+0x42/0xb0
>   blk_queue_enter+0x169/0x210
>   ? wait_woken+0x80/0x80
>   direct_make_request+0x49/0xd0
>   nvme_ns_head_make_request+0xbc/0x3e0 [nvme_core]
>   ? get_user_pages_fast+0xa5/0x190
>   generic_make_request+0xcf/0x320
>   submit_bio+0x42/0x170
>   ? set_page_dirty_lock+0x40/0x60
>   iomap_dio_submit_bio.isra.0+0x55/0x60
>   iomap_dio_bio_actor+0x1c9/0x3d0
>   iomap_dio_actor+0x58/0x1c0
>   iomap_apply+0xd5/0x140
>   ? iomap_dio_bio_actor+0x3d0/0x3d0
>   iomap_dio_rw+0x2c2/0x440
>   ? iomap_dio_bio_actor+0x3d0/0x3d0
>   ? __x64_sys_io_cancel+0x150/0x150
>   xfs_file_dio_aio_read+0x66/0xf0 [xfs]
>   ? xfs_file_dio_aio_read+0x66/0xf0 [xfs]
>   xfs_file_read_iter+0xbf/0xe0 [xfs]
>   aio_read+0xc8/0x160
> -- 

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-03-26  6:29 ` Sagi Grimberg
@ 2020-04-02  7:09   ` Sagi Grimberg
  0 siblings, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-04-02  7:09 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme, Anton Eidelman,
	Hannes Reinecke, James Smart

Someone has a good idea on this?

On 3/25/20 11:29 PM, Sagi Grimberg wrote:
> Adding James as this can be also relevant for FC (or maybe it
> isn't)...
> 
> On 3/25/20 11:23 PM, Sagi Grimberg wrote:
>> Hey,
>>
>> I want to consult with you guys on a deadlock condition I'm able to
>> hit with a test that incorporate controller reconnect, ana updates
>> and live I/O with timeouts.
>>
>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>> well.
>>
>> The deadlock combines 4 flows in parallel:
>> - ns scanning (triggered from reconnect)
>> - request timeout
>> - ANA update (triggered from reconnect)
>> - FS I/O coming into the mpath device
>>
>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>      freeze queue -> but blocked, why?
>>
>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>      the timeout handler, why?
>>
>> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>>      takes the namespaces_rwsem - > but blocks, why?
>>
>> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>>      -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>>      but it blocks, why?
>>
>> (5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
>>      direct_make_request > blk_queue_enter -> but blocked, why?
>>
>> ==> because of (1), the request queue is under freeze -> deadlock.
>>
>> Now as I said, this reproduced with nvme-tcp, rdma does pretty much the
>> same thing, and if we look at pci, it also calls nvme_dev_disable which
>> also calls nvme_stop_queues, and will also block in nvme_set_live (not
>> specific to ANA).
>>
>> I'm trying to think about what is the proper solution for that, so I
>> thought I'd send it for some brainstorming...
>>
>> Any thoughts on this?
>>
>>
>> See some traces for visualization:
>> =================================
>> - ns_scanning
>> -- 
>> Call Trace:
>>   __schedule+0x2b9/0x6c0
>>   schedule+0x42/0xb0
>>   blk_mq_freeze_queue_wait+0x4b/0xb0
>>   ? wait_woken+0x80/0x80
>>   blk_mq_freeze_queue+0x1b/0x20
>>   nvme_update_disk_info+0x62/0x3a0 [nvme_core]
>>   __nvme_revalidate_disk+0xb8/0x110 [nvme_core]
>>   nvme_revalidate_disk+0xa2/0x110 [nvme_core]
>>   revalidate_disk+0x2b/0xa0
>>   nvme_validate_ns+0x49/0x900 [nvme_core]
>>   ? blk_mq_free_request+0xd2/0x100
>>   ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
>>   ? __switch_to_asm+0x40/0x70
>>   ? _cond_resched+0x19/0x30
>>   ? down_read+0x13/0xa0
>>   nvme_scan_work+0x24f/0x380 [nvme_core]
>>   process_one_work+0x1db/0x380
>>   worker_thread+0x4d/0x400
>> -- 
>>
>> - request timeout
>> -- 
>> Call Trace:
>>   __schedule+0x2b9/0x6c0
>>   schedule+0x42/0xb0
>>   schedule_timeout+0x203/0x2f0
>>   ? ttwu_do_activate+0x5b/0x70
>>   wait_for_completion+0xb1/0x120
>>   ? wake_up_q+0x70/0x70
>>   __flush_work+0x123/0x1d0
>>   ? worker_detach_from_pool+0xb0/0xb0
>>   flush_work+0x10/0x20
>>   nvme_tcp_timeout+0x69/0xb0 [nvme_tcp]
>>   blk_mq_check_expired+0x13d/0x160
>>   bt_iter+0x52/0x60
>>   blk_mq_queue_tag_busy_iter+0x1a4/0x2f0
>>   ? blk_poll+0x350/0x350
>>   ? blk_poll+0x350/0x350
>>   ? syscall_return_via_sysret+0xf/0x7f
>>   blk_mq_timeout_work+0x59/0x110
>>   process_one_work+0x1db/0x380
>>   worker_thread+0x4d/0x400
>>   kthread+0x104/0x140
>>   ? process_one_work+0x380/0x380
>>   ? kthread_park+0x80/0x80
>>   ret_from_fork+0x35/0x40
>>
>> ...
>>
>> Call Trace:
>>   __schedule+0x2b9/0x6c0
>>   schedule+0x42/0xb0
>>   rwsem_down_read_slowpath+0x16c/0x4a0
>>   down_read+0x85/0xa0
>>   nvme_stop_queues+0x21/0x50 [nvme_core]
>>   nvme_tcp_teardown_io_queues.part.21+0x19/0x80 [nvme_tcp]
>>   nvme_tcp_error_recovery_work+0x33/0x80 [nvme_tcp]
>>   process_one_work+0x1db/0x380
>>   worker_thread+0x4d/0x400
>>   kthread+0x104/0x140
>>   ? process_one_work+0x380/0x380
>>   ? kthread_park+0x80/0x80
>>   ret_from_fork+0x35/0x40
>> -- 
>>
>> - ANA update
>> -- 
>> Call Trace:
>>   __schedule+0x2b9/0x6c0
>>   schedule+0x42/0xb0
>>   schedule_timeout+0x203/0x2f0
>>   ? __queue_work+0x117/0x3f0
>>   wait_for_completion+0xb1/0x120
>>   ? wake_up_q+0x70/0x70
>>   __synchronize_srcu.part.0+0x81/0xb0
>>   ? __bpf_trace_rcu_utilization+0x10/0x10
>>   ? ktime_get_mono_fast_ns+0x4e/0xa0
>>   synchronize_srcu_expedited+0x28/0x30
>>   synchronize_srcu+0x57/0xe0
>>   nvme_mpath_set_live+0x4f/0x140 [nvme_core]
>>   nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
>>   nvme_update_ana_state+0xca/0xe0 [nvme_core]
>>   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>>   ? nvme_ana_work+0x20/0x20 [nvme_core]
>>   nvme_read_ana_log+0x76/0x100 [nvme_core]
>>   nvme_ana_work+0x15/0x20 [nvme_core]
>>   process_one_work+0x1db/0x380
>> -- 
>>
>> FS I/O
>> -- 
>> Call Trace:
>>   __schedule+0x2b9/0x6c0
>>   schedule+0x42/0xb0
>>   blk_queue_enter+0x169/0x210
>>   ? wait_woken+0x80/0x80
>>   direct_make_request+0x49/0xd0
>>   nvme_ns_head_make_request+0xbc/0x3e0 [nvme_core]
>>   ? get_user_pages_fast+0xa5/0x190
>>   generic_make_request+0xcf/0x320
>>   submit_bio+0x42/0x170
>>   ? set_page_dirty_lock+0x40/0x60
>>   iomap_dio_submit_bio.isra.0+0x55/0x60
>>   iomap_dio_bio_actor+0x1c9/0x3d0
>>   iomap_dio_actor+0x58/0x1c0
>>   iomap_apply+0xd5/0x140
>>   ? iomap_dio_bio_actor+0x3d0/0x3d0
>>   iomap_dio_rw+0x2c2/0x440
>>   ? iomap_dio_bio_actor+0x3d0/0x3d0
>>   ? __x64_sys_io_cancel+0x150/0x150
>>   xfs_file_dio_aio_read+0x66/0xf0 [xfs]
>>   ? xfs_file_dio_aio_read+0x66/0xf0 [xfs]
>>   xfs_file_read_iter+0xbf/0xe0 [xfs]
>>   aio_read+0xc8/0x160
>> -- 

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-03-26  6:23 nvme deadlock with ANA Sagi Grimberg
  2020-03-26  6:29 ` Sagi Grimberg
@ 2020-04-02 15:18 ` Christoph Hellwig
  2020-04-02 15:24   ` Sagi Grimberg
  2020-04-02 16:00 ` Keith Busch
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2020-04-02 15:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme,
	Hannes Reinecke

On Wed, Mar 25, 2020 at 11:23:50PM -0700, Sagi Grimberg wrote:
> Hey,
>
> I want to consult with you guys on a deadlock condition I'm able to
> hit with a test that incorporate controller reconnect, ana updates
> and live I/O with timeouts.
>
> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
> well.
>
> The deadlock combines 4 flows in parallel:
> - ns scanning (triggered from reconnect)
> - request timeout
> - ANA update (triggered from reconnect)
> - FS I/O coming into the mpath device
>
> (1) ns scanning triggers disk revalidation -> update disk info ->
>     freeze queue -> but blocked, why?

What does -> but blocked mean?
>
>
>
> (2) timeout handler reference the g_usage_counter - > but blocks in
>     the timeout handler, why?

The timeout handler obviously needs to keep the queue alive while
running.  We could think of doing a try_get, though?


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 15:18 ` Christoph Hellwig
@ 2020-04-02 15:24   ` Sagi Grimberg
  2020-04-02 15:30     ` Hannes Reinecke
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2020-04-02 15:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Anton Eidelman, Hannes Reinecke, linux-nvme


>> I want to consult with you guys on a deadlock condition I'm able to
>> hit with a test that incorporate controller reconnect, ana updates
>> and live I/O with timeouts.
>>
>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>> well.
>>
>> The deadlock combines 4 flows in parallel:
>> - ns scanning (triggered from reconnect)
>> - request timeout
>> - ANA update (triggered from reconnect)
>> - FS I/O coming into the mpath device
>>
>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>      freeze queue -> but blocked, why?
> 
> What does -> but blocked mean?

It is blocked and cannot complete, because of (2)

>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>      the timeout handler, why?
> 
> The timeout handler obviously needs to keep the queue alive while
> running.  We could think of doing a try_get, though?

It is keeping the queue alive, that is not the issue. it is blocked in
the driver .timeout() handler (i.e. nvme_tcp_timeout).

The reason that it blocked and cannot make forward progress is because
the driver timeout handler will call nvme_stop_queues(), which is
blocked as this takes namespaces_rwsem...

There is a chain of dependency that is deadlocking with circular
dependency.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 15:24   ` Sagi Grimberg
@ 2020-04-02 15:30     ` Hannes Reinecke
  2020-04-02 15:38       ` Sagi Grimberg
  2020-04-02 17:22       ` James Smart
  0 siblings, 2 replies; 12+ messages in thread
From: Hannes Reinecke @ 2020-04-02 15:30 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme

On 4/2/20 5:24 PM, Sagi Grimberg wrote:
> 
>>> I want to consult with you guys on a deadlock condition I'm able to
>>> hit with a test that incorporate controller reconnect, ana updates
>>> and live I/O with timeouts.
>>>
>>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>>> well.
>>>
>>> The deadlock combines 4 flows in parallel:
>>> - ns scanning (triggered from reconnect)
>>> - request timeout
>>> - ANA update (triggered from reconnect)
>>> - FS I/O coming into the mpath device
>>>
>>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>>      freeze queue -> but blocked, why?
>>
>> What does -> but blocked mean?
> 
> It is blocked and cannot complete, because of (2)
> 
>>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>>      the timeout handler, why?
>>
>> The timeout handler obviously needs to keep the queue alive while
>> running.  We could think of doing a try_get, though?
> 
> It is keeping the queue alive, that is not the issue. it is blocked in
> the driver .timeout() handler (i.e. nvme_tcp_timeout).
> 
> The reason that it blocked and cannot make forward progress is because
> the driver timeout handler will call nvme_stop_queues(), which is
> blocked as this takes namespaces_rwsem...
> 
> There is a chain of dependency that is deadlocking with circular
> dependency.

Can't you simply call 'nvme_reset_ctrl()' ?
Seems to work reasonably well on the fc side, so I wonder what's 
different for tcp ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 15:30     ` Hannes Reinecke
@ 2020-04-02 15:38       ` Sagi Grimberg
  2020-04-02 17:22       ` James Smart
  1 sibling, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-04-02 15:38 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, Anton Eidelman, linux-nvme


>>>> I want to consult with you guys on a deadlock condition I'm able to
>>>> hit with a test that incorporate controller reconnect, ana updates
>>>> and live I/O with timeouts.
>>>>
>>>> This is true for NVMe/TCP, but can also happen in rdma or pci 
>>>> drivers as
>>>> well.
>>>>
>>>> The deadlock combines 4 flows in parallel:
>>>> - ns scanning (triggered from reconnect)
>>>> - request timeout
>>>> - ANA update (triggered from reconnect)
>>>> - FS I/O coming into the mpath device
>>>>
>>>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>>>      freeze queue -> but blocked, why?
>>>
>>> What does -> but blocked mean?
>>
>> It is blocked and cannot complete, because of (2)
>>
>>>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>>>      the timeout handler, why?
>>>
>>> The timeout handler obviously needs to keep the queue alive while
>>> running.  We could think of doing a try_get, though?
>>
>> It is keeping the queue alive, that is not the issue. it is blocked in
>> the driver .timeout() handler (i.e. nvme_tcp_timeout).
>>
>> The reason that it blocked and cannot make forward progress is because
>> the driver timeout handler will call nvme_stop_queues(), which is
>> blocked as this takes namespaces_rwsem...
>>
>> There is a chain of dependency that is deadlocking with circular
>> dependency.
> 
> Can't you simply call 'nvme_reset_ctrl()' ?
> Seems to work reasonably well on the fc side, so I wonder what's 
> different for tcp ...

As I mentioned, this is not specific for tcp, it pci timeout handler
can block just as well because it calls nvme_dev_disable which calls
nvme_stop_queues. The rest of the flows are not related to the
transport.

There is no expectation from the driver to always defer handling the
timeout to a different context, but should we make that operation for
nvme transprots at least?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-03-26  6:23 nvme deadlock with ANA Sagi Grimberg
  2020-03-26  6:29 ` Sagi Grimberg
  2020-04-02 15:18 ` Christoph Hellwig
@ 2020-04-02 16:00 ` Keith Busch
  2020-04-02 16:08   ` Sagi Grimberg
  2020-04-02 16:12   ` Hannes Reinecke
  2 siblings, 2 replies; 12+ messages in thread
From: Keith Busch @ 2020-04-02 16:00 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Anton Eidelman, Christoph Hellwig, linux-nvme, Hannes Reinecke

On Wed, Mar 25, 2020 at 11:23:50PM -0700, Sagi Grimberg wrote:
> Hey,
> 
> I want to consult with you guys on a deadlock condition I'm able to
> hit with a test that incorporate controller reconnect, ana updates
> and live I/O with timeouts.
> 
> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
> well.
> 
> The deadlock combines 4 flows in parallel:
> - ns scanning (triggered from reconnect)
> - request timeout
> - ANA update (triggered from reconnect)
> - FS I/O coming into the mpath device
> 
> (1) ns scanning triggers disk revalidation -> update disk info ->
>     freeze queue -> but blocked, why?
> 
> (2) timeout handler reference the g_usage_counter - > but blocks in
>     the timeout handler, why?
> 
> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>     takes the namespaces_rwsem - > but blocks, why?
> 
> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>     -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>     but it blocks, why?


Why is ana updating occuring under a write lock? AFAIK, that should be
necessary only if it's manipulating the ctrl->namespaces list, but this
is just iterating it. What's the harm in using the read lock here?
 
> (5) FS I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
>     direct_make_request > blk_queue_enter -> but blocked, why?
> 
> ==> because of (1), the request queue is under freeze -> deadlock.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 16:00 ` Keith Busch
@ 2020-04-02 16:08   ` Sagi Grimberg
  2020-04-02 16:12   ` Hannes Reinecke
  1 sibling, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-04-02 16:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Anton Eidelman, Christoph Hellwig, linux-nvme, Hannes Reinecke


>> Hey,
>>
>> I want to consult with you guys on a deadlock condition I'm able to
>> hit with a test that incorporate controller reconnect, ana updates
>> and live I/O with timeouts.
>>
>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>> well.
>>
>> The deadlock combines 4 flows in parallel:
>> - ns scanning (triggered from reconnect)
>> - request timeout
>> - ANA update (triggered from reconnect)
>> - FS I/O coming into the mpath device
>>
>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>      freeze queue -> but blocked, why?
>>
>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>      the timeout handler, why?
>>
>> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>>      takes the namespaces_rwsem - > but blocks, why?
>>
>> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>>      -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>>      but it blocks, why?
> 
> 
> Why is ana updating occuring under a write lock? AFAIK, that should be
> necessary only if it's manipulating the ctrl->namespaces list, but this
> is just iterating it. What's the harm in using the read lock here?

Yes you are absolutely correct!

I think this should resolve the deadlock, as nvme_stop_queues can make
forward progress, and then the transport timeout handler will complete,
the timeout handler can decrement the g_usage_counter, and the queue
freeze in update_disk_info will complete.

I'll send a patch.

Thanks!

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 16:00 ` Keith Busch
  2020-04-02 16:08   ` Sagi Grimberg
@ 2020-04-02 16:12   ` Hannes Reinecke
  2020-04-02 16:18     ` Sagi Grimberg
  1 sibling, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2020-04-02 16:12 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg; +Cc: Anton Eidelman, Christoph Hellwig, linux-nvme

On 4/2/20 6:00 PM, Keith Busch wrote:
> On Wed, Mar 25, 2020 at 11:23:50PM -0700, Sagi Grimberg wrote:
>> Hey,
>>
>> I want to consult with you guys on a deadlock condition I'm able to
>> hit with a test that incorporate controller reconnect, ana updates
>> and live I/O with timeouts.
>>
>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>> well.
>>
>> The deadlock combines 4 flows in parallel:
>> - ns scanning (triggered from reconnect)
>> - request timeout
>> - ANA update (triggered from reconnect)
>> - FS I/O coming into the mpath device
>>
>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>      freeze queue -> but blocked, why?
>>
>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>      the timeout handler, why?
>>
>> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>>      takes the namespaces_rwsem - > but blocks, why?
>>
>> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>>      -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>>      but it blocks, why?
> 
> 
> Why is ana updating occuring under a write lock? AFAIK, that should be
> necessary only if it's manipulating the ctrl->namespaces list, but this
> is just iterating it. What's the harm in using the read lock here?
>   

While we do lock ns->head when updating the ana state, we do not protect 
the ns->head pointer itself.
So I guess the intention was to avoid anyone messing with the ns->head 
pointer while traversing; but looking closer I think we should be good 
with just taking the read lock.
Can you try if it makes a difference?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 16:12   ` Hannes Reinecke
@ 2020-04-02 16:18     ` Sagi Grimberg
  0 siblings, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2020-04-02 16:18 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: Anton Eidelman, Christoph Hellwig, linux-nvme


>>> Hey,
>>>
>>> I want to consult with you guys on a deadlock condition I'm able to
>>> hit with a test that incorporate controller reconnect, ana updates
>>> and live I/O with timeouts.
>>>
>>> This is true for NVMe/TCP, but can also happen in rdma or pci drivers as
>>> well.
>>>
>>> The deadlock combines 4 flows in parallel:
>>> - ns scanning (triggered from reconnect)
>>> - request timeout
>>> - ANA update (triggered from reconnect)
>>> - FS I/O coming into the mpath device
>>>
>>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>>      freeze queue -> but blocked, why?
>>>
>>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>>      the timeout handler, why?
>>>
>>> (3) the timeout handler (indirectly) calls nvme_stop_queue() -> which
>>>      takes the namespaces_rwsem - > but blocks, why?
>>>
>>> (4) ANA update takes the namespaces_rwsem -> calls nvme_mpath_set_live()
>>>      -> which synchronize the ns_head srcu (see commit 504db087aacc) ->
>>>      but it blocks, why?
>>
>>
>> Why is ana updating occuring under a write lock? AFAIK, that should be
>> necessary only if it's manipulating the ctrl->namespaces list, but this
>> is just iterating it. What's the harm in using the read lock here?
> 
> While we do lock ns->head when updating the ana state, we do not protect 
> the ns->head pointer itself.
> So I guess the intention was to avoid anyone messing with the ns->head 
> pointer while traversing; but looking closer I think we should be good 
> with just taking the read lock.
> Can you try if it makes a difference?

I will, but this change is needed anyways AFAICT.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nvme deadlock with ANA
  2020-04-02 15:30     ` Hannes Reinecke
  2020-04-02 15:38       ` Sagi Grimberg
@ 2020-04-02 17:22       ` James Smart
  1 sibling, 0 replies; 12+ messages in thread
From: James Smart @ 2020-04-02 17:22 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, Anton Eidelman, linux-nvme

On 4/2/2020 8:30 AM, Hannes Reinecke wrote:
> On 4/2/20 5:24 PM, Sagi Grimberg wrote:
>>
>>>> I want to consult with you guys on a deadlock condition I'm able to
>>>> hit with a test that incorporate controller reconnect, ana updates
>>>> and live I/O with timeouts.
>>>>
>>>> This is true for NVMe/TCP, but can also happen in rdma or pci 
>>>> drivers as
>>>> well.
>>>>
>>>> The deadlock combines 4 flows in parallel:
>>>> - ns scanning (triggered from reconnect)
>>>> - request timeout
>>>> - ANA update (triggered from reconnect)
>>>> - FS I/O coming into the mpath device
>>>>
>>>> (1) ns scanning triggers disk revalidation -> update disk info ->
>>>>      freeze queue -> but blocked, why?
>>>
>>> What does -> but blocked mean?
>>
>> It is blocked and cannot complete, because of (2)
>>
>>>> (2) timeout handler reference the g_usage_counter - > but blocks in
>>>>      the timeout handler, why?
>>>
>>> The timeout handler obviously needs to keep the queue alive while
>>> running.  We could think of doing a try_get, though?
>>
>> It is keeping the queue alive, that is not the issue. it is blocked in
>> the driver .timeout() handler (i.e. nvme_tcp_timeout).
>>
>> The reason that it blocked and cannot make forward progress is because
>> the driver timeout handler will call nvme_stop_queues(), which is
>> blocked as this takes namespaces_rwsem...
>>
>> There is a chain of dependency that is deadlocking with circular
>> dependency.
>
> Can't you simply call 'nvme_reset_ctrl()' ?
> Seems to work reasonably well on the fc side, so I wonder what's 
> different for tcp ...

it's no different with FC. We do likely stay in resetting state longer 
as we have to wait for acknowledgement of io abort. But the flows in 
general aren't different.

-- james

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-04-02 17:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26  6:23 nvme deadlock with ANA Sagi Grimberg
2020-03-26  6:29 ` Sagi Grimberg
2020-04-02  7:09   ` Sagi Grimberg
2020-04-02 15:18 ` Christoph Hellwig
2020-04-02 15:24   ` Sagi Grimberg
2020-04-02 15:30     ` Hannes Reinecke
2020-04-02 15:38       ` Sagi Grimberg
2020-04-02 17:22       ` James Smart
2020-04-02 16:00 ` Keith Busch
2020-04-02 16:08   ` Sagi Grimberg
2020-04-02 16:12   ` Hannes Reinecke
2020-04-02 16:18     ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).