iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)
@ 2019-06-20 10:44 Peter Xu
  2019-06-21  1:58 ` Lu Baolu
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Xu @ 2019-06-20 10:44 UTC (permalink / raw)
  To: Linux IOMMU Mailing List; +Cc: dave.jiang

Hi,

With 5.2.0-rc5 I can easily trigger this with lockdep:

--------------------------------------------
Jun 20 14:37:37 xz-x1 kernel: ======================================================
Jun 20 14:37:37 xz-x1 kernel: WARNING: possible circular locking dependency detected
Jun 20 14:37:37 xz-x1 kernel: 5.2.0-rc5 #78 Not tainted
Jun 20 14:37:37 xz-x1 kernel: ------------------------------------------------------
Jun 20 14:37:37 xz-x1 kernel: swapper/0/1 is trying to acquire lock:
Jun 20 14:37:37 xz-x1 kernel: 00000000ea2b3beb (&(&iommu->lock)->rlock){+.+.}, at: domain_context_mapping_one+0xa5/0x4e0
Jun 20 14:37:37 xz-x1 kernel:
                              but task is already holding lock:
Jun 20 14:37:37 xz-x1 kernel: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
Jun 20 14:37:37 xz-x1 kernel:
                              which lock already depends on the new lock.
Jun 20 14:37:37 xz-x1 kernel:
                              the existing dependency chain (in reverse order) is:
Jun 20 14:37:37 xz-x1 kernel:
                              -> #1 (device_domain_lock){....}:
Jun 20 14:37:37 xz-x1 kernel:        _raw_spin_lock_irqsave+0x3c/0x50
Jun 20 14:37:37 xz-x1 kernel:        dmar_insert_one_dev_info+0xbb/0x510
Jun 20 14:37:37 xz-x1 kernel:        domain_add_dev_info+0x50/0x90
Jun 20 14:37:37 xz-x1 kernel:        dev_prepare_static_identity_mapping+0x30/0x68
Jun 20 14:37:37 xz-x1 kernel:        intel_iommu_init+0xddd/0x1422
Jun 20 14:37:37 xz-x1 kernel:        pci_iommu_init+0x16/0x3f
Jun 20 14:37:37 xz-x1 kernel:        do_one_initcall+0x5d/0x2b4
Jun 20 14:37:37 xz-x1 kernel:        kernel_init_freeable+0x218/0x2c1
Jun 20 14:37:37 xz-x1 kernel:        kernel_init+0xa/0x100
Jun 20 14:37:37 xz-x1 kernel:        ret_from_fork+0x3a/0x50
Jun 20 14:37:37 xz-x1 kernel:
                              -> #0 (&(&iommu->lock)->rlock){+.+.}:
Jun 20 14:37:37 xz-x1 kernel:        lock_acquire+0x9e/0x170
Jun 20 14:37:37 xz-x1 kernel:        _raw_spin_lock+0x25/0x30
Jun 20 14:37:37 xz-x1 kernel:        domain_context_mapping_one+0xa5/0x4e0
Jun 20 14:37:37 xz-x1 kernel:        pci_for_each_dma_alias+0x30/0x140
Jun 20 14:37:37 xz-x1 kernel:        dmar_insert_one_dev_info+0x3b2/0x510
Jun 20 14:37:37 xz-x1 kernel:        domain_add_dev_info+0x50/0x90
Jun 20 14:37:37 xz-x1 kernel:        dev_prepare_static_identity_mapping+0x30/0x68
Jun 20 14:37:37 xz-x1 kernel:        intel_iommu_init+0xddd/0x1422
Jun 20 14:37:37 xz-x1 kernel:        pci_iommu_init+0x16/0x3f
Jun 20 14:37:37 xz-x1 kernel:        do_one_initcall+0x5d/0x2b4
Jun 20 14:37:37 xz-x1 kernel:        kernel_init_freeable+0x218/0x2c1
Jun 20 14:37:37 xz-x1 kernel:        kernel_init+0xa/0x100
Jun 20 14:37:37 xz-x1 kernel:        ret_from_fork+0x3a/0x50
Jun 20 14:37:37 xz-x1 kernel:
                              other info that might help us debug this:
Jun 20 14:37:37 xz-x1 kernel:  Possible unsafe locking scenario:
Jun 20 14:37:37 xz-x1 kernel:        CPU0                    CPU1
Jun 20 14:37:37 xz-x1 kernel:        ----                    ----
Jun 20 14:37:37 xz-x1 kernel:   lock(device_domain_lock);
Jun 20 14:37:37 xz-x1 kernel:                                lock(&(&iommu->lock)->rlock);
Jun 20 14:37:37 xz-x1 kernel:                                lock(device_domain_lock);
Jun 20 14:37:37 xz-x1 kernel:   lock(&(&iommu->lock)->rlock);
Jun 20 14:37:37 xz-x1 kernel:
                               *** DEADLOCK ***
Jun 20 14:37:37 xz-x1 kernel: 2 locks held by swapper/0/1:
Jun 20 14:37:37 xz-x1 kernel:  #0: 00000000033eb13d (dmar_global_lock){++++}, at: intel_iommu_init+0x1e0/0x1422
Jun 20 14:37:37 xz-x1 kernel:  #1: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
Jun 20 14:37:37 xz-x1 kernel:
                              stack backtrace:
Jun 20 14:37:37 xz-x1 kernel: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78
Jun 20 14:37:37 xz-x1 kernel: Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 06/25/2018
Jun 20 14:37:37 xz-x1 kernel: Call Trace:
Jun 20 14:37:37 xz-x1 kernel:  dump_stack+0x85/0xc0
Jun 20 14:37:37 xz-x1 kernel:  print_circular_bug.cold.57+0x15c/0x195
Jun 20 14:37:37 xz-x1 kernel:  __lock_acquire+0x152a/0x1710
Jun 20 14:37:37 xz-x1 kernel:  lock_acquire+0x9e/0x170
Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0xa5/0x4e0
Jun 20 14:37:37 xz-x1 kernel:  _raw_spin_lock+0x25/0x30
Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0xa5/0x4e0
Jun 20 14:37:37 xz-x1 kernel:  domain_context_mapping_one+0xa5/0x4e0
Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0x4e0/0x4e0
Jun 20 14:37:37 xz-x1 kernel:  pci_for_each_dma_alias+0x30/0x140
Jun 20 14:37:37 xz-x1 kernel:  dmar_insert_one_dev_info+0x3b2/0x510
Jun 20 14:37:37 xz-x1 kernel:  domain_add_dev_info+0x50/0x90
Jun 20 14:37:37 xz-x1 kernel:  dev_prepare_static_identity_mapping+0x30/0x68
Jun 20 14:37:37 xz-x1 kernel:  intel_iommu_init+0xddd/0x1422
Jun 20 14:37:37 xz-x1 kernel:  ? printk+0x58/0x6f
Jun 20 14:37:37 xz-x1 kernel:  ? lockdep_hardirqs_on+0xf0/0x180
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  ? e820__memblock_setup+0x63/0x63
Jun 20 14:37:37 xz-x1 kernel:  pci_iommu_init+0x16/0x3f
Jun 20 14:37:37 xz-x1 kernel:  do_one_initcall+0x5d/0x2b4
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  ? rcu_read_lock_sched_held+0x55/0x60
Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
Jun 20 14:37:37 xz-x1 kernel:  kernel_init_freeable+0x218/0x2c1
Jun 20 14:37:37 xz-x1 kernel:  ? rest_init+0x230/0x230
Jun 20 14:37:37 xz-x1 kernel:  kernel_init+0xa/0x100
Jun 20 14:37:37 xz-x1 kernel:  ret_from_fork+0x3a/0x50
--------------------------------------------

domain_context_mapping_one() is taking device_domain_lock first then
iommu lock, while dmar_insert_one_dev_info() is doing the reverse.

I digged a bit and I saw this commit which seems suspicous:

7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and
              device_domain_lock", 2019-05-27)

More interestingly, it was trying to fix the inverted deadlock...

The thing is that even if above commit is correct on the ordering (I
still feel strange that we need to take a per-iommu lock before
another global lock), that commit seems to be an incomplete fix
because there's still other places that are using the other way round.

When I read deeper into that commit message, it seems to be telling me
that before reaching iommu_flush_dev_iotlb() we've got iommu lock
somewhere but I cannot really understand how it happened because I
cannot find a path that iommu lock is taken when reaching
iommu_flush_dev_iotlb().  So I cannot understand how that lockdep
warning message could trigger.

I reverted that commit and now everything is good here (no long runs
but at least previous deadlock issue is fixed).  And with that, IMHO
we'll actually have the correct ordering in the whole repository that
we'll take device_domain_lock before per iommu lock always.

Is there anything I've missed on why we have had 7560cc3ca7d9?

Thanks,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)
  2019-06-20 10:44 VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5) Peter Xu
@ 2019-06-21  1:58 ` Lu Baolu
  2019-06-21  2:30   ` Peter Xu
  0 siblings, 1 reply; 3+ messages in thread
From: Lu Baolu @ 2019-06-21  1:58 UTC (permalink / raw)
  To: Peter Xu, Linux IOMMU Mailing List; +Cc: dave.jiang

Hi Peter,

I agree with you that 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion 
between iommu->lock and device_domain_lock") isn't a good fix. There
is also another thread, https://lkml.org/lkml/2019/6/18/996, which
reported this.

I think we can revert this patch now. I will try to reproduce the
original issue and try to find a new fix.

Can you please submit the revert patch?

Best regards,
Baolu

On 6/20/19 6:44 PM, Peter Xu wrote:
> Hi,
> 
> With 5.2.0-rc5 I can easily trigger this with lockdep:
> 
> --------------------------------------------
> Jun 20 14:37:37 xz-x1 kernel: ======================================================
> Jun 20 14:37:37 xz-x1 kernel: WARNING: possible circular locking dependency detected
> Jun 20 14:37:37 xz-x1 kernel: 5.2.0-rc5 #78 Not tainted
> Jun 20 14:37:37 xz-x1 kernel: ------------------------------------------------------
> Jun 20 14:37:37 xz-x1 kernel: swapper/0/1 is trying to acquire lock:
> Jun 20 14:37:37 xz-x1 kernel: 00000000ea2b3beb (&(&iommu->lock)->rlock){+.+.}, at: domain_context_mapping_one+0xa5/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:
>                                but task is already holding lock:
> Jun 20 14:37:37 xz-x1 kernel: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:
>                                which lock already depends on the new lock.
> Jun 20 14:37:37 xz-x1 kernel:
>                                the existing dependency chain (in reverse order) is:
> Jun 20 14:37:37 xz-x1 kernel:
>                                -> #1 (device_domain_lock){....}:
> Jun 20 14:37:37 xz-x1 kernel:        _raw_spin_lock_irqsave+0x3c/0x50
> Jun 20 14:37:37 xz-x1 kernel:        dmar_insert_one_dev_info+0xbb/0x510
> Jun 20 14:37:37 xz-x1 kernel:        domain_add_dev_info+0x50/0x90
> Jun 20 14:37:37 xz-x1 kernel:        dev_prepare_static_identity_mapping+0x30/0x68
> Jun 20 14:37:37 xz-x1 kernel:        intel_iommu_init+0xddd/0x1422
> Jun 20 14:37:37 xz-x1 kernel:        pci_iommu_init+0x16/0x3f
> Jun 20 14:37:37 xz-x1 kernel:        do_one_initcall+0x5d/0x2b4
> Jun 20 14:37:37 xz-x1 kernel:        kernel_init_freeable+0x218/0x2c1
> Jun 20 14:37:37 xz-x1 kernel:        kernel_init+0xa/0x100
> Jun 20 14:37:37 xz-x1 kernel:        ret_from_fork+0x3a/0x50
> Jun 20 14:37:37 xz-x1 kernel:
>                                -> #0 (&(&iommu->lock)->rlock){+.+.}:
> Jun 20 14:37:37 xz-x1 kernel:        lock_acquire+0x9e/0x170
> Jun 20 14:37:37 xz-x1 kernel:        _raw_spin_lock+0x25/0x30
> Jun 20 14:37:37 xz-x1 kernel:        domain_context_mapping_one+0xa5/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:        pci_for_each_dma_alias+0x30/0x140
> Jun 20 14:37:37 xz-x1 kernel:        dmar_insert_one_dev_info+0x3b2/0x510
> Jun 20 14:37:37 xz-x1 kernel:        domain_add_dev_info+0x50/0x90
> Jun 20 14:37:37 xz-x1 kernel:        dev_prepare_static_identity_mapping+0x30/0x68
> Jun 20 14:37:37 xz-x1 kernel:        intel_iommu_init+0xddd/0x1422
> Jun 20 14:37:37 xz-x1 kernel:        pci_iommu_init+0x16/0x3f
> Jun 20 14:37:37 xz-x1 kernel:        do_one_initcall+0x5d/0x2b4
> Jun 20 14:37:37 xz-x1 kernel:        kernel_init_freeable+0x218/0x2c1
> Jun 20 14:37:37 xz-x1 kernel:        kernel_init+0xa/0x100
> Jun 20 14:37:37 xz-x1 kernel:        ret_from_fork+0x3a/0x50
> Jun 20 14:37:37 xz-x1 kernel:
>                                other info that might help us debug this:
> Jun 20 14:37:37 xz-x1 kernel:  Possible unsafe locking scenario:
> Jun 20 14:37:37 xz-x1 kernel:        CPU0                    CPU1
> Jun 20 14:37:37 xz-x1 kernel:        ----                    ----
> Jun 20 14:37:37 xz-x1 kernel:   lock(device_domain_lock);
> Jun 20 14:37:37 xz-x1 kernel:                                lock(&(&iommu->lock)->rlock);
> Jun 20 14:37:37 xz-x1 kernel:                                lock(device_domain_lock);
> Jun 20 14:37:37 xz-x1 kernel:   lock(&(&iommu->lock)->rlock);
> Jun 20 14:37:37 xz-x1 kernel:
>                                 *** DEADLOCK ***
> Jun 20 14:37:37 xz-x1 kernel: 2 locks held by swapper/0/1:
> Jun 20 14:37:37 xz-x1 kernel:  #0: 00000000033eb13d (dmar_global_lock){++++}, at: intel_iommu_init+0x1e0/0x1422
> Jun 20 14:37:37 xz-x1 kernel:  #1: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:
>                                stack backtrace:
> Jun 20 14:37:37 xz-x1 kernel: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78
> Jun 20 14:37:37 xz-x1 kernel: Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 06/25/2018
> Jun 20 14:37:37 xz-x1 kernel: Call Trace:
> Jun 20 14:37:37 xz-x1 kernel:  dump_stack+0x85/0xc0
> Jun 20 14:37:37 xz-x1 kernel:  print_circular_bug.cold.57+0x15c/0x195
> Jun 20 14:37:37 xz-x1 kernel:  __lock_acquire+0x152a/0x1710
> Jun 20 14:37:37 xz-x1 kernel:  lock_acquire+0x9e/0x170
> Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0xa5/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:  _raw_spin_lock+0x25/0x30
> Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0xa5/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:  domain_context_mapping_one+0xa5/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:  ? domain_context_mapping_one+0x4e0/0x4e0
> Jun 20 14:37:37 xz-x1 kernel:  pci_for_each_dma_alias+0x30/0x140
> Jun 20 14:37:37 xz-x1 kernel:  dmar_insert_one_dev_info+0x3b2/0x510
> Jun 20 14:37:37 xz-x1 kernel:  domain_add_dev_info+0x50/0x90
> Jun 20 14:37:37 xz-x1 kernel:  dev_prepare_static_identity_mapping+0x30/0x68
> Jun 20 14:37:37 xz-x1 kernel:  intel_iommu_init+0xddd/0x1422
> Jun 20 14:37:37 xz-x1 kernel:  ? printk+0x58/0x6f
> Jun 20 14:37:37 xz-x1 kernel:  ? lockdep_hardirqs_on+0xf0/0x180
> Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
> Jun 20 14:37:37 xz-x1 kernel:  ? e820__memblock_setup+0x63/0x63
> Jun 20 14:37:37 xz-x1 kernel:  pci_iommu_init+0x16/0x3f
> Jun 20 14:37:37 xz-x1 kernel:  do_one_initcall+0x5d/0x2b4
> Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
> Jun 20 14:37:37 xz-x1 kernel:  ? rcu_read_lock_sched_held+0x55/0x60
> Jun 20 14:37:37 xz-x1 kernel:  ? do_early_param+0x8e/0x8e
> Jun 20 14:37:37 xz-x1 kernel:  kernel_init_freeable+0x218/0x2c1
> Jun 20 14:37:37 xz-x1 kernel:  ? rest_init+0x230/0x230
> Jun 20 14:37:37 xz-x1 kernel:  kernel_init+0xa/0x100
> Jun 20 14:37:37 xz-x1 kernel:  ret_from_fork+0x3a/0x50
> --------------------------------------------
> 
> domain_context_mapping_one() is taking device_domain_lock first then
> iommu lock, while dmar_insert_one_dev_info() is doing the reverse.
> 
> I digged a bit and I saw this commit which seems suspicous:
> 
> 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and
>                device_domain_lock", 2019-05-27)
> 
> More interestingly, it was trying to fix the inverted deadlock...
> 
> The thing is that even if above commit is correct on the ordering (I
> still feel strange that we need to take a per-iommu lock before
> another global lock), that commit seems to be an incomplete fix
> because there's still other places that are using the other way round.
> 
> When I read deeper into that commit message, it seems to be telling me
> that before reaching iommu_flush_dev_iotlb() we've got iommu lock
> somewhere but I cannot really understand how it happened because I
> cannot find a path that iommu lock is taken when reaching
> iommu_flush_dev_iotlb().  So I cannot understand how that lockdep
> warning message could trigger.
> 
> I reverted that commit and now everything is good here (no long runs
> but at least previous deadlock issue is fixed).  And with that, IMHO
> we'll actually have the correct ordering in the whole repository that
> we'll take device_domain_lock before per iommu lock always.
> 
> Is there anything I've missed on why we have had 7560cc3ca7d9?
> 
> Thanks,
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5)
  2019-06-21  1:58 ` Lu Baolu
@ 2019-06-21  2:30   ` Peter Xu
  0 siblings, 0 replies; 3+ messages in thread
From: Peter Xu @ 2019-06-21  2:30 UTC (permalink / raw)
  To: Lu Baolu; +Cc: Linux IOMMU Mailing List, dave.jiang

On Fri, Jun 21, 2019 at 09:58:28AM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> I agree with you that 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between
> iommu->lock and device_domain_lock") isn't a good fix. There
> is also another thread, https://lkml.org/lkml/2019/6/18/996, which
> reported this.
> 
> I think we can revert this patch now. I will try to reproduce the
> original issue and try to find a new fix.
> 
> Can you please submit the revert patch?

Sure, I'll post a formal patch soon.  Thanks,

-- 
Peter Xu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-06-21  2:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-20 10:44 VT-d deadlock issue on device_domain_lock and iommu lock (5.2-rc5) Peter Xu
2019-06-21  1:58 ` Lu Baolu
2019-06-21  2:30   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).