linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hans Verkuil <hverkuil@xs4all.nl>
To: Keith Pyle <kpyle@austin.rr.com>,
	Linux Media Mailing List <linux-media@vger.kernel.org>
Subject: Re: hdpvr mutex deadlock on kernel 5.1.x
Date: Mon, 17 Jun 2019 12:22:59 +0200	[thread overview]
Message-ID: <8e18508d-7c36-ead7-4c92-7335813895d0@xs4all.nl> (raw)
In-Reply-To: <14d31c83-f48f-319d-6b3a-0753ea9d2c02@austin.rr.com>

On 6/15/19 10:06 PM, Keith Pyle wrote:
> We have observed a hard mutex deadlock in the hdpvr driver on both 5.1.8
> and 5.1.10.  The problem occurs while reading from the HD-PVR device
> and results in an unkillable user process.  It is readily reproducible.
> 
> The problem has been reproduced on two different systems and with two
> different HD-PVR devices using the 0x1e firmware.
> 
> We're not particularly familiar with the hdpvr code and could use
> advice/assistance on this problem.  We're certainly willing to test patches.

Great! :-)

Can you try this first:

Edit drivers/media/usb/hdpvr/hdpvr-video.c, go to function hdpvr_device_release()
and remove the mutex_lock/unlock around the flush_work(&dev->worker); call. That's
definitely wrong.

Compile and try again. I expect this will at least solve 1 and 2, but I'm not sure
about 3 (read deadlock).

Regards,

	Hans

> 
> As of this writing, origin/master is
> 0011572c883082a95e02d47f45fc4a42dc0e8634 (a commit in Linus' tree).
> 
> `git log v5.1.10..origin/master --grep=hdpvr --` returns no output, so
> there appear to be no post-5.1 commits mentioning the hdpvr directly.
> 
> `git log v5.1.10..origin/master -- drivers/media/usb/hdpvr/` shows
> only some comment updates and a strncpy change.  There's nothing that
> appears locking related.
> 
> -----
> 
> Steps taken to characterize and demonstrate the problem:
> 
> We enabled kernel lock debugging using these options:
> 
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_LOCKDEP=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_DEBUG_ATOMIC_SLEEP=y
> 
> The kernel was built with:
> 
> gcc version 7.4.0 (Gentoo Hardened 7.4.0-r1 p1.2)
> GNU ld (Gentoo 2.31.1 p5) 2.31.1
> 
> Three snippets of lockdep output are included below.
> 
> 1. On opening the device, lockdep reported the following:
> 
> [  113.195852] ------------[ cut here ]------------
> [  113.195869] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000e54b9652>] prepare_to_wait_event+0xa7/0xe5
> [  113.195885] WARNING: CPU: 2 PID: 2616 at __might_sleep+0x52/0x65
> [  113.195889] Modules linked in: hdpvr x86_pkg_temp_thermal efivarfs
> [  113.195902] CPU: 2 PID: 2616 Comm: cat Not tainted 5.1.10-ld+ #4
> [  113.195906] Hardware name: MSI MS-7A72/Z270 PC MATE (MS-7A72), BIOS 1.60 07/11/2017
> [  113.195914] RIP: 0010:__might_sleep+0x52/0x65
> [  113.195920] Code: 80 3d e7 98 43 01 00 75 23 48 8b 90 38 21 00 00 48 c7 c7 ce 46 11 b8 c6 05 d0 98 43 01 01 48 8b 70 10 48 89 d1 e8 56 b2 fd ff <0f> 0b 44 89 e2 89 ee 48 89 df 5b 5d 41 5c e9 de fd
> ff ff 83 fe 07
> [  113.195925] RSP: 0018:ffffb73b0057fd30 EFLAGS: 00010282
> [  113.195930] RAX: 0000000000000070 RBX: ffffffffb81164a0 RCX: 0000000000000007
> [  113.195935] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff9ed35eb15610
> [  113.195939] RBP: 000000000000038c R08: 0000000000000001 R09: 000000000000000c
> [  113.195943] R10: ffffb73b0057fe18 R11: 0000008391c86b93 R12: 0000000000000000
> [  113.195947] R13: 0000000000020000 R14: 0000000000000002 R15: 0000000000000000
> [  113.195953] FS:  00007fa6eee2c580(0000) GS:ffff9ed35eb00000(0000) knlGS:0000000000000000
> [  113.195958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  113.195963] CR2: 000056533197de84 CR3: 00000004425de002 CR4: 00000000001606e0
> [  113.195967] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  113.195972] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  113.195975] Call Trace:
> [  113.195985]  __mutex_lock+0x47/0x7ac
> [  113.195997]  ? hdpvr_get_next_buffer+0x16/0x4a [hdpvr]
> [  113.196005]  ? _raw_spin_unlock_irqrestore+0x39/0x4a
> [  113.196013]  ? hdpvr_get_next_buffer+0x16/0x4a [hdpvr]
> [  113.196020]  hdpvr_get_next_buffer+0x16/0x4a [hdpvr]hdpvr_submit_buffers
> [  113.196029]  hdpvr_read+0x174/0x450 [hdpvr]
> [  113.196037]  ? wait_woken+0x86/0x86
> [  113.196044]  v4l2_read+0x38/0x7a
> [  113.196052]  vfs_read+0xad/0x136
> [  113.196059]  ksys_read+0x53/0x95
> [  113.196066]  do_syscall_64+0x52/0x155
> [  113.196074]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  113.196079] RIP: 0033:0x7fa6eed52005
> [  113.196085] Code: 00 00 0f 1f 00 48 83 ec 38 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 48 8d 05 35 85 0d 00 8b 00 85 c0 75 27 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 63 48 8b 4c 24 28 64 48 33 0c
> 25 28 00 00 00
> [  113.196089] RSP: 002b:00007fffd97d5630 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [  113.196096] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fa6eed52005
> [  113.196100] RDX: 0000000000020000 RSI: 00007fa6eee31000 RDI: 0000000000000003
> [  113.196104] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
> [  113.196109] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fa6eee31000
> [  113.196113] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000000
> [  113.196119] irq event stamp: 1514
> [  113.196128] hardirqs last  enabled at (1513): [<ffffffffb714832b>] console_unlock+0x493/0x4da
> [  113.196134] hardirqs last disabled at (1514): [<ffffffffb7001460>] trace_hardirqs_off_thunk+0x1a/0x1c
> [  113.196140] softirqs last  enabled at (1500): [<ffffffffb7c00376>] __do_softirq+0x376/0x3b2
> [  113.196148] softirqs last disabled at (1491): [<ffffffffb70f1ea2>] irq_exit+0x4e/0x9d
> [  113.196152] ---[ end trace 0881f93401450644 ]---
> 
> 2. A short capture from the device proceeded successfully and lockdep
> reported the following on turning the device off:
> 
> [  227.915296] usb 1-10: USB disconnect, device number 9
> [  227.917973] hdpvr 1-10:1.0: device video0 disconnected
> 
> [  228.024287] ======================================================
> [  228.024291] WARNING: possible circular locking dependency detected
> [  228.024296] 5.1.10-ld+ #4 Tainted: G        W
> [  228.024300] ------------------------------------------------------
> [  228.024304] kworker/3:1/41 is trying to acquire lock:
> [  228.024308] 000000002a90d673 ((work_completion)(&dev->worker)){+.+.}, at: __flush_work+0x3d/0x245
> [  228.024322]
>                but task is already holding lock:
> [  228.024326] 000000009cff49c4 (&dev->io_mutex){+.+.}, at: hdpvr_device_release+0x22/0x89 [hdpvr]
> [  228.024337]
>                which lock already depends on the new lock.
> 
> [  228.024341]
>                the existing dependency chain (in reverse order) is:
> [  228.024344]
>                -> #1 (&dev->io_mutex){+.+.}:
> [  228.024354]        __mutex_lock+0x81/0x7ac
> [  228.024360]        hdpvr_transmit_buffers+0x3c/0x26c [hdpvr]
> [  228.024366]        process_one_work+0x2b9/0x4db
> [  228.024372]        worker_thread+0x1d1/0x2a5
> [  228.024377]        kthread+0x11c/0x124
> [  228.024383]        ret_from_fork+0x3a/0x50
> [  228.024386]
>                -> #0 ((work_completion)(&dev->worker)){+.+.}:
> [  228.024395]        lock_acquire+0x14f/0x17a
> [  228.024400]        __flush_work+0x60/0x245
> [  228.024406]        hdpvr_device_release+0x2e/0x89 [hdpvr]
> [  228.024413]        v4l2_device_release+0xaa/0xc0
> [  228.024420]        device_release+0x52/0x7a
> [  228.024427]        kobject_put+0x78/0x8f
> [  228.024433]        hdpvr_disconnect+0xd1/0xdd [hdpvr]
> [  228.024441]        usb_unbind_interface+0x8a/0x1c5
> [  228.024447]        device_release_driver_internal+0xe7/0x198
> [  228.024452]        bus_remove_device+0xf8/0x10d
> [  228.024459]        device_del+0x19e/0x2c4
> [  228.024465]        usb_disable_device+0x7b/0x18d
> [  228.024470]        usb_disconnect+0x90/0x190
> [  228.024475]        hub_event+0x6c4/0xfc4
> [  228.024481]        process_one_work+0x2b9/0x4db
> [  228.024487]        worker_thread+0x1d1/0x2a5
> [  228.024491]        kthread+0x11c/0x124
> [  228.024497]        ret_from_fork+0x3a/0x50
> [  228.024500]
>                other info that might help us debug this:
> 
> [  228.024503]  Possible unsafe locking scenario:
> 
> [  228.024506]        CPU0                    CPU1
> [  228.024509]        ----                    ----
> [  228.024512]   lock(&dev->io_mutex);
> [  228.024516] lock((work_completion)(&dev->worker));
> [  228.024520] lock(&dev->io_mutex);
> [  228.024524]   lock((work_completion)(&dev->worker));
> [  228.024528]
>                 *** DEADLOCK ***
> 
> [  228.024532] 6 locks held by kworker/3:1/41:
> [  228.024535]  #0: 00000000d157af1b ((wq_completion)usb_hub_wq){+.+.}, at: process_one_work+0x187/0x4db
> [  228.024545]  #1: 000000009ea387d9 ((work_completion)(&hub->events)){+.+.}, at: process_one_work+0x187/0x4db
> [  228.024553]  #2: 00000000cf24eaef (&dev->mutex){....}, at: hub_event+0x5b/0xfc4
> [  228.024561]  #3: 0000000096a7bf78 (&dev->mutex){....}, at: usb_disconnect+0x55/0x190
> [  228.024568]  #4: 00000000681cf121 (&dev->mutex){....}, at: device_release_driver_internal+0x15/0x198
> [  228.024576]  #5: 000000009cff49c4 (&dev->io_mutex){+.+.}, at: hdpvr_device_release+0x22/0x89 [hdpvr]
> [  228.024585]
>                stack backtrace:
> [  228.024592] CPU: 3 PID: 41 Comm: kworker/3:1 Tainted: G W         5.1.10-ld+ #4
> [  228.024595] Hardware name: MSI MS-7A72/Z270 PC MATE (MS-7A72), BIOS 1.60 07/11/2017
> [  228.024602] Workqueue: usb_hub_wq hub_event
> [  228.024606] Call Trace:
> [  228.024614]  dump_stack+0x67/0x8e
> [  228.024622]  print_circular_bug.isra.21+0x265/0x272
> [  228.024629]  check_prev_add.constprop.29+0x479/0xa1c
> [  228.024636]  ? add_lock_to_list.isra.13+0x28/0xc1
> [  228.024642]  ? __lock_acquire+0xe83/0xf65
> [  228.024647]  ? alloc_lock_chain+0x11/0x34
> [  228.024652]  __lock_acquire+0xe83/0xf65
> [  228.024658]  lock_acquire+0x14f/0x17a
> [  228.024664]  ? __flush_work+0x3d/0x245
> [  228.024670]  __flush_work+0x60/0x245
> [  228.024675]  ? __flush_work+0x3d/0x245
> [  228.024682]  ? hdpvr_device_release+0x22/0x89 [hdpvr]
> [  228.024690]  ? kfree+0x196/0x217
> [  228.024696]  ? hdpvr_free_queue+0x68/0x73 [hdpvr]
> [  228.024703]  hdpvr_device_release+0x2e/0x89 [hdpvr]
> [  228.024708]  v4l2_device_release+0xaa/0xc0
> [  228.024716]  device_release+0x52/0x7a
> [  228.024723]  kobject_put+0x78/0x8f
> [  228.024729]  hdpvr_disconnect+0xd1/0xdd [hdpvr]
> [  228.024738]  usb_unbind_interface+0x8a/0x1c5
> [  228.024745]  device_release_driver_internal+0xe7/0x198
> [  228.024750]  bus_remove_device+0xf8/0x10d
> [  228.024758]  device_del+0x19e/0x2c4
> [  228.024764]  ? usb_remove_ep_devs+0x16/0x21
> [  228.024771]  usb_disable_device+0x7b/0x18d
> [  228.024777]  usb_disconnect+0x90/0x190
> [  228.024783]  hub_event+0x6c4/0xfc4
> [  228.024790]  ? process_one_work+0x187/0x4db
> [  228.024796]  process_one_work+0x2b9/0x4db
> [  228.024803]  ? worker_thread+0x22d/0x2a5
> [  228.024809]  worker_thread+0x1d1/0x2a5
> [  228.024815]  ? rescuer_thread+0x278/0x278
> [  228.024820]  kthread+0x11c/0x124
> [  228.024824]  ? kthread_park+0x71/0x71
> [  228.024831]  ret_from_fork+0x3a/0x50
> 
> 3. A longer capture test was then run.  After ~20 minutes, the reading process
> deadlocked and lockdep reported the following.  A reboot was required.
> 
> [ 3196.020009] INFO: task kworker/2:4:412 blocked for more than 122 seconds.
> [ 3196.020020]       Tainted: G        W         5.1.10-ld+ #4
> [ 3196.020025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3196.020032] kworker/2:4     D    0   412      2 0x80000000
> [ 3196.020049] Workqueue: events hdpvr_transmit_buffers [hdpvr]
> [ 3196.020055] Call Trace:
> [ 3196.020069]  ? __schedule+0x8f8/0x959
> [ 3196.020077]  ? __mutex_lock+0x2f7/0x7ac
> [ 3196.020084]  schedule+0x69/0x81
> [ 3196.020092]  schedule_preempt_disabled+0xc/0x14
> [ 3196.020099]  __mutex_lock+0x4d0/0x7ac
> [ 3196.020107]  ? __switch_to_asm+0x31/0x60
> [ 3196.020115]  ? __switch_to_asm+0x25/0x60
> [ 3196.020123]  ? hdpvr_transmit_buffers+0x3c/0x26c [hdpvr]
> [ 3196.020131]  ? __switch_to_asm+0x25/0x60
> [ 3196.020138]  ? __switch_to_asm+0x25/0x60
> [ 3196.020146]  ? hdpvr_transmit_buffers+0x3c/0x26c [hdpvr]
> [ 3196.020154]  hdpvr_transmit_buffers+0x3c/0x26c [hdpvr]
> [ 3196.020164]  ? process_one_work+0x187/0x4db
> [ 3196.020172]  process_one_work+0x2b9/0x4db
> [ 3196.020181]  ? worker_thread+0x22d/0x2a5
> [ 3196.020188]  worker_thread+0x1d1/0x2a5
> [ 3196.020196]  ? rescuer_thread+0x278/0x278
> [ 3196.020202]  kthread+0x11c/0x124
> [ 3196.020209]  ? kthread_park+0x71/0x71
> [ 3196.020217]  ret_from_fork+0x3a/0x50
> [ 3196.020250] INFO: task cat_hdpvr:3736 blocked for more than 122 seconds.
> [ 3196.020256]       Tainted: G        W         5.1.10-ld+ #4
> [ 3196.020261] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3196.020266] cat_hdpvr       D    0  3736   3735 0x80000002
> [ 3196.020273] Call Trace:
> [ 3196.020281]  ? __schedule+0x8f8/0x959
> [ 3196.020287]  ? __mutex_lock+0x2f7/0x7ac
> [ 3196.020294]  schedule+0x69/0x81
> [ 3196.020301]  schedule_preempt_disabled+0xc/0x14
> [ 3196.020307]  __mutex_lock+0x4d0/0x7ac
> [ 3196.020316]  ? hdpvr_release+0x24/0x5b [hdpvr]
> [ 3196.020327]  ? kmem_cache_free+0x13a/0x20c
> [ 3196.020334]  ? hdpvr_release+0x24/0x5b [hdpvr]
> [ 3196.020341]  hdpvr_release+0x24/0x5b [hdpvr]
> [ 3196.020350]  v4l2_release+0x83/0xbe
> [ 3196.020360]  __fput+0xfa/0x19b
> [ 3196.020369]  task_work_run+0x7a/0x9c
> [ 3196.020377]  do_exit+0x4f3/0xa41
> [ 3196.020385]  do_group_exit+0xad/0xad
> [ 3196.020393]  __x64_sys_exit_group+0xf/0xf
> [ 3196.020400]  do_syscall_64+0x52/0x155
> [ 3196.020409]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 3196.020416] RIP: 0033:0x7f41505bdcd6
> [ 3196.020427] Code: Bad RIP value.
> [ 3196.020433] RSP: 002b:00007ffe864cdad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [ 3196.020442] RAX: ffffffffffffffda RBX: 00007f41506b25c0 RCX: 00007f41505bdcd6
> [ 3196.020447] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [ 3196.020453] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80
> [ 3196.020458] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f41506b25c0
> [ 3196.020463] R13: 0000000000000001 R14: 00007f41506bb268 R15: 0000000000000000
> [ 3196.020495] INFO: lockdep is turned off.
> 


  reply	other threads:[~2019-06-17 10:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-15 20:06 hdpvr mutex deadlock on kernel 5.1.x Keith Pyle
2019-06-17 10:22 ` Hans Verkuil [this message]
2019-06-18  4:17   ` Keith Pyle
2019-06-18  7:16     ` Hans Verkuil
2019-06-19  2:29       ` Keith Pyle
2019-06-20 11:33         ` Hans Verkuil
2019-06-21  2:50           ` Keith Pyle
2019-06-28  1:52           ` Keith Pyle
2019-06-28 11:57             ` Hans Verkuil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e18508d-7c36-ead7-4c92-7335813895d0@xs4all.nl \
    --to=hverkuil@xs4all.nl \
    --cc=kpyle@austin.rr.com \
    --cc=linux-media@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).