All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Fangjian (Turing)" <f.fangjian@huawei.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Sinan Kaya <Okaya@kernel.org>
Cc: <linux-pci@vger.kernel.org>, Bjorn Helgaas <helgaas@kernel.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: Bug report: AER driver deadlock
Date: Thu, 20 Jun 2019 11:14:11 +0800	[thread overview]
Message-ID: <7b98d81c-55bd-1782-f214-7bbf48e54f16@huawei.com> (raw)
In-Reply-To: <a1c90cfb9ce4062b4823c6647d7709baf1c5534f.camel@kernel.crashing.org>

Hi,
Are there any further advice?

On 2019/6/5 7:47, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-04 at 10:34 -0400, Sinan Kaya wrote:
>> On 6/3/19, Fangjian (Turing) <f.fangjian@huawei.com> wrote:
>>> Hi, We met a deadlock triggered by a NONFATAL AER event during a sysfs
>>> "sriov_numvfs" operation. Any suggestion to fix such deadlock ?
>>>
>>>   enable one VF
>>>   # echo 1 > /sys/devices/pci0000:74/0000:74:00.0/0000:75:00.0/sriov_numvfs
>>>
>>>   The sysfs "sriov_numvfs" side is:
>>>
>>>     sriov_numvfs_store
>>>       device_lock                               # hold the device_lock
>>>         ...
>>>         pci_enable_sriov
>>>           sriov_enable
>>>             ...
>>>             pci_device_add
>>>               down_write(&pci_bus_sem) 	        # wait for
>>> up_read(&pci_bus_sem)
>>>
>>>   The AER side is:
>>>
>>>     pcie_do_recovery
>>>       pci_walk_bus
>>>         down_read(&pci_bus_sem)                 # hold the rw_semaphore
>>>         report_resume
>>
>> Should we replace these device lock with try lock loop with some sleep
>> statements. This could solve the immediate deadlock issues until
>> someone implements granular locking in pci.
> 
> That won't necessarily solve this AB->BA problem. I think the issue
> here is that sriov shouldn't device_lock before doing something that
> can take the pci_bus_sem.
> 
> Ben.
> 
> 
>>>           device_lock                           # wait for device_unlock()
>>>
>>> The calltrace is as below:
>>>
>>> [  258.411464] INFO: task kworker/0:1:13 blocked for more than 120 seconds.
>>> [  258.418139]       Tainted: G         C O      5.1.0-rc1-ge2e3ca0 #1
>>> [  258.424379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>> this message.
>>> [  258.432172] kworker/0:1     D    0    13      2 0x00000028
>>> [  258.437640] Workqueue: events aer_recover_work_func
>>> [  258.442496] Call trace:
>>> [  258.444933]  __switch_to+0xb4/0x1b8
>>> [  258.448409]  __schedule+0x1ec/0x720
>>> [  258.451884]  schedule+0x38/0x90
>>> [  258.455012]  schedule_preempt_disabled+0x20/0x38
>>> [  258.459610]  __mutex_lock.isra.1+0x150/0x518
>>> [  258.463861]  __mutex_lock_slowpath+0x10/0x18
>>> [  258.468112]  mutex_lock+0x34/0x40
>>> [  258.471413]  report_resume+0x1c/0x78
>>> [  258.474973]  pci_walk_bus+0x58/0xb0
>>> [  258.478451]  pcie_do_recovery+0x18c/0x248
>>> [  258.482445]  aer_recover_work_func+0xe0/0x118
>>> [  258.486783]  process_one_work+0x1e4/0x468
>>> [  258.490776]  worker_thread+0x40/0x450
>>> [  258.494424]  kthread+0x128/0x130
>>> [  258.497639]  ret_from_fork+0x10/0x1c
>>> [  258.501329] INFO: task flr.sh:4534 blocked for more than 120 seconds.
>>> [  258.507742]       Tainted: G         C O      5.1.0-rc1-ge2e3ca0 #1
>>> [  258.513980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>> this message.
>>> [  258.521774] flr.sh          D    0  4534   4504 0x00000000
>>> [  258.527235] Call trace:
>>> [  258.529671]  __switch_to+0xb4/0x1b8
>>> [  258.533146]  __schedule+0x1ec/0x720
>>> [  258.536619]  schedule+0x38/0x90
>>> [  258.539749]  rwsem_down_write_failed+0x14c/0x210
>>> [  258.544347]  down_write+0x48/0x60
>>> [  258.547648]  pci_device_add+0x1a0/0x290
>>> [  258.551469]  pci_iov_add_virtfn+0x190/0x358
>>> [  258.555633]  sriov_enable+0x24c/0x480
>>> [  258.559279]  pci_enable_sriov+0x14/0x28
>>> [  258.563101]  hisi_zip_sriov_configure+0x64/0x100 [hisi_zip]
>>> [  258.568649]  sriov_numvfs_store+0xc4/0x190
>>> [  258.572728]  dev_attr_store+0x18/0x28
>>> [  258.576375]  sysfs_kf_write+0x3c/0x50
>>> [  258.580024]  kernfs_fop_write+0x114/0x1d8
>>> [  258.584018]  __vfs_write+0x18/0x38
>>> [  258.587404]  vfs_write+0xa4/0x1b0
>>> [  258.590705]  ksys_write+0x60/0xd8
>>> [  258.594007]  __arm64_sys_write+0x18/0x20
>>> [  258.597914]  el0_svc_common+0x5c/0x100
>>> [  258.601646]  el0_svc_handler+0x2c/0x80
>>> [  258.605381]  el0_svc+0x8/0xc
>>> [  379.243461] INFO: task kworker/0:1:13 blocked for more than 241 seconds.
>>> [  379.250134]       Tainted: G         C O      5.1.0-rc1-ge2e3ca0 #1
>>> [  379.256373] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>> this message.
>>>
>>>
>>> Thank you,
>>> Jay
>>>
>>>
> 
> 
> .
> 


  parent reply	other threads:[~2019-06-20  3:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04  3:25 Bug report: AER driver deadlock Fangjian (Turing)
2019-06-04 14:34 ` Sinan Kaya
2019-06-04 23:47   ` Benjamin Herrenschmidt
2019-06-05  0:59     ` Sinan Kaya
2019-06-20  3:14     ` Fangjian (Turing) [this message]
2019-06-25 17:16 ` Bjorn Helgaas
2019-08-05 12:43   ` Fangjian (Turing)
2019-08-16  7:11     ` Jay Fang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b98d81c-55bd-1782-f214-7bbf48e54f16@huawei.com \
    --to=f.fangjian@huawei.com \
    --cc=Okaya@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rjw@rjwysocki.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.