All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel panic while doing vfio-pci hot-plug/unplug test
@ 2019-10-16 13:36 Xiang Zheng
  2019-10-18 13:58 ` Bjorn Helgaas
  2020-02-22 16:55 ` Bjorn Helgaas
  0 siblings, 2 replies; 12+ messages in thread
From: Xiang Zheng @ 2019-10-16 13:36 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel, mingo, peterz, alex.williamson
  Cc: Wang Haibin, Guoheyi, yebiaoxiang

Hi all,

Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.
See the call stack below:

[66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
[66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
[66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
[66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
[66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
[66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
[66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
[66630.246266] Mem abort info:
[66630.249047]   ESR = 0x8600000d
[66630.252088]   Exception class = IABT (current EL), IL = 32 bits
[66630.257981]   SET = 0, FnV = 0
[66630.261022]   EA = 0, S1PTW = 0
[66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
[66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
[66630.277751] Internal error: Oops: 8600000d [#1] SMP
[66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
[66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
[66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
[66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
[66630.314042] pc : 0xffff802dac469000
[66630.317519] lr : __wake_up_common+0x90/0x1a8
[66630.321768] sp : ffff00027746bb00
[66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
[66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
[66630.335643] x25: 0000000000000000 x24: 0000000000000000
[66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
[66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
[66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
[66630.356793] x17: 0000000000000000 x16: 0000000000000000
[66630.362081] x15: 0000000000000000 x14: 0000000000000000
[66630.367368] x13: 0000000000000000 x12: 0000000000000000
[66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
[66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
[66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
[66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
[66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
[66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
[66630.404381] Call trace:
[66630.406818]  0xffff802dac469000
[66630.409945]  __wake_up_common_lock+0xa8/0x1a0
[66630.414283]  __wake_up+0x40/0x50
[66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
[66630.421752]  pci_try_reset_function+0x58/0x78
[66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
[66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
[66630.436158]  do_vfs_ioctl+0xc4/0x8c0
[66630.439718]  ksys_ioctl+0x8c/0xa0
[66630.443018]  __arm64_sys_ioctl+0x28/0x38
[66630.446925]  el0_svc_common+0x78/0x130
[66630.450657]  el0_svc_handler+0x38/0x78
[66630.454389]  el0_svc+0x8/0xc
[66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
[66630.463325] kernel fault(0x1) notification starting on CPU 41
[66630.469044] kernel fault(0x1) notification finished on CPU 41

The chance to reproduce this problem is very small. We had an initial analysis of this problem,
and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.

I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?


-- 

Thanks,
Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-16 13:36 Kernel panic while doing vfio-pci hot-plug/unplug test Xiang Zheng
@ 2019-10-18 13:58 ` Bjorn Helgaas
  2019-10-23  9:40   ` Xiang Zheng
  2020-02-22 16:55 ` Bjorn Helgaas
  1 sibling, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2019-10-18 13:58 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Matthew Wilcox

[+cc Matthew]

On Wed, Oct 16, 2019 at 09:36:23PM +0800, Xiang Zheng wrote:
> Hi all,
> 
> Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.
> See the call stack below:
> 
> [66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
> [66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
> [66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
> [66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
> [66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
> [66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
> [66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
> [66630.246266] Mem abort info:
> [66630.249047]   ESR = 0x8600000d
> [66630.252088]   Exception class = IABT (current EL), IL = 32 bits
> [66630.257981]   SET = 0, FnV = 0
> [66630.261022]   EA = 0, S1PTW = 0
> [66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
> [66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
> [66630.277751] Internal error: Oops: 8600000d [#1] SMP
> [66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
> [66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
> [66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
> [66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
> [66630.314042] pc : 0xffff802dac469000
> [66630.317519] lr : __wake_up_common+0x90/0x1a8
> [66630.321768] sp : ffff00027746bb00
> [66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
> [66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
> [66630.335643] x25: 0000000000000000 x24: 0000000000000000
> [66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
> [66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
> [66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
> [66630.356793] x17: 0000000000000000 x16: 0000000000000000
> [66630.362081] x15: 0000000000000000 x14: 0000000000000000
> [66630.367368] x13: 0000000000000000 x12: 0000000000000000
> [66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
> [66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
> [66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
> [66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
> [66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
> [66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
> [66630.404381] Call trace:
> [66630.406818]  0xffff802dac469000
> [66630.409945]  __wake_up_common_lock+0xa8/0x1a0
> [66630.414283]  __wake_up+0x40/0x50
> [66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
> [66630.421752]  pci_try_reset_function+0x58/0x78
> [66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
> [66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [66630.436158]  do_vfs_ioctl+0xc4/0x8c0
> [66630.439718]  ksys_ioctl+0x8c/0xa0
> [66630.443018]  __arm64_sys_ioctl+0x28/0x38
> [66630.446925]  el0_svc_common+0x78/0x130
> [66630.450657]  el0_svc_handler+0x38/0x78
> [66630.454389]  el0_svc+0x8/0xc
> [66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
> [66630.463325] kernel fault(0x1) notification starting on CPU 41
> [66630.469044] kernel fault(0x1) notification finished on CPU 41
> 
> The chance to reproduce this problem is very small. We had an initial analysis of this problem,
> and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.
> 
> I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
> concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?

I haven't heard of a problem there, but that doesn't mean there isn't
one.

The fact that pci_wait_cfg() uses __add_wait_queue() (not
add_wait_queue(), which does more locking) makes me a little
suspicious.  Most of the other callers of __add_wait_queue() acquire
the wait_queue lock themselves, but pci_wait_cfg() doesn't.

This was added by 7ea7e98fd8d0 ("PCI: Block on access to temporarily
unavailable pci device"), and the commit log suggests that the
pci_lock is sufficient.  All callers of pci_wait_cfg() do hold
pci_lock, and the "pci_cfg_wait" queue is private, but ...
pci_cfg_access_unlock() calls wake_up_all(&pci_cfg_wait) *without*
holding pci_lock.  That path leads to __wake_up_common_lock(), which
depends on wq_head->lock, which pci_wait_cfg() doesn't use.

pci_cfg_access_unlock() originally *did* hold pci_lock while calling
wake_up_all(), but I changed that with cdcb33f98244 ("PCI: Avoid
possible deadlock on pci_lock and p->pi_lock") without understanding
both sides of the wait_queue locking issue.

But I still don't understand enough to know whether this is actually
the problem or to propose a fix.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-18 13:58 ` Bjorn Helgaas
@ 2019-10-23  9:40   ` Xiang Zheng
  2019-10-23 15:15     ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Xiang Zheng @ 2019-10-23  9:40 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Matthew Wilcox

Hi Bjorn,

Thanks for your reply!

On 2019/10/18 21:58, Bjorn Helgaas wrote:
> [+cc Matthew]
> 
> On Wed, Oct 16, 2019 at 09:36:23PM +0800, Xiang Zheng wrote:
>> Hi all,
>>
>> Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.
>> See the call stack below:
>>
>> [66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
>> [66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
>> [66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
>> [66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
>> [66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
>> [66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
>> [66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
>> [66630.246266] Mem abort info:
>> [66630.249047]   ESR = 0x8600000d
>> [66630.252088]   Exception class = IABT (current EL), IL = 32 bits
>> [66630.257981]   SET = 0, FnV = 0
>> [66630.261022]   EA = 0, S1PTW = 0
>> [66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
>> [66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
>> [66630.277751] Internal error: Oops: 8600000d [#1] SMP
>> [66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
>> [66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
>> [66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
>> [66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
>> [66630.314042] pc : 0xffff802dac469000
>> [66630.317519] lr : __wake_up_common+0x90/0x1a8
>> [66630.321768] sp : ffff00027746bb00
>> [66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
>> [66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
>> [66630.335643] x25: 0000000000000000 x24: 0000000000000000
>> [66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
>> [66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
>> [66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
>> [66630.356793] x17: 0000000000000000 x16: 0000000000000000
>> [66630.362081] x15: 0000000000000000 x14: 0000000000000000
>> [66630.367368] x13: 0000000000000000 x12: 0000000000000000
>> [66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
>> [66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
>> [66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
>> [66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
>> [66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
>> [66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
>> [66630.404381] Call trace:
>> [66630.406818]  0xffff802dac469000
>> [66630.409945]  __wake_up_common_lock+0xa8/0x1a0
>> [66630.414283]  __wake_up+0x40/0x50
>> [66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
>> [66630.421752]  pci_try_reset_function+0x58/0x78
>> [66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
>> [66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
>> [66630.436158]  do_vfs_ioctl+0xc4/0x8c0
>> [66630.439718]  ksys_ioctl+0x8c/0xa0
>> [66630.443018]  __arm64_sys_ioctl+0x28/0x38
>> [66630.446925]  el0_svc_common+0x78/0x130
>> [66630.450657]  el0_svc_handler+0x38/0x78
>> [66630.454389]  el0_svc+0x8/0xc
>> [66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
>> [66630.463325] kernel fault(0x1) notification starting on CPU 41
>> [66630.469044] kernel fault(0x1) notification finished on CPU 41
>>
>> The chance to reproduce this problem is very small. We had an initial analysis of this problem,
>> and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.
>>
>> I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
>> concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?
> 
> I haven't heard of a problem there, but that doesn't mean there isn't
> one.
> 
> The fact that pci_wait_cfg() uses __add_wait_queue() (not
> add_wait_queue(), which does more locking) makes me a little
> suspicious.  Most of the other callers of __add_wait_queue() acquire
> the wait_queue lock themselves, but pci_wait_cfg() doesn't.
> 
> This was added by 7ea7e98fd8d0 ("PCI: Block on access to temporarily
> unavailable pci device"), and the commit log suggests that the
> pci_lock is sufficient.  All callers of pci_wait_cfg() do hold
> pci_lock, and the "pci_cfg_wait" queue is private, but ...
> pci_cfg_access_unlock() calls wake_up_all(&pci_cfg_wait) *without*
> holding pci_lock.  That path leads to __wake_up_common_lock(), which
> depends on wq_head->lock, which pci_wait_cfg() doesn't use.

Yes, we was also suspicious about this point and had a further analysis of this problem.
We found that the "pci_cfg_wait" queue was empty when the "curr->func" callback function
was called:

crash> p pci_cfg_wait.head
$2 = {
  next = 0xffff0000092755b8 <pci_cfg_wait+8>,
  prev = 0xffff0000092755b8 <pci_cfg_wait+8>
}
crash> p &(pci_cfg_wait.head)
$3 = (struct list_head *) 0xffff0000092755b8 <pci_cfg_wait+8>
crash>

The "ps" command also shows that there was no processes on "UN" state at that time.

According to the above two clues, we finally reached a conclusion: there must be two processes,
'A' was calling pci_wait_cfg() and 'B' was calling __wake_up_common(). And there is a very
small chance that 'A' called __remove_wait_queue() before 'B' called the "curr->func", after
'B' got the queue entry "curr". Since the queue entry was a local variable, it would be
invalid after pci_wait_cfg() returned and eventually we got an invalid value of "curr->func".

In order to verify this conclusion, we add a delay(e.g. 300ms) before calling "curr->func"
in the __wake_up_common() function. Then this problem can be easily reproduced.

> 
> pci_cfg_access_unlock() originally *did* hold pci_lock while calling
> wake_up_all(), but I changed that with cdcb33f98244 ("PCI: Avoid
> possible deadlock on pci_lock and p->pi_lock") without understanding
> both sides of the wait_queue locking issue.

Before your change was merged, any operations to the "pci_cfg_wait" was safe because they all
did hold pci_lock. So the pci_lock was sufficient.

> 
> But I still don't understand enough to know whether this is actually
> the problem or to propose a fix.

I think we need to fix it. A simple solution is to use add_wait_queue()/remove_wait_queue()
instead of __add_wait_queue()/__remove_wait_queue() and this works for me. What do you think?

> 
> Bjorn
> 
> .
> 

-- 

Thanks,
Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23  9:40   ` Xiang Zheng
@ 2019-10-23 15:15     ` Bjorn Helgaas
  2019-10-23 16:38       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2019-10-23 15:15 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Matthew Wilcox

On Wed, Oct 23, 2019 at 05:40:20PM +0800, Xiang Zheng wrote:
> Hi Bjorn,
> 
> Thanks for your reply!
> 
> On 2019/10/18 21:58, Bjorn Helgaas wrote:
> > [+cc Matthew]
> > 
> > On Wed, Oct 16, 2019 at 09:36:23PM +0800, Xiang Zheng wrote:
> >> Hi all,
> >>
> >> Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.
> >> See the call stack below:
> >>
> >> [66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
> >> [66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
> >> [66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
> >> [66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
> >> [66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
> >> [66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
> >> [66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
> >> [66630.246266] Mem abort info:
> >> [66630.249047]   ESR = 0x8600000d
> >> [66630.252088]   Exception class = IABT (current EL), IL = 32 bits
> >> [66630.257981]   SET = 0, FnV = 0
> >> [66630.261022]   EA = 0, S1PTW = 0
> >> [66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
> >> [66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
> >> [66630.277751] Internal error: Oops: 8600000d [#1] SMP
> >> [66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
> >> [66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
> >> [66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
> >> [66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
> >> [66630.314042] pc : 0xffff802dac469000
> >> [66630.317519] lr : __wake_up_common+0x90/0x1a8
> >> [66630.321768] sp : ffff00027746bb00
> >> [66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
> >> [66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
> >> [66630.335643] x25: 0000000000000000 x24: 0000000000000000
> >> [66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
> >> [66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
> >> [66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
> >> [66630.356793] x17: 0000000000000000 x16: 0000000000000000
> >> [66630.362081] x15: 0000000000000000 x14: 0000000000000000
> >> [66630.367368] x13: 0000000000000000 x12: 0000000000000000
> >> [66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
> >> [66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
> >> [66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
> >> [66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
> >> [66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
> >> [66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
> >> [66630.404381] Call trace:
> >> [66630.406818]  0xffff802dac469000
> >> [66630.409945]  __wake_up_common_lock+0xa8/0x1a0
> >> [66630.414283]  __wake_up+0x40/0x50
> >> [66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
> >> [66630.421752]  pci_try_reset_function+0x58/0x78
> >> [66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
> >> [66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> >> [66630.436158]  do_vfs_ioctl+0xc4/0x8c0
> >> [66630.439718]  ksys_ioctl+0x8c/0xa0
> >> [66630.443018]  __arm64_sys_ioctl+0x28/0x38
> >> [66630.446925]  el0_svc_common+0x78/0x130
> >> [66630.450657]  el0_svc_handler+0x38/0x78
> >> [66630.454389]  el0_svc+0x8/0xc
> >> [66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
> >> [66630.463325] kernel fault(0x1) notification starting on CPU 41
> >> [66630.469044] kernel fault(0x1) notification finished on CPU 41
> >>
> >> The chance to reproduce this problem is very small. We had an initial analysis of this problem,
> >> and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.
> >>
> >> I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
> >> concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?
> > 
> > I haven't heard of a problem there, but that doesn't mean there isn't
> > one.
> > 
> > The fact that pci_wait_cfg() uses __add_wait_queue() (not
> > add_wait_queue(), which does more locking) makes me a little
> > suspicious.  Most of the other callers of __add_wait_queue() acquire
> > the wait_queue lock themselves, but pci_wait_cfg() doesn't.
> > 
> > This was added by 7ea7e98fd8d0 ("PCI: Block on access to temporarily
> > unavailable pci device"), and the commit log suggests that the
> > pci_lock is sufficient.  All callers of pci_wait_cfg() do hold
> > pci_lock, and the "pci_cfg_wait" queue is private, but ...
> > pci_cfg_access_unlock() calls wake_up_all(&pci_cfg_wait) *without*
> > holding pci_lock.  That path leads to __wake_up_common_lock(), which
> > depends on wq_head->lock, which pci_wait_cfg() doesn't use.
> 
> Yes, we was also suspicious about this point and had a further
> analysis of this problem.  We found that the "pci_cfg_wait" queue
> was empty when the "curr->func" callback function was called:
> 
> crash> p pci_cfg_wait.head
> $2 = {
>   next = 0xffff0000092755b8 <pci_cfg_wait+8>,
>   prev = 0xffff0000092755b8 <pci_cfg_wait+8>
> }
> crash> p &(pci_cfg_wait.head)
> $3 = (struct list_head *) 0xffff0000092755b8 <pci_cfg_wait+8>
> crash>
> 
> The "ps" command also shows that there was no processes on "UN"
> state at that time.
> 
> According to the above two clues, we finally reached a conclusion:
> there must be two processes, 'A' was calling pci_wait_cfg() and 'B'
> was calling __wake_up_common(). And there is a very small chance
> that 'A' called __remove_wait_queue() before 'B' called the
> "curr->func", after 'B' got the queue entry "curr". Since the queue
> entry was a local variable, it would be invalid after pci_wait_cfg()
> returned and eventually we got an invalid value of "curr->func".
> 
> In order to verify this conclusion, we add a delay(e.g. 300ms)
> before calling "curr->func" in the __wake_up_common() function. Then
> this problem can be easily reproduced.
> 
> > pci_cfg_access_unlock() originally *did* hold pci_lock while
> > calling wake_up_all(), but I changed that with cdcb33f98244 ("PCI:
> > Avoid possible deadlock on pci_lock and p->pi_lock") without
> > understanding both sides of the wait_queue locking issue.
> 
> Before your change was merged, any operations to the "pci_cfg_wait"
> was safe because they all did hold pci_lock. So the pci_lock was
> sufficient.
> 
> > But I still don't understand enough to know whether this is
> > actually the problem or to propose a fix.
> 
> I think we need to fix it. A simple solution is to use
> add_wait_queue()/remove_wait_queue() instead of
> __add_wait_queue()/__remove_wait_queue() and this works for me. What
> do you think?

I don't like being one of a handful of callers of __add_wait_queue(),
so I like that solution from that point of view.

The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
device") commit log suggests that using __add_wait_queue() is a
significant optimization, but I don't know how important that is in
practical terms.  Config accesses are never a performance path anyway,
so I'd be inclined to use add_wait_queue() unless somebody complains.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23 15:15     ` Bjorn Helgaas
@ 2019-10-23 16:38       ` Matthew Wilcox
  2019-10-23 20:46         ` Bjorn Helgaas
  2019-10-24  8:07         ` Thomas Gleixner
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2019-10-23 16:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Xiang Zheng, linux-pci, linux-kernel, mingo, peterz,
	alex.williamson, Wang Haibin, Guoheyi, yebiaoxiang

On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> I don't like being one of a handful of callers of __add_wait_queue(),
> so I like that solution from that point of view.
> 
> The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> device") commit log suggests that using __add_wait_queue() is a
> significant optimization, but I don't know how important that is in
> practical terms.  Config accesses are never a performance path anyway,
> so I'd be inclined to use add_wait_queue() unless somebody complains.

Wow, this has got pretty messy in the umpteen years since I last looked
at it.

Some problems I see:

1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:

    x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
    
    All x86 PCI configuration space accessors have either their own
    serialization or can operate completely lockless (ECAM).
    
    Disable the global lock in the generic PCI configuration space accessors.

The concept behind this patch is broken.  We still need to lock out
config space accesses when devices are undergoing D-state transitions.
I would suggest that for the contention case that tglx is concerned about,
we should have a pci_bus_read_config_unlocked_##size set of functions
which can be used for devices we know never go into D states.


2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
   exports pci_lock.  I think this is a mistake; at best there should be
   accessors for the pci_lock.  But I don't understand why it needs to
   exclude PCI config space changes throughout pci_check_and_set_intx_mask().
   Why can it not do:

-	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
+	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);

3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
   changed pci_lock to be a raw spinlock.  The patch description
   essentially says "We need it for RT" which isn't terribly helpful.

4. Finally, getting back to the original problem report here, I wouldn't
   write this code this way today.  There's no reason not to use the
   regular add_wait_queue etc.  BUT!  Why are we using this custom locking
   mechanism?  It pretty much screams to me of an rwsem (reads/writes
   of config space take it for read; changes to config space accesses
   (disabling and changing of accessor methods) take it for write.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23 16:38       ` Matthew Wilcox
@ 2019-10-23 20:46         ` Bjorn Helgaas
  2019-10-24  3:26           ` Xiang Zheng
  2019-10-24 11:22           ` Matthew Wilcox
  2019-10-24  8:07         ` Thomas Gleixner
  1 sibling, 2 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2019-10-23 20:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Xiang Zheng, linux-pci, linux-kernel, mingo, peterz,
	alex.williamson, Wang Haibin, Guoheyi, yebiaoxiang,
	Thomas Gleixner, Rafael J. Wysocki

[+cc Thomas, Rafael, beginning of thread at
https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@huawei.com]

On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
> On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> > I don't like being one of a handful of callers of __add_wait_queue(),
> > so I like that solution from that point of view.
> > 
> > The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> > device") commit log suggests that using __add_wait_queue() is a
> > significant optimization, but I don't know how important that is in
> > practical terms.  Config accesses are never a performance path anyway,
> > so I'd be inclined to use add_wait_queue() unless somebody complains.
> 
> Wow, this has got pretty messy in the umpteen years since I last looked
> at it.
> 
> Some problems I see:
> 
> 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> 
>     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
>     
>     All x86 PCI configuration space accessors have either their own
>     serialization or can operate completely lockless (ECAM).
>     
>     Disable the global lock in the generic PCI configuration space accessors.
> 
> The concept behind this patch is broken.  We still need to lock out
> config space accesses when devices are undergoing D-state transitions.
> I would suggest that for the contention case that tglx is concerned about,
> we should have a pci_bus_read_config_unlocked_##size set of functions
> which can be used for devices we know never go into D states.

Host bridges that can't do config accesses atomically, e.g., they have
something like the 0xcf8/0xcfc addr/data ports, need serialization.
CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
think that part makes sense regardless of whether devices can enter D
states.

We *should* prevent config accesses during D-state transitions (per
PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.
pci_raw_set_power_state() contains delays, but that only prevents
accesses from the caller, not from other threads or from userspace.
I suppose we should also prevent accesses by other threads during
transitions done by ACPI, e.g., _PS0, _PS1, _PS2, _PS3.  AFAICT we
don't do any of that.

It looks like pci_lock currently:

  - Serializes all kernel config accesses system-wide in
    pci_bus_read_config_##size() (unless CONFIG_PCI_LOCKLESS_CONFIG=y).

  - Serializes all userspace config accesses system-wide in
    pci_user_read_config_##size() (this seems unnecessary when
    CONFIG_PCI_LOCKLESS_CONFIG=y).

  - Serializes userspace config accesses with resets of the device via
    the dev->block_cfg_access bit and waitqueue mechanism.

  - Serializes kernel and userspace config accesses with bus->ops
    changes in pci_bus_set_ops() (except that we don't serialize
    kernel config accesses if CONFIG_PCI_LOCKLESS_CONFIG=y, which is
    probably a problem).  But pci_bus_set_ops() is hardly used and I'm
    not sure it's worth keeping it.

> 2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
>    exports pci_lock.  I think this is a mistake; at best there should be
>    accessors for the pci_lock.  But I don't understand why it needs to
>    exclude PCI config space changes throughout pci_check_and_set_intx_mask().
>    Why can it not do:
> 
> -	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
> +	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> 
> 3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
>    changed pci_lock to be a raw spinlock.  The patch description
>    essentially says "We need it for RT" which isn't terribly helpful.
> 
> 4. Finally, getting back to the original problem report here, I wouldn't
>    write this code this way today.  There's no reason not to use the
>    regular add_wait_queue etc.  BUT!  Why are we using this custom locking
>    mechanism?  It pretty much screams to me of an rwsem (reads/writes
>    of config space take it for read; changes to config space accesses
>    (disabling and changing of accessor methods) take it for write.

So maybe the immediate thing is to just convert to add_wait_queue()?

There's a lot we could clean up here, but I think it would take a fair
bit of untangling before we actually solve this panic.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23 20:46         ` Bjorn Helgaas
@ 2019-10-24  3:26           ` Xiang Zheng
  2019-10-24 11:22           ` Matthew Wilcox
  1 sibling, 0 replies; 12+ messages in thread
From: Xiang Zheng @ 2019-10-24  3:26 UTC (permalink / raw)
  To: Bjorn Helgaas, Matthew Wilcox
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Thomas Gleixner,
	Rafael J. Wysocki


On 2019/10/24 4:46, Bjorn Helgaas wrote:
> [+cc Thomas, Rafael, beginning of thread at
> https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@huawei.com]
> 
> On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
>> On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
>>> I don't like being one of a handful of callers of __add_wait_queue(),
>>> so I like that solution from that point of view.
>>>
>>> The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
>>> device") commit log suggests that using __add_wait_queue() is a
>>> significant optimization, but I don't know how important that is in
>>> practical terms.  Config accesses are never a performance path anyway,
>>> so I'd be inclined to use add_wait_queue() unless somebody complains.
>>
>> Wow, this has got pretty messy in the umpteen years since I last looked
>> at it.
>>
>> Some problems I see:
>>
>> 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
>>
>>     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
>>     
>>     All x86 PCI configuration space accessors have either their own
>>     serialization or can operate completely lockless (ECAM).
>>     
>>     Disable the global lock in the generic PCI configuration space accessors.
>>
>> The concept behind this patch is broken.  We still need to lock out
>> config space accesses when devices are undergoing D-state transitions.
>> I would suggest that for the contention case that tglx is concerned about,
>> we should have a pci_bus_read_config_unlocked_##size set of functions
>> which can be used for devices we know never go into D states.
> 
> Host bridges that can't do config accesses atomically, e.g., they have
> something like the 0xcf8/0xcfc addr/data ports, need serialization.
> CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
> think that part makes sense regardless of whether devices can enter D
> states.
> 
> We *should* prevent config accesses during D-state transitions (per
> PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.
> pci_raw_set_power_state() contains delays, but that only prevents
> accesses from the caller, not from other threads or from userspace.
> I suppose we should also prevent accesses by other threads during
> transitions done by ACPI, e.g., _PS0, _PS1, _PS2, _PS3.  AFAICT we
> don't do any of that.
> 
> It looks like pci_lock currently:
> 
>   - Serializes all kernel config accesses system-wide in
>     pci_bus_read_config_##size() (unless CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes all userspace config accesses system-wide in
>     pci_user_read_config_##size() (this seems unnecessary when
>     CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes userspace config accesses with resets of the device via
>     the dev->block_cfg_access bit and waitqueue mechanism.
> 
>   - Serializes kernel and userspace config accesses with bus->ops
>     changes in pci_bus_set_ops() (except that we don't serialize
>     kernel config accesses if CONFIG_PCI_LOCKLESS_CONFIG=y, which is
>     probably a problem).  But pci_bus_set_ops() is hardly used and I'm
>     not sure it's worth keeping it.
> 
>> 2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
>>    exports pci_lock.  I think this is a mistake; at best there should be
>>    accessors for the pci_lock.  But I don't understand why it needs to
>>    exclude PCI config space changes throughout pci_check_and_set_intx_mask().
>>    Why can it not do:
>>
>> -	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
>> +	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
>>
>> 3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
>>    changed pci_lock to be a raw spinlock.  The patch description
>>    essentially says "We need it for RT" which isn't terribly helpful.
>>
>> 4. Finally, getting back to the original problem report here, I wouldn't
>>    write this code this way today.  There's no reason not to use the
>>    regular add_wait_queue etc.  BUT!  Why are we using this custom locking
>>    mechanism?  It pretty much screams to me of an rwsem (reads/writes
>>    of config space take it for read; changes to config space accesses
>>    (disabling and changing of accessor methods) take it for write.
> 
> So maybe the immediate thing is to just convert to add_wait_queue()?

Hmmm... May I push a patch? :)

> 
> There's a lot we could clean up here, but I think it would take a fair
> bit of untangling before we actually solve this panic.
> 
> Bjorn
> 
> .
> 

-- 

Thanks,
Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23 16:38       ` Matthew Wilcox
  2019-10-23 20:46         ` Bjorn Helgaas
@ 2019-10-24  8:07         ` Thomas Gleixner
  1 sibling, 0 replies; 12+ messages in thread
From: Thomas Gleixner @ 2019-10-24  8:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bjorn Helgaas, Xiang Zheng, linux-pci, linux-kernel, mingo,
	peterz, alex.williamson, Wang Haibin, Guoheyi, yebiaoxiang

On Wed, 23 Oct 2019, Matthew Wilcox wrote:
> Some problems I see:
> 
> 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> 
>     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
>     
>     All x86 PCI configuration space accessors have either their own
>     serialization or can operate completely lockless (ECAM).
>     
>     Disable the global lock in the generic PCI configuration space accessors.
> 
> The concept behind this patch is broken.  We still need to lock out
> config space accesses when devices are undergoing D-state transitions.
> I would suggest that for the contention case that tglx is concerned about,
> we should have a pci_bus_read_config_unlocked_##size set of functions
> which can be used for devices we know never go into D states.

I don't think that it's broken. A D-state transition has to make sure that
the rest of stuff which might be touching the config space is
quiescent. pci_lock cannot provide that protection
 
> 
> 2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
>    exports pci_lock.  I think this is a mistake; at best there should be
>    accessors for the pci_lock.  But I don't understand why it needs to
>    exclude PCI config space changes throughout pci_check_and_set_intx_mask().
>    Why can it not do:
> 
> -	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
> +	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);

Hmm. Need to look closer on that.
 
> 3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
>    changed pci_lock to be a raw spinlock.  The patch description
>    essentially says "We need it for RT" which isn't terribly helpful.

Yes, I could slap myself for writing such a useless changelog. The reason
why it is a raw spinlock is that config space access happens from very low
level contexts, which require to have interrupts disabled even on RT,
e.g. from the guts of the interrupt code.

> 4. Finally, getting back to the original problem report here, I wouldn't
>    write this code this way today.  There's no reason not to use the
>    regular add_wait_queue etc.  BUT!  Why are we using this custom locking
>    mechanism?  It pretty much screams to me of an rwsem (reads/writes
>    of config space take it for read; changes to config space accesses
>    (disabling and changing of accessor methods) take it for write.

You cannot use a RWSEM as low level interrupt code needs to access the
config space with interrupts disabled and raw spinlocks held, e.g. to
fiddle with the interrupt and MSI stuff.

Thanks,

	tglx


    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-23 20:46         ` Bjorn Helgaas
  2019-10-24  3:26           ` Xiang Zheng
@ 2019-10-24 11:22           ` Matthew Wilcox
  2019-10-24 22:32             ` Bjorn Helgaas
  1 sibling, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2019-10-24 11:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Xiang Zheng, linux-pci, linux-kernel, mingo, peterz,
	alex.williamson, Wang Haibin, Guoheyi, yebiaoxiang,
	Thomas Gleixner, Rafael J. Wysocki

On Wed, Oct 23, 2019 at 03:46:38PM -0500, Bjorn Helgaas wrote:
> [+cc Thomas, Rafael, beginning of thread at
> https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@huawei.com]
> 
> On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
> > On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> > > I don't like being one of a handful of callers of __add_wait_queue(),
> > > so I like that solution from that point of view.
> > > 
> > > The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> > > device") commit log suggests that using __add_wait_queue() is a
> > > significant optimization, but I don't know how important that is in
> > > practical terms.  Config accesses are never a performance path anyway,
> > > so I'd be inclined to use add_wait_queue() unless somebody complains.
> > 
> > Wow, this has got pretty messy in the umpteen years since I last looked
> > at it.
> > 
> > Some problems I see:
> > 
> > 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> > 
> >     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
> >     
> >     All x86 PCI configuration space accessors have either their own
> >     serialization or can operate completely lockless (ECAM).
> >     
> >     Disable the global lock in the generic PCI configuration space accessors.
> > 
> > The concept behind this patch is broken.  We still need to lock out
> > config space accesses when devices are undergoing D-state transitions.
> > I would suggest that for the contention case that tglx is concerned about,
> > we should have a pci_bus_read_config_unlocked_##size set of functions
> > which can be used for devices we know never go into D states.
> 
> Host bridges that can't do config accesses atomically, e.g., they have
> something like the 0xcf8/0xcfc addr/data ports, need serialization.
> CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
> think that part makes sense regardless of whether devices can enter D
> states.

I disagree.  If a device is in D state, we need to block the access.
Maybe there needs to be a different mechanism for doing it that's not
a machine-wide lock, but it needs to happen.

> We *should* prevent config accesses during D-state transitions (per
> PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.

It used to set block_ucfg_access.  Maybe that's been lost; I see
there are still calls to pci_dev_lock() in pci_reset_function(),
for example.

> pci_raw_set_power_state() contains delays, but that only prevents
> accesses from the caller, not from other threads or from userspace.
> I suppose we should also prevent accesses by other threads during
> transitions done by ACPI, e.g., _PS0, _PS1, _PS2, _PS3.  AFAICT we
> don't do any of that.
> 
> It looks like pci_lock currently:
> 
>   - Serializes all kernel config accesses system-wide in
>     pci_bus_read_config_##size() (unless CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes all userspace config accesses system-wide in
>     pci_user_read_config_##size() (this seems unnecessary when
>     CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes userspace config accesses with resets of the device via
>     the dev->block_cfg_access bit and waitqueue mechanism.
> 
>   - Serializes kernel and userspace config accesses with bus->ops
>     changes in pci_bus_set_ops() (except that we don't serialize
>     kernel config accesses if CONFIG_PCI_LOCKLESS_CONFIG=y, which is
>     probably a problem).  But pci_bus_set_ops() is hardly used and I'm
>     not sure it's worth keeping it.
> 
> > 2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
> >    exports pci_lock.  I think this is a mistake; at best there should be
> >    accessors for the pci_lock.  But I don't understand why it needs to
> >    exclude PCI config space changes throughout pci_check_and_set_intx_mask().
> >    Why can it not do:
> > 
> > -	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
> > +	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> > 
> > 3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
> >    changed pci_lock to be a raw spinlock.  The patch description
> >    essentially says "We need it for RT" which isn't terribly helpful.
> > 
> > 4. Finally, getting back to the original problem report here, I wouldn't
> >    write this code this way today.  There's no reason not to use the
> >    regular add_wait_queue etc.  BUT!  Why are we using this custom locking
> >    mechanism?  It pretty much screams to me of an rwsem (reads/writes
> >    of config space take it for read; changes to config space accesses
> >    (disabling and changing of accessor methods) take it for write.
> 
> So maybe the immediate thing is to just convert to add_wait_queue()?

Isn't that going to run foul of the lock inversion you fixed in
cdcb33f9824429a926b971bf041a6cec238f91ff ?

> There's a lot we could clean up here, but I think it would take a fair
> bit of untangling before we actually solve this panic.

Yes, the mess has spread over many years ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-24 11:22           ` Matthew Wilcox
@ 2019-10-24 22:32             ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2019-10-24 22:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Xiang Zheng, linux-pci, linux-kernel, mingo, peterz,
	alex.williamson, Wang Haibin, Guoheyi, yebiaoxiang,
	Thomas Gleixner, Rafael J. Wysocki

On Thu, Oct 24, 2019 at 04:22:32AM -0700, Matthew Wilcox wrote:
> On Wed, Oct 23, 2019 at 03:46:38PM -0500, Bjorn Helgaas wrote:
> > [+cc Thomas, Rafael, beginning of thread at
> > https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@huawei.com]
> > 
> > On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
> > > On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> > > > I don't like being one of a handful of callers of __add_wait_queue(),
> > > > so I like that solution from that point of view.
> > > > 
> > > > The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> > > > device") commit log suggests that using __add_wait_queue() is a
> > > > significant optimization, but I don't know how important that is in
> > > > practical terms.  Config accesses are never a performance path anyway,
> > > > so I'd be inclined to use add_wait_queue() unless somebody complains.
> > > 
> > > Wow, this has got pretty messy in the umpteen years since I last looked
> > > at it.
> > > 
> > > Some problems I see:
> > > 
> > > 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> > > 
> > >     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
> > >     
> > >     All x86 PCI configuration space accessors have either their own
> > >     serialization or can operate completely lockless (ECAM).
> > >     
> > >     Disable the global lock in the generic PCI configuration space accessors.
> > > 
> > > The concept behind this patch is broken.  We still need to lock out
> > > config space accesses when devices are undergoing D-state transitions.
> > > I would suggest that for the contention case that tglx is concerned about,
> > > we should have a pci_bus_read_config_unlocked_##size set of functions
> > > which can be used for devices we know never go into D states.
> > 
> > Host bridges that can't do config accesses atomically, e.g., they have
> > something like the 0xcf8/0xcfc addr/data ports, need serialization.
> > CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
> > think that part makes sense regardless of whether devices can enter D
> > states.
> 
> I disagree.  If a device is in D state, we need to block the access.
> Maybe there needs to be a different mechanism for doing it that's not
> a machine-wide lock, but it needs to happen.

I'm missing something here.  If we're using 0xcf8/0xcfc, we need to
serialize so the address in 0xcf8 is preserved until we do the load or
store to 0xcfc, and x86 does that with pci_config_lock.  If we're
using ECAM, we don't need that serialization.

We might need serialization for some other reason, but not because of
that host bridge config access mechanism.

Config accesses are supposed to work fine in all D states except
D3cold, as long as any upstream bridges are in D0.  At least, I think
that's what the spec says.

> > We *should* prevent config accesses during D-state transitions (per
> > PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.
> 
> It used to set block_ucfg_access.  Maybe that's been lost; I see
> there are still calls to pci_dev_lock() in pci_reset_function(),
> for example.

You have an incredible memory!  pci_ucfg_wait was renamed to
pci_cfg_wait in 2011 with fb51ccbf217c ("PCI: Rework config space
blocking services").

But even then, I don't see a connection with D-state transitions,
e.g., there's nothing in pci_raw_set_power_state() that calls
pci_dev_lock() or otherwise sets block_cfg_wait.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2019-10-16 13:36 Kernel panic while doing vfio-pci hot-plug/unplug test Xiang Zheng
  2019-10-18 13:58 ` Bjorn Helgaas
@ 2020-02-22 16:55 ` Bjorn Helgaas
  2020-02-24  1:29   ` Xiang Zheng
  1 sibling, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2020-02-22 16:55 UTC (permalink / raw)
  To: Xiang Zheng
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Matthew Wilcox,
	Thomas Gleixner

[+cc Matthew, Thomas]

On Wed, Oct 16, 2019 at 09:36:23PM +0800, Xiang Zheng wrote:
> Hi all,
> 
> Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.

Did we ever do anything about this panic?  I don't see any resolution
in the email thread.

> See the call stack below:
> 
> [66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
> [66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
> [66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
> [66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
> [66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
> [66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
> [66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
> [66630.246266] Mem abort info:
> [66630.249047]   ESR = 0x8600000d
> [66630.252088]   Exception class = IABT (current EL), IL = 32 bits
> [66630.257981]   SET = 0, FnV = 0
> [66630.261022]   EA = 0, S1PTW = 0
> [66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
> [66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
> [66630.277751] Internal error: Oops: 8600000d [#1] SMP
> [66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
> [66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
> [66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
> [66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
> [66630.314042] pc : 0xffff802dac469000
> [66630.317519] lr : __wake_up_common+0x90/0x1a8
> [66630.321768] sp : ffff00027746bb00
> [66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
> [66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
> [66630.335643] x25: 0000000000000000 x24: 0000000000000000
> [66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
> [66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
> [66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
> [66630.356793] x17: 0000000000000000 x16: 0000000000000000
> [66630.362081] x15: 0000000000000000 x14: 0000000000000000
> [66630.367368] x13: 0000000000000000 x12: 0000000000000000
> [66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
> [66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
> [66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
> [66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
> [66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
> [66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
> [66630.404381] Call trace:
> [66630.406818]  0xffff802dac469000
> [66630.409945]  __wake_up_common_lock+0xa8/0x1a0
> [66630.414283]  __wake_up+0x40/0x50
> [66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
> [66630.421752]  pci_try_reset_function+0x58/0x78
> [66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
> [66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [66630.436158]  do_vfs_ioctl+0xc4/0x8c0
> [66630.439718]  ksys_ioctl+0x8c/0xa0
> [66630.443018]  __arm64_sys_ioctl+0x28/0x38
> [66630.446925]  el0_svc_common+0x78/0x130
> [66630.450657]  el0_svc_handler+0x38/0x78
> [66630.454389]  el0_svc+0x8/0xc
> [66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
> [66630.463325] kernel fault(0x1) notification starting on CPU 41
> [66630.469044] kernel fault(0x1) notification finished on CPU 41
> 
> The chance to reproduce this problem is very small. We had an initial analysis of this problem,
> and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.
> 
> I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
> concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?
> 
> 
> -- 
> 
> Thanks,
> Xiang
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel panic while doing vfio-pci hot-plug/unplug test
  2020-02-22 16:55 ` Bjorn Helgaas
@ 2020-02-24  1:29   ` Xiang Zheng
  0 siblings, 0 replies; 12+ messages in thread
From: Xiang Zheng @ 2020-02-24  1:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, mingo, peterz, alex.williamson,
	Wang Haibin, Guoheyi, yebiaoxiang, Matthew Wilcox,
	Thomas Gleixner

Hi Bjorn,

On 2020/2/23 0:55, Bjorn Helgaas wrote:
> [+cc Matthew, Thomas]
> 
> On Wed, Oct 16, 2019 at 09:36:23PM +0800, Xiang Zheng wrote:
>> Hi all,
>>
>> Recently I encountered a kernel panic while doing vfio-pci hot-plug/unplug test repeatly on my Arm-KVM virtual machines.
> 
> Did we ever do anything about this panic?  I don't see any resolution
> in the email thread.
> 

I had sent a v3 patch two months ago and wait for more substantive comments to address this problem.
Maybe you missed this email.:)

v3: https://lkml.org/lkml/2019/12/10/36
v2: https://lkml.org/lkml/2019/11/18/977

>> See the call stack below:
>>
>> [66628.697280] vfio-pci 0000:06:03.5: enabling device (0000 -> 0002)
>> [66628.809290] vfio-pci 0000:06:03.1: enabling device (0000 -> 0002)
>> [66628.921283] vfio-pci 0000:06:02.7: enabling device (0000 -> 0002)
>> [66629.029280] vfio-pci 0000:06:03.6: enabling device (0000 -> 0002)
>> [66629.137338] vfio-pci 0000:06:03.2: enabling device (0000 -> 0002)
>> [66629.249285] vfio-pci 0000:06:03.7: enabling device (0000 -> 0002)
>> [66630.237261] Unable to handle kernel read from unreadable memory at virtual address ffff802dac469000
>> [66630.246266] Mem abort info:
>> [66630.249047]   ESR = 0x8600000d
>> [66630.252088]   Exception class = IABT (current EL), IL = 32 bits
>> [66630.257981]   SET = 0, FnV = 0
>> [66630.261022]   EA = 0, S1PTW = 0
>> [66630.264150] swapper pgtable: 4k pages, 48-bit VAs, pgdp = 00000000fb16886e
>> [66630.270992] [ffff802dac469000] pgd=0000203fffff6803, pud=00e8002d80000f11
>> [66630.277751] Internal error: Oops: 8600000d [#1] SMP
>> [66630.282606] Process qemu-kvm (pid: 37201, stack limit = 0x00000000d8f19858)
>> [66630.289537] CPU: 41 PID: 37201 Comm: qemu-kvm Kdump: loaded Tainted: G           OE     4.19.36-vhulk1907.1.0.h453.eulerosv2r8.aarch64 #1
>> [66630.301822] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 0.88 07/24/2019
>> [66630.309270] pstate: 80400089 (Nzcv daIf +PAN -UAO)
>> [66630.314042] pc : 0xffff802dac469000
>> [66630.317519] lr : __wake_up_common+0x90/0x1a8
>> [66630.321768] sp : ffff00027746bb00
>> [66630.325067] x29: ffff00027746bb00 x28: 0000000000000000
>> [66630.330355] x27: 0000000000000000 x26: ffff0000092755b8
>> [66630.335643] x25: 0000000000000000 x24: 0000000000000000
>> [66630.340930] x23: 0000000000000003 x22: ffff00027746bbc0
>> [66630.346219] x21: 000000000954c000 x20: ffff0001f542bc6c
>> [66630.351506] x19: ffff0001f542bb90 x18: 0000000000000000
>> [66630.356793] x17: 0000000000000000 x16: 0000000000000000
>> [66630.362081] x15: 0000000000000000 x14: 0000000000000000
>> [66630.367368] x13: 0000000000000000 x12: 0000000000000000
>> [66630.372655] x11: 0000000000000000 x10: 0000000000000bb0
>> [66630.377942] x9 : ffff00027746ba50 x8 : ffff80367ff6ca10
>> [66630.383229] x7 : ffff802e20d59200 x6 : 000000000000003f
>> [66630.388517] x5 : ffff00027746bbc0 x4 : ffff802dac469000
>> [66630.393806] x3 : 0000000000000000 x2 : 0000000000000000
>> [66630.399093] x1 : 0000000000000003 x0 : ffff0001f542bb90
>> [66630.404381] Call trace:
>> [66630.406818]  0xffff802dac469000
>> [66630.409945]  __wake_up_common_lock+0xa8/0x1a0
>> [66630.414283]  __wake_up+0x40/0x50
>> [66630.417499]  pci_cfg_access_unlock+0x9c/0xd0
>> [66630.421752]  pci_try_reset_function+0x58/0x78
>> [66630.426095]  vfio_pci_ioctl+0x478/0xdb8 [vfio_pci]
>> [66630.430870]  vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
>> [66630.436158]  do_vfs_ioctl+0xc4/0x8c0
>> [66630.439718]  ksys_ioctl+0x8c/0xa0
>> [66630.443018]  __arm64_sys_ioctl+0x28/0x38
>> [66630.446925]  el0_svc_common+0x78/0x130
>> [66630.450657]  el0_svc_handler+0x38/0x78
>> [66630.454389]  el0_svc+0x8/0xc
>> [66630.457260] Code: 00000000 00000000 00000000 00000000 (ac46d000)
>> [66630.463325] kernel fault(0x1) notification starting on CPU 41
>> [66630.469044] kernel fault(0x1) notification finished on CPU 41
>>
>> The chance to reproduce this problem is very small. We had an initial analysis of this problem,
>> and found it was caused by the illegal value of the 'curr->func' in the __wake_up_common() function.
>>
>> I cannot image how 'curr->func' can be wrote to 0xffff802dac469000. Is there any problem about
>> concurrent competition between the pci_wait_cfg() function and the wake_up_all() function?
>>
>>
>> -- 
>>
>> Thanks,
>> Xiang
>>
> 
> .
> 

-- 

Thanks,
Xiang


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-02-24  1:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16 13:36 Kernel panic while doing vfio-pci hot-plug/unplug test Xiang Zheng
2019-10-18 13:58 ` Bjorn Helgaas
2019-10-23  9:40   ` Xiang Zheng
2019-10-23 15:15     ` Bjorn Helgaas
2019-10-23 16:38       ` Matthew Wilcox
2019-10-23 20:46         ` Bjorn Helgaas
2019-10-24  3:26           ` Xiang Zheng
2019-10-24 11:22           ` Matthew Wilcox
2019-10-24 22:32             ` Bjorn Helgaas
2019-10-24  8:07         ` Thomas Gleixner
2020-02-22 16:55 ` Bjorn Helgaas
2020-02-24  1:29   ` Xiang Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.