All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices
@ 2019-09-27 12:45 Frederic Barrat
  2019-10-15  5:42 ` [EXTERNAL] " Sam Bobroff
  0 siblings, 1 reply; 4+ messages in thread
From: Frederic Barrat @ 2019-09-27 12:45 UTC (permalink / raw)
  To: sbobroff, linuxppc-dev; +Cc: clombard, andrew.donnellan

Recent cleanup in the way EEH support is added to a device causes a
kernel oops when the cxl driver probes a device and creates virtual
devices discovered on the FPGA:

    BUG: Kernel NULL pointer dereference at 0x000000a0
    Faulting instruction address: 0xc000000000048070
    Oops: Kernel access of bad area, sig: 7 [#1]
    ...
    NIP [c000000000048070] eeh_add_device_late.part.9+0x50/0x1e0
    LR [c00000000004805c] eeh_add_device_late.part.9+0x3c/0x1e0
    Call Trace:
    [c000200e43983900] [c00000000079e250] _dev_info+0x5c/0x6c (unreliable)
    [c000200e43983980] [c0000000000d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
    [c000200e439839f0] [c0000000000606d0] pcibios_bus_add_device+0x40/0x60
    [c000200e43983a10] [c0000000006aa3a0] pci_bus_add_device+0x30/0x100
    [c000200e43983a80] [c0000000006aa4d4] pci_bus_add_devices+0x64/0xd0
    [c000200e43983ac0] [c00800001c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
    [c000200e43983b00] [c00800001c4242ac] cxl_probe+0x504/0x5b0 [cxl]
    [c000200e43983bb0] [c0000000006bba1c] local_pci_probe+0x6c/0x110
    [c000200e43983c30] [c000000000159278] work_for_cpu_fn+0x38/0x60

The root cause is that those cxl virtual devices don't have a
representation in the device tree and therefore no associated pci_dn
structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
we oops.

We never had explicit support for EEH for those virtual
devices. Instead, EEH events are reported to the (real) pci device and
handled by the cxl driver. Which can then forward to the virtual
devices and handle dependencies. The fact that we try adding EEH
support for the virtual devices is new and a side-effect of the recent
cleanup.

This patch fixes it by skipping adding EEH support on powernv for
devices which don't have a pci_dn structure.

Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---

Sending as an RFC, as I'm afraid of hiding potential issues and would
be interested in comments. The powernv eeh code expects a struct
pci_dn, so the fix seems safe. I'm wondering if there could be cases
(other than capi virtual devices) where we'd want to blow up and fix
instead of going undetected with this patch.


 arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6bc24a47e9ef..6f300ab7f0e9 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);
 
-	if (eeh_has_flag(EEH_FORCE_DISABLED))
+	if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
 		return;
 
 	dev_dbg(&pdev->dev, "EEH: Setting up device\n");
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [EXTERNAL] [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices
  2019-09-27 12:45 [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices Frederic Barrat
@ 2019-10-15  5:42 ` Sam Bobroff
  2019-10-15 19:41   ` Frederic Barrat
  0 siblings, 1 reply; 4+ messages in thread
From: Sam Bobroff @ 2019-10-15  5:42 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: clombard, linuxppc-dev, andrew.donnellan

[-- Attachment #1: Type: text/plain, Size: 3639 bytes --]

On Fri, Sep 27, 2019 at 02:45:10PM +0200, Frederic Barrat wrote:
> Recent cleanup in the way EEH support is added to a device causes a
> kernel oops when the cxl driver probes a device and creates virtual
> devices discovered on the FPGA:
> 
>     BUG: Kernel NULL pointer dereference at 0x000000a0
>     Faulting instruction address: 0xc000000000048070
>     Oops: Kernel access of bad area, sig: 7 [#1]
>     ...
>     NIP [c000000000048070] eeh_add_device_late.part.9+0x50/0x1e0
>     LR [c00000000004805c] eeh_add_device_late.part.9+0x3c/0x1e0
>     Call Trace:
>     [c000200e43983900] [c00000000079e250] _dev_info+0x5c/0x6c (unreliable)
>     [c000200e43983980] [c0000000000d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
>     [c000200e439839f0] [c0000000000606d0] pcibios_bus_add_device+0x40/0x60
>     [c000200e43983a10] [c0000000006aa3a0] pci_bus_add_device+0x30/0x100
>     [c000200e43983a80] [c0000000006aa4d4] pci_bus_add_devices+0x64/0xd0
>     [c000200e43983ac0] [c00800001c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
>     [c000200e43983b00] [c00800001c4242ac] cxl_probe+0x504/0x5b0 [cxl]
>     [c000200e43983bb0] [c0000000006bba1c] local_pci_probe+0x6c/0x110
>     [c000200e43983c30] [c000000000159278] work_for_cpu_fn+0x38/0x60
> 
> The root cause is that those cxl virtual devices don't have a
> representation in the device tree and therefore no associated pci_dn
> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
> we oops.
> 
> We never had explicit support for EEH for those virtual
> devices. Instead, EEH events are reported to the (real) pci device and
> handled by the cxl driver. Which can then forward to the virtual
> devices and handle dependencies. The fact that we try adding EEH
> support for the virtual devices is new and a side-effect of the recent
> cleanup.
> 
> This patch fixes it by skipping adding EEH support on powernv for
> devices which don't have a pci_dn structure.
> 
> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
> 
> Sending as an RFC, as I'm afraid of hiding potential issues and would
> be interested in comments. The powernv eeh code expects a struct
> pci_dn, so the fix seems safe. I'm wondering if there could be cases
> (other than capi virtual devices) where we'd want to blow up and fix
> instead of going undetected with this patch.

Looks good to me.

I do think it would be good to detect a missing pci_dn (WARN_ONCE()
might be appropriate).  However to implement it,
pnv_pcibios_bus_add_device() would need a way to detect that a struct
pci_dev is a cxl virtual device. I don't see an easy way to do that; do
you know if it's possible?

One last thing: pseries_pcibios_bus_add_device() also requires a pci_dn.
Do you know if it's possible to trigger a similar issue there, or is it
not possible for some reason?

Cheers,
Sam.

>  arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 6bc24a47e9ef..6f300ab7f0e9 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
>  {
>  	struct pci_dn *pdn = pci_get_pdn(pdev);
>  
> -	if (eeh_has_flag(EEH_FORCE_DISABLED))
> +	if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
>  		return;
>  
>  	dev_dbg(&pdev->dev, "EEH: Setting up device\n");
> -- 
> 2.21.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [EXTERNAL] [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices
  2019-10-15  5:42 ` [EXTERNAL] " Sam Bobroff
@ 2019-10-15 19:41   ` Frederic Barrat
  2019-10-16 14:18     ` Frederic Barrat
  0 siblings, 1 reply; 4+ messages in thread
From: Frederic Barrat @ 2019-10-15 19:41 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: clombard, linuxppc-dev, andrew.donnellan



Le 15/10/2019 à 07:42, Sam Bobroff a écrit :
> On Fri, Sep 27, 2019 at 02:45:10PM +0200, Frederic Barrat wrote:
>> Recent cleanup in the way EEH support is added to a device causes a
>> kernel oops when the cxl driver probes a device and creates virtual
>> devices discovered on the FPGA:
>>
>>      BUG: Kernel NULL pointer dereference at 0x000000a0
>>      Faulting instruction address: 0xc000000000048070
>>      Oops: Kernel access of bad area, sig: 7 [#1]
>>      ...
>>      NIP [c000000000048070] eeh_add_device_late.part.9+0x50/0x1e0
>>      LR [c00000000004805c] eeh_add_device_late.part.9+0x3c/0x1e0
>>      Call Trace:
>>      [c000200e43983900] [c00000000079e250] _dev_info+0x5c/0x6c (unreliable)
>>      [c000200e43983980] [c0000000000d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
>>      [c000200e439839f0] [c0000000000606d0] pcibios_bus_add_device+0x40/0x60
>>      [c000200e43983a10] [c0000000006aa3a0] pci_bus_add_device+0x30/0x100
>>      [c000200e43983a80] [c0000000006aa4d4] pci_bus_add_devices+0x64/0xd0
>>      [c000200e43983ac0] [c00800001c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
>>      [c000200e43983b00] [c00800001c4242ac] cxl_probe+0x504/0x5b0 [cxl]
>>      [c000200e43983bb0] [c0000000006bba1c] local_pci_probe+0x6c/0x110
>>      [c000200e43983c30] [c000000000159278] work_for_cpu_fn+0x38/0x60
>>
>> The root cause is that those cxl virtual devices don't have a
>> representation in the device tree and therefore no associated pci_dn
>> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
>> we oops.
>>
>> We never had explicit support for EEH for those virtual
>> devices. Instead, EEH events are reported to the (real) pci device and
>> handled by the cxl driver. Which can then forward to the virtual
>> devices and handle dependencies. The fact that we try adding EEH
>> support for the virtual devices is new and a side-effect of the recent
>> cleanup.
>>
>> This patch fixes it by skipping adding EEH support on powernv for
>> devices which don't have a pci_dn structure.
>>
>> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
>> ---
>>
>> Sending as an RFC, as I'm afraid of hiding potential issues and would
>> be interested in comments. The powernv eeh code expects a struct
>> pci_dn, so the fix seems safe. I'm wondering if there could be cases
>> (other than capi virtual devices) where we'd want to blow up and fix
>> instead of going undetected with this patch.
> 
> Looks good to me.
> 
> I do think it would be good to detect a missing pci_dn (WARN_ONCE()
> might be appropriate).  However to implement it,
> pnv_pcibios_bus_add_device() would need a way to detect that a struct
> pci_dev is a cxl virtual device. I don't see an easy way to do that; do
> you know if it's possible?


I think I found a solution. There's a cxl_pci_is_vphb_device() which is 
fairly cheap and would do the job. Sorry, I didn't think about it at first.


> One last thing: pseries_pcibios_bus_add_device() also requires a pci_dn.
> Do you know if it's possible to trigger a similar issue there, or is it
> not possible for some reason?


I don't think anybody is using capi in a lpar, but it should theorically 
be possible to hit it. I'll dig some more tomorrow and adjust the patch 
when resubmitting.

Thanks!

   Fred


> 
> Cheers,
> Sam.
> 
>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> index 6bc24a47e9ef..6f300ab7f0e9 100644
>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> @@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
>>   {
>>   	struct pci_dn *pdn = pci_get_pdn(pdev);
>>   
>> -	if (eeh_has_flag(EEH_FORCE_DISABLED))
>> +	if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
>>   		return;
>>   
>>   	dev_dbg(&pdev->dev, "EEH: Setting up device\n");
>> -- 
>> 2.21.0
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [EXTERNAL] [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices
  2019-10-15 19:41   ` Frederic Barrat
@ 2019-10-16 14:18     ` Frederic Barrat
  0 siblings, 0 replies; 4+ messages in thread
From: Frederic Barrat @ 2019-10-16 14:18 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: clombard, linuxppc-dev, andrew.donnellan



Le 15/10/2019 à 21:41, Frederic Barrat a écrit :
> 
> 
> Le 15/10/2019 à 07:42, Sam Bobroff a écrit :
>> On Fri, Sep 27, 2019 at 02:45:10PM +0200, Frederic Barrat wrote:
>>> Recent cleanup in the way EEH support is added to a device causes a
>>> kernel oops when the cxl driver probes a device and creates virtual
>>> devices discovered on the FPGA:
>>>
>>>      BUG: Kernel NULL pointer dereference at 0x000000a0
>>>      Faulting instruction address: 0xc000000000048070
>>>      Oops: Kernel access of bad area, sig: 7 [#1]
>>>      ...
>>>      NIP [c000000000048070] eeh_add_device_late.part.9+0x50/0x1e0
>>>      LR [c00000000004805c] eeh_add_device_late.part.9+0x3c/0x1e0
>>>      Call Trace:
>>>      [c000200e43983900] [c00000000079e250] _dev_info+0x5c/0x6c 
>>> (unreliable)
>>>      [c000200e43983980] [c0000000000d1ad0] 
>>> pnv_pcibios_bus_add_device+0x60/0xb0
>>>      [c000200e439839f0] [c0000000000606d0] 
>>> pcibios_bus_add_device+0x40/0x60
>>>      [c000200e43983a10] [c0000000006aa3a0] pci_bus_add_device+0x30/0x100
>>>      [c000200e43983a80] [c0000000006aa4d4] pci_bus_add_devices+0x64/0xd0
>>>      [c000200e43983ac0] [c00800001c429118] 
>>> cxl_pci_vphb_add+0xe0/0x130 [cxl]
>>>      [c000200e43983b00] [c00800001c4242ac] cxl_probe+0x504/0x5b0 [cxl]
>>>      [c000200e43983bb0] [c0000000006bba1c] local_pci_probe+0x6c/0x110
>>>      [c000200e43983c30] [c000000000159278] work_for_cpu_fn+0x38/0x60
>>>
>>> The root cause is that those cxl virtual devices don't have a
>>> representation in the device tree and therefore no associated pci_dn
>>> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
>>> we oops.
>>>
>>> We never had explicit support for EEH for those virtual
>>> devices. Instead, EEH events are reported to the (real) pci device and
>>> handled by the cxl driver. Which can then forward to the virtual
>>> devices and handle dependencies. The fact that we try adding EEH
>>> support for the virtual devices is new and a side-effect of the recent
>>> cleanup.
>>>
>>> This patch fixes it by skipping adding EEH support on powernv for
>>> devices which don't have a pci_dn structure.
>>>
>>> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
>>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
>>> ---
>>>
>>> Sending as an RFC, as I'm afraid of hiding potential issues and would
>>> be interested in comments. The powernv eeh code expects a struct
>>> pci_dn, so the fix seems safe. I'm wondering if there could be cases
>>> (other than capi virtual devices) where we'd want to blow up and fix
>>> instead of going undetected with this patch.
>>
>> Looks good to me.
>>
>> I do think it would be good to detect a missing pci_dn (WARN_ONCE()
>> might be appropriate).  However to implement it,
>> pnv_pcibios_bus_add_device() would need a way to detect that a struct
>> pci_dev is a cxl virtual device. I don't see an easy way to do that; do
>> you know if it's possible?
> 
> 
> I think I found a solution. There's a cxl_pci_is_vphb_device() which is 
> fairly cheap and would do the job. Sorry, I didn't think about it at first.


Not sure what I was smoking... this is all internal to the cxl module 
and would require some changes to make it available to the kernel. Not 
impossible, but not trivial either. And I'm not convinced it's worth it 
to just be able to call WARN.

   Fred


> 
>> One last thing: pseries_pcibios_bus_add_device() also requires a pci_dn.
>> Do you know if it's possible to trigger a similar issue there, or is it
>> not possible for some reason?
> 
> 
> I don't think anybody is using capi in a lpar, but it should theorically 
> be possible to hit it. I'll dig some more tomorrow and adjust the patch 
> when resubmitting.
> 
> Thanks!
> 
>    Fred
> 
> 
>>
>> Cheers,
>> Sam.
>>
>>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>>> b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index 6bc24a47e9ef..6f300ab7f0e9 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
>>>   {
>>>       struct pci_dn *pdn = pci_get_pdn(pdev);
>>> -    if (eeh_has_flag(EEH_FORCE_DISABLED))
>>> +    if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
>>>           return;
>>>       dev_dbg(&pdev->dev, "EEH: Setting up device\n");
>>> -- 
>>> 2.21.0
>>>
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-10-16 14:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-27 12:45 [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices Frederic Barrat
2019-10-15  5:42 ` [EXTERNAL] " Sam Bobroff
2019-10-15 19:41   ` Frederic Barrat
2019-10-16 14:18     ` Frederic Barrat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.