All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible race condition in the kernel between PCI driver and AER handling
@ 2018-07-31 12:42 gokul cg
  2018-07-31 13:15 ` Thomas Tai
  0 siblings, 1 reply; 10+ messages in thread
From: gokul cg @ 2018-07-31 12:42 UTC (permalink / raw)
  To: linux-pci

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]

Hi All,


I am suspecting a possible race condition in the kernel between PCI driver
and AER handling.

Because of the same kernel panic happens from worker thread which handles
bottom half of aer irq.


I am seeing this issue when I suddenly power off PCI card which
supports/enabled PCIE AER error reporting.

While powering off PCI device, AER driver will get AER IRQ for the device,
from AER IRQ handler, it will cache AER error code and schedule worker
thread to handle error.

The PCIe device will get removed from PCI tree before worker thread
completes its task and kernel panic is  happening when worker thread tries
to access PCI device's config space.



Issue:

crash>

crash> bt

PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"

#0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18

#1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05

#2 [ffff88027469fbf0] oops_end at ffffffff8176d960

#3 [ffff88027469fc18] die at ffffffff810060db

#4 [ffff88027469fc48] do_general_protection at ffffffff8176d452

#5 [ffff88027469fc70] general_protection at ffffffff8176cdf2

    [exception RIP: pci_bus_read_config_dword+100]

    RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046

    RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004

    RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000

    RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0

    R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c

    R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000

    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000

#6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b

#7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225

#8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d

#9 [ffff88027469fdd0] aer_isr at ffffffff81357a38

#10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0

#11 [ffff88027469fe70] worker_thread at ffffffff8105e251

#12 [ffff88027469fed0] kthread at ffffffff81064260

#13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38


crash>


I have tested it on kernel 3.10 . But from source i could see that this
case is still relevant for latest Linux source .


Can anybody tell me if this is an issue with AER driver in linux ?




Regards

Gokul CG

[-- Attachment #2: Type: text/html, Size: 12266 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-07-31 12:42 Possible race condition in the kernel between PCI driver and AER handling gokul cg
@ 2018-07-31 13:15 ` Thomas Tai
  2018-08-01  5:42   ` gokul cg
  2018-08-01  5:53   ` gokul cg
  0 siblings, 2 replies; 10+ messages in thread
From: Thomas Tai @ 2018-07-31 13:15 UTC (permalink / raw)
  To: gokul cg, linux-pci



On 07/31/2018 08:42 AM, gokul cg wrote:
> Hi All,
> 
> 
> I am suspecting a possible race condition in the kernel between PCI 
> driver and AER handling.
> 
> Because of the same kernel panic happens from worker thread which 
> handles bottom half of aer irq.
> 
> 
> I am seeing this issue when I suddenly power off PCI card which 
> supports/enabled PCIE AER error reporting.
> 
> While powering off PCI device, AER driver will get AER IRQ for the 
> device, from AER IRQ handler, it will cache AER error code and schedule 
> worker thread to handle error.

Hi Gokul,

It may be an issue in the AER driver. How do you power off your device? 
I've never seen this issue with normal shutdown nor "echo 0 > 
/sys/bus/pci/slots/xx/power"

Cheers,
Thomas

> 
> The PCIe device will get removed from PCI tree before worker thread 
> completes its task and kernel panic is  happening when worker thread 
> tries to access PCI device's config space.
> 
> 
> 
> Issue:
> 
> 
> crash>
> 
> crash> bt
> 
> PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
> 
> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
> 
> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
> 
> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
> 
> #3 [ffff88027469fc18] die at ffffffff810060db
> 
> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
> 
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
> 
>      [exception RIP: pci_bus_read_config_dword+100]
> 
>      RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
> 
>      RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004
> 
>      RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000
> 
>      RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0
> 
>      R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c
> 
>      R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000
> 
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> 
> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
> 
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> 
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> 
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> 
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> 
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> 
> #12 [ffff88027469fed0] kthread at ffffffff81064260
> 
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> 
> crash>
> 
> 
> I have tested it on kernel 3.10 . But from source i could see that this 
> case is still relevant for latest Linux source .
> 
> 
> Can anybody tell me if this is an issue with AER driver in linux ?
> 
> 
> 
> 
> Regards
> 
> Gokul CG
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-07-31 13:15 ` Thomas Tai
@ 2018-08-01  5:42   ` gokul cg
  2018-08-01 14:17     ` Thomas Tai
  2018-08-01 17:47     ` Thomas Tai
  2018-08-01  5:53   ` gokul cg
  1 sibling, 2 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01  5:42 UTC (permalink / raw)
  To: Thomas Tai; +Cc: linux-pci

[-- Attachment #1: Type: text/plain, Size: 3054 bytes --]

Hi Thomas,

In my hardware, there is i2c power control chip for PCI card, I just
powered down using i2c command .

Regards,
Gokul

On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com> wrote:

>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
>> Hi All,
>>
>>
>> I am suspecting a possible race condition in the kernel between PCI
>> driver and AER handling.
>>
>> Because of the same kernel panic happens from worker thread which handles
>> bottom half of aer irq.
>>
>>
>> I am seeing this issue when I suddenly power off PCI card which
>> supports/enabled PCIE AER error reporting.
>>
>> While powering off PCI device, AER driver will get AER IRQ for the
>> device, from AER IRQ handler, it will cache AER error code and schedule
>> worker thread to handle error.
>>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your device?
> I've never seen this issue with normal shutdown nor "echo 0 >
> /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
>> The PCIe device will get removed from PCI tree before worker thread
>> completes its task and kernel panic is  happening when worker thread tries
>> to access PCI device's config space.
>>
>>
>>
>> Issue:
>>
>>
>> crash>
>>
>> crash> bt
>>
>> PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
>>
>> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>> #3 [ffff88027469fc18] die at ffffffff810060db
>>
>> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>>      [exception RIP: pci_bus_read_config_dword+100]
>>
>>      RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
>>
>>      RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004
>>
>>      RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000
>>
>>      RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0
>>
>>      R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c
>>
>>      R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000
>>
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>>
>> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
>>
>> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>> #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>> crash>
>>
>>
>> I have tested it on kernel 3.10 . But from source i could see that this
>> case is still relevant for latest Linux source .
>>
>>
>> Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>> Regards
>>
>> Gokul CG
>>
>>

[-- Attachment #2: Type: text/html, Size: 4771 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-07-31 13:15 ` Thomas Tai
  2018-08-01  5:42   ` gokul cg
@ 2018-08-01  5:53   ` gokul cg
  2018-08-01 14:24     ` Thomas Tai
  1 sibling, 1 reply; 10+ messages in thread
From: gokul cg @ 2018-08-01  5:53 UTC (permalink / raw)
  To: Thomas Tai; +Cc: linux-pci

[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]

Hi,

I see there is a basic design flow. As AER and PCI drivers are independent
modules ,
locally storing pointer to any data structure from pci linked list in AER
driver will create problem as there is no synchronization between the same .


https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
Here 'struct aer_err_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info> *e_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer to
pci dev , which can be removed from pci tree at any time .
I think this is the basic issue.


Regards
Gokul

On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com> wrote:

>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
>> Hi All,
>>
>>
>> I am suspecting a possible race condition in the kernel between PCI
>> driver and AER handling.
>>
>> Because of the same kernel panic happens from worker thread which handles
>> bottom half of aer irq.
>>
>>
>> I am seeing this issue when I suddenly power off PCI card which
>> supports/enabled PCIE AER error reporting.
>>
>> While powering off PCI device, AER driver will get AER IRQ for the
>> device, from AER IRQ handler, it will cache AER error code and schedule
>> worker thread to handle error.
>>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your device?
> I've never seen this issue with normal shutdown nor "echo 0 >
> /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
>> The PCIe device will get removed from PCI tree before worker thread
>> completes its task and kernel panic is  happening when worker thread tries
>> to access PCI device's config space.
>>
>>
>>
>> Issue:
>>
>>
>> crash>
>>
>> crash> bt
>>
>> PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
>>
>> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>> #3 [ffff88027469fc18] die at ffffffff810060db
>>
>> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>>      [exception RIP: pci_bus_read_config_dword+100]
>>
>>      RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
>>
>>      RAX: 435f494350006963  RBX: ffff880274892000  RCX: 0000000000000004
>>
>>      RDX: 0000000000000100  RSI: 0000000000000060  RDI: ffff880274892000
>>
>>      RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9: 00000000000012c0
>>
>>      R10: 0000000000000006  R11: 00000000000012bf  R12: ffff88027469fd5c
>>
>>      R13: 0000000000000246  R14: 0000000000000000  R15: ffff8802741a4000
>>
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>>
>> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
>>
>> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>> #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>> crash>
>>
>>
>> I have tested it on kernel 3.10 . But from source i could see that this
>> case is still relevant for latest Linux source .
>>
>>
>> Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>> Regards
>>
>> Gokul CG
>>
>>

[-- Attachment #2: Type: text/html, Size: 6014 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01  5:42   ` gokul cg
@ 2018-08-01 14:17     ` Thomas Tai
  2018-08-01 17:47     ` Thomas Tai
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 14:17 UTC (permalink / raw)
  To: gokul cg; +Cc: linux-pci



On 08/01/2018 01:42 AM, gokul cg wrote:
> Hi Thomas,
> 
> In my hardware, there is i2c power control chip for PCI card, I just 
> powered down using i2c command .

Hi Gokul,
I see. That is why we normally didn't see this issue. Let me dig around 
to see if we have any machine that we can do similar thing.

Thomas

> 
> Regards,
> Gokul
> 
> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
> 
> 
>     On 07/31/2018 08:42 AM, gokul cg wrote:
> 
>         Hi All,
> 
> 
>         I am suspecting a possible race condition in the kernel between
>         PCI driver and AER handling.
> 
>         Because of the same kernel panic happens from worker thread
>         which handles bottom half of aer irq.
> 
> 
>         I am seeing this issue when I suddenly power off PCI card which
>         supports/enabled PCIE AER error reporting.
> 
>         While powering off PCI device, AER driver will get AER IRQ for
>         the device, from AER IRQ handler, it will cache AER error code
>         and schedule worker thread to handle error.
> 
> 
>     Hi Gokul,
> 
>     It may be an issue in the AER driver. How do you power off your
>     device? I've never seen this issue with normal shutdown nor "echo 0
>      > /sys/bus/pci/slots/xx/power"
> 
>     Cheers,
>     Thomas
> 
> 
> 
>         The PCIe device will get removed from PCI tree before worker
>         thread completes its task and kernel panic is  happening when
>         worker thread tries to access PCI device's config space.
> 
> 
> 
>         Issue:
> 
> 
>         crash>
> 
>         crash> bt
> 
>         PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
> 
>         #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
> 
>         #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
> 
>         #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
> 
>         #3 [ffff88027469fc18] die at ffffffff810060db
> 
>         #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
> 
>         #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
> 
>               [exception RIP: pci_bus_read_config_dword+100]
> 
>               RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
> 
>               RAX: 435f494350006963  RBX: ffff880274892000  RCX:
>         0000000000000004
> 
>               RDX: 0000000000000100  RSI: 0000000000000060  RDI:
>         ffff880274892000
> 
>               RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9:
>         00000000000012c0
> 
>               R10: 0000000000000006  R11: 00000000000012bf  R12:
>         ffff88027469fd5c
> 
>               R13: 0000000000000246  R14: 0000000000000000  R15:
>         ffff8802741a4000
> 
>               ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> 
>         #6 [ffff88027469fd50] pci_find_next_ext_capability at
>         ffffffff81345d7b
> 
>         #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> 
>         #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> 
>         #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> 
>         #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> 
>         #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> 
>         #12 [ffff88027469fed0] kthread at ffffffff81064260
> 
>         #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> 
>         crash>
> 
> 
>         I have tested it on kernel 3.10 . But from source i could see
>         that this case is still relevant for latest Linux source .
> 
> 
>         Can anybody tell me if this is an issue with AER driver in linux ?
> 
> 
> 
> 
>         Regards
> 
>         Gokul CG
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01  5:53   ` gokul cg
@ 2018-08-01 14:24     ` Thomas Tai
  2018-08-01 15:22       ` gokul cg
  2018-08-02 14:17       ` Thomas Tai
  0 siblings, 2 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 14:24 UTC (permalink / raw)
  To: gokul cg; +Cc: linux-pci



On 08/01/2018 01:53 AM, gokul cg wrote:
> Hi,
> 
> I see there is a basic design flow. As AER and PCI drivers are 
> independent modules ,
> locally storing pointer to any data structure from pci linked list in 
> AER driver will create problem as there is no synchronization between 
> the same .
> 
> 
> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
> Here 'structaer_err_info 
> <https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info 
> <https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer to 
> pci dev , which can be removed from pci tree at any time .
> I think this is the basic issue.

Hi Gokul,
Agree. We had an issue last week about this e_info storing the pci_dev 
which is removed in the pcie_do_fatal_recovery() and causes 
use-after-free problem.

In your case, I am hoping to recreate your issue so that we can work 
together to isolate and fix the issue. Do you have any suggestion how to 
fix it at this moment?

Thanks,
Thomas

> 
> 
> Regards
> Gokul
> 
> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
> 
> 
>     On 07/31/2018 08:42 AM, gokul cg wrote:
> 
>         Hi All,
> 
> 
>         I am suspecting a possible race condition in the kernel between
>         PCI driver and AER handling.
> 
>         Because of the same kernel panic happens from worker thread
>         which handles bottom half of aer irq.
> 
> 
>         I am seeing this issue when I suddenly power off PCI card which
>         supports/enabled PCIE AER error reporting.
> 
>         While powering off PCI device, AER driver will get AER IRQ for
>         the device, from AER IRQ handler, it will cache AER error code
>         and schedule worker thread to handle error.
> 
> 
>     Hi Gokul,
> 
>     It may be an issue in the AER driver. How do you power off your
>     device? I've never seen this issue with normal shutdown nor "echo 0
>      > /sys/bus/pci/slots/xx/power"
> 
>     Cheers,
>     Thomas
> 
> 
> 
>         The PCIe device will get removed from PCI tree before worker
>         thread completes its task and kernel panic is  happening when
>         worker thread tries to access PCI device's config space.
> 
> 
> 
>         Issue:
> 
> 
>         crash>
> 
>         crash> bt
> 
>         PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
> 
>         #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
> 
>         #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
> 
>         #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
> 
>         #3 [ffff88027469fc18] die at ffffffff810060db
> 
>         #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
> 
>         #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
> 
>               [exception RIP: pci_bus_read_config_dword+100]
> 
>               RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
> 
>               RAX: 435f494350006963  RBX: ffff880274892000  RCX:
>         0000000000000004
> 
>               RDX: 0000000000000100  RSI: 0000000000000060  RDI:
>         ffff880274892000
> 
>               RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9:
>         00000000000012c0
> 
>               R10: 0000000000000006  R11: 00000000000012bf  R12:
>         ffff88027469fd5c
> 
>               R13: 0000000000000246  R14: 0000000000000000  R15:
>         ffff8802741a4000
> 
>               ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> 
>         #6 [ffff88027469fd50] pci_find_next_ext_capability at
>         ffffffff81345d7b
> 
>         #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> 
>         #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> 
>         #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> 
>         #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> 
>         #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> 
>         #12 [ffff88027469fed0] kthread at ffffffff81064260
> 
>         #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> 
>         crash>
> 
> 
>         I have tested it on kernel 3.10 . But from source i could see
>         that this case is still relevant for latest Linux source .
> 
> 
>         Can anybody tell me if this is an issue with AER driver in linux ?
> 
> 
> 
> 
>         Regards
> 
>         Gokul CG
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01 14:24     ` Thomas Tai
@ 2018-08-01 15:22       ` gokul cg
  2018-08-02 14:17       ` Thomas Tai
  1 sibling, 0 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01 15:22 UTC (permalink / raw)
  To: Thomas Tai; +Cc: linux-pci

[-- Attachment #1: Type: text/plain, Size: 5248 bytes --]

Hi Thomas,

>In your case, I am hoping to recreate your issue so that we can work
together to isolate and fix the issue. Do you have any suggestion how to
fix it at this moment?
Yes . I can reproduce issue.

I don't  have any patch right now.
I was thinking about two options ,

1)  Adding generic call back  in pci_dev to notify any when a device get
removed from tree so that aer_driver can also subscribe to it
2)  set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags) in pci device flags
when it removed from list and let aer driver to mange free , but i fear
whether this will this create memory leak because of race.


Regards
Gokul
On Wed, Aug 1, 2018 at 7:54 PM, Thomas Tai <thomas.tai@oracle.com> wrote:

>
>
> On 08/01/2018 01:53 AM, gokul cg wrote:
>
>> Hi,
>>
>> I see there is a basic design flow. As AER and PCI drivers are
>> independent modules ,
>> locally storing pointer to any data structure from pci linked list in AER
>> driver will create problem as there is no synchronization between the same .
>>
>>
>> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci
>> /pcie/aer/aerdrv_core.c#L701
>> Here 'structaer_err_info <https://elixir.bootlin.com/li
>> nux/v3.10.99/ident/aer_err_info>*e_info <https://elixir.bootlin.com/li
>> nux/v3.10.99/ident/e_info>' has pointer to pci dev , which can be
>> removed from pci tree at any time .
>> I think this is the basic issue.
>>
>
> Hi Gokul,
> Agree. We had an issue last week about this e_info storing the pci_dev
> which is removed in the pcie_do_fatal_recovery() and causes use-after-free
> problem.
>
> In your case, I am hoping to recreate your issue so that we can work
> together to isolate and fix the issue. Do you have any suggestion how to
> fix it at this moment?
>
> Thanks,
> Thomas
>
>
>>
>> Regards
>> Gokul
>>
>>
>> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com
>> <mailto:thomas.tai@oracle.com>> wrote:
>>
>>
>>
>>     On 07/31/2018 08:42 AM, gokul cg wrote:
>>
>>         Hi All,
>>
>>
>>         I am suspecting a possible race condition in the kernel between
>>         PCI driver and AER handling.
>>
>>         Because of the same kernel panic happens from worker thread
>>         which handles bottom half of aer irq.
>>
>>
>>         I am seeing this issue when I suddenly power off PCI card which
>>         supports/enabled PCIE AER error reporting.
>>
>>         While powering off PCI device, AER driver will get AER IRQ for
>>         the device, from AER IRQ handler, it will cache AER error code
>>         and schedule worker thread to handle error.
>>
>>
>>     Hi Gokul,
>>
>>     It may be an issue in the AER driver. How do you power off your
>>     device? I've never seen this issue with normal shutdown nor "echo 0
>>      > /sys/bus/pci/slots/xx/power"
>>
>>     Cheers,
>>     Thomas
>>
>>
>>
>>         The PCIe device will get removed from PCI tree before worker
>>         thread completes its task and kernel panic is  happening when
>>         worker thread tries to access PCI device's config space.
>>
>>
>>
>>         Issue:
>>
>>
>>         crash>
>>
>>         crash> bt
>>
>>         PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND:
>> "kworker/0:2"
>>
>>         #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>>         #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>>         #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>>         #3 [ffff88027469fc18] die at ffffffff810060db
>>
>>         #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>>         #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>>               [exception RIP: pci_bus_read_config_dword+100]
>>
>>               RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS:
>> 00010046
>>
>>               RAX: 435f494350006963  RBX: ffff880274892000  RCX:
>>         0000000000000004
>>
>>               RDX: 0000000000000100  RSI: 0000000000000060  RDI:
>>         ffff880274892000
>>
>>               RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9:
>>         00000000000012c0
>>
>>               R10: 0000000000000006  R11: 00000000000012bf  R12:
>>         ffff88027469fd5c
>>
>>               R13: 0000000000000246  R14: 0000000000000000  R15:
>>         ffff8802741a4000
>>
>>               ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>>
>>         #6 [ffff88027469fd50] pci_find_next_ext_capability at
>>         ffffffff81345d7b
>>
>>         #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>>         #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>>         #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>>         #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>>         #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>>         #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>>         #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>>         crash>
>>
>>
>>         I have tested it on kernel 3.10 . But from source i could see
>>         that this case is still relevant for latest Linux source .
>>
>>
>>         Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>>         Regards
>>
>>         Gokul CG
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 7661 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01  5:42   ` gokul cg
  2018-08-01 14:17     ` Thomas Tai
@ 2018-08-01 17:47     ` Thomas Tai
  2018-08-01 18:52       ` gokul cg
  1 sibling, 1 reply; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 17:47 UTC (permalink / raw)
  To: gokul cg; +Cc: linux-pci



On 08/01/2018 01:42 AM, gokul cg wrote:
> Hi Thomas,
> 
> In my hardware, there is i2c power control chip for PCI card, I just 
> powered down using i2c command .

Hi Gokul,
When you power off the card via the i2c, it forcefully power off the 
card without notify the kernel? That is, during the card power off 
sequence it manages to send a last AER isr to report the error and die? 
I am kind of expect the pcie surprise removal or hot plug driver will 
handle it correctly.

Thanks,
Thomas

> 
> Regards,
> Gokul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01 17:47     ` Thomas Tai
@ 2018-08-01 18:52       ` gokul cg
  0 siblings, 0 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01 18:52 UTC (permalink / raw)
  To: Thomas Tai; +Cc: linux-pci

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

HI Thomas ,


Yes , its surprise removal .

But as far as I know , linux kernel will handle surprise removal of PCIe
device without panic.

The driver will suddenly start reading all 0xff and will then need to
abort whatever it was doing. Usually all drivers handle this just fine.

Nothing, the driver individually needs to handle the fact that it might
at any time, start getting invalid data. If it doesn't, it needs to be
fixed. Whether AER  driver that does not handle this properly?


Regards,
Gokul

On Wed, Aug 1, 2018 at 11:17 PM, Thomas Tai <thomas.tai@oracle.com> wrote:

>
>
> On 08/01/2018 01:42 AM, gokul cg wrote:
>
>> Hi Thomas,
>>
>> In my hardware, there is i2c power control chip for PCI card, I just
>> powered down using i2c command .
>>
>
> Hi Gokul,
> When you power off the card via the i2c, it forcefully power off the card
> without notify the kernel? That is, during the card power off sequence it
> manages to send a last AER isr to report the error and die? I am kind of
> expect the pcie surprise removal or hot plug driver will handle it
> correctly.
>
> Thanks,
> Thomas
>
>
>> Regards,
>> Gokul
>>
>

[-- Attachment #2: Type: text/html, Size: 4641 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible race condition in the kernel between PCI driver and AER handling
  2018-08-01 14:24     ` Thomas Tai
  2018-08-01 15:22       ` gokul cg
@ 2018-08-02 14:17       ` Thomas Tai
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-02 14:17 UTC (permalink / raw)
  To: gokul cg; +Cc: linux-pci


On 08/01/2018 10:24 AM, Thomas Tai wrote:
> 
> 
> On 08/01/2018 01:53 AM, gokul cg wrote:
>> Hi,
>>
>> I see there is a basic design flow. As AER and PCI drivers are 
>> independent modules ,
>> locally storing pointer to any data structure from pci linked list in 
>> AER driver will create problem as there is no synchronization between 
>> the same .
>>
>>
>> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701 
>>
>> Here 'structaer_err_info 
>> <https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info 
>> <https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer 
>> to pci dev , which can be removed from pci tree at any time .
>> I think this is the basic issue.

Hi Gokul,

I am afraid that I am having hard time recreating your issue. Following 
is the normal situation and wondering did you see any hotplug message 
before the aer message?

pcieport 0000:00:02.2: AER: Corrected error received: id=1130
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down event ignored; already 
powering off
pcieport 0000:11:06.0: PCIe Bus Error: severity=Corrected, type=Physical 
Layer, id=1130(Receiver ID)
pcieport 0000:11:06.0:   device [111d:80b5] error 
status/mask=00000001/0000e000
pcieport 0000:11:06.0:    [ 0] Receiver Error

As far as the pci_dev being corrupted, may be you can add 
"slub_debug=FZP" in your kernel boot argument and rerun your test and 
see if it find anything. I am curious that who corrupted the pci_dev in 
the first place. I am not totally convinced that the problem is in the 
AER codes.

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-08-02 16:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 12:42 Possible race condition in the kernel between PCI driver and AER handling gokul cg
2018-07-31 13:15 ` Thomas Tai
2018-08-01  5:42   ` gokul cg
2018-08-01 14:17     ` Thomas Tai
2018-08-01 17:47     ` Thomas Tai
2018-08-01 18:52       ` gokul cg
2018-08-01  5:53   ` gokul cg
2018-08-01 14:24     ` Thomas Tai
2018-08-01 15:22       ` gokul cg
2018-08-02 14:17       ` Thomas Tai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.