* Possible race condition in the kernel between PCI driver and AER handling
@ 2018-07-31 12:42 gokul cg
2018-07-31 13:15 ` Thomas Tai
0 siblings, 1 reply; 10+ messages in thread
From: gokul cg @ 2018-07-31 12:42 UTC (permalink / raw)
To: linux-pci
[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]
Hi All,
I am suspecting a possible race condition in the kernel between PCI driver
and AER handling.
Because of the same kernel panic happens from worker thread which handles
bottom half of aer irq.
I am seeing this issue when I suddenly power off PCI card which
supports/enabled PCIE AER error reporting.
While powering off PCI device, AER driver will get AER IRQ for the device,
from AER IRQ handler, it will cache AER error code and schedule worker
thread to handle error.
The PCIe device will get removed from PCI tree before worker thread
completes its task and kernel panic is happening when worker thread tries
to access PCI device's config space.
Issue:
crash>
crash> bt
PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
#0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
#1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
#2 [ffff88027469fbf0] oops_end at ffffffff8176d960
#3 [ffff88027469fc18] die at ffffffff810060db
#4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
#5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
[exception RIP: pci_bus_read_config_dword+100]
RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
RAX: 435f494350006963 RBX: ffff880274892000 RCX: 0000000000000004
RDX: 0000000000000100 RSI: 0000000000000060 RDI: ffff880274892000
RBP: ffff88027469fd48 R8: ffff88027469fd2c R9: 00000000000012c0
R10: 0000000000000006 R11: 00000000000012bf R12: ffff88027469fd5c
R13: 0000000000000246 R14: 0000000000000000 R15: ffff8802741a4000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
#7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
#8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
#9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
#10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
#11 [ffff88027469fe70] worker_thread at ffffffff8105e251
#12 [ffff88027469fed0] kthread at ffffffff81064260
#13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
crash>
I have tested it on kernel 3.10 . But from source i could see that this
case is still relevant for latest Linux source .
Can anybody tell me if this is an issue with AER driver in linux ?
Regards
Gokul CG
[-- Attachment #2: Type: text/html, Size: 12266 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-07-31 12:42 Possible race condition in the kernel between PCI driver and AER handling gokul cg
@ 2018-07-31 13:15 ` Thomas Tai
2018-08-01 5:42 ` gokul cg
2018-08-01 5:53 ` gokul cg
0 siblings, 2 replies; 10+ messages in thread
From: Thomas Tai @ 2018-07-31 13:15 UTC (permalink / raw)
To: gokul cg, linux-pci
On 07/31/2018 08:42 AM, gokul cg wrote:
> Hi All,
>
>
> I am suspecting a possible race condition in the kernel between PCI
> driver and AER handling.
>
> Because of the same kernel panic happens from worker thread which
> handles bottom half of aer irq.
>
>
> I am seeing this issue when I suddenly power off PCI card which
> supports/enabled PCIE AER error reporting.
>
> While powering off PCI device, AER driver will get AER IRQ for the
> device, from AER IRQ handler, it will cache AER error code and schedule
> worker thread to handle error.
Hi Gokul,
It may be an issue in the AER driver. How do you power off your device?
I've never seen this issue with normal shutdown nor "echo 0 >
/sys/bus/pci/slots/xx/power"
Cheers,
Thomas
>
> The PCIe device will get removed from PCI tree before worker thread
> completes its task and kernel panic is happening when worker thread
> tries to access PCI device's config space.
>
>
>
> Issue:
>
>
> crash>
>
> crash> bt
>
> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
>
> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>
> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>
> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>
> #3 [ffff88027469fc18] die at ffffffff810060db
>
> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>
> [exception RIP: pci_bus_read_config_dword+100]
>
> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
>
> RAX: 435f494350006963 RBX: ffff880274892000 RCX: 0000000000000004
>
> RDX: 0000000000000100 RSI: 0000000000000060 RDI: ffff880274892000
>
> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9: 00000000000012c0
>
> R10: 0000000000000006 R11: 00000000000012bf R12: ffff88027469fd5c
>
> R13: 0000000000000246 R14: 0000000000000000 R15: ffff8802741a4000
>
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>
> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
>
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>
> #12 [ffff88027469fed0] kthread at ffffffff81064260
>
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>
>
> crash>
>
>
> I have tested it on kernel 3.10 . But from source i could see that this
> case is still relevant for latest Linux source .
>
>
> Can anybody tell me if this is an issue with AER driver in linux ?
>
>
>
>
> Regards
>
> Gokul CG
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-07-31 13:15 ` Thomas Tai
@ 2018-08-01 5:42 ` gokul cg
2018-08-01 14:17 ` Thomas Tai
2018-08-01 17:47 ` Thomas Tai
2018-08-01 5:53 ` gokul cg
1 sibling, 2 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01 5:42 UTC (permalink / raw)
To: Thomas Tai; +Cc: linux-pci
[-- Attachment #1: Type: text/plain, Size: 3054 bytes --]
Hi Thomas,
In my hardware, there is i2c power control chip for PCI card, I just
powered down using i2c command .
Regards,
Gokul
On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com> wrote:
>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
>> Hi All,
>>
>>
>> I am suspecting a possible race condition in the kernel between PCI
>> driver and AER handling.
>>
>> Because of the same kernel panic happens from worker thread which handles
>> bottom half of aer irq.
>>
>>
>> I am seeing this issue when I suddenly power off PCI card which
>> supports/enabled PCIE AER error reporting.
>>
>> While powering off PCI device, AER driver will get AER IRQ for the
>> device, from AER IRQ handler, it will cache AER error code and schedule
>> worker thread to handle error.
>>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your device?
> I've never seen this issue with normal shutdown nor "echo 0 >
> /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
>> The PCIe device will get removed from PCI tree before worker thread
>> completes its task and kernel panic is happening when worker thread tries
>> to access PCI device's config space.
>>
>>
>>
>> Issue:
>>
>>
>> crash>
>>
>> crash> bt
>>
>> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
>>
>> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>> #3 [ffff88027469fc18] die at ffffffff810060db
>>
>> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>> [exception RIP: pci_bus_read_config_dword+100]
>>
>> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
>>
>> RAX: 435f494350006963 RBX: ffff880274892000 RCX: 0000000000000004
>>
>> RDX: 0000000000000100 RSI: 0000000000000060 RDI: ffff880274892000
>>
>> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9: 00000000000012c0
>>
>> R10: 0000000000000006 R11: 00000000000012bf R12: ffff88027469fd5c
>>
>> R13: 0000000000000246 R14: 0000000000000000 R15: ffff8802741a4000
>>
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>>
>> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
>>
>> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>> #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>> crash>
>>
>>
>> I have tested it on kernel 3.10 . But from source i could see that this
>> case is still relevant for latest Linux source .
>>
>>
>> Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>> Regards
>>
>> Gokul CG
>>
>>
[-- Attachment #2: Type: text/html, Size: 4771 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-07-31 13:15 ` Thomas Tai
2018-08-01 5:42 ` gokul cg
@ 2018-08-01 5:53 ` gokul cg
2018-08-01 14:24 ` Thomas Tai
1 sibling, 1 reply; 10+ messages in thread
From: gokul cg @ 2018-08-01 5:53 UTC (permalink / raw)
To: Thomas Tai; +Cc: linux-pci
[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]
Hi,
I see there is a basic design flow. As AER and PCI drivers are independent
modules ,
locally storing pointer to any data structure from pci linked list in AER
driver will create problem as there is no synchronization between the same .
https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
Here 'struct aer_err_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info> *e_info
<https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer to
pci dev , which can be removed from pci tree at any time .
I think this is the basic issue.
Regards
Gokul
On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com> wrote:
>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
>> Hi All,
>>
>>
>> I am suspecting a possible race condition in the kernel between PCI
>> driver and AER handling.
>>
>> Because of the same kernel panic happens from worker thread which handles
>> bottom half of aer irq.
>>
>>
>> I am seeing this issue when I suddenly power off PCI card which
>> supports/enabled PCIE AER error reporting.
>>
>> While powering off PCI device, AER driver will get AER IRQ for the
>> device, from AER IRQ handler, it will cache AER error code and schedule
>> worker thread to handle error.
>>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your device?
> I've never seen this issue with normal shutdown nor "echo 0 >
> /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
>> The PCIe device will get removed from PCI tree before worker thread
>> completes its task and kernel panic is happening when worker thread tries
>> to access PCI device's config space.
>>
>>
>>
>> Issue:
>>
>>
>> crash>
>>
>> crash> bt
>>
>> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
>>
>> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>> #3 [ffff88027469fc18] die at ffffffff810060db
>>
>> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>> [exception RIP: pci_bus_read_config_dword+100]
>>
>> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
>>
>> RAX: 435f494350006963 RBX: ffff880274892000 RCX: 0000000000000004
>>
>> RDX: 0000000000000100 RSI: 0000000000000060 RDI: ffff880274892000
>>
>> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9: 00000000000012c0
>>
>> R10: 0000000000000006 R11: 00000000000012bf R12: ffff88027469fd5c
>>
>> R13: 0000000000000246 R14: 0000000000000000 R15: ffff8802741a4000
>>
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>>
>> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
>>
>> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>> #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>> crash>
>>
>>
>> I have tested it on kernel 3.10 . But from source i could see that this
>> case is still relevant for latest Linux source .
>>
>>
>> Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>> Regards
>>
>> Gokul CG
>>
>>
[-- Attachment #2: Type: text/html, Size: 6014 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 5:42 ` gokul cg
@ 2018-08-01 14:17 ` Thomas Tai
2018-08-01 17:47 ` Thomas Tai
1 sibling, 0 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 14:17 UTC (permalink / raw)
To: gokul cg; +Cc: linux-pci
On 08/01/2018 01:42 AM, gokul cg wrote:
> Hi Thomas,
>
> In my hardware, there is i2c power control chip for PCI card, I just
> powered down using i2c command .
Hi Gokul,
I see. That is why we normally didn't see this issue. Let me dig around
to see if we have any machine that we can do similar thing.
Thomas
>
> Regards,
> Gokul
>
> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com
> <mailto:thomas.tai@oracle.com>> wrote:
>
>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
> Hi All,
>
>
> I am suspecting a possible race condition in the kernel between
> PCI driver and AER handling.
>
> Because of the same kernel panic happens from worker thread
> which handles bottom half of aer irq.
>
>
> I am seeing this issue when I suddenly power off PCI card which
> supports/enabled PCIE AER error reporting.
>
> While powering off PCI device, AER driver will get AER IRQ for
> the device, from AER IRQ handler, it will cache AER error code
> and schedule worker thread to handle error.
>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your
> device? I've never seen this issue with normal shutdown nor "echo 0
> > /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
> The PCIe device will get removed from PCI tree before worker
> thread completes its task and kernel panic is happening when
> worker thread tries to access PCI device's config space.
>
>
>
> Issue:
>
>
> crash>
>
> crash> bt
>
> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
>
> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>
> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>
> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>
> #3 [ffff88027469fc18] die at ffffffff810060db
>
> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>
> [exception RIP: pci_bus_read_config_dword+100]
>
> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
>
> RAX: 435f494350006963 RBX: ffff880274892000 RCX:
> 0000000000000004
>
> RDX: 0000000000000100 RSI: 0000000000000060 RDI:
> ffff880274892000
>
> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9:
> 00000000000012c0
>
> R10: 0000000000000006 R11: 00000000000012bf R12:
> ffff88027469fd5c
>
> R13: 0000000000000246 R14: 0000000000000000 R15:
> ffff8802741a4000
>
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>
> #6 [ffff88027469fd50] pci_find_next_ext_capability at
> ffffffff81345d7b
>
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>
> #12 [ffff88027469fed0] kthread at ffffffff81064260
>
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>
>
> crash>
>
>
> I have tested it on kernel 3.10 . But from source i could see
> that this case is still relevant for latest Linux source .
>
>
> Can anybody tell me if this is an issue with AER driver in linux ?
>
>
>
>
> Regards
>
> Gokul CG
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 5:53 ` gokul cg
@ 2018-08-01 14:24 ` Thomas Tai
2018-08-01 15:22 ` gokul cg
2018-08-02 14:17 ` Thomas Tai
0 siblings, 2 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 14:24 UTC (permalink / raw)
To: gokul cg; +Cc: linux-pci
On 08/01/2018 01:53 AM, gokul cg wrote:
> Hi,
>
> I see there is a basic design flow. As AER and PCI drivers are
> independent modules ,
> locally storing pointer to any data structure from pci linked list in
> AER driver will create problem as there is no synchronization between
> the same .
>
>
> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
> Here 'structaer_err_info
> <https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info
> <https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer to
> pci dev , which can be removed from pci tree at any time .
> I think this is the basic issue.
Hi Gokul,
Agree. We had an issue last week about this e_info storing the pci_dev
which is removed in the pcie_do_fatal_recovery() and causes
use-after-free problem.
In your case, I am hoping to recreate your issue so that we can work
together to isolate and fix the issue. Do you have any suggestion how to
fix it at this moment?
Thanks,
Thomas
>
>
> Regards
> Gokul
>
> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com
> <mailto:thomas.tai@oracle.com>> wrote:
>
>
>
> On 07/31/2018 08:42 AM, gokul cg wrote:
>
> Hi All,
>
>
> I am suspecting a possible race condition in the kernel between
> PCI driver and AER handling.
>
> Because of the same kernel panic happens from worker thread
> which handles bottom half of aer irq.
>
>
> I am seeing this issue when I suddenly power off PCI card which
> supports/enabled PCIE AER error reporting.
>
> While powering off PCI device, AER driver will get AER IRQ for
> the device, from AER IRQ handler, it will cache AER error code
> and schedule worker thread to handle error.
>
>
> Hi Gokul,
>
> It may be an issue in the AER driver. How do you power off your
> device? I've never seen this issue with normal shutdown nor "echo 0
> > /sys/bus/pci/slots/xx/power"
>
> Cheers,
> Thomas
>
>
>
> The PCIe device will get removed from PCI tree before worker
> thread completes its task and kernel panic is happening when
> worker thread tries to access PCI device's config space.
>
>
>
> Issue:
>
>
> crash>
>
> crash> bt
>
> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND: "kworker/0:2"
>
> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>
> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>
> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>
> #3 [ffff88027469fc18] die at ffffffff810060db
>
> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>
> [exception RIP: pci_bus_read_config_dword+100]
>
> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS: 00010046
>
> RAX: 435f494350006963 RBX: ffff880274892000 RCX:
> 0000000000000004
>
> RDX: 0000000000000100 RSI: 0000000000000060 RDI:
> ffff880274892000
>
> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9:
> 00000000000012c0
>
> R10: 0000000000000006 R11: 00000000000012bf R12:
> ffff88027469fd5c
>
> R13: 0000000000000246 R14: 0000000000000000 R15:
> ffff8802741a4000
>
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>
> #6 [ffff88027469fd50] pci_find_next_ext_capability at
> ffffffff81345d7b
>
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>
> #12 [ffff88027469fed0] kthread at ffffffff81064260
>
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>
>
> crash>
>
>
> I have tested it on kernel 3.10 . But from source i could see
> that this case is still relevant for latest Linux source .
>
>
> Can anybody tell me if this is an issue with AER driver in linux ?
>
>
>
>
> Regards
>
> Gokul CG
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 14:24 ` Thomas Tai
@ 2018-08-01 15:22 ` gokul cg
2018-08-02 14:17 ` Thomas Tai
1 sibling, 0 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01 15:22 UTC (permalink / raw)
To: Thomas Tai; +Cc: linux-pci
[-- Attachment #1: Type: text/plain, Size: 5248 bytes --]
Hi Thomas,
>In your case, I am hoping to recreate your issue so that we can work
together to isolate and fix the issue. Do you have any suggestion how to
fix it at this moment?
Yes . I can reproduce issue.
I don't have any patch right now.
I was thinking about two options ,
1) Adding generic call back in pci_dev to notify any when a device get
removed from tree so that aer_driver can also subscribe to it
2) set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags) in pci device flags
when it removed from list and let aer driver to mange free , but i fear
whether this will this create memory leak because of race.
Regards
Gokul
On Wed, Aug 1, 2018 at 7:54 PM, Thomas Tai <thomas.tai@oracle.com> wrote:
>
>
> On 08/01/2018 01:53 AM, gokul cg wrote:
>
>> Hi,
>>
>> I see there is a basic design flow. As AER and PCI drivers are
>> independent modules ,
>> locally storing pointer to any data structure from pci linked list in AER
>> driver will create problem as there is no synchronization between the same .
>>
>>
>> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci
>> /pcie/aer/aerdrv_core.c#L701
>> Here 'structaer_err_info <https://elixir.bootlin.com/li
>> nux/v3.10.99/ident/aer_err_info>*e_info <https://elixir.bootlin.com/li
>> nux/v3.10.99/ident/e_info>' has pointer to pci dev , which can be
>> removed from pci tree at any time .
>> I think this is the basic issue.
>>
>
> Hi Gokul,
> Agree. We had an issue last week about this e_info storing the pci_dev
> which is removed in the pcie_do_fatal_recovery() and causes use-after-free
> problem.
>
> In your case, I am hoping to recreate your issue so that we can work
> together to isolate and fix the issue. Do you have any suggestion how to
> fix it at this moment?
>
> Thanks,
> Thomas
>
>
>>
>> Regards
>> Gokul
>>
>>
>> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com
>> <mailto:thomas.tai@oracle.com>> wrote:
>>
>>
>>
>> On 07/31/2018 08:42 AM, gokul cg wrote:
>>
>> Hi All,
>>
>>
>> I am suspecting a possible race condition in the kernel between
>> PCI driver and AER handling.
>>
>> Because of the same kernel panic happens from worker thread
>> which handles bottom half of aer irq.
>>
>>
>> I am seeing this issue when I suddenly power off PCI card which
>> supports/enabled PCIE AER error reporting.
>>
>> While powering off PCI device, AER driver will get AER IRQ for
>> the device, from AER IRQ handler, it will cache AER error code
>> and schedule worker thread to handle error.
>>
>>
>> Hi Gokul,
>>
>> It may be an issue in the AER driver. How do you power off your
>> device? I've never seen this issue with normal shutdown nor "echo 0
>> > /sys/bus/pci/slots/xx/power"
>>
>> Cheers,
>> Thomas
>>
>>
>>
>> The PCIe device will get removed from PCI tree before worker
>> thread completes its task and kernel panic is happening when
>> worker thread tries to access PCI device's config space.
>>
>>
>>
>> Issue:
>>
>>
>> crash>
>>
>> crash> bt
>>
>> PID: 2727 TASK: ffff880272adc530 CPU: 0 COMMAND:
>> "kworker/0:2"
>>
>> #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
>>
>> #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
>>
>> #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
>>
>> #3 [ffff88027469fc18] die at ffffffff810060db
>>
>> #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
>>
>> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>>
>> [exception RIP: pci_bus_read_config_dword+100]
>>
>> RIP: ffffffff813405f4 RSP: ffff88027469fd20 RFLAGS:
>> 00010046
>>
>> RAX: 435f494350006963 RBX: ffff880274892000 RCX:
>> 0000000000000004
>>
>> RDX: 0000000000000100 RSI: 0000000000000060 RDI:
>> ffff880274892000
>>
>> RBP: ffff88027469fd48 R8: ffff88027469fd2c R9:
>> 00000000000012c0
>>
>> R10: 0000000000000006 R11: 00000000000012bf R12:
>> ffff88027469fd5c
>>
>> R13: 0000000000000246 R14: 0000000000000000 R15:
>> ffff8802741a4000
>>
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>>
>> #6 [ffff88027469fd50] pci_find_next_ext_capability at
>> ffffffff81345d7b
>>
>> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
>>
>> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
>>
>> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
>>
>> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
>>
>> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
>>
>> #12 [ffff88027469fed0] kthread at ffffffff81064260
>>
>> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
>>
>>
>> crash>
>>
>>
>> I have tested it on kernel 3.10 . But from source i could see
>> that this case is still relevant for latest Linux source .
>>
>>
>> Can anybody tell me if this is an issue with AER driver in linux ?
>>
>>
>>
>>
>> Regards
>>
>> Gokul CG
>>
>>
>>
[-- Attachment #2: Type: text/html, Size: 7661 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 5:42 ` gokul cg
2018-08-01 14:17 ` Thomas Tai
@ 2018-08-01 17:47 ` Thomas Tai
2018-08-01 18:52 ` gokul cg
1 sibling, 1 reply; 10+ messages in thread
From: Thomas Tai @ 2018-08-01 17:47 UTC (permalink / raw)
To: gokul cg; +Cc: linux-pci
On 08/01/2018 01:42 AM, gokul cg wrote:
> Hi Thomas,
>
> In my hardware, there is i2c power control chip for PCI card, I just
> powered down using i2c command .
Hi Gokul,
When you power off the card via the i2c, it forcefully power off the
card without notify the kernel? That is, during the card power off
sequence it manages to send a last AER isr to report the error and die?
I am kind of expect the pcie surprise removal or hot plug driver will
handle it correctly.
Thanks,
Thomas
>
> Regards,
> Gokul
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 17:47 ` Thomas Tai
@ 2018-08-01 18:52 ` gokul cg
0 siblings, 0 replies; 10+ messages in thread
From: gokul cg @ 2018-08-01 18:52 UTC (permalink / raw)
To: Thomas Tai; +Cc: linux-pci
[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]
HI Thomas ,
Yes , its surprise removal .
But as far as I know , linux kernel will handle surprise removal of PCIe
device without panic.
The driver will suddenly start reading all 0xff and will then need to
abort whatever it was doing. Usually all drivers handle this just fine.
Nothing, the driver individually needs to handle the fact that it might
at any time, start getting invalid data. If it doesn't, it needs to be
fixed. Whether AER driver that does not handle this properly?
Regards,
Gokul
On Wed, Aug 1, 2018 at 11:17 PM, Thomas Tai <thomas.tai@oracle.com> wrote:
>
>
> On 08/01/2018 01:42 AM, gokul cg wrote:
>
>> Hi Thomas,
>>
>> In my hardware, there is i2c power control chip for PCI card, I just
>> powered down using i2c command .
>>
>
> Hi Gokul,
> When you power off the card via the i2c, it forcefully power off the card
> without notify the kernel? That is, during the card power off sequence it
> manages to send a last AER isr to report the error and die? I am kind of
> expect the pcie surprise removal or hot plug driver will handle it
> correctly.
>
> Thanks,
> Thomas
>
>
>> Regards,
>> Gokul
>>
>
[-- Attachment #2: Type: text/html, Size: 4641 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Possible race condition in the kernel between PCI driver and AER handling
2018-08-01 14:24 ` Thomas Tai
2018-08-01 15:22 ` gokul cg
@ 2018-08-02 14:17 ` Thomas Tai
1 sibling, 0 replies; 10+ messages in thread
From: Thomas Tai @ 2018-08-02 14:17 UTC (permalink / raw)
To: gokul cg; +Cc: linux-pci
On 08/01/2018 10:24 AM, Thomas Tai wrote:
>
>
> On 08/01/2018 01:53 AM, gokul cg wrote:
>> Hi,
>>
>> I see there is a basic design flow. As AER and PCI drivers are
>> independent modules ,
>> locally storing pointer to any data structure from pci linked list in
>> AER driver will create problem as there is no synchronization between
>> the same .
>>
>>
>> https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
>>
>> Here 'structaer_err_info
>> <https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info
>> <https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer
>> to pci dev , which can be removed from pci tree at any time .
>> I think this is the basic issue.
Hi Gokul,
I am afraid that I am having hard time recreating your issue. Following
is the normal situation and wondering did you see any hotplug message
before the aer message?
pcieport 0000:00:02.2: AER: Corrected error received: id=1130
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down
pciehp 0000:11:06.0:pcie204: Slot(102): Link Down event ignored; already
powering off
pcieport 0000:11:06.0: PCIe Bus Error: severity=Corrected, type=Physical
Layer, id=1130(Receiver ID)
pcieport 0000:11:06.0: device [111d:80b5] error
status/mask=00000001/0000e000
pcieport 0000:11:06.0: [ 0] Receiver Error
As far as the pci_dev being corrupted, may be you can add
"slub_debug=FZP" in your kernel boot argument and rerun your test and
see if it find anything. I am curious that who corrupted the pci_dev in
the first place. I am not totally convinced that the problem is in the
AER codes.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-08-02 16:08 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 12:42 Possible race condition in the kernel between PCI driver and AER handling gokul cg
2018-07-31 13:15 ` Thomas Tai
2018-08-01 5:42 ` gokul cg
2018-08-01 14:17 ` Thomas Tai
2018-08-01 17:47 ` Thomas Tai
2018-08-01 18:52 ` gokul cg
2018-08-01 5:53 ` gokul cg
2018-08-01 14:24 ` Thomas Tai
2018-08-01 15:22 ` gokul cg
2018-08-02 14:17 ` Thomas Tai
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.