Hi Thomas,

>In your case, I am hoping to recreate your issue so that we can work together to isolate and fix the issue. Do you have any suggestion how to fix it at this moment?
Yes . I can reproduce issue.

I don't  have any patch right now. 
I was thinking about two options , 

1)  Adding generic call back  in pci_dev to notify any when a device get removed from tree so that aer_driver can also subscribe to it
2)  set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags) in pci device flags when it removed from list and let aer driver to mange free , but i fear whether this will this create memory leak because of race.


Regards
Gokul 
On Wed, Aug 1, 2018 at 7:54 PM, Thomas Tai <thomas.tai@oracle.com> wrote:


On 08/01/2018 01:53 AM, gokul cg wrote:
Hi,

I see there is a basic design flow. As AER and PCI drivers are independent modules ,
locally storing pointer to any data structure from pci linked list in AER driver will create problem as there is no synchronization between the same .


https://elixir.bootlin.com/linux/v3.10.99/source/drivers/pci/pcie/aer/aerdrv_core.c#L701
Here 'structaer_err_info <https://elixir.bootlin.com/linux/v3.10.99/ident/aer_err_info>*e_info <https://elixir.bootlin.com/linux/v3.10.99/ident/e_info>' has pointer to pci dev , which can be removed from pci tree at any time .
I think this is the basic issue.

Hi Gokul,
Agree. We had an issue last week about this e_info storing the pci_dev which is removed in the pcie_do_fatal_recovery() and causes use-after-free problem.

In your case, I am hoping to recreate your issue so that we can work together to isolate and fix the issue. Do you have any suggestion how to fix it at this moment?

Thanks,
Thomas



Regards
Gokul


On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com <mailto:thomas.tai@oracle.com>> wrote:



    On 07/31/2018 08:42 AM, gokul cg wrote:

        Hi All,


        I am suspecting a possible race condition in the kernel between
        PCI driver and AER handling.

        Because of the same kernel panic happens from worker thread
        which handles bottom half of aer irq.


        I am seeing this issue when I suddenly power off PCI card which
        supports/enabled PCIE AER error reporting.

        While powering off PCI device, AER driver will get AER IRQ for
        the device, from AER IRQ handler, it will cache AER error code
        and schedule worker thread to handle error.


    Hi Gokul,

    It may be an issue in the AER driver. How do you power off your
    device? I've never seen this issue with normal shutdown nor "echo 0
     > /sys/bus/pci/slots/xx/power"

    Cheers,
    Thomas



        The PCIe device will get removed from PCI tree before worker
        thread completes its task and kernel panic is  happening when
        worker thread tries to access PCI device's config space.



        Issue:


        crash>

        crash> bt

        PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"

        #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18

        #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05

        #2 [ffff88027469fbf0] oops_end at ffffffff8176d960

        #3 [ffff88027469fc18] die at ffffffff810060db

        #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452

        #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2

              [exception RIP: pci_bus_read_config_dword+100]

              RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046

              RAX: 435f494350006963  RBX: ffff880274892000  RCX:
        0000000000000004

              RDX: 0000000000000100  RSI: 0000000000000060  RDI:
        ffff880274892000

              RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9:
        00000000000012c0

              R10: 0000000000000006  R11: 00000000000012bf  R12:
        ffff88027469fd5c

              R13: 0000000000000246  R14: 0000000000000000  R15:
        ffff8802741a4000

              ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000

        #6 [ffff88027469fd50] pci_find_next_ext_capability at
        ffffffff81345d7b

        #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225

        #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d

        #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38

        #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0

        #11 [ffff88027469fe70] worker_thread at ffffffff8105e251

        #12 [ffff88027469fed0] kthread at ffffffff81064260

        #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38


        crash>


        I have tested it on kernel 3.10 . But from source i could see
        that this case is still relevant for latest Linux source .


        Can anybody tell me if this is an issue with AER driver in linux ?




        Regards

        Gokul CG