All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Tai <thomas.tai@oracle.com>
To: gokul cg <gokuljnpr@gmail.com>
Cc: linux-pci@vger.kernel.org
Subject: Re: Possible race condition in the kernel between PCI driver and AER handling
Date: Wed, 1 Aug 2018 10:17:23 -0400	[thread overview]
Message-ID: <9c3bd9bf-d170-2661-2f53-e8ede9d19927@oracle.com> (raw)
In-Reply-To: <CAFP4jM84BuVZ+yKvcVvbu_6LFoO+2iVivWWZTZgbEEy3=oMftg@mail.gmail.com>



On 08/01/2018 01:42 AM, gokul cg wrote:
> Hi Thomas,
> 
> In my hardware, there is i2c power control chip for PCI card, I just 
> powered down using i2c command .

Hi Gokul,
I see. That is why we normally didn't see this issue. Let me dig around 
to see if we have any machine that we can do similar thing.

Thomas

> 
> Regards,
> Gokul
> 
> On Tue, Jul 31, 2018 at 6:45 PM, Thomas Tai <thomas.tai@oracle.com 
> <mailto:thomas.tai@oracle.com>> wrote:
> 
> 
> 
>     On 07/31/2018 08:42 AM, gokul cg wrote:
> 
>         Hi All,
> 
> 
>         I am suspecting a possible race condition in the kernel between
>         PCI driver and AER handling.
> 
>         Because of the same kernel panic happens from worker thread
>         which handles bottom half of aer irq.
> 
> 
>         I am seeing this issue when I suddenly power off PCI card which
>         supports/enabled PCIE AER error reporting.
> 
>         While powering off PCI device, AER driver will get AER IRQ for
>         the device, from AER IRQ handler, it will cache AER error code
>         and schedule worker thread to handle error.
> 
> 
>     Hi Gokul,
> 
>     It may be an issue in the AER driver. How do you power off your
>     device? I've never seen this issue with normal shutdown nor "echo 0
>      > /sys/bus/pci/slots/xx/power"
> 
>     Cheers,
>     Thomas
> 
> 
> 
>         The PCIe device will get removed from PCI tree before worker
>         thread completes its task and kernel panic is  happening when
>         worker thread tries to access PCI device's config space.
> 
> 
> 
>         Issue:
> 
> 
>         crash>
> 
>         crash> bt
> 
>         PID: 2727   TASK: ffff880272adc530  CPU: 0   COMMAND: "kworker/0:2"
> 
>         #0 [ffff88027469fac8] machine_kexec at ffffffff8102cf18
> 
>         #1 [ffff88027469fb28] crash_kexec at ffffffff810a6b05
> 
>         #2 [ffff88027469fbf0] oops_end at ffffffff8176d960
> 
>         #3 [ffff88027469fc18] die at ffffffff810060db
> 
>         #4 [ffff88027469fc48] do_general_protection at ffffffff8176d452
> 
>         #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
> 
>               [exception RIP: pci_bus_read_config_dword+100]
> 
>               RIP: ffffffff813405f4  RSP: ffff88027469fd20  RFLAGS: 00010046
> 
>               RAX: 435f494350006963  RBX: ffff880274892000  RCX:
>         0000000000000004
> 
>               RDX: 0000000000000100  RSI: 0000000000000060  RDI:
>         ffff880274892000
> 
>               RBP: ffff88027469fd48   R8: ffff88027469fd2c   R9:
>         00000000000012c0
> 
>               R10: 0000000000000006  R11: 00000000000012bf  R12:
>         ffff88027469fd5c
> 
>               R13: 0000000000000246  R14: 0000000000000000  R15:
>         ffff8802741a4000
> 
>               ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
> 
>         #6 [ffff88027469fd50] pci_find_next_ext_capability at
>         ffffffff81345d7b
> 
>         #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> 
>         #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> 
>         #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> 
>         #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> 
>         #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> 
>         #12 [ffff88027469fed0] kthread at ffffffff81064260
> 
>         #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> 
>         crash>
> 
> 
>         I have tested it on kernel 3.10 . But from source i could see
>         that this case is still relevant for latest Linux source .
> 
> 
>         Can anybody tell me if this is an issue with AER driver in linux ?
> 
> 
> 
> 
>         Regards
> 
>         Gokul CG
> 
> 

  reply	other threads:[~2018-08-01 16:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-31 12:42 Possible race condition in the kernel between PCI driver and AER handling gokul cg
2018-07-31 13:15 ` Thomas Tai
2018-08-01  5:42   ` gokul cg
2018-08-01 14:17     ` Thomas Tai [this message]
2018-08-01 17:47     ` Thomas Tai
2018-08-01 18:52       ` gokul cg
2018-08-01  5:53   ` gokul cg
2018-08-01 14:24     ` Thomas Tai
2018-08-01 15:22       ` gokul cg
2018-08-02 14:17       ` Thomas Tai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9c3bd9bf-d170-2661-2f53-e8ede9d19927@oracle.com \
    --to=thomas.tai@oracle.com \
    --cc=gokuljnpr@gmail.com \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.