From: "Baicar, Tyler" <tbaicar@codeaurora.org>
To: Sinan Kaya <okaya@codeaurora.org>, Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>,
rjw@rjwysocki.net, lenb@kernel.org, will.deacon@arm.com,
james.morse@arm.com, prarit@redhat.com, punit.agrawal@arm.com,
shiju.jose@huawei.com, andriy.shevchenko@linux.intel.com,
linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
Linux PCI <linux-pci@vger.kernel.org>,
Huang Ying <ying.huang@intel.com>
Subject: Re: [PATCH] acpi: apei: call into AER handling regardless of severity
Date: Wed, 30 Aug 2017 09:42:08 -0600 [thread overview]
Message-ID: <a074792e-ad43-d4f5-6f99-8fcf8349c7ad@codeaurora.org> (raw)
In-Reply-To: <af5dc902-faca-a7a1-6781-95ff82d5d8fd@codeaurora.org>
On 8/30/2017 9:31 AM, Sinan Kaya wrote:
> On 8/30/2017 11:16 AM, Borislav Petkov wrote:
>> On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:
>>> Link reset is not the only recovery mechanism. In the case of nonfatal
>>> errors, it is assumed that the endpoint CSR is still reachable.
>>> Error is propagated the PCIe endpoint driver. Endpoint driver does a
>>> re-initialization, we are back in business.
>> I'm assuming that's broadcast_error_message()'s job.
>>
> That's right. Each driver provides an err_handler hook. broadcast function
> calls these.
>
> static struct pci_driver e1000_driver = {
> ..
> .err_handler = &e1000_err_handler
> };
>
> struct pci_error_handlers {
> ...
> pci_ers_result_t (*error_detected)(struct pci_dev *dev,
> enum pci_channel_state error);
> }
>
>
>>> That's not true. The GHES code is changing the severity here before posting
>>> to the AER driver in ghes_do_proc().
>>>
>>> if (gdata->flags & CPER_SEC_RESET)
>>> aer_severity = AER_FATAL;
>> You're missing the point that we would walk into that if branch *only* for
>>
>> if (sev == GHES_SEV_RECOVERABLE &&
>> sec_sev == GHES_SEV_RECOVERABLE
>>
>> severities. So if you have an AER_FATAL error but ghes severities are
>> not GHES_SEV_RECOVERABLE, nothing happens.
> I see. We should probably try to do something only if GHES_SEV_CORRECTED or
> GHES_SEV_RECOVERABLE.
>
> If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
> in doing additional work.
See below.
>>> No, AER ISR is not set up if firmware first is enabled.
>> So then this is a major suckage. We do AER recovery on FF systems only
>> for GHES_SEV_RECOVERABLE severity.
>>
>>> The behavior should match non firmware-first case ideally.
>>>
>>> 1. Print all correctable errors.
>>> 2. Go to do_recovery for all uncorrectable errors including fatal and
>>> non-fatal.
>>>
>>> This is also what AER driver does in the absence of firmware first via
>>> handle_error_source().
>> Yes, that makes sense.
>>
>> Which would mean that we'd call aer_recover_queue() regardless of GHES
>> severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
>> or CPER_SEC_RESET. I.e., we can communicate all that by setting the
>> correct AER severity before calling aer_recover_queue(). And then call
>> do_recovery() based on AER severity.
>>
>> Hmmm?
>>
> Sounds good. Do you still want to do PCIe recovery in the case of
> GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?
>
We do not need to worry about the GHES_SEV_PANIC case. Those get sent to
__ghes_panic() in ghes_proc() without even making it to ghes_do_proc().
Those errors are just printed and then the kernel panics.
I think with my two patches we will have the desired functionality:
GHES_SEV_CORRECTABLE -> AER_CORRECTABLE -> Print AER info, but do not
call do_recovery
GHES_SEV_RECOVERABLE -> AER_NONFATAL -> Print AER info and do_recovery
GHES_RECOVERABLE and CPER_SEC_RESET -> AER_FATAL -> Print AER info and
do_recover
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
next prev parent reply other threads:[~2017-08-30 15:42 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-28 17:11 [PATCH] acpi: apei: call into AER handling regardless of severity Tyler Baicar
2017-08-28 20:52 ` Rafael J. Wysocki
2017-08-29 8:20 ` Borislav Petkov
2017-08-29 21:27 ` Baicar, Tyler
2017-08-29 22:19 ` Borislav Petkov
2017-08-29 22:34 ` Sinan Kaya
2017-08-30 10:16 ` Borislav Petkov
2017-08-30 14:05 ` Sinan Kaya
2017-08-30 15:16 ` Borislav Petkov
2017-08-30 15:31 ` Sinan Kaya
2017-08-30 15:42 ` Baicar, Tyler [this message]
2017-08-30 17:14 ` Borislav Petkov
2017-08-30 18:09 ` Baicar, Tyler
2017-08-30 17:02 ` Borislav Petkov
2017-08-29 23:06 ` Luck, Tony
2017-08-29 23:06 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a074792e-ad43-d4f5-6f99-8fcf8349c7ad@codeaurora.org \
--to=tbaicar@codeaurora.org \
--cc=andriy.shevchenko@linux.intel.com \
--cc=bp@suse.de \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=okaya@codeaurora.org \
--cc=prarit@redhat.com \
--cc=punit.agrawal@arm.com \
--cc=rjw@rjwysocki.net \
--cc=shiju.jose@huawei.com \
--cc=tony.luck@intel.com \
--cc=will.deacon@arm.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.