linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.garry@huawei.com>
To: Tyler Baicar <tbaicar@codeaurora.org>, <marc.zyngier@arm.com>,
	<pbonzini@redhat.com>, <rkrcmar@redhat.com>,
	<linux@armlinux.org.uk>, <catalin.marinas@arm.com>,
	<will.deacon@arm.com>, <rjw@rjwysocki.net>, <lenb@kernel.org>,
	<matt@codeblueprint.co.uk>, <robert.moore@intel.com>,
	<lv.zheng@intel.com>, <nkaje@codeaurora.org>,
	<zjzhang@codeaurora.org>, <mark.rutland@arm.com>,
	<james.morse@arm.com>, <akpm@linux-foundation.org>,
	<eun.taik.lee@samsung.com>, <sandeepa.s.prabhu@gmail.com>,
	<shijie.huang@arm.com>, <rruigrok@codeaurora.org>,
	<paul.gortmaker@windriver.com>, <tomasz.nowicki@linaro.org>,
	<fu.wei@linaro.org>, <rostedt@goodmis.org>, <bristot@redhat.com>,
	<linux-arm-kernel@lists.infradead.org>,
	<kvmarm@lists.cs.columbia.edu>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>,
	<linux-efi@vger.kernel.org>, <Suzuki.Poulose@arm.com>,
	<punit.agrawal@arm.com>, <astone@redhat.com>,
	<harba@codeaurora.org>, <hanjun.guo@linaro.org>,
	Shiju Jose <shiju.jose@huawei.com>,
	Linuxarm <linuxarm@huawei.com>, Anurup M <anurup.m@huawei.com>
Subject: Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
Date: Tue, 22 Nov 2016 11:11:55 +0000	[thread overview]
Message-ID: <d5e199f2-c23e-9599-7ed7-e54475311a39@huawei.com> (raw)
In-Reply-To: <1479767763-27532-1-git-send-email-tbaicar@codeaurora.org>

+

We'll try and test this on our platform.

Cheers,
John

On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
>                  +---------------+
>        +-------->|               |
>        |         |  GHES polling |--+
> +-------------+  |    source     |  |   +---------------+   +------------+
> |             |  +---------------+  |   |  Kernel GHES  |   |            |
> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
> |             |  +---------------+  |   |  EDAC drivers |   |   event    |
> +-------------+  |               |  |   +---------------+   +------------+
>        |         |  GHES sci     |--+
>        +-------->|   source      |
>                  +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records.  This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
>             https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
>     Change ghes_do_read_ack to ghes_ack_error
>     Make sure data version check is >= 3
>     Use CPER helper functions in print functions
>     Make handle_guest_sea() dummy function static for arm
>     Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
>     Make HEST generic and generic_v2 structures a union in the ghes structure
>     Move gdata v3 helper functions into ghes.h to avoid duplication
>     Reorder the timestamp print and avoid memcpy
>     Add helper functions for gdata size checking
>     Rename the SEA functions
>     Add helper function for GHES panics
>     Set fru_id to NULL UUID at variable declaration
>     Limit ARM trace event parameters to the needed structures
>     Reorder the ARM trace event variables to save space
>     Add comment for why we don't pass SEAs to the guest when it aborts
>     Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
>     Add helper function to get the proper payload based on generic data entry
>      version
>     Move timestamp print to avoid changing function calls in cper.c
>     Remove patch "arm64: exception: handle instruction abort at current EL"
>      since the el1_ia handler is already added in 4.8
>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>     Add a new trace event for ARM type errors
>     Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
>     Separate timestamp year into year and century using BCD format.
>     Rebase on top of ACPICA 20160318 release and remove header file changes
>      in include/acpi/actbl1.h.
>     Add panic OS with fatal error status block patch.
>     Add processing of unrecognized CPER error section patches with updates
>      from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
>   acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARMv8 processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support
>
>  arch/arm/include/asm/kvm_arm.h       |   1 +
>  arch/arm/include/asm/system_misc.h   |   5 +
>  arch/arm/kvm/mmu.c                   |  18 ++-
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>  arch/arm64/include/asm/system_misc.h |  15 +++
>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>  drivers/acpi/apei/Kconfig            |  14 +++
>  drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c             |   7 +-
>  drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
>  drivers/ras/ras.c                    |   2 +
>  include/acpi/ghes.h                  |  15 ++-
>  include/linux/cper.h                 |  84 ++++++++++++++
>  include/ras/ras_event.h              | 100 +++++++++++++++++
>  15 files changed, 688 insertions(+), 44 deletions(-)
>

  parent reply	other threads:[~2016-11-22 11:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-21 22:35 [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption Tyler Baicar
2016-11-25 18:19   ` James Morse
2016-11-21 22:35 ` [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1 Tyler Baicar
2016-11-25 18:20   ` James Morse
2016-11-28 18:55     ` Baicar, Tyler
2016-11-21 22:35 ` [PATCH V5 03/10] efi: parse ARMv8 processor error Tyler Baicar
2016-11-25 18:23   ` James Morse
2016-11-29 15:37     ` Baicar, Tyler
2016-11-21 22:35 ` [PATCH V5 04/10] arm64: exception: handle Synchronous External Abort Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 06/10] acpi: apei: panic OS with fatal error status block Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 07/10] efi: print unrecognized CPER section Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 08/10] ras: acpi / apei: generate trace event for " Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 09/10] trace, ras: add ARM processor error trace event Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 10/10] arm/arm64: KVM: add guest SEA support Tyler Baicar
2016-11-22 11:11 ` John Garry [this message]
2016-11-22 17:13   ` [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Baicar, Tyler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d5e199f2-c23e-9599-7ed7-e54475311a39@huawei.com \
    --to=john.garry@huawei.com \
    --cc=Suzuki.Poulose@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anurup.m@huawei.com \
    --cc=astone@redhat.com \
    --cc=bristot@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=eun.taik.lee@samsung.com \
    --cc=fu.wei@linaro.org \
    --cc=hanjun.guo@linaro.org \
    --cc=harba@codeaurora.org \
    --cc=james.morse@arm.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-efi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxarm@huawei.com \
    --cc=lv.zheng@intel.com \
    --cc=marc.zyngier@arm.com \
    --cc=mark.rutland@arm.com \
    --cc=matt@codeblueprint.co.uk \
    --cc=nkaje@codeaurora.org \
    --cc=paul.gortmaker@windriver.com \
    --cc=pbonzini@redhat.com \
    --cc=punit.agrawal@arm.com \
    --cc=rjw@rjwysocki.net \
    --cc=rkrcmar@redhat.com \
    --cc=robert.moore@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=rruigrok@codeaurora.org \
    --cc=sandeepa.s.prabhu@gmail.com \
    --cc=shijie.huang@arm.com \
    --cc=shiju.jose@huawei.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tomasz.nowicki@linaro.org \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).