All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-21 22:35 ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal
  Cc: Tyler Baicar

When a memory error, CPU error, PCIe error, or other type of hardware error
that's covered by RAS occurs, firmware should populate the shared GHES memory
location with the proper GHES structures to notify the OS of the error.
For example, platforms that implement firmware first handling may implement
separate GHES sources for corrected errors and uncorrected errors. If the
error is an uncorrectable error, then the firmware will notify the OS
immediately since the error needs to be handled ASAP. The OS will then be able
to take the appropriate action needed such as offlining a page. If the error
is a corrected error, then the firmware will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES timer
expires. The kernel will first parse the GHES structures and report the errors
through the kernel logs and then notify the user space through RAS trace
events. This allows user space applications such as RAS Daemon to see the
errors and report them however the user desires. This patchset extends the
kernel functionality for RAS errors based on updates in the UEFI 2.6 and
ACPI 6.1 specifications.

An example flow from firmware to user space could be:

                 +---------------+
       +-------->|               |
       |         |  GHES polling |--+
+-------------+  |    source     |  |   +---------------+   +------------+
|             |  +---------------+  |   |  Kernel GHES  |   |            |
|  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
|             |  +---------------+  |   |  EDAC drivers |   |   event    |
+-------------+  |               |  |   +---------------+   +------------+
       |         |  GHES sci     |--+
       +-------->|   source      |
                 +---------------+

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records.  This provides more detail on for processor error logs. This
can help describe ARMv8 cache, tlb, and bus errors.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
            https://lkml.org/lkml/2016/8/10/231

V5: Fix GHES goto logic for error conditions
    Change ghes_do_read_ack to ghes_ack_error
    Make sure data version check is >= 3
    Use CPER helper functions in print functions
    Make handle_guest_sea() dummy function static for arm
    Add arm to subject line for KVM patch

V4: Add bit offset left shift to read_ack_write value
    Make HEST generic and generic_v2 structures a union in the ghes structure
    Move gdata v3 helper functions into ghes.h to avoid duplication
    Reorder the timestamp print and avoid memcpy
    Add helper functions for gdata size checking
    Rename the SEA functions
    Add helper function for GHES panics
    Set fru_id to NULL UUID at variable declaration
    Limit ARM trace event parameters to the needed structures
    Reorder the ARM trace event variables to save space
    Add comment for why we don't pass SEAs to the guest when it aborts
    Move ARM trace event call into GHES driver instead of CPER

V3: Fix unmapped address to the read_ack_register in ghes.c
    Add helper function to get the proper payload based on generic data entry
     version
    Move timestamp print to avoid changing function calls in cper.c
    Remove patch "arm64: exception: handle instruction abort at current EL"
     since the el1_ia handler is already added in 4.8
    Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
    Add a new trace event for ARM type errors
    Add support to handle KVM guest SEAs

V2: Add PSCI state print for the ARMv8 error type.
    Separate timestamp year into year and century using BCD format.
    Rebase on top of ACPICA 20160318 release and remove header file changes
     in include/acpi/actbl1.h.
    Add panic OS with fatal error status block patch.
    Add processing of unrecognized CPER error section patches with updates
     from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
  acpi: apei: panic OS with fatal error status block

Tyler Baicar (9):
  acpi: apei: read ack upon ghes record consumption
  ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  efi: parse ARMv8 processor error
  arm64: exception: handle Synchronous External Abort
  acpi: apei: handle SEA notification type for ARMv8
  efi: print unrecognized CPER section
  ras: acpi / apei: generate trace event for unrecognized CPER section
  trace, ras: add ARM processor error trace event
  arm/arm64: KVM: add guest SEA support

 arch/arm/include/asm/kvm_arm.h       |   1 +
 arch/arm/include/asm/system_misc.h   |   5 +
 arch/arm/kvm/mmu.c                   |  18 ++-
 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/kvm_arm.h     |   1 +
 arch/arm64/include/asm/system_misc.h |  15 +++
 arch/arm64/mm/fault.c                |  71 ++++++++++--
 drivers/acpi/apei/Kconfig            |  14 +++
 drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c             |   7 +-
 drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
 drivers/ras/ras.c                    |   2 +
 include/acpi/ghes.h                  |  15 ++-
 include/linux/cper.h                 |  84 ++++++++++++++
 include/ras/ras_event.h              | 100 +++++++++++++++++
 15 files changed, 688 insertions(+), 44 deletions(-)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-21 22:35 ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

When a memory error, CPU error, PCIe error, or other type of hardware error
that's covered by RAS occurs, firmware should populate the shared GHES memory
location with the proper GHES structures to notify the OS of the error.
For example, platforms that implement firmware first handling may implement
separate GHES sources for corrected errors and uncorrected errors. If the
error is an uncorrectable error, then the firmware will notify the OS
immediately since the error needs to be handled ASAP. The OS will then be able
to take the appropriate action needed such as offlining a page. If the error
is a corrected error, then the firmware will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES timer
expires. The kernel will first parse the GHES structures and report the errors
through the kernel logs and then notify the user space through RAS trace
events. This allows user space applications such as RAS Daemon to see the
errors and report them however the user desires. This patchset extends the
kernel functionality for RAS errors based on updates in the UEFI 2.6 and
ACPI 6.1 specifications.

An example flow from firmware to user space could be:

                 +---------------+
       +-------->|               |
       |         |  GHES polling |--+
+-------------+  |    source     |  |   +---------------+   +------------+
|             |  +---------------+  |   |  Kernel GHES  |   |            |
|  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
|             |  +---------------+  |   |  EDAC drivers |   |   event    |
+-------------+  |               |  |   +---------------+   +------------+
       |         |  GHES sci     |--+
       +-------->|   source      |
                 +---------------+

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records.  This provides more detail on for processor error logs. This
can help describe ARMv8 cache, tlb, and bus errors.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
            https://lkml.org/lkml/2016/8/10/231

V5: Fix GHES goto logic for error conditions
    Change ghes_do_read_ack to ghes_ack_error
    Make sure data version check is >= 3
    Use CPER helper functions in print functions
    Make handle_guest_sea() dummy function static for arm
    Add arm to subject line for KVM patch

V4: Add bit offset left shift to read_ack_write value
    Make HEST generic and generic_v2 structures a union in the ghes structure
    Move gdata v3 helper functions into ghes.h to avoid duplication
    Reorder the timestamp print and avoid memcpy
    Add helper functions for gdata size checking
    Rename the SEA functions
    Add helper function for GHES panics
    Set fru_id to NULL UUID at variable declaration
    Limit ARM trace event parameters to the needed structures
    Reorder the ARM trace event variables to save space
    Add comment for why we don't pass SEAs to the guest when it aborts
    Move ARM trace event call into GHES driver instead of CPER

V3: Fix unmapped address to the read_ack_register in ghes.c
    Add helper function to get the proper payload based on generic data entry
     version
    Move timestamp print to avoid changing function calls in cper.c
    Remove patch "arm64: exception: handle instruction abort at current EL"
     since the el1_ia handler is already added in 4.8
    Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
    Add a new trace event for ARM type errors
    Add support to handle KVM guest SEAs

V2: Add PSCI state print for the ARMv8 error type.
    Separate timestamp year into year and century using BCD format.
    Rebase on top of ACPICA 20160318 release and remove header file changes
     in include/acpi/actbl1.h.
    Add panic OS with fatal error status block patch.
    Add processing of unrecognized CPER error section patches with updates
     from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
  acpi: apei: panic OS with fatal error status block

Tyler Baicar (9):
  acpi: apei: read ack upon ghes record consumption
  ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  efi: parse ARMv8 processor error
  arm64: exception: handle Synchronous External Abort
  acpi: apei: handle SEA notification type for ARMv8
  efi: print unrecognized CPER section
  ras: acpi / apei: generate trace event for unrecognized CPER section
  trace, ras: add ARM processor error trace event
  arm/arm64: KVM: add guest SEA support

 arch/arm/include/asm/kvm_arm.h       |   1 +
 arch/arm/include/asm/system_misc.h   |   5 +
 arch/arm/kvm/mmu.c                   |  18 ++-
 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/kvm_arm.h     |   1 +
 arch/arm64/include/asm/system_misc.h |  15 +++
 arch/arm64/mm/fault.c                |  71 ++++++++++--
 drivers/acpi/apei/Kconfig            |  14 +++
 drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c             |   7 +-
 drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
 drivers/ras/ras.c                    |   2 +
 include/acpi/ghes.h                  |  15 ++-
 include/linux/cper.h                 |  84 ++++++++++++++
 include/ras/ras_event.h              | 100 +++++++++++++++++
 15 files changed, 688 insertions(+), 44 deletions(-)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-21 22:35 ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

When a memory error, CPU error, PCIe error, or other type of hardware error
that's covered by RAS occurs, firmware should populate the shared GHES memory
location with the proper GHES structures to notify the OS of the error.
For example, platforms that implement firmware first handling may implement
separate GHES sources for corrected errors and uncorrected errors. If the
error is an uncorrectable error, then the firmware will notify the OS
immediately since the error needs to be handled ASAP. The OS will then be able
to take the appropriate action needed such as offlining a page. If the error
is a corrected error, then the firmware will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES timer
expires. The kernel will first parse the GHES structures and report the errors
through the kernel logs and then notify the user space through RAS trace
events. This allows user space applications such as RAS Daemon to see the
errors and report them however the user desires. This patchset extends the
kernel functionality for RAS errors based on updates in the UEFI 2.6 and
ACPI 6.1 specifications.

An example flow from firmware to user space could be:

                 +---------------+
       +-------->|               |
       |         |  GHES polling |--+
+-------------+  |    source     |  |   +---------------+   +------------+
|             |  +---------------+  |   |  Kernel GHES  |   |            |
|  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
|             |  +---------------+  |   |  EDAC drivers |   |   event    |
+-------------+  |               |  |   +---------------+   +------------+
       |         |  GHES sci     |--+
       +-------->|   source      |
                 +---------------+

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records.  This provides more detail on for processor error logs. This
can help describe ARMv8 cache, tlb, and bus errors.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
            https://lkml.org/lkml/2016/8/10/231

V5: Fix GHES goto logic for error conditions
    Change ghes_do_read_ack to ghes_ack_error
    Make sure data version check is >= 3
    Use CPER helper functions in print functions
    Make handle_guest_sea() dummy function static for arm
    Add arm to subject line for KVM patch

V4: Add bit offset left shift to read_ack_write value
    Make HEST generic and generic_v2 structures a union in the ghes structure
    Move gdata v3 helper functions into ghes.h to avoid duplication
    Reorder the timestamp print and avoid memcpy
    Add helper functions for gdata size checking
    Rename the SEA functions
    Add helper function for GHES panics
    Set fru_id to NULL UUID at variable declaration
    Limit ARM trace event parameters to the needed structures
    Reorder the ARM trace event variables to save space
    Add comment for why we don't pass SEAs to the guest when it aborts
    Move ARM trace event call into GHES driver instead of CPER

V3: Fix unmapped address to the read_ack_register in ghes.c
    Add helper function to get the proper payload based on generic data entry
     version
    Move timestamp print to avoid changing function calls in cper.c
    Remove patch "arm64: exception: handle instruction abort at current EL"
     since the el1_ia handler is already added in 4.8
    Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
    Add a new trace event for ARM type errors
    Add support to handle KVM guest SEAs

V2: Add PSCI state print for the ARMv8 error type.
    Separate timestamp year into year and century using BCD format.
    Rebase on top of ACPICA 20160318 release and remove header file changes
     in include/acpi/actbl1.h.
    Add panic OS with fatal error status block patch.
    Add processing of unrecognized CPER error section patches with updates
     from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
  acpi: apei: panic OS with fatal error status block

Tyler Baicar (9):
  acpi: apei: read ack upon ghes record consumption
  ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  efi: parse ARMv8 processor error
  arm64: exception: handle Synchronous External Abort
  acpi: apei: handle SEA notification type for ARMv8
  efi: print unrecognized CPER section
  ras: acpi / apei: generate trace event for unrecognized CPER section
  trace, ras: add ARM processor error trace event
  arm/arm64: KVM: add guest SEA support

 arch/arm/include/asm/kvm_arm.h       |   1 +
 arch/arm/include/asm/system_misc.h   |   5 +
 arch/arm/kvm/mmu.c                   |  18 ++-
 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/kvm_arm.h     |   1 +
 arch/arm64/include/asm/system_misc.h |  15 +++
 arch/arm64/mm/fault.c                |  71 ++++++++++--
 drivers/acpi/apei/Kconfig            |  14 +++
 drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c             |   7 +-
 drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
 drivers/ras/ras.c                    |   2 +
 include/acpi/ghes.h                  |  15 ++-
 include/linux/cper.h                 |  84 ++++++++++++++
 include/ras/ras_event.h              | 100 +++++++++++++++++
 15 files changed, 688 insertions(+), 44 deletions(-)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.

The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c |  7 +++++--
 include/acpi/ghes.h      |  5 ++++-
 3 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 60746ef..b79abc5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,6 +45,7 @@
 #include <linux/aer.h>
 #include <linux/nmi.h>
 
+#include <acpi/actbl1.h>
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
@@ -79,6 +80,10 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
+#define HEST_TYPE_GENERIC_V2(ghes)				\
+	((struct acpi_hest_header *)ghes->generic)->type ==	\
+	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
+
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
 	if (!ghes)
 		return ERR_PTR(-ENOMEM);
+
 	ghes->generic = generic;
+	if (HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = apei_map_generic_address(
+			&ghes->generic_v2->read_ack_register);
+		if (rc)
+			goto err_free;
+	}
+
 	rc = apei_map_generic_address(&generic->error_status_address);
 	if (rc)
-		goto err_free;
+		goto err_unmap_read_ack_addr;
 	error_block_length = generic->error_block_length;
 	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
 		pr_warning(FW_WARN GHES_PFX
@@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
 	if (!ghes->estatus) {
 		rc = -ENOMEM;
-		goto err_unmap;
+		goto err_unmap_status_addr;
 	}
 
 	return ghes;
 
-err_unmap:
+err_unmap_status_addr:
 	apei_unmap_generic_address(&generic->error_status_address);
+err_unmap_read_ack_addr:
+	if (HEST_TYPE_GENERIC_V2(ghes))
+		apei_unmap_generic_address(
+			&ghes->generic_v2->read_ack_register);
 err_free:
 	kfree(ghes);
 	return ERR_PTR(rc);
@@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
 {
 	kfree(ghes->estatus);
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
+	if (HEST_TYPE_GENERIC_V2(ghes))
+		apei_unmap_generic_address(
+			&ghes->generic_v2->read_ack_register);
 }
 
 static inline int ghes_severity(int severity)
@@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
 	rcu_read_unlock();
 }
 
+static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
+{
+	int rc;
+	u64 val = 0;
+
+	rc = apei_read(&val, &generic_v2->read_ack_register);
+	if (rc)
+		return rc;
+	val &= generic_v2->read_ack_preserve <<
+		generic_v2->read_ack_register.bit_offset;
+	val |= generic_v2->read_ack_write <<
+		generic_v2->read_ack_register.bit_offset;
+	rc = apei_write(val, &generic_v2->read_ack_register);
+
+	return rc;
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
 	ghes_do_proc(ghes, ghes->estatus);
+
+	if (HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = ghes_ack_error(ghes->generic_v2);
+		if (rc)
+			return rc;
+	}
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
index 792a0d9..ef725a9 100644
--- a/drivers/acpi/apei/hest.c
+++ b/drivers/acpi/apei/hest.c
@@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
 	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
 	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
 	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
+	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
 };
 
 static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
@@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
 {
 	int *count = data;
 
-	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
+	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		(*count)++;
 	return 0;
 }
@@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
 	struct ghes_arr *ghes_arr = data;
 	int rc, i;
 
-	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
+	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		return 0;
 
 	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 720446c..68f088a 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -13,7 +13,10 @@
 #define GHES_EXITING		0x0002
 
 struct ghes {
-	struct acpi_hest_generic *generic;
+	union {
+		struct acpi_hest_generic *generic;
+		struct acpi_hest_generic_v2 *generic_v2;
+	};
 	struct acpi_hest_generic_status *estatus;
 	u64 buffer_paddr;
 	unsigned long flags;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.

The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c |  7 +++++--
 include/acpi/ghes.h      |  5 ++++-
 3 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 60746ef..b79abc5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,6 +45,7 @@
 #include <linux/aer.h>
 #include <linux/nmi.h>
 
+#include <acpi/actbl1.h>
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
@@ -79,6 +80,10 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
+#define HEST_TYPE_GENERIC_V2(ghes)				\
+	((struct acpi_hest_header *)ghes->generic)->type ==	\
+	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
+
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
 	if (!ghes)
 		return ERR_PTR(-ENOMEM);
+
 	ghes->generic = generic;
+	if (HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = apei_map_generic_address(
+			&ghes->generic_v2->read_ack_register);
+		if (rc)
+			goto err_free;
+	}
+
 	rc = apei_map_generic_address(&generic->error_status_address);
 	if (rc)
-		goto err_free;
+		goto err_unmap_read_ack_addr;
 	error_block_length = generic->error_block_length;
 	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
 		pr_warning(FW_WARN GHES_PFX
@@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
 	if (!ghes->estatus) {
 		rc = -ENOMEM;
-		goto err_unmap;
+		goto err_unmap_status_addr;
 	}
 
 	return ghes;
 
-err_unmap:
+err_unmap_status_addr:
 	apei_unmap_generic_address(&generic->error_status_address);
+err_unmap_read_ack_addr:
+	if (HEST_TYPE_GENERIC_V2(ghes))
+		apei_unmap_generic_address(
+			&ghes->generic_v2->read_ack_register);
 err_free:
 	kfree(ghes);
 	return ERR_PTR(rc);
@@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
 {
 	kfree(ghes->estatus);
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
+	if (HEST_TYPE_GENERIC_V2(ghes))
+		apei_unmap_generic_address(
+			&ghes->generic_v2->read_ack_register);
 }
 
 static inline int ghes_severity(int severity)
@@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
 	rcu_read_unlock();
 }
 
+static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
+{
+	int rc;
+	u64 val = 0;
+
+	rc = apei_read(&val, &generic_v2->read_ack_register);
+	if (rc)
+		return rc;
+	val &= generic_v2->read_ack_preserve <<
+		generic_v2->read_ack_register.bit_offset;
+	val |= generic_v2->read_ack_write <<
+		generic_v2->read_ack_register.bit_offset;
+	rc = apei_write(val, &generic_v2->read_ack_register);
+
+	return rc;
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
 	ghes_do_proc(ghes, ghes->estatus);
+
+	if (HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = ghes_ack_error(ghes->generic_v2);
+		if (rc)
+			return rc;
+	}
 out:
 	ghes_clear_estatus(ghes);
 	return 0;
diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
index 792a0d9..ef725a9 100644
--- a/drivers/acpi/apei/hest.c
+++ b/drivers/acpi/apei/hest.c
@@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
 	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
 	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
 	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
+	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
 };
 
 static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
@@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
 {
 	int *count = data;
 
-	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
+	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		(*count)++;
 	return 0;
 }
@@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
 	struct ghes_arr *ghes_arr = data;
 	int rc, i;
 
-	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
+	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		return 0;
 
 	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 720446c..68f088a 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -13,7 +13,10 @@
 #define GHES_EXITING		0x0002
 
 struct ghes {
-	struct acpi_hest_generic *generic;
+	union {
+		struct acpi_hest_generic *generic;
+		struct acpi_hest_generic_v2 *generic_v2;
+	};
 	struct acpi_hest_generic_status *estatus;
 	u64 buffer_paddr;
 	unsigned long flags;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

Currently when a RAS error is reported it is not timestamped.
The ACPI 6.1 spec adds the timestamp field to the generic error
data entry v3 structure. The timestamp of when the firmware
generated the error is now being reported.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/acpi/apei/ghes.c    | 14 +++++++---
 drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++----------
 include/acpi/ghes.h         | 10 ++++++++
 include/linux/cper.h        | 12 +++++++++
 4 files changed, 80 insertions(+), 18 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b79abc5..9063d68 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
 	int flags = -1;
 	int sec_sev = ghes_severity(gdata->error_severity);
 	struct cper_sec_mem_err *mem_err;
-	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
+
+	mem_err = acpi_hest_generic_data_payload(gdata);
 
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
 		return;
@@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
+	uuid_le sec_type;
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_sev = ghes_severity(gdata->error_severity);
-		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+		sec_type = *(uuid_le *)gdata->section_type;
+
+		if (!uuid_le_cmp(sec_type,
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
-			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+
+			mem_err = acpi_hest_generic_data_payload(gdata);
 			ghes_edac_report_mem_error(ghes, sev, mem_err);
 
 			arch_apei_report_mem_error(sev, mem_err);
@@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
 		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
-			pcie_err = (struct cper_sec_pcie *)(gdata+1);
+
+			pcie_err = acpi_hest_generic_data_payload(gdata);
 			if (sev == GHES_SEV_RECOVERABLE &&
 			    sec_sev == GHES_SEV_RECOVERABLE &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index d425374..7e2439e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,9 @@
 #include <linux/acpi.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <linux/printk.h>
+#include <linux/bcd.h>
+#include <acpi/ghes.h>
 
 #define INDENT_SP	" "
 
@@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 }
 
+static void cper_estatus_print_section_v300(const char *pfx,
+	const struct acpi_hest_generic_data_v300 *gdata)
+{
+	__u8 hour, min, sec, day, mon, year, century, *timestamp;
+
+	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
+		timestamp = (__u8 *)&(gdata->time_stamp);
+		sec = bcd2bin(timestamp[0]);
+		min = bcd2bin(timestamp[1]);
+		hour = bcd2bin(timestamp[2]);
+		day = bcd2bin(timestamp[4]);
+		mon = bcd2bin(timestamp[5]);
+		year = bcd2bin(timestamp[6]);
+		century = bcd2bin(timestamp[7]);
+		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+			0x01 & *(timestamp + 3) ? "precise" : "", century,
+			year, mon, day, hour, min, sec);
+	}
+}
+
 static void cper_estatus_print_section(
-	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
+	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
 {
 	uuid_le *sec_type = (uuid_le *)gdata->section_type;
 	__u16 severity;
 	char newpfx[64];
 
+	if (acpi_hest_generic_data_version(gdata) >= 3)
+		cper_estatus_print_section_v300(pfx,
+			(const struct acpi_hest_generic_data_v300 *)gdata);
+
 	severity = gdata->error_severity;
 	printk("%s""Error %d, type: %s\n", pfx, sec_no,
 	       cper_severity_str(severity));
@@ -403,14 +430,18 @@ static void cper_estatus_print_section(
 
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
 	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
-		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+		struct cper_sec_proc_generic *proc_err;
+
+		proc_err = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: general processor error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*proc_err))
 			cper_print_proc_generic(newpfx, proc_err);
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
-		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+		struct cper_sec_mem_err *mem_err;
+
+		mem_err = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: memory error\n", newpfx);
 		if (gdata->error_data_length >=
 		    sizeof(struct cper_sec_mem_err_old))
@@ -419,7 +450,9 @@ static void cper_estatus_print_section(
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
-		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+		struct cper_sec_pcie *pcie;
+
+		pcie = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: PCIe error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*pcie))
 			cper_print_pcie(newpfx, pcie, gdata);
@@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
 			const struct acpi_hest_generic_status *estatus)
 {
 	struct acpi_hest_generic_data *gdata;
-	unsigned int data_len, gedata_len;
+	unsigned int data_len;
 	int sec_no = 0;
 	char newpfx[64];
 	__u16 severity;
@@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
 	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
+
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
+
+	while (data_len >= acpi_hest_generic_data_size(gdata)) {
 		cper_estatus_print_section(newpfx, gdata, sec_no);
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		gdata = acpi_hest_generic_data_next(gdata);
 		sec_no++;
 	}
 }
@@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
 		return rc;
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
-		if (gedata_len > data_len - sizeof(*gdata))
+
+	while (data_len >= acpi_hest_generic_data_size(gdata)) {
+		gedata_len = acpi_hest_generic_data_error_length(gdata);
+		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
 			return -EINVAL;
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
+		gdata = acpi_hest_generic_data_next(gdata);
 	}
 	if (data_len)
 		return -EINVAL;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 68f088a..56b9679 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
 {
 }
 #endif
+
+#define acpi_hest_generic_data_version(gdata)			\
+	(gdata->revision >> 8)
+
+static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
+{
+	return acpi_hest_generic_data_version(gdata) >= 3 ?
+		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
+		gdata + 1;
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index dcacb1a..13ea41c 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -255,6 +255,18 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define acpi_hest_generic_data_error_length(gdata)	\
+	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
+#define acpi_hest_generic_data_size(gdata)		\
+	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
+	sizeof(struct acpi_hest_generic_data_v300) :	\
+	sizeof(struct acpi_hest_generic_data))
+#define acpi_hest_generic_data_record_size(gdata)	\
+	(acpi_hest_generic_data_size(gdata) +		\
+	acpi_hest_generic_data_error_length(gdata))
+#define acpi_hest_generic_data_next(gdata)		\
+	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
+
 /*
  * All tables and structs must be byte-packed to match CPER
  * specification, since the tables are provided by the system BIOS
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

Currently when a RAS error is reported it is not timestamped.
The ACPI 6.1 spec adds the timestamp field to the generic error
data entry v3 structure. The timestamp of when the firmware
generated the error is now being reported.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/acpi/apei/ghes.c    | 14 +++++++---
 drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++----------
 include/acpi/ghes.h         | 10 ++++++++
 include/linux/cper.h        | 12 +++++++++
 4 files changed, 80 insertions(+), 18 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b79abc5..9063d68 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
 	int flags = -1;
 	int sec_sev = ghes_severity(gdata->error_severity);
 	struct cper_sec_mem_err *mem_err;
-	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
+
+	mem_err = acpi_hest_generic_data_payload(gdata);
 
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
 		return;
@@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
+	uuid_le sec_type;
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_sev = ghes_severity(gdata->error_severity);
-		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+		sec_type = *(uuid_le *)gdata->section_type;
+
+		if (!uuid_le_cmp(sec_type,
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
-			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+
+			mem_err = acpi_hest_generic_data_payload(gdata);
 			ghes_edac_report_mem_error(ghes, sev, mem_err);
 
 			arch_apei_report_mem_error(sev, mem_err);
@@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
 		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
-			pcie_err = (struct cper_sec_pcie *)(gdata+1);
+
+			pcie_err = acpi_hest_generic_data_payload(gdata);
 			if (sev == GHES_SEV_RECOVERABLE &&
 			    sec_sev == GHES_SEV_RECOVERABLE &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index d425374..7e2439e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,9 @@
 #include <linux/acpi.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <linux/printk.h>
+#include <linux/bcd.h>
+#include <acpi/ghes.h>
 
 #define INDENT_SP	" "
 
@@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 }
 
+static void cper_estatus_print_section_v300(const char *pfx,
+	const struct acpi_hest_generic_data_v300 *gdata)
+{
+	__u8 hour, min, sec, day, mon, year, century, *timestamp;
+
+	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
+		timestamp = (__u8 *)&(gdata->time_stamp);
+		sec = bcd2bin(timestamp[0]);
+		min = bcd2bin(timestamp[1]);
+		hour = bcd2bin(timestamp[2]);
+		day = bcd2bin(timestamp[4]);
+		mon = bcd2bin(timestamp[5]);
+		year = bcd2bin(timestamp[6]);
+		century = bcd2bin(timestamp[7]);
+		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+			0x01 & *(timestamp + 3) ? "precise" : "", century,
+			year, mon, day, hour, min, sec);
+	}
+}
+
 static void cper_estatus_print_section(
-	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
+	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
 {
 	uuid_le *sec_type = (uuid_le *)gdata->section_type;
 	__u16 severity;
 	char newpfx[64];
 
+	if (acpi_hest_generic_data_version(gdata) >= 3)
+		cper_estatus_print_section_v300(pfx,
+			(const struct acpi_hest_generic_data_v300 *)gdata);
+
 	severity = gdata->error_severity;
 	printk("%s""Error %d, type: %s\n", pfx, sec_no,
 	       cper_severity_str(severity));
@@ -403,14 +430,18 @@ static void cper_estatus_print_section(
 
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
 	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
-		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+		struct cper_sec_proc_generic *proc_err;
+
+		proc_err = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: general processor error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*proc_err))
 			cper_print_proc_generic(newpfx, proc_err);
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
-		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+		struct cper_sec_mem_err *mem_err;
+
+		mem_err = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: memory error\n", newpfx);
 		if (gdata->error_data_length >=
 		    sizeof(struct cper_sec_mem_err_old))
@@ -419,7 +450,9 @@ static void cper_estatus_print_section(
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
-		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+		struct cper_sec_pcie *pcie;
+
+		pcie = acpi_hest_generic_data_payload(gdata);
 		printk("%s""section_type: PCIe error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*pcie))
 			cper_print_pcie(newpfx, pcie, gdata);
@@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
 			const struct acpi_hest_generic_status *estatus)
 {
 	struct acpi_hest_generic_data *gdata;
-	unsigned int data_len, gedata_len;
+	unsigned int data_len;
 	int sec_no = 0;
 	char newpfx[64];
 	__u16 severity;
@@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
 	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
+
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
+
+	while (data_len >= acpi_hest_generic_data_size(gdata)) {
 		cper_estatus_print_section(newpfx, gdata, sec_no);
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		gdata = acpi_hest_generic_data_next(gdata);
 		sec_no++;
 	}
 }
@@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
 		return rc;
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
-		if (gedata_len > data_len - sizeof(*gdata))
+
+	while (data_len >= acpi_hest_generic_data_size(gdata)) {
+		gedata_len = acpi_hest_generic_data_error_length(gdata);
+		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
 			return -EINVAL;
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
+		gdata = acpi_hest_generic_data_next(gdata);
 	}
 	if (data_len)
 		return -EINVAL;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 68f088a..56b9679 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
 {
 }
 #endif
+
+#define acpi_hest_generic_data_version(gdata)			\
+	(gdata->revision >> 8)
+
+static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
+{
+	return acpi_hest_generic_data_version(gdata) >= 3 ?
+		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
+		gdata + 1;
+}
diff --git a/include/linux/cper.h b/include/linux/cper.h
index dcacb1a..13ea41c 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -255,6 +255,18 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define acpi_hest_generic_data_error_length(gdata)	\
+	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
+#define acpi_hest_generic_data_size(gdata)		\
+	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
+	sizeof(struct acpi_hest_generic_data_v300) :	\
+	sizeof(struct acpi_hest_generic_data))
+#define acpi_hest_generic_data_record_size(gdata)	\
+	(acpi_hest_generic_data_size(gdata) +		\
+	acpi_hest_generic_data_error_length(gdata))
+#define acpi_hest_generic_data_next(gdata)		\
+	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
+
 /*
  * All tables and structs must be byte-packed to match CPER
  * specification, since the tables are provided by the system BIOS
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 03/10] efi: parse ARMv8 processor error
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

Add support for ARMv8 Common Platform Error Record (CPER).
UEFI 2.6 specification adds support for ARMv8 specific
processor error information to be reported as part of the
CPER records. This provides more detail on for processor error logs.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/firmware/efi/cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cper.h        |  72 +++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 7e2439e..004aa1b 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
 static const char * const proc_type_strs[] = {
 	"IA32/X64",
 	"IA64",
+	"ARMv8",
 };
 
 static const char * const proc_isa_strs[] = {
 	"IA32",
 	"IA64",
 	"X64",
+	"ARM A32/T32",
+	"ARM A64",
 };
 
 static const char * const proc_error_type_strs[] = {
@@ -184,6 +187,129 @@ static void cper_print_proc_generic(const char *pfx,
 		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
 }
 
+static void cper_print_proc_armv8(const char *pfx,
+				  const struct cper_sec_proc_armv8 *proc)
+{
+	int i, len;
+	struct cper_armv8_err_info *err_info;
+	__u64 *qword = NULL;
+	char newpfx[64];
+
+	printk("%ssection length: %d\n", pfx, proc->section_length);
+	printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
+
+	len = proc->section_length - (sizeof(*proc) +
+		proc->err_info_num * (sizeof(*err_info)));
+	if (len < 0) {
+		printk("%ssection length is too small.\n", pfx);
+		printk("%sERR_INFO_NUM is %d.\n", pfx, proc->err_info_num);
+		return;
+	}
+
+	if (proc->validation_bits & CPER_ARMV8_VALID_MPIDR)
+		printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
+	if (proc->validation_bits & CPER_ARMV8_VALID_AFFINITY_LEVEL)
+		printk("%serror affinity level: %d\n", pfx,
+			proc->affinity_level);
+	if (proc->validation_bits & CPER_ARMV8_VALID_RUNNING_STATE) {
+		printk("%srunning state: %d\n", pfx, proc->running_state);
+		printk("%sPSCI state: %d\n", pfx, proc->psci_state);
+	}
+
+	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+	err_info = (struct cper_armv8_err_info *)(proc + 1);
+	for (i = 0; i < proc->err_info_num; i++) {
+		printk("%sError info structure %d:\n", pfx, i);
+		printk("%sversion:%d\n", newpfx, err_info->version);
+		printk("%slength:%d\n", newpfx, err_info->length);
+		if (err_info->validation_bits &
+		    CPER_ARMV8_INFO_VALID_MULTI_ERR) {
+			if (err_info->multiple_error == 0)
+				printk("%ssingle error.\n", newpfx);
+			else if (err_info->multiple_error == 1)
+				printk("%smultiple errors.\n", newpfx);
+			else
+				printk("%smultiple errors count:%d.\n",
+				newpfx, err_info->multiple_error);
+		}
+		if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_FLAGS) {
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_FIRST)
+				printk("%sfirst error captured.\n", newpfx);
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_LAST)
+				printk("%slast error captured.\n", newpfx);
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_PROPAGATED)
+				printk("%spropagated error captured.\n",
+				       newpfx);
+		}
+		printk("%serror_type: %d, %s\n", newpfx, err_info->type,
+			err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
+			proc_error_type_strs[err_info->type] : "unknown");
+		printk("%serror_info: 0x%016llx\n", newpfx,
+		       err_info->error_info);
+		if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_VIRT_ADDR)
+			printk("%svirtual fault address: 0x%016llx\n",
+				newpfx, err_info->virt_fault_addr);
+		if (err_info->validation_bits &
+		    CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR)
+			printk("%sphysical fault address: 0x%016llx\n",
+				newpfx, err_info->physical_fault_addr);
+		err_info += 1;
+	}
+
+	if (len < sizeof(*qword) && proc->context_info_num > 0) {
+		printk("%ssection length is too small.\n", pfx);
+		printk("%sCTX_INFO_NUM is %d.\n", pfx, proc->context_info_num);
+		return;
+	}
+	for (i = 0; i < proc->context_info_num; i++) {
+		qword = (__u64 *)err_info;
+		printk("%sProcessor context info structure %d:\n", pfx, i);
+		printk("%sException level %d.\n", newpfx,
+		       (int)((*qword & CPER_ARMV8_CTX_EL_MASK)
+				>> CPER_ARMV8_CTX_EL_SHIFT));
+		printk("%sSecure bit: %d.\n", newpfx,
+		       (int)((*qword & CPER_ARMV8_CTX_NS_MASK)
+				>> CPER_ARMV8_CTX_NS_SHIFT));
+		if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 0) {
+			if (len < CPER_AARCH32_CTX_LEN) {
+				printk("%ssection length is too small.\n", pfx);
+				printk("%sremaining length is %d.\n", pfx, len);
+				return;
+			}
+			printk("%sAArch32 execution context.\n", newpfx);
+			qword++;
+			print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+				qword, CPER_AARCH32_CTX_LEN - sizeof(*qword),
+				0);
+			len -= CPER_AARCH32_CTX_LEN;
+		} else if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 1) {
+			if (len < CPER_AARCH64_CTX_LEN) {
+				printk("%ssection length is too small.\n", pfx);
+				printk("%sremaining length is %d.\n", pfx, len);
+				return;
+			}
+			printk("%sAArch64 execution context.\n", newpfx);
+			qword++;
+			print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+				qword, CPER_AARCH64_CTX_LEN - sizeof(*qword),
+				0);
+			len -= CPER_AARCH64_CTX_LEN;
+		} else {
+			printk("%scontext type is incorrect 0x%016llx.\n",
+			pfx, *qword);
+			return;
+		}
+	}
+
+	if (len > 0) {
+		printk("%sVendor specific error info has %d bytes.\n", pfx,
+		       len);
+		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, qword, len,
+			0);
+	}
+}
+
 static const char * const mem_err_type_strs[] = {
 	"unknown",
 	"no error",
@@ -458,6 +584,15 @@ static void cper_estatus_print_section(
 			cper_print_pcie(newpfx, pcie, gdata);
 		else
 			goto err_section_too_small;
+	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARMV8)) {
+		struct cper_sec_proc_armv8 *armv8_err;
+
+		armv8_err = acpi_hest_generic_data_payload(gdata);
+		printk("%ssection_type: ARMv8 processor error\n", newpfx);
+		if (gdata->error_data_length >= sizeof(*armv8_err))
+			cper_print_proc_armv8(newpfx, armv8_err);
+		else
+			goto err_section_too_small;
 	} else
 		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
 
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 13ea41c..2a9d553 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -162,6 +162,11 @@ enum {
  * corrective action before the data is consumed
  */
 #define CPER_SEC_LATENT_ERROR			0x0020
+/*
+ * If set, the section contains an error that is propagated. The error
+ * did not originate from the hardware associated with this section.
+ */
+#define CPER_SEC_PROPAGATED			0x0040
 
 /*
  * Section type definitions, used in section_type field in struct
@@ -180,6 +185,10 @@ enum {
 #define CPER_SEC_PROC_IPF						\
 	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
 		0x80, 0xC7, 0x3C, 0x88, 0x81)
+/* Processor Specific: ARMv8 */
+#define CPER_SEC_PROC_ARMV8						\
+	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
+		0x1D, 0x5D, 0x46, 0xB0)
 /* Platform Memory */
 #define CPER_SEC_PLATFORM_MEM						\
 	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
@@ -255,6 +264,34 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
+#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
+#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
+
+#define CPER_ARMV8_VALID_MPIDR			0x00000001
+#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
+#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
+#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
+
+#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
+#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
+#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
+#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
+#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
+
+#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
+#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
+#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
+
+#define CPER_AARCH64_CTX_LEN			368
+#define CPER_AARCH32_CTX_LEN			256
+
+#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
+#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
+#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
+#define CPER_ARMV8_CTX_EL_SHIFT			4
+#define CPER_ARMV8_CTX_NS_SHIFT			7
+
 #define acpi_hest_generic_data_error_length(gdata)	\
 	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
 #define acpi_hest_generic_data_size(gdata)		\
@@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
 	__u64	mm_reg_addr;
 };
 
+/* ARMv8 Processor Error Section */
+struct cper_sec_proc_armv8 {
+	__u32	validation_bits;
+	__u16	err_info_num; /* Number of Processor Error Info */
+	__u16	context_info_num; /* Number of Processor Context Info Records*/
+	__u32	section_length;
+	__u8	affinity_level;
+	__u8	reserved[3];	/* must be zero */
+	__u64	mpidr;
+	__u64	midr;
+	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
+	__u32	psci_state;
+};
+
+/* ARMv8 Processor Error Information Structure */
+struct cper_armv8_err_info {
+	__u8	version;
+	__u8	length;
+	__u16	validation_bits;
+	__u8	type;
+	__u16	multiple_error;
+	__u8	flags;
+	__u64	error_info;
+	__u64	virt_fault_addr;
+	__u64	physical_fault_addr;
+};
+
+/* ARMv8 AARCH64 Processor Context Information Structure */
+struct cper_armv8_aarch64_ctx {
+	__u8	type_el_ns;
+	__u8	reserved[7];	/* must be zero */
+	__u8	gpr[288];
+	__u8	spr[68];
+};
+
 /* Old Memory Error Section UEFI 2.1, 2.2 */
 struct cper_sec_mem_err_old {
 	__u64	validation_bits;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

Add support for ARMv8 Common Platform Error Record (CPER).
UEFI 2.6 specification adds support for ARMv8 specific
processor error information to be reported as part of the
CPER records. This provides more detail on for processor error logs.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 drivers/firmware/efi/cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cper.h        |  72 +++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 7e2439e..004aa1b 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
 static const char * const proc_type_strs[] = {
 	"IA32/X64",
 	"IA64",
+	"ARMv8",
 };
 
 static const char * const proc_isa_strs[] = {
 	"IA32",
 	"IA64",
 	"X64",
+	"ARM A32/T32",
+	"ARM A64",
 };
 
 static const char * const proc_error_type_strs[] = {
@@ -184,6 +187,129 @@ static void cper_print_proc_generic(const char *pfx,
 		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
 }
 
+static void cper_print_proc_armv8(const char *pfx,
+				  const struct cper_sec_proc_armv8 *proc)
+{
+	int i, len;
+	struct cper_armv8_err_info *err_info;
+	__u64 *qword = NULL;
+	char newpfx[64];
+
+	printk("%ssection length: %d\n", pfx, proc->section_length);
+	printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
+
+	len = proc->section_length - (sizeof(*proc) +
+		proc->err_info_num * (sizeof(*err_info)));
+	if (len < 0) {
+		printk("%ssection length is too small.\n", pfx);
+		printk("%sERR_INFO_NUM is %d.\n", pfx, proc->err_info_num);
+		return;
+	}
+
+	if (proc->validation_bits & CPER_ARMV8_VALID_MPIDR)
+		printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
+	if (proc->validation_bits & CPER_ARMV8_VALID_AFFINITY_LEVEL)
+		printk("%serror affinity level: %d\n", pfx,
+			proc->affinity_level);
+	if (proc->validation_bits & CPER_ARMV8_VALID_RUNNING_STATE) {
+		printk("%srunning state: %d\n", pfx, proc->running_state);
+		printk("%sPSCI state: %d\n", pfx, proc->psci_state);
+	}
+
+	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+	err_info = (struct cper_armv8_err_info *)(proc + 1);
+	for (i = 0; i < proc->err_info_num; i++) {
+		printk("%sError info structure %d:\n", pfx, i);
+		printk("%sversion:%d\n", newpfx, err_info->version);
+		printk("%slength:%d\n", newpfx, err_info->length);
+		if (err_info->validation_bits &
+		    CPER_ARMV8_INFO_VALID_MULTI_ERR) {
+			if (err_info->multiple_error == 0)
+				printk("%ssingle error.\n", newpfx);
+			else if (err_info->multiple_error == 1)
+				printk("%smultiple errors.\n", newpfx);
+			else
+				printk("%smultiple errors count:%d.\n",
+				newpfx, err_info->multiple_error);
+		}
+		if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_FLAGS) {
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_FIRST)
+				printk("%sfirst error captured.\n", newpfx);
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_LAST)
+				printk("%slast error captured.\n", newpfx);
+			if (err_info->flags & CPER_ARMV8_INFO_FLAGS_PROPAGATED)
+				printk("%spropagated error captured.\n",
+				       newpfx);
+		}
+		printk("%serror_type: %d, %s\n", newpfx, err_info->type,
+			err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
+			proc_error_type_strs[err_info->type] : "unknown");
+		printk("%serror_info: 0x%016llx\n", newpfx,
+		       err_info->error_info);
+		if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_VIRT_ADDR)
+			printk("%svirtual fault address: 0x%016llx\n",
+				newpfx, err_info->virt_fault_addr);
+		if (err_info->validation_bits &
+		    CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR)
+			printk("%sphysical fault address: 0x%016llx\n",
+				newpfx, err_info->physical_fault_addr);
+		err_info += 1;
+	}
+
+	if (len < sizeof(*qword) && proc->context_info_num > 0) {
+		printk("%ssection length is too small.\n", pfx);
+		printk("%sCTX_INFO_NUM is %d.\n", pfx, proc->context_info_num);
+		return;
+	}
+	for (i = 0; i < proc->context_info_num; i++) {
+		qword = (__u64 *)err_info;
+		printk("%sProcessor context info structure %d:\n", pfx, i);
+		printk("%sException level %d.\n", newpfx,
+		       (int)((*qword & CPER_ARMV8_CTX_EL_MASK)
+				>> CPER_ARMV8_CTX_EL_SHIFT));
+		printk("%sSecure bit: %d.\n", newpfx,
+		       (int)((*qword & CPER_ARMV8_CTX_NS_MASK)
+				>> CPER_ARMV8_CTX_NS_SHIFT));
+		if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 0) {
+			if (len < CPER_AARCH32_CTX_LEN) {
+				printk("%ssection length is too small.\n", pfx);
+				printk("%sremaining length is %d.\n", pfx, len);
+				return;
+			}
+			printk("%sAArch32 execution context.\n", newpfx);
+			qword++;
+			print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+				qword, CPER_AARCH32_CTX_LEN - sizeof(*qword),
+				0);
+			len -= CPER_AARCH32_CTX_LEN;
+		} else if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 1) {
+			if (len < CPER_AARCH64_CTX_LEN) {
+				printk("%ssection length is too small.\n", pfx);
+				printk("%sremaining length is %d.\n", pfx, len);
+				return;
+			}
+			printk("%sAArch64 execution context.\n", newpfx);
+			qword++;
+			print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+				qword, CPER_AARCH64_CTX_LEN - sizeof(*qword),
+				0);
+			len -= CPER_AARCH64_CTX_LEN;
+		} else {
+			printk("%scontext type is incorrect 0x%016llx.\n",
+			pfx, *qword);
+			return;
+		}
+	}
+
+	if (len > 0) {
+		printk("%sVendor specific error info has %d bytes.\n", pfx,
+		       len);
+		print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, qword, len,
+			0);
+	}
+}
+
 static const char * const mem_err_type_strs[] = {
 	"unknown",
 	"no error",
@@ -458,6 +584,15 @@ static void cper_estatus_print_section(
 			cper_print_pcie(newpfx, pcie, gdata);
 		else
 			goto err_section_too_small;
+	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARMV8)) {
+		struct cper_sec_proc_armv8 *armv8_err;
+
+		armv8_err = acpi_hest_generic_data_payload(gdata);
+		printk("%ssection_type: ARMv8 processor error\n", newpfx);
+		if (gdata->error_data_length >= sizeof(*armv8_err))
+			cper_print_proc_armv8(newpfx, armv8_err);
+		else
+			goto err_section_too_small;
 	} else
 		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
 
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 13ea41c..2a9d553 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -162,6 +162,11 @@ enum {
  * corrective action before the data is consumed
  */
 #define CPER_SEC_LATENT_ERROR			0x0020
+/*
+ * If set, the section contains an error that is propagated. The error
+ * did not originate from the hardware associated with this section.
+ */
+#define CPER_SEC_PROPAGATED			0x0040
 
 /*
  * Section type definitions, used in section_type field in struct
@@ -180,6 +185,10 @@ enum {
 #define CPER_SEC_PROC_IPF						\
 	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
 		0x80, 0xC7, 0x3C, 0x88, 0x81)
+/* Processor Specific: ARMv8 */
+#define CPER_SEC_PROC_ARMV8						\
+	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
+		0x1D, 0x5D, 0x46, 0xB0)
 /* Platform Memory */
 #define CPER_SEC_PLATFORM_MEM						\
 	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
@@ -255,6 +264,34 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
+#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
+#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
+
+#define CPER_ARMV8_VALID_MPIDR			0x00000001
+#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
+#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
+#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
+
+#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
+#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
+#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
+#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
+#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
+
+#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
+#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
+#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
+
+#define CPER_AARCH64_CTX_LEN			368
+#define CPER_AARCH32_CTX_LEN			256
+
+#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
+#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
+#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
+#define CPER_ARMV8_CTX_EL_SHIFT			4
+#define CPER_ARMV8_CTX_NS_SHIFT			7
+
 #define acpi_hest_generic_data_error_length(gdata)	\
 	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
 #define acpi_hest_generic_data_size(gdata)		\
@@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
 	__u64	mm_reg_addr;
 };
 
+/* ARMv8 Processor Error Section */
+struct cper_sec_proc_armv8 {
+	__u32	validation_bits;
+	__u16	err_info_num; /* Number of Processor Error Info */
+	__u16	context_info_num; /* Number of Processor Context Info Records*/
+	__u32	section_length;
+	__u8	affinity_level;
+	__u8	reserved[3];	/* must be zero */
+	__u64	mpidr;
+	__u64	midr;
+	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
+	__u32	psci_state;
+};
+
+/* ARMv8 Processor Error Information Structure */
+struct cper_armv8_err_info {
+	__u8	version;
+	__u8	length;
+	__u16	validation_bits;
+	__u8	type;
+	__u16	multiple_error;
+	__u8	flags;
+	__u64	error_info;
+	__u64	virt_fault_addr;
+	__u64	physical_fault_addr;
+};
+
+/* ARMv8 AARCH64 Processor Context Information Structure */
+struct cper_armv8_aarch64_ctx {
+	__u8	type_el_ns;
+	__u8	reserved[7];	/* must be zero */
+	__u8	gpr[288];
+	__u8	spr[68];
+};
+
 /* Old Memory Error Section UEFI 2.1, 2.2 */
 struct cper_sec_mem_err_old {
 	__u64	validation_bits;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 04/10] arm64: exception: handle Synchronous External Abort
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

SEA exceptions are often caused by an uncorrected hardware
error, and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.
When SEA occurs, before killing the process, go through
the handlers registered in the notification list.
Update fault_info[] with specific SEA faults so that the
new SEA handler is used.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 arch/arm64/include/asm/system_misc.h | 13 ++++++++
 arch/arm64/mm/fault.c                | 58 +++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 57f110b..9040e1d 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -64,4 +64,17 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 
 #endif	/* __ASSEMBLY__ */
 
+/*
+ * The functions below are used to register and unregister callbacks
+ * that are to be invoked when a Synchronous External Abort (SEA)
+ * occurs. An SEA is raised by certain fault status codes that have
+ * either data or instruction abort as the exception class, and
+ * callbacks may be registered to parse or handle such hardware errors.
+ *
+ * Registered callbacks are run in an interrupt/atomic context. They
+ * are not allowed to block or sleep.
+ */
+int register_synchronous_ext_abort_notifier(struct notifier_block *nb);
+void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb);
+
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 05d2bd7..fcc49f1 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -39,6 +39,22 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
+/*
+ * GHES SEA handler code may register a notifier call here to
+ * handle HW error record passed from platform.
+ */
+static ATOMIC_NOTIFIER_HEAD(sea_handler_chain);
+
+int register_synchronous_ext_abort_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&sea_handler_chain, nb);
+}
+
+void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb)
+{
+	atomic_notifier_chain_unregister(&sea_handler_chain, nb);
+}
+
 static const char *fault_name(unsigned int esr);
 
 #ifdef CONFIG_KPROBES
@@ -480,6 +496,28 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_synch_ext_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	struct siginfo info;
+
+	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		 fault_name(esr), esr, addr);
+
+	info.si_signo = SIGBUS;
+	info.si_errno = 0;
+	info.si_code  = 0;
+	info.si_addr  = (void __user *)addr;
+	arm64_notify_die("", regs, &info, esr);
+
+	return 0;
+}
+
 static const struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 	int	sig;
@@ -502,22 +540,22 @@ static const struct fault_info {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 17"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 0 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 1 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 2 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 3 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"synchronous parity or ECC err" },
 	{ do_bad,		SIGBUS,  0,		"unknown 25"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 26"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 27"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 0 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 1 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 2 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 3 synch parity error"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 32"			},
 	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
 	{ do_bad,		SIGBUS,  0,		"unknown 34"			},
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 04/10] arm64: exception: handle Synchronous External Abort
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

SEA exceptions are often caused by an uncorrected hardware
error, and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.
When SEA occurs, before killing the process, go through
the handlers registered in the notification list.
Update fault_info[] with specific SEA faults so that the
new SEA handler is used.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 arch/arm64/include/asm/system_misc.h | 13 ++++++++
 arch/arm64/mm/fault.c                | 58 +++++++++++++++++++++++++++++-------
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 57f110b..9040e1d 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -64,4 +64,17 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 
 #endif	/* __ASSEMBLY__ */
 
+/*
+ * The functions below are used to register and unregister callbacks
+ * that are to be invoked when a Synchronous External Abort (SEA)
+ * occurs. An SEA is raised by certain fault status codes that have
+ * either data or instruction abort as the exception class, and
+ * callbacks may be registered to parse or handle such hardware errors.
+ *
+ * Registered callbacks are run in an interrupt/atomic context. They
+ * are not allowed to block or sleep.
+ */
+int register_synchronous_ext_abort_notifier(struct notifier_block *nb);
+void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb);
+
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 05d2bd7..fcc49f1 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -39,6 +39,22 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
+/*
+ * GHES SEA handler code may register a notifier call here to
+ * handle HW error record passed from platform.
+ */
+static ATOMIC_NOTIFIER_HEAD(sea_handler_chain);
+
+int register_synchronous_ext_abort_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&sea_handler_chain, nb);
+}
+
+void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb)
+{
+	atomic_notifier_chain_unregister(&sea_handler_chain, nb);
+}
+
 static const char *fault_name(unsigned int esr);
 
 #ifdef CONFIG_KPROBES
@@ -480,6 +496,28 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_synch_ext_abort(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	struct siginfo info;
+
+	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		 fault_name(esr), esr, addr);
+
+	info.si_signo = SIGBUS;
+	info.si_errno = 0;
+	info.si_code  = 0;
+	info.si_addr  = (void __user *)addr;
+	arm64_notify_die("", regs, &info, esr);
+
+	return 0;
+}
+
 static const struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 	int	sig;
@@ -502,22 +540,22 @@ static const struct fault_info {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 17"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 0 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 1 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 2 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 3 SEA (trans tbl walk)"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"synchronous parity or ECC err" },
 	{ do_bad,		SIGBUS,  0,		"unknown 25"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 26"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 27"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 0 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 1 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 2 synch parity error"	},
+	{ do_synch_ext_abort,	SIGBUS,  0,		"level 3 synch parity error"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 32"			},
 	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
 	{ do_bad,		SIGBUS,  0,		"unknown 34"			},
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 05/10] acpi: apei: handle SEA notification type for ARMv8
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

ARM APEI extension proposal added SEA (Synchrounous External
Abort) notification type for ARMv8.
Add a new GHES error source handling function for SEA. If an error
source's notification type is SEA, then this function can be registered
into the SEA exception handler. That way GHES will parse and report
SEA exceptions when they occur.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 arch/arm64/Kconfig        |  1 +
 drivers/acpi/apei/Kconfig | 14 ++++++++
 drivers/acpi/apei/ghes.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 98 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b380c87..ae34349 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -53,6 +53,7 @@ config ARM64
 	select HANDLE_DOMAIN_IRQ
 	select HARDIRQS_SW_RESEND
 	select HAVE_ACPI_APEI if (ACPI && EFI)
+	select HAVE_ACPI_APEI_SEA if (ACPI && EFI)
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_BITREVERSE
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index b0140c8..3786ff1 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -4,6 +4,20 @@ config HAVE_ACPI_APEI
 config HAVE_ACPI_APEI_NMI
 	bool
 
+config HAVE_ACPI_APEI_SEA
+	bool "APEI Synchronous External Abort logging/recovering support"
+	depends on ARM64
+	help
+	  This option should be enabled if the system supports
+	  firmware first handling of SEA (Synchronous External Abort).
+	  SEA happens with certain faults of data abort or instruction
+	  abort synchronous exceptions on ARMv8 systems. If a system
+	  supports firmware first handling of SEA, the platform analyzes
+	  and handles hardware error notifications with SEA, and it may then
+	  form a HW error record for the OS to parse and handle. This
+	  option allows the OS to look for such HW error record, and
+	  take appropriate action.
+
 config ACPI_APEI
 	bool "ACPI Platform Error Interface (APEI)"
 	select MISC_FILESYSTEMS
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9063d68..839a0e2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -50,6 +50,10 @@
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+#include <asm/system_misc.h>
+#endif
+
 #include "apei-internal.h"
 
 #define GHES_PFX	"GHES: "
@@ -770,6 +774,62 @@ static struct notifier_block ghes_notifier_sci = {
 	.notifier_call = ghes_notify_sci,
 };
 
+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+static LIST_HEAD(ghes_sea);
+
+static int ghes_notify_sea(struct notifier_block *this,
+				  unsigned long event, void *data)
+{
+	struct ghes *ghes;
+	int ret = NOTIFY_DONE;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
+		if (!ghes_proc(ghes))
+			ret = NOTIFY_OK;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static struct notifier_block ghes_notifier_sea = {
+	.notifier_call = ghes_notify_sea,
+};
+
+static int ghes_sea_add(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	if (list_empty(&ghes_sea))
+		register_synchronous_ext_abort_notifier(&ghes_notifier_sea);
+	list_add_rcu(&ghes->list, &ghes_sea);
+	mutex_unlock(&ghes_list_mutex);
+	return 0;
+}
+
+static void ghes_sea_remove(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_del_rcu(&ghes->list);
+	if (list_empty(&ghes_sea))
+		unregister_synchronous_ext_abort_notifier(&ghes_notifier_sea);
+	mutex_unlock(&ghes_list_mutex);
+}
+#else /* CONFIG_HAVE_ACPI_APEI_SEA */
+static inline int ghes_sea_add(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+	return -ENOTSUPP;
+}
+
+static inline void ghes_sea_remove(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+}
+#endif /* CONFIG_HAVE_ACPI_APEI_SEA */
+
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 /*
  * printk is not safe in NMI context.  So in NMI handler, we allocate
@@ -1014,6 +1074,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	case ACPI_HEST_NOTIFY_EXTERNAL:
 	case ACPI_HEST_NOTIFY_SCI:
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_SEA)) {
+			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
+				generic->header.source_id);
+			rc = -ENOTSUPP;
+			goto err;
+		}
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
 			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1025,6 +1093,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
 			   generic->header.source_id);
 		goto err;
+	case ACPI_HEST_NOTIFY_GPIO:
+	case ACPI_HEST_NOTIFY_SEI:
+	case ACPI_HEST_NOTIFY_GSIV:
+		pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
+			generic->header.source_id, generic->header.source_id);
+		rc = -ENOTSUPP;
+		goto err;
 	default:
 		pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
 			   generic->notify.type, generic->header.source_id);
@@ -1079,6 +1154,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		list_add_rcu(&ghes->list, &ghes_sci);
 		mutex_unlock(&ghes_list_mutex);
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		rc = ghes_sea_add(ghes);
+		if (rc)
+			goto err_edac_unreg;
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_add(ghes);
 		break;
@@ -1121,6 +1201,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 			unregister_acpi_hed_notifier(&ghes_notifier_sci);
 		mutex_unlock(&ghes_list_mutex);
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		ghes_sea_remove(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_remove(ghes);
 		break;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 05/10] acpi: apei: handle SEA notification type for ARMv8
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

ARM APEI extension proposal added SEA (Synchrounous External
Abort) notification type for ARMv8.
Add a new GHES error source handling function for SEA. If an error
source's notification type is SEA, then this function can be registered
into the SEA exception handler. That way GHES will parse and report
SEA exceptions when they occur.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
---
 arch/arm64/Kconfig        |  1 +
 drivers/acpi/apei/Kconfig | 14 ++++++++
 drivers/acpi/apei/ghes.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 98 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b380c87..ae34349 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -53,6 +53,7 @@ config ARM64
 	select HANDLE_DOMAIN_IRQ
 	select HARDIRQS_SW_RESEND
 	select HAVE_ACPI_APEI if (ACPI && EFI)
+	select HAVE_ACPI_APEI_SEA if (ACPI && EFI)
 	select HAVE_ALIGNED_STRUCT_PAGE if SLUB
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_BITREVERSE
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index b0140c8..3786ff1 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -4,6 +4,20 @@ config HAVE_ACPI_APEI
 config HAVE_ACPI_APEI_NMI
 	bool
 
+config HAVE_ACPI_APEI_SEA
+	bool "APEI Synchronous External Abort logging/recovering support"
+	depends on ARM64
+	help
+	  This option should be enabled if the system supports
+	  firmware first handling of SEA (Synchronous External Abort).
+	  SEA happens with certain faults of data abort or instruction
+	  abort synchronous exceptions on ARMv8 systems. If a system
+	  supports firmware first handling of SEA, the platform analyzes
+	  and handles hardware error notifications with SEA, and it may then
+	  form a HW error record for the OS to parse and handle. This
+	  option allows the OS to look for such HW error record, and
+	  take appropriate action.
+
 config ACPI_APEI
 	bool "ACPI Platform Error Interface (APEI)"
 	select MISC_FILESYSTEMS
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9063d68..839a0e2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -50,6 +50,10 @@
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+#include <asm/system_misc.h>
+#endif
+
 #include "apei-internal.h"
 
 #define GHES_PFX	"GHES: "
@@ -770,6 +774,62 @@ static struct notifier_block ghes_notifier_sci = {
 	.notifier_call = ghes_notify_sci,
 };
 
+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+static LIST_HEAD(ghes_sea);
+
+static int ghes_notify_sea(struct notifier_block *this,
+				  unsigned long event, void *data)
+{
+	struct ghes *ghes;
+	int ret = NOTIFY_DONE;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
+		if (!ghes_proc(ghes))
+			ret = NOTIFY_OK;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static struct notifier_block ghes_notifier_sea = {
+	.notifier_call = ghes_notify_sea,
+};
+
+static int ghes_sea_add(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	if (list_empty(&ghes_sea))
+		register_synchronous_ext_abort_notifier(&ghes_notifier_sea);
+	list_add_rcu(&ghes->list, &ghes_sea);
+	mutex_unlock(&ghes_list_mutex);
+	return 0;
+}
+
+static void ghes_sea_remove(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_del_rcu(&ghes->list);
+	if (list_empty(&ghes_sea))
+		unregister_synchronous_ext_abort_notifier(&ghes_notifier_sea);
+	mutex_unlock(&ghes_list_mutex);
+}
+#else /* CONFIG_HAVE_ACPI_APEI_SEA */
+static inline int ghes_sea_add(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+	return -ENOTSUPP;
+}
+
+static inline void ghes_sea_remove(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+}
+#endif /* CONFIG_HAVE_ACPI_APEI_SEA */
+
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 /*
  * printk is not safe in NMI context.  So in NMI handler, we allocate
@@ -1014,6 +1074,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	case ACPI_HEST_NOTIFY_EXTERNAL:
 	case ACPI_HEST_NOTIFY_SCI:
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_SEA)) {
+			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
+				generic->header.source_id);
+			rc = -ENOTSUPP;
+			goto err;
+		}
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
 			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1025,6 +1093,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
 			   generic->header.source_id);
 		goto err;
+	case ACPI_HEST_NOTIFY_GPIO:
+	case ACPI_HEST_NOTIFY_SEI:
+	case ACPI_HEST_NOTIFY_GSIV:
+		pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
+			generic->header.source_id, generic->header.source_id);
+		rc = -ENOTSUPP;
+		goto err;
 	default:
 		pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
 			   generic->notify.type, generic->header.source_id);
@@ -1079,6 +1154,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		list_add_rcu(&ghes->list, &ghes_sci);
 		mutex_unlock(&ghes_list_mutex);
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		rc = ghes_sea_add(ghes);
+		if (rc)
+			goto err_edac_unreg;
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_add(ghes);
 		break;
@@ -1121,6 +1201,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 			unregister_acpi_hed_notifier(&ghes_notifier_sci);
 		mutex_unlock(&ghes_list_mutex);
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		ghes_sea_remove(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_remove(ghes);
 		break;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 06/10] acpi: apei: panic OS with fatal error status block
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:35   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 839a0e2..28f801c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -141,6 +141,8 @@ static unsigned long ghes_estatus_pool_size_request;
 static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+static int ghes_panic_timeout __read_mostly = 30;
+
 static int ghes_ioremap_init(void)
 {
 	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -695,6 +697,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
 	return rc;
 }
 
+static void __ghes_call_panic(void)
+{
+	if (panic_timeout == 0)
+		panic_timeout = ghes_panic_timeout;
+	panic("Fatal hardware error!");
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -706,6 +715,10 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		__ghes_call_panic();
+	}
+
 	ghes_do_proc(ghes, ghes->estatus);
 
 	if (HEST_TYPE_GENERIC_V2(ghes)) {
@@ -850,8 +863,6 @@ static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
 
 static LIST_HEAD(ghes_nmi);
 
-static int ghes_panic_timeout	__read_mostly = 30;
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
@@ -944,9 +955,7 @@ static void __ghes_panic(struct ghes *ghes)
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
 
 	/* reboot to log the error! */
-	if (panic_timeout == 0)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+	__ghes_call_panic();
 }
 
 static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 06/10] acpi: apei: panic OS with fatal error status block
@ 2016-11-21 22:35   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 839a0e2..28f801c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -141,6 +141,8 @@ static unsigned long ghes_estatus_pool_size_request;
 static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+static int ghes_panic_timeout __read_mostly = 30;
+
 static int ghes_ioremap_init(void)
 {
 	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -695,6 +697,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
 	return rc;
 }
 
+static void __ghes_call_panic(void)
+{
+	if (panic_timeout == 0)
+		panic_timeout = ghes_panic_timeout;
+	panic("Fatal hardware error!");
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -706,6 +715,10 @@ static int ghes_proc(struct ghes *ghes)
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		__ghes_call_panic();
+	}
+
 	ghes_do_proc(ghes, ghes->estatus);
 
 	if (HEST_TYPE_GENERIC_V2(ghes)) {
@@ -850,8 +863,6 @@ static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
 
 static LIST_HEAD(ghes_nmi);
 
-static int ghes_panic_timeout	__read_mostly = 30;
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
@@ -944,9 +955,7 @@ static void __ghes_panic(struct ghes *ghes)
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
 
 	/* reboot to log the error! */
-	if (panic_timeout == 0)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+	__ghes_call_panic();
 }
 
 static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 07/10] efi: print unrecognized CPER section
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:36   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
one of the section types that the kernel knows how to parse, the
section is skipped. Therefore, user is not able to see
such CPER data, for instance, error record of non-standard section.

For above mentioned case, this change prints out the raw data in
hex in dmesg buffer. Data length is taken from Error Data length
field of Generic Error Data Entry.

Following is a sample output from dmesg:
[  115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[  115.787456] {1}[Hardware Error]: event severity: corrected
[  115.792927] {1}[Hardware Error]:  Error 0, type: corrected
[  115.798415] {1}[Hardware Error]:  fru_id: 00000000-0000-0000-0000-000000000000
[  115.805596] {1}[Hardware Error]:  fru_text:
[  115.816105] {1}[Hardware Error]:  section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
[  115.823880] {1}[Hardware Error]:  section length: 88
[  115.828779] {1}[Hardware Error]:   00000000: 01000001 00000002 5f434345 525f4543
[  115.836153] {1}[Hardware Error]:   00000010: 0000574d 00000000 00000000 00000000
[  115.843531] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[  115.850908] {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000
[  115.858288] {1}[Hardware Error]:   00000040: fe800000 00000000 00000004 5f434345
[  115.865665] {1}[Hardware Error]:   00000050: 525f4543 0000574d

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/firmware/efi/cper.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 004aa1b..bbb576e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -593,8 +593,16 @@ static void cper_estatus_print_section(
 			cper_print_proc_armv8(newpfx, armv8_err);
 		else
 			goto err_section_too_small;
-	} else
-		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
+	} else {
+		const void *unknown_err;
+
+		unknown_err = acpi_hest_generic_data_payload(gdata);
+		printk("%ssection type: %pUl\n", newpfx, sec_type);
+		printk("%ssection length: %d\n", newpfx,
+		       gdata->error_data_length);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+			       unknown_err, gdata->error_data_length, 0);
+	}
 
 	return;
 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 07/10] efi: print unrecognized CPER section
@ 2016-11-21 22:36   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
one of the section types that the kernel knows how to parse, the
section is skipped. Therefore, user is not able to see
such CPER data, for instance, error record of non-standard section.

For above mentioned case, this change prints out the raw data in
hex in dmesg buffer. Data length is taken from Error Data length
field of Generic Error Data Entry.

Following is a sample output from dmesg:
[  115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[  115.787456] {1}[Hardware Error]: event severity: corrected
[  115.792927] {1}[Hardware Error]:  Error 0, type: corrected
[  115.798415] {1}[Hardware Error]:  fru_id: 00000000-0000-0000-0000-000000000000
[  115.805596] {1}[Hardware Error]:  fru_text:
[  115.816105] {1}[Hardware Error]:  section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
[  115.823880] {1}[Hardware Error]:  section length: 88
[  115.828779] {1}[Hardware Error]:   00000000: 01000001 00000002 5f434345 525f4543
[  115.836153] {1}[Hardware Error]:   00000010: 0000574d 00000000 00000000 00000000
[  115.843531] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[  115.850908] {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000
[  115.858288] {1}[Hardware Error]:   00000040: fe800000 00000000 00000004 5f434345
[  115.865665] {1}[Hardware Error]:   00000050: 525f4543 0000574d

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/firmware/efi/cper.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 004aa1b..bbb576e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -593,8 +593,16 @@ static void cper_estatus_print_section(
 			cper_print_proc_armv8(newpfx, armv8_err);
 		else
 			goto err_section_too_small;
-	} else
-		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
+	} else {
+		const void *unknown_err;
+
+		unknown_err = acpi_hest_generic_data_payload(gdata);
+		printk("%ssection type: %pUl\n", newpfx, sec_type);
+		printk("%ssection length: %d\n", newpfx,
+		       gdata->error_data_length);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+			       unknown_err, gdata->error_data_length, 0);
+	}
 
 	return;
 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 08/10] ras: acpi / apei: generate trace event for unrecognized CPER section
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:36   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
any section type that the kernel knows how to parse, trace event
is not generated for such section. And thus user is not able to know
happening of such hardware error, including error record of
non-standard section.

This commit generates a trace event which contains raw error data
for unrecognized CPER section.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 16 +++++++++++++++-
 drivers/ras/ras.c        |  1 +
 include/ras/ras_event.h  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 28f801c..c7fbbc1 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -49,6 +49,7 @@
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
+#include <ras/ras_event.h>
 
 #ifdef CONFIG_HAVE_ACPI_APEI_SEA
 #include <asm/system_misc.h>
@@ -458,12 +459,19 @@ static void ghes_do_proc(struct ghes *ghes,
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
 	uuid_le sec_type;
+	uuid_le *fru_id = &NULL_UUID_LE;
+	char *fru_text = "";
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_sev = ghes_severity(gdata->error_severity);
 		sec_type = *(uuid_le *)gdata->section_type;
 
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+			fru_id = (uuid_le *)gdata->fru_id;
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+			fru_text = gdata->fru_text;
+
 		if (!uuid_le_cmp(sec_type,
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
@@ -475,7 +483,7 @@ static void ghes_do_proc(struct ghes *ghes,
 			ghes_handle_memory_failure(gdata, sev);
 		}
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+		else if (!uuid_le_cmp(sec_type,
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
 
@@ -508,6 +516,12 @@ static void ghes_do_proc(struct ghes *ghes,
 
 		}
 #endif
+		else {
+			void *unknown_err = acpi_hest_generic_data_payload(gdata);
+			trace_unknown_sec_event(&sec_type,
+					fru_id, fru_text, sec_sev,
+					unknown_err, gdata->error_data_length);
+		}
 	}
 }
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index b67dd36..fb2500b 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -27,3 +27,4 @@ subsys_initcall(ras_init);
 EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 1791a12..5861b6f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@ TRACE_EVENT(mc_event,
 );
 
 /*
+ * Unknown Section Report
+ *
+ * This event is generated when hardware detected a hardware
+ * error event, which may be of non-standard section as defined
+ * in UEFI spec appendix "Common Platform Error Record", or may
+ * be of sections for which TRACE_EVENT is not defined.
+ *
+ */
+TRACE_EVENT(unknown_sec_event,
+
+	TP_PROTO(const uuid_le *sec_type,
+		 const uuid_le *fru_id,
+		 const char *fru_text,
+		 const u8 sev,
+		 const u8 *err,
+		 const u32 len),
+
+	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+
+	TP_STRUCT__entry(
+		__array(char, sec_type, 16)
+		__array(char, fru_id, 16)
+		__string(fru_text, fru_text)
+		__field(u8, sev)
+		__field(u32, len)
+		__dynamic_array(u8, buf, len)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
+		memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));
+		__assign_str(fru_text, fru_text);
+		__entry->sev = sev;
+		__entry->len = len;
+		memcpy(__get_dynamic_array(buf), err, len);
+	),
+
+	TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
+		  __entry->sev, __entry->sec_type,
+		  __entry->fru_id, __get_str(fru_text),
+		  __entry->len,
+		  __print_hex(__get_dynamic_array(buf), __entry->len))
+);
+
+/*
  * PCIe AER Trace event
  *
  * These events are generated when hardware detects a corrected or
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 08/10] ras: acpi / apei: generate trace event for unrecognized CPER section
@ 2016-11-21 22:36   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
any section type that the kernel knows how to parse, trace event
is not generated for such section. And thus user is not able to know
happening of such hardware error, including error record of
non-standard section.

This commit generates a trace event which contains raw error data
for unrecognized CPER section.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/acpi/apei/ghes.c | 16 +++++++++++++++-
 drivers/ras/ras.c        |  1 +
 include/ras/ras_event.h  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 28f801c..c7fbbc1 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -49,6 +49,7 @@
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
+#include <ras/ras_event.h>
 
 #ifdef CONFIG_HAVE_ACPI_APEI_SEA
 #include <asm/system_misc.h>
@@ -458,12 +459,19 @@ static void ghes_do_proc(struct ghes *ghes,
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
 	uuid_le sec_type;
+	uuid_le *fru_id = &NULL_UUID_LE;
+	char *fru_text = "";
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_sev = ghes_severity(gdata->error_severity);
 		sec_type = *(uuid_le *)gdata->section_type;
 
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+			fru_id = (uuid_le *)gdata->fru_id;
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+			fru_text = gdata->fru_text;
+
 		if (!uuid_le_cmp(sec_type,
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
@@ -475,7 +483,7 @@ static void ghes_do_proc(struct ghes *ghes,
 			ghes_handle_memory_failure(gdata, sev);
 		}
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+		else if (!uuid_le_cmp(sec_type,
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
 
@@ -508,6 +516,12 @@ static void ghes_do_proc(struct ghes *ghes,
 
 		}
 #endif
+		else {
+			void *unknown_err = acpi_hest_generic_data_payload(gdata);
+			trace_unknown_sec_event(&sec_type,
+					fru_id, fru_text, sec_sev,
+					unknown_err, gdata->error_data_length);
+		}
 	}
 }
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index b67dd36..fb2500b 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -27,3 +27,4 @@ subsys_initcall(ras_init);
 EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 1791a12..5861b6f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@ TRACE_EVENT(mc_event,
 );
 
 /*
+ * Unknown Section Report
+ *
+ * This event is generated when hardware detected a hardware
+ * error event, which may be of non-standard section as defined
+ * in UEFI spec appendix "Common Platform Error Record", or may
+ * be of sections for which TRACE_EVENT is not defined.
+ *
+ */
+TRACE_EVENT(unknown_sec_event,
+
+	TP_PROTO(const uuid_le *sec_type,
+		 const uuid_le *fru_id,
+		 const char *fru_text,
+		 const u8 sev,
+		 const u8 *err,
+		 const u32 len),
+
+	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+
+	TP_STRUCT__entry(
+		__array(char, sec_type, 16)
+		__array(char, fru_id, 16)
+		__string(fru_text, fru_text)
+		__field(u8, sev)
+		__field(u32, len)
+		__dynamic_array(u8, buf, len)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
+		memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));
+		__assign_str(fru_text, fru_text);
+		__entry->sev = sev;
+		__entry->len = len;
+		memcpy(__get_dynamic_array(buf), err, len);
+	),
+
+	TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
+		  __entry->sev, __entry->sec_type,
+		  __entry->fru_id, __get_str(fru_text),
+		  __entry->len,
+		  __print_hex(__get_dynamic_array(buf), __entry->len))
+);
+
+/*
  * PCIe AER Trace event
  *
  * These events are generated when hardware detects a corrected or
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 09/10] trace, ras: add ARM processor error trace event
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:36   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

Currently there are trace events for the various RAS
errors with the exception of ARM processor type errors.
Add a new trace event for such errors so that the user
will know when they occur. These trace events are
consistent with the ARM processor error section type
defined in UEFI 2.6 spec section N.2.4.4.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
---
 drivers/acpi/apei/ghes.c    |  9 +++++++-
 drivers/firmware/efi/cper.c |  1 +
 drivers/ras/ras.c           |  1 +
 include/ras/ras_event.h     | 55 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c7fbbc1..1147e17 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -516,7 +516,14 @@ static void ghes_do_proc(struct ghes *ghes,
 
 		}
 #endif
-		else {
+		else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARMV8)) {
+			struct cper_sec_proc_armv8 *armv8_err;
+			struct cper_armv8_err_info *err_info;
+
+			armv8_err = acpi_hest_generic_data_payload(gdata);
+			err_info = (void *)(armv8_err +1);
+			trace_arm_event(armv8_err, err_info);
+		} else {
 			void *unknown_err = acpi_hest_generic_data_payload(gdata);
 			trace_unknown_sec_event(&sec_type,
 					fru_id, fru_text, sec_sev,
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index bbb576e..0a0cd74 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -35,6 +35,7 @@
 #include <linux/printk.h>
 #include <linux/bcd.h>
 #include <acpi/ghes.h>
+#include <ras/ras_event.h>
 
 #define INDENT_SP	" "
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index fb2500b..8ba5a94 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -28,3 +28,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
 EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(arm_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5861b6f..0060bba 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,61 @@ TRACE_EVENT(mc_event,
 );
 
 /*
+ * ARM Processor Events Report
+ *
+ * This event is generated when hardware detects an ARM processor error
+ * has occurred. UEFI 2.6 spec section N.2.4.4.
+ */
+TRACE_EVENT(arm_event,
+
+	TP_PROTO(const struct cper_sec_proc_armv8 *proc,
+		 struct cper_armv8_err_info *err_info),
+
+	TP_ARGS(proc, err_info),
+
+	TP_STRUCT__entry(
+		__field(u64, mpidr)
+		__field(u64, midr)
+		__field(u64, info)
+		__field(u64, virt_fault_addr)
+		__field(u64, phys_fault_addr)
+		__field(u32, running_state)
+		__field(u32, psci_state)
+		__field(u16, err_count)
+		__field(u8, affinity)
+		__field(u8, version)
+		__field(u8, type)
+		__field(u8, flags)
+	),
+
+	TP_fast_assign(
+		__entry->affinity = proc->affinity_level;
+		__entry->mpidr = proc->mpidr;
+		__entry->midr = proc->midr;
+		__entry->running_state = proc->running_state;
+		__entry->psci_state = proc->psci_state;
+		__entry->version = err_info->version;
+		__entry->type = err_info->type;
+		__entry->err_count = err_info->multiple_error;
+		__entry->flags = err_info->flags;
+		__entry->info = err_info->error_info;
+		__entry->virt_fault_addr = err_info->virt_fault_addr;
+		__entry->phys_fault_addr = err_info->physical_fault_addr;
+	),
+
+	TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
+		  "running state: %d; PSCI state: %d; version: %d; type: %d; "
+		  "error count: 0x%04x; flags: 0x%02x; info: %016llx; "
+		  "virtual fault address: %016llx; "
+		  "physical fault address: %016llx",
+		  __entry->affinity, __entry->mpidr, __entry->midr,
+		  __entry->running_state, __entry->psci_state, __entry->version,
+		  __entry->type, __entry->err_count, __entry->flags,
+		  __entry->info, __entry->virt_fault_addr,
+		  __entry->phys_fault_addr)
+);
+
+/*
  * Unknown Section Report
  *
  * This event is generated when hardware detected a hardware
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 09/10] trace, ras: add ARM processor error trace event
@ 2016-11-21 22:36   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

Currently there are trace events for the various RAS
errors with the exception of ARM processor type errors.
Add a new trace event for such errors so that the user
will know when they occur. These trace events are
consistent with the ARM processor error section type
defined in UEFI 2.6 spec section N.2.4.4.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
---
 drivers/acpi/apei/ghes.c    |  9 +++++++-
 drivers/firmware/efi/cper.c |  1 +
 drivers/ras/ras.c           |  1 +
 include/ras/ras_event.h     | 55 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c7fbbc1..1147e17 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -516,7 +516,14 @@ static void ghes_do_proc(struct ghes *ghes,
 
 		}
 #endif
-		else {
+		else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARMV8)) {
+			struct cper_sec_proc_armv8 *armv8_err;
+			struct cper_armv8_err_info *err_info;
+
+			armv8_err = acpi_hest_generic_data_payload(gdata);
+			err_info = (void *)(armv8_err +1);
+			trace_arm_event(armv8_err, err_info);
+		} else {
 			void *unknown_err = acpi_hest_generic_data_payload(gdata);
 			trace_unknown_sec_event(&sec_type,
 					fru_id, fru_text, sec_sev,
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index bbb576e..0a0cd74 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -35,6 +35,7 @@
 #include <linux/printk.h>
 #include <linux/bcd.h>
 #include <acpi/ghes.h>
+#include <ras/ras_event.h>
 
 #define INDENT_SP	" "
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index fb2500b..8ba5a94 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -28,3 +28,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
 EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(arm_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5861b6f..0060bba 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,61 @@ TRACE_EVENT(mc_event,
 );
 
 /*
+ * ARM Processor Events Report
+ *
+ * This event is generated when hardware detects an ARM processor error
+ * has occurred. UEFI 2.6 spec section N.2.4.4.
+ */
+TRACE_EVENT(arm_event,
+
+	TP_PROTO(const struct cper_sec_proc_armv8 *proc,
+		 struct cper_armv8_err_info *err_info),
+
+	TP_ARGS(proc, err_info),
+
+	TP_STRUCT__entry(
+		__field(u64, mpidr)
+		__field(u64, midr)
+		__field(u64, info)
+		__field(u64, virt_fault_addr)
+		__field(u64, phys_fault_addr)
+		__field(u32, running_state)
+		__field(u32, psci_state)
+		__field(u16, err_count)
+		__field(u8, affinity)
+		__field(u8, version)
+		__field(u8, type)
+		__field(u8, flags)
+	),
+
+	TP_fast_assign(
+		__entry->affinity = proc->affinity_level;
+		__entry->mpidr = proc->mpidr;
+		__entry->midr = proc->midr;
+		__entry->running_state = proc->running_state;
+		__entry->psci_state = proc->psci_state;
+		__entry->version = err_info->version;
+		__entry->type = err_info->type;
+		__entry->err_count = err_info->multiple_error;
+		__entry->flags = err_info->flags;
+		__entry->info = err_info->error_info;
+		__entry->virt_fault_addr = err_info->virt_fault_addr;
+		__entry->phys_fault_addr = err_info->physical_fault_addr;
+	),
+
+	TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
+		  "running state: %d; PSCI state: %d; version: %d; type: %d; "
+		  "error count: 0x%04x; flags: 0x%02x; info: %016llx; "
+		  "virtual fault address: %016llx; "
+		  "physical fault address: %016llx",
+		  __entry->affinity, __entry->mpidr, __entry->midr,
+		  __entry->running_state, __entry->psci_state, __entry->version,
+		  __entry->type, __entry->err_count, __entry->flags,
+		  __entry->info, __entry->virt_fault_addr,
+		  __entry->phys_fault_addr)
+);
+
+/*
  * Unknown Section Report
  *
  * This event is generated when hardware detected a hardware
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 10/10] arm/arm64: KVM: add guest SEA support
  2016-11-21 22:35 ` Tyler Baicar
@ 2016-11-21 22:36   ` Tyler Baicar
  -1 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, james.morse, akpm, eun.taik.lee,
	sandeepa.s.prabhu, shijie.huang, rruigrok, paul.gortmaker,
	tomasz.nowicki, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo
  Cc: Tyler Baicar

Currently external aborts are unsupported by the guest abort
handling. Add handling for SEAs so that the host kernel reports
SEAs which occur in the guest kernel.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 arch/arm/include/asm/kvm_arm.h       |  1 +
 arch/arm/include/asm/system_misc.h   |  5 +++++
 arch/arm/kvm/mmu.c                   | 18 ++++++++++++++++--
 arch/arm64/include/asm/kvm_arm.h     |  1 +
 arch/arm64/include/asm/system_misc.h |  2 ++
 arch/arm64/mm/fault.c                | 13 +++++++++++++
 6 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index e22089f..33a77509 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -187,6 +187,7 @@
 #define FSC_FAULT	(0x04)
 #define FSC_ACCESS	(0x08)
 #define FSC_PERM	(0x0c)
+#define FSC_EXTABT	(0x10)
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~0xf)
diff --git a/arch/arm/include/asm/system_misc.h b/arch/arm/include/asm/system_misc.h
index a3d61ad..ea45d94 100644
--- a/arch/arm/include/asm/system_misc.h
+++ b/arch/arm/include/asm/system_misc.h
@@ -24,4 +24,9 @@ extern unsigned int user_debug;
 
 #endif /* !__ASSEMBLY__ */
 
+static inline int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+	return -1;
+}
+
 #endif /* __ASM_ARM_SYSTEM_MISC_H */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index e9a5c0e..1152966 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -29,6 +29,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/virt.h>
+#include <asm/system_misc.h>
 
 #include "trace.h"
 
@@ -1441,8 +1442,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 	/* Check the stage-2 fault is trans. fault or write fault */
 	fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
-	if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
-	    fault_status != FSC_ACCESS) {
+
+	/* The host kernel will handle the synchronous external abort. There
+	 * is no need to pass the error into the guest.
+	 */
+	if (fault_status == FSC_EXTABT) {
+		if(handle_guest_sea((unsigned long)fault_ipa,
+				    kvm_vcpu_get_hsr(vcpu))) {
+			kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
+				kvm_vcpu_trap_get_class(vcpu),
+				(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
+				(unsigned long)kvm_vcpu_get_hsr(vcpu));
+			return -EFAULT;
+		}
+	} else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
+		   fault_status != FSC_ACCESS) {
 		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
 			kvm_vcpu_trap_get_class(vcpu),
 			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 4b5c977..be0efb6 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -201,6 +201,7 @@
 #define FSC_FAULT	ESR_ELx_FSC_FAULT
 #define FSC_ACCESS	ESR_ELx_FSC_ACCESS
 #define FSC_PERM	ESR_ELx_FSC_PERM
+#define FSC_EXTABT	ESR_ELx_FSC_EXTABT
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 9040e1d..3a142a5 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -77,4 +77,6 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 int register_synchronous_ext_abort_notifier(struct notifier_block *nb);
 void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb);
 
+int handle_guest_sea(unsigned long addr, unsigned int esr);
+
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index fcc49f1..691399e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -597,6 +597,19 @@ static const char *fault_name(unsigned int esr)
 }
 
 /*
+ * Handle Synchronous External Aborts that occur in a guest kernel.
+ */
+int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		fault_name(esr), esr, addr);
+
+	return 0;
+}
+
+/*
  * Dispatch a data abort to the relevant handler.
  */
 asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH V5 10/10] arm/arm64: KVM: add guest SEA support
@ 2016-11-21 22:36   ` Tyler Baicar
  0 siblings, 0 replies; 55+ messages in thread
From: Tyler Baicar @ 2016-11-21 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

Currently external aborts are unsupported by the guest abort
handling. Add handling for SEAs so that the host kernel reports
SEAs which occur in the guest kernel.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 arch/arm/include/asm/kvm_arm.h       |  1 +
 arch/arm/include/asm/system_misc.h   |  5 +++++
 arch/arm/kvm/mmu.c                   | 18 ++++++++++++++++--
 arch/arm64/include/asm/kvm_arm.h     |  1 +
 arch/arm64/include/asm/system_misc.h |  2 ++
 arch/arm64/mm/fault.c                | 13 +++++++++++++
 6 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index e22089f..33a77509 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -187,6 +187,7 @@
 #define FSC_FAULT	(0x04)
 #define FSC_ACCESS	(0x08)
 #define FSC_PERM	(0x0c)
+#define FSC_EXTABT	(0x10)
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~0xf)
diff --git a/arch/arm/include/asm/system_misc.h b/arch/arm/include/asm/system_misc.h
index a3d61ad..ea45d94 100644
--- a/arch/arm/include/asm/system_misc.h
+++ b/arch/arm/include/asm/system_misc.h
@@ -24,4 +24,9 @@ extern unsigned int user_debug;
 
 #endif /* !__ASSEMBLY__ */
 
+static inline int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+	return -1;
+}
+
 #endif /* __ASM_ARM_SYSTEM_MISC_H */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index e9a5c0e..1152966 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -29,6 +29,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/virt.h>
+#include <asm/system_misc.h>
 
 #include "trace.h"
 
@@ -1441,8 +1442,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 
 	/* Check the stage-2 fault is trans. fault or write fault */
 	fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
-	if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
-	    fault_status != FSC_ACCESS) {
+
+	/* The host kernel will handle the synchronous external abort. There
+	 * is no need to pass the error into the guest.
+	 */
+	if (fault_status == FSC_EXTABT) {
+		if(handle_guest_sea((unsigned long)fault_ipa,
+				    kvm_vcpu_get_hsr(vcpu))) {
+			kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
+				kvm_vcpu_trap_get_class(vcpu),
+				(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
+				(unsigned long)kvm_vcpu_get_hsr(vcpu));
+			return -EFAULT;
+		}
+	} else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
+		   fault_status != FSC_ACCESS) {
 		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
 			kvm_vcpu_trap_get_class(vcpu),
 			(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 4b5c977..be0efb6 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -201,6 +201,7 @@
 #define FSC_FAULT	ESR_ELx_FSC_FAULT
 #define FSC_ACCESS	ESR_ELx_FSC_ACCESS
 #define FSC_PERM	ESR_ELx_FSC_PERM
+#define FSC_EXTABT	ESR_ELx_FSC_EXTABT
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 9040e1d..3a142a5 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -77,4 +77,6 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
 int register_synchronous_ext_abort_notifier(struct notifier_block *nb);
 void unregister_synchronous_ext_abort_notifier(struct notifier_block *nb);
 
+int handle_guest_sea(unsigned long addr, unsigned int esr);
+
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index fcc49f1..691399e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -597,6 +597,19 @@ static const char *fault_name(unsigned int esr)
 }
 
 /*
+ * Handle Synchronous External Aborts that occur in a guest kernel.
+ */
+int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+	atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		fault_name(esr), esr, addr);
+
+	return 0;
+}
+
+/*
  * Dispatch a data abort to the relevant handler.
  */
 asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
  2016-11-21 22:35 ` Tyler Baicar
  (?)
  (?)
@ 2016-11-22 11:11     ` John Garry
  -1 siblings, 0 replies; 55+ messages in thread
From: John Garry @ 2016-11-22 11:11 UTC (permalink / raw)
  To: Tyler Baicar, marc.zyngier-5wv7dgnIgG8,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, rkrcmar-H+wXaHxf7aLQT0dZR+AlfA,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, rjw-LthD3rsA81gm4RdzfppkhA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A,
	matt-mF/unelCI9GS6iBeEJttW/XRex20P6io,
	robert.moore-ral2JQCrhuEAvxtiuMwx3w,
	lv.zheng-ral2JQCrhuEAvxtiuMwx3w, nkaje-sgV2jX0FEOL9JmXXK+q4OQ,
	zjzhang-sgV2jX0FEOL9JmXXK+q4OQ, mark.rutland-5wv7dgnIgG8,
	james.morse-5wv7dgnIgG8, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	eun.taik.lee-Sze3O3UU22JBDgjK7y7TUQ,
	sandeepa.s.prabhu-Re5JQEeQqe8AvxtiuMwx3w,
	shijie.huang-5wv7dgnIgG8, rruigrok-sgV2jX0FEOL9JmXXK+q4OQ,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	tomasz.nowicki-QSEj5FYQhm4dnm+yROfE0A,
	fu.wei-QSEj5FYQhm4dnm+yROfE0A, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TbrhsbdSgBK9A

+

We'll try and test this on our platform.

Cheers,
John

On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
>                  +---------------+
>        +-------->|               |
>        |         |  GHES polling |--+
> +-------------+  |    source     |  |   +---------------+   +------------+
> |             |  +---------------+  |   |  Kernel GHES  |   |            |
> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
> |             |  +---------------+  |   |  EDAC drivers |   |   event    |
> +-------------+  |               |  |   +---------------+   +------------+
>        |         |  GHES sci     |--+
>        +-------->|   source      |
>                  +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records.  This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
>             https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
>     Change ghes_do_read_ack to ghes_ack_error
>     Make sure data version check is >= 3
>     Use CPER helper functions in print functions
>     Make handle_guest_sea() dummy function static for arm
>     Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
>     Make HEST generic and generic_v2 structures a union in the ghes structure
>     Move gdata v3 helper functions into ghes.h to avoid duplication
>     Reorder the timestamp print and avoid memcpy
>     Add helper functions for gdata size checking
>     Rename the SEA functions
>     Add helper function for GHES panics
>     Set fru_id to NULL UUID at variable declaration
>     Limit ARM trace event parameters to the needed structures
>     Reorder the ARM trace event variables to save space
>     Add comment for why we don't pass SEAs to the guest when it aborts
>     Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
>     Add helper function to get the proper payload based on generic data entry
>      version
>     Move timestamp print to avoid changing function calls in cper.c
>     Remove patch "arm64: exception: handle instruction abort at current EL"
>      since the el1_ia handler is already added in 4.8
>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>     Add a new trace event for ARM type errors
>     Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
>     Separate timestamp year into year and century using BCD format.
>     Rebase on top of ACPICA 20160318 release and remove header file changes
>      in include/acpi/actbl1.h.
>     Add panic OS with fatal error status block patch.
>     Add processing of unrecognized CPER error section patches with updates
>      from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
>   acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARMv8 processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support
>
>  arch/arm/include/asm/kvm_arm.h       |   1 +
>  arch/arm/include/asm/system_misc.h   |   5 +
>  arch/arm/kvm/mmu.c                   |  18 ++-
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>  arch/arm64/include/asm/system_misc.h |  15 +++
>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>  drivers/acpi/apei/Kconfig            |  14 +++
>  drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c             |   7 +-
>  drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
>  drivers/ras/ras.c                    |   2 +
>  include/acpi/ghes.h                  |  15 ++-
>  include/linux/cper.h                 |  84 ++++++++++++++
>  include/ras/ras_event.h              | 100 +++++++++++++++++
>  15 files changed, 688 insertions(+), 44 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-22 11:11     ` John Garry
  0 siblings, 0 replies; 55+ messages in thread
From: John Garry @ 2016-11-22 11:11 UTC (permalink / raw)
  To: Tyler Baicar, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, shijie.huang, rruigrok,
	paul.gortmaker, tomasz.nowicki, fu.wei, rostedt, bristot,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
	linux-efi, Suzuki.Poulose, punit.agrawal, astone, harba,
	hanjun.guo, Shiju Jose, Linuxarm, Anurup M

+

We'll try and test this on our platform.

Cheers,
John

On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
>                  +---------------+
>        +-------->|               |
>        |         |  GHES polling |--+
> +-------------+  |    source     |  |   +---------------+   +------------+
> |             |  +---------------+  |   |  Kernel GHES  |   |            |
> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
> |             |  +---------------+  |   |  EDAC drivers |   |   event    |
> +-------------+  |               |  |   +---------------+   +------------+
>        |         |  GHES sci     |--+
>        +-------->|   source      |
>                  +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records.  This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
>             https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
>     Change ghes_do_read_ack to ghes_ack_error
>     Make sure data version check is >= 3
>     Use CPER helper functions in print functions
>     Make handle_guest_sea() dummy function static for arm
>     Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
>     Make HEST generic and generic_v2 structures a union in the ghes structure
>     Move gdata v3 helper functions into ghes.h to avoid duplication
>     Reorder the timestamp print and avoid memcpy
>     Add helper functions for gdata size checking
>     Rename the SEA functions
>     Add helper function for GHES panics
>     Set fru_id to NULL UUID at variable declaration
>     Limit ARM trace event parameters to the needed structures
>     Reorder the ARM trace event variables to save space
>     Add comment for why we don't pass SEAs to the guest when it aborts
>     Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
>     Add helper function to get the proper payload based on generic data entry
>      version
>     Move timestamp print to avoid changing function calls in cper.c
>     Remove patch "arm64: exception: handle instruction abort at current EL"
>      since the el1_ia handler is already added in 4.8
>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>     Add a new trace event for ARM type errors
>     Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
>     Separate timestamp year into year and century using BCD format.
>     Rebase on top of ACPICA 20160318 release and remove header file changes
>      in include/acpi/actbl1.h.
>     Add panic OS with fatal error status block patch.
>     Add processing of unrecognized CPER error section patches with updates
>      from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
>   acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARMv8 processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support
>
>  arch/arm/include/asm/kvm_arm.h       |   1 +
>  arch/arm/include/asm/system_misc.h   |   5 +
>  arch/arm/kvm/mmu.c                   |  18 ++-
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>  arch/arm64/include/asm/system_misc.h |  15 +++
>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>  drivers/acpi/apei/Kconfig            |  14 +++
>  drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c             |   7 +-
>  drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
>  drivers/ras/ras.c                    |   2 +
>  include/acpi/ghes.h                  |  15 ++-
>  include/linux/cper.h                 |  84 ++++++++++++++
>  include/ras/ras_event.h              | 100 +++++++++++++++++
>  15 files changed, 688 insertions(+), 44 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-22 11:11     ` John Garry
  0 siblings, 0 replies; 55+ messages in thread
From: John Garry @ 2016-11-22 11:11 UTC (permalink / raw)
  To: Tyler Baicar, marc.zyngier-5wv7dgnIgG8,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, rkrcmar-H+wXaHxf7aLQT0dZR+AlfA,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, rjw-LthD3rsA81gm4RdzfppkhA,
	lenb-DgEjT+Ai2ygdnm+yROfE0A,
	matt-mF/unelCI9GS6iBeEJttW/XRex20P6io,
	robert.moore-ral2JQCrhuEAvxtiuMwx3w,
	lv.zheng-ral2JQCrhuEAvxtiuMwx3w, nkaje-sgV2jX0FEOL9JmXXK+q4OQ,
	zjzhang-sgV2jX0FEOL9JmXXK+q4OQ, mark.rutland-5wv7dgnIgG8,
	james.morse-5wv7dgnIgG8, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	eun.taik.lee-Sze3O3UU22JBDgjK7y7TUQ,
	sandeepa.s.prabhu-Re5JQEeQqe8AvxtiuMwx3w,
	shijie.huang-5wv7dgnIgG8, rruigrok-sgV2jX0FEOL9JmXXK+q4OQ,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	tomasz.nowicki-QSEj5FYQhm4dnm+yROfE0A,
	fu.wei-QSEj5FYQhm4dnm+yROfE0A, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TbrhsbdSgBK9A

+

We'll try and test this on our platform.

Cheers,
John

On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
>                  +---------------+
>        +-------->|               |
>        |         |  GHES polling |--+
> +-------------+  |    source     |  |   +---------------+   +------------+
> |             |  +---------------+  |   |  Kernel GHES  |   |            |
> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
> |             |  +---------------+  |   |  EDAC drivers |   |   event    |
> +-------------+  |               |  |   +---------------+   +------------+
>        |         |  GHES sci     |--+
>        +-------->|   source      |
>                  +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records.  This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
>             https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
>     Change ghes_do_read_ack to ghes_ack_error
>     Make sure data version check is >= 3
>     Use CPER helper functions in print functions
>     Make handle_guest_sea() dummy function static for arm
>     Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
>     Make HEST generic and generic_v2 structures a union in the ghes structure
>     Move gdata v3 helper functions into ghes.h to avoid duplication
>     Reorder the timestamp print and avoid memcpy
>     Add helper functions for gdata size checking
>     Rename the SEA functions
>     Add helper function for GHES panics
>     Set fru_id to NULL UUID at variable declaration
>     Limit ARM trace event parameters to the needed structures
>     Reorder the ARM trace event variables to save space
>     Add comment for why we don't pass SEAs to the guest when it aborts
>     Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
>     Add helper function to get the proper payload based on generic data entry
>      version
>     Move timestamp print to avoid changing function calls in cper.c
>     Remove patch "arm64: exception: handle instruction abort at current EL"
>      since the el1_ia handler is already added in 4.8
>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>     Add a new trace event for ARM type errors
>     Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
>     Separate timestamp year into year and century using BCD format.
>     Rebase on top of ACPICA 20160318 release and remove header file changes
>      in include/acpi/actbl1.h.
>     Add panic OS with fatal error status block patch.
>     Add processing of unrecognized CPER error section patches with updates
>      from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
>   acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARMv8 processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support
>
>  arch/arm/include/asm/kvm_arm.h       |   1 +
>  arch/arm/include/asm/system_misc.h   |   5 +
>  arch/arm/kvm/mmu.c                   |  18 ++-
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>  arch/arm64/include/asm/system_misc.h |  15 +++
>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>  drivers/acpi/apei/Kconfig            |  14 +++
>  drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c             |   7 +-
>  drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
>  drivers/ras/ras.c                    |   2 +
>  include/acpi/ghes.h                  |  15 ++-
>  include/linux/cper.h                 |  84 ++++++++++++++
>  include/ras/ras_event.h              | 100 +++++++++++++++++
>  15 files changed, 688 insertions(+), 44 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-22 11:11     ` John Garry
  0 siblings, 0 replies; 55+ messages in thread
From: John Garry @ 2016-11-22 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

+

We'll try and test this on our platform.

Cheers,
John

On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
>                  +---------------+
>        +-------->|               |
>        |         |  GHES polling |--+
> +-------------+  |    source     |  |   +---------------+   +------------+
> |             |  +---------------+  |   |  Kernel GHES  |   |            |
> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
> |             |  +---------------+  |   |  EDAC drivers |   |   event    |
> +-------------+  |               |  |   +---------------+   +------------+
>        |         |  GHES sci     |--+
>        +-------->|   source      |
>                  +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records.  This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
>             https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
>     Change ghes_do_read_ack to ghes_ack_error
>     Make sure data version check is >= 3
>     Use CPER helper functions in print functions
>     Make handle_guest_sea() dummy function static for arm
>     Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
>     Make HEST generic and generic_v2 structures a union in the ghes structure
>     Move gdata v3 helper functions into ghes.h to avoid duplication
>     Reorder the timestamp print and avoid memcpy
>     Add helper functions for gdata size checking
>     Rename the SEA functions
>     Add helper function for GHES panics
>     Set fru_id to NULL UUID at variable declaration
>     Limit ARM trace event parameters to the needed structures
>     Reorder the ARM trace event variables to save space
>     Add comment for why we don't pass SEAs to the guest when it aborts
>     Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
>     Add helper function to get the proper payload based on generic data entry
>      version
>     Move timestamp print to avoid changing function calls in cper.c
>     Remove patch "arm64: exception: handle instruction abort at current EL"
>      since the el1_ia handler is already added in 4.8
>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>     Add a new trace event for ARM type errors
>     Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
>     Separate timestamp year into year and century using BCD format.
>     Rebase on top of ACPICA 20160318 release and remove header file changes
>      in include/acpi/actbl1.h.
>     Add panic OS with fatal error status block patch.
>     Add processing of unrecognized CPER error section patches with updates
>      from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
>   acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARMv8 processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support
>
>  arch/arm/include/asm/kvm_arm.h       |   1 +
>  arch/arm/include/asm/system_misc.h   |   5 +
>  arch/arm/kvm/mmu.c                   |  18 ++-
>  arch/arm64/Kconfig                   |   1 +
>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>  arch/arm64/include/asm/system_misc.h |  15 +++
>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>  drivers/acpi/apei/Kconfig            |  14 +++
>  drivers/acpi/apei/ghes.c             | 188 ++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c             |   7 +-
>  drivers/firmware/efi/cper.c          | 210 ++++++++++++++++++++++++++++++++---
>  drivers/ras/ras.c                    |   2 +
>  include/acpi/ghes.h                  |  15 ++-
>  include/linux/cper.h                 |  84 ++++++++++++++
>  include/ras/ras_event.h              | 100 +++++++++++++++++
>  15 files changed, 688 insertions(+), 44 deletions(-)
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
  2016-11-22 11:11     ` John Garry
  (?)
@ 2016-11-22 17:13       ` Baicar, Tyler
  -1 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-22 17:13 UTC (permalink / raw)
  To: John Garry, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, shijie.huang, rruigrok,
	paul.gortmaker, tomasz.nowicki, fu.wei, rostedt, bristot,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
	linux-efi, Suzuki.Po

Thank you John! Let me know how it goes and if you have any questions :)

Tyler

On 11/22/2016 4:11 AM, John Garry wrote:
> +
>
> We'll try and test this on our platform.
>
> Cheers,
> John
>
> On 21/11/2016 22:35, Tyler Baicar wrote:
>> When a memory error, CPU error, PCIe error, or other type of hardware 
>> error
>> that's covered by RAS occurs, firmware should populate the shared 
>> GHES memory
>> location with the proper GHES structures to notify the OS of the error.
>> For example, platforms that implement firmware first handling may 
>> implement
>> separate GHES sources for corrected errors and uncorrected errors. If 
>> the
>> error is an uncorrectable error, then the firmware will notify the OS
>> immediately since the error needs to be handled ASAP. The OS will 
>> then be able
>> to take the appropriate action needed such as offlining a page. If 
>> the error
>> is a corrected error, then the firmware will not interrupt the OS 
>> immediately.
>> Instead, the OS will see and report the error the next time it's GHES 
>> timer
>> expires. The kernel will first parse the GHES structures and report 
>> the errors
>> through the kernel logs and then notify the user space through RAS trace
>> events. This allows user space applications such as RAS Daemon to see 
>> the
>> errors and report them however the user desires. This patchset 
>> extends the
>> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
>> ACPI 6.1 specifications.
>>
>> An example flow from firmware to user space could be:
>>
>>                  +---------------+
>>        +-------->|               |
>>        |         |  GHES polling |--+
>> +-------------+  |    source     |  |   +---------------+ +------------+
>> |             |  +---------------+  |   |  Kernel GHES  | |            |
>> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS 
>> trace |
>> |             |  +---------------+  |   |  EDAC drivers |   | event    |
>> +-------------+  |               |  |   +---------------+ +------------+
>>        |         |  GHES sci     |--+
>>        +-------->|   source      |
>>                  +---------------+
>>
>> Add support for Generic Hardware Error Source (GHES) v2, which 
>> introduces the
>> capability for the OS to acknowledge the consumption of the error record
>> generated by the Reliability, Availability and Serviceability (RAS) 
>> controller.
>> This eliminates potential race conditions between the OS and the RAS 
>> controller.
>>
>> Add support for the timestamp field added to the Generic Error Data 
>> Entry v3,
>> allowing the OS to log the time that the error is generated by the 
>> firmware,
>> rather than the time the error is consumed. This improves the 
>> correctness of
>> event sequences when analyzing error logs. The timestamp is added in
>> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>>
>> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
>> specification. ARMv8 specific processor error information is reported 
>> as part of
>> the CPER records.  This provides more detail on for processor error 
>> logs. This
>> can help describe ARMv8 cache, tlb, and bus errors.
>>
>> Synchronous External Abort (SEA) represents a specific processor 
>> error condition
>> in ARM systems. A handler is added to recognize SEA errors, and a 
>> notifier is
>> added to parse and report the errors before the process is killed. 
>> Refer to
>> section N.2.1.1 in the Common Platform Error Record appendix of the 
>> UEFI 2.6
>> specification.
>>
>> Currently the kernel ignores CPER records that are unrecognized.
>> On the other hand, UEFI spec allows for non-standard (eg. vendor
>> proprietary) error section type in CPER (Common Platform Error Record),
>> as defined in section N2.3 of UEFI version 2.5. Therefore, user
>> is not able to see hardware error data of non-standard section.
>>
>> If section Type field of Generic Error Data Entry is unrecognized,
>> prints out the raw data in dmesg buffer, and also adds a tracepoint
>> for reporting such hardware errors.
>>
>> Currently even if an error status block's severity is fatal, the kernel
>> does not honor the severity level and panic. With the firmware first
>> model, the platform could inform the OS about a fatal hardware error
>> through the non-NMI GHES notification type. The OS should panic when a
>> hardware error record is received with this severity.
>>
>> Add support to handle SEAs that occur while a KVM guest kernel is
>> running. Currently these are unsupported by the guest abort handling.
>>
>> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for 
>> aarch64.
>>             https://lkml.org/lkml/2016/8/10/231
>>
>> V5: Fix GHES goto logic for error conditions
>>     Change ghes_do_read_ack to ghes_ack_error
>>     Make sure data version check is >= 3
>>     Use CPER helper functions in print functions
>>     Make handle_guest_sea() dummy function static for arm
>>     Add arm to subject line for KVM patch
>>
>> V4: Add bit offset left shift to read_ack_write value
>>     Make HEST generic and generic_v2 structures a union in the ghes 
>> structure
>>     Move gdata v3 helper functions into ghes.h to avoid duplication
>>     Reorder the timestamp print and avoid memcpy
>>     Add helper functions for gdata size checking
>>     Rename the SEA functions
>>     Add helper function for GHES panics
>>     Set fru_id to NULL UUID at variable declaration
>>     Limit ARM trace event parameters to the needed structures
>>     Reorder the ARM trace event variables to save space
>>     Add comment for why we don't pass SEAs to the guest when it aborts
>>     Move ARM trace event call into GHES driver instead of CPER
>>
>> V3: Fix unmapped address to the read_ack_register in ghes.c
>>     Add helper function to get the proper payload based on generic 
>> data entry
>>      version
>>     Move timestamp print to avoid changing function calls in cper.c
>>     Remove patch "arm64: exception: handle instruction abort at 
>> current EL"
>>      since the el1_ia handler is already added in 4.8
>>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>>     Add a new trace event for ARM type errors
>>     Add support to handle KVM guest SEAs
>>
>> V2: Add PSCI state print for the ARMv8 error type.
>>     Separate timestamp year into year and century using BCD format.
>>     Rebase on top of ACPICA 20160318 release and remove header file 
>> changes
>>      in include/acpi/actbl1.h.
>>     Add panic OS with fatal error status block patch.
>>     Add processing of unrecognized CPER error section patches with 
>> updates
>>      from previous comments. Original patches: 
>> https://lkml.org/lkml/2015/9/8/646
>>
>> V1: https://lkml.org/lkml/2016/2/5/544
>>
>> Jonathan (Zhixiong) Zhang (1):
>>   acpi: apei: panic OS with fatal error status block
>>
>> Tyler Baicar (9):
>>   acpi: apei: read ack upon ghes record consumption
>>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>>   efi: parse ARMv8 processor error
>>   arm64: exception: handle Synchronous External Abort
>>   acpi: apei: handle SEA notification type for ARMv8
>>   efi: print unrecognized CPER section
>>   ras: acpi / apei: generate trace event for unrecognized CPER section
>>   trace, ras: add ARM processor error trace event
>>   arm/arm64: KVM: add guest SEA support
>>
>>  arch/arm/include/asm/kvm_arm.h       |   1 +
>>  arch/arm/include/asm/system_misc.h   |   5 +
>>  arch/arm/kvm/mmu.c                   |  18 ++-
>>  arch/arm64/Kconfig                   |   1 +
>>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>>  arch/arm64/include/asm/system_misc.h |  15 +++
>>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>>  drivers/acpi/apei/Kconfig            |  14 +++
>>  drivers/acpi/apei/ghes.c             | 188 
>> ++++++++++++++++++++++++++++---
>>  drivers/acpi/apei/hest.c             |   7 +-
>>  drivers/firmware/efi/cper.c          | 210 
>> ++++++++++++++++++++++++++++++++---
>>  drivers/ras/ras.c                    |   2 +
>>  include/acpi/ghes.h                  |  15 ++-
>>  include/linux/cper.h                 |  84 ++++++++++++++
>>  include/ras/ras_event.h              | 100 +++++++++++++++++
>>  15 files changed, 688 insertions(+), 44 deletions(-)
>>
>
>

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-22 17:13       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-22 17:13 UTC (permalink / raw)
  To: John Garry, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, shijie.huang, rruigrok,
	paul.gortmaker, tomasz.nowicki, fu.wei, rostedt, bristot,
	linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
	linux-efi, Suzuki.Poulose, punit.agrawal, astone, harba,
	hanjun.guo, Shiju Jose, Linuxarm, Anurup M

Thank you John! Let me know how it goes and if you have any questions :)

Tyler

On 11/22/2016 4:11 AM, John Garry wrote:
> +
>
> We'll try and test this on our platform.
>
> Cheers,
> John
>
> On 21/11/2016 22:35, Tyler Baicar wrote:
>> When a memory error, CPU error, PCIe error, or other type of hardware 
>> error
>> that's covered by RAS occurs, firmware should populate the shared 
>> GHES memory
>> location with the proper GHES structures to notify the OS of the error.
>> For example, platforms that implement firmware first handling may 
>> implement
>> separate GHES sources for corrected errors and uncorrected errors. If 
>> the
>> error is an uncorrectable error, then the firmware will notify the OS
>> immediately since the error needs to be handled ASAP. The OS will 
>> then be able
>> to take the appropriate action needed such as offlining a page. If 
>> the error
>> is a corrected error, then the firmware will not interrupt the OS 
>> immediately.
>> Instead, the OS will see and report the error the next time it's GHES 
>> timer
>> expires. The kernel will first parse the GHES structures and report 
>> the errors
>> through the kernel logs and then notify the user space through RAS trace
>> events. This allows user space applications such as RAS Daemon to see 
>> the
>> errors and report them however the user desires. This patchset 
>> extends the
>> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
>> ACPI 6.1 specifications.
>>
>> An example flow from firmware to user space could be:
>>
>>                  +---------------+
>>        +-------->|               |
>>        |         |  GHES polling |--+
>> +-------------+  |    source     |  |   +---------------+ +------------+
>> |             |  +---------------+  |   |  Kernel GHES  | |            |
>> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS 
>> trace |
>> |             |  +---------------+  |   |  EDAC drivers |   | event    |
>> +-------------+  |               |  |   +---------------+ +------------+
>>        |         |  GHES sci     |--+
>>        +-------->|   source      |
>>                  +---------------+
>>
>> Add support for Generic Hardware Error Source (GHES) v2, which 
>> introduces the
>> capability for the OS to acknowledge the consumption of the error record
>> generated by the Reliability, Availability and Serviceability (RAS) 
>> controller.
>> This eliminates potential race conditions between the OS and the RAS 
>> controller.
>>
>> Add support for the timestamp field added to the Generic Error Data 
>> Entry v3,
>> allowing the OS to log the time that the error is generated by the 
>> firmware,
>> rather than the time the error is consumed. This improves the 
>> correctness of
>> event sequences when analyzing error logs. The timestamp is added in
>> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>>
>> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
>> specification. ARMv8 specific processor error information is reported 
>> as part of
>> the CPER records.  This provides more detail on for processor error 
>> logs. This
>> can help describe ARMv8 cache, tlb, and bus errors.
>>
>> Synchronous External Abort (SEA) represents a specific processor 
>> error condition
>> in ARM systems. A handler is added to recognize SEA errors, and a 
>> notifier is
>> added to parse and report the errors before the process is killed. 
>> Refer to
>> section N.2.1.1 in the Common Platform Error Record appendix of the 
>> UEFI 2.6
>> specification.
>>
>> Currently the kernel ignores CPER records that are unrecognized.
>> On the other hand, UEFI spec allows for non-standard (eg. vendor
>> proprietary) error section type in CPER (Common Platform Error Record),
>> as defined in section N2.3 of UEFI version 2.5. Therefore, user
>> is not able to see hardware error data of non-standard section.
>>
>> If section Type field of Generic Error Data Entry is unrecognized,
>> prints out the raw data in dmesg buffer, and also adds a tracepoint
>> for reporting such hardware errors.
>>
>> Currently even if an error status block's severity is fatal, the kernel
>> does not honor the severity level and panic. With the firmware first
>> model, the platform could inform the OS about a fatal hardware error
>> through the non-NMI GHES notification type. The OS should panic when a
>> hardware error record is received with this severity.
>>
>> Add support to handle SEAs that occur while a KVM guest kernel is
>> running. Currently these are unsupported by the guest abort handling.
>>
>> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for 
>> aarch64.
>>             https://lkml.org/lkml/2016/8/10/231
>>
>> V5: Fix GHES goto logic for error conditions
>>     Change ghes_do_read_ack to ghes_ack_error
>>     Make sure data version check is >= 3
>>     Use CPER helper functions in print functions
>>     Make handle_guest_sea() dummy function static for arm
>>     Add arm to subject line for KVM patch
>>
>> V4: Add bit offset left shift to read_ack_write value
>>     Make HEST generic and generic_v2 structures a union in the ghes 
>> structure
>>     Move gdata v3 helper functions into ghes.h to avoid duplication
>>     Reorder the timestamp print and avoid memcpy
>>     Add helper functions for gdata size checking
>>     Rename the SEA functions
>>     Add helper function for GHES panics
>>     Set fru_id to NULL UUID at variable declaration
>>     Limit ARM trace event parameters to the needed structures
>>     Reorder the ARM trace event variables to save space
>>     Add comment for why we don't pass SEAs to the guest when it aborts
>>     Move ARM trace event call into GHES driver instead of CPER
>>
>> V3: Fix unmapped address to the read_ack_register in ghes.c
>>     Add helper function to get the proper payload based on generic 
>> data entry
>>      version
>>     Move timestamp print to avoid changing function calls in cper.c
>>     Remove patch "arm64: exception: handle instruction abort at 
>> current EL"
>>      since the el1_ia handler is already added in 4.8
>>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>>     Add a new trace event for ARM type errors
>>     Add support to handle KVM guest SEAs
>>
>> V2: Add PSCI state print for the ARMv8 error type.
>>     Separate timestamp year into year and century using BCD format.
>>     Rebase on top of ACPICA 20160318 release and remove header file 
>> changes
>>      in include/acpi/actbl1.h.
>>     Add panic OS with fatal error status block patch.
>>     Add processing of unrecognized CPER error section patches with 
>> updates
>>      from previous comments. Original patches: 
>> https://lkml.org/lkml/2015/9/8/646
>>
>> V1: https://lkml.org/lkml/2016/2/5/544
>>
>> Jonathan (Zhixiong) Zhang (1):
>>   acpi: apei: panic OS with fatal error status block
>>
>> Tyler Baicar (9):
>>   acpi: apei: read ack upon ghes record consumption
>>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>>   efi: parse ARMv8 processor error
>>   arm64: exception: handle Synchronous External Abort
>>   acpi: apei: handle SEA notification type for ARMv8
>>   efi: print unrecognized CPER section
>>   ras: acpi / apei: generate trace event for unrecognized CPER section
>>   trace, ras: add ARM processor error trace event
>>   arm/arm64: KVM: add guest SEA support
>>
>>  arch/arm/include/asm/kvm_arm.h       |   1 +
>>  arch/arm/include/asm/system_misc.h   |   5 +
>>  arch/arm/kvm/mmu.c                   |  18 ++-
>>  arch/arm64/Kconfig                   |   1 +
>>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>>  arch/arm64/include/asm/system_misc.h |  15 +++
>>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>>  drivers/acpi/apei/Kconfig            |  14 +++
>>  drivers/acpi/apei/ghes.c             | 188 
>> ++++++++++++++++++++++++++++---
>>  drivers/acpi/apei/hest.c             |   7 +-
>>  drivers/firmware/efi/cper.c          | 210 
>> ++++++++++++++++++++++++++++++++---
>>  drivers/ras/ras.c                    |   2 +
>>  include/acpi/ghes.h                  |  15 ++-
>>  include/linux/cper.h                 |  84 ++++++++++++++
>>  include/ras/ras_event.h              | 100 +++++++++++++++++
>>  15 files changed, 688 insertions(+), 44 deletions(-)
>>
>
>

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2016-11-22 17:13       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-22 17:13 UTC (permalink / raw)
  To: linux-arm-kernel

Thank you John! Let me know how it goes and if you have any questions :)

Tyler

On 11/22/2016 4:11 AM, John Garry wrote:
> +
>
> We'll try and test this on our platform.
>
> Cheers,
> John
>
> On 21/11/2016 22:35, Tyler Baicar wrote:
>> When a memory error, CPU error, PCIe error, or other type of hardware 
>> error
>> that's covered by RAS occurs, firmware should populate the shared 
>> GHES memory
>> location with the proper GHES structures to notify the OS of the error.
>> For example, platforms that implement firmware first handling may 
>> implement
>> separate GHES sources for corrected errors and uncorrected errors. If 
>> the
>> error is an uncorrectable error, then the firmware will notify the OS
>> immediately since the error needs to be handled ASAP. The OS will 
>> then be able
>> to take the appropriate action needed such as offlining a page. If 
>> the error
>> is a corrected error, then the firmware will not interrupt the OS 
>> immediately.
>> Instead, the OS will see and report the error the next time it's GHES 
>> timer
>> expires. The kernel will first parse the GHES structures and report 
>> the errors
>> through the kernel logs and then notify the user space through RAS trace
>> events. This allows user space applications such as RAS Daemon to see 
>> the
>> errors and report them however the user desires. This patchset 
>> extends the
>> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
>> ACPI 6.1 specifications.
>>
>> An example flow from firmware to user space could be:
>>
>>                  +---------------+
>>        +-------->|               |
>>        |         |  GHES polling |--+
>> +-------------+  |    source     |  |   +---------------+ +------------+
>> |             |  +---------------+  |   |  Kernel GHES  | |            |
>> |  Firmware   |                     +-->|  CPER AER and |-->|  RAS 
>> trace |
>> |             |  +---------------+  |   |  EDAC drivers |   | event    |
>> +-------------+  |               |  |   +---------------+ +------------+
>>        |         |  GHES sci     |--+
>>        +-------->|   source      |
>>                  +---------------+
>>
>> Add support for Generic Hardware Error Source (GHES) v2, which 
>> introduces the
>> capability for the OS to acknowledge the consumption of the error record
>> generated by the Reliability, Availability and Serviceability (RAS) 
>> controller.
>> This eliminates potential race conditions between the OS and the RAS 
>> controller.
>>
>> Add support for the timestamp field added to the Generic Error Data 
>> Entry v3,
>> allowing the OS to log the time that the error is generated by the 
>> firmware,
>> rather than the time the error is consumed. This improves the 
>> correctness of
>> event sequences when analyzing error logs. The timestamp is added in
>> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>>
>> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
>> specification. ARMv8 specific processor error information is reported 
>> as part of
>> the CPER records.  This provides more detail on for processor error 
>> logs. This
>> can help describe ARMv8 cache, tlb, and bus errors.
>>
>> Synchronous External Abort (SEA) represents a specific processor 
>> error condition
>> in ARM systems. A handler is added to recognize SEA errors, and a 
>> notifier is
>> added to parse and report the errors before the process is killed. 
>> Refer to
>> section N.2.1.1 in the Common Platform Error Record appendix of the 
>> UEFI 2.6
>> specification.
>>
>> Currently the kernel ignores CPER records that are unrecognized.
>> On the other hand, UEFI spec allows for non-standard (eg. vendor
>> proprietary) error section type in CPER (Common Platform Error Record),
>> as defined in section N2.3 of UEFI version 2.5. Therefore, user
>> is not able to see hardware error data of non-standard section.
>>
>> If section Type field of Generic Error Data Entry is unrecognized,
>> prints out the raw data in dmesg buffer, and also adds a tracepoint
>> for reporting such hardware errors.
>>
>> Currently even if an error status block's severity is fatal, the kernel
>> does not honor the severity level and panic. With the firmware first
>> model, the platform could inform the OS about a fatal hardware error
>> through the non-NMI GHES notification type. The OS should panic when a
>> hardware error record is received with this severity.
>>
>> Add support to handle SEAs that occur while a KVM guest kernel is
>> running. Currently these are unsupported by the guest abort handling.
>>
>> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for 
>> aarch64.
>>             https://lkml.org/lkml/2016/8/10/231
>>
>> V5: Fix GHES goto logic for error conditions
>>     Change ghes_do_read_ack to ghes_ack_error
>>     Make sure data version check is >= 3
>>     Use CPER helper functions in print functions
>>     Make handle_guest_sea() dummy function static for arm
>>     Add arm to subject line for KVM patch
>>
>> V4: Add bit offset left shift to read_ack_write value
>>     Make HEST generic and generic_v2 structures a union in the ghes 
>> structure
>>     Move gdata v3 helper functions into ghes.h to avoid duplication
>>     Reorder the timestamp print and avoid memcpy
>>     Add helper functions for gdata size checking
>>     Rename the SEA functions
>>     Add helper function for GHES panics
>>     Set fru_id to NULL UUID at variable declaration
>>     Limit ARM trace event parameters to the needed structures
>>     Reorder the ARM trace event variables to save space
>>     Add comment for why we don't pass SEAs to the guest when it aborts
>>     Move ARM trace event call into GHES driver instead of CPER
>>
>> V3: Fix unmapped address to the read_ack_register in ghes.c
>>     Add helper function to get the proper payload based on generic 
>> data entry
>>      version
>>     Move timestamp print to avoid changing function calls in cper.c
>>     Remove patch "arm64: exception: handle instruction abort at 
>> current EL"
>>      since the el1_ia handler is already added in 4.8
>>     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>>     Add a new trace event for ARM type errors
>>     Add support to handle KVM guest SEAs
>>
>> V2: Add PSCI state print for the ARMv8 error type.
>>     Separate timestamp year into year and century using BCD format.
>>     Rebase on top of ACPICA 20160318 release and remove header file 
>> changes
>>      in include/acpi/actbl1.h.
>>     Add panic OS with fatal error status block patch.
>>     Add processing of unrecognized CPER error section patches with 
>> updates
>>      from previous comments. Original patches: 
>> https://lkml.org/lkml/2015/9/8/646
>>
>> V1: https://lkml.org/lkml/2016/2/5/544
>>
>> Jonathan (Zhixiong) Zhang (1):
>>   acpi: apei: panic OS with fatal error status block
>>
>> Tyler Baicar (9):
>>   acpi: apei: read ack upon ghes record consumption
>>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>>   efi: parse ARMv8 processor error
>>   arm64: exception: handle Synchronous External Abort
>>   acpi: apei: handle SEA notification type for ARMv8
>>   efi: print unrecognized CPER section
>>   ras: acpi / apei: generate trace event for unrecognized CPER section
>>   trace, ras: add ARM processor error trace event
>>   arm/arm64: KVM: add guest SEA support
>>
>>  arch/arm/include/asm/kvm_arm.h       |   1 +
>>  arch/arm/include/asm/system_misc.h   |   5 +
>>  arch/arm/kvm/mmu.c                   |  18 ++-
>>  arch/arm64/Kconfig                   |   1 +
>>  arch/arm64/include/asm/kvm_arm.h     |   1 +
>>  arch/arm64/include/asm/system_misc.h |  15 +++
>>  arch/arm64/mm/fault.c                |  71 ++++++++++--
>>  drivers/acpi/apei/Kconfig            |  14 +++
>>  drivers/acpi/apei/ghes.c             | 188 
>> ++++++++++++++++++++++++++++---
>>  drivers/acpi/apei/hest.c             |   7 +-
>>  drivers/firmware/efi/cper.c          | 210 
>> ++++++++++++++++++++++++++++++++---
>>  drivers/ras/ras.c                    |   2 +
>>  include/acpi/ghes.h                  |  15 ++-
>>  include/linux/cper.h                 |  84 ++++++++++++++
>>  include/ras/ras_event.h              | 100 +++++++++++++++++
>>  15 files changed, 688 insertions(+), 44 deletions(-)
>>
>
>

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
  2016-11-21 22:35   ` Tyler Baicar
  (?)
  (?)
@ 2016-11-25 18:19       ` James Morse
  -1 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:19 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: marc.zyngier-5wv7dgnIgG8, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	rkrcmar-H+wXaHxf7aLQT0dZR+AlfA, linux-I+IVW8TIWO2tmTQ+vhA3Yw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	rjw-LthD3rsA81gm4RdzfppkhA, lenb-DgEjT+Ai2ygdnm+yROfE0A,
	matt-mF/unelCI9GS6iBeEJttW/XRex20P6io,
	robert.moore-ral2JQCrhuEAvxtiuMwx3w,
	lv.zheng-ral2JQCrhuEAvxtiuMwx3w, nkaje-sgV2jX0FEOL9JmXXK+q4OQ,
	zjzhang-sgV2jX0FEOL9JmXXK+q4OQ, mark.rutland-5wv7dgnIgG8,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	eun.taik.lee-Sze3O3UU22JBDgjK7y7TUQ,
	sandeepa.s.prabhu-Re5JQEeQqe8AvxtiuMwx3w,
	shijie.huang-5wv7dgnIgG8, rruigrok-sgV2jX0FEOL9JmXXK+q4OQ,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	tomasz.nowicki-QSEj5FYQhm4dnm+yROfE0A,
	fu.wei-QSEj5FYQhm4dnm+yROfE0A, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA,
	linux-efi-u79uwXL29TY76Z2rM5mHXA, Suzuki.Poulose-5wv7dgnIgG8,
	punit.agrawal-5wv7dgnIgG8, astone-H+wXaHxf7aLQT0dZR+AlfA

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
> 
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.

This patch also adds support for parsing GHESv2 sub-tables.
Before they would be rejected as an unknown hardware error source.


> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

Nit: the patch author's Sign-off should come first, you either need a 'From:
Jonathan (Zhixiong) Zhang ...' on this patch, or re-order these Signed-off-by's.


> Signed-off-by: Richard Ruigrok <rruigrok-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> Signed-off-by: Tyler Baicar <tbaicar-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> Signed-off-by: Naveen Kaje <nkaje-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> ---
>  drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c |  7 +++++--
>  include/acpi/ghes.h      |  5 ++++-
>  3 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 60746ef..b79abc5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -45,6 +45,7 @@
>  #include <linux/aer.h>
>  #include <linux/nmi.h>
>  
> +#include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> @@ -79,6 +80,10 @@
>  	((struct acpi_hest_generic_status *)				\
>  	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>  
> +#define HEST_TYPE_GENERIC_V2(ghes)				\
> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
> +	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
> +

IS_ HEST_TYPE_GENERIC_V2() ? (for the sake of readability)


>  /*
>   * This driver isn't really modular, however for the time being,
>   * continuing to use module_param is the easiest way to remain
> @@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>  	if (!ghes)
>  		return ERR_PTR(-ENOMEM);
> +
>  	ghes->generic = generic;
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = apei_map_generic_address(
> +			&ghes->generic_v2->read_ack_register);
> +		if (rc)
> +			goto err_free;
> +	}
> +
>  	rc = apei_map_generic_address(&generic->error_status_address);
>  	if (rc)
> -		goto err_free;
> +		goto err_unmap_read_ack_addr;
>  	error_block_length = generic->error_block_length;
>  	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>  		pr_warning(FW_WARN GHES_PFX
> @@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>  	if (!ghes->estatus) {
>  		rc = -ENOMEM;
> -		goto err_unmap;
> +		goto err_unmap_status_addr;
>  	}
>  
>  	return ghes;
>  
> -err_unmap:
> +err_unmap_status_addr:
>  	apei_unmap_generic_address(&generic->error_status_address);
> +err_unmap_read_ack_addr:
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  err_free:
>  	kfree(ghes);
>  	return ERR_PTR(rc);
> @@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
>  {
>  	kfree(ghes->estatus);
>  	apei_unmap_generic_address(&ghes->generic->error_status_address);
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  }
>  
>  static inline int ghes_severity(int severity)
> @@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
>  	rcu_read_unlock();
>  }
>  
> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> +{
> +	int rc;
> +	u64 val = 0;
> +
> +	rc = apei_read(&val, &generic_v2->read_ack_register);
> +	if (rc)
> +		return rc;
> +	val &= generic_v2->read_ack_preserve <<
> +		generic_v2->read_ack_register.bit_offset;
> +	val |= generic_v2->read_ack_write <<
> +		generic_v2->read_ack_register.bit_offset;

Is this bit_offset shifting needed in case the read_ack_register is in the
'system io' (or embedded controller) address space and shares a register with
some other stuff?

The read_ack_{preserve,write} values are u64, so if bit_offset is non-zero the
high order bits get lost, but both ends of this are in the firmware's control.

(I assumed this thing would always be in memory and these fields would never be
used - but I guess that isn't true!)


> +	rc = apei_write(val, &generic_v2->read_ack_register);
> +
> +	return rc;
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
>  	ghes_do_proc(ghes, ghes->estatus);
> +
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = ghes_ack_error(ghes->generic_v2);
> +		if (rc)
> +			return rc;
> +	}
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
> index 792a0d9..ef725a9 100644
> --- a/drivers/acpi/apei/hest.c
> +++ b/drivers/acpi/apei/hest.c
> @@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
>  	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
>  	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
>  	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
> +	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
>  };
>  
>  static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
> @@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
>  {
>  	int *count = data;
>  
> -	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
> +	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		(*count)++;
>  	return 0;
>  }
> @@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
>  	struct ghes_arr *ghes_arr = data;
>  	int rc, i;
>  
> -	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
> +	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		return 0;
>  
>  	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 720446c..68f088a 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -13,7 +13,10 @@
>  #define GHES_EXITING		0x0002
>  
>  struct ghes {
> -	struct acpi_hest_generic *generic;
> +	union {
> +		struct acpi_hest_generic *generic;
> +		struct acpi_hest_generic_v2 *generic_v2;
> +	};
>  	struct acpi_hest_generic_status *estatus;
>  	u64 buffer_paddr;
>  	unsigned long flags;
> 

Looks good to me, for what its worth:
Reviewed-by: James Morse <james.morse-5wv7dgnIgG8@public.gmane.org>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
@ 2016-11-25 18:19       ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:19 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone,
	harba, hanjun.guo

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
> 
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.

This patch also adds support for parsing GHESv2 sub-tables.
Before they would be rejected as an unknown hardware error source.


> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>

Nit: the patch author's Sign-off should come first, you either need a 'From:
Jonathan (Zhixiong) Zhang ...' on this patch, or re-order these Signed-off-by's.


> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c |  7 +++++--
>  include/acpi/ghes.h      |  5 ++++-
>  3 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 60746ef..b79abc5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -45,6 +45,7 @@
>  #include <linux/aer.h>
>  #include <linux/nmi.h>
>  
> +#include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> @@ -79,6 +80,10 @@
>  	((struct acpi_hest_generic_status *)				\
>  	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>  
> +#define HEST_TYPE_GENERIC_V2(ghes)				\
> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
> +	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
> +

IS_ HEST_TYPE_GENERIC_V2() ? (for the sake of readability)


>  /*
>   * This driver isn't really modular, however for the time being,
>   * continuing to use module_param is the easiest way to remain
> @@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>  	if (!ghes)
>  		return ERR_PTR(-ENOMEM);
> +
>  	ghes->generic = generic;
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = apei_map_generic_address(
> +			&ghes->generic_v2->read_ack_register);
> +		if (rc)
> +			goto err_free;
> +	}
> +
>  	rc = apei_map_generic_address(&generic->error_status_address);
>  	if (rc)
> -		goto err_free;
> +		goto err_unmap_read_ack_addr;
>  	error_block_length = generic->error_block_length;
>  	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>  		pr_warning(FW_WARN GHES_PFX
> @@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>  	if (!ghes->estatus) {
>  		rc = -ENOMEM;
> -		goto err_unmap;
> +		goto err_unmap_status_addr;
>  	}
>  
>  	return ghes;
>  
> -err_unmap:
> +err_unmap_status_addr:
>  	apei_unmap_generic_address(&generic->error_status_address);
> +err_unmap_read_ack_addr:
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  err_free:
>  	kfree(ghes);
>  	return ERR_PTR(rc);
> @@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
>  {
>  	kfree(ghes->estatus);
>  	apei_unmap_generic_address(&ghes->generic->error_status_address);
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  }
>  
>  static inline int ghes_severity(int severity)
> @@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
>  	rcu_read_unlock();
>  }
>  
> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> +{
> +	int rc;
> +	u64 val = 0;
> +
> +	rc = apei_read(&val, &generic_v2->read_ack_register);
> +	if (rc)
> +		return rc;
> +	val &= generic_v2->read_ack_preserve <<
> +		generic_v2->read_ack_register.bit_offset;
> +	val |= generic_v2->read_ack_write <<
> +		generic_v2->read_ack_register.bit_offset;

Is this bit_offset shifting needed in case the read_ack_register is in the
'system io' (or embedded controller) address space and shares a register with
some other stuff?

The read_ack_{preserve,write} values are u64, so if bit_offset is non-zero the
high order bits get lost, but both ends of this are in the firmware's control.

(I assumed this thing would always be in memory and these fields would never be
used - but I guess that isn't true!)


> +	rc = apei_write(val, &generic_v2->read_ack_register);
> +
> +	return rc;
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
>  	ghes_do_proc(ghes, ghes->estatus);
> +
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = ghes_ack_error(ghes->generic_v2);
> +		if (rc)
> +			return rc;
> +	}
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
> index 792a0d9..ef725a9 100644
> --- a/drivers/acpi/apei/hest.c
> +++ b/drivers/acpi/apei/hest.c
> @@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
>  	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
>  	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
>  	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
> +	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
>  };
>  
>  static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
> @@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
>  {
>  	int *count = data;
>  
> -	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
> +	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		(*count)++;
>  	return 0;
>  }
> @@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
>  	struct ghes_arr *ghes_arr = data;
>  	int rc, i;
>  
> -	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
> +	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		return 0;
>  
>  	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 720446c..68f088a 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -13,7 +13,10 @@
>  #define GHES_EXITING		0x0002
>  
>  struct ghes {
> -	struct acpi_hest_generic *generic;
> +	union {
> +		struct acpi_hest_generic *generic;
> +		struct acpi_hest_generic_v2 *generic_v2;
> +	};
>  	struct acpi_hest_generic_status *estatus;
>  	u64 buffer_paddr;
>  	unsigned long flags;
> 

Looks good to me, for what its worth:
Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
@ 2016-11-25 18:19       ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:19 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: marc.zyngier-5wv7dgnIgG8, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	rkrcmar-H+wXaHxf7aLQT0dZR+AlfA, linux-I+IVW8TIWO2tmTQ+vhA3Yw,
	catalin.marinas-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	rjw-LthD3rsA81gm4RdzfppkhA, lenb-DgEjT+Ai2ygdnm+yROfE0A,
	matt-mF/unelCI9GS6iBeEJttW/XRex20P6io,
	robert.moore-ral2JQCrhuEAvxtiuMwx3w,
	lv.zheng-ral2JQCrhuEAvxtiuMwx3w, nkaje-sgV2jX0FEOL9JmXXK+q4OQ,
	zjzhang-sgV2jX0FEOL9JmXXK+q4OQ, mark.rutland-5wv7dgnIgG8,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	eun.taik.lee-Sze3O3UU22JBDgjK7y7TUQ,
	sandeepa.s.prabhu-Re5JQEeQqe8AvxtiuMwx3w,
	shijie.huang-5wv7dgnIgG8, rruigrok-sgV2jX0FEOL9JmXXK+q4OQ,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	tomasz.nowicki-QSEj5FYQhm4dnm+yROfE0A,
	fu.wei-QSEj5FYQhm4dnm+yROfE0A, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	bristot-H+wXaHxf7aLQT0dZR+AlfA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg,
	kvm-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA,
	linux-efi-u79uwXL29TY76Z2rM5mHXA, Suzuki.Poulose-5wv7dgnIgG8,
	punit.agrawal-5wv7dgnIgG8, astone-H+wXaHxf7aLQT0dZR+AlfA

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
> 
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.

This patch also adds support for parsing GHESv2 sub-tables.
Before they would be rejected as an unknown hardware error source.


> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>

Nit: the patch author's Sign-off should come first, you either need a 'From:
Jonathan (Zhixiong) Zhang ...' on this patch, or re-order these Signed-off-by's.


> Signed-off-by: Richard Ruigrok <rruigrok-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> Signed-off-by: Tyler Baicar <tbaicar-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> Signed-off-by: Naveen Kaje <nkaje-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> ---
>  drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c |  7 +++++--
>  include/acpi/ghes.h      |  5 ++++-
>  3 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 60746ef..b79abc5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -45,6 +45,7 @@
>  #include <linux/aer.h>
>  #include <linux/nmi.h>
>  
> +#include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> @@ -79,6 +80,10 @@
>  	((struct acpi_hest_generic_status *)				\
>  	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>  
> +#define HEST_TYPE_GENERIC_V2(ghes)				\
> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
> +	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
> +

IS_ HEST_TYPE_GENERIC_V2() ? (for the sake of readability)


>  /*
>   * This driver isn't really modular, however for the time being,
>   * continuing to use module_param is the easiest way to remain
> @@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>  	if (!ghes)
>  		return ERR_PTR(-ENOMEM);
> +
>  	ghes->generic = generic;
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = apei_map_generic_address(
> +			&ghes->generic_v2->read_ack_register);
> +		if (rc)
> +			goto err_free;
> +	}
> +
>  	rc = apei_map_generic_address(&generic->error_status_address);
>  	if (rc)
> -		goto err_free;
> +		goto err_unmap_read_ack_addr;
>  	error_block_length = generic->error_block_length;
>  	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>  		pr_warning(FW_WARN GHES_PFX
> @@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>  	if (!ghes->estatus) {
>  		rc = -ENOMEM;
> -		goto err_unmap;
> +		goto err_unmap_status_addr;
>  	}
>  
>  	return ghes;
>  
> -err_unmap:
> +err_unmap_status_addr:
>  	apei_unmap_generic_address(&generic->error_status_address);
> +err_unmap_read_ack_addr:
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  err_free:
>  	kfree(ghes);
>  	return ERR_PTR(rc);
> @@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
>  {
>  	kfree(ghes->estatus);
>  	apei_unmap_generic_address(&ghes->generic->error_status_address);
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  }
>  
>  static inline int ghes_severity(int severity)
> @@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
>  	rcu_read_unlock();
>  }
>  
> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> +{
> +	int rc;
> +	u64 val = 0;
> +
> +	rc = apei_read(&val, &generic_v2->read_ack_register);
> +	if (rc)
> +		return rc;
> +	val &= generic_v2->read_ack_preserve <<
> +		generic_v2->read_ack_register.bit_offset;
> +	val |= generic_v2->read_ack_write <<
> +		generic_v2->read_ack_register.bit_offset;

Is this bit_offset shifting needed in case the read_ack_register is in the
'system io' (or embedded controller) address space and shares a register with
some other stuff?

The read_ack_{preserve,write} values are u64, so if bit_offset is non-zero the
high order bits get lost, but both ends of this are in the firmware's control.

(I assumed this thing would always be in memory and these fields would never be
used - but I guess that isn't true!)


> +	rc = apei_write(val, &generic_v2->read_ack_register);
> +
> +	return rc;
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
>  	ghes_do_proc(ghes, ghes->estatus);
> +
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = ghes_ack_error(ghes->generic_v2);
> +		if (rc)
> +			return rc;
> +	}
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
> index 792a0d9..ef725a9 100644
> --- a/drivers/acpi/apei/hest.c
> +++ b/drivers/acpi/apei/hest.c
> @@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
>  	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
>  	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
>  	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
> +	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
>  };
>  
>  static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
> @@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
>  {
>  	int *count = data;
>  
> -	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
> +	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		(*count)++;
>  	return 0;
>  }
> @@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
>  	struct ghes_arr *ghes_arr = data;
>  	int rc, i;
>  
> -	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
> +	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		return 0;
>  
>  	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 720446c..68f088a 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -13,7 +13,10 @@
>  #define GHES_EXITING		0x0002
>  
>  struct ghes {
> -	struct acpi_hest_generic *generic;
> +	union {
> +		struct acpi_hest_generic *generic;
> +		struct acpi_hest_generic_v2 *generic_v2;
> +	};
>  	struct acpi_hest_generic_status *estatus;
>  	u64 buffer_paddr;
>  	unsigned long flags;
> 

Looks good to me, for what its worth:
Reviewed-by: James Morse <james.morse-5wv7dgnIgG8@public.gmane.org>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
@ 2016-11-25 18:19       ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
> 
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.

This patch also adds support for parsing GHESv2 sub-tables.
Before they would be rejected as an unknown hardware error source.


> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>

Nit: the patch author's Sign-off should come first, you either need a 'From:
Jonathan (Zhixiong) Zhang ...' on this patch, or re-order these Signed-off-by's.


> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c |  7 +++++--
>  include/acpi/ghes.h      |  5 ++++-
>  3 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 60746ef..b79abc5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -45,6 +45,7 @@
>  #include <linux/aer.h>
>  #include <linux/nmi.h>
>  
> +#include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> @@ -79,6 +80,10 @@
>  	((struct acpi_hest_generic_status *)				\
>  	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>  
> +#define HEST_TYPE_GENERIC_V2(ghes)				\
> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
> +	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
> +

IS_ HEST_TYPE_GENERIC_V2() ? (for the sake of readability)


>  /*
>   * This driver isn't really modular, however for the time being,
>   * continuing to use module_param is the easiest way to remain
> @@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>  	if (!ghes)
>  		return ERR_PTR(-ENOMEM);
> +
>  	ghes->generic = generic;
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = apei_map_generic_address(
> +			&ghes->generic_v2->read_ack_register);
> +		if (rc)
> +			goto err_free;
> +	}
> +
>  	rc = apei_map_generic_address(&generic->error_status_address);
>  	if (rc)
> -		goto err_free;
> +		goto err_unmap_read_ack_addr;
>  	error_block_length = generic->error_block_length;
>  	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>  		pr_warning(FW_WARN GHES_PFX
> @@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>  	if (!ghes->estatus) {
>  		rc = -ENOMEM;
> -		goto err_unmap;
> +		goto err_unmap_status_addr;
>  	}
>  
>  	return ghes;
>  
> -err_unmap:
> +err_unmap_status_addr:
>  	apei_unmap_generic_address(&generic->error_status_address);
> +err_unmap_read_ack_addr:
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  err_free:
>  	kfree(ghes);
>  	return ERR_PTR(rc);
> @@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
>  {
>  	kfree(ghes->estatus);
>  	apei_unmap_generic_address(&ghes->generic->error_status_address);
> +	if (HEST_TYPE_GENERIC_V2(ghes))
> +		apei_unmap_generic_address(
> +			&ghes->generic_v2->read_ack_register);
>  }
>  
>  static inline int ghes_severity(int severity)
> @@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
>  	rcu_read_unlock();
>  }
>  
> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> +{
> +	int rc;
> +	u64 val = 0;
> +
> +	rc = apei_read(&val, &generic_v2->read_ack_register);
> +	if (rc)
> +		return rc;
> +	val &= generic_v2->read_ack_preserve <<
> +		generic_v2->read_ack_register.bit_offset;
> +	val |= generic_v2->read_ack_write <<
> +		generic_v2->read_ack_register.bit_offset;

Is this bit_offset shifting needed in case the read_ack_register is in the
'system io' (or embedded controller) address space and shares a register with
some other stuff?

The read_ack_{preserve,write} values are u64, so if bit_offset is non-zero the
high order bits get lost, but both ends of this are in the firmware's control.

(I assumed this thing would always be in memory and these fields would never be
used - but I guess that isn't true!)


> +	rc = apei_write(val, &generic_v2->read_ack_register);
> +
> +	return rc;
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
>  	ghes_do_proc(ghes, ghes->estatus);
> +
> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = ghes_ack_error(ghes->generic_v2);
> +		if (rc)
> +			return rc;
> +	}
>  out:
>  	ghes_clear_estatus(ghes);
>  	return 0;
> diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
> index 792a0d9..ef725a9 100644
> --- a/drivers/acpi/apei/hest.c
> +++ b/drivers/acpi/apei/hest.c
> @@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
>  	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
>  	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
>  	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
> +	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
>  };
>  
>  static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
> @@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
>  {
>  	int *count = data;
>  
> -	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
> +	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		(*count)++;
>  	return 0;
>  }
> @@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
>  	struct ghes_arr *ghes_arr = data;
>  	int rc, i;
>  
> -	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
> +	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
> +	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>  		return 0;
>  
>  	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 720446c..68f088a 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -13,7 +13,10 @@
>  #define GHES_EXITING		0x0002
>  
>  struct ghes {
> -	struct acpi_hest_generic *generic;
> +	union {
> +		struct acpi_hest_generic *generic;
> +		struct acpi_hest_generic_v2 *generic_v2;
> +	};
>  	struct acpi_hest_generic_status *estatus;
>  	u64 buffer_paddr;
>  	unsigned long flags;
> 

Looks good to me, for what its worth:
Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-21 22:35   ` Tyler Baicar
  (?)
@ 2016-11-25 18:20     ` James Morse
  -1 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:20 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: linux-efi, kvm, matt, catalin.marinas, will.deacon, robert.moore,
	paul.gortmaker, lv.zheng, kvmarm, fu.wei, zjzhang, linux,
	linux-acpi, eun.taik.lee, shijie.huang, lenb, harba,
	marc.zyngier, punit.agrawal, tomasz.nowicki, nkaje, rostedt,
	sandeepa.s.prabhu, linux-arm-kernel, rjw, rruigrok, linux-kernel,
	astone, hanjun.guo, pbonzini, akpm, bristot

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error
> data entry v3 structure. The timestamp of when the firmware
> generated the error is now being reported.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
>  
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;

ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
Worth fixing as part of this patch?


>  
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +

You don't use sec_type again here, why change this?
(should it be in a later patch?)


> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
>  
>  			arch_apei_report_mem_error(sev, mem_err);
> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
>  
>  #define INDENT_SP	" "
>  
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  }
>  
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata)
> +{
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
>  
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>  
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old))
> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata);
> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)
>  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>  {
>  }
>  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
> +{
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  

> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +

How come these aren't in ghes.h?



Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-25 18:20     ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:20 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone,
	harba, hanjun.guo

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error
> data entry v3 structure. The timestamp of when the firmware
> generated the error is now being reported.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
>  
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;

ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
Worth fixing as part of this patch?


>  
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +

You don't use sec_type again here, why change this?
(should it be in a later patch?)


> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
>  
>  			arch_apei_report_mem_error(sev, mem_err);
> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
>  
>  #define INDENT_SP	" "
>  
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  }
>  
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata)
> +{
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
>  
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>  
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old))
> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata);
> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)
>  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>  {
>  }
>  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
> +{
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  

> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +

How come these aren't in ghes.h?



Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-25 18:20     ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error
> data entry v3 structure. The timestamp of when the firmware
> generated the error is now being reported.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
>  
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;

ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
Worth fixing as part of this patch?


>  
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +

You don't use sec_type again here, why change this?
(should it be in a later patch?)


> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
>  
>  			arch_apei_report_mem_error(sev, mem_err);
> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
>  
>  #define INDENT_SP	" "
>  
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  }
>  
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata)
> +{
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
>  
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>  
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old))
> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata);
> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)
>  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>  {
>  }
>  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
> +{
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  

> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +

How come these aren't in ghes.h?



Reviewed-by: James Morse <james.morse@arm.com>


Thanks,

James

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 03/10] efi: parse ARMv8 processor error
  2016-11-21 22:35   ` Tyler Baicar
  (?)
@ 2016-11-25 18:23     ` James Morse
  -1 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:23 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: linux-efi, kvm, matt, catalin.marinas, will.deacon, robert.moore,
	paul.gortmaker, lv.zheng, kvmarm, fu.wei, zjzhang, linux,
	linux-acpi, eun.taik.lee, shijie.huang, lenb, harba,
	marc.zyngier, punit.agrawal, tomasz.nowicki, nkaje, rostedt,
	sandeepa.s.prabhu, linux-arm-kernel, rjw, rruigrok, linux-kernel,
	astone, hanjun.guo, pbonzini, akpm, bristot

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Add support for ARMv8 Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARMv8 specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor error logs.

I think I'm missing a big part of the puzzle here, I will come back to this next
week. I can't quite line up some of the masks and shifts with the table
descriptions in the UEFI spec[0].


> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 13ea41c..2a9d553 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h

> @@ -180,6 +185,10 @@ enum {
>  #define CPER_SEC_PROC_IPF						\
>  	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>  		0x80, 0xC7, 0x3C, 0x88, 0x81)
> +/* Processor Specific: ARMv8 */
> +#define CPER_SEC_PROC_ARMV8						\
> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
> +		0x1D, 0x5D, 0x46, 0xB0)

Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
an architectural version).


>  /* Platform Memory */
>  #define CPER_SEC_PLATFORM_MEM						\
>  	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
> @@ -255,6 +264,34 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  
> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00

Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
something else? Do these correspond with one of the four bitfield formats
described in Table 262->265?

I can't see where they are used, and they look like they are reaching across
multiple fields in a struct.


> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
> +
> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
> +
> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
> +
> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
> +
> +#define CPER_AARCH64_CTX_LEN			368
> +#define CPER_AARCH32_CTX_LEN			256

Are these the worst case sizes for combinations of the structures in N2.4.4.2?
(Tables 266 to 273)

If so is there any chance they could be sizeof(<some union of structs>), even if
the structs are things like:
> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
> struct cper_armv8_aarch64_gprs {
> 	u64 regs[32];
> }

This way its easier to check the number is correct, and if a new type is added
this won't get forgotten.


> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
> +#define CPER_ARMV8_CTX_EL_SHIFT			4
> +#define CPER_ARMV8_CTX_NS_SHIFT			7
> +

Again, I can't work out what these correspond to. I can't see a secure bit or EL
field in any of those UEFI tables.

Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
so we should have some infrastructure for picking the correct (or unknown)
decode function based on a range of MIDRs.


>  #define acpi_hest_generic_data_error_length(gdata)	\
>  	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>  #define acpi_hest_generic_data_size(gdata)		\
> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>  	__u64	mm_reg_addr;
>  };
>  
> +/* ARMv8 Processor Error Section */
> +struct cper_sec_proc_armv8 {
> +	__u32	validation_bits;
> +	__u16	err_info_num; /* Number of Processor Error Info */
> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
> +	__u32	section_length;
> +	__u8	affinity_level;
> +	__u8	reserved[3];	/* must be zero */
> +	__u64	mpidr;
> +	__u64	midr;
> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
> +	__u32	psci_state;
> +};
> +
> +/* ARMv8 Processor Error Information Structure */
> +struct cper_armv8_err_info {
> +	__u8	version;
> +	__u8	length;
> +	__u16	validation_bits;
> +	__u8	type;
> +	__u16	multiple_error;
> +	__u8	flags;
> +	__u64	error_info;
> +	__u64	virt_fault_addr;
> +	__u64	physical_fault_addr;
> +};


> +/* ARMv8 AARCH64 Processor Context Information Structure */
> +struct cper_armv8_aarch64_ctx {
> +	__u8	type_el_ns;
> +	__u8	reserved[7];	/* must be zero */
> +	__u8	gpr[288];
> +	__u8	spr[68];
> +};

Is this:
"Table 265. ARM Processor Error Context Information Header Structure"?


Sorry if I've missed something blindingly obvious!


Thanks,

James

[0] http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-25 18:23     ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:23 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone,
	harba, hanjun.guo

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Add support for ARMv8 Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARMv8 specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor error logs.

I think I'm missing a big part of the puzzle here, I will come back to this next
week. I can't quite line up some of the masks and shifts with the table
descriptions in the UEFI spec[0].


> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 13ea41c..2a9d553 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h

> @@ -180,6 +185,10 @@ enum {
>  #define CPER_SEC_PROC_IPF						\
>  	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>  		0x80, 0xC7, 0x3C, 0x88, 0x81)
> +/* Processor Specific: ARMv8 */
> +#define CPER_SEC_PROC_ARMV8						\
> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
> +		0x1D, 0x5D, 0x46, 0xB0)

Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
an architectural version).


>  /* Platform Memory */
>  #define CPER_SEC_PLATFORM_MEM						\
>  	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
> @@ -255,6 +264,34 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  
> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00

Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
something else? Do these correspond with one of the four bitfield formats
described in Table 262->265?

I can't see where they are used, and they look like they are reaching across
multiple fields in a struct.


> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
> +
> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
> +
> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
> +
> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
> +
> +#define CPER_AARCH64_CTX_LEN			368
> +#define CPER_AARCH32_CTX_LEN			256

Are these the worst case sizes for combinations of the structures in N2.4.4.2?
(Tables 266 to 273)

If so is there any chance they could be sizeof(<some union of structs>), even if
the structs are things like:
> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
> struct cper_armv8_aarch64_gprs {
> 	u64 regs[32];
> }

This way its easier to check the number is correct, and if a new type is added
this won't get forgotten.


> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
> +#define CPER_ARMV8_CTX_EL_SHIFT			4
> +#define CPER_ARMV8_CTX_NS_SHIFT			7
> +

Again, I can't work out what these correspond to. I can't see a secure bit or EL
field in any of those UEFI tables.

Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
so we should have some infrastructure for picking the correct (or unknown)
decode function based on a range of MIDRs.


>  #define acpi_hest_generic_data_error_length(gdata)	\
>  	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>  #define acpi_hest_generic_data_size(gdata)		\
> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>  	__u64	mm_reg_addr;
>  };
>  
> +/* ARMv8 Processor Error Section */
> +struct cper_sec_proc_armv8 {
> +	__u32	validation_bits;
> +	__u16	err_info_num; /* Number of Processor Error Info */
> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
> +	__u32	section_length;
> +	__u8	affinity_level;
> +	__u8	reserved[3];	/* must be zero */
> +	__u64	mpidr;
> +	__u64	midr;
> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
> +	__u32	psci_state;
> +};
> +
> +/* ARMv8 Processor Error Information Structure */
> +struct cper_armv8_err_info {
> +	__u8	version;
> +	__u8	length;
> +	__u16	validation_bits;
> +	__u8	type;
> +	__u16	multiple_error;
> +	__u8	flags;
> +	__u64	error_info;
> +	__u64	virt_fault_addr;
> +	__u64	physical_fault_addr;
> +};


> +/* ARMv8 AARCH64 Processor Context Information Structure */
> +struct cper_armv8_aarch64_ctx {
> +	__u8	type_el_ns;
> +	__u8	reserved[7];	/* must be zero */
> +	__u8	gpr[288];
> +	__u8	spr[68];
> +};

Is this:
"Table 265. ARM Processor Error Context Information Header Structure"?


Sorry if I've missed something blindingly obvious!


Thanks,

James

[0] http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-25 18:23     ` James Morse
  0 siblings, 0 replies; 55+ messages in thread
From: James Morse @ 2016-11-25 18:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tyler,

On 21/11/16 22:35, Tyler Baicar wrote:
> Add support for ARMv8 Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARMv8 specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor error logs.

I think I'm missing a big part of the puzzle here, I will come back to this next
week. I can't quite line up some of the masks and shifts with the table
descriptions in the UEFI spec[0].


> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index 13ea41c..2a9d553 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h

> @@ -180,6 +185,10 @@ enum {
>  #define CPER_SEC_PROC_IPF						\
>  	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>  		0x80, 0xC7, 0x3C, 0x88, 0x81)
> +/* Processor Specific: ARMv8 */
> +#define CPER_SEC_PROC_ARMV8						\
> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
> +		0x1D, 0x5D, 0x46, 0xB0)

Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
an architectural version).


>  /* Platform Memory */
>  #define CPER_SEC_PLATFORM_MEM						\
>  	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
> @@ -255,6 +264,34 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  
> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00

Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
something else? Do these correspond with one of the four bitfield formats
described in Table 262->265?

I can't see where they are used, and they look like they are reaching across
multiple fields in a struct.


> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
> +
> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
> +
> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
> +
> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
> +
> +#define CPER_AARCH64_CTX_LEN			368
> +#define CPER_AARCH32_CTX_LEN			256

Are these the worst case sizes for combinations of the structures in N2.4.4.2?
(Tables 266 to 273)

If so is there any chance they could be sizeof(<some union of structs>), even if
the structs are things like:
> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
> struct cper_armv8_aarch64_gprs {
> 	u64 regs[32];
> }

This way its easier to check the number is correct, and if a new type is added
this won't get forgotten.


> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
> +#define CPER_ARMV8_CTX_EL_SHIFT			4
> +#define CPER_ARMV8_CTX_NS_SHIFT			7
> +

Again, I can't work out what these correspond to. I can't see a secure bit or EL
field in any of those UEFI tables.

Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
so we should have some infrastructure for picking the correct (or unknown)
decode function based on a range of MIDRs.


>  #define acpi_hest_generic_data_error_length(gdata)	\
>  	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>  #define acpi_hest_generic_data_size(gdata)		\
> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>  	__u64	mm_reg_addr;
>  };
>  
> +/* ARMv8 Processor Error Section */
> +struct cper_sec_proc_armv8 {
> +	__u32	validation_bits;
> +	__u16	err_info_num; /* Number of Processor Error Info */
> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
> +	__u32	section_length;
> +	__u8	affinity_level;
> +	__u8	reserved[3];	/* must be zero */
> +	__u64	mpidr;
> +	__u64	midr;
> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
> +	__u32	psci_state;
> +};
> +
> +/* ARMv8 Processor Error Information Structure */
> +struct cper_armv8_err_info {
> +	__u8	version;
> +	__u8	length;
> +	__u16	validation_bits;
> +	__u8	type;
> +	__u16	multiple_error;
> +	__u8	flags;
> +	__u64	error_info;
> +	__u64	virt_fault_addr;
> +	__u64	physical_fault_addr;
> +};


> +/* ARMv8 AARCH64 Processor Context Information Structure */
> +struct cper_armv8_aarch64_ctx {
> +	__u8	type_el_ns;
> +	__u8	reserved[7];	/* must be zero */
> +	__u8	gpr[288];
> +	__u8	spr[68];
> +};

Is this:
"Table 265. ARM Processor Error Context Information Header Structure"?


Sorry if I've missed something blindingly obvious!


Thanks,

James

[0] http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption
  2016-11-25 18:19       ` James Morse
                         ` (2 preceding siblings ...)
  (?)
@ 2016-11-28 18:34       ` Baicar, Tyler
  -1 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-28 18:34 UTC (permalink / raw)
  To: James Morse
  Cc: linux-efi, kvm, matt, catalin.marinas, will.deacon, robert.moore,
	paul.gortmaker, lv.zheng, kvmarm, fu.wei, zjzhang, linux,
	linux-acpi, eun.taik.lee, shijie.huang, lenb, harba,
	marc.zyngier, punit.agrawal, tomasz.nowicki, nkaje, rostedt,
	sandeepa.s.prabhu, linux-arm-kernel, rjw, rruigrok, linux-kernel,
	astone, hanjun.guo, pbonzini, akpm, bristot


[-- Attachment #1.1: Type: text/plain, Size: 8161 bytes --]

Hello James,

Thank you for your feedback!

On 11/25/2016 11:19 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> A RAS (Reliability, Availability, Serviceability) controller
>> may be a separate processor running in parallel with OS
>> execution, and may generate error records for consumption by
>> the OS. If the RAS controller produces multiple error records,
>> then they may be overwritten before the OS has consumed them.
>>
>> The Generic Hardware Error Source (GHES) v2 structure
>> introduces the capability for the OS to acknowledge the
>> consumption of the error record generated by the RAS
>> controller. A RAS controller supporting GHESv2 shall wait for
>> the acknowledgment before writing a new error record, thus
>> eliminating the race condition.
> This patch also adds support for parsing GHESv2 sub-tables.
> Before they would be rejected as an unknown hardware error source.
Yes, I will add that to the text.
>> Signed-off-by: Jonathan (Zhixiong) Zhang<zjzhang@codeaurora.org>
> Nit: the patch author's Sign-off should come first, you either need a 'From:
> Jonathan (Zhixiong) Zhang ...' on this patch, or re-order these Signed-off-by's.
I'll reorder them in the next set.
>> Signed-off-by: Richard Ruigrok<rruigrok@codeaurora.org>
>> Signed-off-by: Tyler Baicar<tbaicar@codeaurora.org>
>> Signed-off-by: Naveen Kaje<nkaje@codeaurora.org>
>> ---
>>   drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
>>   drivers/acpi/apei/hest.c |  7 +++++--
>>   include/acpi/ghes.h      |  5 ++++-
>>   3 files changed, 55 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 60746ef..b79abc5 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -45,6 +45,7 @@
>>   #include <linux/aer.h>
>>   #include <linux/nmi.h>
>>   
>> +#include <acpi/actbl1.h>
>>   #include <acpi/ghes.h>
>>   #include <acpi/apei.h>
>>   #include <asm/tlbflush.h>
>> @@ -79,6 +80,10 @@
>>   	((struct acpi_hest_generic_status *)				\
>>   	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>>   
>> +#define HEST_TYPE_GENERIC_V2(ghes)				\
>> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
>> +	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
>> +
> IS_ HEST_TYPE_GENERIC_V2() ? (for the sake of readability)
>
Will do.
>>   /*
>>    * This driver isn't really modular, however for the time being,
>>    * continuing to use module_param is the easiest way to remain
>> @@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>>   	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>>   	if (!ghes)
>>   		return ERR_PTR(-ENOMEM);
>> +
>>   	ghes->generic = generic;
>> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
>> +		rc = apei_map_generic_address(
>> +			&ghes->generic_v2->read_ack_register);
>> +		if (rc)
>> +			goto err_free;
>> +	}
>> +
>>   	rc = apei_map_generic_address(&generic->error_status_address);
>>   	if (rc)
>> -		goto err_free;
>> +		goto err_unmap_read_ack_addr;
>>   	error_block_length = generic->error_block_length;
>>   	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>>   		pr_warning(FW_WARN GHES_PFX
>> @@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>>   	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>>   	if (!ghes->estatus) {
>>   		rc = -ENOMEM;
>> -		goto err_unmap;
>> +		goto err_unmap_status_addr;
>>   	}
>>   
>>   	return ghes;
>>   
>> -err_unmap:
>> +err_unmap_status_addr:
>>   	apei_unmap_generic_address(&generic->error_status_address);
>> +err_unmap_read_ack_addr:
>> +	if (HEST_TYPE_GENERIC_V2(ghes))
>> +		apei_unmap_generic_address(
>> +			&ghes->generic_v2->read_ack_register);
>>   err_free:
>>   	kfree(ghes);
>>   	return ERR_PTR(rc);
>> @@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
>>   {
>>   	kfree(ghes->estatus);
>>   	apei_unmap_generic_address(&ghes->generic->error_status_address);
>> +	if (HEST_TYPE_GENERIC_V2(ghes))
>> +		apei_unmap_generic_address(
>> +			&ghes->generic_v2->read_ack_register);
>>   }
>>   
>>   static inline int ghes_severity(int severity)
>> @@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
>>   	rcu_read_unlock();
>>   }
>>   
>> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
>> +{
>> +	int rc;
>> +	u64 val = 0;
>> +
>> +	rc = apei_read(&val, &generic_v2->read_ack_register);
>> +	if (rc)
>> +		return rc;
>> +	val &= generic_v2->read_ack_preserve <<
>> +		generic_v2->read_ack_register.bit_offset;
>> +	val |= generic_v2->read_ack_write <<
>> +		generic_v2->read_ack_register.bit_offset;
> Is this bit_offset shifting needed in case the read_ack_register is in the
> 'system io' (or embedded controller) address space and shares a register with
> some other stuff?
>
> The read_ack_{preserve,write} values are u64, so if bit_offset is non-zero the
> high order bits get lost, but both ends of this are in the firmware's control.
>
> (I assumed this thing would always be in memory and these fields would never be
> used - but I guess that isn't true!)
>
Yeah, we are not using these values, but they are defined this way in 
the ACPI 6.1 spec (Table 18-344).
read_ack_register is defined as a Generic Address Structure which has 
this offset defined in
Table 5-26.

I assume it is defined this way for shared registers as you mentioned 
though. With this
flexibility the firmware is able to specify exactly what to write.
>> +	rc = apei_write(val, &generic_v2->read_ack_register);
>> +
>> +	return rc;
>> +}
>> +
>>   static int ghes_proc(struct ghes *ghes)
>>   {
>>   	int rc;
>> @@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
>>   			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>   	}
>>   	ghes_do_proc(ghes, ghes->estatus);
>> +
>> +	if (HEST_TYPE_GENERIC_V2(ghes)) {
>> +		rc = ghes_ack_error(ghes->generic_v2);
>> +		if (rc)
>> +			return rc;
>> +	}
>>   out:
>>   	ghes_clear_estatus(ghes);
>>   	return 0;
>> diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
>> index 792a0d9..ef725a9 100644
>> --- a/drivers/acpi/apei/hest.c
>> +++ b/drivers/acpi/apei/hest.c
>> @@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
>>   	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
>>   	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
>>   	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
>> +	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
>>   };
>>   
>>   static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
>> @@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
>>   {
>>   	int *count = data;
>>   
>> -	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
>> +	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
>> +	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>>   		(*count)++;
>>   	return 0;
>>   }
>> @@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
>>   	struct ghes_arr *ghes_arr = data;
>>   	int rc, i;
>>   
>> -	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
>> +	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
>> +	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
>>   		return 0;
>>   
>>   	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 720446c..68f088a 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -13,7 +13,10 @@
>>   #define GHES_EXITING		0x0002
>>   
>>   struct ghes {
>> -	struct acpi_hest_generic *generic;
>> +	union {
>> +		struct acpi_hest_generic *generic;
>> +		struct acpi_hest_generic_v2 *generic_v2;
>> +	};
>>   	struct acpi_hest_generic_status *estatus;
>>   	u64 buffer_paddr;
>>   	unsigned long flags;
>>
> Looks good to me, for what its worth:
> Reviewed-by: James Morse<james.morse@arm.com>
Thanks!
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


[-- Attachment #1.2: Type: text/html, Size: 9473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-25 18:20     ` James Morse
  (?)
  (?)
@ 2016-11-28 18:55       ` Baicar, Tyler
  -1 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-28 18:55 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone

Hello James,

On 11/25/2016 11:20 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Currently when a RAS error is reported it is not timestamped.
>> The ACPI 6.1 spec adds the timestamp field to the generic error
>> data entry v3 structure. The timestamp of when the firmware
>> generated the error is now being reported.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b79abc5..9063d68 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>>   	int flags = -1;
>>   	int sec_sev = ghes_severity(gdata->error_severity);
>>   	struct cper_sec_mem_err *mem_err;
>> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>> +
>> +	mem_err = acpi_hest_generic_data_payload(gdata);
>>   
>>   	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>   		return;
>> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>>   {
>>   	int sev, sec_sev;
>>   	struct acpi_hest_generic_data *gdata;
>> +	uuid_le sec_type;
> ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
> Worth fixing as part of this patch?

I can add it here, but it shouldn't be needed. ghes.c includes 
<linux/cper.h> and that header
includes <linux/uuid.h>. Should it be added just to make the dependency 
more clear?

>>   
>>   	sev = ghes_severity(estatus->error_severity);
>>   	apei_estatus_for_each_section(estatus, gdata) {
>>   		sec_sev = ghes_severity(gdata->error_severity);
>> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> +		sec_type = *(uuid_le *)gdata->section_type;
>> +
> You don't use sec_type again here, why change this?
> (should it be in a later patch?)

Ah, yes, this change should be moved to patch 8 in this patchset.

>> +		if (!uuid_le_cmp(sec_type,
>>   				 CPER_SEC_PLATFORM_MEM)) {
>>   			struct cper_sec_mem_err *mem_err;
>> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
>> +
>> +			mem_err = acpi_hest_generic_data_payload(gdata);
>>   			ghes_edac_report_mem_error(ghes, sev, mem_err);
>>   
>>   			arch_apei_report_mem_error(sev, mem_err);
>> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>>   		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>>   				      CPER_SEC_PCIE)) {
>>   			struct cper_sec_pcie *pcie_err;
>> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
>> +
>> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>>   			if (sev == GHES_SEV_RECOVERABLE &&
>>   			    sec_sev == GHES_SEV_RECOVERABLE &&
>>   			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index d425374..7e2439e 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,9 @@
>>   #include <linux/acpi.h>
>>   #include <linux/pci.h>
>>   #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>> +#include <acpi/ghes.h>
>>   
>>   #define INDENT_SP	" "
>>   
>> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>   	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>>   }
>>   
>> +static void cper_estatus_print_section_v300(const char *pfx,
>> +	const struct acpi_hest_generic_data_v300 *gdata)
>> +{
>> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> +		timestamp = (__u8 *)&(gdata->time_stamp);
>> +		sec = bcd2bin(timestamp[0]);
>> +		min = bcd2bin(timestamp[1]);
>> +		hour = bcd2bin(timestamp[2]);
>> +		day = bcd2bin(timestamp[4]);
>> +		mon = bcd2bin(timestamp[5]);
>> +		year = bcd2bin(timestamp[6]);
>> +		century = bcd2bin(timestamp[7]);
>> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
>> +			year, mon, day, hour, min, sec);
>> +	}
>> +}
>> +
>>   static void cper_estatus_print_section(
>> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
>> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>>   {
>>   	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>>   	__u16 severity;
>>   	char newpfx[64];
>>   
>> +	if (acpi_hest_generic_data_version(gdata) >= 3)
>> +		cper_estatus_print_section_v300(pfx,
>> +			(const struct acpi_hest_generic_data_v300 *)gdata);
>> +
>>   	severity = gdata->error_severity;
>>   	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>>   	       cper_severity_str(severity));
>> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>>   
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>>   	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
>> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
>> +		struct cper_sec_proc_generic *proc_err;
>> +
>> +		proc_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: general processor error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*proc_err))
>>   			cper_print_proc_generic(newpfx, proc_err);
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
>> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
>> +		struct cper_sec_mem_err *mem_err;
>> +
>> +		mem_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: memory error\n", newpfx);
>>   		if (gdata->error_data_length >=
>>   		    sizeof(struct cper_sec_mem_err_old))
>> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
>> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
>> +		struct cper_sec_pcie *pcie;
>> +
>> +		pcie = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: PCIe error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*pcie))
>>   			cper_print_pcie(newpfx, pcie, gdata);
>> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>>   			const struct acpi_hest_generic_status *estatus)
>>   {
>>   	struct acpi_hest_generic_data *gdata;
>> -	unsigned int data_len, gedata_len;
>> +	unsigned int data_len;
>>   	int sec_no = 0;
>>   	char newpfx[64];
>>   	__u16 severity;
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
>> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>>   		return rc;
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> -		if (gedata_len > data_len - sizeof(*gdata))
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
>> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>>   			return -EINVAL;
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   	}
>>   	if (data_len)
>>   		return -EINVAL;
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 68f088a..56b9679 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>>   {
>>   }
>>   #endif
>> +
>> +#define acpi_hest_generic_data_version(gdata)			\
>> +	(gdata->revision >> 8)
>> +
>> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
>> +{
>> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
>> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
>> +		gdata + 1;
>> +}
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index dcacb1a..13ea41c 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -255,6 +255,18 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define acpi_hest_generic_data_error_length(gdata)	\
>> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>> +#define acpi_hest_generic_data_size(gdata)		\
>> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
>> +	sizeof(struct acpi_hest_generic_data_v300) :	\
>> +	sizeof(struct acpi_hest_generic_data))
>> +#define acpi_hest_generic_data_record_size(gdata)	\
>> +	(acpi_hest_generic_data_size(gdata) +		\
>> +	acpi_hest_generic_data_error_length(gdata))
>> +#define acpi_hest_generic_data_next(gdata)		\
>> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
>> +
> How come these aren't in ghes.h?

It probably does make more sense to add these in ghes.h, I'll move them 
there in the next set.

> Reviewed-by: James Morse <james.morse@arm.com>
>
Thanks!
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-28 18:55       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-28 18:55 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone,
	harba, hanjun.guo

Hello James,

On 11/25/2016 11:20 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Currently when a RAS error is reported it is not timestamped.
>> The ACPI 6.1 spec adds the timestamp field to the generic error
>> data entry v3 structure. The timestamp of when the firmware
>> generated the error is now being reported.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b79abc5..9063d68 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>>   	int flags = -1;
>>   	int sec_sev = ghes_severity(gdata->error_severity);
>>   	struct cper_sec_mem_err *mem_err;
>> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>> +
>> +	mem_err = acpi_hest_generic_data_payload(gdata);
>>   
>>   	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>   		return;
>> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>>   {
>>   	int sev, sec_sev;
>>   	struct acpi_hest_generic_data *gdata;
>> +	uuid_le sec_type;
> ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
> Worth fixing as part of this patch?

I can add it here, but it shouldn't be needed. ghes.c includes 
<linux/cper.h> and that header
includes <linux/uuid.h>. Should it be added just to make the dependency 
more clear?

>>   
>>   	sev = ghes_severity(estatus->error_severity);
>>   	apei_estatus_for_each_section(estatus, gdata) {
>>   		sec_sev = ghes_severity(gdata->error_severity);
>> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> +		sec_type = *(uuid_le *)gdata->section_type;
>> +
> You don't use sec_type again here, why change this?
> (should it be in a later patch?)

Ah, yes, this change should be moved to patch 8 in this patchset.

>> +		if (!uuid_le_cmp(sec_type,
>>   				 CPER_SEC_PLATFORM_MEM)) {
>>   			struct cper_sec_mem_err *mem_err;
>> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
>> +
>> +			mem_err = acpi_hest_generic_data_payload(gdata);
>>   			ghes_edac_report_mem_error(ghes, sev, mem_err);
>>   
>>   			arch_apei_report_mem_error(sev, mem_err);
>> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>>   		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>>   				      CPER_SEC_PCIE)) {
>>   			struct cper_sec_pcie *pcie_err;
>> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
>> +
>> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>>   			if (sev == GHES_SEV_RECOVERABLE &&
>>   			    sec_sev == GHES_SEV_RECOVERABLE &&
>>   			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index d425374..7e2439e 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,9 @@
>>   #include <linux/acpi.h>
>>   #include <linux/pci.h>
>>   #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>> +#include <acpi/ghes.h>
>>   
>>   #define INDENT_SP	" "
>>   
>> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>   	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>>   }
>>   
>> +static void cper_estatus_print_section_v300(const char *pfx,
>> +	const struct acpi_hest_generic_data_v300 *gdata)
>> +{
>> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> +		timestamp = (__u8 *)&(gdata->time_stamp);
>> +		sec = bcd2bin(timestamp[0]);
>> +		min = bcd2bin(timestamp[1]);
>> +		hour = bcd2bin(timestamp[2]);
>> +		day = bcd2bin(timestamp[4]);
>> +		mon = bcd2bin(timestamp[5]);
>> +		year = bcd2bin(timestamp[6]);
>> +		century = bcd2bin(timestamp[7]);
>> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
>> +			year, mon, day, hour, min, sec);
>> +	}
>> +}
>> +
>>   static void cper_estatus_print_section(
>> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
>> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>>   {
>>   	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>>   	__u16 severity;
>>   	char newpfx[64];
>>   
>> +	if (acpi_hest_generic_data_version(gdata) >= 3)
>> +		cper_estatus_print_section_v300(pfx,
>> +			(const struct acpi_hest_generic_data_v300 *)gdata);
>> +
>>   	severity = gdata->error_severity;
>>   	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>>   	       cper_severity_str(severity));
>> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>>   
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>>   	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
>> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
>> +		struct cper_sec_proc_generic *proc_err;
>> +
>> +		proc_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: general processor error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*proc_err))
>>   			cper_print_proc_generic(newpfx, proc_err);
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
>> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
>> +		struct cper_sec_mem_err *mem_err;
>> +
>> +		mem_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: memory error\n", newpfx);
>>   		if (gdata->error_data_length >=
>>   		    sizeof(struct cper_sec_mem_err_old))
>> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
>> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
>> +		struct cper_sec_pcie *pcie;
>> +
>> +		pcie = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: PCIe error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*pcie))
>>   			cper_print_pcie(newpfx, pcie, gdata);
>> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>>   			const struct acpi_hest_generic_status *estatus)
>>   {
>>   	struct acpi_hest_generic_data *gdata;
>> -	unsigned int data_len, gedata_len;
>> +	unsigned int data_len;
>>   	int sec_no = 0;
>>   	char newpfx[64];
>>   	__u16 severity;
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
>> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>>   		return rc;
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> -		if (gedata_len > data_len - sizeof(*gdata))
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
>> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>>   			return -EINVAL;
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   	}
>>   	if (data_len)
>>   		return -EINVAL;
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 68f088a..56b9679 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>>   {
>>   }
>>   #endif
>> +
>> +#define acpi_hest_generic_data_version(gdata)			\
>> +	(gdata->revision >> 8)
>> +
>> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
>> +{
>> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
>> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
>> +		gdata + 1;
>> +}
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index dcacb1a..13ea41c 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -255,6 +255,18 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define acpi_hest_generic_data_error_length(gdata)	\
>> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>> +#define acpi_hest_generic_data_size(gdata)		\
>> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
>> +	sizeof(struct acpi_hest_generic_data_v300) :	\
>> +	sizeof(struct acpi_hest_generic_data))
>> +#define acpi_hest_generic_data_record_size(gdata)	\
>> +	(acpi_hest_generic_data_size(gdata) +		\
>> +	acpi_hest_generic_data_error_length(gdata))
>> +#define acpi_hest_generic_data_next(gdata)		\
>> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
>> +
> How come these aren't in ghes.h?

It probably does make more sense to add these in ghes.h, I'll move them 
there in the next set.

> Reviewed-by: James Morse <james.morse@arm.com>
>
Thanks!
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-28 18:55       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-28 18:55 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone

Hello James,

On 11/25/2016 11:20 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Currently when a RAS error is reported it is not timestamped.
>> The ACPI 6.1 spec adds the timestamp field to the generic error
>> data entry v3 structure. The timestamp of when the firmware
>> generated the error is now being reported.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b79abc5..9063d68 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>>   	int flags = -1;
>>   	int sec_sev = ghes_severity(gdata->error_severity);
>>   	struct cper_sec_mem_err *mem_err;
>> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>> +
>> +	mem_err = acpi_hest_generic_data_payload(gdata);
>>   
>>   	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>   		return;
>> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>>   {
>>   	int sev, sec_sev;
>>   	struct acpi_hest_generic_data *gdata;
>> +	uuid_le sec_type;
> ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
> Worth fixing as part of this patch?

I can add it here, but it shouldn't be needed. ghes.c includes 
<linux/cper.h> and that header
includes <linux/uuid.h>. Should it be added just to make the dependency 
more clear?

>>   
>>   	sev = ghes_severity(estatus->error_severity);
>>   	apei_estatus_for_each_section(estatus, gdata) {
>>   		sec_sev = ghes_severity(gdata->error_severity);
>> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> +		sec_type = *(uuid_le *)gdata->section_type;
>> +
> You don't use sec_type again here, why change this?
> (should it be in a later patch?)

Ah, yes, this change should be moved to patch 8 in this patchset.

>> +		if (!uuid_le_cmp(sec_type,
>>   				 CPER_SEC_PLATFORM_MEM)) {
>>   			struct cper_sec_mem_err *mem_err;
>> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
>> +
>> +			mem_err = acpi_hest_generic_data_payload(gdata);
>>   			ghes_edac_report_mem_error(ghes, sev, mem_err);
>>   
>>   			arch_apei_report_mem_error(sev, mem_err);
>> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>>   		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>>   				      CPER_SEC_PCIE)) {
>>   			struct cper_sec_pcie *pcie_err;
>> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
>> +
>> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>>   			if (sev == GHES_SEV_RECOVERABLE &&
>>   			    sec_sev == GHES_SEV_RECOVERABLE &&
>>   			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index d425374..7e2439e 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,9 @@
>>   #include <linux/acpi.h>
>>   #include <linux/pci.h>
>>   #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>> +#include <acpi/ghes.h>
>>   
>>   #define INDENT_SP	" "
>>   
>> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>   	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>>   }
>>   
>> +static void cper_estatus_print_section_v300(const char *pfx,
>> +	const struct acpi_hest_generic_data_v300 *gdata)
>> +{
>> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> +		timestamp = (__u8 *)&(gdata->time_stamp);
>> +		sec = bcd2bin(timestamp[0]);
>> +		min = bcd2bin(timestamp[1]);
>> +		hour = bcd2bin(timestamp[2]);
>> +		day = bcd2bin(timestamp[4]);
>> +		mon = bcd2bin(timestamp[5]);
>> +		year = bcd2bin(timestamp[6]);
>> +		century = bcd2bin(timestamp[7]);
>> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
>> +			year, mon, day, hour, min, sec);
>> +	}
>> +}
>> +
>>   static void cper_estatus_print_section(
>> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
>> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>>   {
>>   	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>>   	__u16 severity;
>>   	char newpfx[64];
>>   
>> +	if (acpi_hest_generic_data_version(gdata) >= 3)
>> +		cper_estatus_print_section_v300(pfx,
>> +			(const struct acpi_hest_generic_data_v300 *)gdata);
>> +
>>   	severity = gdata->error_severity;
>>   	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>>   	       cper_severity_str(severity));
>> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>>   
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>>   	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
>> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
>> +		struct cper_sec_proc_generic *proc_err;
>> +
>> +		proc_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: general processor error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*proc_err))
>>   			cper_print_proc_generic(newpfx, proc_err);
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
>> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
>> +		struct cper_sec_mem_err *mem_err;
>> +
>> +		mem_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: memory error\n", newpfx);
>>   		if (gdata->error_data_length >=
>>   		    sizeof(struct cper_sec_mem_err_old))
>> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
>> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
>> +		struct cper_sec_pcie *pcie;
>> +
>> +		pcie = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: PCIe error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*pcie))
>>   			cper_print_pcie(newpfx, pcie, gdata);
>> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>>   			const struct acpi_hest_generic_status *estatus)
>>   {
>>   	struct acpi_hest_generic_data *gdata;
>> -	unsigned int data_len, gedata_len;
>> +	unsigned int data_len;
>>   	int sec_no = 0;
>>   	char newpfx[64];
>>   	__u16 severity;
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
>> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>>   		return rc;
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> -		if (gedata_len > data_len - sizeof(*gdata))
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
>> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>>   			return -EINVAL;
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   	}
>>   	if (data_len)
>>   		return -EINVAL;
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 68f088a..56b9679 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>>   {
>>   }
>>   #endif
>> +
>> +#define acpi_hest_generic_data_version(gdata)			\
>> +	(gdata->revision >> 8)
>> +
>> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
>> +{
>> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
>> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
>> +		gdata + 1;
>> +}
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index dcacb1a..13ea41c 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -255,6 +255,18 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define acpi_hest_generic_data_error_length(gdata)	\
>> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>> +#define acpi_hest_generic_data_size(gdata)		\
>> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
>> +	sizeof(struct acpi_hest_generic_data_v300) :	\
>> +	sizeof(struct acpi_hest_generic_data))
>> +#define acpi_hest_generic_data_record_size(gdata)	\
>> +	(acpi_hest_generic_data_size(gdata) +		\
>> +	acpi_hest_generic_data_error_length(gdata))
>> +#define acpi_hest_generic_data_next(gdata)		\
>> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
>> +
> How come these aren't in ghes.h?

It probably does make more sense to add these in ghes.h, I'll move them 
there in the next set.

> Reviewed-by: James Morse <james.morse@arm.com>
>
Thanks!
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-28 18:55       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-28 18:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hello James,

On 11/25/2016 11:20 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Currently when a RAS error is reported it is not timestamped.
>> The ACPI 6.1 spec adds the timestamp field to the generic error
>> data entry v3 structure. The timestamp of when the firmware
>> generated the error is now being reported.
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b79abc5..9063d68 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>>   	int flags = -1;
>>   	int sec_sev = ghes_severity(gdata->error_severity);
>>   	struct cper_sec_mem_err *mem_err;
>> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>> +
>> +	mem_err = acpi_hest_generic_data_payload(gdata);
>>   
>>   	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>>   		return;
>> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,
>>   {
>>   	int sev, sec_sev;
>>   	struct acpi_hest_generic_data *gdata;
>> +	uuid_le sec_type;
> ghes.c doesn't include <linux/uuid.h>, but I see it already uses uuid_le_cmp().
> Worth fixing as part of this patch?

I can add it here, but it shouldn't be needed. ghes.c includes 
<linux/cper.h> and that header
includes <linux/uuid.h>. Should it be added just to make the dependency 
more clear?

>>   
>>   	sev = ghes_severity(estatus->error_severity);
>>   	apei_estatus_for_each_section(estatus, gdata) {
>>   		sec_sev = ghes_severity(gdata->error_severity);
>> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> +		sec_type = *(uuid_le *)gdata->section_type;
>> +
> You don't use sec_type again here, why change this?
> (should it be in a later patch?)

Ah, yes, this change should be moved to patch 8 in this patchset.

>> +		if (!uuid_le_cmp(sec_type,
>>   				 CPER_SEC_PLATFORM_MEM)) {
>>   			struct cper_sec_mem_err *mem_err;
>> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
>> +
>> +			mem_err = acpi_hest_generic_data_payload(gdata);
>>   			ghes_edac_report_mem_error(ghes, sev, mem_err);
>>   
>>   			arch_apei_report_mem_error(sev, mem_err);
>> @@ -467,7 +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>>   		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>>   				      CPER_SEC_PCIE)) {
>>   			struct cper_sec_pcie *pcie_err;
>> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
>> +
>> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>>   			if (sev == GHES_SEV_RECOVERABLE &&
>>   			    sec_sev == GHES_SEV_RECOVERABLE &&
>>   			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index d425374..7e2439e 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,9 @@
>>   #include <linux/acpi.h>
>>   #include <linux/pci.h>
>>   #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>> +#include <acpi/ghes.h>
>>   
>>   #define INDENT_SP	" "
>>   
>> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>   	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>>   }
>>   
>> +static void cper_estatus_print_section_v300(const char *pfx,
>> +	const struct acpi_hest_generic_data_v300 *gdata)
>> +{
>> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> +		timestamp = (__u8 *)&(gdata->time_stamp);
>> +		sec = bcd2bin(timestamp[0]);
>> +		min = bcd2bin(timestamp[1]);
>> +		hour = bcd2bin(timestamp[2]);
>> +		day = bcd2bin(timestamp[4]);
>> +		mon = bcd2bin(timestamp[5]);
>> +		year = bcd2bin(timestamp[6]);
>> +		century = bcd2bin(timestamp[7]);
>> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
>> +			year, mon, day, hour, min, sec);
>> +	}
>> +}
>> +
>>   static void cper_estatus_print_section(
>> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
>> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>>   {
>>   	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>>   	__u16 severity;
>>   	char newpfx[64];
>>   
>> +	if (acpi_hest_generic_data_version(gdata) >= 3)
>> +		cper_estatus_print_section_v300(pfx,
>> +			(const struct acpi_hest_generic_data_v300 *)gdata);
>> +
>>   	severity = gdata->error_severity;
>>   	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>>   	       cper_severity_str(severity));
>> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
>>   
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>>   	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
>> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
>> +		struct cper_sec_proc_generic *proc_err;
>> +
>> +		proc_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: general processor error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*proc_err))
>>   			cper_print_proc_generic(newpfx, proc_err);
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
>> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
>> +		struct cper_sec_mem_err *mem_err;
>> +
>> +		mem_err = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: memory error\n", newpfx);
>>   		if (gdata->error_data_length >=
>>   		    sizeof(struct cper_sec_mem_err_old))
>> @@ -419,7 +450,9 @@ static void cper_estatus_print_section(
>>   		else
>>   			goto err_section_too_small;
>>   	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
>> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
>> +		struct cper_sec_pcie *pcie;
>> +
>> +		pcie = acpi_hest_generic_data_payload(gdata);
>>   		printk("%s""section_type: PCIe error\n", newpfx);
>>   		if (gdata->error_data_length >= sizeof(*pcie))
>>   			cper_print_pcie(newpfx, pcie, gdata);
>> @@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
>>   			const struct acpi_hest_generic_status *estatus)
>>   {
>>   	struct acpi_hest_generic_data *gdata;
>> -	unsigned int data_len, gedata_len;
>> +	unsigned int data_len;
>>   	int sec_no = 0;
>>   	char newpfx[64];
>>   	__u16 severity;
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
>> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>>   		return rc;
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> -		if (gedata_len > data_len - sizeof(*gdata))
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
>> +		if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
>>   			return -EINVAL;
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   	}
>>   	if (data_len)
>>   		return -EINVAL;
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 68f088a..56b9679 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
>>   {
>>   }
>>   #endif
>> +
>> +#define acpi_hest_generic_data_version(gdata)			\
>> +	(gdata->revision >> 8)
>> +
>> +static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
>> +{
>> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
>> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
>> +		gdata + 1;
>> +}
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index dcacb1a..13ea41c 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -255,6 +255,18 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define acpi_hest_generic_data_error_length(gdata)	\
>> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>> +#define acpi_hest_generic_data_size(gdata)		\
>> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
>> +	sizeof(struct acpi_hest_generic_data_v300) :	\
>> +	sizeof(struct acpi_hest_generic_data))
>> +#define acpi_hest_generic_data_record_size(gdata)	\
>> +	(acpi_hest_generic_data_size(gdata) +		\
>> +	acpi_hest_generic_data_error_length(gdata))
>> +#define acpi_hest_generic_data_next(gdata)		\
>> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
>> +
> How come these aren't in ghes.h?

It probably does make more sense to add these in ghes.h, I'll move them 
there in the next set.

> Reviewed-by: James Morse <james.morse@arm.com>
>
Thanks!
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-21 22:35   ` Tyler Baicar
                     ` (2 preceding siblings ...)
  (?)
@ 2016-11-29 11:29   ` Shiju Jose
  -1 siblings, 0 replies; 55+ messages in thread
From: Shiju Jose @ 2016-11-29 11:29 UTC (permalink / raw)
  To: Tyler Baicar, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee

> -----Original Message-----
> From: linux-acpi-owner@vger.kernel.org [mailto:linux-acpi-
> owner@vger.kernel.org] On Behalf Of Tyler Baicar
> Sent: 21 November 2016 22:36
> To: marc.zyngier@arm.com; pbonzini@redhat.com; rkrcmar@redhat.com;
> linux@armlinux.org.uk; catalin.marinas@arm.com; will.deacon@arm.com;
> rjw@rjwysocki.net; lenb@kernel.org; matt@codeblueprint.co.uk;
> robert.moore@intel.com; lv.zheng@intel.com; nkaje@codeaurora.org;
> zjzhang@codeaurora.org; mark.rutland@arm.com; james.morse@arm.com;
> akpm@linux-foundation.org; eun.taik.lee@samsung.com;
> sandeepa.s.prabhu@gmail.com; shijie.huang@arm.com;
> rruigrok@codeaurora.org; paul.gortmaker@windriver.com;
> tomasz.nowicki@linaro.org; fu.wei@linaro.org; rostedt@goodmis.org;
> bristot@redhat.com; linux-arm-kernel@lists.infradead.org;
> kvmarm@lists.cs.columbia.edu; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-acpi@vger.kernel.org; linux-
> efi@vger.kernel.org; Suzuki.Poulose@arm.com; punit.agrawal@arm.com;
> astone@redhat.com; harba@codeaurora.org; hanjun.guo@linaro.org
> Cc: Tyler Baicar
> Subject: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data
> entry v3 per ACPI 6.1
> 
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error data
> entry v3 structure. The timestamp of when the firmware generated the
> error is now being reported.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c    | 14 +++++++---
>  drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++--
> --------
>  include/acpi/ghes.h         | 10 ++++++++
>  include/linux/cper.h        | 12 +++++++++
>  4 files changed, 80 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct
> acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
> 
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;
> 
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +
> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
> 
>  			arch_apei_report_mem_error(sev, mem_err); @@ -467,7
> +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits &
> CPER_PCIE_VALID_DEVICE_ID && diff --git a/drivers/firmware/efi/cper.c
> b/drivers/firmware/efi/cper.c index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
> 
>  #define INDENT_SP	" "
> 
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);  }
> 
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata) {
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n",
> pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int
> sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
> 
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
> 
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata +
> 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n",
> newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old)) @@ -419,7 +450,9
> @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata); @@ -438,7
> +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx,
> cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
Hi Tyler,
Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.     
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct
> acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len -
> acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index
> 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes
> *ghes)  {  }  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct
> +acpi_hest_generic_data *gdata) {
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) +
> 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h index
> dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
> 
>  #define CPER_PCIE_SLOT_SHIFT			3
> 
> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +
>  /*
>   * All tables and structs must be byte-packed to match CPER
>   * specification, since the tables are provided by the system BIOS
> --
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
> Linux Foundation Collaborative Project.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-21 22:35   ` Tyler Baicar
  (?)
  (?)
@ 2016-11-29 11:29   ` Shiju Jose
  2016-11-29 17:30       ` Baicar, Tyler
  -1 siblings, 1 reply; 55+ messages in thread
From: Shiju Jose @ 2016-11-29 11:29 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: linux-acpi-owner at vger.kernel.org [mailto:linux-acpi-
> owner at vger.kernel.org] On Behalf Of Tyler Baicar
> Sent: 21 November 2016 22:36
> To: marc.zyngier at arm.com; pbonzini at redhat.com; rkrcmar at redhat.com;
> linux at armlinux.org.uk; catalin.marinas at arm.com; will.deacon at arm.com;
> rjw at rjwysocki.net; lenb at kernel.org; matt at codeblueprint.co.uk;
> robert.moore at intel.com; lv.zheng at intel.com; nkaje at codeaurora.org;
> zjzhang at codeaurora.org; mark.rutland at arm.com; james.morse at arm.com;
> akpm at linux-foundation.org; eun.taik.lee at samsung.com;
> sandeepa.s.prabhu at gmail.com; shijie.huang at arm.com;
> rruigrok at codeaurora.org; paul.gortmaker at windriver.com;
> tomasz.nowicki at linaro.org; fu.wei at linaro.org; rostedt at goodmis.org;
> bristot at redhat.com; linux-arm-kernel at lists.infradead.org;
> kvmarm at lists.cs.columbia.edu; kvm at vger.kernel.org; linux-
> kernel at vger.kernel.org; linux-acpi at vger.kernel.org; linux-
> efi at vger.kernel.org; Suzuki.Poulose at arm.com; punit.agrawal at arm.com;
> astone at redhat.com; harba at codeaurora.org; hanjun.guo at linaro.org
> Cc: Tyler Baicar
> Subject: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data
> entry v3 per ACPI 6.1
> 
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error data
> entry v3 structure. The timestamp of when the firmware generated the
> error is now being reported.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c    | 14 +++++++---
>  drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++--
> --------
>  include/acpi/ghes.h         | 10 ++++++++
>  include/linux/cper.h        | 12 +++++++++
>  4 files changed, 80 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct
> acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
> 
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;
> 
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +
> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
> 
>  			arch_apei_report_mem_error(sev, mem_err); @@ -467,7
> +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits &
> CPER_PCIE_VALID_DEVICE_ID && diff --git a/drivers/firmware/efi/cper.c
> b/drivers/firmware/efi/cper.c index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
> 
>  #define INDENT_SP	" "
> 
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);  }
> 
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata) {
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n",
> pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int
> sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
> 
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
> 
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata +
> 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n",
> newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old)) @@ -419,7 +450,9
> @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata); @@ -438,7
> +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx,
> cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
Hi Tyler,
Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.     
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct
> acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len -
> acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index
> 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes
> *ghes)  {  }  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct
> +acpi_hest_generic_data *gdata) {
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) +
> 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h index
> dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
> 
>  #define CPER_PCIE_SLOT_SHIFT			3
> 
> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +
>  /*
>   * All tables and structs must be byte-packed to match CPER
>   * specification, since the tables are provided by the system BIOS
> --
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
> Linux Foundation Collaborative Project.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi"
> in the body of a message to majordomo at vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-21 22:35   ` Tyler Baicar
@ 2016-11-29 12:26     ` Shiju Jose
  -1 siblings, 0 replies; 55+ messages in thread
From: Shiju Jose @ 2016-11-29 12:26 UTC (permalink / raw)
  To: Tyler Baicar, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee
  Cc: Gabriele Paoloni, John Garry, Linuxarm, xuwei (O), Anurup m, Shiju Jose

Hi Tyler,

Please find the following comment.

Thanks,
Shiju 

> 
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error data
> entry v3 structure. The timestamp of when the firmware generated the
> error is now being reported.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c    | 14 +++++++---
>  drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++--
> --------
>  include/acpi/ghes.h         | 10 ++++++++
>  include/linux/cper.h        | 12 +++++++++
>  4 files changed, 80 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct
> acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
> 
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;
> 
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +
> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
> 
>  			arch_apei_report_mem_error(sev, mem_err); @@ -467,7
> +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits &
> CPER_PCIE_VALID_DEVICE_ID && diff --git a/drivers/firmware/efi/cper.c
> b/drivers/firmware/efi/cper.c index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
> 
>  #define INDENT_SP	" "
> 
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);  }
> 
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata) {
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n",
> pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int
> sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
> 
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
> 
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata +
> 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n",
> newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old)) @@ -419,7 +450,9
> @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata); @@ -438,7
> +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx,
> cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.     
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct
> acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len -
> acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index
> 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes
> *ghes)  {  }  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct
> +acpi_hest_generic_data *gdata) {
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) +
> 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h index
> dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
> 
>  #define CPER_PCIE_SLOT_SHIFT			3
> 
> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +
>  /*
>   * All tables and structs must be byte-packed to match CPER
>   * specification, since the tables are provided by the system BIOS
> --
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
> Linux Foundation Collaborative Project.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-29 12:26     ` Shiju Jose
  0 siblings, 0 replies; 55+ messages in thread
From: Shiju Jose @ 2016-11-29 12:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tyler,

Please find the following comment.

Thanks,
Shiju 

> 
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error data
> entry v3 structure. The timestamp of when the firmware generated the
> error is now being reported.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Richard Ruigrok <rruigrok@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
> ---
>  drivers/acpi/apei/ghes.c    | 14 +++++++---
>  drivers/firmware/efi/cper.c | 62 +++++++++++++++++++++++++++++++++++--
> --------
>  include/acpi/ghes.h         | 10 ++++++++
>  include/linux/cper.h        | 12 +++++++++
>  4 files changed, 80 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> b79abc5..9063d68 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct
> acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +
> +	mem_err = acpi_hest_generic_data_payload(gdata);
> 
>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -450,14 +451,18 @@ static void ghes_do_proc(struct ghes *ghes,  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;
> 
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> +		sec_type = *(uuid_le *)gdata->section_type;
> +
> +		if (!uuid_le_cmp(sec_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> +			mem_err = acpi_hest_generic_data_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
> 
>  			arch_apei_report_mem_error(sev, mem_err); @@ -467,7
> +472,8 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> +			pcie_err = acpi_hest_generic_data_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits &
> CPER_PCIE_VALID_DEVICE_ID && diff --git a/drivers/firmware/efi/cper.c
> b/drivers/firmware/efi/cper.c index d425374..7e2439e 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,9 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
> +#include <acpi/ghes.h>
> 
>  #define INDENT_SP	" "
> 
> @@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx,
> const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);  }
> 
> +static void cper_estatus_print_section_v300(const char *pfx,
> +	const struct acpi_hest_generic_data_v300 *gdata) {
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec = bcd2bin(timestamp[0]);
> +		min = bcd2bin(timestamp[1]);
> +		hour = bcd2bin(timestamp[2]);
> +		day = bcd2bin(timestamp[4]);
> +		mon = bcd2bin(timestamp[5]);
> +		year = bcd2bin(timestamp[6]);
> +		century = bcd2bin(timestamp[7]);
> +		printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n",
> pfx,
> +			0x01 & *(timestamp + 3) ? "precise" : "", century,
> +			year, mon, day, hour, min, sec);
> +	}
> +}
> +
>  static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int
> sec_no)
> +	const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
>  	char newpfx[64];
> 
> +	if (acpi_hest_generic_data_version(gdata) >= 3)
> +		cper_estatus_print_section_v300(pfx,
> +			(const struct acpi_hest_generic_data_v300 *)gdata);
> +
>  	severity = gdata->error_severity;
>  	printk("%s""Error %d, type: %s\n", pfx, sec_no,
>  	       cper_severity_str(severity));
> @@ -403,14 +430,18 @@ static void cper_estatus_print_section(
> 
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata +
> 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: general processor error\n",
> newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old)) @@ -419,7 +450,9
> @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_generic_data_payload(gdata);
>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata); @@ -438,7
> +471,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>  	printk("%s""event severity: %s\n", pfx,
> cper_severity_str(severity));
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> +
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		gdata = acpi_hest_generic_data_next(gdata);
>  		sec_no++;
>  	}
>  }
Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.     
> @@ -486,12 +519,13 @@ int cper_estatus_check(const struct
> acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))
> +
> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
> +		gedata_len = acpi_hest_generic_data_error_length(gdata);
> +		if (gedata_len > data_len -
> acpi_hest_generic_data_size(gdata))
>  			return -EINVAL;
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
> +		gdata = acpi_hest_generic_data_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index
> 68f088a..56b9679 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -73,3 +73,13 @@ static inline void ghes_edac_unregister(struct ghes
> *ghes)  {  }  #endif
> +
> +#define acpi_hest_generic_data_version(gdata)			\
> +	(gdata->revision >> 8)
> +
> +static inline void *acpi_hest_generic_data_payload(struct
> +acpi_hest_generic_data *gdata) {
> +	return acpi_hest_generic_data_version(gdata) >= 3 ?
> +		(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) +
> 1) :
> +		gdata + 1;
> +}
> diff --git a/include/linux/cper.h b/include/linux/cper.h index
> dcacb1a..13ea41c 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -255,6 +255,18 @@ enum {
> 
>  #define CPER_PCIE_SLOT_SHIFT			3
> 
> +#define acpi_hest_generic_data_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_generic_data_size(gdata)		\
> +	((acpi_hest_generic_data_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_generic_data_record_size(gdata)	\
> +	(acpi_hest_generic_data_size(gdata) +		\
> +	acpi_hest_generic_data_error_length(gdata))
> +#define acpi_hest_generic_data_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
> +
>  /*
>   * All tables and structs must be byte-packed to match CPER
>   * specification, since the tables are provided by the system BIOS
> --
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
> Linux Foundation Collaborative Project.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi"
> in the body of a message to majordomo at vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 03/10] efi: parse ARMv8 processor error
  2016-11-25 18:23     ` James Morse
  (?)
  (?)
@ 2016-11-29 15:37       ` Baicar, Tyler
  -1 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 15:37 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone

Hello James,

On 11/25/2016 11:23 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Add support for ARMv8 Common Platform Error Record (CPER).
>> UEFI 2.6 specification adds support for ARMv8 specific
>> processor error information to be reported as part of the
>> CPER records. This provides more detail on for processor error logs.
> I think I'm missing a big part of the puzzle here, I will come back to this next
> week. I can't quite line up some of the masks and shifts with the table
> descriptions in the UEFI spec[0].

It looks like there was some misunderstanding when the context info 
parsing was added here
(probably because the spec has some issues that I describe below).
I'll need to clean quite a bit of the context info parsing up. I didn't 
catch this earlier because
we aren't reporting context info in firmware right now for the errors I 
have been testing.

>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index 13ea41c..2a9d553 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -180,6 +185,10 @@ enum {
>>   #define CPER_SEC_PROC_IPF						\
>>   	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>>   		0x80, 0xC7, 0x3C, 0x88, 0x81)
>> +/* Processor Specific: ARMv8 */
>> +#define CPER_SEC_PROC_ARMV8						\
>> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
>> +		0x1D, 0x5D, 0x46, 0xB0)
> Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
> an architectural version).

I'll change it in the next set.

>>   /* Platform Memory */
>>   #define CPER_SEC_PLATFORM_MEM						\
>>   	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
>> @@ -255,6 +264,34 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
>> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
> Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
> 2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
> something else? Do these correspond with one of the four bitfield formats
> described in Table 262->265?
>
> I can't see where they are used, and they look like they are reaching across
> multiple fields in a struct.

I will remove these as they aren't needed.

>> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
>> +
>> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
>> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
>> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
>> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
>> +
>> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
>> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
>> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
>> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
>> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
>> +
>> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
>> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
>> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
>> +
>> +#define CPER_AARCH64_CTX_LEN			368
>> +#define CPER_AARCH32_CTX_LEN			256
> Are these the worst case sizes for combinations of the structures in N2.4.4.2?
> (Tables 266 to 273)
>
> If so is there any chance they could be sizeof(<some union of structs>), even if
> the structs are things like:
>> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
>> struct cper_armv8_aarch64_gprs {
>> 	u64 regs[32];
>> }
> This way its easier to check the number is correct, and if a new type is added
> this won't get forgotten.

These were representing the sizes of table 266 and table 267, but 
looking at this more it seems
like some of the spec doesn't make sense:

Table 260 has the Processor Context field which only mentions tables 266 
and 267.
I think that should really be tables 266 - 274 representing all 9 
context types.

Table 265 then has the Register Array field which mentions the contents 
of the array
are described in tables 267 - 271. I think this also should be tables 
266 - 274 to cover
all 9 context types.

And then the text before table 274 is clearly wrong calling it table 
275...seems like there
are several mistakes in the table numbering mentioned in this section.

I'm going to need to update the context info parsing code and add the 
other register array
sizes based on all of the context tables. Looks like the code will need 
to be restructured
some because otherwise there will be quite a bit of duplication.

>> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
>> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
>> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
>> +#define CPER_ARMV8_CTX_EL_SHIFT			4
>> +#define CPER_ARMV8_CTX_NS_SHIFT			7
>> +
> Again, I can't work out what these correspond to. I can't see a secure bit or EL
> field in any of those UEFI tables.
>
> Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
> so we should have some infrastructure for picking the correct (or unknown)
> decode function based on a range of MIDRs.

These will be removed. The exception level and secure context 
information will be covered by
which register context type is being reported.

0 – AArch32 GPRs (General Purpose Registers).
1 -- AArch32 EL1 context registers
2 -- AArch32 EL2 context registers
3 -- Aarch32 secure context registers
4 – AArch64 GPRs
5 -- AArch64 EL1 context registers
6 – Aarch64 EL2 context registers
7 -- AArch64 EL3 context registers
8 – Misc. System Register Structure

>>   #define acpi_hest_generic_data_error_length(gdata)	\
>>   	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>>   #define acpi_hest_generic_data_size(gdata)		\
>> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>>   	__u64	mm_reg_addr;
>>   };
>>   
>> +/* ARMv8 Processor Error Section */
>> +struct cper_sec_proc_armv8 {
>> +	__u32	validation_bits;
>> +	__u16	err_info_num; /* Number of Processor Error Info */
>> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
>> +	__u32	section_length;
>> +	__u8	affinity_level;
>> +	__u8	reserved[3];	/* must be zero */
>> +	__u64	mpidr;
>> +	__u64	midr;
>> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
>> +	__u32	psci_state;
>> +};
>> +
>> +/* ARMv8 Processor Error Information Structure */
>> +struct cper_armv8_err_info {
>> +	__u8	version;
>> +	__u8	length;
>> +	__u16	validation_bits;
>> +	__u8	type;
>> +	__u16	multiple_error;
>> +	__u8	flags;
>> +	__u64	error_info;
>> +	__u64	virt_fault_addr;
>> +	__u64	physical_fault_addr;
>> +};
>
>> +/* ARMv8 AARCH64 Processor Context Information Structure */
>> +struct cper_armv8_aarch64_ctx {
>> +	__u8	type_el_ns;
>> +	__u8	reserved[7];	/* must be zero */
>> +	__u8	gpr[288];
>> +	__u8	spr[68];
>> +};
> Is this:
> "Table 265. ARM Processor Error Context Information Header Structure"?

This structure should be removed, it doesn't get used in code now.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-29 15:37       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 15:37 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone,
	harba, hanjun.guo

Hello James,

On 11/25/2016 11:23 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Add support for ARMv8 Common Platform Error Record (CPER).
>> UEFI 2.6 specification adds support for ARMv8 specific
>> processor error information to be reported as part of the
>> CPER records. This provides more detail on for processor error logs.
> I think I'm missing a big part of the puzzle here, I will come back to this next
> week. I can't quite line up some of the masks and shifts with the table
> descriptions in the UEFI spec[0].

It looks like there was some misunderstanding when the context info 
parsing was added here
(probably because the spec has some issues that I describe below).
I'll need to clean quite a bit of the context info parsing up. I didn't 
catch this earlier because
we aren't reporting context info in firmware right now for the errors I 
have been testing.

>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index 13ea41c..2a9d553 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -180,6 +185,10 @@ enum {
>>   #define CPER_SEC_PROC_IPF						\
>>   	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>>   		0x80, 0xC7, 0x3C, 0x88, 0x81)
>> +/* Processor Specific: ARMv8 */
>> +#define CPER_SEC_PROC_ARMV8						\
>> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
>> +		0x1D, 0x5D, 0x46, 0xB0)
> Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
> an architectural version).

I'll change it in the next set.

>>   /* Platform Memory */
>>   #define CPER_SEC_PLATFORM_MEM						\
>>   	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
>> @@ -255,6 +264,34 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
>> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
> Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
> 2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
> something else? Do these correspond with one of the four bitfield formats
> described in Table 262->265?
>
> I can't see where they are used, and they look like they are reaching across
> multiple fields in a struct.

I will remove these as they aren't needed.

>> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
>> +
>> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
>> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
>> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
>> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
>> +
>> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
>> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
>> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
>> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
>> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
>> +
>> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
>> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
>> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
>> +
>> +#define CPER_AARCH64_CTX_LEN			368
>> +#define CPER_AARCH32_CTX_LEN			256
> Are these the worst case sizes for combinations of the structures in N2.4.4.2?
> (Tables 266 to 273)
>
> If so is there any chance they could be sizeof(<some union of structs>), even if
> the structs are things like:
>> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
>> struct cper_armv8_aarch64_gprs {
>> 	u64 regs[32];
>> }
> This way its easier to check the number is correct, and if a new type is added
> this won't get forgotten.

These were representing the sizes of table 266 and table 267, but 
looking at this more it seems
like some of the spec doesn't make sense:

Table 260 has the Processor Context field which only mentions tables 266 
and 267.
I think that should really be tables 266 - 274 representing all 9 
context types.

Table 265 then has the Register Array field which mentions the contents 
of the array
are described in tables 267 - 271. I think this also should be tables 
266 - 274 to cover
all 9 context types.

And then the text before table 274 is clearly wrong calling it table 
275...seems like there
are several mistakes in the table numbering mentioned in this section.

I'm going to need to update the context info parsing code and add the 
other register array
sizes based on all of the context tables. Looks like the code will need 
to be restructured
some because otherwise there will be quite a bit of duplication.

>> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
>> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
>> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
>> +#define CPER_ARMV8_CTX_EL_SHIFT			4
>> +#define CPER_ARMV8_CTX_NS_SHIFT			7
>> +
> Again, I can't work out what these correspond to. I can't see a secure bit or EL
> field in any of those UEFI tables.
>
> Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
> so we should have some infrastructure for picking the correct (or unknown)
> decode function based on a range of MIDRs.

These will be removed. The exception level and secure context 
information will be covered by
which register context type is being reported.

0 – AArch32 GPRs (General Purpose Registers).
1 -- AArch32 EL1 context registers
2 -- AArch32 EL2 context registers
3 -- Aarch32 secure context registers
4 – AArch64 GPRs
5 -- AArch64 EL1 context registers
6 – Aarch64 EL2 context registers
7 -- AArch64 EL3 context registers
8 – Misc. System Register Structure

>>   #define acpi_hest_generic_data_error_length(gdata)	\
>>   	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>>   #define acpi_hest_generic_data_size(gdata)		\
>> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>>   	__u64	mm_reg_addr;
>>   };
>>   
>> +/* ARMv8 Processor Error Section */
>> +struct cper_sec_proc_armv8 {
>> +	__u32	validation_bits;
>> +	__u16	err_info_num; /* Number of Processor Error Info */
>> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
>> +	__u32	section_length;
>> +	__u8	affinity_level;
>> +	__u8	reserved[3];	/* must be zero */
>> +	__u64	mpidr;
>> +	__u64	midr;
>> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
>> +	__u32	psci_state;
>> +};
>> +
>> +/* ARMv8 Processor Error Information Structure */
>> +struct cper_armv8_err_info {
>> +	__u8	version;
>> +	__u8	length;
>> +	__u16	validation_bits;
>> +	__u8	type;
>> +	__u16	multiple_error;
>> +	__u8	flags;
>> +	__u64	error_info;
>> +	__u64	virt_fault_addr;
>> +	__u64	physical_fault_addr;
>> +};
>
>> +/* ARMv8 AARCH64 Processor Context Information Structure */
>> +struct cper_armv8_aarch64_ctx {
>> +	__u8	type_el_ns;
>> +	__u8	reserved[7];	/* must be zero */
>> +	__u8	gpr[288];
>> +	__u8	spr[68];
>> +};
> Is this:
> "Table 265. ARM Processor Error Context Information Header Structure"?

This structure should be removed, it doesn't get used in code now.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-29 15:37       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 15:37 UTC (permalink / raw)
  To: James Morse
  Cc: marc.zyngier, pbonzini, rkrcmar, linux, catalin.marinas,
	will.deacon, rjw, lenb, matt, robert.moore, lv.zheng, nkaje,
	zjzhang, mark.rutland, akpm, eun.taik.lee, sandeepa.s.prabhu,
	shijie.huang, rruigrok, paul.gortmaker, tomasz.nowicki, fu.wei,
	rostedt, bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel,
	linux-acpi, linux-efi, Suzuki.Poulose, punit.agrawal, astone

Hello James,

On 11/25/2016 11:23 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Add support for ARMv8 Common Platform Error Record (CPER).
>> UEFI 2.6 specification adds support for ARMv8 specific
>> processor error information to be reported as part of the
>> CPER records. This provides more detail on for processor error logs.
> I think I'm missing a big part of the puzzle here, I will come back to this next
> week. I can't quite line up some of the masks and shifts with the table
> descriptions in the UEFI spec[0].

It looks like there was some misunderstanding when the context info 
parsing was added here
(probably because the spec has some issues that I describe below).
I'll need to clean quite a bit of the context info parsing up. I didn't 
catch this earlier because
we aren't reporting context info in firmware right now for the errors I 
have been testing.

>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index 13ea41c..2a9d553 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -180,6 +185,10 @@ enum {
>>   #define CPER_SEC_PROC_IPF						\
>>   	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>>   		0x80, 0xC7, 0x3C, 0x88, 0x81)
>> +/* Processor Specific: ARMv8 */
>> +#define CPER_SEC_PROC_ARMV8						\
>> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
>> +		0x1D, 0x5D, 0x46, 0xB0)
> Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
> an architectural version).

I'll change it in the next set.

>>   /* Platform Memory */
>>   #define CPER_SEC_PLATFORM_MEM						\
>>   	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
>> @@ -255,6 +264,34 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
>> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
> Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
> 2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
> something else? Do these correspond with one of the four bitfield formats
> described in Table 262->265?
>
> I can't see where they are used, and they look like they are reaching across
> multiple fields in a struct.

I will remove these as they aren't needed.

>> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
>> +
>> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
>> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
>> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
>> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
>> +
>> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
>> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
>> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
>> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
>> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
>> +
>> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
>> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
>> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
>> +
>> +#define CPER_AARCH64_CTX_LEN			368
>> +#define CPER_AARCH32_CTX_LEN			256
> Are these the worst case sizes for combinations of the structures in N2.4.4.2?
> (Tables 266 to 273)
>
> If so is there any chance they could be sizeof(<some union of structs>), even if
> the structs are things like:
>> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
>> struct cper_armv8_aarch64_gprs {
>> 	u64 regs[32];
>> }
> This way its easier to check the number is correct, and if a new type is added
> this won't get forgotten.

These were representing the sizes of table 266 and table 267, but 
looking at this more it seems
like some of the spec doesn't make sense:

Table 260 has the Processor Context field which only mentions tables 266 
and 267.
I think that should really be tables 266 - 274 representing all 9 
context types.

Table 265 then has the Register Array field which mentions the contents 
of the array
are described in tables 267 - 271. I think this also should be tables 
266 - 274 to cover
all 9 context types.

And then the text before table 274 is clearly wrong calling it table 
275...seems like there
are several mistakes in the table numbering mentioned in this section.

I'm going to need to update the context info parsing code and add the 
other register array
sizes based on all of the context tables. Looks like the code will need 
to be restructured
some because otherwise there will be quite a bit of duplication.

>> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
>> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
>> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
>> +#define CPER_ARMV8_CTX_EL_SHIFT			4
>> +#define CPER_ARMV8_CTX_NS_SHIFT			7
>> +
> Again, I can't work out what these correspond to. I can't see a secure bit or EL
> field in any of those UEFI tables.
>
> Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
> so we should have some infrastructure for picking the correct (or unknown)
> decode function based on a range of MIDRs.

These will be removed. The exception level and secure context 
information will be covered by
which register context type is being reported.

0 – AArch32 GPRs (General Purpose Registers).
1 -- AArch32 EL1 context registers
2 -- AArch32 EL2 context registers
3 -- Aarch32 secure context registers
4 – AArch64 GPRs
5 -- AArch64 EL1 context registers
6 – Aarch64 EL2 context registers
7 -- AArch64 EL3 context registers
8 – Misc. System Register Structure

>>   #define acpi_hest_generic_data_error_length(gdata)	\
>>   	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>>   #define acpi_hest_generic_data_size(gdata)		\
>> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>>   	__u64	mm_reg_addr;
>>   };
>>   
>> +/* ARMv8 Processor Error Section */
>> +struct cper_sec_proc_armv8 {
>> +	__u32	validation_bits;
>> +	__u16	err_info_num; /* Number of Processor Error Info */
>> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
>> +	__u32	section_length;
>> +	__u8	affinity_level;
>> +	__u8	reserved[3];	/* must be zero */
>> +	__u64	mpidr;
>> +	__u64	midr;
>> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
>> +	__u32	psci_state;
>> +};
>> +
>> +/* ARMv8 Processor Error Information Structure */
>> +struct cper_armv8_err_info {
>> +	__u8	version;
>> +	__u8	length;
>> +	__u16	validation_bits;
>> +	__u8	type;
>> +	__u16	multiple_error;
>> +	__u8	flags;
>> +	__u64	error_info;
>> +	__u64	virt_fault_addr;
>> +	__u64	physical_fault_addr;
>> +};
>
>> +/* ARMv8 AARCH64 Processor Context Information Structure */
>> +struct cper_armv8_aarch64_ctx {
>> +	__u8	type_el_ns;
>> +	__u8	reserved[7];	/* must be zero */
>> +	__u8	gpr[288];
>> +	__u8	spr[68];
>> +};
> Is this:
> "Table 265. ARM Processor Error Context Information Header Structure"?

This structure should be removed, it doesn't get used in code now.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 03/10] efi: parse ARMv8 processor error
@ 2016-11-29 15:37       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hello James,

On 11/25/2016 11:23 AM, James Morse wrote:
> Hi Tyler,
>
> On 21/11/16 22:35, Tyler Baicar wrote:
>> Add support for ARMv8 Common Platform Error Record (CPER).
>> UEFI 2.6 specification adds support for ARMv8 specific
>> processor error information to be reported as part of the
>> CPER records. This provides more detail on for processor error logs.
> I think I'm missing a big part of the puzzle here, I will come back to this next
> week. I can't quite line up some of the masks and shifts with the table
> descriptions in the UEFI spec[0].

It looks like there was some misunderstanding when the context info 
parsing was added here
(probably because the spec has some issues that I describe below).
I'll need to clean quite a bit of the context info parsing up. I didn't 
catch this earlier because
we aren't reporting context info in firmware right now for the errors I 
have been testing.

>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index 13ea41c..2a9d553 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -180,6 +185,10 @@ enum {
>>   #define CPER_SEC_PROC_IPF						\
>>   	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>>   		0x80, 0xC7, 0x3C, 0x88, 0x81)
>> +/* Processor Specific: ARMv8 */
>> +#define CPER_SEC_PROC_ARMV8						\
>> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
>> +		0x1D, 0x5D, 0x46, 0xB0)
> Nit: UEFI v2.6 N.2.2 (table 249) describes this as 'ARM' not 'ARMV8' (which is
> an architectural version).

I'll change it in the next set.

>>   /* Platform Memory */
>>   #define CPER_SEC_PLATFORM_MEM						\
>>   	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
>> @@ -255,6 +264,34 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define CPER_ARMV8_ERR_INFO_NUM_MASK		0x00000000000000FF
>> +#define CPER_ARMV8_CTX_INFO_NUM_MASK		0x0000000000FFFF00
> Table 260 describes both ERR_INFO_NUM and CONTEXT_INFO_NUM for as both being
> 2bytes long, as does your struct cper_sec_proc_armv8 below. Are these for
> something else? Do these correspond with one of the four bitfield formats
> described in Table 262->265?
>
> I can't see where they are used, and they look like they are reaching across
> multiple fields in a struct.

I will remove these as they aren't needed.

>> +#define CPER_ARMV8_CTX_INFO_NUM_SHIFT		8
>> +
>> +#define CPER_ARMV8_VALID_MPIDR			0x00000001
>> +#define CPER_ARMV8_VALID_AFFINITY_LEVEL		0x00000002
>> +#define CPER_ARMV8_VALID_RUNNING_STATE		0x00000004
>> +#define CPER_ARMV8_VALID_VENDOR_INFO		0x00000008
>> +
>> +#define CPER_ARMV8_INFO_VALID_MULTI_ERR		0x0001
>> +#define CPER_ARMV8_INFO_VALID_FLAGS		0x0002
>> +#define CPER_ARMV8_INFO_VALID_ERR_INFO		0x0004
>> +#define CPER_ARMV8_INFO_VALID_VIRT_ADDR		0x0008
>> +#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR	0x0010
>> +
>> +#define CPER_ARMV8_INFO_FLAGS_FIRST		0x0001
>> +#define CPER_ARMV8_INFO_FLAGS_LAST		0x0002
>> +#define CPER_ARMV8_INFO_FLAGS_PROPAGATED	0x0004
>> +
>> +#define CPER_AARCH64_CTX_LEN			368
>> +#define CPER_AARCH32_CTX_LEN			256
> Are these the worst case sizes for combinations of the structures in N2.4.4.2?
> (Tables 266 to 273)
>
> If so is there any chance they could be sizeof(<some union of structs>), even if
> the structs are things like:
>> /* ARMv8 AArch64 GPRs (Type 4) - defined in UEFI Spec N2.4.4.2 */
>> struct cper_armv8_aarch64_gprs {
>> 	u64 regs[32];
>> }
> This way its easier to check the number is correct, and if a new type is added
> this won't get forgotten.

These were representing the sizes of table 266 and table 267, but 
looking at this more it seems
like some of the spec doesn't make sense:

Table 260 has the Processor Context field which only mentions tables 266 
and 267.
I think that should really be tables 266 - 274 representing all 9 
context types.

Table 265 then has the Register Array field which mentions the contents 
of the array
are described in tables 267 - 271. I think this also should be tables 
266 - 274 to cover
all 9 context types.

And then the text before table 274 is clearly wrong calling it table 
275...seems like there
are several mistakes in the table numbering mentioned in this section.

I'm going to need to update the context info parsing code and add the 
other register array
sizes based on all of the context tables. Looks like the code will need 
to be restructured
some because otherwise there will be quite a bit of duplication.

>> +#define CPER_ARMV8_CTX_TYPE_MASK		0x000000000000000F
>> +#define CPER_ARMV8_CTX_EL_MASK			0x0000000000000070
>> +#define CPER_ARMV8_CTX_NS_MASK			0x0000000000000080
>> +#define CPER_ARMV8_CTX_EL_SHIFT			4
>> +#define CPER_ARMV8_CTX_NS_SHIFT			7
>> +
> Again, I can't work out what these correspond to. I can't see a secure bit or EL
> field in any of those UEFI tables.
>
> Is this one of the 'ARM Vendor Specific Micro-Architecture Error Structure's? If
> so we should have some infrastructure for picking the correct (or unknown)
> decode function based on a range of MIDRs.

These will be removed. The exception level and secure context 
information will be covered by
which register context type is being reported.

0 ? AArch32 GPRs (General Purpose Registers).
1 -- AArch32 EL1 context registers
2 -- AArch32 EL2 context registers
3 -- Aarch32 secure context registers
4 ? AArch64 GPRs
5 -- AArch64 EL1 context registers
6 ? Aarch64 EL2 context registers
7 -- AArch64 EL3 context registers
8 ? Misc. System Register Structure

>>   #define acpi_hest_generic_data_error_length(gdata)	\
>>   	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
>>   #define acpi_hest_generic_data_size(gdata)		\
>> @@ -352,6 +389,41 @@ struct cper_ia_proc_ctx {
>>   	__u64	mm_reg_addr;
>>   };
>>   
>> +/* ARMv8 Processor Error Section */
>> +struct cper_sec_proc_armv8 {
>> +	__u32	validation_bits;
>> +	__u16	err_info_num; /* Number of Processor Error Info */
>> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
>> +	__u32	section_length;
>> +	__u8	affinity_level;
>> +	__u8	reserved[3];	/* must be zero */
>> +	__u64	mpidr;
>> +	__u64	midr;
>> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
>> +	__u32	psci_state;
>> +};
>> +
>> +/* ARMv8 Processor Error Information Structure */
>> +struct cper_armv8_err_info {
>> +	__u8	version;
>> +	__u8	length;
>> +	__u16	validation_bits;
>> +	__u8	type;
>> +	__u16	multiple_error;
>> +	__u8	flags;
>> +	__u64	error_info;
>> +	__u64	virt_fault_addr;
>> +	__u64	physical_fault_addr;
>> +};
>
>> +/* ARMv8 AARCH64 Processor Context Information Structure */
>> +struct cper_armv8_aarch64_ctx {
>> +	__u8	type_el_ns;
>> +	__u8	reserved[7];	/* must be zero */
>> +	__u8	gpr[288];
>> +	__u8	spr[68];
>> +};
> Is this:
> "Table 265. ARM Processor Error Context Information Header Structure"?

This structure should be removed, it doesn't get used in code now.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
  2016-11-29 11:29   ` Shiju Jose
@ 2016-11-29 17:30       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 17:30 UTC (permalink / raw)
  To: Shiju Jose, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee

On 11/29/2016 4:29 AM, Shiju Jose wrote:
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx,
>> cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
> Hi Tyler,
> Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
> This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.

Hello Shiju,

Thank you for testing, and you're right...looks like I got a little too excited at this code simplification. :)
I'll add the data_len update in the next patchset.

Thanks,
Tyler

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
@ 2016-11-29 17:30       ` Baicar, Tyler
  0 siblings, 0 replies; 55+ messages in thread
From: Baicar, Tyler @ 2016-11-29 17:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/29/2016 4:29 AM, Shiju Jose wrote:
>> @@ -451,12 +484,12 @@ void cper_estatus_print(const char *pfx,
>>   	printk("%s""event severity: %s\n", pfx,
>> cper_severity_str(severity));
>>   	data_len = estatus->data_length;
>>   	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>> +
>>   	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> -	while (data_len >= sizeof(*gdata)) {
>> -		gedata_len = gdata->error_data_length;
>> +
>> +	while (data_len >= acpi_hest_generic_data_size(gdata)) {
>>   		cper_estatus_print_section(newpfx, gdata, sec_no);
>> -		data_len -= gedata_len + sizeof(*gdata);
>> -		gdata = (void *)(gdata + 1) + gedata_len;
>> +		gdata = acpi_hest_generic_data_next(gdata);
>>   		sec_no++;
>>   	}
>>   }
> Hi Tyler,
> Will the above while loop does not come out because data_len is not getting updated as it did in V4 patch?
> This is the behaviour seen when we tested on our platform. It worked fine when we update the data_len.

Hello Shiju,

Thank you for testing, and you're right...looks like I got a little too excited at this code simplification. :)
I'll add the data_len update in the next patchset.

Thanks,
Tyler

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2016-11-29 17:30 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-21 22:35 [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
2016-11-21 22:35 ` Tyler Baicar
2016-11-21 22:35 ` Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 01/10] acpi: apei: read ack upon ghes record consumption Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
     [not found]   ` <1479767763-27532-2-git-send-email-tbaicar-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2016-11-25 18:19     ` James Morse
2016-11-25 18:19       ` James Morse
2016-11-25 18:19       ` James Morse
2016-11-25 18:19       ` James Morse
2016-11-28 18:34       ` Baicar, Tyler
2016-11-21 22:35 ` [PATCH V5 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1 Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
2016-11-25 18:20   ` James Morse
2016-11-25 18:20     ` James Morse
2016-11-25 18:20     ` James Morse
2016-11-28 18:55     ` Baicar, Tyler
2016-11-28 18:55       ` Baicar, Tyler
2016-11-28 18:55       ` Baicar, Tyler
2016-11-28 18:55       ` Baicar, Tyler
2016-11-29 11:29   ` Shiju Jose
2016-11-29 17:30     ` Baicar, Tyler
2016-11-29 17:30       ` Baicar, Tyler
2016-11-29 11:29   ` Shiju Jose
2016-11-29 12:26   ` Shiju Jose
2016-11-29 12:26     ` Shiju Jose
2016-11-21 22:35 ` [PATCH V5 03/10] efi: parse ARMv8 processor error Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
2016-11-25 18:23   ` James Morse
2016-11-25 18:23     ` James Morse
2016-11-25 18:23     ` James Morse
2016-11-29 15:37     ` Baicar, Tyler
2016-11-29 15:37       ` Baicar, Tyler
2016-11-29 15:37       ` Baicar, Tyler
2016-11-29 15:37       ` Baicar, Tyler
2016-11-21 22:35 ` [PATCH V5 04/10] arm64: exception: handle Synchronous External Abort Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
2016-11-21 22:35 ` [PATCH V5 06/10] acpi: apei: panic OS with fatal error status block Tyler Baicar
2016-11-21 22:35   ` Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 07/10] efi: print unrecognized CPER section Tyler Baicar
2016-11-21 22:36   ` Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 08/10] ras: acpi / apei: generate trace event for " Tyler Baicar
2016-11-21 22:36   ` Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 09/10] trace, ras: add ARM processor error trace event Tyler Baicar
2016-11-21 22:36   ` Tyler Baicar
2016-11-21 22:36 ` [PATCH V5 10/10] arm/arm64: KVM: add guest SEA support Tyler Baicar
2016-11-21 22:36   ` Tyler Baicar
     [not found] ` <1479767763-27532-1-git-send-email-tbaicar-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
2016-11-22 11:11   ` [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 John Garry
2016-11-22 11:11     ` John Garry
2016-11-22 11:11     ` John Garry
2016-11-22 11:11     ` John Garry
2016-11-22 17:13     ` Baicar, Tyler
2016-11-22 17:13       ` Baicar, Tyler
2016-11-22 17:13       ` Baicar, Tyler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.