linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
@ 2017-04-18 23:05 Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption Tyler Baicar
                   ` (10 more replies)
  0 siblings, 11 replies; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

When a memory error, CPU error, PCIe error, or other type of hardware error
that's covered by RAS occurs, firmware should populate the shared GHES memory
location with the proper GHES structures to notify the OS of the error.
For example, platforms that implement firmware first handling may implement
separate GHES sources for corrected errors and uncorrected errors. If the
error is an uncorrectable error, then the firmware will notify the OS
immediately since the error needs to be handled ASAP. The OS will then be able
to take the appropriate action needed such as offlining a page. If the error
is a corrected error, then the firmware will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES timer
expires. The kernel will first parse the GHES structures and report the errors
through the kernel logs and then notify the user space through RAS trace
events. This allows user space applications such as RAS Daemon to see the
errors and report them however the user desires. This patchset extends the
kernel functionality for RAS errors based on updates in the UEFI 2.6 and
ACPI 6.1 specifications.

An example flow from firmware to user space could be:

                 +---------------+
       +-------->|               |
       |         |  GHES polling |--+
+-------------+  |    source     |  |   +---------------+   +------------+
|             |  +---------------+  |   |  Kernel GHES  |   |            |
|  Firmware   |                     +-->|  CPER AER and |-->|  RAS trace |
|             |  +---------------+  |   |  EDAC drivers |   |   event    |
+-------------+  |               |  |   +---------------+   +------------+
       |         |  GHES sci     |--+
       +-------->|   source      |
                 +---------------+

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records.  This provides more detail on for processor error logs. This
can help describe ARMv8 cache, tlb, and bus errors.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

V15:Rebase on 4.11-rc7
    Use wrapper functions for [un]mapping kernel acknowledgement register
    Spacing and name changes to make code cleaner
    Break up timestamp print to be more readable
    Break generic error data v3 structure handling code into seperate patch
     and have timestamp handling in it's own patch
    Put ARM CPER handling into ifdef for ARM systems
    Add braces and missing space to KVM patch

V14:Make sure function prototypes are in the __ASSEMBLY__ block
    Change is_abort_synchronous to is_abort_sea
    Use phys_addr_t for SEA address
    Return after successful SEA handling in handle_guest_abort()

V13:Rebase on 4.11rc2
    Print decimal and hex sizes for unknown CPER section errors
    Use proper CONFIG_* when using IS_ENABLED
    Move handle_guest_sea call prior to SEI check
    Add a return value to handle_guest_sea
    Move RCU locking into ghes_notify_sea
    Add valid bit checks to ARM trace event
    Remove GPIO, SEI, and GSIV cases in GHES
    Add ARCH_HAVE_NMI_SAFE_CMPXCHG since we added NMI usage

V12:Remove double quotes from CPER code
    Add helper function to check all SEA cases in KVM patch
    Replace nmi_enter/exit with rcu_read_lock/unlock for KVM SEA
    Change HAVE_ACPI_APEI_SEA to ACPI_APEI_SEA in KVM SEA case

V11:Change print_hex_dump calls to include ASCII output
    Change HAVE_ACPI_APEI_SEA to ACPI_APEI_SEA and make it 'default y'
    Add unknown print back when printing unknown CPER section
    Make sure to use "%s"" in CPER code
    Spacing fix when checking if SEA is enabled

V10:Fix spacing of trace event enabled if statement

V9: Move SEA_FnV_MASK to ESR_ELx_FnV
    Move HAVE_NMI into alphabetical order
    Remove duplicate hardirq.h include
    Only call ghes_notify_sea if HAVE_ACPI_APEI_SEA
    Make ACPI_APEI_SEA depend on ACPI_APEI_GHES
    Use phys_addr_t for physical address variable
    Make ghes_sea_add() return void
    Add include guard to ghes.h
    Verify HAVE_RAS before calling ras trace events
    Call __ghes_print_estatus() before __ghes_call_panic()
    Add trace_*_event_enabled() checks for both new trace events

V8: Remove SEA notifier
    Add FAR not valid bit check when populating the SEA error address
    Move nmi_enter/exit() to architecture specific code
    Add synchronize_rcu() usage to SEA handling
    Make GHES_IOREMAP_PAGES always 2
    Update ghes_ioremap_pfn_nmi() to work like ghes_ioremap_pfn_irq()
    Remove the SEA print from handle_guest_sea()

V7: Update a couple prints for ARM processor errors
    Add Print notifying if overflow occurred for ARM processor errors
    Check for ARM configuration to allow the compiler to ignore ARM code
     on non-ARM systems
    Use SEA acronym instead of spelling it out
    Update fault_info prints to be more clear
    Add NMI locking to SEA notification
    Remove error info structure from ARM trace event since there can be
     a variable amount of these structures

V6: Change HEST_TYPE_GENERIC_V2 to IS_HEST_TYPE_GENERIC_V2 for readability
    Move APEI helper defines from cper.h to ghes.h
    Add data_len decrement back into print loop
    Change references to ARMv8 to just ARM
    Rewrite ARM processor context info parsing
    Check valid bit of ARM error info field before printing it
    Add include of linux/uuid.h in ghes.c

V5: Fix GHES goto logic for error conditions
    Change ghes_do_read_ack to ghes_ack_error
    Make sure data version check is >= 3
    Use CPER helper functions in print functions
    Make handle_guest_sea() dummy function static for arm
    Add arm to subject line for KVM patch

V4: Add bit offset left shift to read_ack_write value
    Make HEST generic and generic_v2 structures a union in the ghes structure
    Move gdata v3 helper functions into ghes.h to avoid duplication
    Reorder the timestamp print and avoid memcpy
    Add helper functions for gdata size checking
    Rename the SEA functions
    Add helper function for GHES panics
    Set fru_id to NULL UUID at variable declaration
    Limit ARM trace event parameters to the needed structures
    Reorder the ARM trace event variables to save space
    Add comment for why we don't pass SEAs to the guest when it aborts
    Move ARM trace event call into GHES driver instead of CPER

V3: Fix unmapped address to the read_ack_register in ghes.c
    Add helper function to get the proper payload based on generic data entry
     version
    Move timestamp print to avoid changing function calls in cper.c
    Remove patch "arm64: exception: handle instruction abort at current EL"
     since the el1_ia handler is already added in 4.8
    Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
    Add a new trace event for ARM type errors
    Add support to handle KVM guest SEAs

V2: Add PSCI state print for the ARMv8 error type.
    Separate timestamp year into year and century using BCD format.
    Rebase on top of ACPICA 20160318 release and remove header file changes
     in include/acpi/actbl1.h.
    Add panic OS with fatal error status block patch.
    Add processing of unrecognized CPER error section patches with updates
     from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
  acpi: apei: panic OS with fatal error status block

Tyler Baicar (10):
  acpi: apei: read ack upon ghes record consumption
  ras: acpi/apei: cper: add support for generic data v3 structure
  cper: add timestamp print to CPER status printing
  efi: parse ARM processor error
  arm64: exception: handle Synchronous External Abort
  acpi: apei: handle SEA notification type for ARMv8
  efi: print unrecognized CPER section
  ras: acpi / apei: generate trace event for unrecognized CPER section
  trace, ras: add ARM processor error trace event
  arm/arm64: KVM: add guest SEA support

 arch/arm/include/asm/kvm_arm.h       |  10 ++
 arch/arm/include/asm/system_misc.h   |   5 +
 arch/arm/kvm/mmu.c                   |  35 +++++-
 arch/arm64/Kconfig                   |   2 +
 arch/arm64/include/asm/esr.h         |   1 +
 arch/arm64/include/asm/kvm_arm.h     |  10 ++
 arch/arm64/include/asm/system_misc.h |   2 +
 arch/arm64/mm/fault.c                |  77 +++++++++++--
 drivers/acpi/apei/Kconfig            |  15 +++
 drivers/acpi/apei/ghes.c             | 184 ++++++++++++++++++++++++++----
 drivers/acpi/apei/hest.c             |   7 +-
 drivers/firmware/efi/cper.c          | 213 ++++++++++++++++++++++++++++++++---
 drivers/ras/ras.c                    |   2 +
 include/acpi/ghes.h                  |  34 +++++-
 include/linux/cper.h                 |  54 +++++++++
 include/ras/ras_event.h              |  90 +++++++++++++++
 16 files changed, 687 insertions(+), 54 deletions(-)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-19 18:31   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure Tyler Baicar
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.

The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.

Add support for parsing of GHESv2 sub-tables as well.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
 drivers/acpi/apei/ghes.c | 55 +++++++++++++++++++++++++++++++++++++++++++++---
 drivers/acpi/apei/hest.c |  7 ++++--
 include/acpi/ghes.h      |  5 ++++-
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 79b3c9c..6d87ab7 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -46,6 +46,7 @@
 #include <linux/nmi.h>
 #include <linux/sched/clock.h>
 
+#include <acpi/actbl1.h>
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
@@ -80,6 +81,10 @@
 	((struct acpi_hest_generic_status *)				\
 	 ((struct ghes_estatus_node *)(estatus_node) + 1))
 
+#define IS_HEST_TYPE_GENERIC_V2(ghes)				\
+	((struct acpi_hest_header *)ghes->generic)->type ==	\
+	 ACPI_HEST_TYPE_GENERIC_ERROR_V2
+
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -240,6 +245,17 @@ static int ghes_estatus_pool_expand(unsigned long len)
 	return 0;
 }
 
+static int map_gen_v2(struct ghes *ghes)
+{
+	return apei_map_generic_address(&ghes->generic_v2->read_ack_register);
+}
+
+static void unmap_gen_v2(struct ghes *ghes)
+{
+	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
+	return;
+}
+
 static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 {
 	struct ghes *ghes;
@@ -249,10 +265,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
 	if (!ghes)
 		return ERR_PTR(-ENOMEM);
+
 	ghes->generic = generic;
+	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = map_gen_v2(ghes);
+		if (rc)
+			goto err_free;
+	}
+
 	rc = apei_map_generic_address(&generic->error_status_address);
 	if (rc)
-		goto err_free;
+		goto err_unmap_read_ack_addr;
 	error_block_length = generic->error_block_length;
 	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
 		pr_warning(FW_WARN GHES_PFX
@@ -264,13 +287,16 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
 	if (!ghes->estatus) {
 		rc = -ENOMEM;
-		goto err_unmap;
+		goto err_unmap_status_addr;
 	}
 
 	return ghes;
 
-err_unmap:
+err_unmap_status_addr:
 	apei_unmap_generic_address(&generic->error_status_address);
+err_unmap_read_ack_addr:
+	if (IS_HEST_TYPE_GENERIC_V2(ghes))
+		unmap_gen_v2(ghes);
 err_free:
 	kfree(ghes);
 	return ERR_PTR(rc);
@@ -280,6 +306,8 @@ static void ghes_fini(struct ghes *ghes)
 {
 	kfree(ghes->estatus);
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
+	if (IS_HEST_TYPE_GENERIC_V2(ghes))
+		unmap_gen_v2(ghes);
 }
 
 static inline int ghes_severity(int severity)
@@ -649,6 +677,21 @@ static void ghes_estatus_cache_add(
 	rcu_read_unlock();
 }
 
+static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
+{
+	int rc;
+	u64 val = 0;
+
+	rc = apei_read(&val, &generic_v2->read_ack_register);
+	if (rc)
+		return rc;
+
+	val &= generic_v2->read_ack_preserve << generic_v2->read_ack_register.bit_offset;
+	val |= generic_v2->read_ack_write << generic_v2->read_ack_register.bit_offset;
+
+	return apei_write(val, &generic_v2->read_ack_register);
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -661,6 +704,12 @@ static int ghes_proc(struct ghes *ghes)
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
 	}
 	ghes_do_proc(ghes, ghes->estatus);
+
+	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
+		rc = ghes_ack_error(ghes->generic_v2);
+		if (rc)
+			return rc;
+	}
 out:
 	ghes_clear_estatus(ghes);
 	return rc;
diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
index 8f2a98e..456b488 100644
--- a/drivers/acpi/apei/hest.c
+++ b/drivers/acpi/apei/hest.c
@@ -52,6 +52,7 @@
 	[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
 	[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
 	[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
+	[ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
 };
 
 static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
@@ -141,7 +142,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
 {
 	int *count = data;
 
-	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
+	    hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		(*count)++;
 	return 0;
 }
@@ -152,7 +154,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
 	struct ghes_arr *ghes_arr = data;
 	int rc, i;
 
-	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
+	if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
+	    hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
 		return 0;
 
 	if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 720446c..68f088a 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -13,7 +13,10 @@
 #define GHES_EXITING		0x0002
 
 struct ghes {
-	struct acpi_hest_generic *generic;
+	union {
+		struct acpi_hest_generic *generic;
+		struct acpi_hest_generic_v2 *generic_v2;
+	};
 	struct acpi_hest_generic_status *estatus;
 	u64 buffer_paddr;
 	unsigned long flags;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-20 11:14   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 03/11] cper: add timestamp print to CPER status printing Tyler Baicar
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

The ACPI 6.1 spec adds a new version of the generic data structure.
Add support to handle the new structure as well as properly verify
and iterate through the generic data entries.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/acpi/apei/ghes.c    |  6 +++---
 drivers/firmware/efi/cper.c | 37 ++++++++++++++++++++++---------------
 include/acpi/ghes.h         | 22 ++++++++++++++++++++++
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 6d87ab7..dfb7dd2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -429,7 +429,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
 	int flags = -1;
 	int sec_sev = ghes_severity(gdata->error_severity);
 	struct cper_sec_mem_err *mem_err;
-	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
+	mem_err = acpi_hest_get_payload(gdata);
 
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
 		return;
@@ -466,7 +466,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
 				 CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
-			mem_err = (struct cper_sec_mem_err *)(gdata+1);
+			mem_err = acpi_hest_get_payload(gdata);
 			ghes_edac_report_mem_error(ghes, sev, mem_err);
 
 			arch_apei_report_mem_error(sev, mem_err);
@@ -476,7 +476,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
-			pcie_err = (struct cper_sec_pcie *)(gdata+1);
+			pcie_err = acpi_hest_get_payload(gdata);
 			if (sev == GHES_SEV_RECOVERABLE &&
 			    sec_sev == GHES_SEV_RECOVERABLE &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index d425374..8328a6f 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,7 @@
 #include <linux/acpi.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <acpi/ghes.h>
 
 #define INDENT_SP	" "
 
@@ -386,8 +387,9 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 }
 
-static void cper_estatus_print_section(
-	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
+static void
+cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata,
+			   int sec_no)
 {
 	uuid_le *sec_type = (uuid_le *)gdata->section_type;
 	__u16 severity;
@@ -403,14 +405,18 @@ static void cper_estatus_print_section(
 
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
 	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
-		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+		struct cper_sec_proc_generic *proc_err;
+
+		proc_err = acpi_hest_get_payload(gdata);
 		printk("%s""section_type: general processor error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*proc_err))
 			cper_print_proc_generic(newpfx, proc_err);
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
-		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+		struct cper_sec_mem_err *mem_err;
+
+		mem_err = acpi_hest_get_payload(gdata);
 		printk("%s""section_type: memory error\n", newpfx);
 		if (gdata->error_data_length >=
 		    sizeof(struct cper_sec_mem_err_old))
@@ -419,7 +425,9 @@ static void cper_estatus_print_section(
 		else
 			goto err_section_too_small;
 	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
-		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+		struct cper_sec_pcie *pcie;
+
+		pcie = acpi_hest_get_payload(gdata);
 		printk("%s""section_type: PCIe error\n", newpfx);
 		if (gdata->error_data_length >= sizeof(*pcie))
 			cper_print_pcie(newpfx, pcie, gdata);
@@ -438,7 +446,7 @@ void cper_estatus_print(const char *pfx,
 			const struct acpi_hest_generic_status *estatus)
 {
 	struct acpi_hest_generic_data *gdata;
-	unsigned int data_len, gedata_len;
+	unsigned int data_len;
 	int sec_no = 0;
 	char newpfx[64];
 	__u16 severity;
@@ -452,11 +460,10 @@ void cper_estatus_print(const char *pfx,
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
 	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
+	while (data_len >= acpi_hest_get_size(gdata)) {
 		cper_estatus_print_section(newpfx, gdata, sec_no);
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		data_len -= acpi_hest_get_record_size(gdata);
+		gdata = acpi_hest_get_next(gdata);
 		sec_no++;
 	}
 }
@@ -486,12 +493,12 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
 		return rc;
 	data_len = estatus->data_length;
 	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
-	while (data_len >= sizeof(*gdata)) {
-		gedata_len = gdata->error_data_length;
-		if (gedata_len > data_len - sizeof(*gdata))
+	while (data_len >= acpi_hest_get_size(gdata)) {
+		gedata_len = acpi_hest_get_error_length(gdata);
+		if (gedata_len > data_len - acpi_hest_get_size(gdata))
 			return -EINVAL;
-		data_len -= gedata_len + sizeof(*gdata);
-		gdata = (void *)(gdata + 1) + gedata_len;
+		data_len -= acpi_hest_get_record_size(gdata);
+		gdata = acpi_hest_get_next(gdata);
 	}
 	if (data_len)
 		return -EINVAL;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 68f088a..b89361a 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -12,6 +12,20 @@
 #define GHES_TO_CLEAR		0x0001
 #define GHES_EXITING		0x0002
 
+#define acpi_hest_get_error_length(gdata)	\
+	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
+#define acpi_hest_get_size(gdata)		\
+	((acpi_hest_get_version(gdata) >= 3) ?	\
+	sizeof(struct acpi_hest_generic_data_v300) :	\
+	sizeof(struct acpi_hest_generic_data))
+#define acpi_hest_get_record_size(gdata)	\
+	(acpi_hest_get_size(gdata) +		\
+	acpi_hest_get_error_length(gdata))
+#define acpi_hest_get_next(gdata)		\
+	((void *)(gdata) + acpi_hest_get_record_size(gdata))
+#define acpi_hest_get_version(gdata)			\
+	(gdata->revision >> 8)
+
 struct ghes {
 	union {
 		struct acpi_hest_generic *generic;
@@ -73,3 +87,11 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
 {
 }
 #endif
+
+static inline void *acpi_hest_get_payload(struct acpi_hest_generic_data *gdata)
+{
+	if (acpi_hest_get_version(gdata) >= 3)
+		return (void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1);
+
+	return gdata + 1;
+}
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-21 12:21   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 04/11] efi: parse ARM processor error Tyler Baicar
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

The ACPI 6.1 spec added a timestamp to the HEST generic data
structure. Print the timestamp out when printing out the error
status information.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/firmware/efi/cper.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 8328a6f..46585f9 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,8 @@
 #include <linux/acpi.h>
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <linux/printk.h>
+#include <linux/bcd.h>
 #include <acpi/ghes.h>
 
 #define INDENT_SP	" "
@@ -387,6 +389,29 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 }
 
+static void cper_estatus_timestamp(const char *pfx,
+				   struct acpi_hest_generic_data_v300 *gdata)
+{
+	__u8 hour, min, sec, day, mon, year, century, *timestamp;
+
+	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
+		timestamp = (__u8 *)&(gdata->time_stamp);
+		sec       = bcd2bin(timestamp[0]);
+		min       = bcd2bin(timestamp[1]);
+		hour      = bcd2bin(timestamp[2]);
+		day       = bcd2bin(timestamp[4]);
+		mon       = bcd2bin(timestamp[5]);
+		year      = bcd2bin(timestamp[6]);
+		century   = bcd2bin(timestamp[7]);
+
+		if (*(timestamp + 3) & 0x1)
+			printk("%stimestamp is precise\n", pfx);
+
+		printk("%stime: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+			century, year, mon, day, hour, min, sec);
+	}
+}
+
 static void
 cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata,
 			   int sec_no)
@@ -395,6 +420,9 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	__u16 severity;
 	char newpfx[64];
 
+	if (acpi_hest_get_version(gdata) >= 3)
+		cper_estatus_timestamp(pfx, (struct acpi_hest_generic_data_v300 *)gdata);
+
 	severity = gdata->error_severity;
 	printk("%s""Error %d, type: %s\n", pfx, sec_no,
 	       cper_severity_str(severity));
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (2 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 03/11] cper: add timestamp print to CPER status printing Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-21 17:55   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 05/11] arm64: exception: handle Synchronous External Abort Tyler Baicar
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

Add support for ARM Common Platform Error Record (CPER).
UEFI 2.6 specification adds support for ARM specific
processor error information to be reported as part of the
CPER records. This provides more detail on for processor error logs.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/firmware/efi/cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cper.h        |  54 ++++++++++++++++++
 2 files changed, 189 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 46585f9..f959185 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
 static const char * const proc_type_strs[] = {
 	"IA32/X64",
 	"IA64",
+	"ARM",
 };
 
 static const char * const proc_isa_strs[] = {
 	"IA32",
 	"IA64",
 	"X64",
+	"ARM A32/T32",
+	"ARM A64",
 };
 
 static const char * const proc_error_type_strs[] = {
@@ -184,6 +187,128 @@ static void cper_print_proc_generic(const char *pfx,
 		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
 }
 
+#if defined(CONFIG_ARM64) || defined(CONFIG_ARM)
+static const char * const arm_reg_ctx_strs[] = {
+	"AArch32 general purpose registers",
+	"AArch32 EL1 context registers",
+	"AArch32 EL2 context registers",
+	"AArch32 secure context registers",
+	"AArch64 general purpose registers",
+	"AArch64 EL1 context registers",
+	"AArch64 EL2 context registers",
+	"AArch64 EL3 context registers",
+	"Misc. system register structure",
+};
+
+static void cper_print_proc_arm(const char *pfx,
+				const struct cper_sec_proc_arm *proc)
+{
+	int i, len, max_ctx_type;
+	struct cper_arm_err_info *err_info;
+	struct cper_arm_ctx_info *ctx_info;
+	char newpfx[64];
+
+	printk("%ssection length: %d\n", pfx, proc->section_length);
+	printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
+
+	len = proc->section_length - (sizeof(*proc) +
+		proc->err_info_num * (sizeof(*err_info)));
+	if (len < 0) {
+		printk("%ssection length is too small\n", pfx);
+		printk("%sfirmware-generated error record is incorrect\n", pfx);
+		printk("%sERR_INFO_NUM is %d\n", pfx, proc->err_info_num);
+		return;
+	}
+
+	if (proc->validation_bits & CPER_ARM_VALID_MPIDR)
+		printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
+	if (proc->validation_bits & CPER_ARM_VALID_AFFINITY_LEVEL)
+		printk("%serror affinity level: %d\n", pfx,
+			proc->affinity_level);
+	if (proc->validation_bits & CPER_ARM_VALID_RUNNING_STATE) {
+		printk("%srunning state: 0x%x\n", pfx, proc->running_state);
+		printk("%sPSCI state: %d\n", pfx, proc->psci_state);
+	}
+
+	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+	err_info = (struct cper_arm_err_info *)(proc + 1);
+	for (i = 0; i < proc->err_info_num; i++) {
+		printk("%sError info structure %d:\n", pfx, i);
+		printk("%sversion:%d\n", newpfx, err_info->version);
+		printk("%slength:%d\n", newpfx, err_info->length);
+		if (err_info->validation_bits &
+		    CPER_ARM_INFO_VALID_MULTI_ERR) {
+			if (err_info->multiple_error == 0)
+				printk("%ssingle error\n", newpfx);
+			else if (err_info->multiple_error == 1)
+				printk("%smultiple errors\n", newpfx);
+			else
+				printk("%smultiple errors count:%u\n",
+				newpfx, err_info->multiple_error);
+		}
+		if (err_info->validation_bits & CPER_ARM_INFO_VALID_FLAGS) {
+			if (err_info->flags & CPER_ARM_INFO_FLAGS_FIRST)
+				printk("%sfirst error captured\n", newpfx);
+			if (err_info->flags & CPER_ARM_INFO_FLAGS_LAST)
+				printk("%slast error captured\n", newpfx);
+			if (err_info->flags & CPER_ARM_INFO_FLAGS_PROPAGATED)
+				printk("%spropagated error captured\n",
+				       newpfx);
+			if (err_info->flags & CPER_ARM_INFO_FLAGS_OVERFLOW)
+				printk("%soverflow occurred, error info is incomplete\n",
+				       newpfx);
+		}
+		printk("%serror_type: %d, %s\n", newpfx, err_info->type,
+			err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
+			proc_error_type_strs[err_info->type] : "unknown");
+		if (err_info->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+			printk("%serror_info: 0x%016llx\n", newpfx,
+			       err_info->error_info);
+		if (err_info->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+			printk("%svirtual fault address: 0x%016llx\n",
+				newpfx, err_info->virt_fault_addr);
+		if (err_info->validation_bits &
+		    CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+			printk("%sphysical fault address: 0x%016llx\n",
+				newpfx, err_info->physical_fault_addr);
+		err_info += 1;
+	}
+	ctx_info = (struct cper_arm_ctx_info *)err_info;
+	max_ctx_type = ARRAY_SIZE(arm_reg_ctx_strs) - 1;
+	for (i = 0; i < proc->context_info_num; i++) {
+		int size = sizeof(*ctx_info) + ctx_info->size;
+
+		printk("%sContext info structure %d:\n", pfx, i);
+		if (len < size) {
+			printk("%ssection length is too small\n", newpfx);
+			printk("%sfirmware-generated error record is incorrect\n", pfx);
+			return;
+		}
+		if (ctx_info->type > max_ctx_type) {
+			printk("%sInvalid context type: %d\n", newpfx,
+							ctx_info->type);
+			printk("%sMax context type: %d\n", newpfx,
+							max_ctx_type);
+			return;
+		}
+		printk("%sregister context type %d: %s\n", newpfx,
+			ctx_info->type, arm_reg_ctx_strs[ctx_info->type]);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+				(ctx_info + 1), ctx_info->size, 0);
+		len -= size;
+		ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + size);
+	}
+
+	if (len > 0) {
+		printk("%sVendor specific error info has %u bytes:\n", pfx,
+		       len);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4, ctx_info,
+				len, true);
+	}
+}
+#endif
+
 static const char * const mem_err_type_strs[] = {
 	"unknown",
 	"no error",
@@ -461,6 +586,16 @@ static void cper_estatus_timestamp(const char *pfx,
 			cper_print_pcie(newpfx, pcie, gdata);
 		else
 			goto err_section_too_small;
+	} else if ((IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)) &&
+		   !uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARM)) {
+		struct cper_sec_proc_arm *arm_err;
+
+		arm_err = acpi_hest_get_payload(gdata);
+		printk("%ssection_type: ARM processor error\n", newpfx);
+		if (gdata->error_data_length >= sizeof(*arm_err))
+			cper_print_proc_arm(newpfx, arm_err);
+		else
+			goto err_section_too_small;
 	} else
 		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
 
diff --git a/include/linux/cper.h b/include/linux/cper.h
index dcacb1a..85450f3 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -180,6 +180,10 @@ enum {
 #define CPER_SEC_PROC_IPF						\
 	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
 		0x80, 0xC7, 0x3C, 0x88, 0x81)
+/* Processor Specific: ARM */
+#define CPER_SEC_PROC_ARM						\
+	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
+		0x1D, 0x5D, 0x46, 0xB0)
 /* Platform Memory */
 #define CPER_SEC_PLATFORM_MEM						\
 	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
@@ -255,6 +259,22 @@ enum {
 
 #define CPER_PCIE_SLOT_SHIFT			3
 
+#define CPER_ARM_VALID_MPIDR			0x00000001
+#define CPER_ARM_VALID_AFFINITY_LEVEL		0x00000002
+#define CPER_ARM_VALID_RUNNING_STATE		0x00000004
+#define CPER_ARM_VALID_VENDOR_INFO		0x00000008
+
+#define CPER_ARM_INFO_VALID_MULTI_ERR		0x0001
+#define CPER_ARM_INFO_VALID_FLAGS		0x0002
+#define CPER_ARM_INFO_VALID_ERR_INFO		0x0004
+#define CPER_ARM_INFO_VALID_VIRT_ADDR		0x0008
+#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR	0x0010
+
+#define CPER_ARM_INFO_FLAGS_FIRST		0x0001
+#define CPER_ARM_INFO_FLAGS_LAST		0x0002
+#define CPER_ARM_INFO_FLAGS_PROPAGATED		0x0004
+#define CPER_ARM_INFO_FLAGS_OVERFLOW		0x0008
+
 /*
  * All tables and structs must be byte-packed to match CPER
  * specification, since the tables are provided by the system BIOS
@@ -340,6 +360,40 @@ struct cper_ia_proc_ctx {
 	__u64	mm_reg_addr;
 };
 
+/* ARM Processor Error Section */
+struct cper_sec_proc_arm {
+	__u32	validation_bits;
+	__u16	err_info_num; /* Number of Processor Error Info */
+	__u16	context_info_num; /* Number of Processor Context Info Records*/
+	__u32	section_length;
+	__u8	affinity_level;
+	__u8	reserved[3];	/* must be zero */
+	__u64	mpidr;
+	__u64	midr;
+	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
+	__u32	psci_state;
+};
+
+/* ARM Processor Error Information Structure */
+struct cper_arm_err_info {
+	__u8	version;
+	__u8	length;
+	__u16	validation_bits;
+	__u8	type;
+	__u16	multiple_error;
+	__u8	flags;
+	__u64	error_info;
+	__u64	virt_fault_addr;
+	__u64	physical_fault_addr;
+};
+
+/* ARM Processor Context Information Structure */
+struct cper_arm_ctx_info {
+	__u16	version;
+	__u16	type;
+	__u32	size;
+};
+
 /* Old Memory Error Section UEFI 2.1, 2.2 */
 struct cper_sec_mem_err_old {
 	__u64	validation_bits;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 05/11] arm64: exception: handle Synchronous External Abort
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (3 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 04/11] efi: parse ARM processor error Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

SEA exceptions are often caused by an uncorrected hardware
error, and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.
When SEA occurs, before killing the process, report the error
in the kernel logs.
Update fault_info[] with specific SEA faults so that the
new SEA handler is used.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/include/asm/esr.h |  1 +
 arch/arm64/mm/fault.c        | 45 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index d14c478..f20c64a 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -83,6 +83,7 @@
 #define ESR_ELx_WNR		(UL(1) << 6)
 
 /* Shared ISS field definitions for Data/Instruction aborts */
+#define ESR_ELx_FnV		(UL(1) << 10)
 #define ESR_ELx_EA		(UL(1) << 9)
 #define ESR_ELx_S1PTW		(UL(1) << 7)
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 1b35b8bd..b74d8b7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -505,6 +505,31 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	struct siginfo info;
+	const struct fault_info *inf;
+
+	inf = esr_to_fault_info(esr);
+	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+		inf->name, esr, addr);
+
+	info.si_signo = SIGBUS;
+	info.si_errno = 0;
+	info.si_code  = 0;
+	if (esr & ESR_ELx_FnV)
+		info.si_addr = 0;
+	else
+		info.si_addr  = (void __user *)addr;
+	arm64_notify_die("", regs, &info, esr);
+
+	return 0;
+}
+
 static const struct fault_info fault_info[] = {
 	{ do_bad,		SIGBUS,  0,		"ttbr address size fault"	},
 	{ do_bad,		SIGBUS,  0,		"level 1 address size fault"	},
@@ -522,22 +547,22 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_sea,		SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 17"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error"	},
+	{ do_sea,		SIGBUS,  0,		"level 0 (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 1 (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 2 (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 3 (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"synchronous parity or ECC error" },
 	{ do_bad,		SIGBUS,  0,		"unknown 25"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 26"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 27"			},
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
-	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
+	{ do_sea,		SIGBUS,  0,		"level 0 synchronous parity error (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 1 synchronous parity error (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 2 synchronous parity error (translation table walk)"	},
+	{ do_sea,		SIGBUS,  0,		"level 3 synchronous parity error (translation table walk)"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 32"			},
 	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
 	{ do_bad,		SIGBUS,  0,		"unknown 34"			},
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (4 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 05/11] arm64: exception: handle Synchronous External Abort Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-25 17:21   ` Borislav Petkov
                     ` (2 more replies)
  2017-04-18 23:05 ` [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block Tyler Baicar
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

ARM APEI extension proposal added SEA (Synchronous External Abort)
notification type for ARMv8.
Add a new GHES error source handling function for SEA. If an error
source's notification type is SEA, then this function can be registered
into the SEA exception handler. That way GHES will parse and report
SEA exceptions when they occur.
An SEA can interrupt code that had interrupts masked and is treated as
an NMI. To aid this the page of address space for mapping APEI buffers
while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
changed to use the helper methods to find the prot_t to map with in
the same way as ghes_ioremap_pfn_irq().

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig        |  2 ++
 arch/arm64/mm/fault.c     | 13 +++++++++
 drivers/acpi/apei/Kconfig | 15 ++++++++++
 drivers/acpi/apei/ghes.c  | 70 +++++++++++++++++++++++++++++++++++++++++++----
 include/acpi/ghes.h       |  7 +++++
 5 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3741859..36226c2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -18,6 +18,7 @@ config ARM64
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_NUMA_BALANCING
@@ -92,6 +93,7 @@ config ARM64
 	select HAVE_IRQ_TIME_ACCOUNTING
 	select HAVE_MEMBLOCK
 	select HAVE_MEMBLOCK_NODE_MAP if NUMA
+	select HAVE_NMI if ACPI_APEI_SEA
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b74d8b7..10013ff 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -42,6 +42,8 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
+#include <acpi/ghes.h>
+
 struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr,
 		      struct pt_regs *regs);
@@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
 		inf->name, esr, addr);
 
+	/*
+	 * Synchronous aborts may interrupt code which had interrupts masked.
+	 * Before calling out into the wider kernel tell the interested
+	 * subsystems.
+	 */
+	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
+		nmi_enter();
+		ghes_notify_sea();
+		nmi_exit();
+	}
+
 	info.si_signo = SIGBUS;
 	info.si_errno = 0;
 	info.si_code  = 0;
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index b0140c8..de14d49 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -39,6 +39,21 @@ config ACPI_APEI_PCIEAER
 	  PCIe AER errors may be reported via APEI firmware first mode.
 	  Turn on this option to enable the corresponding support.
 
+config ACPI_APEI_SEA
+	bool "APEI Synchronous External Abort logging/recovering support"
+	depends on ARM64 && ACPI_APEI_GHES
+	default y
+	help
+	  This option should be enabled if the system supports
+	  firmware first handling of SEA (Synchronous External Abort).
+	  SEA happens with certain faults of data abort or instruction
+	  abort synchronous exceptions on ARMv8 systems. If a system
+	  supports firmware first handling of SEA, the platform analyzes
+	  and handles hardware error notifications from SEA, and it may then
+	  form a HW error record for the OS to parse and handle. This
+	  option allows the OS to look for such hardware error record, and
+	  take appropriate action.
+
 config ACPI_APEI_MEMORY_FAILURE
 	bool "APEI memory error recovering support"
 	depends on ACPI_APEI && MEMORY_FAILURE
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index dfb7dd2..2d387f8 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -115,11 +115,7 @@
  * Two virtual pages are used, one for IRQ/PROCESS context, the other for
  * NMI context (optionally).
  */
-#ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #define GHES_IOREMAP_PAGES           2
-#else
-#define GHES_IOREMAP_PAGES           1
-#endif
 #define GHES_IOREMAP_IRQ_PAGE(base)	(base)
 #define GHES_IOREMAP_NMI_PAGE(base)	((base) + PAGE_SIZE)
 
@@ -158,10 +154,14 @@ static void ghes_ioremap_exit(void)
 static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
 {
 	unsigned long vaddr;
+	phys_addr_t paddr;
+	pgprot_t prot;
 
 	vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-	ioremap_page_range(vaddr, vaddr + PAGE_SIZE,
-			   pfn << PAGE_SHIFT, PAGE_KERNEL);
+
+	paddr = pfn << PAGE_SHIFT;
+	prot = arch_apei_get_mem_attribute(paddr);
+	ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
 
 	return (void __iomem *)vaddr;
 }
@@ -771,6 +771,50 @@ static int ghes_notify_sci(struct notifier_block *this,
 	.notifier_call = ghes_notify_sci,
 };
 
+#ifdef CONFIG_ACPI_APEI_SEA
+static LIST_HEAD(ghes_sea);
+
+void ghes_notify_sea(void)
+{
+	struct ghes *ghes;
+
+	/*
+	 * synchronize_rcu() will wait for nmi_exit(), so no need to
+	 * rcu_read_lock().
+	 */
+	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
+		ghes_proc(ghes);
+	}
+}
+
+static void ghes_sea_add(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_add_rcu(&ghes->list, &ghes_sea);
+	mutex_unlock(&ghes_list_mutex);
+}
+
+static void ghes_sea_remove(struct ghes *ghes)
+{
+	mutex_lock(&ghes_list_mutex);
+	list_del_rcu(&ghes->list);
+	mutex_unlock(&ghes_list_mutex);
+	synchronize_rcu();
+}
+#else /* CONFIG_ACPI_APEI_SEA */
+static inline void ghes_sea_add(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+}
+
+static inline void ghes_sea_remove(struct ghes *ghes)
+{
+	pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
+	       ghes->generic->header.source_id);
+}
+#endif /* CONFIG_ACPI_APEI_SEA */
+
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 /*
  * printk is not safe in NMI context.  So in NMI handler, we allocate
@@ -1016,6 +1060,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
 	case ACPI_HEST_NOTIFY_EXTERNAL:
 	case ACPI_HEST_NOTIFY_SCI:
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
+			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
+				generic->header.source_id);
+			rc = -ENOTSUPP;
+			goto err;
+		}
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
 			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1081,6 +1133,9 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		list_add_rcu(&ghes->list, &ghes_sci);
 		mutex_unlock(&ghes_list_mutex);
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		ghes_sea_add(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_add(ghes);
 		break;
@@ -1124,6 +1179,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
 		mutex_unlock(&ghes_list_mutex);
 		synchronize_rcu();
 		break;
+	case ACPI_HEST_NOTIFY_SEA:
+		ghes_sea_remove(ghes);
+		break;
 	case ACPI_HEST_NOTIFY_NMI:
 		ghes_nmi_remove(ghes);
 		break;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index b89361a..ef0040893 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -1,3 +1,6 @@
+#ifndef GHES_H
+#define GHES_H
+
 #include <acpi/apei.h>
 #include <acpi/hed.h>
 
@@ -95,3 +98,7 @@ static inline void *acpi_hest_get_payload(struct acpi_hest_generic_data *gdata)
 
 	return gdata + 1;
 }
+
+void ghes_notify_sea(void);
+
+#endif /* GHES_H */
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (5 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-04-28 13:07   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 08/11] efi: print unrecognized CPER section Tyler Baicar
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
 drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 2d387f8..b91123f 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -134,6 +134,8 @@
 static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+static int ghes_panic_timeout __read_mostly = 30;
+
 static int ghes_ioremap_init(void)
 {
 	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -692,6 +694,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
 	return apei_write(val, &generic_v2->read_ack_register);
 }
 
+static void __ghes_call_panic(void)
+{
+	if (panic_timeout == 0)
+		panic_timeout = ghes_panic_timeout;
+	panic("Fatal hardware error!");
+}
+
 static int ghes_proc(struct ghes *ghes)
 {
 	int rc;
@@ -699,6 +708,10 @@ static int ghes_proc(struct ghes *ghes)
 	rc = ghes_read_estatus(ghes, 0);
 	if (rc)
 		goto out;
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
+		__ghes_call_panic();
+	}
 	if (!ghes_estatus_cached(ghes->estatus)) {
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
 			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
@@ -835,8 +848,6 @@ static inline void ghes_sea_remove(struct ghes *ghes)
 
 static LIST_HEAD(ghes_nmi);
 
-static int ghes_panic_timeout	__read_mostly = 30;
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
@@ -929,9 +940,7 @@ static void __ghes_panic(struct ghes *ghes)
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
 
 	/* reboot to log the error! */
-	if (panic_timeout == 0)
-		panic_timeout = ghes_panic_timeout;
-	panic("Fatal hardware error!");
+	__ghes_call_panic();
 }
 
 static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 08/11] efi: print unrecognized CPER section
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (6 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-05-05 13:27   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 09/11] ras: acpi / apei: generate trace event for " Tyler Baicar
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
one of the section types that the kernel knows how to parse, the
section is skipped. Therefore, user is not able to see
such CPER data, for instance, error record of non-standard section.

For above mentioned case, this change prints out the raw data in
hex in dmesg buffer. Data length is taken from Error Data length
field of Generic Error Data Entry.

The following is a sample output from dmesg:
[  140.739180] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  140.739182] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[  140.739191] {1}[Hardware Error]: event severity: corrected
[  140.739196] {1}[Hardware Error]:  time: precise 2017-03-15 20:37:35
[  140.739197] {1}[Hardware Error]:  Error 0, type: corrected
[  140.739203] {1}[Hardware Error]:   section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b
[  140.739205] {1}[Hardware Error]:   section length: 568 (0x238)
[  140.739210] {1}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 435f4343  .RAM1 IMEM:ECC_C
[  140.739214] {1}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 00000000  E_QSB_RD........
[  140.739217] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
[  140.739220] {1}[Hardware Error]:   00000030: 00000000 00000000 01010000 01010000  ................
[  140.739223] {1}[Hardware Error]:   00000040: 00000000 00000000 00000005 00000000  ................
[  140.739226] {1}[Hardware Error]:   00000050: 01010000 00000000 00000001 00dddd00  ................
...

The raw data from the error can then be decoded using vendor
specific tools.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
 drivers/firmware/efi/cper.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index f959185..610d31a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -596,8 +596,16 @@ static void cper_estatus_timestamp(const char *pfx,
 			cper_print_proc_arm(newpfx, arm_err);
 		else
 			goto err_section_too_small;
-	} else
-		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
+	} else {
+		const void *unknown_err;
+
+		unknown_err = acpi_hest_get_payload(gdata);
+		printk("%ssection type: unknown, %pUl\n", newpfx, sec_type);
+		printk("%ssection length: %d (%#x)\n", newpfx,
+		       gdata->error_data_length, gdata->error_data_length);
+		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+			       unknown_err, gdata->error_data_length, true);
+	}
 
 	return;
 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 09/11] ras: acpi / apei: generate trace event for unrecognized CPER section
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (7 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 08/11] efi: print unrecognized CPER section Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-05-05 17:53   ` Borislav Petkov
  2017-05-05 18:44   ` Steven Rostedt
  2017-04-18 23:05 ` [PATCH V15 10/11] trace, ras: add ARM processor error trace event Tyler Baicar
  2017-04-18 23:05 ` [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support Tyler Baicar
  10 siblings, 2 replies; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
any section type that the kernel knows how to parse, trace event
is not generated for such section. And thus user is not able to know
happening of such hardware error, including error record of
non-standard section.

This commit generates a trace event which contains raw error data
for unrecognized CPER section.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Tested-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 27 +++++++++++++++++++++++----
 drivers/ras/ras.c        |  1 +
 include/ras/ras_event.h  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b91123f..3d9f63b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,11 +45,13 @@
 #include <linux/aer.h>
 #include <linux/nmi.h>
 #include <linux/sched/clock.h>
+#include <linux/uuid.h>
 
 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
 #include <acpi/apei.h>
 #include <asm/tlbflush.h>
+#include <ras/ras_event.h>
 
 #include "apei-internal.h"
 
@@ -461,12 +463,21 @@ static void ghes_do_proc(struct ghes *ghes,
 {
 	int sev, sec_sev;
 	struct acpi_hest_generic_data *gdata;
+	uuid_le sec_type;
+	uuid_le *fru_id = &NULL_UUID_LE;
+	char *fru_text = "";
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_sev = ghes_severity(gdata->error_severity);
-		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
-				 CPER_SEC_PLATFORM_MEM)) {
+		sec_type = *(uuid_le *)gdata->section_type;
+
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+			fru_id = (uuid_le *)gdata->fru_id;
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+			fru_text = gdata->fru_text;
+
+		if (!uuid_le_cmp(sec_type, CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err;
 			mem_err = acpi_hest_get_payload(gdata);
 			ghes_edac_report_mem_error(ghes, sev, mem_err);
@@ -475,8 +486,7 @@ static void ghes_do_proc(struct ghes *ghes,
 			ghes_handle_memory_failure(gdata, sev);
 		}
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
-				      CPER_SEC_PCIE)) {
+		else if (!uuid_le_cmp(sec_type, CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
 			pcie_err = acpi_hest_get_payload(gdata);
 			if (sev == GHES_SEV_RECOVERABLE &&
@@ -507,6 +517,15 @@ static void ghes_do_proc(struct ghes *ghes,
 
 		}
 #endif
+#ifdef CONFIG_RAS
+		else if (trace_unknown_sec_event_enabled()) {
+			void *unknown_err = acpi_hest_get_payload(gdata);
+
+			trace_unknown_sec_event(&sec_type,
+					fru_id, fru_text, sec_sev,
+					unknown_err, gdata->error_data_length);
+		}
+#endif
 	}
 }
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index b67dd36..fb2500b 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -27,3 +27,4 @@ static int __init ras_init(void)
 EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 1791a12..5861b6f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@
 );
 
 /*
+ * Unknown Section Report
+ *
+ * This event is generated when hardware detected a hardware
+ * error event, which may be of non-standard section as defined
+ * in UEFI spec appendix "Common Platform Error Record", or may
+ * be of sections for which TRACE_EVENT is not defined.
+ *
+ */
+TRACE_EVENT(unknown_sec_event,
+
+	TP_PROTO(const uuid_le *sec_type,
+		 const uuid_le *fru_id,
+		 const char *fru_text,
+		 const u8 sev,
+		 const u8 *err,
+		 const u32 len),
+
+	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+
+	TP_STRUCT__entry(
+		__array(char, sec_type, 16)
+		__array(char, fru_id, 16)
+		__string(fru_text, fru_text)
+		__field(u8, sev)
+		__field(u32, len)
+		__dynamic_array(u8, buf, len)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
+		memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));
+		__assign_str(fru_text, fru_text);
+		__entry->sev = sev;
+		__entry->len = len;
+		memcpy(__get_dynamic_array(buf), err, len);
+	),
+
+	TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
+		  __entry->sev, __entry->sec_type,
+		  __entry->fru_id, __get_str(fru_text),
+		  __entry->len,
+		  __print_hex(__get_dynamic_array(buf), __entry->len))
+);
+
+/*
  * PCIe AER Trace event
  *
  * These events are generated when hardware detects a corrected or
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 10/11] trace, ras: add ARM processor error trace event
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (8 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 09/11] ras: acpi / apei: generate trace event for " Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-05-08 17:34   ` Borislav Petkov
  2017-04-18 23:05 ` [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support Tyler Baicar
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

Currently there are trace events for the various RAS
errors with the exception of ARM processor type errors.
Add a new trace event for such errors so that the user
will know when they occur. These trace events are
consistent with the ARM processor error section type
defined in UEFI 2.6 spec section N.2.4.4.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
---
 drivers/acpi/apei/ghes.c    |  8 +++++++-
 drivers/firmware/efi/cper.c |  1 +
 drivers/ras/ras.c           |  1 +
 include/ras/ras_event.h     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3d9f63b..612deb3 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -518,7 +518,13 @@ static void ghes_do_proc(struct ghes *ghes,
 		}
 #endif
 #ifdef CONFIG_RAS
-		else if (trace_unknown_sec_event_enabled()) {
+		else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
+			 trace_arm_event_enabled()) {
+			struct cper_sec_proc_arm *arm_err;
+
+			arm_err = acpi_hest_get_payload(gdata);
+			trace_arm_event(arm_err);
+		} else if (trace_unknown_sec_event_enabled()) {
 			void *unknown_err = acpi_hest_get_payload(gdata);
 
 			trace_unknown_sec_event(&sec_type,
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 610d31a..650e0f6 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -35,6 +35,7 @@
 #include <linux/printk.h>
 #include <linux/bcd.h>
 #include <acpi/ghes.h>
+#include <ras/ras_event.h>
 
 #define INDENT_SP	" "
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index fb2500b..8ba5a94 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -28,3 +28,4 @@ static int __init ras_init(void)
 #endif
 EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
 EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(arm_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5861b6f..13befad 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@
 );
 
 /*
+ * ARM Processor Events Report
+ *
+ * This event is generated when hardware detects an ARM processor error
+ * has occurred. UEFI 2.6 spec section N.2.4.4.
+ */
+TRACE_EVENT(arm_event,
+
+	TP_PROTO(const struct cper_sec_proc_arm *proc),
+
+	TP_ARGS(proc),
+
+	TP_STRUCT__entry(
+		__field(u64, mpidr)
+		__field(u64, midr)
+		__field(u32, running_state)
+		__field(u32, psci_state)
+		__field(u8, affinity)
+	),
+
+	TP_fast_assign(
+		if (proc->validation_bits & CPER_ARM_VALID_AFFINITY_LEVEL)
+			__entry->affinity = proc->affinity_level;
+		else
+			__entry->affinity = ~0;
+		if (proc->validation_bits & CPER_ARM_VALID_MPIDR)
+			__entry->mpidr = proc->mpidr;
+		else
+			__entry->mpidr = 0ULL;
+		__entry->midr = proc->midr;
+		if (proc->validation_bits & CPER_ARM_VALID_RUNNING_STATE) {
+			__entry->running_state = proc->running_state;
+			__entry->psci_state = proc->psci_state;
+		} else {
+			__entry->running_state = ~0;
+			__entry->psci_state = ~0;
+		}
+	),
+
+	TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
+		  "running state: %d; PSCI state: %d",
+		  __entry->affinity, __entry->mpidr, __entry->midr,
+		  __entry->running_state, __entry->psci_state)
+);
+
+/*
  * Unknown Section Report
  *
  * This event is generated when hardware detected a hardware
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support
  2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
                   ` (9 preceding siblings ...)
  2017-04-18 23:05 ` [PATCH V15 10/11] trace, ras: add ARM processor error trace event Tyler Baicar
@ 2017-04-18 23:05 ` Tyler Baicar
  2017-05-08 17:40   ` Borislav Petkov
  10 siblings, 1 reply; 41+ messages in thread
From: Tyler Baicar @ 2017-04-18 23:05 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi
  Cc: Tyler Baicar

Currently external aborts are unsupported by the guest abort
handling. Add handling for SEAs so that the host kernel reports
SEAs which occur in the guest kernel.

When an SEA occurs in the guest kernel, the guest exits and is
routed to kvm_handle_guest_abort(). Prior to this patch, a print
message of an unsupported FSC would be printed and nothing else
would happen. With this patch, the code gets routed to the APEI
handling of SEAs in the host kernel to report the SEA information.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
---
 arch/arm/include/asm/kvm_arm.h       | 10 ++++++++++
 arch/arm/include/asm/system_misc.h   |  5 +++++
 arch/arm/kvm/mmu.c                   | 35 ++++++++++++++++++++++++++++++++---
 arch/arm64/include/asm/kvm_arm.h     | 10 ++++++++++
 arch/arm64/include/asm/system_misc.h |  2 ++
 arch/arm64/mm/fault.c                | 23 +++++++++++++++++++++--
 drivers/acpi/apei/ghes.c             | 13 +++++++------
 include/acpi/ghes.h                  |  2 +-
 8 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index a3f0b3d..ebf020b 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -187,6 +187,16 @@
 #define FSC_FAULT	(0x04)
 #define FSC_ACCESS	(0x08)
 #define FSC_PERM	(0x0c)
+#define FSC_SEA		(0x10)
+#define FSC_SEA_TTW0	(0x14)
+#define FSC_SEA_TTW1	(0x15)
+#define FSC_SEA_TTW2	(0x16)
+#define FSC_SEA_TTW3	(0x17)
+#define FSC_SECC	(0x18)
+#define FSC_SECC_TTW0	(0x1c)
+#define FSC_SECC_TTW1	(0x1d)
+#define FSC_SECC_TTW2	(0x1e)
+#define FSC_SECC_TTW3	(0x1f)
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~0xf)
diff --git a/arch/arm/include/asm/system_misc.h b/arch/arm/include/asm/system_misc.h
index a3d61ad..8c4a89f 100644
--- a/arch/arm/include/asm/system_misc.h
+++ b/arch/arm/include/asm/system_misc.h
@@ -22,6 +22,11 @@
 
 extern unsigned int user_debug;
 
+static inline int handle_guest_sea(phys_addr_t addr, unsigned int esr)
+{
+	return -1;
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_ARM_SYSTEM_MISC_H */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 582a972..fd11855 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -29,6 +29,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/virt.h>
+#include <asm/system_misc.h>
 
 #include "trace.h"
 
@@ -1418,6 +1419,24 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 		kvm_set_pfn_accessed(pfn);
 }
 
+static bool is_abort_sea(unsigned long fault_status) {
+	switch (fault_status) {
+	case FSC_SEA:
+	case FSC_SEA_TTW0:
+	case FSC_SEA_TTW1:
+	case FSC_SEA_TTW2:
+	case FSC_SEA_TTW3:
+	case FSC_SECC:
+	case FSC_SECC_TTW0:
+	case FSC_SECC_TTW1:
+	case FSC_SECC_TTW2:
+	case FSC_SECC_TTW3:
+		return true;
+	default:
+		return false;
+	}
+}
+
 /**
  * kvm_handle_guest_abort - handles all 2nd stage aborts
  * @vcpu:	the VCPU pointer
@@ -1440,19 +1459,29 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	gfn_t gfn;
 	int ret, idx;
 
+	fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
+
+	fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
+
+	/*
+	 * The host kernel will handle the synchronous external abort. There
+	 * is no need to pass the error into the guest.
+	 */
+	if (is_abort_sea(fault_status)) {
+		if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
+			return 1;
+	}
+
 	is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
 	if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
 		kvm_inject_vabt(vcpu);
 		return 1;
 	}
 
-	fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
-
 	trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu),
 			      kvm_vcpu_get_hfar(vcpu), fault_ipa);
 
 	/* Check the stage-2 fault is trans. fault or write fault */
-	fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
 	if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
 	    fault_status != FSC_ACCESS) {
 		kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 6e99978..61d694c 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -204,6 +204,16 @@
 #define FSC_FAULT	ESR_ELx_FSC_FAULT
 #define FSC_ACCESS	ESR_ELx_FSC_ACCESS
 #define FSC_PERM	ESR_ELx_FSC_PERM
+#define FSC_SEA		ESR_ELx_FSC_EXTABT
+#define FSC_SEA_TTW0	(0x14)
+#define FSC_SEA_TTW1	(0x15)
+#define FSC_SEA_TTW2	(0x16)
+#define FSC_SEA_TTW3	(0x17)
+#define FSC_SECC	(0x18)
+#define FSC_SECC_TTW0	(0x1c)
+#define FSC_SECC_TTW1	(0x1d)
+#define FSC_SECC_TTW2	(0x1e)
+#define FSC_SECC_TTW3	(0x1f)
 
 /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
 #define HPFAR_MASK	(~UL(0xf))
diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index bc81243..95aa442 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -56,6 +56,8 @@ void hook_debug_fault_code(int nr, int (*fn)(unsigned long, unsigned int,
 	__show_ratelimited;						\
 })
 
+int handle_guest_sea(phys_addr_t addr, unsigned int esr);
+
 #endif	/* __ASSEMBLY__ */
 
 #endif	/* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 10013ff..a8c5d11 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -515,6 +515,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 {
 	struct siginfo info;
 	const struct fault_info *inf;
+	int ret = 0;
 
 	inf = esr_to_fault_info(esr);
 	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
@@ -527,7 +528,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	 */
 	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
 		nmi_enter();
-		ghes_notify_sea();
+		ret = ghes_notify_sea();
 		nmi_exit();
 	}
 
@@ -540,7 +541,7 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 		info.si_addr  = (void __user *)addr;
 	arm64_notify_die("", regs, &info, esr);
 
-	return 0;
+	return ret;
 }
 
 static const struct fault_info fault_info[] = {
@@ -611,6 +612,24 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 };
 
 /*
+ * Handle Synchronous External Aborts that occur in a guest kernel.
+ *
+ * The return value will be zero if the SEA was successfully handled
+ * and non-zero if there was an error processing the error or there was
+ * no error to process.
+ */
+int handle_guest_sea(phys_addr_t addr, unsigned int esr)
+{
+	int ret = -ENOENT;
+
+	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
+		ret = ghes_notify_sea();
+	}
+
+	return ret;
+}
+
+/*
  * Dispatch a data abort to the relevant handler.
  */
 asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 612deb3..d286248 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -812,17 +812,18 @@ static int ghes_notify_sci(struct notifier_block *this,
 #ifdef CONFIG_ACPI_APEI_SEA
 static LIST_HEAD(ghes_sea);
 
-void ghes_notify_sea(void)
+int ghes_notify_sea(void)
 {
 	struct ghes *ghes;
+	int ret = -ENOENT;
 
-	/*
-	 * synchronize_rcu() will wait for nmi_exit(), so no need to
-	 * rcu_read_lock().
-	 */
+	rcu_read_lock();
 	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
-		ghes_proc(ghes);
+		if(!ghes_proc(ghes))
+			ret = 0;
 	}
+	rcu_read_unlock();
+	return ret;
 }
 
 static void ghes_sea_add(struct ghes *ghes)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index ef0040893..09e0b65 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -99,6 +99,6 @@ static inline void *acpi_hest_get_payload(struct acpi_hest_generic_data *gdata)
 	return gdata + 1;
 }
 
-void ghes_notify_sea(void);
+int ghes_notify_sea(void);
 
 #endif /* GHES_H */
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption
  2017-04-18 23:05 ` [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption Tyler Baicar
@ 2017-04-19 18:31   ` Borislav Petkov
  2017-04-19 20:31     ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-19 18:31 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:13PM -0600, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
> 
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.
> 
> Add support for parsing of GHESv2 sub-tables as well.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  drivers/acpi/apei/ghes.c | 55 +++++++++++++++++++++++++++++++++++++++++++++---
>  drivers/acpi/apei/hest.c |  7 ++++--
>  include/acpi/ghes.h      |  5 ++++-
>  3 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 79b3c9c..6d87ab7 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -46,6 +46,7 @@
>  #include <linux/nmi.h>
>  #include <linux/sched/clock.h>
>  
> +#include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> @@ -80,6 +81,10 @@
>  	((struct acpi_hest_generic_status *)				\
>  	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>  
> +#define IS_HEST_TYPE_GENERIC_V2(ghes)				\
> +	((struct acpi_hest_header *)ghes->generic)->type ==	\

This is a nasty hack: casting the ghes->generic pointer to a pointer of its
first member which is a acpi_hest_header.

Why isn't this a nice inline function with proper dereferencing:

static inline bool is_hest_type_generic_v2(struct ghes *ghes)
{
        return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

?

Also, please integrate scripts/checkpatch.pl in your patch creation
workflow. Some of the warnings/errors *actually* make sense.

>  /*
>   * This driver isn't really modular, however for the time being,
>   * continuing to use module_param is the easiest way to remain
> @@ -240,6 +245,17 @@ static int ghes_estatus_pool_expand(unsigned long len)
>  	return 0;
>  }
>  
> +static int map_gen_v2(struct ghes *ghes)
> +{
> +	return apei_map_generic_address(&ghes->generic_v2->read_ack_register);
> +}
> +
> +static void unmap_gen_v2(struct ghes *ghes)
> +{
> +	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
> +	return;
> +}

Like this one, for example:

WARNING: void function return statements are not generally useful
#89: FILE: drivers/acpi/apei/ghes.c:257:
+       return;
+}

> +
>  static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  {
>  	struct ghes *ghes;
> @@ -249,10 +265,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>  	if (!ghes)
>  		return ERR_PTR(-ENOMEM);
> +
>  	ghes->generic = generic;
> +	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = map_gen_v2(ghes);
> +		if (rc)
> +			goto err_free;
> +	}
> +
>  	rc = apei_map_generic_address(&generic->error_status_address);
>  	if (rc)
> -		goto err_free;
> +		goto err_unmap_read_ack_addr;
>  	error_block_length = generic->error_block_length;
>  	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>  		pr_warning(FW_WARN GHES_PFX
> @@ -264,13 +287,16 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>  	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>  	if (!ghes->estatus) {
>  		rc = -ENOMEM;
> -		goto err_unmap;
> +		goto err_unmap_status_addr;
>  	}
>  
>  	return ghes;
>  
> -err_unmap:
> +err_unmap_status_addr:
>  	apei_unmap_generic_address(&generic->error_status_address);
> +err_unmap_read_ack_addr:
> +	if (IS_HEST_TYPE_GENERIC_V2(ghes))
> +		unmap_gen_v2(ghes);
>  err_free:
>  	kfree(ghes);
>  	return ERR_PTR(rc);
> @@ -280,6 +306,8 @@ static void ghes_fini(struct ghes *ghes)
>  {
>  	kfree(ghes->estatus);
>  	apei_unmap_generic_address(&ghes->generic->error_status_address);
> +	if (IS_HEST_TYPE_GENERIC_V2(ghes))
> +		unmap_gen_v2(ghes);
>  }
>  
>  static inline int ghes_severity(int severity)
> @@ -649,6 +677,21 @@ static void ghes_estatus_cache_add(
>  	rcu_read_unlock();
>  }
>  
> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)

If you name this function parameter to something shorter, say gv2, for
example...

> +{
> +	int rc;
> +	u64 val = 0;
> +
> +	rc = apei_read(&val, &generic_v2->read_ack_register);
> +	if (rc)
> +		return rc;
> +
> +	val &= generic_v2->read_ack_preserve << generic_v2->read_ack_register.bit_offset;
> +	val |= generic_v2->read_ack_write << generic_v2->read_ack_register.bit_offset;

... you can align those two nicely while remaining within the 80 cols width:

        val &= gv2->read_ack_preserve << gv2->read_ack_register.bit_offset;
        val |= gv2->read_ack_write    << gv2->read_ack_register.bit_offset;

and make them readable at a quick glance.

> +
> +	return apei_write(val, &generic_v2->read_ack_register);
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -661,6 +704,12 @@ static int ghes_proc(struct ghes *ghes)
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>  	}
>  	ghes_do_proc(ghes, ghes->estatus);

This needs a comment why v2 needs to ACK the error. The commit message
is not necessarily something we'll find quickly in the future.

> +
> +	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
> +		rc = ghes_ack_error(ghes->generic_v2);
> +		if (rc)
> +			return rc;
> +	}
>  out:
>  	ghes_clear_estatus(ghes);
>  	return rc;

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption
  2017-04-19 18:31   ` Borislav Petkov
@ 2017-04-19 20:31     ` Baicar, Tyler
  2017-04-19 20:41       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-19 20:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On 4/19/2017 12:31 PM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 05:05:13PM -0600, Tyler Baicar wrote:
>> A RAS (Reliability, Availability, Serviceability) controller
>> may be a separate processor running in parallel with OS
>> execution, and may generate error records for consumption by
>> the OS. If the RAS controller produces multiple error records,
>> then they may be overwritten before the OS has consumed them.
>>
>> The Generic Hardware Error Source (GHES) v2 structure
>> introduces the capability for the OS to acknowledge the
>> consumption of the error record generated by the RAS
>> controller. A RAS controller supporting GHESv2 shall wait for
>> the acknowledgment before writing a new error record, thus
>> eliminating the race condition.
>>
>> Add support for parsing of GHESv2 sub-tables as well.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
>> Reviewed-by: James Morse <james.morse@arm.com>
>> ---
>>   drivers/acpi/apei/ghes.c | 55 +++++++++++++++++++++++++++++++++++++++++++++---
>>   drivers/acpi/apei/hest.c |  7 ++++--
>>   include/acpi/ghes.h      |  5 ++++-
>>   3 files changed, 61 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 79b3c9c..6d87ab7 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -46,6 +46,7 @@
>>   #include <linux/nmi.h>
>>   #include <linux/sched/clock.h>
>>   
>> +#include <acpi/actbl1.h>
>>   #include <acpi/ghes.h>
>>   #include <acpi/apei.h>
>>   #include <asm/tlbflush.h>
>> @@ -80,6 +81,10 @@
>>   	((struct acpi_hest_generic_status *)				\
>>   	 ((struct ghes_estatus_node *)(estatus_node) + 1))
>>   
>> +#define IS_HEST_TYPE_GENERIC_V2(ghes)				\
>> +	((struct acpi_hest_header *)ghes->generic)->type ==	\
> This is a nasty hack: casting the ghes->generic pointer to a pointer of its
> first member which is a acpi_hest_header.
>
> Why isn't this a nice inline function with proper dereferencing:
>
> static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> {
>          return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> ?
I'll change it to this.
> Also, please integrate scripts/checkpatch.pl in your patch creation
> workflow. Some of the warnings/errors *actually* make sense.
>
>>   /*
>>    * This driver isn't really modular, however for the time being,
>>    * continuing to use module_param is the easiest way to remain
>> @@ -240,6 +245,17 @@ static int ghes_estatus_pool_expand(unsigned long len)
>>   	return 0;
>>   }
>>   
>> +static int map_gen_v2(struct ghes *ghes)
>> +{
>> +	return apei_map_generic_address(&ghes->generic_v2->read_ack_register);
>> +}
>> +
>> +static void unmap_gen_v2(struct ghes *ghes)
>> +{
>> +	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
>> +	return;
>> +}
> Like this one, for example:
>
> WARNING: void function return statements are not generally useful
> #89: FILE: drivers/acpi/apei/ghes.c:257:
> +       return;
> +}
Will remove the return.
>
>> +
>>   static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>>   {
>>   	struct ghes *ghes;
>> @@ -249,10 +265,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>>   	ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
>>   	if (!ghes)
>>   		return ERR_PTR(-ENOMEM);
>> +
>>   	ghes->generic = generic;
>> +	if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
>> +		rc = map_gen_v2(ghes);
>> +		if (rc)
>> +			goto err_free;
>> +	}
>> +
>>   	rc = apei_map_generic_address(&generic->error_status_address);
>>   	if (rc)
>> -		goto err_free;
>> +		goto err_unmap_read_ack_addr;
>>   	error_block_length = generic->error_block_length;
>>   	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
>>   		pr_warning(FW_WARN GHES_PFX
>> @@ -264,13 +287,16 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
>>   	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
>>   	if (!ghes->estatus) {
>>   		rc = -ENOMEM;
>> -		goto err_unmap;
>> +		goto err_unmap_status_addr;
>>   	}
>>   
>>   	return ghes;
>>   
>> -err_unmap:
>> +err_unmap_status_addr:
>>   	apei_unmap_generic_address(&generic->error_status_address);
>> +err_unmap_read_ack_addr:
>> +	if (IS_HEST_TYPE_GENERIC_V2(ghes))
>> +		unmap_gen_v2(ghes);
>>   err_free:
>>   	kfree(ghes);
>>   	return ERR_PTR(rc);
>> @@ -280,6 +306,8 @@ static void ghes_fini(struct ghes *ghes)
>>   {
>>   	kfree(ghes->estatus);
>>   	apei_unmap_generic_address(&ghes->generic->error_status_address);
>> +	if (IS_HEST_TYPE_GENERIC_V2(ghes))
>> +		unmap_gen_v2(ghes);
>>   }
>>   
>>   static inline int ghes_severity(int severity)
>> @@ -649,6 +677,21 @@ static void ghes_estatus_cache_add(
>>   	rcu_read_unlock();
>>   }
>>   
>> +static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
> If you name this function parameter to something shorter, say gv2, for
> example...
Will do.
>
>> +{
>> +	int rc;
>> +	u64 val = 0;
>> +
>> +	rc = apei_read(&val, &generic_v2->read_ack_register);
>> +	if (rc)
>> +		return rc;
>> +
>> +	val &= generic_v2->read_ack_preserve << generic_v2->read_ack_register.bit_offset;
>> +	val |= generic_v2->read_ack_write << generic_v2->read_ack_register.bit_offset;
> ... you can align those two nicely while remaining within the 80 cols width:
>
>          val &= gv2->read_ack_preserve << gv2->read_ack_register.bit_offset;
>          val |= gv2->read_ack_write    << gv2->read_ack_register.bit_offset;
>
> and make them readable at a quick glance.
Will do.
>> +
>> +	return apei_write(val, &generic_v2->read_ack_register);
>> +}
>> +
>>   static int ghes_proc(struct ghes *ghes)
>>   {
>>   	int rc;
>> @@ -661,6 +704,12 @@ static int ghes_proc(struct ghes *ghes)
>>   			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
>>   	}
>>   	ghes_do_proc(ghes, ghes->estatus);
> This needs a comment why v2 needs to ACK the error. The commit message
> is not necessarily something we'll find quickly in the future.
Will do.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption
  2017-04-19 20:31     ` Baicar, Tyler
@ 2017-04-19 20:41       ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-19 20:41 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Wed, Apr 19, 2017 at 02:31:13PM -0600, Baicar, Tyler wrote:
> Will do.

You don't necessarily have to reply with "will do" if you agree with the
review.

Also, please wait until I've gone through the whole pile before sending
it again.

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure
  2017-04-18 23:05 ` [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure Tyler Baicar
@ 2017-04-20 11:14   ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-20 11:14 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:14PM -0600, Tyler Baicar wrote:
> The ACPI 6.1 spec adds a new version of the generic data structure.

which data structure? HEST?

> Add support to handle the new structure as well as properly verify
> and iterate through the generic data entries.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

This is clearly a new version of the patch so Reviewed-by tags don't
apply anymore. Ditto for the next one. Remember to remove such tags in
the future when the patches are changed non-trivially in a following
iteration.

> ---
>  drivers/acpi/apei/ghes.c    |  6 +++---
>  drivers/firmware/efi/cper.c | 37 ++++++++++++++++++++++---------------
>  include/acpi/ghes.h         | 22 ++++++++++++++++++++++
>  3 files changed, 47 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 6d87ab7..dfb7dd2 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -429,7 +429,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
>  	int flags = -1;
>  	int sec_sev = ghes_severity(gdata->error_severity);
>  	struct cper_sec_mem_err *mem_err;
> -	mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> +	mem_err = acpi_hest_get_payload(gdata);

        struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

while you're at it.

>  	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>  		return;
> @@ -466,7 +466,7 @@ static void ghes_do_proc(struct ghes *ghes,
>  		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				 CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
> -			mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +			mem_err = acpi_hest_get_payload(gdata);

Ditto

+ add a \n here.

>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
>  
>  			arch_apei_report_mem_error(sev, mem_err);
> @@ -476,7 +476,7 @@ static void ghes_do_proc(struct ghes *ghes,
>  		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>  				      CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
> -			pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +			pcie_err = acpi_hest_get_payload(gdata);

Ditto.

>  			if (sev == GHES_SEV_RECOVERABLE &&
>  			    sec_sev == GHES_SEV_RECOVERABLE &&
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index d425374..8328a6f 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,7 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <acpi/ghes.h>
>  
>  #define INDENT_SP	" "
>  
> @@ -386,8 +387,9 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  }
>  
> -static void cper_estatus_print_section(
> -	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
> +static void
> +cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata,
> +			   int sec_no)
>  {
>  	uuid_le *sec_type = (uuid_le *)gdata->section_type;
>  	__u16 severity;
> @@ -403,14 +405,18 @@ static void cper_estatus_print_section(
>  
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>  	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> -		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
> +		struct cper_sec_proc_generic *proc_err;
> +
> +		proc_err = acpi_hest_get_payload(gdata);

Ditto.

>  		printk("%s""section_type: general processor error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*proc_err))
>  			cper_print_proc_generic(newpfx, proc_err);
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> -		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> +		struct cper_sec_mem_err *mem_err;
> +
> +		mem_err = acpi_hest_get_payload(gdata);

Ditto.

>  		printk("%s""section_type: memory error\n", newpfx);
>  		if (gdata->error_data_length >=
>  		    sizeof(struct cper_sec_mem_err_old))
> @@ -419,7 +425,9 @@ static void cper_estatus_print_section(
>  		else
>  			goto err_section_too_small;
>  	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> -		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> +		struct cper_sec_pcie *pcie;
> +
> +		pcie = acpi_hest_get_payload(gdata);

Ditto.

>  		printk("%s""section_type: PCIe error\n", newpfx);
>  		if (gdata->error_data_length >= sizeof(*pcie))
>  			cper_print_pcie(newpfx, pcie, gdata);
> @@ -438,7 +446,7 @@ void cper_estatus_print(const char *pfx,
>  			const struct acpi_hest_generic_status *estatus)
>  {
>  	struct acpi_hest_generic_data *gdata;
> -	unsigned int data_len, gedata_len;
> +	unsigned int data_len;
>  	int sec_no = 0;
>  	char newpfx[64];
>  	__u16 severity;
> @@ -452,11 +460,10 @@ void cper_estatus_print(const char *pfx,
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
>  	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;

<-- \n here while you're at it.

We have to start making this code more readable. It looks write-only
right now.

> +	while (data_len >= acpi_hest_get_size(gdata)) {
>  		cper_estatus_print_section(newpfx, gdata, sec_no);
> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= acpi_hest_get_record_size(gdata);
> +		gdata = acpi_hest_get_next(gdata);
>  		sec_no++;
>  	}
>  }
> @@ -486,12 +493,12 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
>  		return rc;
>  	data_len = estatus->data_length;
>  	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
> -	while (data_len >= sizeof(*gdata)) {
> -		gedata_len = gdata->error_data_length;
> -		if (gedata_len > data_len - sizeof(*gdata))

Ditto.

> +	while (data_len >= acpi_hest_get_size(gdata)) {
> +		gedata_len = acpi_hest_get_error_length(gdata);
> +		if (gedata_len > data_len - acpi_hest_get_size(gdata))
>  			return -EINVAL;

<-- \n here too

> -		data_len -= gedata_len + sizeof(*gdata);
> -		gdata = (void *)(gdata + 1) + gedata_len;
> +		data_len -= acpi_hest_get_record_size(gdata);
> +		gdata = acpi_hest_get_next(gdata);
>  	}
>  	if (data_len)
>  		return -EINVAL;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 68f088a..b89361a 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -12,6 +12,20 @@
>  #define GHES_TO_CLEAR		0x0001
>  #define GHES_EXITING		0x0002
>  
> +#define acpi_hest_get_error_length(gdata)	\
> +	(((struct acpi_hest_generic_data *)(gdata))->error_data_length)
> +#define acpi_hest_get_size(gdata)		\
> +	((acpi_hest_get_version(gdata) >= 3) ?	\
> +	sizeof(struct acpi_hest_generic_data_v300) :	\
> +	sizeof(struct acpi_hest_generic_data))
> +#define acpi_hest_get_record_size(gdata)	\
> +	(acpi_hest_get_size(gdata) +		\
> +	acpi_hest_get_error_length(gdata))
> +#define acpi_hest_get_next(gdata)		\
> +	((void *)(gdata) + acpi_hest_get_record_size(gdata))
> +#define acpi_hest_get_version(gdata)			\
> +	(gdata->revision >> 8)

Make all those inline functions.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-18 23:05 ` [PATCH V15 03/11] cper: add timestamp print to CPER status printing Tyler Baicar
@ 2017-04-21 12:21   ` Borislav Petkov
  2017-04-21 16:04     ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-21 12:21 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:15PM -0600, Tyler Baicar wrote:
> The ACPI 6.1 spec added a timestamp to the HEST generic data

HEST?

I see the timestamp in

Table 18-343 Generic Error Data Entry

where those things are "One or more Generic Error Data Entry structures
may be recorded in the Generic Error Data Entries field of the Generic
Error Status Block structure."

And those are part of the "18.3.2.7 Generic Hardware Error Source",
i.e., GHES. So why do you say "HEST" above?

> structure. Print the timestamp out when printing out the error
> status information.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Remove those Reviewed-by:s

> ---
>  drivers/firmware/efi/cper.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 8328a6f..46585f9 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,8 @@
>  #include <linux/acpi.h>
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
>  #include <acpi/ghes.h>
>  
>  #define INDENT_SP	" "
> @@ -387,6 +389,29 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>  	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>  }
>  
> +static void cper_estatus_timestamp(const char *pfx,

cper_print_tstamp()

> +				   struct acpi_hest_generic_data_v300 *gdata)
> +{
> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> +		timestamp = (__u8 *)&(gdata->time_stamp);
> +		sec       = bcd2bin(timestamp[0]);
> +		min       = bcd2bin(timestamp[1]);
> +		hour      = bcd2bin(timestamp[2]);
> +		day       = bcd2bin(timestamp[4]);
> +		mon       = bcd2bin(timestamp[5]);
> +		year      = bcd2bin(timestamp[6]);
> +		century   = bcd2bin(timestamp[7]);
> +
> +		if (*(timestamp + 3) & 0x1)
> +			printk("%stimestamp is precise\n", pfx);
> +
> +		printk("%stime: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +			century, year, mon, day, hour, min, sec);

Yeah, about the precise tstamp, you can do something like this:

---
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 46585f92b741..a649884e2264 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -404,10 +404,8 @@ static void cper_estatus_timestamp(const char *pfx,
 		year      = bcd2bin(timestamp[6]);
 		century   = bcd2bin(timestamp[7]);
 
-		if (*(timestamp + 3) & 0x1)
-			printk("%stimestamp is precise\n", pfx);
-
-		printk("%stime: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+		printk("%s%ststamp: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+			(timestamp[3] & 0x1 ? "precise " : ""), 
 			century, year, mon, day, hour, min, sec);
 	}
 }

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-21 12:21   ` Borislav Petkov
@ 2017-04-21 16:04     ` Baicar, Tyler
  2017-04-21 17:26       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-21 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On 4/21/2017 6:21 AM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 05:05:15PM -0600, Tyler Baicar wrote:
>> The ACPI 6.1 spec added a timestamp to the HEST generic data
> HEST?
>
> I see the timestamp in
>
> Table 18-343 Generic Error Data Entry
>
> where those things are "One or more Generic Error Data Entry structures
> may be recorded in the Generic Error Data Entries field of the Generic
> Error Status Block structure."
>
> And those are part of the "18.3.2.7 Generic Hardware Error Source",
> i.e., GHES. So why do you say "HEST" above?
Ah yes, this should be GHES no HEST. There are too many acronyms 
involved here :)
>
>> structure. Print the timestamp out when printing out the error
>> status information.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
>> Reviewed-by: James Morse <james.morse@arm.com>
>> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Remove those Reviewed-by:s
>
>> ---
>>   drivers/firmware/efi/cper.c | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 8328a6f..46585f9 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,8 @@
>>   #include <linux/acpi.h>
>>   #include <linux/pci.h>
>>   #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>>   #include <acpi/ghes.h>
>>   
>>   #define INDENT_SP	" "
>> @@ -387,6 +389,29 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
>>   	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
>>   }
>>   
>> +static void cper_estatus_timestamp(const char *pfx,
> cper_print_tstamp()
>
>> +				   struct acpi_hest_generic_data_v300 *gdata)
>> +{
>> +	__u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> +	if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> +		timestamp = (__u8 *)&(gdata->time_stamp);
>> +		sec       = bcd2bin(timestamp[0]);
>> +		min       = bcd2bin(timestamp[1]);
>> +		hour      = bcd2bin(timestamp[2]);
>> +		day       = bcd2bin(timestamp[4]);
>> +		mon       = bcd2bin(timestamp[5]);
>> +		year      = bcd2bin(timestamp[6]);
>> +		century   = bcd2bin(timestamp[7]);
>> +
>> +		if (*(timestamp + 3) & 0x1)
>> +			printk("%stimestamp is precise\n", pfx);
>> +
>> +		printk("%stime: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>> +			century, year, mon, day, hour, min, sec);
> Yeah, about the precise tstamp, you can do something like this:
>
> ---
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 46585f92b741..a649884e2264 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -404,10 +404,8 @@ static void cper_estatus_timestamp(const char *pfx,
>   		year      = bcd2bin(timestamp[6]);
>   		century   = bcd2bin(timestamp[7]);
>   
> -		if (*(timestamp + 3) & 0x1)
> -			printk("%stimestamp is precise\n", pfx);
> -
> -		printk("%stime: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +		printk("%s%ststamp: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
> +			(timestamp[3] & 0x1 ? "precise " : ""),
>   			century, year, mon, day, hour, min, sec);
>   	}
>   }
This is basically what I already had in v14...you asked to move it into 
a different if-statement? https://lkml.org/lkml/2017/4/12/397

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-21 16:04     ` Baicar, Tyler
@ 2017-04-21 17:26       ` Borislav Petkov
  2017-04-21 18:08         ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-21 17:26 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Fri, Apr 21, 2017 at 10:04:35AM -0600, Baicar, Tyler wrote:
> This is basically what I already had in v14...you asked to move it into a
> different if-statement? https://lkml.org/lkml/2017/4/12/397

Well, clearly I've been smoking some nasty potent sh*t. :-\

/me goes and looks at the spec:

"Bit 0 – Timestamp is precise if this bit is set and correlates to the
time of the error event."

So why are we even printing the timestamp when !precise?

IOW, I think we should do:

	if (!(timestamp[3] & 0x1))
		printk("%stimestamp imprecise\n", pfx);
	else {
		sec = ..
		min = ...

		...
	}

and print the actual values only when the timestamp is precise.
Otherwise it has *some* values which could just as well be completely
random. And it's not like we're reporting the error tomorrow - it is
mostly a couple of seconds from logging to the fw pushing it out...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-18 23:05 ` [PATCH V15 04/11] efi: parse ARM processor error Tyler Baicar
@ 2017-04-21 17:55   ` Borislav Petkov
  2017-04-21 18:22     ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-21 17:55 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:16PM -0600, Tyler Baicar wrote:
> Add support for ARM Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARM specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor error logs.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  drivers/firmware/efi/cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/cper.h        |  54 ++++++++++++++++++
>  2 files changed, 189 insertions(+)
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 46585f9..f959185 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
>  static const char * const proc_type_strs[] = {
>  	"IA32/X64",
>  	"IA64",
> +	"ARM",
>  };
>  
>  static const char * const proc_isa_strs[] = {
>  	"IA32",
>  	"IA64",
>  	"X64",
> +	"ARM A32/T32",
> +	"ARM A64",
>  };
>  
>  static const char * const proc_error_type_strs[] = {
> @@ -184,6 +187,128 @@ static void cper_print_proc_generic(const char *pfx,
>  		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
>  }
>  
> +#if defined(CONFIG_ARM64) || defined(CONFIG_ARM)
> +static const char * const arm_reg_ctx_strs[] = {
> +	"AArch32 general purpose registers",
> +	"AArch32 EL1 context registers",
> +	"AArch32 EL2 context registers",
> +	"AArch32 secure context registers",
> +	"AArch64 general purpose registers",
> +	"AArch64 EL1 context registers",
> +	"AArch64 EL2 context registers",
> +	"AArch64 EL3 context registers",
> +	"Misc. system register structure",
> +};
> +
> +static void cper_print_proc_arm(const char *pfx,
> +				const struct cper_sec_proc_arm *proc)
> +{
> +	int i, len, max_ctx_type;
> +	struct cper_arm_err_info *err_info;
> +	struct cper_arm_ctx_info *ctx_info;
> +	char newpfx[64];
> +
> +	printk("%ssection length: %d\n", pfx, proc->section_length);

We need to dump section length because?

> +	printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
> +
> +	len = proc->section_length - (sizeof(*proc) +
> +		proc->err_info_num * (sizeof(*err_info)));
> +	if (len < 0) {
> +		printk("%ssection length is too small\n", pfx);

Now here we *can* dump it.

> +		printk("%sfirmware-generated error record is incorrect\n", pfx);
> +		printk("%sERR_INFO_NUM is %d\n", pfx, proc->err_info_num);
> +		return;
> +	}
> +
> +	if (proc->validation_bits & CPER_ARM_VALID_MPIDR)
> +		printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);


<---- newline here.

Also, what is MPIDR and can it be written in a more user-friendly manner
and not be an abbreviation?

> +	if (proc->validation_bits & CPER_ARM_VALID_AFFINITY_LEVEL)
> +		printk("%serror affinity level: %d\n", pfx,
> +			proc->affinity_level);
> +	if (proc->validation_bits & CPER_ARM_VALID_RUNNING_STATE) {
> +		printk("%srunning state: 0x%x\n", pfx, proc->running_state);
> +		printk("%sPSCI state: %d\n", pfx, proc->psci_state);

One more abbreviation. Please consider whether having the abbreviations
or actually writing them out is more user-friendly.

> +	}
> +
> +	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);

That INDENT_SP thing is just silly, someone should kill it.

> +
> +	err_info = (struct cper_arm_err_info *)(proc + 1);
> +	for (i = 0; i < proc->err_info_num; i++) {
> +		printk("%sError info structure %d:\n", pfx, i);
> +		printk("%sversion:%d\n", newpfx, err_info->version);
> +		printk("%slength:%d\n", newpfx, err_info->length);

<---- newline here.

Why do we even dump version and info for *every* err_info structure?

> +		if (err_info->validation_bits &
> +		    CPER_ARM_INFO_VALID_MULTI_ERR) {
> +			if (err_info->multiple_error == 0)
> +				printk("%ssingle error\n", newpfx);
> +			else if (err_info->multiple_error == 1)
> +				printk("%smultiple errors\n", newpfx);
> +			else
> +				printk("%smultiple errors count:%u\n",
> +				newpfx, err_info->multiple_error);

So this can be simply: "num errors: %d", err_info->multiple_error+1...

Without checking CPER_ARM_INFO_VALID_MULTI_ERR.

> +		}

<---- newline here.

> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_FLAGS) {
> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_FIRST)
> +				printk("%sfirst error captured\n", newpfx);
> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_LAST)
> +				printk("%slast error captured\n", newpfx);
> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_PROPAGATED)
> +				printk("%spropagated error captured\n",
> +				       newpfx);
> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_OVERFLOW)
> +				printk("%soverflow occurred, error info is incomplete\n",
> +				       newpfx);
> +		}

<---- newline here.

> +		printk("%serror_type: %d, %s\n", newpfx, err_info->type,
> +			err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
> +			proc_error_type_strs[err_info->type] : "unknown");
> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
> +			printk("%serror_info: 0x%016llx\n", newpfx,
> +			       err_info->error_info);

err_info->error_info ?

What is that supposed to mean? A u64 value of some sorts.

> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
> +			printk("%svirtual fault address: 0x%016llx\n",
> +				newpfx, err_info->virt_fault_addr);
> +		if (err_info->validation_bits &
> +		    CPER_ARM_INFO_VALID_PHYSICAL_ADDR)

Just let that line stick out.

> +			printk("%sphysical fault address: 0x%016llx\n",
> +				newpfx, err_info->physical_fault_addr);
> +		err_info += 1;
> +	}

<---- newline here.

That function is kinda missing newlines.

> +	ctx_info = (struct cper_arm_ctx_info *)err_info;
> +	max_ctx_type = ARRAY_SIZE(arm_reg_ctx_strs) - 1;
> +	for (i = 0; i < proc->context_info_num; i++) {
> +		int size = sizeof(*ctx_info) + ctx_info->size;
> +
> +		printk("%sContext info structure %d:\n", pfx, i);
> +		if (len < size) {
> +			printk("%ssection length is too small\n", newpfx);
> +			printk("%sfirmware-generated error record is incorrect\n", pfx);
> +			return;
> +		}
> +		if (ctx_info->type > max_ctx_type) {
> +			printk("%sInvalid context type: %d\n", newpfx,
> +							ctx_info->type);
> +			printk("%sMax context type: %d\n", newpfx,
> +							max_ctx_type);
> +			return;

You can combine those into:

                        printk("%sInvalid context type: %d (max: %d)\n",
				newpfx, ctx_info->type, max_ctx_type);


> +		}
> +		printk("%sregister context type %d: %s\n", newpfx,
> +			ctx_info->type, arm_reg_ctx_strs[ctx_info->type]);

Why dump the type as %d and as a string too? String should be enough, no?

> +		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
> +				(ctx_info + 1), ctx_info->size, 0);
> +		len -= size;
> +		ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + size);
> +	}
> +
> +	if (len > 0) {
> +		printk("%sVendor specific error info has %u bytes:\n", pfx,
> +		       len);
> +		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4, ctx_info,
> +				len, true);

That looks like it should be a debug printk...

> +	}
> +}
> +#endif
> +
>  static const char * const mem_err_type_strs[] = {
>  	"unknown",
>  	"no error",
> @@ -461,6 +586,16 @@ static void cper_estatus_timestamp(const char *pfx,
>  			cper_print_pcie(newpfx, pcie, gdata);
>  		else
>  			goto err_section_too_small;
> +	} else if ((IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)) &&
> +		   !uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARM)) {
> +		struct cper_sec_proc_arm *arm_err;
> +
> +		arm_err = acpi_hest_get_payload(gdata);

		struct cper_sec_proc_arm *arm_err = acpi_hest_get_payload(gdata);

> +		printk("%ssection_type: ARM processor error\n", newpfx);
> +		if (gdata->error_data_length >= sizeof(*arm_err))
> +			cper_print_proc_arm(newpfx, arm_err);
> +		else
> +			goto err_section_too_small;

You need to build-test your patches before submitting:

drivers/firmware/efi/cper.c: In function ‘cper_estatus_print_section’:
drivers/firmware/efi/cper.c:596:4: error: implicit declaration of function ‘cper_print_proc_arm’ [-Werror=implicit-function-declaration]
    cper_print_proc_arm(newpfx, arm_err);
    ^~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[3]: *** [drivers/firmware/efi/cper.o] Error 1
make[2]: *** [drivers/firmware/efi] Error 2
make[1]: *** [drivers/firmware] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [drivers] Error 2
make: *** Waiting for unfinished jobs....

this is a x86 build.

>  	} else
>  		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
>  
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index dcacb1a..85450f3 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -180,6 +180,10 @@ enum {
>  #define CPER_SEC_PROC_IPF						\
>  	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>  		0x80, 0xC7, 0x3C, 0x88, 0x81)
> +/* Processor Specific: ARM */
> +#define CPER_SEC_PROC_ARM						\
> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
> +		0x1D, 0x5D, 0x46, 0xB0)
>  /* Platform Memory */
>  #define CPER_SEC_PLATFORM_MEM						\
>  	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
> @@ -255,6 +259,22 @@ enum {
>  
>  #define CPER_PCIE_SLOT_SHIFT			3
>  
> +#define CPER_ARM_VALID_MPIDR			0x00000001
> +#define CPER_ARM_VALID_AFFINITY_LEVEL		0x00000002
> +#define CPER_ARM_VALID_RUNNING_STATE		0x00000004
> +#define CPER_ARM_VALID_VENDOR_INFO		0x00000008
> +
> +#define CPER_ARM_INFO_VALID_MULTI_ERR		0x0001
> +#define CPER_ARM_INFO_VALID_FLAGS		0x0002
> +#define CPER_ARM_INFO_VALID_ERR_INFO		0x0004
> +#define CPER_ARM_INFO_VALID_VIRT_ADDR		0x0008
> +#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR	0x0010
> +
> +#define CPER_ARM_INFO_FLAGS_FIRST		0x0001
> +#define CPER_ARM_INFO_FLAGS_LAST		0x0002
> +#define CPER_ARM_INFO_FLAGS_PROPAGATED		0x0004
> +#define CPER_ARM_INFO_FLAGS_OVERFLOW		0x0008

For all of the above use BIT().

> +
>  /*
>   * All tables and structs must be byte-packed to match CPER
>   * specification, since the tables are provided by the system BIOS
> @@ -340,6 +360,40 @@ struct cper_ia_proc_ctx {
>  	__u64	mm_reg_addr;
>  };
>  
> +/* ARM Processor Error Section */
> +struct cper_sec_proc_arm {
> +	__u32	validation_bits;
> +	__u16	err_info_num; /* Number of Processor Error Info */
> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
> +	__u32	section_length;
> +	__u8	affinity_level;
> +	__u8	reserved[3];	/* must be zero */
> +	__u64	mpidr;
> +	__u64	midr;
> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
> +	__u32	psci_state;

Align comments vertically pls.

> +};
> +
> +/* ARM Processor Error Information Structure */
> +struct cper_arm_err_info {
> +	__u8	version;
> +	__u8	length;
> +	__u16	validation_bits;
> +	__u8	type;
> +	__u16	multiple_error;
> +	__u8	flags;
> +	__u64	error_info;
> +	__u64	virt_fault_addr;
> +	__u64	physical_fault_addr;
> +};
> +
> +/* ARM Processor Context Information Structure */
> +struct cper_arm_ctx_info {
> +	__u16	version;
> +	__u16	type;
> +	__u32	size;
> +};

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-21 17:26       ` Borislav Petkov
@ 2017-04-21 18:08         ` Baicar, Tyler
  2017-04-21 18:12           ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-21 18:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On 4/21/2017 11:26 AM, Borislav Petkov wrote:
> On Fri, Apr 21, 2017 at 10:04:35AM -0600, Baicar, Tyler wrote:
>> This is basically what I already had in v14...you asked to move it into a
>> different if-statement? https://lkml.org/lkml/2017/4/12/397
> Well, clearly I've been smoking some nasty potent sh*t. :-\
>
> /me goes and looks at the spec:
>
> "Bit 0 – Timestamp is precise if this bit is set and correlates to the
> time of the error event."
>
> So why are we even printing the timestamp when !precise?
>
> IOW, I think we should do:
>
> 	if (!(timestamp[3] & 0x1))
> 		printk("%stimestamp imprecise\n", pfx);
> 	else {
> 		sec = ..
> 		min = ...
>
> 		...
> 	}
>
> and print the actual values only when the timestamp is precise.
> Otherwise it has *some* values which could just as well be completely
> random. And it's not like we're reporting the error tomorrow - it is
> mostly a couple of seconds from logging to the fw pushing it out...
The timestamp may still be useful when it is imprecise. In the polling 
case, you may only poll every minute or so, so the time may be useful. 
Also, I imagine there could be interrupt based errors happening much 
faster than the FW/OS handshake can happen. Maybe we can just use what I 
had before but also specify imprecise so that it is clear:

         printk("%s%ststamp: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
             (timestamp[3] & 0x1 ? "precise " : "imprecise "),
              century, year, mon, day, hour, min, sec);

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 03/11] cper: add timestamp print to CPER status printing
  2017-04-21 18:08         ` Baicar, Tyler
@ 2017-04-21 18:12           ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-21 18:12 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On Fri, Apr 21, 2017 at 12:08:43PM -0600, Baicar, Tyler wrote:
> The timestamp may still be useful when it is imprecise. In the polling case,
> you may only poll every minute or so, so the time may be useful.

Well, what is in the timestamp when !precise? Some random time or some
timestamp from a couple of seconds ago? How do you differentiate what
timestamp is bollocks and what is from a while ago?

Is the imprecise tstamp really close to the time the error happened or
pointing at 1970 - the beginning of unix time? :-)

I'm sure you've picked up by now that we don't trust the firmware one
bit.

> Also, I imagine there could be interrupt based errors happening much faster than the
> FW/OS handshake can happen. Maybe we can just use what I had before but also
> specify imprecise so that it is clear:
> 
>         printk("%s%ststamp: %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
>             (timestamp[3] & 0x1 ? "precise " : "imprecise "),
>              century, year, mon, day, hour, min, sec);

I guess.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-21 17:55   ` Borislav Petkov
@ 2017-04-21 18:22     ` Baicar, Tyler
  2017-04-24 17:52       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-21 18:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On 4/21/2017 11:55 AM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 05:05:16PM -0600, Tyler Baicar wrote:
>> Add support for ARM Common Platform Error Record (CPER).
>> UEFI 2.6 specification adds support for ARM specific
>> processor error information to be reported as part of the
>> CPER records. This provides more detail on for processor error logs.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
>> Reviewed-by: James Morse <james.morse@arm.com>
>> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>>   drivers/firmware/efi/cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/cper.h        |  54 ++++++++++++++++++
>>   2 files changed, 189 insertions(+)
>>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 46585f9..f959185 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
>>   static const char * const proc_type_strs[] = {
>>   	"IA32/X64",
>>   	"IA64",
>> +	"ARM",
>>   };
>>   
>>   static const char * const proc_isa_strs[] = {
>>   	"IA32",
>>   	"IA64",
>>   	"X64",
>> +	"ARM A32/T32",
>> +	"ARM A64",
>>   };
>>   
>>   static const char * const proc_error_type_strs[] = {
>> @@ -184,6 +187,128 @@ static void cper_print_proc_generic(const char *pfx,
>>   		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
>>   }
>>   
>> +#if defined(CONFIG_ARM64) || defined(CONFIG_ARM)
>> +static const char * const arm_reg_ctx_strs[] = {
>> +	"AArch32 general purpose registers",
>> +	"AArch32 EL1 context registers",
>> +	"AArch32 EL2 context registers",
>> +	"AArch32 secure context registers",
>> +	"AArch64 general purpose registers",
>> +	"AArch64 EL1 context registers",
>> +	"AArch64 EL2 context registers",
>> +	"AArch64 EL3 context registers",
>> +	"Misc. system register structure",
>> +};
>> +
>> +static void cper_print_proc_arm(const char *pfx,
>> +				const struct cper_sec_proc_arm *proc)
>> +{
>> +	int i, len, max_ctx_type;
>> +	struct cper_arm_err_info *err_info;
>> +	struct cper_arm_ctx_info *ctx_info;
>> +	char newpfx[64];
>> +
>> +	printk("%ssection length: %d\n", pfx, proc->section_length);
> We need to dump section length because?
I guess it's not really needed. It just may be useful considering there 
can be numerous error info structures, numerous context info structures, 
and a variable length vendor information section. I can move this print 
to only in the length check failure cases.
>
>> +	printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
>> +
>> +	len = proc->section_length - (sizeof(*proc) +
>> +		proc->err_info_num * (sizeof(*err_info)));
>> +	if (len < 0) {
>> +		printk("%ssection length is too small\n", pfx);
> Now here we *can* dump it.
>
>> +		printk("%sfirmware-generated error record is incorrect\n", pfx);
>> +		printk("%sERR_INFO_NUM is %d\n", pfx, proc->err_info_num);
>> +		return;
>> +	}
>> +
>> +	if (proc->validation_bits & CPER_ARM_VALID_MPIDR)
>> +		printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
>
> <---- newline here.
>
> Also, what is MPIDR and can it be written in a more user-friendly manner
> and not be an abbreviation?
>
>> +	if (proc->validation_bits & CPER_ARM_VALID_AFFINITY_LEVEL)
>> +		printk("%serror affinity level: %d\n", pfx,
>> +			proc->affinity_level);
>> +	if (proc->validation_bits & CPER_ARM_VALID_RUNNING_STATE) {
>> +		printk("%srunning state: 0x%x\n", pfx, proc->running_state);
>> +		printk("%sPSCI state: %d\n", pfx, proc->psci_state);
> One more abbreviation. Please consider whether having the abbreviations
> or actually writing them out is more user-friendly.
>
>> +	}
>> +
>> +	snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> That INDENT_SP thing is just silly, someone should kill it.
>
>> +
>> +	err_info = (struct cper_arm_err_info *)(proc + 1);
>> +	for (i = 0; i < proc->err_info_num; i++) {
>> +		printk("%sError info structure %d:\n", pfx, i);
>> +		printk("%sversion:%d\n", newpfx, err_info->version);
>> +		printk("%slength:%d\n", newpfx, err_info->length);
> <---- newline here.
>
> Why do we even dump version and info for *every* err_info structure?
Because these are part of the error information structure. I wouldn't 
think FW would populate error information structures that are different 
versions in the same processor error, but it could be possible from the 
spec (at least once there are different versions of the table).
>
>> +		if (err_info->validation_bits &
>> +		    CPER_ARM_INFO_VALID_MULTI_ERR) {
>> +			if (err_info->multiple_error == 0)
>> +				printk("%ssingle error\n", newpfx);
>> +			else if (err_info->multiple_error == 1)
>> +				printk("%smultiple errors\n", newpfx);
>> +			else
>> +				printk("%smultiple errors count:%u\n",
>> +				newpfx, err_info->multiple_error);
> So this can be simply: "num errors: %d", err_info->multiple_error+1...
>
> Without checking CPER_ARM_INFO_VALID_MULTI_ERR.
>
>> +		}
> <---- newline here.
>
>> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_FLAGS) {
>> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_FIRST)
>> +				printk("%sfirst error captured\n", newpfx);
>> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_LAST)
>> +				printk("%slast error captured\n", newpfx);
>> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_PROPAGATED)
>> +				printk("%spropagated error captured\n",
>> +				       newpfx);
>> +			if (err_info->flags & CPER_ARM_INFO_FLAGS_OVERFLOW)
>> +				printk("%soverflow occurred, error info is incomplete\n",
>> +				       newpfx);
>> +		}
> <---- newline here.
>
>> +		printk("%serror_type: %d, %s\n", newpfx, err_info->type,
>> +			err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
>> +			proc_error_type_strs[err_info->type] : "unknown");
>> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
>> +			printk("%serror_info: 0x%016llx\n", newpfx,
>> +			       err_info->error_info);
> err_info->error_info ?
>
> What is that supposed to mean? A u64 value of some sorts.
There is an error information 64 bit value in the ARM processor error 
information structure. (UEFI spec 2.6 table 261)
>
>> +		if (err_info->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
>> +			printk("%svirtual fault address: 0x%016llx\n",
>> +				newpfx, err_info->virt_fault_addr);
>> +		if (err_info->validation_bits &
>> +		    CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
> Just let that line stick out.
>
>> +			printk("%sphysical fault address: 0x%016llx\n",
>> +				newpfx, err_info->physical_fault_addr);
>> +		err_info += 1;
>> +	}
> <---- newline here.
>
> That function is kinda missing newlines.
>
>> +	ctx_info = (struct cper_arm_ctx_info *)err_info;
>> +	max_ctx_type = ARRAY_SIZE(arm_reg_ctx_strs) - 1;
>> +	for (i = 0; i < proc->context_info_num; i++) {
>> +		int size = sizeof(*ctx_info) + ctx_info->size;
>> +
>> +		printk("%sContext info structure %d:\n", pfx, i);
>> +		if (len < size) {
>> +			printk("%ssection length is too small\n", newpfx);
>> +			printk("%sfirmware-generated error record is incorrect\n", pfx);
>> +			return;
>> +		}
>> +		if (ctx_info->type > max_ctx_type) {
>> +			printk("%sInvalid context type: %d\n", newpfx,
>> +							ctx_info->type);
>> +			printk("%sMax context type: %d\n", newpfx,
>> +							max_ctx_type);
>> +			return;
> You can combine those into:
>
>                          printk("%sInvalid context type: %d (max: %d)\n",
> 				newpfx, ctx_info->type, max_ctx_type);
>
>
>> +		}
>> +		printk("%sregister context type %d: %s\n", newpfx,
>> +			ctx_info->type, arm_reg_ctx_strs[ctx_info->type]);
> Why dump the type as %d and as a string too? String should be enough, no?
Yes, the string should be sufficient.
>
>> +		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
>> +				(ctx_info + 1), ctx_info->size, 0);
>> +		len -= size;
>> +		ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + size);
>> +	}
>> +
>> +	if (len > 0) {
>> +		printk("%sVendor specific error info has %u bytes:\n", pfx,
>> +		       len);
>> +		print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4, ctx_info,
>> +				len, true);
> That looks like it should be a debug printk...
Why's that? Dumping this vendor specific error information is similar to 
the unrecognized CPER section reporting which is also meant for vendor 
specific information https://lkml.org/lkml/2017/4/18/751

Thanks,
Tyler
>
>> +	}
>> +}
>> +#endif
>> +
>>   static const char * const mem_err_type_strs[] = {
>>   	"unknown",
>>   	"no error",
>> @@ -461,6 +586,16 @@ static void cper_estatus_timestamp(const char *pfx,
>>   			cper_print_pcie(newpfx, pcie, gdata);
>>   		else
>>   			goto err_section_too_small;
>> +	} else if ((IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)) &&
>> +		   !uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARM)) {
>> +		struct cper_sec_proc_arm *arm_err;
>> +
>> +		arm_err = acpi_hest_get_payload(gdata);
> 		struct cper_sec_proc_arm *arm_err = acpi_hest_get_payload(gdata);
>
>> +		printk("%ssection_type: ARM processor error\n", newpfx);
>> +		if (gdata->error_data_length >= sizeof(*arm_err))
>> +			cper_print_proc_arm(newpfx, arm_err);
>> +		else
>> +			goto err_section_too_small;
> You need to build-test your patches before submitting:
>
> drivers/firmware/efi/cper.c: In function ‘cper_estatus_print_section’:
> drivers/firmware/efi/cper.c:596:4: error: implicit declaration of function ‘cper_print_proc_arm’ [-Werror=implicit-function-declaration]
>      cper_print_proc_arm(newpfx, arm_err);
>      ^~~~~~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
> make[3]: *** [drivers/firmware/efi/cper.o] Error 1
> make[2]: *** [drivers/firmware/efi] Error 2
> make[1]: *** [drivers/firmware] Error 2
> make[1]: *** Waiting for unfinished jobs....
> make: *** [drivers] Error 2
> make: *** Waiting for unfinished jobs....
>
> this is a x86 build.
>
>>   	} else
>>   		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
>>   
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index dcacb1a..85450f3 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -180,6 +180,10 @@ enum {
>>   #define CPER_SEC_PROC_IPF						\
>>   	UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00,	\
>>   		0x80, 0xC7, 0x3C, 0x88, 0x81)
>> +/* Processor Specific: ARM */
>> +#define CPER_SEC_PROC_ARM						\
>> +	UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05,	\
>> +		0x1D, 0x5D, 0x46, 0xB0)
>>   /* Platform Memory */
>>   #define CPER_SEC_PLATFORM_MEM						\
>>   	UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83,	\
>> @@ -255,6 +259,22 @@ enum {
>>   
>>   #define CPER_PCIE_SLOT_SHIFT			3
>>   
>> +#define CPER_ARM_VALID_MPIDR			0x00000001
>> +#define CPER_ARM_VALID_AFFINITY_LEVEL		0x00000002
>> +#define CPER_ARM_VALID_RUNNING_STATE		0x00000004
>> +#define CPER_ARM_VALID_VENDOR_INFO		0x00000008
>> +
>> +#define CPER_ARM_INFO_VALID_MULTI_ERR		0x0001
>> +#define CPER_ARM_INFO_VALID_FLAGS		0x0002
>> +#define CPER_ARM_INFO_VALID_ERR_INFO		0x0004
>> +#define CPER_ARM_INFO_VALID_VIRT_ADDR		0x0008
>> +#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR	0x0010
>> +
>> +#define CPER_ARM_INFO_FLAGS_FIRST		0x0001
>> +#define CPER_ARM_INFO_FLAGS_LAST		0x0002
>> +#define CPER_ARM_INFO_FLAGS_PROPAGATED		0x0004
>> +#define CPER_ARM_INFO_FLAGS_OVERFLOW		0x0008
> For all of the above use BIT().
>
>> +
>>   /*
>>    * All tables and structs must be byte-packed to match CPER
>>    * specification, since the tables are provided by the system BIOS
>> @@ -340,6 +360,40 @@ struct cper_ia_proc_ctx {
>>   	__u64	mm_reg_addr;
>>   };
>>   
>> +/* ARM Processor Error Section */
>> +struct cper_sec_proc_arm {
>> +	__u32	validation_bits;
>> +	__u16	err_info_num; /* Number of Processor Error Info */
>> +	__u16	context_info_num; /* Number of Processor Context Info Records*/
>> +	__u32	section_length;
>> +	__u8	affinity_level;
>> +	__u8	reserved[3];	/* must be zero */
>> +	__u64	mpidr;
>> +	__u64	midr;
>> +	__u32	running_state; /* Bit 0 set - Processor running. PSCI = 0 */
>> +	__u32	psci_state;
> Align comments vertically pls.
>
>> +};
>> +
>> +/* ARM Processor Error Information Structure */
>> +struct cper_arm_err_info {
>> +	__u8	version;
>> +	__u8	length;
>> +	__u16	validation_bits;
>> +	__u8	type;
>> +	__u16	multiple_error;
>> +	__u8	flags;
>> +	__u64	error_info;
>> +	__u64	virt_fault_addr;
>> +	__u64	physical_fault_addr;
>> +};
>> +
>> +/* ARM Processor Context Information Structure */
>> +struct cper_arm_ctx_info {
>> +	__u16	version;
>> +	__u16	type;
>> +	__u32	size;
>> +};

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-21 18:22     ` Baicar, Tyler
@ 2017-04-24 17:52       ` Borislav Petkov
  2017-04-25 16:05         ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-24 17:52 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On Fri, Apr 21, 2017 at 12:22:09PM -0600, Baicar, Tyler wrote:
> I guess it's not really needed. It just may be useful considering there can
> be numerous error info structures, numerous context info structures, and a
> variable length vendor information section. I can move this print to only in
> the length check failure cases.

And? Why does the user care?

I mean, it is good for debugging when you wanna see you're parsing the
error info data properly but otherwise it doesn't improve the error
reporting one bit.

> Because these are part of the error information structure. I wouldn't think
> FW would populate error information structures that are different versions
> in the same processor error, but it could be possible from the spec (at
> least once there are different versions of the table).

Same argument as above.

> There is an error information 64 bit value in the ARM processor error
> information structure. (UEFI spec 2.6 table 261)

So that's IP-dependent and explained in the following tables. Any plans
on decoding that too?

> Why's that? Dumping this vendor specific error information is similar to the
> unrecognized CPER section reporting which is also meant for vendor specific
> information https://lkml.org/lkml/2017/4/18/751

And how do those naked bytes help the user understand the error happening?

Even in your example you have:

[  140.739210] {1}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 435f4343  .RAM1 IMEM:ECC_C
[  140.739214] {1}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 00000000  E_QSB_RD........

Which looks like some correctable ECC DRAM error and is actually begging
to be decoded in a human-readable form. So let's do that completely and
not dump partially decoded information.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-24 17:52       ` Borislav Petkov
@ 2017-04-25 16:05         ` Baicar, Tyler
  2017-04-25 16:31           ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-25 16:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On 4/24/2017 11:52 AM, Borislav Petkov wrote:
> On Fri, Apr 21, 2017 at 12:22:09PM -0600, Baicar, Tyler wrote:
>> I guess it's not really needed. It just may be useful considering there can
>> be numerous error info structures, numerous context info structures, and a
>> variable length vendor information section. I can move this print to only in
>> the length check failure cases.
> And? Why does the user care?
>
> I mean, it is good for debugging when you wanna see you're parsing the
> error info data properly but otherwise it doesn't improve the error
> reporting one bit.
I'll move this to just happen when the length check fails.
>> Because these are part of the error information structure. I wouldn't think
>> FW would populate error information structures that are different versions
>> in the same processor error, but it could be possible from the spec (at
>> least once there are different versions of the table).
> Same argument as above.
I can remove it then.
>
>> There is an error information 64 bit value in the ARM processor error
>> information structure. (UEFI spec 2.6 table 261)
> So that's IP-dependent and explained in the following tables. Any plans
> on decoding that too?
Yes, I do plan on adding further decoding for these values in the future.
>
>> Why's that? Dumping this vendor specific error information is similar to the
>> unrecognized CPER section reporting which is also meant for vendor specific
>> information https://lkml.org/lkml/2017/4/18/751
> And how do those naked bytes help the user understand the error happening?
>
> Even in your example you have:
>
> [  140.739210] {1}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 435f4343  .RAM1 IMEM:ECC_C
> [  140.739214] {1}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 00000000  E_QSB_RD........
>
> Which looks like some correctable ECC DRAM error and is actually begging
> to be decoded in a human-readable form. So let's do that completely and
> not dump partially decoded information.
That seems like something that should be done outside of these patches 
(if added to the kernel at all). The decoding for this information would 
all be vendor specific, so I'm not sure if we want to pollute the EFI 
code with vendor specific error decoding. Currently we are using the RAS 
Daemon user space tool for the decoding of this information since 
vendors can easily pick up this tool and add an extension for their 
vendor specific parsing. These prints will only happen when the firmware 
supports the vendor specific error information anyway.

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 04/11] efi: parse ARM processor error
  2017-04-25 16:05         ` Baicar, Tyler
@ 2017-04-25 16:31           ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-25 16:31 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, rafael, tony.luck, gengdongjiu, xiexiuqi

On Tue, Apr 25, 2017 at 10:05:31AM -0600, Baicar, Tyler wrote:
> That seems like something that should be done outside of these patches (if
> added to the kernel at all). The decoding for this information would all be
> vendor specific, so I'm not sure if we want to pollute the EFI code with
> vendor specific error decoding. Currently we are using the RAS Daemon user
> space tool for the decoding of this information since vendors can easily
> pick up this tool and add an extension for their vendor specific parsing.
> These prints will only happen when the firmware supports the vendor specific
> error information anyway.

The same questions apply here: what do you do if the machine panics
because the error is fatal and you can't get it to run any userspace?
Wouldn't it be good to decode the whole error?

Right now we photograph screens on Intel x86 and feed the typed MCA info
by hand to mcelog. Hardly an optimal situation.

On AMD, there's a decoder which actually can dump the decoded critical
error. (Or could - that's in flux again :-\).

So yes, if stuff is too vendor-specific then you probably don't
want to decode it (or add a vendor-specific decoding module like
edac_mce_amd.ko, for example). But the tables from the UEFI spec,
documenting IP-specific error types which should be valid for most if
not all ARM64 implementations adhering to the spec, would be useful to
decode.

In general, the more we can decode the error in the kernel and the less
we need an external help to do so, the better.

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
@ 2017-04-25 17:21   ` Borislav Petkov
  2017-04-25 17:41     ` Baicar, Tyler
  2017-05-08 17:28   ` James Morse
  2017-08-14  7:55   ` Xiongfeng Wang
  2 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-04-25 17:21 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:18PM -0600, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().

...

> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>  		inf->name, esr, addr);
>  
> +	/*
> +	 * Synchronous aborts may interrupt code which had interrupts masked.
> +	 * Before calling out into the wider kernel tell the interested
> +	 * subsystems.
> +	 */
> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> +		nmi_enter();
> +		ghes_notify_sea();
> +		nmi_exit();
> +	}

Well, the other GHES notification methods use a notifier:
ghes_notify_sci, ghes_notify_nmi. You probably should do that too
instead of calling straight into a driver from arch code.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-25 17:21   ` Borislav Petkov
@ 2017-04-25 17:41     ` Baicar, Tyler
  2017-04-25 17:46       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-04-25 17:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On 4/25/2017 11:21 AM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 05:05:18PM -0600, Tyler Baicar wrote:
>> ARM APEI extension proposal added SEA (Synchronous External Abort)
>> notification type for ARMv8.
>> Add a new GHES error source handling function for SEA. If an error
>> source's notification type is SEA, then this function can be registered
>> into the SEA exception handler. That way GHES will parse and report
>> SEA exceptions when they occur.
>> An SEA can interrupt code that had interrupts masked and is treated as
>> an NMI. To aid this the page of address space for mapping APEI buffers
>> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
>> changed to use the helper methods to find the prot_t to map with in
>> the same way as ghes_ioremap_pfn_irq().
> ...
>
>> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>   	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>>   		inf->name, esr, addr);
>>   
>> +	/*
>> +	 * Synchronous aborts may interrupt code which had interrupts masked.
>> +	 * Before calling out into the wider kernel tell the interested
>> +	 * subsystems.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
>> +		nmi_enter();
>> +		ghes_notify_sea();
>> +		nmi_exit();
>> +	}
> Well, the other GHES notification methods use a notifier:
> ghes_notify_sci, ghes_notify_nmi. You probably should do that too
> instead of calling straight into a driver from arch code.
I originally had this as a notifier, but Will requested to remove the 
notifier. That conversation is here: https://lkml.org/lkml/2017/1/18/1018

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-25 17:41     ` Baicar, Tyler
@ 2017-04-25 17:46       ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-25 17:46 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 25, 2017 at 11:41:39AM -0600, Baicar, Tyler wrote:
> I originally had this as a notifier, but Will requested to remove the
> notifier. That conversation is here: https://lkml.org/lkml/2017/1/18/1018

Yeah, he mentioned on IRC. I just think notifiers would be the cleaner
thing but whatever you guys say.

Just we had a nasty hack on x86 which I got rid of recently:

https://lkml.kernel.org/r/20170406090634.30950-1-bp@alien8.de

and I wouldn't want you guys to have the same "fun". :)

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block
  2017-04-18 23:05 ` [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block Tyler Baicar
@ 2017-04-28 13:07   ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-04-28 13:07 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:19PM -0600, Tyler Baicar wrote:
> From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
> 
> Even if an error status block's severity is fatal, the kernel does not
> honor the severity level and panic.
> 
> With the firmware first model, the platform could inform the OS about a
> fatal hardware error through the non-NMI GHES notification type. The OS
> should panic when a hardware error record is received with this
> severity.
> 
> Call panic() after CPER data in error status block is printed if
> severity is fatal, before each error section is handled.
> 
> Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2d387f8..b91123f 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -134,6 +134,8 @@
>  static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
>  static atomic_t ghes_estatus_cache_alloced;
>  
> +static int ghes_panic_timeout __read_mostly = 30;
> +
>  static int ghes_ioremap_init(void)
>  {
>  	ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
> @@ -692,6 +694,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
>  	return apei_write(val, &generic_v2->read_ack_register);
>  }
>  
> +static void __ghes_call_panic(void)

__ghes_panic()

> +{
> +	if (panic_timeout == 0)

	if (!panic_timeout)

> +		panic_timeout = ghes_panic_timeout;
> +	panic("Fatal hardware error!");
> +}
> +
>  static int ghes_proc(struct ghes *ghes)
>  {
>  	int rc;
> @@ -699,6 +708,10 @@ static int ghes_proc(struct ghes *ghes)
>  	rc = ghes_read_estatus(ghes, 0);
>  	if (rc)
>  		goto out;

<---- newline here.

> +	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
> +		__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
> +		__ghes_call_panic();
> +	}

ditto.

>  	if (!ghes_estatus_cached(ghes->estatus)) {
>  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
>  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 08/11] efi: print unrecognized CPER section
  2017-04-18 23:05 ` [PATCH V15 08/11] efi: print unrecognized CPER section Tyler Baicar
@ 2017-05-05 13:27   ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-05-05 13:27 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:20PM -0600, Tyler Baicar wrote:
> UEFI spec allows for non-standard section in Common Platform Error
> Record. This is defined in section N.2.3 of UEFI version 2.5.
> 
> Currently if the CPER section's type (UUID) does not match with
> one of the section types that the kernel knows how to parse, the
> section is skipped. Therefore, user is not able to see
> such CPER data, for instance, error record of non-standard section.
> 
> For above mentioned case, this change prints out the raw data in
> hex in dmesg buffer.

... because? We'd need the reason why we're not ignoring those errors
anymore.

>  Data length is taken from Error Data length
> field of Generic Error Data Entry.
> 
> The following is a sample output from dmesg:
> [  140.739180] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [  140.739182] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
> [  140.739191] {1}[Hardware Error]: event severity: corrected
> [  140.739196] {1}[Hardware Error]:  time: precise 2017-03-15 20:37:35
> [  140.739197] {1}[Hardware Error]:  Error 0, type: corrected
> [  140.739203] {1}[Hardware Error]:   section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b
> [  140.739205] {1}[Hardware Error]:   section length: 568 (0x238)
> [  140.739210] {1}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 435f4343  .RAM1 IMEM:ECC_C
> [  140.739214] {1}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 00000000  E_QSB_RD........
> [  140.739217] {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> [  140.739220] {1}[Hardware Error]:   00000030: 00000000 00000000 01010000 01010000  ................
> [  140.739223] {1}[Hardware Error]:   00000040: 00000000 00000000 00000005 00000000  ................
> [  140.739226] {1}[Hardware Error]:   00000050: 01010000 00000000 00000001 00dddd00  ................

Kill all those prefixes:

" Hardware error from APEI Generic Hardware Error Source: 2
  It has been corrected by h/w and requires no further action
  event severity: corrected
   time: precise 2017-03-15 20:37:35
   Error 0, type: corrected
    section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b
    section length: 568 (0x238)
    00000000: 4d415201 4d492031 453a4d45 435f4343  .RAM1 IMEM:ECC_C
    00000010: 53515f45 44525f42 00000000 00000000  E_QSB_RD........
    00000020: 00000000 00000000 00000000 00000000  ................
    00000030: 00000000 00000000 01010000 01010000  ................
    00000040: 00000000 00000000 00000005 00000000  ................
    00000050: 01010000 00000000 00000001 00dddd00  ................
"

to the important info only.


> ...
> 
> The raw data from the error can then be decoded using vendor
> specific tools.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> ---
>  drivers/firmware/efi/cper.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index f959185..610d31a 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -596,8 +596,16 @@ static void cper_estatus_timestamp(const char *pfx,
>  			cper_print_proc_arm(newpfx, arm_err);
>  		else
>  			goto err_section_too_small;
> -	} else
> -		printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
> +	} else {
> +		const void *unknown_err;
> +
> +		unknown_err = acpi_hest_get_payload(gdata);

Simply:

	const void *err = acpi_hest_get_payload(gdata);

Short and sweet.

> +		printk("%ssection type: unknown, %pUl\n", newpfx, sec_type);
> +		printk("%ssection length: %d (%#x)\n", newpfx,

One number format is fine.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 09/11] ras: acpi / apei: generate trace event for unrecognized CPER section
  2017-04-18 23:05 ` [PATCH V15 09/11] ras: acpi / apei: generate trace event for " Tyler Baicar
@ 2017-05-05 17:53   ` Borislav Petkov
  2017-05-05 18:44   ` Steven Rostedt
  1 sibling, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-05-05 17:53 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:21PM -0600, Tyler Baicar wrote:
> UEFI spec allows for non-standard section in Common Platform Error
> Record. This is defined in section N.2.3 of UEFI version 2.5.

If the spec calls it non-standard why are we calling it "unknown
section"?

> Currently if the CPER section's type (UUID) does not match with
> any section type that the kernel knows how to parse, trace event
> is not generated for such section. And thus user is not able to know
> happening of such hardware error, including error record of
> non-standard section.

That's sentence sounds funny.

> This commit generates a trace event which contains raw error data
> for unrecognized CPER section.

Never write "This commit" or "This patch" in your commit message -
that's a given.

> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Tested-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  drivers/acpi/apei/ghes.c | 27 +++++++++++++++++++++++----
>  drivers/ras/ras.c        |  1 +
>  include/ras/ras_event.h  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b91123f..3d9f63b 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -45,11 +45,13 @@
>  #include <linux/aer.h>
>  #include <linux/nmi.h>
>  #include <linux/sched/clock.h>
> +#include <linux/uuid.h>
>  
>  #include <acpi/actbl1.h>
>  #include <acpi/ghes.h>
>  #include <acpi/apei.h>
>  #include <asm/tlbflush.h>
> +#include <ras/ras_event.h>
>  
>  #include "apei-internal.h"
>  
> @@ -461,12 +463,21 @@ static void ghes_do_proc(struct ghes *ghes,
>  {
>  	int sev, sec_sev;
>  	struct acpi_hest_generic_data *gdata;
> +	uuid_le sec_type;
> +	uuid_le *fru_id = &NULL_UUID_LE;
> +	char *fru_text = "";
>  
>  	sev = ghes_severity(estatus->error_severity);
>  	apei_estatus_for_each_section(estatus, gdata) {
>  		sec_sev = ghes_severity(gdata->error_severity);
> -		if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> -				 CPER_SEC_PLATFORM_MEM)) {
> +		sec_type = *(uuid_le *)gdata->section_type;
> +
> +		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
> +			fru_id = (uuid_le *)gdata->fru_id;
> +		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
> +			fru_text = gdata->fru_text;
> +
> +		if (!uuid_le_cmp(sec_type, CPER_SEC_PLATFORM_MEM)) {
>  			struct cper_sec_mem_err *mem_err;
>  			mem_err = acpi_hest_get_payload(gdata);
>  			ghes_edac_report_mem_error(ghes, sev, mem_err);
> @@ -475,8 +486,7 @@ static void ghes_do_proc(struct ghes *ghes,
>  			ghes_handle_memory_failure(gdata, sev);
>  		}
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -		else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> -				      CPER_SEC_PCIE)) {
> +		else if (!uuid_le_cmp(sec_type, CPER_SEC_PCIE)) {
>  			struct cper_sec_pcie *pcie_err;
>  			pcie_err = acpi_hest_get_payload(gdata);
>  			if (sev == GHES_SEV_RECOVERABLE &&
> @@ -507,6 +517,15 @@ static void ghes_do_proc(struct ghes *ghes,
>  
>  		}
>  #endif
> +#ifdef CONFIG_RAS
> +		else if (trace_unknown_sec_event_enabled()) {
> +			void *unknown_err = acpi_hest_get_payload(gdata);
> +
> +			trace_unknown_sec_event(&sec_type,
> +					fru_id, fru_text, sec_sev,
> +					unknown_err, gdata->error_data_length);
> +		}
> +#endif

Put that in a function in ras.c along with a prototype for
include/linux/ras.h for the !CONFIG_RAS case so that you can save
yourself the ifdeffery in an already not really easy to read function.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 09/11] ras: acpi / apei: generate trace event for unrecognized CPER section
  2017-04-18 23:05 ` [PATCH V15 09/11] ras: acpi / apei: generate trace event for " Tyler Baicar
  2017-05-05 17:53   ` Borislav Petkov
@ 2017-05-05 18:44   ` Steven Rostedt
  1 sibling, 0 replies; 41+ messages in thread
From: Steven Rostedt @ 2017-05-05 18:44 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, bristot, linux-arm-kernel, kvmarm,
	kvm, linux-kernel, linux-acpi, linux-efi, devel, Suzuki.Poulose,
	punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
	joe, bp, rafael, tony.luck, gengdongjiu, xiexiuqi


Sorry for the late reply. Borislav pinged me to look at this.


On Tue, 18 Apr 2017 17:05:21 -0600
Tyler Baicar <tbaicar@codeaurora.org> wrote:



> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 1791a12..5861b6f 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -162,6 +162,51 @@
>  );
>  
>  /*
> + * Unknown Section Report
> + *
> + * This event is generated when hardware detected a hardware
> + * error event, which may be of non-standard section as defined
> + * in UEFI spec appendix "Common Platform Error Record", or may
> + * be of sections for which TRACE_EVENT is not defined.
> + *
> + */
> +TRACE_EVENT(unknown_sec_event,
> +
> +	TP_PROTO(const uuid_le *sec_type,
> +		 const uuid_le *fru_id,
> +		 const char *fru_text,
> +		 const u8 sev,
> +		 const u8 *err,
> +		 const u32 len),
> +
> +	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
> +
> +	TP_STRUCT__entry(
> +		__array(char, sec_type, 16)
> +		__array(char, fru_id, 16)
> +		__string(fru_text, fru_text)
> +		__field(u8, sev)
> +		__field(u32, len)
> +		__dynamic_array(u8, buf, len)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
> +		memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));

My only concern here is that you are using sizeof(uuid_le) into an
array that is hardcoded as 16 bytes. I don't expect the size of uuid_le
to ever change, but if it does, you just created an exploit.

I would suggest having a macro about the size of uuid_le and use both
here and include/uapi/linux/uuid.h.

#define UUID_SIZE

typedef struct {
	__u8 b[UUID_SIZE];
} uuid_le;

And then we can just use UUID_SIZE safely here:

	__array(char, sec_type, UUID_SIZE)

[...]

	memcpy(__entry->sec_type, sec_type, UUID_SIZE));

Alternatively we could add in the C file that defines the tracepoints:

BUILD_BUG(sizeof(uuid_le) > 16);

But that's hacky.


> +		__assign_str(fru_text, fru_text);
> +		__entry->sev = sev;
> +		__entry->len = len;
> +		memcpy(__get_dynamic_array(buf), err, len);
> +	),
> +
> +	TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
> +		  __entry->sev, __entry->sec_type,

Hmm, I wonder if %pU is defined in the libtraceevent library?

-- Steve

> +		  __entry->fru_id, __get_str(fru_text),
> +		  __entry->len,
> +		  __print_hex(__get_dynamic_array(buf), __entry->len))
> +);
> +
> +/*
>   * PCIe AER Trace event
>   *
>   * These events are generated when hardware detects a corrected or

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
  2017-04-25 17:21   ` Borislav Petkov
@ 2017-05-08 17:28   ` James Morse
  2017-05-08 19:59     ` Baicar, Tyler
  2017-08-14  7:55   ` Xiongfeng Wang
  2 siblings, 1 reply; 41+ messages in thread
From: James Morse @ 2017-05-08 17:28 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
	sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi

Hi Tyler,

On 19/04/17 00:05, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index b74d8b7..10013ff 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>  		inf->name, esr, addr);
>  
> +	/*
> +	 * Synchronous aborts may interrupt code which had interrupts masked.
> +	 * Before calling out into the wider kernel tell the interested
> +	 * subsystems.
> +	 */
> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> +		nmi_enter();
> +		ghes_notify_sea();
> +		nmi_exit();
> +	}
> +
>  	info.si_signo = SIGBUS;
>  	info.si_errno = 0;
>  	info.si_code  = 0;


I was tidying up the masking/unmasking in entry.S, something I wasn't aware of
that leads to a bug:
entry.S will unmask interrupts for instruction/data aborts that came from a
context with interrupts enabled. This makes sense for get_user() and friends...
For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked code,
such as APEI, but if we end up in here with interrupts unmasked we can take an
IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead to
the deadlock we were originally trying to avoid.

Teaching entry.S to spot external aborts is messy. I think the two choices are
to either mask interrupts when calling nmi_enter() (as these things should be
mutually exclusive), or to conditionally call nmi_enter() based on
interrupts_enabled(regs). I prefer the second one as it matches the notify_sea()
while interruptible that happens when KVM takes one of these.



Thanks,

James

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 10/11] trace, ras: add ARM processor error trace event
  2017-04-18 23:05 ` [PATCH V15 10/11] trace, ras: add ARM processor error trace event Tyler Baicar
@ 2017-05-08 17:34   ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-05-08 17:34 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:22PM -0600, Tyler Baicar wrote:
> Currently there are trace events for the various RAS
> errors with the exception of ARM processor type errors.
> Add a new trace event for such errors so that the user
> will know when they occur. These trace events are
> consistent with the ARM processor error section type
> defined in UEFI 2.6 spec section N.2.4.4.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
> Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com>
> ---
>  drivers/acpi/apei/ghes.c    |  8 +++++++-
>  drivers/firmware/efi/cper.c |  1 +
>  drivers/ras/ras.c           |  1 +
>  include/ras/ras_event.h     | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 3d9f63b..612deb3 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -518,7 +518,13 @@ static void ghes_do_proc(struct ghes *ghes,
>  		}
>  #endif
>  #ifdef CONFIG_RAS
> -		else if (trace_unknown_sec_event_enabled()) {
> +		else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
> +			 trace_arm_event_enabled()) {
> +			struct cper_sec_proc_arm *arm_err;
> +
> +			arm_err = acpi_hest_get_payload(gdata);
> +			trace_arm_event(arm_err);
> +		} else if (trace_unknown_sec_event_enabled()) {
>  			void *unknown_err = acpi_hest_get_payload(gdata);
>  
>  			trace_unknown_sec_event(&sec_type,

As with the previous patch, move that code to ras.c so that you can get rid of
the ifdeffery here.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support
  2017-04-18 23:05 ` [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support Tyler Baicar
@ 2017-05-08 17:40   ` Borislav Petkov
  2017-05-08 19:54     ` Baicar, Tyler
  0 siblings, 1 reply; 41+ messages in thread
From: Borislav Petkov @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Tyler Baicar
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Tue, Apr 18, 2017 at 05:05:23PM -0600, Tyler Baicar wrote:
> Currently external aborts are unsupported by the guest abort
> handling. Add handling for SEAs so that the host kernel reports
> SEAs which occur in the guest kernel.
> 
> When an SEA occurs in the guest kernel, the guest exits and is
> routed to kvm_handle_guest_abort(). Prior to this patch, a print
> message of an unsupported FSC would be printed and nothing else
> would happen. With this patch, the code gets routed to the APEI
> handling of SEAs in the host kernel to report the SEA information.
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@linaro.org>

...

> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 582a972..fd11855 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -29,6 +29,7 @@
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/virt.h>
> +#include <asm/system_misc.h>
>  
>  #include "trace.h"
>  
> @@ -1418,6 +1419,24 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>  		kvm_set_pfn_accessed(pfn);
>  }
>  
> +static bool is_abort_sea(unsigned long fault_status) {

ERROR: open brace '{' following function declarations go on the next line
#107: FILE: arch/arm/kvm/mmu.c:1422:
+static bool is_abort_sea(unsigned long fault_status) {

...

> @@ -611,6 +612,24 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  };
>  
>  /*
> + * Handle Synchronous External Aborts that occur in a guest kernel.
> + *
> + * The return value will be zero if the SEA was successfully handled
> + * and non-zero if there was an error processing the error or there was
> + * no error to process.
> + */
> +int handle_guest_sea(phys_addr_t addr, unsigned int esr)
> +{
> +	int ret = -ENOENT;
> +
> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> +		ret = ghes_notify_sea();
> +	}

WARNING: braces {} are not necessary for single statement blocks
#239: FILE: arch/arm64/mm/fault.c:625:
+       if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
+               ret = ghes_notify_sea();
+       }


> +
> +	return ret;
> +}
> +
> +/*
>   * Dispatch a data abort to the relevant handler.
>   */
>  asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 612deb3..d286248 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -812,17 +812,18 @@ static int ghes_notify_sci(struct notifier_block *this,
>  #ifdef CONFIG_ACPI_APEI_SEA
>  static LIST_HEAD(ghes_sea);
>  
> -void ghes_notify_sea(void)
> +int ghes_notify_sea(void)
>  {
>  	struct ghes *ghes;
> +	int ret = -ENOENT;
>  
> -	/*
> -	 * synchronize_rcu() will wait for nmi_exit(), so no need to
> -	 * rcu_read_lock().
> -	 */
> +	rcu_read_lock();
>  	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
> -		ghes_proc(ghes);
> +		if(!ghes_proc(ghes))
> +			ret = 0;

What is the idea here: the first time ghes_proc() returns 0, ret is set
to 0 and all errors after it will be practically ignored. Looks like it
needs more love.

Also:

ERROR: space required before the open parenthesis '('
#271: FILE: drivers/acpi/apei/ghes.c:822:
+               if(!ghes_proc(ghes))

Please integrate scripts/checkpatch.pl in your patch creation workflow.
Some of the warnings/errors *actually* make sense.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support
  2017-05-08 17:40   ` Borislav Petkov
@ 2017-05-08 19:54     ` Baicar, Tyler
  2017-05-08 20:22       ` Borislav Petkov
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-05-08 19:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On 5/8/2017 11:40 AM, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 05:05:23PM -0600, Tyler Baicar wrote:
>> Currently external aborts are unsupported by the guest abort
>> handling. Add handling for SEAs so that the host kernel reports
>> SEAs which occur in the guest kernel.
>>
>> When an SEA occurs in the guest kernel, the guest exits and is
>> routed to kvm_handle_guest_abort(). Prior to this patch, a print
>> message of an unsupported FSC would be printed and nothing else
>> would happen. With this patch, the code gets routed to the APEI
>> handling of SEAs in the host kernel to report the SEA information.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
>> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
>> Acked-by: Christoffer Dall <cdall@linaro.org>
...
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>>    * Dispatch a data abort to the relevant handler.
>>    */
>>   asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 612deb3..d286248 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -812,17 +812,18 @@ static int ghes_notify_sci(struct notifier_block *this,
>>   #ifdef CONFIG_ACPI_APEI_SEA
>>   static LIST_HEAD(ghes_sea);
>>   
>> -void ghes_notify_sea(void)
>> +int ghes_notify_sea(void)
>>   {
>>   	struct ghes *ghes;
>> +	int ret = -ENOENT;
>>   
>> -	/*
>> -	 * synchronize_rcu() will wait for nmi_exit(), so no need to
>> -	 * rcu_read_lock().
>> -	 */
>> +	rcu_read_lock();
>>   	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
>> -		ghes_proc(ghes);
>> +		if(!ghes_proc(ghes))
>> +			ret = 0;
> What is the idea here: the first time ghes_proc() returns 0, ret is set
> to 0 and all errors after it will be practically ignored. Looks like it
> needs more love.
This was discussed in the v12 and v13 patch series. There is existing 
code in kvm_handle_guest_abort
for injecting an abort back into the guest. We only want to do that if 
it was an abort that was not
handled by the firmware first handling. So here we verify that at least 
one of the SEA error sources
successfully reported an error record sent from the firmware. If there 
were no errors reported by
firmware, then we want to continue with the current implementation that 
will inject the virtual
abort. (kvm_inject_vabt)

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-05-08 17:28   ` James Morse
@ 2017-05-08 19:59     ` Baicar, Tyler
  2017-05-12 16:45       ` James Morse
  0 siblings, 1 reply; 41+ messages in thread
From: Baicar, Tyler @ 2017-05-08 19:59 UTC (permalink / raw)
  To: James Morse
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
	sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On 5/8/2017 11:28 AM, James Morse wrote:
> Hi Tyler,
>
> On 19/04/17 00:05, Tyler Baicar wrote:
>> ARM APEI extension proposal added SEA (Synchronous External Abort)
>> notification type for ARMv8.
>> Add a new GHES error source handling function for SEA. If an error
>> source's notification type is SEA, then this function can be registered
>> into the SEA exception handler. That way GHES will parse and report
>> SEA exceptions when they occur.
>> An SEA can interrupt code that had interrupts masked and is treated as
>> an NMI. To aid this the page of address space for mapping APEI buffers
>> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
>> changed to use the helper methods to find the prot_t to map with in
>> the same way as ghes_ioremap_pfn_irq().
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index b74d8b7..10013ff 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>   	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>>   		inf->name, esr, addr);
>>   
>> +	/*
>> +	 * Synchronous aborts may interrupt code which had interrupts masked.
>> +	 * Before calling out into the wider kernel tell the interested
>> +	 * subsystems.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
>> +		nmi_enter();
>> +		ghes_notify_sea();
>> +		nmi_exit();
>> +	}
>> +
>>   	info.si_signo = SIGBUS;
>>   	info.si_errno = 0;
>>   	info.si_code  = 0;
>
> I was tidying up the masking/unmasking in entry.S, something I wasn't aware of
> that leads to a bug:
> entry.S will unmask interrupts for instruction/data aborts that came from a
> context with interrupts enabled. This makes sense for get_user() and friends...
> For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked code,
> such as APEI, but if we end up in here with interrupts unmasked we can take an
> IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead to
> the deadlock we were originally trying to avoid.
>
> Teaching entry.S to spot external aborts is messy. I think the two choices are
> to either mask interrupts when calling nmi_enter() (as these things should be
> mutually exclusive), or to conditionally call nmi_enter() based on
> interrupts_enabled(regs). I prefer the second one as it matches the notify_sea()
> while interruptible that happens when KVM takes one of these.
Hello James,

So it would need to be like this?

         if(interrupts_enabled(regs))
             nmi_enter();
         ghes_notify_sea();
         if(interrupts_enabled(regs))
             nmi_exit();

Thanks,
Tyler

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support
  2017-05-08 19:54     ` Baicar, Tyler
@ 2017-05-08 20:22       ` Borislav Petkov
  0 siblings, 0 replies; 41+ messages in thread
From: Borislav Petkov @ 2017-05-08 20:22 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
	eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, rafael, tony.luck, gengdongjiu,
	xiexiuqi

On Mon, May 08, 2017 at 01:54:44PM -0600, Baicar, Tyler wrote:
> This was discussed in the v12 and v13 patch series. There is existing
> code in kvm_handle_guest_abort for injecting an abort back into the
> guest. We only want to do that if it was an abort that was not handled
> by the firmware first handling. So here we verify that at least one of
> the SEA error sources successfully reported an error record sent from
> the firmware. If there were no errors reported by firmware, then we
> want to continue with the current implementation that will inject the
> virtual abort. (kvm_inject_vabt)

So this needs to be in a comment there. This is generic code in the
sense that it is in drivers/acpi/ and it should say why it is doing that
special thing. I know, SEA is ARM-only, as far as I'm gathering from
reviewing this, but this behavior needs to be documented as it is not
obvious.

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-05-08 19:59     ` Baicar, Tyler
@ 2017-05-12 16:45       ` James Morse
  0 siblings, 0 replies; 41+ messages in thread
From: James Morse @ 2017-05-12 16:45 UTC (permalink / raw)
  To: Baicar, Tyler
  Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
	catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
	lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
	sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
	paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
	kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
	Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
	john.garry, shiju.jose, joe, bp, rafael, tony.luck, gengdongjiu,
	xiexiuqi

Hi Tyler,

On 08/05/17 20:59, Baicar, Tyler wrote:
> On 5/8/2017 11:28 AM, James Morse wrote:
>> I was tidying up the masking/unmasking in entry.S, something I wasn't aware of
>> that leads to a bug:
>> entry.S will unmask interrupts for instruction/data aborts that came from a
>> context with interrupts enabled. This makes sense for get_user() and friends...
>> For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked code,
>> such as APEI, but if we end up in here with interrupts unmasked we can take an
>> IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead to
>> the deadlock we were originally trying to avoid.
>>
>> Teaching entry.S to spot external aborts is messy. I think the two choices are
>> to either mask interrupts when calling nmi_enter() (as these things should be
>> mutually exclusive), or to conditionally call nmi_enter() based on
>> interrupts_enabled(regs). I prefer the second one as it matches the notify_sea()
>> while interruptible that happens when KVM takes one of these.

Thinking about this some more: the KVM case is different as we know it was a
guest that triggered the external abort. Nothing the host kernel does is likely
to trigger either the same error or a related one.

But I can't think of a way this would trip twice on the host... yes your
suggestion looks fine.

(When we add SError/SEI support too we will need to change it as SEA may
interrupt SEI, and nmi_enter() has a BUG_ON(in_nmi()), so this nesting will need
explicitly checking.)


Thanks,

James

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
  2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
  2017-04-25 17:21   ` Borislav Petkov
  2017-05-08 17:28   ` James Morse
@ 2017-08-14  7:55   ` Xiongfeng Wang
  2 siblings, 0 replies; 41+ messages in thread
From: Xiongfeng Wang @ 2017-08-14  7:55 UTC (permalink / raw)
  To: Tyler Baicar, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
	linux, catalin.marinas, will.deacon, rjw, lenb, matt,
	robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
	james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
	shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
	bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
	linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
	hanjun.guo, john.garry, shiju.jose, joe, bp, rafael, tony.luck,
	gengdongjiu, xiexiuqi

Hi Tyler,

On 2017/4/19 7:05, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().
> 
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
> Reviewed-by: James Morse <james.morse@arm.com>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> ---

> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>  	pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>  		inf->name, esr, addr);
>  
> +	/*
> +	 * Synchronous aborts may interrupt code which had interrupts masked.
> +	 * Before calling out into the wider kernel tell the interested
> +	 * subsystems.
> +	 */
> +	if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> +		nmi_enter();
> +		ghes_notify_sea();
> +		nmi_exit();
> +	}
> +
>  	info.si_signo = SIGBUS;
>  	info.si_errno = 0;
>  	info.si_code  = 0;
For instruction abort, if there exists memory section in ghes, we will call memory_failure() in ghes_notify_sea()
and reread the instruction from the disk. In this case, we don't have to send SIGBUS to the application.
But memory_failure() is scheduled in a work queue, we don't what the result of memory_failure will be when ghes_notify_sea() returned.
Do you have any idea about how to fix this, so we don't have to kill the application in the instruction abort case.

Thanks,
Wang Xiongfeng


> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index b0140c8..de14d49 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -39,6 +39,21 @@ config ACPI_APEI_PCIEAER
>  	  PCIe AER errors may be reported via APEI firmware first mode.
>  	  Turn on this option to enable the corresponding support.
>  
> +config ACPI_APEI_SEA
> +	bool "APEI Synchronous External Abort logging/recovering support"
> +	depends on ARM64 && ACPI_APEI_GHES
> +	default y
> +	help
> +	  This option should be enabled if the system supports
> +	  firmware first handling of SEA (Synchronous External Abort).
> +	  SEA happens with certain faults of data abort or instruction
> +	  abort synchronous exceptions on ARMv8 systems. If a system
> +	  supports firmware first handling of SEA, the platform analyzes
> +	  and handles hardware error notifications from SEA, and it may then
> +	  form a HW error record for the OS to parse and handle. This
> +	  option allows the OS to look for such hardware error record, and
> +	  take appropriate action.
> +
>  config ACPI_APEI_MEMORY_FAILURE
>  	bool "APEI memory error recovering support"
>  	depends on ACPI_APEI && MEMORY_FAILURE
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index dfb7dd2..2d387f8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -115,11 +115,7 @@
>   * Two virtual pages are used, one for IRQ/PROCESS context, the other for
>   * NMI context (optionally).
>   */
> -#ifdef CONFIG_HAVE_ACPI_APEI_NMI
>  #define GHES_IOREMAP_PAGES           2
> -#else
> -#define GHES_IOREMAP_PAGES           1
> -#endif
>  #define GHES_IOREMAP_IRQ_PAGE(base)	(base)
>  #define GHES_IOREMAP_NMI_PAGE(base)	((base) + PAGE_SIZE)
>  
> @@ -158,10 +154,14 @@ static void ghes_ioremap_exit(void)
>  static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
>  {
>  	unsigned long vaddr;
> +	phys_addr_t paddr;
> +	pgprot_t prot;
>  
>  	vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
> -	ioremap_page_range(vaddr, vaddr + PAGE_SIZE,
> -			   pfn << PAGE_SHIFT, PAGE_KERNEL);
> +
> +	paddr = pfn << PAGE_SHIFT;
> +	prot = arch_apei_get_mem_attribute(paddr);
> +	ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
>  
>  	return (void __iomem *)vaddr;
>  }
> @@ -771,6 +771,50 @@ static int ghes_notify_sci(struct notifier_block *this,
>  	.notifier_call = ghes_notify_sci,
>  };
>  
> +#ifdef CONFIG_ACPI_APEI_SEA
> +static LIST_HEAD(ghes_sea);
> +
> +void ghes_notify_sea(void)
> +{
> +	struct ghes *ghes;
> +
> +	/*
> +	 * synchronize_rcu() will wait for nmi_exit(), so no need to
> +	 * rcu_read_lock().
> +	 */
> +	list_for_each_entry_rcu(ghes, &ghes_sea, list) {
> +		ghes_proc(ghes);
> +	}
> +}
> +
> +static void ghes_sea_add(struct ghes *ghes)
> +{
> +	mutex_lock(&ghes_list_mutex);
> +	list_add_rcu(&ghes->list, &ghes_sea);
> +	mutex_unlock(&ghes_list_mutex);
> +}
> +
> +static void ghes_sea_remove(struct ghes *ghes)
> +{
> +	mutex_lock(&ghes_list_mutex);
> +	list_del_rcu(&ghes->list);
> +	mutex_unlock(&ghes_list_mutex);
> +	synchronize_rcu();
> +}
> +#else /* CONFIG_ACPI_APEI_SEA */
> +static inline void ghes_sea_add(struct ghes *ghes)
> +{
> +	pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
> +	       ghes->generic->header.source_id);
> +}
> +
> +static inline void ghes_sea_remove(struct ghes *ghes)
> +{
> +	pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
> +	       ghes->generic->header.source_id);
> +}
> +#endif /* CONFIG_ACPI_APEI_SEA */
> +
>  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>  /*
>   * printk is not safe in NMI context.  So in NMI handler, we allocate
> @@ -1016,6 +1060,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  	case ACPI_HEST_NOTIFY_EXTERNAL:
>  	case ACPI_HEST_NOTIFY_SCI:
>  		break;
> +	case ACPI_HEST_NOTIFY_SEA:
> +		if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> +			pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
> +				generic->header.source_id);
> +			rc = -ENOTSUPP;
> +			goto err;
> +		}
> +		break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
>  			pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
> @@ -1081,6 +1133,9 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  		list_add_rcu(&ghes->list, &ghes_sci);
>  		mutex_unlock(&ghes_list_mutex);
>  		break;
> +	case ACPI_HEST_NOTIFY_SEA:
> +		ghes_sea_add(ghes);
> +		break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		ghes_nmi_add(ghes);
>  		break;
> @@ -1124,6 +1179,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  		mutex_unlock(&ghes_list_mutex);
>  		synchronize_rcu();
>  		break;
> +	case ACPI_HEST_NOTIFY_SEA:
> +		ghes_sea_remove(ghes);
> +		break;
>  	case ACPI_HEST_NOTIFY_NMI:
>  		ghes_nmi_remove(ghes);
>  		break;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index b89361a..ef0040893 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -1,3 +1,6 @@
> +#ifndef GHES_H
> +#define GHES_H
> +
>  #include <acpi/apei.h>
>  #include <acpi/hed.h>
>  
> @@ -95,3 +98,7 @@ static inline void *acpi_hest_get_payload(struct acpi_hest_generic_data *gdata)
>  
>  	return gdata + 1;
>  }
> +
> +void ghes_notify_sea(void);
> +
> +#endif /* GHES_H */
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-08-14  7:59 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 23:05 [PATCH V15 00/11] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
2017-04-18 23:05 ` [PATCH V15 01/11] acpi: apei: read ack upon ghes record consumption Tyler Baicar
2017-04-19 18:31   ` Borislav Petkov
2017-04-19 20:31     ` Baicar, Tyler
2017-04-19 20:41       ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 02/11] ras: acpi/apei: cper: add support for generic data v3 structure Tyler Baicar
2017-04-20 11:14   ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 03/11] cper: add timestamp print to CPER status printing Tyler Baicar
2017-04-21 12:21   ` Borislav Petkov
2017-04-21 16:04     ` Baicar, Tyler
2017-04-21 17:26       ` Borislav Petkov
2017-04-21 18:08         ` Baicar, Tyler
2017-04-21 18:12           ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 04/11] efi: parse ARM processor error Tyler Baicar
2017-04-21 17:55   ` Borislav Petkov
2017-04-21 18:22     ` Baicar, Tyler
2017-04-24 17:52       ` Borislav Petkov
2017-04-25 16:05         ` Baicar, Tyler
2017-04-25 16:31           ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 05/11] arm64: exception: handle Synchronous External Abort Tyler Baicar
2017-04-18 23:05 ` [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
2017-04-25 17:21   ` Borislav Petkov
2017-04-25 17:41     ` Baicar, Tyler
2017-04-25 17:46       ` Borislav Petkov
2017-05-08 17:28   ` James Morse
2017-05-08 19:59     ` Baicar, Tyler
2017-05-12 16:45       ` James Morse
2017-08-14  7:55   ` Xiongfeng Wang
2017-04-18 23:05 ` [PATCH V15 07/11] acpi: apei: panic OS with fatal error status block Tyler Baicar
2017-04-28 13:07   ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 08/11] efi: print unrecognized CPER section Tyler Baicar
2017-05-05 13:27   ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 09/11] ras: acpi / apei: generate trace event for " Tyler Baicar
2017-05-05 17:53   ` Borislav Petkov
2017-05-05 18:44   ` Steven Rostedt
2017-04-18 23:05 ` [PATCH V15 10/11] trace, ras: add ARM processor error trace event Tyler Baicar
2017-05-08 17:34   ` Borislav Petkov
2017-04-18 23:05 ` [PATCH V15 11/11] arm/arm64: KVM: add guest SEA support Tyler Baicar
2017-05-08 17:40   ` Borislav Petkov
2017-05-08 19:54     ` Baicar, Tyler
2017-05-08 20:22       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).