* [PATCH V12 01/10] acpi: apei: read ack upon ghes record consumption
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1 Tyler Baicar
` (9 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.
The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.
Add support for parsing of GHESv2 sub-tables as well.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
drivers/acpi/apei/ghes.c | 49 +++++++++++++++++++++++++++++++++++++++++++++---
drivers/acpi/apei/hest.c | 7 +++++--
include/acpi/ghes.h | 5 ++++-
3 files changed, 55 insertions(+), 6 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index e53bef6..5e1ec41 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,6 +45,7 @@
#include <linux/aer.h>
#include <linux/nmi.h>
+#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
#include <asm/tlbflush.h>
@@ -79,6 +80,10 @@
((struct acpi_hest_generic_status *) \
((struct ghes_estatus_node *)(estatus_node) + 1))
+#define IS_HEST_TYPE_GENERIC_V2(ghes) \
+ ((struct acpi_hest_header *)ghes->generic)->type == \
+ ACPI_HEST_TYPE_GENERIC_ERROR_V2
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -248,10 +253,18 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
if (!ghes)
return ERR_PTR(-ENOMEM);
+
ghes->generic = generic;
+ if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
+ rc = apei_map_generic_address(
+ &ghes->generic_v2->read_ack_register);
+ if (rc)
+ goto err_free;
+ }
+
rc = apei_map_generic_address(&generic->error_status_address);
if (rc)
- goto err_free;
+ goto err_unmap_read_ack_addr;
error_block_length = generic->error_block_length;
if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
pr_warning(FW_WARN GHES_PFX
@@ -263,13 +276,17 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
if (!ghes->estatus) {
rc = -ENOMEM;
- goto err_unmap;
+ goto err_unmap_status_addr;
}
return ghes;
-err_unmap:
+err_unmap_status_addr:
apei_unmap_generic_address(&generic->error_status_address);
+err_unmap_read_ack_addr:
+ if (IS_HEST_TYPE_GENERIC_V2(ghes))
+ apei_unmap_generic_address(
+ &ghes->generic_v2->read_ack_register);
err_free:
kfree(ghes);
return ERR_PTR(rc);
@@ -279,6 +296,9 @@ static void ghes_fini(struct ghes *ghes)
{
kfree(ghes->estatus);
apei_unmap_generic_address(&ghes->generic->error_status_address);
+ if (IS_HEST_TYPE_GENERIC_V2(ghes))
+ apei_unmap_generic_address(
+ &ghes->generic_v2->read_ack_register);
}
static inline int ghes_severity(int severity)
@@ -648,6 +668,23 @@ static void ghes_estatus_cache_add(
rcu_read_unlock();
}
+static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
+{
+ int rc;
+ u64 val = 0;
+
+ rc = apei_read(&val, &generic_v2->read_ack_register);
+ if (rc)
+ return rc;
+ val &= generic_v2->read_ack_preserve <<
+ generic_v2->read_ack_register.bit_offset;
+ val |= generic_v2->read_ack_write <<
+ generic_v2->read_ack_register.bit_offset;
+ rc = apei_write(val, &generic_v2->read_ack_register);
+
+ return rc;
+}
+
static int ghes_proc(struct ghes *ghes)
{
int rc;
@@ -660,6 +697,12 @@ static int ghes_proc(struct ghes *ghes)
ghes_estatus_cache_add(ghes->generic, ghes->estatus);
}
ghes_do_proc(ghes, ghes->estatus);
+
+ if (IS_HEST_TYPE_GENERIC_V2(ghes)) {
+ rc = ghes_ack_error(ghes->generic_v2);
+ if (rc)
+ return rc;
+ }
out:
ghes_clear_estatus(ghes);
return rc;
diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
index 8f2a98e..456b488 100644
--- a/drivers/acpi/apei/hest.c
+++ b/drivers/acpi/apei/hest.c
@@ -52,6 +52,7 @@
[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
+ [ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
};
static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
@@ -141,7 +142,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
{
int *count = data;
- if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
+ if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
+ hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
(*count)++;
return 0;
}
@@ -152,7 +154,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
struct ghes_arr *ghes_arr = data;
int rc, i;
- if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
+ if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
+ hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
return 0;
if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 720446c..68f088a 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -13,7 +13,10 @@
#define GHES_EXITING 0x0002
struct ghes {
- struct acpi_hest_generic *generic;
+ union {
+ struct acpi_hest_generic *generic;
+ struct acpi_hest_generic_v2 *generic_v2;
+ };
struct acpi_hest_generic_status *estatus;
u64 buffer_paddr;
unsigned long flags;
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 01/10] acpi: apei: read ack upon ghes record consumption Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 03/10] efi: parse ARM processor error Tyler Baicar
` (8 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
Currently when a RAS error is reported it is not timestamped.
The ACPI 6.1 spec adds the timestamp field to the generic error
data entry v3 structure. The timestamp of when the firmware
generated the error is now being reported.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
drivers/acpi/apei/ghes.c | 9 ++++---
drivers/firmware/efi/cper.c | 63 +++++++++++++++++++++++++++++++++++----------
include/acpi/ghes.h | 22 ++++++++++++++++
3 files changed, 77 insertions(+), 17 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 5e1ec41..b25e7cf 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -420,7 +420,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
struct cper_sec_mem_err *mem_err;
- mem_err = (struct cper_sec_mem_err *)(gdata + 1);
+
+ mem_err = acpi_hest_generic_data_payload(gdata);
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
@@ -457,7 +458,8 @@ static void ghes_do_proc(struct ghes *ghes,
if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err;
- mem_err = (struct cper_sec_mem_err *)(gdata+1);
+
+ mem_err = acpi_hest_generic_data_payload(gdata);
ghes_edac_report_mem_error(ghes, sev, mem_err);
arch_apei_report_mem_error(sev, mem_err);
@@ -467,7 +469,8 @@ static void ghes_do_proc(struct ghes *ghes,
else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
CPER_SEC_PCIE)) {
struct cper_sec_pcie *pcie_err;
- pcie_err = (struct cper_sec_pcie *)(gdata+1);
+
+ pcie_err = acpi_hest_generic_data_payload(gdata);
if (sev == GHES_SEV_RECOVERABLE &&
sec_sev == GHES_SEV_RECOVERABLE &&
pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index d425374..8fa4e23 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,9 @@
#include <linux/acpi.h>
#include <linux/pci.h>
#include <linux/aer.h>
+#include <linux/printk.h>
+#include <linux/bcd.h>
+#include <acpi/ghes.h>
#define INDENT_SP " "
@@ -386,13 +389,37 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
pfx, pcie->bridge.secondary_status, pcie->bridge.control);
}
+static void cper_estatus_print_section_v300(const char *pfx,
+ const struct acpi_hest_generic_data_v300 *gdata)
+{
+ __u8 hour, min, sec, day, mon, year, century, *timestamp;
+
+ if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
+ timestamp = (__u8 *)&(gdata->time_stamp);
+ sec = bcd2bin(timestamp[0]);
+ min = bcd2bin(timestamp[1]);
+ hour = bcd2bin(timestamp[2]);
+ day = bcd2bin(timestamp[4]);
+ mon = bcd2bin(timestamp[5]);
+ year = bcd2bin(timestamp[6]);
+ century = bcd2bin(timestamp[7]);
+ printk("%stime: %7s %02d%02d-%02d-%02d %02d:%02d:%02d\n", pfx,
+ 0x01 & *(timestamp + 3) ? "precise" : "", century,
+ year, mon, day, hour, min, sec);
+ }
+}
+
static void cper_estatus_print_section(
- const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
+ const char *pfx, struct acpi_hest_generic_data *gdata, int sec_no)
{
uuid_le *sec_type = (uuid_le *)gdata->section_type;
__u16 severity;
char newpfx[64];
+ if (acpi_hest_generic_data_version(gdata) >= 3)
+ cper_estatus_print_section_v300(pfx,
+ (const struct acpi_hest_generic_data_v300 *)gdata);
+
severity = gdata->error_severity;
printk("%s""Error %d, type: %s\n", pfx, sec_no,
cper_severity_str(severity));
@@ -403,14 +430,18 @@ static void cper_estatus_print_section(
snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
- struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+ struct cper_sec_proc_generic *proc_err;
+
+ proc_err = acpi_hest_generic_data_payload(gdata);
printk("%s""section_type: general processor error\n", newpfx);
if (gdata->error_data_length >= sizeof(*proc_err))
cper_print_proc_generic(newpfx, proc_err);
else
goto err_section_too_small;
} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
- struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+ struct cper_sec_mem_err *mem_err;
+
+ mem_err = acpi_hest_generic_data_payload(gdata);
printk("%s""section_type: memory error\n", newpfx);
if (gdata->error_data_length >=
sizeof(struct cper_sec_mem_err_old))
@@ -419,7 +450,9 @@ static void cper_estatus_print_section(
else
goto err_section_too_small;
} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
- struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+ struct cper_sec_pcie *pcie;
+
+ pcie = acpi_hest_generic_data_payload(gdata);
printk("%s""section_type: PCIe error\n", newpfx);
if (gdata->error_data_length >= sizeof(*pcie))
cper_print_pcie(newpfx, pcie, gdata);
@@ -438,7 +471,7 @@ void cper_estatus_print(const char *pfx,
const struct acpi_hest_generic_status *estatus)
{
struct acpi_hest_generic_data *gdata;
- unsigned int data_len, gedata_len;
+ unsigned int data_len;
int sec_no = 0;
char newpfx[64];
__u16 severity;
@@ -451,12 +484,13 @@ void cper_estatus_print(const char *pfx,
printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
data_len = estatus->data_length;
gdata = (struct acpi_hest_generic_data *)(estatus + 1);
+
snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
- while (data_len >= sizeof(*gdata)) {
- gedata_len = gdata->error_data_length;
+
+ while (data_len >= acpi_hest_generic_data_size(gdata)) {
cper_estatus_print_section(newpfx, gdata, sec_no);
- data_len -= gedata_len + sizeof(*gdata);
- gdata = (void *)(gdata + 1) + gedata_len;
+ data_len -= acpi_hest_generic_data_record_size(gdata);
+ gdata = acpi_hest_generic_data_next(gdata);
sec_no++;
}
}
@@ -486,12 +520,13 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
return rc;
data_len = estatus->data_length;
gdata = (struct acpi_hest_generic_data *)(estatus + 1);
- while (data_len >= sizeof(*gdata)) {
- gedata_len = gdata->error_data_length;
- if (gedata_len > data_len - sizeof(*gdata))
+
+ while (data_len >= acpi_hest_generic_data_size(gdata)) {
+ gedata_len = acpi_hest_generic_data_error_length(gdata);
+ if (gedata_len > data_len - acpi_hest_generic_data_size(gdata))
return -EINVAL;
- data_len -= gedata_len + sizeof(*gdata);
- gdata = (void *)(gdata + 1) + gedata_len;
+ data_len -= gedata_len + acpi_hest_generic_data_size(gdata);
+ gdata = acpi_hest_generic_data_next(gdata);
}
if (data_len)
return -EINVAL;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 68f088a..6ae318b 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -12,6 +12,18 @@
#define GHES_TO_CLEAR 0x0001
#define GHES_EXITING 0x0002
+#define acpi_hest_generic_data_error_length(gdata) \
+ (((struct acpi_hest_generic_data *)(gdata))->error_data_length)
+#define acpi_hest_generic_data_size(gdata) \
+ ((acpi_hest_generic_data_version(gdata) >= 3) ? \
+ sizeof(struct acpi_hest_generic_data_v300) : \
+ sizeof(struct acpi_hest_generic_data))
+#define acpi_hest_generic_data_record_size(gdata) \
+ (acpi_hest_generic_data_size(gdata) + \
+ acpi_hest_generic_data_error_length(gdata))
+#define acpi_hest_generic_data_next(gdata) \
+ ((void *)(gdata) + acpi_hest_generic_data_record_size(gdata))
+
struct ghes {
union {
struct acpi_hest_generic *generic;
@@ -73,3 +85,13 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
{
}
#endif
+
+#define acpi_hest_generic_data_version(gdata) \
+ (gdata->revision >> 8)
+
+static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
+{
+ return acpi_hest_generic_data_version(gdata) >= 3 ?
+ (void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
+ gdata + 1;
+}
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 03/10] efi: parse ARM processor error
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 01/10] acpi: apei: read ack upon ghes record consumption Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 02/10] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1 Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 04/10] arm64: exception: handle Synchronous External Abort Tyler Baicar
` (7 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
Add support for ARM Common Platform Error Record (CPER).
UEFI 2.6 specification adds support for ARM specific
processor error information to be reported as part of the
CPER records. This provides more detail on for processor error logs.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
drivers/firmware/efi/cper.c | 133 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/cper.h | 54 ++++++++++++++++++
2 files changed, 187 insertions(+)
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 8fa4e23..56aa516 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -110,12 +110,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
static const char * const proc_type_strs[] = {
"IA32/X64",
"IA64",
+ "ARM",
};
static const char * const proc_isa_strs[] = {
"IA32",
"IA64",
"X64",
+ "ARM A32/T32",
+ "ARM A64",
};
static const char * const proc_error_type_strs[] = {
@@ -139,6 +142,18 @@ void cper_print_bits(const char *pfx, unsigned int bits,
"corrected",
};
+static const char * const arm_reg_ctx_strs[] = {
+ "AArch32 general purpose registers",
+ "AArch32 EL1 context registers",
+ "AArch32 EL2 context registers",
+ "AArch32 secure context registers",
+ "AArch64 general purpose registers",
+ "AArch64 EL1 context registers",
+ "AArch64 EL2 context registers",
+ "AArch64 EL3 context registers",
+ "Misc. system register structure",
+};
+
static void cper_print_proc_generic(const char *pfx,
const struct cper_sec_proc_generic *proc)
{
@@ -184,6 +199,114 @@ static void cper_print_proc_generic(const char *pfx,
printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
}
+static void cper_print_proc_arm(const char *pfx,
+ const struct cper_sec_proc_arm *proc)
+{
+ int i, len, max_ctx_type;
+ struct cper_arm_err_info *err_info;
+ struct cper_arm_ctx_info *ctx_info;
+ char newpfx[64];
+
+ printk("%ssection length: %d\n", pfx, proc->section_length);
+ printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
+
+ len = proc->section_length - (sizeof(*proc) +
+ proc->err_info_num * (sizeof(*err_info)));
+ if (len < 0) {
+ printk("%ssection length is too small\n", pfx);
+ printk("%sfirmware-generated error record is incorrect\n", pfx);
+ printk("%sERR_INFO_NUM is %d\n", pfx, proc->err_info_num);
+ return;
+ }
+
+ if (proc->validation_bits & CPER_ARM_VALID_MPIDR)
+ printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
+ if (proc->validation_bits & CPER_ARM_VALID_AFFINITY_LEVEL)
+ printk("%serror affinity level: %d\n", pfx,
+ proc->affinity_level);
+ if (proc->validation_bits & CPER_ARM_VALID_RUNNING_STATE) {
+ printk("%srunning state: 0x%x\n", pfx, proc->running_state);
+ printk("%sPSCI state: %d\n", pfx, proc->psci_state);
+ }
+
+ snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+ err_info = (struct cper_arm_err_info *)(proc + 1);
+ for (i = 0; i < proc->err_info_num; i++) {
+ printk("%sError info structure %d:\n", pfx, i);
+ printk("%sversion:%d\n", newpfx, err_info->version);
+ printk("%slength:%d\n", newpfx, err_info->length);
+ if (err_info->validation_bits &
+ CPER_ARM_INFO_VALID_MULTI_ERR) {
+ if (err_info->multiple_error == 0)
+ printk("%ssingle error\n", newpfx);
+ else if (err_info->multiple_error == 1)
+ printk("%smultiple errors\n", newpfx);
+ else
+ printk("%smultiple errors count:%u\n",
+ newpfx, err_info->multiple_error);
+ }
+ if (err_info->validation_bits & CPER_ARM_INFO_VALID_FLAGS) {
+ if (err_info->flags & CPER_ARM_INFO_FLAGS_FIRST)
+ printk("%sfirst error captured\n", newpfx);
+ if (err_info->flags & CPER_ARM_INFO_FLAGS_LAST)
+ printk("%slast error captured\n", newpfx);
+ if (err_info->flags & CPER_ARM_INFO_FLAGS_PROPAGATED)
+ printk("%spropagated error captured\n",
+ newpfx);
+ if (err_info->flags & CPER_ARM_INFO_FLAGS_OVERFLOW)
+ printk("%soverflow occurred, error info is incomplete\n",
+ newpfx);
+ }
+ printk("%serror_type: %d, %s\n", newpfx, err_info->type,
+ err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
+ proc_error_type_strs[err_info->type] : "unknown");
+ if (err_info->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+ printk("%serror_info: 0x%016llx\n", newpfx,
+ err_info->error_info);
+ if (err_info->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+ printk("%svirtual fault address: 0x%016llx\n",
+ newpfx, err_info->virt_fault_addr);
+ if (err_info->validation_bits &
+ CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+ printk("%sphysical fault address: 0x%016llx\n",
+ newpfx, err_info->physical_fault_addr);
+ err_info += 1;
+ }
+ ctx_info = (struct cper_arm_ctx_info *)err_info;
+ max_ctx_type = ARRAY_SIZE(arm_reg_ctx_strs) - 1;
+ for (i = 0; i < proc->context_info_num; i++) {
+ int size = sizeof(*ctx_info) + ctx_info->size;
+
+ printk("%sContext info structure %d:\n", pfx, i);
+ if (len < size) {
+ printk("%ssection length is too small\n", newpfx);
+ printk("%sfirmware-generated error record is incorrect\n", pfx);
+ return;
+ }
+ if (ctx_info->type > max_ctx_type) {
+ printk("%sInvalid context type: %d\n", newpfx,
+ ctx_info->type);
+ printk("%sMax context type: %d\n", newpfx,
+ max_ctx_type);
+ return;
+ }
+ printk("%sregister context type %d: %s\n", newpfx,
+ ctx_info->type, arm_reg_ctx_strs[ctx_info->type]);
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+ (ctx_info + 1), ctx_info->size, 0);
+ len -= size;
+ ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + size);
+ }
+
+ if (len > 0) {
+ printk("%sVendor specific error info has %u bytes:\n", pfx,
+ len);
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4, ctx_info,
+ len, true);
+ }
+}
+
static const char * const mem_err_type_strs[] = {
"unknown",
"no error",
@@ -458,6 +581,16 @@ static void cper_estatus_print_section(
cper_print_pcie(newpfx, pcie, gdata);
else
goto err_section_too_small;
+ } else if ((IS_ENABLED(CONFIG_ARM64) || IS_ENABLED(CONFIG_ARM)) &&
+ !uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARM)) {
+ struct cper_sec_proc_arm *arm_err;
+
+ arm_err = acpi_hest_generic_data_payload(gdata);
+ printk("%ssection_type: ARM processor error\n", newpfx);
+ if (gdata->error_data_length >= sizeof(*arm_err))
+ cper_print_proc_arm(newpfx, arm_err);
+ else
+ goto err_section_too_small;
} else
printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index dcacb1a..85450f3 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -180,6 +180,10 @@ enum {
#define CPER_SEC_PROC_IPF \
UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00, \
0x80, 0xC7, 0x3C, 0x88, 0x81)
+/* Processor Specific: ARM */
+#define CPER_SEC_PROC_ARM \
+ UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05, \
+ 0x1D, 0x5D, 0x46, 0xB0)
/* Platform Memory */
#define CPER_SEC_PLATFORM_MEM \
UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
@@ -255,6 +259,22 @@ enum {
#define CPER_PCIE_SLOT_SHIFT 3
+#define CPER_ARM_VALID_MPIDR 0x00000001
+#define CPER_ARM_VALID_AFFINITY_LEVEL 0x00000002
+#define CPER_ARM_VALID_RUNNING_STATE 0x00000004
+#define CPER_ARM_VALID_VENDOR_INFO 0x00000008
+
+#define CPER_ARM_INFO_VALID_MULTI_ERR 0x0001
+#define CPER_ARM_INFO_VALID_FLAGS 0x0002
+#define CPER_ARM_INFO_VALID_ERR_INFO 0x0004
+#define CPER_ARM_INFO_VALID_VIRT_ADDR 0x0008
+#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR 0x0010
+
+#define CPER_ARM_INFO_FLAGS_FIRST 0x0001
+#define CPER_ARM_INFO_FLAGS_LAST 0x0002
+#define CPER_ARM_INFO_FLAGS_PROPAGATED 0x0004
+#define CPER_ARM_INFO_FLAGS_OVERFLOW 0x0008
+
/*
* All tables and structs must be byte-packed to match CPER
* specification, since the tables are provided by the system BIOS
@@ -340,6 +360,40 @@ struct cper_ia_proc_ctx {
__u64 mm_reg_addr;
};
+/* ARM Processor Error Section */
+struct cper_sec_proc_arm {
+ __u32 validation_bits;
+ __u16 err_info_num; /* Number of Processor Error Info */
+ __u16 context_info_num; /* Number of Processor Context Info Records*/
+ __u32 section_length;
+ __u8 affinity_level;
+ __u8 reserved[3]; /* must be zero */
+ __u64 mpidr;
+ __u64 midr;
+ __u32 running_state; /* Bit 0 set - Processor running. PSCI = 0 */
+ __u32 psci_state;
+};
+
+/* ARM Processor Error Information Structure */
+struct cper_arm_err_info {
+ __u8 version;
+ __u8 length;
+ __u16 validation_bits;
+ __u8 type;
+ __u16 multiple_error;
+ __u8 flags;
+ __u64 error_info;
+ __u64 virt_fault_addr;
+ __u64 physical_fault_addr;
+};
+
+/* ARM Processor Context Information Structure */
+struct cper_arm_ctx_info {
+ __u16 version;
+ __u16 type;
+ __u32 size;
+};
+
/* Old Memory Error Section UEFI 2.1, 2.2 */
struct cper_sec_mem_err_old {
__u64 validation_bits;
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 04/10] arm64: exception: handle Synchronous External Abort
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (2 preceding siblings ...)
2017-03-06 20:44 ` [PATCH V12 03/10] efi: parse ARM processor error Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-06 20:44 ` [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
` (6 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
SEA exceptions are often caused by an uncorrected hardware
error, and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.
When SEA occurs, before killing the process, report the error
in the kernel logs.
Update fault_info[] with specific SEA faults so that the
new SEA handler is used.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
arch/arm64/include/asm/esr.h | 1 +
arch/arm64/mm/fault.c | 43 +++++++++++++++++++++++++++++++++----------
2 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index d14c478..f20c64a 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -83,6 +83,7 @@
#define ESR_ELx_WNR (UL(1) << 6)
/* Shared ISS field definitions for Data/Instruction aborts */
+#define ESR_ELx_FnV (UL(1) << 10)
#define ESR_ELx_EA (UL(1) << 9)
#define ESR_ELx_S1PTW (UL(1) << 7)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 156169c..d178dc0 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -487,6 +487,29 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
return 1;
}
+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+ struct siginfo info;
+
+ pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+ fault_name(esr), esr, addr);
+
+ info.si_signo = SIGBUS;
+ info.si_errno = 0;
+ info.si_code = 0;
+ if (esr & ESR_ELx_FnV)
+ info.si_addr = 0;
+ else
+ info.si_addr = (void __user *)addr;
+ arm64_notify_die("", regs, &info, esr);
+
+ return 0;
+}
+
static const struct fault_info {
int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
int sig;
@@ -509,22 +532,22 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" },
- { do_bad, SIGBUS, 0, "synchronous external abort" },
+ { do_sea, SIGBUS, 0, "synchronous external abort" },
{ do_bad, SIGBUS, 0, "unknown 17" },
{ do_bad, SIGBUS, 0, "unknown 18" },
{ do_bad, SIGBUS, 0, "unknown 19" },
- { do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error" },
+ { do_sea, SIGBUS, 0, "level 0 (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 1 (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 2 (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 3 (translation table walk)" },
+ { do_sea, SIGBUS, 0, "synchronous parity or ECC error" },
{ do_bad, SIGBUS, 0, "unknown 25" },
{ do_bad, SIGBUS, 0, "unknown 26" },
{ do_bad, SIGBUS, 0, "unknown 27" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 0 synchronous parity error (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 1 synchronous parity error (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 2 synchronous parity error (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 3 synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "unknown 32" },
{ do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" },
{ do_bad, SIGBUS, 0, "unknown 34" },
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (3 preceding siblings ...)
2017-03-06 20:44 ` [PATCH V12 04/10] arm64: exception: handle Synchronous External Abort Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-07 11:37 ` James Morse
2017-03-17 16:43 ` James Morse
2017-03-06 20:44 ` [PATCH V12 06/10] acpi: apei: panic OS with fatal error status block Tyler Baicar
` (5 subsequent siblings)
10 siblings, 2 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
ARM APEI extension proposal added SEA (Synchronous External Abort)
notification type for ARMv8.
Add a new GHES error source handling function for SEA. If an error
source's notification type is SEA, then this function can be registered
into the SEA exception handler. That way GHES will parse and report
SEA exceptions when they occur.
An SEA can interrupt code that had interrupts masked and is treated as
an NMI. To aid this the page of address space for mapping APEI buffers
while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
changed to use the helper methods to find the prot_t to map with in
the same way as ghes_ioremap_pfn_irq().
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 13 ++++++++
drivers/acpi/apei/Kconfig | 15 +++++++++
drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++++++++++++++++++++++----
include/acpi/ghes.h | 7 +++++
5 files changed, 107 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1117421..fca4dc1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -88,6 +88,7 @@ config ARM64
select HAVE_IRQ_TIME_ACCOUNTING
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP if NUMA
+ select HAVE_NMI if ACPI_APEI_SEA
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
select HAVE_PERF_REGS
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d178dc0..b2d57fc 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -41,6 +41,8 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
+#include <acpi/ghes.h>
+
static const char *fault_name(unsigned int esr);
#ifdef CONFIG_KPROBES
@@ -498,6 +500,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
fault_name(esr), esr, addr);
+ /*
+ * Synchronous aborts may interrupt code which had interrupts masked.
+ * Before calling out into the wider kernel tell the interested
+ * subsystems.
+ */
+ if (IS_ENABLED(ACPI_APEI_SEA)) {
+ nmi_enter();
+ ghes_notify_sea();
+ nmi_exit();
+ }
+
info.si_signo = SIGBUS;
info.si_errno = 0;
info.si_code = 0;
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index b0140c8..c545dd1 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -39,6 +39,21 @@ config ACPI_APEI_PCIEAER
PCIe AER errors may be reported via APEI firmware first mode.
Turn on this option to enable the corresponding support.
+config ACPI_APEI_SEA
+ bool "APEI Synchronous External Abort logging/recovering support"
+ depends on ARM64 && ACPI_APEI && ACPI_APEI_GHES
+ default y
+ help
+ This option should be enabled if the system supports
+ firmware first handling of SEA (Synchronous External Abort).
+ SEA happens with certain faults of data abort or instruction
+ abort synchronous exceptions on ARMv8 systems. If a system
+ supports firmware first handling of SEA, the platform analyzes
+ and handles hardware error notifications from SEA, and it may then
+ form a HW error record for the OS to parse and handle. This
+ option allows the OS to look for such hardware error record, and
+ take appropriate action.
+
config ACPI_APEI_MEMORY_FAILURE
bool "APEI memory error recovering support"
depends on ACPI_APEI && MEMORY_FAILURE
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b25e7cf..b0596ba 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,11 +114,7 @@
* Two virtual pages are used, one for IRQ/PROCESS context, the other for
* NMI context (optionally).
*/
-#ifdef CONFIG_HAVE_ACPI_APEI_NMI
#define GHES_IOREMAP_PAGES 2
-#else
-#define GHES_IOREMAP_PAGES 1
-#endif
#define GHES_IOREMAP_IRQ_PAGE(base) (base)
#define GHES_IOREMAP_NMI_PAGE(base) ((base) + PAGE_SIZE)
@@ -157,10 +153,14 @@ static void ghes_ioremap_exit(void)
static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
{
unsigned long vaddr;
+ phys_addr_t paddr;
+ pgprot_t prot;
vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
- ioremap_page_range(vaddr, vaddr + PAGE_SIZE,
- pfn << PAGE_SHIFT, PAGE_KERNEL);
+
+ paddr = pfn << PAGE_SHIFT;
+ prot = arch_apei_get_mem_attribute(paddr);
+ ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
return (void __iomem *)vaddr;
}
@@ -767,6 +767,50 @@ static int ghes_notify_sci(struct notifier_block *this,
.notifier_call = ghes_notify_sci,
};
+#ifdef CONFIG_ACPI_APEI_SEA
+static LIST_HEAD(ghes_sea);
+
+void ghes_notify_sea(void)
+{
+ struct ghes *ghes;
+
+ /*
+ * synchronize_rcu() will wait for nmi_exit(), so no need to
+ * rcu_read_lock().
+ */
+ list_for_each_entry_rcu(ghes, &ghes_sea, list) {
+ ghes_proc(ghes);
+ }
+}
+
+static void ghes_sea_add(struct ghes *ghes)
+{
+ mutex_lock(&ghes_list_mutex);
+ list_add_rcu(&ghes->list, &ghes_sea);
+ mutex_unlock(&ghes_list_mutex);
+}
+
+static void ghes_sea_remove(struct ghes *ghes)
+{
+ mutex_lock(&ghes_list_mutex);
+ list_del_rcu(&ghes->list);
+ mutex_unlock(&ghes_list_mutex);
+ synchronize_rcu();
+}
+#else /* CONFIG_ACPI_APEI_SEA */
+static inline void ghes_sea_add(struct ghes *ghes)
+{
+ pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
+ ghes->generic->header.source_id);
+}
+
+static inline void ghes_sea_remove(struct ghes *ghes)
+{
+ pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
+ ghes->generic->header.source_id);
+}
+#endif /* CONFIG_ACPI_APEI_SEA */
+
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
/*
* printk is not safe in NMI context. So in NMI handler, we allocate
@@ -1012,6 +1056,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
case ACPI_HEST_NOTIFY_EXTERNAL:
case ACPI_HEST_NOTIFY_SCI:
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
+ pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
+ generic->header.source_id);
+ rc = -ENOTSUPP;
+ goto err;
+ }
+ break;
case ACPI_HEST_NOTIFY_NMI:
if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1023,6 +1075,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
generic->header.source_id);
goto err;
+ case ACPI_HEST_NOTIFY_GPIO:
+ case ACPI_HEST_NOTIFY_SEI:
+ case ACPI_HEST_NOTIFY_GSIV:
+ pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
+ generic->header.source_id, generic->header.source_id);
+ rc = -ENOTSUPP;
+ goto err;
default:
pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
generic->notify.type, generic->header.source_id);
@@ -1077,6 +1136,9 @@ static int ghes_probe(struct platform_device *ghes_dev)
list_add_rcu(&ghes->list, &ghes_sci);
mutex_unlock(&ghes_list_mutex);
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ ghes_sea_add(ghes);
+ break;
case ACPI_HEST_NOTIFY_NMI:
ghes_nmi_add(ghes);
break;
@@ -1119,6 +1181,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
unregister_acpi_hed_notifier(&ghes_notifier_sci);
mutex_unlock(&ghes_list_mutex);
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ ghes_sea_remove(ghes);
+ break;
case ACPI_HEST_NOTIFY_NMI:
ghes_nmi_remove(ghes);
break;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 6ae318b..18bc935 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -1,3 +1,6 @@
+#ifndef GHES_H
+#define GHES_H
+
#include <acpi/apei.h>
#include <acpi/hed.h>
@@ -95,3 +98,7 @@ static inline void *acpi_hest_generic_data_payload(struct acpi_hest_generic_data
(void *)(((struct acpi_hest_generic_data_v300 *)(gdata)) + 1) :
gdata + 1;
}
+
+void ghes_notify_sea(void);
+
+#endif /* GHES_H */
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8
2017-03-06 20:44 ` [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
@ 2017-03-07 11:37 ` James Morse
2017-03-07 16:40 ` Baicar, Tyler
2017-03-17 16:43 ` James Morse
1 sibling, 1 reply; 30+ messages in thread
From: James Morse @ 2017-03-07 11:37 UTC (permalink / raw)
To: Tyler Baicar
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hi Tyler,
On 06/03/17 20:44, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index d178dc0..b2d57fc 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -41,6 +41,8 @@
> #include <asm/pgtable.h>
> #include <asm/tlbflush.h>
>
> +#include <acpi/ghes.h>
> +
> static const char *fault_name(unsigned int esr);
>
> #ifdef CONFIG_KPROBES
> @@ -498,6 +500,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
> pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
> fault_name(esr), esr, addr);
>
> + /*
> + * Synchronous aborts may interrupt code which had interrupts masked.
> + * Before calling out into the wider kernel tell the interested
> + * subsystems.
> + */
> + if (IS_ENABLED(ACPI_APEI_SEA)) {
IS_ENABLED() needs the CONFIG_ version of the symbols, otherwise this doesn't
get built.
(I guess the testing from the previous always-enabled version is still valid)
> + nmi_enter();
> + ghes_notify_sea();
> + nmi_exit();
> + }
> +
> info.si_signo = SIGBUS;
> info.si_errno = 0;
> info.si_code = 0;
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index b0140c8..c545dd1 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -39,6 +39,21 @@ config ACPI_APEI_PCIEAER
> PCIe AER errors may be reported via APEI firmware first mode.
> Turn on this option to enable the corresponding support.
>
> +config ACPI_APEI_SEA
> + bool "APEI Synchronous External Abort logging/recovering support"
> + depends on ARM64 && ACPI_APEI && ACPI_APEI_GHES
Nit: ACPI_APEI_GHES already depends on ACPI_APEI
> + default y
> + help
> + This option should be enabled if the system supports
> + firmware first handling of SEA (Synchronous External Abort).
> + SEA happens with certain faults of data abort or instruction
> + abort synchronous exceptions on ARMv8 systems. If a system
> + supports firmware first handling of SEA, the platform analyzes
> + and handles hardware error notifications from SEA, and it may then
> + form a HW error record for the OS to parse and handle. This
> + option allows the OS to look for such hardware error record, and
> + take appropriate action.
> +
> config ACPI_APEI_MEMORY_FAILURE
> bool "APEI memory error recovering support"
> depends on ACPI_APEI && MEMORY_FAILURE
Reviewed-by: James Morse <james.morse@arm.com>
Thanks,
James
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8
2017-03-07 11:37 ` James Morse
@ 2017-03-07 16:40 ` Baicar, Tyler
0 siblings, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-07 16:40 UTC (permalink / raw)
To: James Morse
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hello James,
On 3/7/2017 4:37 AM, James Morse wrote:
> On 06/03/17 20:44, Tyler Baicar wrote:
>> ARM APEI extension proposal added SEA (Synchronous External Abort)
>> notification type for ARMv8.
>> Add a new GHES error source handling function for SEA. If an error
>> source's notification type is SEA, then this function can be registered
>> into the SEA exception handler. That way GHES will parse and report
>> SEA exceptions when they occur.
>> An SEA can interrupt code that had interrupts masked and is treated as
>> an NMI. To aid this the page of address space for mapping APEI buffers
>> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
>> changed to use the helper methods to find the prot_t to map with in
>> the same way as ghes_ioremap_pfn_irq().
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d178dc0..b2d57fc 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -41,6 +41,8 @@
>> #include <asm/pgtable.h>
>> #include <asm/tlbflush.h>
>>
>> +#include <acpi/ghes.h>
>> +
>> static const char *fault_name(unsigned int esr);
>>
>> #ifdef CONFIG_KPROBES
>> @@ -498,6 +500,17 @@ static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>> pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>> fault_name(esr), esr, addr);
>>
>> + /*
>> + * Synchronous aborts may interrupt code which had interrupts masked.
>> + * Before calling out into the wider kernel tell the interested
>> + * subsystems.
>> + */
>> + if (IS_ENABLED(ACPI_APEI_SEA)) {
> IS_ENABLED() needs the CONFIG_ version of the symbols, otherwise this doesn't
> get built.
>
> (I guess the testing from the previous always-enabled version is still valid)
Okay, I will use CONFIG_ACPI_APEI_SEA in the next patch set.
>
>> + nmi_enter();
>> + ghes_notify_sea();
>> + nmi_exit();
>> + }
>> +
>> info.si_signo = SIGBUS;
>> info.si_errno = 0;
>> info.si_code = 0;
>> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
>> index b0140c8..c545dd1 100644
>> --- a/drivers/acpi/apei/Kconfig
>> +++ b/drivers/acpi/apei/Kconfig
>> @@ -39,6 +39,21 @@ config ACPI_APEI_PCIEAER
>> PCIe AER errors may be reported via APEI firmware first mode.
>> Turn on this option to enable the corresponding support.
>>
>> +config ACPI_APEI_SEA
>> + bool "APEI Synchronous External Abort logging/recovering support"
>> + depends on ARM64 && ACPI_APEI && ACPI_APEI_GHES
> Nit: ACPI_APEI_GHES already depends on ACPI_APEI
I can remove ACPI_APEI here then.
>> + default y
>> + help
>> + This option should be enabled if the system supports
>> + firmware first handling of SEA (Synchronous External Abort).
>> + SEA happens with certain faults of data abort or instruction
>> + abort synchronous exceptions on ARMv8 systems. If a system
>> + supports firmware first handling of SEA, the platform analyzes
>> + and handles hardware error notifications from SEA, and it may then
>> + form a HW error record for the OS to parse and handle. This
>> + option allows the OS to look for such hardware error record, and
>> + take appropriate action.
>> +
>> config ACPI_APEI_MEMORY_FAILURE
>> bool "APEI memory error recovering support"
>> depends on ACPI_APEI && MEMORY_FAILURE
>
> Reviewed-by: James Morse <james.morse@arm.com>
>
Thanks!
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8
2017-03-06 20:44 ` [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
2017-03-07 11:37 ` James Morse
@ 2017-03-17 16:43 ` James Morse
2017-03-21 19:19 ` Baicar, Tyler
1 sibling, 1 reply; 30+ messages in thread
From: James Morse @ 2017-03-17 16:43 UTC (permalink / raw)
To: Tyler Baicar
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hi Tyler,
On 06/03/17 20:44, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b25e7cf..b0596ba 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1023,6 +1075,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
> pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
> generic->header.source_id);
> goto err;
> + case ACPI_HEST_NOTIFY_GPIO:
> + case ACPI_HEST_NOTIFY_SEI:
> + case ACPI_HEST_NOTIFY_GSIV:
> + pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
> + generic->header.source_id, generic->header.source_id);
> + rc = -ENOTSUPP;
> + goto err;
> default:
> pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
> generic->notify.type, generic->header.source_id);
This hunk will conflict with Shiju Jose's patch[0] that adds GPIO and GSIV
support. Can we remove it?
Thanks,
James
[0] https://www.spinics.net/lists/linux-acpi/msg72654.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8
2017-03-17 16:43 ` James Morse
@ 2017-03-21 19:19 ` Baicar, Tyler
0 siblings, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-21 19:19 UTC (permalink / raw)
To: James Morse
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hello James,
On 3/17/2017 10:43 AM, James Morse wrote:
> On 06/03/17 20:44, Tyler Baicar wrote:
>> ARM APEI extension proposal added SEA (Synchronous External Abort)
>> notification type for ARMv8.
>> Add a new GHES error source handling function for SEA. If an error
>> source's notification type is SEA, then this function can be registered
>> into the SEA exception handler. That way GHES will parse and report
>> SEA exceptions when they occur.
>> An SEA can interrupt code that had interrupts masked and is treated as
>> an NMI. To aid this the page of address space for mapping APEI buffers
>> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
>> changed to use the helper methods to find the prot_t to map with in
>> the same way as ghes_ioremap_pfn_irq().
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index b25e7cf..b0596ba 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -1023,6 +1075,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>> pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
>> generic->header.source_id);
>> goto err;
>> + case ACPI_HEST_NOTIFY_GPIO:
>> + case ACPI_HEST_NOTIFY_SEI:
>> + case ACPI_HEST_NOTIFY_GSIV:
>> + pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
>> + generic->header.source_id, generic->header.source_id);
>> + rc = -ENOTSUPP;
>> + goto err;
>> default:
>> pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
>> generic->notify.type, generic->header.source_id);
>
> This hunk will conflict with Shiju Jose's patch[0] that adds GPIO and GSIV
> support. Can we remove it?
Yes, I was planning on removing this when I saw Shiju's patch. It will
be removed in my v13.
Thanks,
Tyler
>
> [0] https://www.spinics.net/lists/linux-acpi/msg72654.html
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH V12 06/10] acpi: apei: panic OS with fatal error status block
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (4 preceding siblings ...)
2017-03-06 20:44 ` [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8 Tyler Baicar
@ 2017-03-06 20:44 ` Tyler Baicar
2017-03-06 20:45 ` [PATCH V12 07/10] efi: print unrecognized CPER section Tyler Baicar
` (4 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:44 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
From: "Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>
Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.
With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.
Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.
Signed-off-by: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
drivers/acpi/apei/ghes.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index b0596ba..d6a3b9f 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -133,6 +133,8 @@
static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;
+static int ghes_panic_timeout __read_mostly = 30;
+
static int ghes_ioremap_init(void)
{
ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -688,6 +690,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
return rc;
}
+static void __ghes_call_panic(void)
+{
+ if (panic_timeout == 0)
+ panic_timeout = ghes_panic_timeout;
+ panic("Fatal hardware error!");
+}
+
static int ghes_proc(struct ghes *ghes)
{
int rc;
@@ -695,6 +704,10 @@ static int ghes_proc(struct ghes *ghes)
rc = ghes_read_estatus(ghes, 0);
if (rc)
goto out;
+ if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+ __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
+ __ghes_call_panic();
+ }
if (!ghes_estatus_cached(ghes->estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
ghes_estatus_cache_add(ghes->generic, ghes->estatus);
@@ -831,8 +844,6 @@ static inline void ghes_sea_remove(struct ghes *ghes)
static LIST_HEAD(ghes_nmi);
-static int ghes_panic_timeout __read_mostly = 30;
-
static void ghes_proc_in_irq(struct irq_work *irq_work)
{
struct llist_node *llnode, *next;
@@ -925,9 +936,7 @@ static void __ghes_panic(struct ghes *ghes)
__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
/* reboot to log the error! */
- if (panic_timeout == 0)
- panic_timeout = ghes_panic_timeout;
- panic("Fatal hardware error!");
+ __ghes_call_panic();
}
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 07/10] efi: print unrecognized CPER section
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (5 preceding siblings ...)
2017-03-06 20:44 ` [PATCH V12 06/10] acpi: apei: panic OS with fatal error status block Tyler Baicar
@ 2017-03-06 20:45 ` Tyler Baicar
2017-03-06 21:05 ` Joe Perches
2017-03-06 20:45 ` [PATCH V12 08/10] ras: acpi / apei: generate trace event for " Tyler Baicar
` (3 subsequent siblings)
10 siblings, 1 reply; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:45 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.
Currently if the CPER section's type (UUID) does not match with
one of the section types that the kernel knows how to parse, the
section is skipped. Therefore, user is not able to see
such CPER data, for instance, error record of non-standard section.
For above mentioned case, this change prints out the raw data in
hex in dmesg buffer. Data length is taken from Error Data length
field of Generic Error Data Entry.
Following is a sample output from dmesg:
[ 115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 115.787456] {1}[Hardware Error]: event severity: corrected
[ 115.792927] {1}[Hardware Error]: Error 0, type: corrected
[ 115.798415] {1}[Hardware Error]: fru_id: 00000000-0000-0000-0000-000000000000
[ 115.805596] {1}[Hardware Error]: fru_text:
[ 115.816105] {1}[Hardware Error]: section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
[ 115.823880] {1}[Hardware Error]: section length: 88
[ 115.828779] {1}[Hardware Error]: 00000000: 01000001 00000002 5f434345 525f4543
[ 115.836153] {1}[Hardware Error]: 00000010: 0000574d 00000000 00000000 00000000
[ 115.843531] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 115.850908] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
[ 115.858288] {1}[Hardware Error]: 00000040: fe800000 00000000 00000004 5f434345
[ 115.865665] {1}[Hardware Error]: 00000050: 525f4543 0000574d
The raw data from the error can then be decoded using vendor
specific tools.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
Reviewed-by: James Morse <james.morse@arm.com>
---
drivers/firmware/efi/cper.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 56aa516..545a6c2 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -591,8 +591,16 @@ static void cper_estatus_print_section(
cper_print_proc_arm(newpfx, arm_err);
else
goto err_section_too_small;
- } else
- printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
+ } else {
+ const void *unknown_err;
+
+ unknown_err = acpi_hest_generic_data_payload(gdata);
+ printk("%ssection type: unknown, %pUl\n", newpfx, sec_type);
+ printk("%ssection length: %d\n", newpfx,
+ gdata->error_data_length);
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+ unknown_err, gdata->error_data_length, true);
+ }
return;
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 07/10] efi: print unrecognized CPER section
2017-03-06 20:45 ` [PATCH V12 07/10] efi: print unrecognized CPER section Tyler Baicar
@ 2017-03-06 21:05 ` Joe Perches
2017-03-07 16:39 ` Baicar, Tyler
0 siblings, 1 reply; 30+ messages in thread
From: Joe Perches @ 2017-03-06 21:05 UTC (permalink / raw)
To: Tyler Baicar, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose
On Mon, 2017-03-06 at 13:45 -0700, Tyler Baicar wrote:
> UEFI spec allows for non-standard section in Common Platform Error
> Record. This is defined in section N.2.3 of UEFI version 2.5.
>
> Currently if the CPER section's type (UUID) does not match with
> one of the section types that the kernel knows how to parse, the
> section is skipped. Therefore, user is not able to see
> such CPER data, for instance, error record of non-standard section.
>
> For above mentioned case, this change prints out the raw data in
> hex in dmesg buffer. Data length is taken from Error Data length
> field of Generic Error Data Entry.
Hi Tyler.
Trivia: (probably not worth resubmitting for this)
There's a slight mismatch between logging output and commit
message. Now there's an ASCII block after the output.
Another suggestion below.
> Following is a sample output from dmesg:
> [ 115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [ 115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
> [ 115.787456] {1}[Hardware Error]: event severity: corrected
> [ 115.792927] {1}[Hardware Error]: Error 0, type: corrected
> [ 115.798415] {1}[Hardware Error]: fru_id: 00000000-0000-0000-0000-000000000000
> [ 115.805596] {1}[Hardware Error]: fru_text:
> [ 115.816105] {1}[Hardware Error]: section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
> [ 115.823880] {1}[Hardware Error]: section length: 88
> [ 115.828779] {1}[Hardware Error]: 00000000: 01000001 00000002 5f434345 525f4543
> [ 115.836153] {1}[Hardware Error]: 00000010: 0000574d 00000000 00000000 00000000
> [ 115.843531] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
> [ 115.850908] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
> [ 115.858288] {1}[Hardware Error]: 00000040: fe800000 00000000 00000004 5f434345
> [ 115.865665] {1}[Hardware Error]: 00000050: 525f4543 0000574d
[]
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
[]
> @@ -591,8 +591,16 @@ static void cper_estatus_print_section(
> cper_print_proc_arm(newpfx, arm_err);
> else
> goto err_section_too_small;
> - } else
> - printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
> + } else {
> + const void *unknown_err;
> +
> + unknown_err = acpi_hest_generic_data_payload(gdata);
> + printk("%ssection type: unknown, %pUl\n", newpfx, sec_type);
> + printk("%ssection length: %d\n", newpfx,
> + gdata->error_data_length);
It might be nice to output this as
printk("%ssection length: %d (%#x)\n",
newpfx, gdata->error_data_length, gdata->error_data_length);
so it's easy to know the appropriate hex buffer length too.
> + print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
> + unknown_err, gdata->error_data_length, true);
> + }
>
> return;
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 07/10] efi: print unrecognized CPER section
2017-03-06 21:05 ` Joe Perches
@ 2017-03-07 16:39 ` Baicar, Tyler
0 siblings, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-07 16:39 UTC (permalink / raw)
To: Joe Perches, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose
On 3/6/2017 2:05 PM, Joe Perches wrote:
> On Mon, 2017-03-06 at 13:45 -0700, Tyler Baicar wrote:
>> UEFI spec allows for non-standard section in Common Platform Error
>> Record. This is defined in section N.2.3 of UEFI version 2.5.
>>
>> Currently if the CPER section's type (UUID) does not match with
>> one of the section types that the kernel knows how to parse, the
>> section is skipped. Therefore, user is not able to see
>> such CPER data, for instance, error record of non-standard section.
>>
>> For above mentioned case, this change prints out the raw data in
>> hex in dmesg buffer. Data length is taken from Error Data length
>> field of Generic Error Data Entry.
> Hi Tyler.
>
> Trivia: (probably not worth resubmitting for this)
>
> There's a slight mismatch between logging output and commit
> message. Now there's an ASCII block after the output.
>
> Another suggestion below.
True, I can update this commit for the next patch set.
>
>> Following is a sample output from dmesg:
>> [ 115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [ 115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [ 115.787456] {1}[Hardware Error]: event severity: corrected
>> [ 115.792927] {1}[Hardware Error]: Error 0, type: corrected
>> [ 115.798415] {1}[Hardware Error]: fru_id: 00000000-0000-0000-0000-000000000000
>> [ 115.805596] {1}[Hardware Error]: fru_text:
>> [ 115.816105] {1}[Hardware Error]: section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
>> [ 115.823880] {1}[Hardware Error]: section length: 88
>> [ 115.828779] {1}[Hardware Error]: 00000000: 01000001 00000002 5f434345 525f4543
>> [ 115.836153] {1}[Hardware Error]: 00000010: 0000574d 00000000 00000000 00000000
>> [ 115.843531] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
>> [ 115.850908] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
>> [ 115.858288] {1}[Hardware Error]: 00000040: fe800000 00000000 00000004 5f434345
>> [ 115.865665] {1}[Hardware Error]: 00000050: 525f4543 0000574d
> []
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> []
>> @@ -591,8 +591,16 @@ static void cper_estatus_print_section(
>> cper_print_proc_arm(newpfx, arm_err);
>> else
>> goto err_section_too_small;
>> - } else
>> - printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
>> + } else {
>> + const void *unknown_err;
>> +
>> + unknown_err = acpi_hest_generic_data_payload(gdata);
>> + printk("%ssection type: unknown, %pUl\n", newpfx, sec_type);
>> + printk("%ssection length: %d\n", newpfx,
>> + gdata->error_data_length);
> It might be nice to output this as
>
> printk("%ssection length: %d (%#x)\n",
> newpfx, gdata->error_data_length, gdata->error_data_length);
>
> so it's easy to know the appropriate hex buffer length too.
I will make this change in the next patch set.
Thanks,
Tyler
>
>> + print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
>> + unknown_err, gdata->error_data_length, true);
>> + }
>>
>> return;
>>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH V12 08/10] ras: acpi / apei: generate trace event for unrecognized CPER section
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (6 preceding siblings ...)
2017-03-06 20:45 ` [PATCH V12 07/10] efi: print unrecognized CPER section Tyler Baicar
@ 2017-03-06 20:45 ` Tyler Baicar
2017-03-06 20:45 ` [PATCH V12 09/10] trace, ras: add ARM processor error trace event Tyler Baicar
` (2 subsequent siblings)
10 siblings, 0 replies; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:45 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.
Currently if the CPER section's type (UUID) does not match with
any section type that the kernel knows how to parse, trace event
is not generated for such section. And thus user is not able to know
happening of such hardware error, including error record of
non-standard section.
This commit generates a trace event which contains raw error data
for unrecognized CPER section.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
CC: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
---
drivers/acpi/apei/ghes.c | 24 ++++++++++++++++++++++--
drivers/ras/ras.c | 1 +
include/ras/ras_event.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d6a3b9f..842c0cc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -44,11 +44,13 @@
#include <linux/pci.h>
#include <linux/aer.h>
#include <linux/nmi.h>
+#include <linux/uuid.h>
#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
#include <asm/tlbflush.h>
+#include <ras/ras_event.h>
#include "apei-internal.h"
@@ -453,11 +455,21 @@ static void ghes_do_proc(struct ghes *ghes,
{
int sev, sec_sev;
struct acpi_hest_generic_data *gdata;
+ uuid_le sec_type;
+ uuid_le *fru_id = &NULL_UUID_LE;
+ char *fru_text = "";
sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_sev = ghes_severity(gdata->error_severity);
- if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+ sec_type = *(uuid_le *)gdata->section_type;
+
+ if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+ fru_id = (uuid_le *)gdata->fru_id;
+ if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+ fru_text = gdata->fru_text;
+
+ if (!uuid_le_cmp(sec_type,
CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err;
@@ -468,7 +480,7 @@ static void ghes_do_proc(struct ghes *ghes,
ghes_handle_memory_failure(gdata, sev);
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
- else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+ else if (!uuid_le_cmp(sec_type,
CPER_SEC_PCIE)) {
struct cper_sec_pcie *pcie_err;
@@ -501,6 +513,14 @@ static void ghes_do_proc(struct ghes *ghes,
}
#endif
+#ifdef CONFIG_RAS
+ else if (trace_unknown_sec_event_enabled()) {
+ void *unknown_err = acpi_hest_generic_data_payload(gdata);
+ trace_unknown_sec_event(&sec_type,
+ fru_id, fru_text, sec_sev,
+ unknown_err, gdata->error_data_length);
+ }
+#endif
}
}
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index b67dd36..fb2500b 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -27,3 +27,4 @@ static int __init ras_init(void)
EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
#endif
EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 1791a12..5861b6f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@
);
/*
+ * Unknown Section Report
+ *
+ * This event is generated when hardware detected a hardware
+ * error event, which may be of non-standard section as defined
+ * in UEFI spec appendix "Common Platform Error Record", or may
+ * be of sections for which TRACE_EVENT is not defined.
+ *
+ */
+TRACE_EVENT(unknown_sec_event,
+
+ TP_PROTO(const uuid_le *sec_type,
+ const uuid_le *fru_id,
+ const char *fru_text,
+ const u8 sev,
+ const u8 *err,
+ const u32 len),
+
+ TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+
+ TP_STRUCT__entry(
+ __array(char, sec_type, 16)
+ __array(char, fru_id, 16)
+ __string(fru_text, fru_text)
+ __field(u8, sev)
+ __field(u32, len)
+ __dynamic_array(u8, buf, len)
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
+ memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));
+ __assign_str(fru_text, fru_text);
+ __entry->sev = sev;
+ __entry->len = len;
+ memcpy(__get_dynamic_array(buf), err, len);
+ ),
+
+ TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
+ __entry->sev, __entry->sec_type,
+ __entry->fru_id, __get_str(fru_text),
+ __entry->len,
+ __print_hex(__get_dynamic_array(buf), __entry->len))
+);
+
+/*
* PCIe AER Trace event
*
* These events are generated when hardware detects a corrected or
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (7 preceding siblings ...)
2017-03-06 20:45 ` [PATCH V12 08/10] ras: acpi / apei: generate trace event for " Tyler Baicar
@ 2017-03-06 20:45 ` Tyler Baicar
2017-03-09 9:41 ` Xie XiuQi
2017-03-06 20:45 ` [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support Tyler Baicar
2017-03-07 11:37 ` [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 James Morse
10 siblings, 1 reply; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:45 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
Currently there are trace events for the various RAS
errors with the exception of ARM processor type errors.
Add a new trace event for such errors so that the user
will know when they occur. These trace events are
consistent with the ARM processor error section type
defined in UEFI 2.6 spec section N.2.4.4.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
---
drivers/acpi/apei/ghes.c | 8 +++++++-
drivers/firmware/efi/cper.c | 1 +
drivers/ras/ras.c | 1 +
include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 842c0cc..81d7b79 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -514,7 +514,13 @@ static void ghes_do_proc(struct ghes *ghes,
}
#endif
#ifdef CONFIG_RAS
- else if (trace_unknown_sec_event_enabled()) {
+ else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
+ trace_arm_event_enabled()) {
+ struct cper_sec_proc_arm *arm_err;
+
+ arm_err = acpi_hest_generic_data_payload(gdata);
+ trace_arm_event(arm_err);
+ } else if (trace_unknown_sec_event_enabled()) {
void *unknown_err = acpi_hest_generic_data_payload(gdata);
trace_unknown_sec_event(&sec_type,
fru_id, fru_text, sec_sev,
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 545a6c2..e9fb56a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -35,6 +35,7 @@
#include <linux/printk.h>
#include <linux/bcd.h>
#include <acpi/ghes.h>
+#include <ras/ras_event.h>
#define INDENT_SP " "
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index fb2500b..8ba5a94 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -28,3 +28,4 @@ static int __init ras_init(void)
#endif
EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(arm_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5861b6f..b36db48 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,40 @@
);
/*
+ * ARM Processor Events Report
+ *
+ * This event is generated when hardware detects an ARM processor error
+ * has occurred. UEFI 2.6 spec section N.2.4.4.
+ */
+TRACE_EVENT(arm_event,
+
+ TP_PROTO(const struct cper_sec_proc_arm *proc),
+
+ TP_ARGS(proc),
+
+ TP_STRUCT__entry(
+ __field(u64, mpidr)
+ __field(u64, midr)
+ __field(u32, running_state)
+ __field(u32, psci_state)
+ __field(u8, affinity)
+ ),
+
+ TP_fast_assign(
+ __entry->affinity = proc->affinity_level;
+ __entry->mpidr = proc->mpidr;
+ __entry->midr = proc->midr;
+ __entry->running_state = proc->running_state;
+ __entry->psci_state = proc->psci_state;
+ ),
+
+ TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
+ "running state: %d; PSCI state: %d",
+ __entry->affinity, __entry->mpidr, __entry->midr,
+ __entry->running_state, __entry->psci_state)
+);
+
+/*
* Unknown Section Report
*
* This event is generated when hardware detected a hardware
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-06 20:45 ` [PATCH V12 09/10] trace, ras: add ARM processor error trace event Tyler Baicar
@ 2017-03-09 9:41 ` Xie XiuQi
2017-03-10 18:23 ` Baicar, Tyler
0 siblings, 1 reply; 30+ messages in thread
From: Xie XiuQi @ 2017-03-09 9:41 UTC (permalink / raw)
To: Tyler Baicar, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe
Hi Tyler Baicar,
On 2017/3/7 4:45, Tyler Baicar wrote:
> Currently there are trace events for the various RAS
> errors with the exception of ARM processor type errors.
> Add a new trace event for such errors so that the user
> will know when they occur. These trace events are
> consistent with the ARM processor error section type
> defined in UEFI 2.6 spec section N.2.4.4.
>
> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> drivers/acpi/apei/ghes.c | 8 +++++++-
> drivers/firmware/efi/cper.c | 1 +
> drivers/ras/ras.c | 1 +
> include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
> 4 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 842c0cc..81d7b79 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -514,7 +514,13 @@ static void ghes_do_proc(struct ghes *ghes,
> }
> #endif
> #ifdef CONFIG_RAS
> - else if (trace_unknown_sec_event_enabled()) {
> + else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
> + trace_arm_event_enabled()) {
> + struct cper_sec_proc_arm *arm_err;
> +
> + arm_err = acpi_hest_generic_data_payload(gdata);
> + trace_arm_event(arm_err);
> + } else if (trace_unknown_sec_event_enabled()) {
> void *unknown_err = acpi_hest_generic_data_payload(gdata);
> trace_unknown_sec_event(&sec_type,
> fru_id, fru_text, sec_sev,
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 545a6c2..e9fb56a 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -35,6 +35,7 @@
> #include <linux/printk.h>
> #include <linux/bcd.h>
> #include <acpi/ghes.h>
> +#include <ras/ras_event.h>
>
> #define INDENT_SP " "
>
> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
> index fb2500b..8ba5a94 100644
> --- a/drivers/ras/ras.c
> +++ b/drivers/ras/ras.c
> @@ -28,3 +28,4 @@ static int __init ras_init(void)
> #endif
> EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
> EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(arm_event);
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 5861b6f..b36db48 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -162,6 +162,40 @@
> );
>
> /*
> + * ARM Processor Events Report
> + *
> + * This event is generated when hardware detects an ARM processor error
> + * has occurred. UEFI 2.6 spec section N.2.4.4.
> + */
> +TRACE_EVENT(arm_event,
> +
> + TP_PROTO(const struct cper_sec_proc_arm *proc),
> +
> + TP_ARGS(proc),
> +
> + TP_STRUCT__entry(
> + __field(u64, mpidr)
> + __field(u64, midr)
> + __field(u32, running_state)
> + __field(u32, psci_state)
> + __field(u8, affinity)
> + ),
> +
> + TP_fast_assign(
> + __entry->affinity = proc->affinity_level;
> + __entry->mpidr = proc->mpidr;
> + __entry->midr = proc->midr;
> + __entry->running_state = proc->running_state;
> + __entry->psci_state = proc->psci_state;
> + ),
> +
> + TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
> + "running state: %d; PSCI state: %d",
> + __entry->affinity, __entry->mpidr, __entry->midr,
> + __entry->running_state, __entry->psci_state)
> +);
> +
I think these fields are not enough, we need also export arm processor error
information (UEFI 2.6 spec section N.2.4.4.1), or at least the error type,
address, etc. So that the userspace (such as rasdaemon tool) could know what
error occurred.
Thanks,
Xie XiuQi
> +/*
> * Unknown Section Report
> *
> * This event is generated when hardware detected a hardware
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-09 9:41 ` Xie XiuQi
@ 2017-03-10 18:23 ` Baicar, Tyler
2017-03-13 2:31 ` Xie XiuQi
0 siblings, 1 reply; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-10 18:23 UTC (permalink / raw)
To: Xie XiuQi, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe
Hello Xie XiuQi,
On 3/9/2017 2:41 AM, Xie XiuQi wrote:
> On 2017/3/7 4:45, Tyler Baicar wrote:
>> Currently there are trace events for the various RAS
>> errors with the exception of ARM processor type errors.
>> Add a new trace event for such errors so that the user
>> will know when they occur. These trace events are
>> consistent with the ARM processor error section type
>> defined in UEFI 2.6 spec section N.2.4.4.
>>
>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>> Acked-by: Steven Rostedt <rostedt@goodmis.org>
>> ---
>> drivers/acpi/apei/ghes.c | 8 +++++++-
>> drivers/firmware/efi/cper.c | 1 +
>> drivers/ras/ras.c | 1 +
>> include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
>> 4 files changed, 43 insertions(+), 1 deletion(-)
>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>> index 5861b6f..b36db48 100644
>> --- a/include/ras/ras_event.h
>> +++ b/include/ras/ras_event.h
>> @@ -162,6 +162,40 @@
>> );
>>
>> /*
>> + * ARM Processor Events Report
>> + *
>> + * This event is generated when hardware detects an ARM processor error
>> + * has occurred. UEFI 2.6 spec section N.2.4.4.
>> + */
>> +TRACE_EVENT(arm_event,
>> +
>> + TP_PROTO(const struct cper_sec_proc_arm *proc),
>> +
>> + TP_ARGS(proc),
>> +
>> + TP_STRUCT__entry(
>> + __field(u64, mpidr)
>> + __field(u64, midr)
>> + __field(u32, running_state)
>> + __field(u32, psci_state)
>> + __field(u8, affinity)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->affinity = proc->affinity_level;
>> + __entry->mpidr = proc->mpidr;
>> + __entry->midr = proc->midr;
>> + __entry->running_state = proc->running_state;
>> + __entry->psci_state = proc->psci_state;
>> + ),
>> +
>> + TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
>> + "running state: %d; PSCI state: %d",
>> + __entry->affinity, __entry->mpidr, __entry->midr,
>> + __entry->running_state, __entry->psci_state)
>> +);
>> +
> I think these fields are not enough, we need also export arm processor error
> information (UEFI 2.6 spec section N.2.4.4.1), or at least the error type,
> address, etc. So that the userspace (such as rasdaemon tool) could know what
> error occurred.
This is something I am planning on adding in later. It is not clear to
me how to actually do this at this point. If you look at the spec, there
is not a single error information structure. There is at least one, but
possibly a lot. There is also an unknown amount of context information
structures. In "Table 260. ARM Processor Error Section" there are
ERR_INFO_NUM and CONTEXT_INFO_NUM which give the number of these
structures. I think there will need to be separate trace events added in
for each of these structures because I don't think there is a way to
have variable amounts of structures inside of a trace event.
The ARM processor error section also has a vendor specific error info
buffer which will need to be exposed to userspace. This may be something
that can reuse the unknown section type trace event or have it's own
trace event for.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-10 18:23 ` Baicar, Tyler
@ 2017-03-13 2:31 ` Xie XiuQi
2017-03-13 9:00 ` Xie XiuQi
2017-03-14 19:29 ` Baicar, Tyler
0 siblings, 2 replies; 30+ messages in thread
From: Xie XiuQi @ 2017-03-13 2:31 UTC (permalink / raw)
To: Baicar, Tyler, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe
Hi Baicar Tyler,
On 2017/3/11 2:23, Baicar, Tyler wrote:
> Hello Xie XiuQi,
>
>
> On 3/9/2017 2:41 AM, Xie XiuQi wrote:
>> On 2017/3/7 4:45, Tyler Baicar wrote:
>>> Currently there are trace events for the various RAS
>>> errors with the exception of ARM processor type errors.
>>> Add a new trace event for such errors so that the user
>>> will know when they occur. These trace events are
>>> consistent with the ARM processor error section type
>>> defined in UEFI 2.6 spec section N.2.4.4.
>>>
>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>> Acked-by: Steven Rostedt <rostedt@goodmis.org>
>>> ---
>>> drivers/acpi/apei/ghes.c | 8 +++++++-
>>> drivers/firmware/efi/cper.c | 1 +
>>> drivers/ras/ras.c | 1 +
>>> include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
>>> 4 files changed, 43 insertions(+), 1 deletion(-)
>
>>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>>> index 5861b6f..b36db48 100644
>>> --- a/include/ras/ras_event.h
>>> +++ b/include/ras/ras_event.h
>>> @@ -162,6 +162,40 @@
>>> );
>>> /*
>>> + * ARM Processor Events Report
>>> + *
>>> + * This event is generated when hardware detects an ARM processor error
>>> + * has occurred. UEFI 2.6 spec section N.2.4.4.
>>> + */
>>> +TRACE_EVENT(arm_event,
>>> +
>>> + TP_PROTO(const struct cper_sec_proc_arm *proc),
>>> +
>>> + TP_ARGS(proc),
>>> +
>>> + TP_STRUCT__entry(
>>> + __field(u64, mpidr)
>>> + __field(u64, midr)
>>> + __field(u32, running_state)
>>> + __field(u32, psci_state)
>>> + __field(u8, affinity)
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __entry->affinity = proc->affinity_level;
>>> + __entry->mpidr = proc->mpidr;
>>> + __entry->midr = proc->midr;
>>> + __entry->running_state = proc->running_state;
>>> + __entry->psci_state = proc->psci_state;
>>> + ),
>>> +
>>> + TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
>>> + "running state: %d; PSCI state: %d",
>>> + __entry->affinity, __entry->mpidr, __entry->midr,
>>> + __entry->running_state, __entry->psci_state)
>>> +);
>>> +
>> I think these fields are not enough, we need also export arm processor error
>> information (UEFI 2.6 spec section N.2.4.4.1), or at least the error type,
>> address, etc. So that the userspace (such as rasdaemon tool) could know what
>> error occurred.
>
> This is something I am planning on adding in later. It is not clear to me how to
> actually do this at this point. If you look at the spec, there is not a single
> error information structure. There is at least one, but possibly a lot. There is
> also an unknown amount of context information structures. In "Table 260. ARM Processor
> Error Section" there are ERR_INFO_NUM and CONTEXT_INFO_NUM which give the number of these
> structures. I think there will need to be separate trace events added in for each of
> these structures because I don't think there is a way to have variable amounts of
> structures inside of a trace event.
Yes, I agree.
Additional, cper_sec_proc_arm has validation bit, which indicates whether or not each of
the fields is valid in this section. How could we show it in this trace event? If the filed
is invalid, we would get a wrong value here.
--
Thanks,
Xie XiuQi
>
> The ARM processor error section also has a vendor specific error info buffer which will need to be exposed to userspace. This may be something that can reuse the unknown section type trace event or have it's own trace event for.
>
> Thanks,
> Tyler
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-13 2:31 ` Xie XiuQi
@ 2017-03-13 9:00 ` Xie XiuQi
2017-03-13 13:58 ` Steven Rostedt
2017-03-14 19:29 ` Baicar, Tyler
1 sibling, 1 reply; 30+ messages in thread
From: Xie XiuQi @ 2017-03-13 9:00 UTC (permalink / raw)
To: Baicar, Tyler, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe
Cc: wangxiongfeng2, Guo Hanjun, Zhengqiang (turing)
Hi Baicar Tyler,
On 2017/3/13 10:31, Xie XiuQi wrote:
> Hi Baicar Tyler,
>
> On 2017/3/11 2:23, Baicar, Tyler wrote:
>> Hello Xie XiuQi,
>>
>>
>> On 3/9/2017 2:41 AM, Xie XiuQi wrote:
>>> On 2017/3/7 4:45, Tyler Baicar wrote:
>>>> Currently there are trace events for the various RAS
>>>> errors with the exception of ARM processor type errors.
>>>> Add a new trace event for such errors so that the user
>>>> will know when they occur. These trace events are
>>>> consistent with the ARM processor error section type
>>>> defined in UEFI 2.6 spec section N.2.4.4.
>>>>
>>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>>> Acked-by: Steven Rostedt <rostedt@goodmis.org>
>>>> ---
>>>> drivers/acpi/apei/ghes.c | 8 +++++++-
>>>> drivers/firmware/efi/cper.c | 1 +
>>>> drivers/ras/ras.c | 1 +
>>>> include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
>>>> 4 files changed, 43 insertions(+), 1 deletion(-)
>>
>>>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>>>> index 5861b6f..b36db48 100644
>>>> --- a/include/ras/ras_event.h
>>>> +++ b/include/ras/ras_event.h
>>>> @@ -162,6 +162,40 @@
>>>> );
>>>> /*
>>>> + * ARM Processor Events Report
>>>> + *
>>>> + * This event is generated when hardware detects an ARM processor error
>>>> + * has occurred. UEFI 2.6 spec section N.2.4.4.
>>>> + */
>>>> +TRACE_EVENT(arm_event,
>>>> +
>>>> + TP_PROTO(const struct cper_sec_proc_arm *proc),
>>>> +
>>>> + TP_ARGS(proc),
>>>> +
>>>> + TP_STRUCT__entry(
>>>> + __field(u64, mpidr)
>>>> + __field(u64, midr)
>>>> + __field(u32, running_state)
>>>> + __field(u32, psci_state)
>>>> + __field(u8, affinity)
>>>> + ),
>>>> +
>>>> + TP_fast_assign(
>>>> + __entry->affinity = proc->affinity_level;
>>>> + __entry->mpidr = proc->mpidr;
>>>> + __entry->midr = proc->midr;
>>>> + __entry->running_state = proc->running_state;
>>>> + __entry->psci_state = proc->psci_state;
>>>> + ),
>>>> +
>>>> + TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
>>>> + "running state: %d; PSCI state: %d",
>>>> + __entry->affinity, __entry->mpidr, __entry->midr,
>>>> + __entry->running_state, __entry->psci_state)
>>>> +);
>>>> +
>>> I think these fields are not enough, we need also export arm processor error
>>> information (UEFI 2.6 spec section N.2.4.4.1), or at least the error type,
>>> address, etc. So that the userspace (such as rasdaemon tool) could know what
>>> error occurred.
>>
>> This is something I am planning on adding in later. It is not clear to me how to
>> actually do this at this point. If you look at the spec, there is not a single
>> error information structure. There is at least one, but possibly a lot. There is
>> also an unknown amount of context information structures. In "Table 260. ARM Processor
>> Error Section" there are ERR_INFO_NUM and CONTEXT_INFO_NUM which give the number of these
>> structures. I think there will need to be separate trace events added in for each of
>> these structures because I don't think there is a way to have variable amounts of
>> structures inside of a trace event.
I have a patch below to add a trace event to expose arm processor error information
to user space. Would you take it to your series or later series if possible.
Any comments is welcome.
This patch is just compile OK. I have no arm box for testing just now.
Any one who can help to test it is very grateful.
Thanks.
>From e591570eecc6cd70e18d8f8ae75534b55a22f7ba Mon Sep 17 00:00:00 2001
From: Xie XiuQi <xiexiuqi@huawei.com>
Date: Mon, 13 Mar 2017 15:46:06 +0800
Subject: [PATCH] trace: ras: add ARM processor error information trace event
Add a new trace event for ARM processor error information, so that
the user will know what error occurred. With this information the
user may take appropriate action.
These trace events are consistent with the ARM processor error
information table which defined in UEFI 2.6 spec section N.2.4.4.1.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
---
drivers/acpi/apei/ghes.c | 8 +++++
include/linux/cper.h | 5 +++
include/ras/ras_event.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 100 insertions(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 251d7e0..6d34c26 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -518,9 +518,17 @@ static void ghes_do_proc(struct ghes *ghes,
else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
trace_arm_event_enabled()) {
struct cper_sec_proc_arm *arm_err;
+ struct cper_arm_err_info *err_info;
+ int i;
arm_err = acpi_hest_generic_data_payload(gdata);
trace_arm_event(arm_err);
+
+ err_info = (struct cper_arm_err_info *)(arm_err + 1);
+ for (i = 0; i < arm_err->err_info_num; i++) {
+ trace_arm_proc_err(err_info);
+ err_info += 1;
+ }
} else if (trace_unknown_sec_event_enabled()) {
void *unknown_err = acpi_hest_generic_data_payload(gdata);
trace_unknown_sec_event(&sec_type,
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 85450f3..0cae900 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -270,6 +270,11 @@ enum {
#define CPER_ARM_INFO_VALID_VIRT_ADDR 0x0008
#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR 0x0010
+#define CPER_ARM_INFO_TYPE_CACHE 0
+#define CPER_ARM_INFO_TYPE_TLB 1
+#define CPER_ARM_INFO_TYPE_BUS 2
+#define CPER_ARM_INFO_TYPE_UARCH 3
+
#define CPER_ARM_INFO_FLAGS_FIRST 0x0001
#define CPER_ARM_INFO_FLAGS_LAST 0x0002
#define CPER_ARM_INFO_FLAGS_PROPAGATED 0x0004
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index b36db48..72c6a06 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -195,6 +195,93 @@
__entry->running_state, __entry->psci_state)
);
+#define ARM_PROC_ERR_TYPE \
+ EM ( CPER_ARM_INFO_TYPE_CACHE, "cache error" ) \
+ EM ( CPER_ARM_INFO_TYPE_TLB, "TLB error" ) \
+ EM ( CPER_ARM_INFO_TYPE_BUS, "bus error" ) \
+ EMe ( CPER_ARM_INFO_TYPE_UARCH, "micro-architectural error" )
+
+#define ARM_PROC_ERR_FLAGS \
+ EM ( CPER_ARM_INFO_FLAGS_FIRST, "First error captured" ) \
+ EM ( CPER_ARM_INFO_FLAGS_LAST, "Last error captured" ) \
+ EM ( CPER_ARM_INFO_FLAGS_PROPAGATED, "Propagated" ) \
+ EMe ( CPER_ARM_INFO_FLAGS_OVERFLOW, "Overflow" )
+
+/*
+ * First define the enums in MM_ACTION_RESULT to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+ARM_PROC_ERR_TYPE
+ARM_PROC_ERR_FLAGS
+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) { a, b },
+#define EMe(a, b) { a, b }
+
+TRACE_EVENT(arm_proc_error,
+
+ TP_PROTO(const struct cper_arm_err_info *err),
+
+ TP_ARGS(err),
+
+ TP_STRUCT__entry(
+ __field(u8, type)
+ __field(u16, multiple_error)
+ __field(u8, flags)
+ __field(u64, error_info)
+ __field(u64, virt_fault_addr)
+ __field(u64, physical_fault_addr)
+ ),
+
+ TP_fast_assign(
+ __entry->type = err->type;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_MULTI_ERR)
+ __entry->multiple_error = err->multiple_error;
+ else
+ __entry->multiple_error = ~0;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_FLAGS)
+ __entry->flags = err->flags;
+ else
+ __entry->flags = ~0;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+ __entry->error_info = err->error_info;
+ else
+ __entry->error_info = 0ULL;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+ __entry->virt_fault_addr = err->virt_fault_addr;
+ else
+ __entry->virt_fault_addr = 0ULL;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+ __entry->physical_fault_addr = err->physical_fault_addr;
+ else
+ __entry->physical_fault_addr = 0ULL;
+ ),
+
+ TP_printk("ARM Processor Error: type %s; count: %u; flags: %s;"
+ " error info: %016llx; virtual address: %016llx;"
+ " physical address: %016llx",
+ __print_symbolic(__entry->type, ARCH_PROC_ERR_TYPE),
+ __entry->multiple_error,
+ __print_symbolic(__entry->flags, ARCH_PROC_ERR_FLAGS),
+ __entry->error_info, __entry->virt_fault_addr,
+ __entry->physical_fault_addr)
+);
+
/*
* Unknown Section Report
*
--
1.8.3.1
>
> Yes, I agree.
>
> Additional, cper_sec_proc_arm has validation bit, which indicates whether or not each of
> the fields is valid in this section. How could we show it in this trace event? If the filed
> is invalid, we would get a wrong value here.
>
> --
> Thanks,
> Xie XiuQi
>
>>
>> The ARM processor error section also has a vendor specific error info buffer which will need to be exposed to userspace. This may be something that can reuse the unknown section type trace event or have it's own trace event for.
>>
>> Thanks,
>> Tyler
>>
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-13 9:00 ` Xie XiuQi
@ 2017-03-13 13:58 ` Steven Rostedt
2017-03-14 9:35 ` Xie XiuQi
0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2017-03-13 13:58 UTC (permalink / raw)
To: Xie XiuQi
Cc: Baicar, Tyler, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, bristot,
linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe, wangxiongfeng2,
Guo Hanjun, Zhengqiang (turing)
On Mon, 13 Mar 2017 17:00:59 +0800
Xie XiuQi <xiexiuqi@huawei.com> wrote:
> ---
> drivers/acpi/apei/ghes.c | 8 +++++
> include/linux/cper.h | 5 +++
> include/ras/ras_event.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 100 insertions(+)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 251d7e0..6d34c26 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -518,9 +518,17 @@ static void ghes_do_proc(struct ghes *ghes,
> else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
> trace_arm_event_enabled()) {
> struct cper_sec_proc_arm *arm_err;
> + struct cper_arm_err_info *err_info;
> + int i;
>
> arm_err = acpi_hest_generic_data_payload(gdata);
> trace_arm_event(arm_err);
> +
if (trace_arm_proc_err_enabled()) {
> + err_info = (struct cper_arm_err_info *)(arm_err + 1);
> + for (i = 0; i < arm_err->err_info_num; i++) {
> + trace_arm_proc_err(err_info);
> + err_info += 1;
> + }
}
-- Steve
> } else if (trace_unknown_sec_event_enabled()) {
> void *unknown_err = acpi_hest_generic_data_payload(gdata);
> trace_unknown_sec_event(&sec_type,
> diff --git a/include/linux/cper.h b/include/linux/cper.h
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-13 13:58 ` Steven Rostedt
@ 2017-03-14 9:35 ` Xie XiuQi
0 siblings, 0 replies; 30+ messages in thread
From: Xie XiuQi @ 2017-03-14 9:35 UTC (permalink / raw)
To: Steven Rostedt, Baicar, Tyler
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, bristot, linux-arm-kernel, kvmarm,
kvm, linux-kernel, linux-acpi, linux-efi, devel, Suzuki.Poulose,
punit.agrawal, astone, harba, hanjun.guo, john.garry, shiju.jose,
joe, wangxiongfeng2, Guo Hanjun, Zhengqiang (turing)
Hi Steven,
Thanks for you comments. As your suggestion, I've changed it in v2.
>From c29c14c3b960f55beb4c4b22b7aced64fa7daf9a Mon Sep 17 00:00:00 2001
From: Xie XiuQi <xiexiuqi@huawei.com>
Date: Mon, 13 Mar 2017 15:46:06 +0800
Subject: [PATCH v2] trace: ras: add ARM processor error information trace event
Add a new trace event for ARM processor error information, so that
the user will know what error occurred. With this information the
user may take appropriate action.
These trace events are consistent with the ARM processor error
information table which defined in UEFI 2.6 spec section N.2.4.4.1.
---
v2: add trace enabled condition as Steven's suggestion.
fix a typo.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
---
drivers/acpi/apei/ghes.c | 10 ++++++
include/linux/cper.h | 5 +++
include/ras/ras_event.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 102 insertions(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 251d7e0..bcb8160 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -518,9 +518,19 @@ static void ghes_do_proc(struct ghes *ghes,
else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
trace_arm_event_enabled()) {
struct cper_sec_proc_arm *arm_err;
+ struct cper_arm_err_info *err_info;
+ int i;
arm_err = acpi_hest_generic_data_payload(gdata);
trace_arm_event(arm_err);
+
+ if (trace_arm_proc_err_enabled()) {
+ err_info = (struct cper_arm_err_info *)(arm_err + 1);
+ for (i = 0; i < arm_err->err_info_num; i++) {
+ trace_arm_proc_err(err_info);
+ err_info += 1;
+ }
+ }
} else if (trace_unknown_sec_event_enabled()) {
void *unknown_err = acpi_hest_generic_data_payload(gdata);
trace_unknown_sec_event(&sec_type,
diff --git a/include/linux/cper.h b/include/linux/cper.h
index 85450f3..0cae900 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -270,6 +270,11 @@ enum {
#define CPER_ARM_INFO_VALID_VIRT_ADDR 0x0008
#define CPER_ARM_INFO_VALID_PHYSICAL_ADDR 0x0010
+#define CPER_ARM_INFO_TYPE_CACHE 0
+#define CPER_ARM_INFO_TYPE_TLB 1
+#define CPER_ARM_INFO_TYPE_BUS 2
+#define CPER_ARM_INFO_TYPE_UARCH 3
+
#define CPER_ARM_INFO_FLAGS_FIRST 0x0001
#define CPER_ARM_INFO_FLAGS_LAST 0x0002
#define CPER_ARM_INFO_FLAGS_PROPAGATED 0x0004
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index b36db48..0f9f46e 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -195,6 +195,93 @@
__entry->running_state, __entry->psci_state)
);
+#define ARM_PROC_ERR_TYPE \
+ EM ( CPER_ARM_INFO_TYPE_CACHE, "cache error" ) \
+ EM ( CPER_ARM_INFO_TYPE_TLB, "TLB error" ) \
+ EM ( CPER_ARM_INFO_TYPE_BUS, "bus error" ) \
+ EMe ( CPER_ARM_INFO_TYPE_UARCH, "micro-architectural error" )
+
+#define ARM_PROC_ERR_FLAGS \
+ EM ( CPER_ARM_INFO_FLAGS_FIRST, "First error captured" ) \
+ EM ( CPER_ARM_INFO_FLAGS_LAST, "Last error captured" ) \
+ EM ( CPER_ARM_INFO_FLAGS_PROPAGATED, "Propagated" ) \
+ EMe ( CPER_ARM_INFO_FLAGS_OVERFLOW, "Overflow" )
+
+/*
+ * First define the enums in MM_ACTION_RESULT to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b) TRACE_DEFINE_ENUM(a);
+
+ARM_PROC_ERR_TYPE
+ARM_PROC_ERR_FLAGS
+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) { a, b },
+#define EMe(a, b) { a, b }
+
+TRACE_EVENT(arm_proc_err,
+
+ TP_PROTO(const struct cper_arm_err_info *err),
+
+ TP_ARGS(err),
+
+ TP_STRUCT__entry(
+ __field(u8, type)
+ __field(u16, multiple_error)
+ __field(u8, flags)
+ __field(u64, error_info)
+ __field(u64, virt_fault_addr)
+ __field(u64, physical_fault_addr)
+ ),
+
+ TP_fast_assign(
+ __entry->type = err->type;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_MULTI_ERR)
+ __entry->multiple_error = err->multiple_error;
+ else
+ __entry->multiple_error = ~0;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_FLAGS)
+ __entry->flags = err->flags;
+ else
+ __entry->flags = ~0;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+ __entry->error_info = err->error_info;
+ else
+ __entry->error_info = 0ULL;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+ __entry->virt_fault_addr = err->virt_fault_addr;
+ else
+ __entry->virt_fault_addr = 0ULL;
+
+ if (err->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+ __entry->physical_fault_addr = err->physical_fault_addr;
+ else
+ __entry->physical_fault_addr = 0ULL;
+ ),
+
+ TP_printk("ARM Processor Error: type %s; count: %u; flags: %s;"
+ " error info: %016llx; virtual address: %016llx;"
+ " physical address: %016llx",
+ __print_symbolic(__entry->type, ARCH_PROC_ERR_TYPE),
+ __entry->multiple_error,
+ __print_symbolic(__entry->flags, ARCH_PROC_ERR_FLAGS),
+ __entry->error_info, __entry->virt_fault_addr,
+ __entry->physical_fault_addr)
+);
+
/*
* Unknown Section Report
*
--
1.8.3.1
On 2017/3/13 21:58, Steven Rostedt wrote:
> On Mon, 13 Mar 2017 17:00:59 +0800
> Xie XiuQi <xiexiuqi@huawei.com> wrote:
>
>> ---
>> drivers/acpi/apei/ghes.c | 8 +++++
>> include/linux/cper.h | 5 +++
>> include/ras/ras_event.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 100 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 251d7e0..6d34c26 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -518,9 +518,17 @@ static void ghes_do_proc(struct ghes *ghes,
>> else if (!uuid_le_cmp(sec_type, CPER_SEC_PROC_ARM) &&
>> trace_arm_event_enabled()) {
>> struct cper_sec_proc_arm *arm_err;
>> + struct cper_arm_err_info *err_info;
>> + int i;
>>
>> arm_err = acpi_hest_generic_data_payload(gdata);
>> trace_arm_event(arm_err);
>> +
>
> if (trace_arm_proc_err_enabled()) {
>
>> + err_info = (struct cper_arm_err_info *)(arm_err + 1);
>> + for (i = 0; i < arm_err->err_info_num; i++) {
>> + trace_arm_proc_err(err_info);
>> + err_info += 1;
>> + }
>
> }
>
> -- Steve
>
>> } else if (trace_unknown_sec_event_enabled()) {
>> void *unknown_err = acpi_hest_generic_data_payload(gdata);
>> trace_unknown_sec_event(&sec_type,
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>
--
Thanks,
Xie XiuQi
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 09/10] trace, ras: add ARM processor error trace event
2017-03-13 2:31 ` Xie XiuQi
2017-03-13 9:00 ` Xie XiuQi
@ 2017-03-14 19:29 ` Baicar, Tyler
1 sibling, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-14 19:29 UTC (permalink / raw)
To: Xie XiuQi, christoffer.dall, marc.zyngier, pbonzini, rkrcmar,
linux, catalin.marinas, will.deacon, rjw, lenb, matt,
robert.moore, lv.zheng, nkaje, zjzhang, mark.rutland,
james.morse, akpm, eun.taik.lee, sandeepa.s.prabhu, labbott,
shijie.huang, rruigrok, paul.gortmaker, tn, fu.wei, rostedt,
bristot, linux-arm-kernel, kvmarm, kvm, linux-kernel, linux-acpi,
linux-efi, devel, Suzuki.Poulose, punit.agrawal, astone, harba,
hanjun.guo, john.garry, shiju.jose, joe
Hello Xie XiUQi,
On 3/12/2017 8:31 PM, Xie XiuQi wrote:
> Hi Baicar Tyler,
>
> On 2017/3/11 2:23, Baicar, Tyler wrote:
>> Hello Xie XiuQi,
>>
>>
>> On 3/9/2017 2:41 AM, Xie XiuQi wrote:
>>> On 2017/3/7 4:45, Tyler Baicar wrote:
>>>> Currently there are trace events for the various RAS
>>>> errors with the exception of ARM processor type errors.
>>>> Add a new trace event for such errors so that the user
>>>> will know when they occur. These trace events are
>>>> consistent with the ARM processor error section type
>>>> defined in UEFI 2.6 spec section N.2.4.4.
>>>>
>>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>>> Acked-by: Steven Rostedt <rostedt@goodmis.org>
>>>> ---
>>>> drivers/acpi/apei/ghes.c | 8 +++++++-
>>>> drivers/firmware/efi/cper.c | 1 +
>>>> drivers/ras/ras.c | 1 +
>>>> include/ras/ras_event.h | 34 ++++++++++++++++++++++++++++++++++
>>>> 4 files changed, 43 insertions(+), 1 deletion(-)
>>>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>>>> index 5861b6f..b36db48 100644
>>>> --- a/include/ras/ras_event.h
>>>> +++ b/include/ras/ras_event.h
>>>> @@ -162,6 +162,40 @@
>>>> );
>>>> /*
>>>> + * ARM Processor Events Report
>>>> + *
>>>> + * This event is generated when hardware detects an ARM processor error
>>>> + * has occurred. UEFI 2.6 spec section N.2.4.4.
>>>> + */
>>>> +TRACE_EVENT(arm_event,
>>>> +
>>>> + TP_PROTO(const struct cper_sec_proc_arm *proc),
>>>> +
>>>> + TP_ARGS(proc),
>>>> +
>>>> + TP_STRUCT__entry(
>>>> + __field(u64, mpidr)
>>>> + __field(u64, midr)
>>>> + __field(u32, running_state)
>>>> + __field(u32, psci_state)
>>>> + __field(u8, affinity)
>>>> + ),
>>>> +
>>>> + TP_fast_assign(
>>>> + __entry->affinity = proc->affinity_level;
>>>> + __entry->mpidr = proc->mpidr;
>>>> + __entry->midr = proc->midr;
>>>> + __entry->running_state = proc->running_state;
>>>> + __entry->psci_state = proc->psci_state;
>>>> + ),
>>>> +
>>>> + TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; "
>>>> + "running state: %d; PSCI state: %d",
>>>> + __entry->affinity, __entry->mpidr, __entry->midr,
>>>> + __entry->running_state, __entry->psci_state)
>>>> +);
>>>> +
>>> I think these fields are not enough, we need also export arm processor error
>>> information (UEFI 2.6 spec section N.2.4.4.1), or at least the error type,
>>> address, etc. So that the userspace (such as rasdaemon tool) could know what
>>> error occurred.
>> This is something I am planning on adding in later. It is not clear to me how to
>> actually do this at this point. If you look at the spec, there is not a single
>> error information structure. There is at least one, but possibly a lot. There is
>> also an unknown amount of context information structures. In "Table 260. ARM Processor
>> Error Section" there are ERR_INFO_NUM and CONTEXT_INFO_NUM which give the number of these
>> structures. I think there will need to be separate trace events added in for each of
>> these structures because I don't think there is a way to have variable amounts of
>> structures inside of a trace event.
> Yes, I agree.
>
> Additional, cper_sec_proc_arm has validation bit, which indicates whether or not each of
> the fields is valid in this section. How could we show it in this trace event? If the filed
> is invalid, we would get a wrong value here.
>
I will add in checks for whether the fields are valid similar to what
you did for the error info patch.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (8 preceding siblings ...)
2017-03-06 20:45 ` [PATCH V12 09/10] trace, ras: add ARM processor error trace event Tyler Baicar
@ 2017-03-06 20:45 ` Tyler Baicar
2017-03-07 11:48 ` James Morse
2017-03-07 11:37 ` [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 James Morse
10 siblings, 1 reply; 30+ messages in thread
From: Tyler Baicar @ 2017-03-06 20:45 UTC (permalink / raw)
To: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, james.morse, akpm,
eun.taik.lee, sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Cc: Tyler Baicar
Currently external aborts are unsupported by the guest abort
handling. Add handling for SEAs so that the host kernel reports
SEAs which occur in the guest kernel.
Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
arch/arm/include/asm/kvm_arm.h | 10 ++++++++++
arch/arm/include/asm/system_misc.h | 5 +++++
arch/arm/kvm/mmu.c | 36 ++++++++++++++++++++++++++++++++++--
arch/arm64/include/asm/kvm_arm.h | 10 ++++++++++
arch/arm64/include/asm/system_misc.h | 2 ++
arch/arm64/mm/fault.c | 18 ++++++++++++++++++
6 files changed, 79 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index e22089f..a1a3dff 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -187,6 +187,16 @@
#define FSC_FAULT (0x04)
#define FSC_ACCESS (0x08)
#define FSC_PERM (0x0c)
+#define FSC_SEA (0x10)
+#define FSC_SEA_TTW0 (0x14)
+#define FSC_SEA_TTW1 (0x15)
+#define FSC_SEA_TTW2 (0x16)
+#define FSC_SEA_TTW3 (0x17)
+#define FSC_SECC (0x18)
+#define FSC_SECC_TTW0 (0x1c)
+#define FSC_SECC_TTW1 (0x1d)
+#define FSC_SECC_TTW2 (0x1e)
+#define FSC_SECC_TTW3 (0x1f)
/* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
#define HPFAR_MASK (~0xf)
diff --git a/arch/arm/include/asm/system_misc.h b/arch/arm/include/asm/system_misc.h
index a3d61ad..ea45d94 100644
--- a/arch/arm/include/asm/system_misc.h
+++ b/arch/arm/include/asm/system_misc.h
@@ -24,4 +24,9 @@
#endif /* !__ASSEMBLY__ */
+static inline int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+ return -1;
+}
+
#endif /* __ASM_ARM_SYSTEM_MISC_H */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index a5265ed..f3608c9 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -29,6 +29,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/virt.h>
+#include <asm/system_misc.h>
#include "trace.h"
@@ -1409,6 +1410,24 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
kvm_set_pfn_accessed(pfn);
}
+static bool is_abort_synchronous(unsigned long fault_status) {
+ switch (fault_status) {
+ case FSC_SEA:
+ case FSC_SEA_TTW0:
+ case FSC_SEA_TTW1:
+ case FSC_SEA_TTW2:
+ case FSC_SEA_TTW3:
+ case FSC_SECC:
+ case FSC_SECC_TTW0:
+ case FSC_SECC_TTW1:
+ case FSC_SECC_TTW2:
+ case FSC_SECC_TTW3:
+ return true;
+ default:
+ return false;
+ }
+}
+
/**
* kvm_handle_guest_abort - handles all 2nd stage aborts
* @vcpu: the VCPU pointer
@@ -1444,8 +1463,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
/* Check the stage-2 fault is trans. fault or write fault */
fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
- if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
- fault_status != FSC_ACCESS) {
+
+ /* The host kernel will handle the synchronous external abort. There
+ * is no need to pass the error into the guest.
+ */
+ if (is_abort_synchronous(fault_status)) {
+ if(handle_guest_sea((unsigned long)fault_ipa,
+ kvm_vcpu_get_hsr(vcpu))) {
+ kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
+ kvm_vcpu_trap_get_class(vcpu),
+ (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
+ (unsigned long)kvm_vcpu_get_hsr(vcpu));
+ return -EFAULT;
+ }
+ } else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
+ fault_status != FSC_ACCESS) {
kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
kvm_vcpu_trap_get_class(vcpu),
(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 2a2752b..dcacdbb 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -201,6 +201,16 @@
#define FSC_FAULT ESR_ELx_FSC_FAULT
#define FSC_ACCESS ESR_ELx_FSC_ACCESS
#define FSC_PERM ESR_ELx_FSC_PERM
+#define FSC_SEA ESR_ELx_FSC_EXTABT
+#define FSC_SEA_TTW0 (0x14)
+#define FSC_SEA_TTW1 (0x15)
+#define FSC_SEA_TTW2 (0x16)
+#define FSC_SEA_TTW3 (0x17)
+#define FSC_SECC (0x18)
+#define FSC_SECC_TTW0 (0x1c)
+#define FSC_SECC_TTW1 (0x1d)
+#define FSC_SECC_TTW2 (0x1e)
+#define FSC_SECC_TTW3 (0x1f)
/* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
#define HPFAR_MASK (~UL(0xf))
diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index bc81243..5b2cecd1 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -58,4 +58,6 @@ void hook_debug_fault_code(int nr, int (*fn)(unsigned long, unsigned int,
#endif /* __ASSEMBLY__ */
+int handle_guest_sea(unsigned long addr, unsigned int esr);
+
#endif /* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b2d57fc..31c5171 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -602,6 +602,24 @@ static const char *fault_name(unsigned int esr)
}
/*
+ * Handle Synchronous External Aborts that occur in a guest kernel.
+ */
+int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+ /*
+ * synchronize_rcu() will wait for nmi_exit(), so no need to
+ * rcu_read_lock().
+ */
+ if(IS_ENABLED(ACPI_APEI_SEA)) {
+ rcu_read_lock();
+ ghes_notify_sea();
+ rcu_read_unlock();
+ }
+
+ return 0;
+}
+
+/*
* Dispatch a data abort to the relevant handler.
*/
asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support
2017-03-06 20:45 ` [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support Tyler Baicar
@ 2017-03-07 11:48 ` James Morse
2017-03-07 17:58 ` Baicar, Tyler
0 siblings, 1 reply; 30+ messages in thread
From: James Morse @ 2017-03-07 11:48 UTC (permalink / raw)
To: Tyler Baicar
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hi Tyler,
On 06/03/17 20:45, Tyler Baicar wrote:
> Currently external aborts are unsupported by the guest abort
> handling. Add handling for SEAs so that the host kernel reports
> SEAs which occur in the guest kernel.
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index e22089f..a1a3dff 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -187,6 +187,16 @@
> #define FSC_FAULT (0x04)
> #define FSC_ACCESS (0x08)
> #define FSC_PERM (0x0c)
> +#define FSC_SEA (0x10)
> +#define FSC_SEA_TTW0 (0x14)
> +#define FSC_SEA_TTW1 (0x15)
> +#define FSC_SEA_TTW2 (0x16)
> +#define FSC_SEA_TTW3 (0x17)
> +#define FSC_SECC (0x18)
> +#define FSC_SECC_TTW0 (0x1c)
aarch32 doesn't have either of these 'TW0' values, it's an unused encoding.
(However ...)
> +#define FSC_SECC_TTW1 (0x1d)
> +#define FSC_SECC_TTW2 (0x1e)
> +#define FSC_SECC_TTW3 (0x1f)
>
> /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
> #define HPFAR_MASK (~0xf)
> #endif /* __ASM_ARM_SYSTEM_MISC_H */
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index a5265ed..f3608c9 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1444,8 +1463,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>
> /* Check the stage-2 fault is trans. fault or write fault */
> fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
> - if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
> - fault_status != FSC_ACCESS) {
> +
> + /* The host kernel will handle the synchronous external abort. There
> + * is no need to pass the error into the guest.
> + */
> + if (is_abort_synchronous(fault_status)) {
> + if(handle_guest_sea((unsigned long)fault_ipa,
> + kvm_vcpu_get_hsr(vcpu))) {
... Looking further up in this function:
> is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
> kvm_inject_vabt(vcpu);
> return 1;
> }
... so external data aborts will have already been 'claimed' by kvm and dealt
with, and we already have a helper for spotting external aborts. (sorry I didn't
spot it earlier).
We need to do the handle_guest_sea() before this code.
kvm_inject_vabt() makes an SError interrupt pending for the guest. This makes a
synchronous error asynchronous as the guest may have SError interrupts masked.
I guess this was the best that could be done at the time of (4055710baca8
"arm/arm64: KVM: Inject virtual abort when guest exits on external abort"), but
in the light of this firmware-first handling, I'm not sure its the right thing
to do.
Is it possible for handle_guest_sea() to return whether it actually found any
work to do? If there was none I think we should keep this kvm_inject_vabt() as
it is the existing behaviour.
> + kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
> + kvm_vcpu_trap_get_class(vcpu),
> + (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
> + (unsigned long)kvm_vcpu_get_hsr(vcpu));
> + return -EFAULT;
> + }
> + } else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
> + fault_status != FSC_ACCESS) {
> kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
> kvm_vcpu_trap_get_class(vcpu),
> (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index b2d57fc..31c5171 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -602,6 +602,24 @@ static const char *fault_name(unsigned int esr)
> }
>
> /*
> + * Handle Synchronous External Aborts that occur in a guest kernel.
> + */
> +int handle_guest_sea(unsigned long addr, unsigned int esr)
> +{
> + /*
> + * synchronize_rcu() will wait for nmi_exit(), so no need to
> + * rcu_read_lock().
> + */
This comment has a life of its own! Given we don't always call ghes_notify_sea()
when we interrupted un-interruptable code its not always true. I think the
rcu_read_{,un}lock() should go against the list walk (so it looks like the
examples), and ditch the comment!
> + if(IS_ENABLED(ACPI_APEI_SEA)) {
> + rcu_read_lock();
> + ghes_notify_sea();
> + rcu_read_unlock();
> + }
> +
> + return 0;
> +}
> +
> +/*
> * Dispatch a data abort to the relevant handler.
> */
> asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
>
Thanks,
James
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support
2017-03-07 11:48 ` James Morse
@ 2017-03-07 17:58 ` Baicar, Tyler
2017-03-08 16:09 ` James Morse
0 siblings, 1 reply; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-07 17:58 UTC (permalink / raw)
To: James Morse
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hello James,
On 3/7/2017 4:48 AM, James Morse wrote:
> On 06/03/17 20:45, Tyler Baicar wrote:
>> Currently external aborts are unsupported by the guest abort
>> handling. Add handling for SEAs so that the host kernel reports
>> SEAs which occur in the guest kernel.
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> index e22089f..a1a3dff 100644
>> --- a/arch/arm/include/asm/kvm_arm.h
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -187,6 +187,16 @@
>> #define FSC_FAULT (0x04)
>> #define FSC_ACCESS (0x08)
>> #define FSC_PERM (0x0c)
>> +#define FSC_SEA (0x10)
>> +#define FSC_SEA_TTW0 (0x14)
>> +#define FSC_SEA_TTW1 (0x15)
>> +#define FSC_SEA_TTW2 (0x16)
>> +#define FSC_SEA_TTW3 (0x17)
>> +#define FSC_SECC (0x18)
>> +#define FSC_SECC_TTW0 (0x1c)
> aarch32 doesn't have either of these 'TW0' values, it's an unused encoding.
> (However ...)
>
>> +#define FSC_SECC_TTW1 (0x1d)
>> +#define FSC_SECC_TTW2 (0x1e)
>> +#define FSC_SECC_TTW3 (0x1f)
>>
>> /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>> #define HPFAR_MASK (~0xf)
>> #endif /* __ASM_ARM_SYSTEM_MISC_H */
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index a5265ed..f3608c9 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1444,8 +1463,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>
>> /* Check the stage-2 fault is trans. fault or write fault */
>> fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
>> - if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>> - fault_status != FSC_ACCESS) {
>> +
>> + /* The host kernel will handle the synchronous external abort. There
>> + * is no need to pass the error into the guest.
>> + */
>> + if (is_abort_synchronous(fault_status)) {
>> + if(handle_guest_sea((unsigned long)fault_ipa,
>> + kvm_vcpu_get_hsr(vcpu))) {
>
> ... Looking further up in this function:
>> is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>> if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
>> kvm_inject_vabt(vcpu);
>> return 1;
>> }
> ... so external data aborts will have already been 'claimed' by kvm and dealt
> with, and we already have a helper for spotting external aborts. (sorry I didn't
> spot it earlier).
>
> We need to do the handle_guest_sea() before this code.
>
> kvm_inject_vabt() makes an SError interrupt pending for the guest. This makes a
> synchronous error asynchronous as the guest may have SError interrupts masked.
>
> I guess this was the best that could be done at the time of (4055710baca8
> "arm/arm64: KVM: Inject virtual abort when guest exits on external abort"), but
> in the light of this firmware-first handling, I'm not sure its the right thing
> to do.
>
> Is it possible for handle_guest_sea() to return whether it actually found any
> work to do? If there was none I think we should keep this kvm_inject_vabt() as
> it is the existing behaviour.
Yes, I'll move the handle_guest_sea() call above this. My testing didn't
call into that if statement for some reason...it made it to the
handle_guest_sea() call successfully.
If there is no work for the GHES code to do it will return and could
still make the kvm_inject_vabt() call. It will also return and do that
same thing if the error was not fatal in GHES...would that be an issue?
>
>> + kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>> + kvm_vcpu_trap_get_class(vcpu),
>> + (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>> + (unsigned long)kvm_vcpu_get_hsr(vcpu));
>> + return -EFAULT;
>> + }
>> + } else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>> + fault_status != FSC_ACCESS) {
>> kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>> kvm_vcpu_trap_get_class(vcpu),
>> (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index b2d57fc..31c5171 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -602,6 +602,24 @@ static const char *fault_name(unsigned int esr)
>> }
>>
>> /*
>> + * Handle Synchronous External Aborts that occur in a guest kernel.
>> + */
>> +int handle_guest_sea(unsigned long addr, unsigned int esr)
>> +{
>> + /*
>> + * synchronize_rcu() will wait for nmi_exit(), so no need to
>> + * rcu_read_lock().
>> + */
> This comment has a life of its own! Given we don't always call ghes_notify_sea()
> when we interrupted un-interruptable code its not always true. I think the
> rcu_read_{,un}lock() should go against the list walk (so it looks like the
> examples), and ditch the comment!
:) Okay I will move the rcu calls into ghes_notify_sea() and remove the
comments.
Thanks,
Tyler
>
>> + if(IS_ENABLED(ACPI_APEI_SEA)) {
>> + rcu_read_lock();
>> + ghes_notify_sea();
>> + rcu_read_unlock();
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> * Dispatch a data abort to the relevant handler.
>> */
>> asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
>>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support
2017-03-07 17:58 ` Baicar, Tyler
@ 2017-03-08 16:09 ` James Morse
2017-03-10 18:15 ` Baicar, Tyler
0 siblings, 1 reply; 30+ messages in thread
From: James Morse @ 2017-03-08 16:09 UTC (permalink / raw)
To: Baicar, Tyler
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
On 07/03/17 17:58, Baicar, Tyler wrote:
> On 3/7/2017 4:48 AM, James Morse wrote:
>> On 06/03/17 20:45, Tyler Baicar wrote:
>>> Currently external aborts are unsupported by the guest abort
>>> handling. Add handling for SEAs so that the host kernel reports
>>> SEAs which occur in the guest kernel.
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index a5265ed..f3608c9 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -1444,8 +1463,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu,
>>> struct kvm_run *run)
>>> /* Check the stage-2 fault is trans. fault or write fault */
>>> fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
>>> - if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>>> - fault_status != FSC_ACCESS) {
>>> +
>>> + /* The host kernel will handle the synchronous external abort. There
>>> + * is no need to pass the error into the guest.
>>> + */
>>> + if (is_abort_synchronous(fault_status)) {
>>> + if(handle_guest_sea((unsigned long)fault_ipa,
>>> + kvm_vcpu_get_hsr(vcpu))) {
>>
>> ... Looking further up in this function:
>>> is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>> if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
>>> kvm_inject_vabt(vcpu);
>>> return 1;
>>> }
>> ... so external data aborts will have already been 'claimed' by kvm and dealt
>> with, and we already have a helper for spotting external aborts. (sorry I didn't
>> spot it earlier).
>>
>> We need to do the handle_guest_sea() before this code.
>>
>> kvm_inject_vabt() makes an SError interrupt pending for the guest. This makes a
>> synchronous error asynchronous as the guest may have SError interrupts masked.
>>
>> I guess this was the best that could be done at the time of (4055710baca8
>> "arm/arm64: KVM: Inject virtual abort when guest exits on external abort"), but
>> in the light of this firmware-first handling, I'm not sure its the right thing
>> to do.
>>
>> Is it possible for handle_guest_sea() to return whether it actually found any
>> work to do? If there was none I think we should keep this kvm_inject_vabt() as
>> it is the existing behaviour.
> Yes, I'll move the handle_guest_sea() call above this. My testing didn't call
> into that if statement for some reason...it made it to the handle_guest_sea()
> call successfully.
I guess you're not using data aborts for testing then.
> If there is no work for the GHES code to do it will return and could still make
> the kvm_inject_vabt() call. It will also return and do that same thing if the
> error was not fatal in GHES...would that be an issue?
We might inject a superfluous SError Interrupt in that case.
For memory errors we may do the whole unmapping and signalling thing to handle
the fault. For recoverable faults, QEMU can generate its own CPER records for
the guest and do the work to notify the guest. Everything looks fine until the
guest gets the extra SError interrupt.
If there is firmware-first RAS data, we should skip the injected SError
Interrupt as the host's RAS code should choose what to do. If this Synchronous
External Abort was nothing to do with RAS, we should inject the SError interrupt
as it's the existing behaviour, and not all platforms will have this Synchronous
External Abort mechanism.
x86's APEI NMI needs to know if the NMI was due to RAS, so we can probably
borrow the same trick. ghes_notify_nmi() calls ghes_read_estatus() for each
struct ghes. If they all return an error, there was no work to do.
(I assume firmware will only generate one of these at a time, so there is no
risk of one list-walker processing two entries, then the second finding nothing
to do?)
Thanks,
James
>>> + kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx
>>> ESR_EL2=%#lx\n",
>>> + kvm_vcpu_trap_get_class(vcpu),
>>> + (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>>> + (unsigned long)kvm_vcpu_get_hsr(vcpu));
>>> + return -EFAULT;
>>> + }
>>> + } else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>>> + fault_status != FSC_ACCESS) {
>>> kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>>> kvm_vcpu_trap_get_class(vcpu),
>>> (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support
2017-03-08 16:09 ` James Morse
@ 2017-03-10 18:15 ` Baicar, Tyler
0 siblings, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-10 18:15 UTC (permalink / raw)
To: James Morse
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hello James,
On 3/8/2017 9:09 AM, James Morse wrote:
> On 07/03/17 17:58, Baicar, Tyler wrote:
>> On 3/7/2017 4:48 AM, James Morse wrote:
>>> On 06/03/17 20:45, Tyler Baicar wrote:
>>>> Currently external aborts are unsupported by the guest abort
>>>> handling. Add handling for SEAs so that the host kernel reports
>>>> SEAs which occur in the guest kernel.
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index a5265ed..f3608c9 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -1444,8 +1463,21 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu,
>>>> struct kvm_run *run)
>>>> /* Check the stage-2 fault is trans. fault or write fault */
>>>> fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
>>>> - if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>>>> - fault_status != FSC_ACCESS) {
>>>> +
>>>> + /* The host kernel will handle the synchronous external abort. There
>>>> + * is no need to pass the error into the guest.
>>>> + */
>>>> + if (is_abort_synchronous(fault_status)) {
>>>> + if(handle_guest_sea((unsigned long)fault_ipa,
>>>> + kvm_vcpu_get_hsr(vcpu))) {
>>> ... Looking further up in this function:
>>>> is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
>>>> if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
>>>> kvm_inject_vabt(vcpu);
>>>> return 1;
>>>> }
>>> ... so external data aborts will have already been 'claimed' by kvm and dealt
>>> with, and we already have a helper for spotting external aborts. (sorry I didn't
>>> spot it earlier).
>>>
>>> We need to do the handle_guest_sea() before this code.
>>>
>>> kvm_inject_vabt() makes an SError interrupt pending for the guest. This makes a
>>> synchronous error asynchronous as the guest may have SError interrupts masked.
>>>
>>> I guess this was the best that could be done at the time of (4055710baca8
>>> "arm/arm64: KVM: Inject virtual abort when guest exits on external abort"), but
>>> in the light of this firmware-first handling, I'm not sure its the right thing
>>> to do.
>>>
>>> Is it possible for handle_guest_sea() to return whether it actually found any
>>> work to do? If there was none I think we should keep this kvm_inject_vabt() as
>>> it is the existing behaviour.
>> Yes, I'll move the handle_guest_sea() call above this. My testing didn't call
>> into that if statement for some reason...it made it to the handle_guest_sea()
>> call successfully.
> I guess you're not using data aborts for testing then.
>
>
>> If there is no work for the GHES code to do it will return and could still make
>> the kvm_inject_vabt() call. It will also return and do that same thing if the
>> error was not fatal in GHES...would that be an issue?
> We might inject a superfluous SError Interrupt in that case.
>
> For memory errors we may do the whole unmapping and signalling thing to handle
> the fault. For recoverable faults, QEMU can generate its own CPER records for
> the guest and do the work to notify the guest. Everything looks fine until the
> guest gets the extra SError interrupt.
>
> If there is firmware-first RAS data, we should skip the injected SError
> Interrupt as the host's RAS code should choose what to do. If this Synchronous
> External Abort was nothing to do with RAS, we should inject the SError interrupt
> as it's the existing behaviour, and not all platforms will have this Synchronous
> External Abort mechanism.
>
> x86's APEI NMI needs to know if the NMI was due to RAS, so we can probably
> borrow the same trick. ghes_notify_nmi() calls ghes_read_estatus() for each
> struct ghes. If they all return an error, there was no work to do.
>
>
> (I assume firmware will only generate one of these at a time, so there is no
> risk of one list-walker processing two entries, then the second finding nothing
> to do?)
>
Okay, I will add in this functionality to avoid injecting an SEI if
there is actually a GHES error populated.
Firmware that supports the new specs should only generate one of these
at a time, it will wait for the ack from kernel before sending a second
error (patch 1 of this series).
Thanks,
Tyler
>
>
>>>> + kvm_err("Failed to handle guest SEA, FSC: EC=%#x xFSC=%#lx
>>>> ESR_EL2=%#lx\n",
>>>> + kvm_vcpu_trap_get_class(vcpu),
>>>> + (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
>>>> + (unsigned long)kvm_vcpu_get_hsr(vcpu));
>>>> + return -EFAULT;
>>>> + }
>>>> + } else if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
>>>> + fault_status != FSC_ACCESS) {
>>>> kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>>>> kvm_vcpu_trap_get_class(vcpu),
>>>> (unsigned long)kvm_vcpu_trap_get_fault(vcpu),
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
2017-03-06 20:44 [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 Tyler Baicar
` (9 preceding siblings ...)
2017-03-06 20:45 ` [PATCH V12 10/10] arm/arm64: KVM: add guest SEA support Tyler Baicar
@ 2017-03-07 11:37 ` James Morse
2017-03-07 16:37 ` Baicar, Tyler
10 siblings, 1 reply; 30+ messages in thread
From: James Morse @ 2017-03-07 11:37 UTC (permalink / raw)
To: Tyler Baicar
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
Hi Tyler,
On 06/03/17 20:44, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
This series doesn't apply cleanly to v4.11-rc1, what did you base it on?
Please base this on a v4.11 release candidate if you want it considered for v4.12.
Thanks,
James
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
2017-03-07 11:37 ` [PATCH V12 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 James Morse
@ 2017-03-07 16:37 ` Baicar, Tyler
0 siblings, 0 replies; 30+ messages in thread
From: Baicar, Tyler @ 2017-03-07 16:37 UTC (permalink / raw)
To: James Morse
Cc: christoffer.dall, marc.zyngier, pbonzini, rkrcmar, linux,
catalin.marinas, will.deacon, rjw, lenb, matt, robert.moore,
lv.zheng, nkaje, zjzhang, mark.rutland, akpm, eun.taik.lee,
sandeepa.s.prabhu, labbott, shijie.huang, rruigrok,
paul.gortmaker, tn, fu.wei, rostedt, bristot, linux-arm-kernel,
kvmarm, kvm, linux-kernel, linux-acpi, linux-efi, devel,
Suzuki.Poulose, punit.agrawal, astone, harba, hanjun.guo,
john.garry, shiju.jose, joe
On 3/7/2017 4:37 AM, James Morse wrote:
> Hi Tyler,
>
> On 06/03/17 20:44, Tyler Baicar wrote:
>> When a memory error, CPU error, PCIe error, or other type of hardware error
>> that's covered by RAS occurs, firmware should populate the shared GHES memory
>> location with the proper GHES structures to notify the OS of the error.
>> For example, platforms that implement firmware first handling may implement
>> separate GHES sources for corrected errors and uncorrected errors. If the
>> error is an uncorrectable error, then the firmware will notify the OS
>> immediately since the error needs to be handled ASAP. The OS will then be able
>> to take the appropriate action needed such as offlining a page. If the error
>> is a corrected error, then the firmware will not interrupt the OS immediately.
>> Instead, the OS will see and report the error the next time it's GHES timer
>> expires. The kernel will first parse the GHES structures and report the errors
>> through the kernel logs and then notify the user space through RAS trace
>> events. This allows user space applications such as RAS Daemon to see the
>> errors and report them however the user desires. This patchset extends the
>> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
>> ACPI 6.1 specifications.
> This series doesn't apply cleanly to v4.11-rc1, what did you base it on?
>
> Please base this on a v4.11 release candidate if you want it considered for v4.12.
>
This was based on 4.10, I will base it on 4.11-rc1 for the next patch set.
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 30+ messages in thread