* [PATCH 0/2] EDAC, skx: Provide more machine specific location detail @ 2019-09-13 22:13 Tony Luck 2019-09-13 22:13 ` [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode Tony Luck ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: Tony Luck @ 2019-09-13 22:13 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Qiuxu Zhuo, Aristeu Rozanski, Mauro Carvalho Chehab, linux-edac First patch refactors code so that second can work on systems with and without the ACPI ADXL address translation code. Perhaps has some value on its own as the code is, IMHO, a little cleaner. Second is in RFC state. Im looking for input on whether to just print the extra information to the console log (as the patch does now) or whether to tag it onto the long string that we push though the EDAC reporting path. Tony Luck (2): EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. EDAC, skx: Retrieve and print retry_rd_err_log registers drivers/edac/skx_base.c | 38 +++++++++++++++++++++++++-- drivers/edac/skx_common.c | 55 +++++++++++++++++++++------------------ drivers/edac/skx_common.h | 4 ++- 3 files changed, 68 insertions(+), 29 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. 2019-09-13 22:13 [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Tony Luck @ 2019-09-13 22:13 ` Tony Luck 2019-09-18 10:40 ` Mauro Carvalho Chehab 2019-09-13 22:13 ` [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers Tony Luck ` (2 subsequent siblings) 3 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2019-09-13 22:13 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Qiuxu Zhuo, Aristeu Rozanski, Mauro Carvalho Chehab, linux-edac Simplifies the code a little. Signed-off-by: Tony Luck <tony.luck@intel.com> --- drivers/edac/skx_common.c | 48 +++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index d8ff63d91b86..58b8348d0f71 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -100,6 +100,7 @@ void __exit skx_adxl_put(void) static bool skx_adxl_decode(struct decoded_addr *res) { + struct skx_dev *d; int i, len = 0; if (res->addr >= skx_tohm || (res->addr >= skx_tolm && @@ -118,6 +119,24 @@ static bool skx_adxl_decode(struct decoded_addr *res) res->channel = (int)adxl_values[component_indices[INDEX_CHANNEL]]; res->dimm = (int)adxl_values[component_indices[INDEX_DIMM]]; + if (res->imc > NUM_IMC - 1) { + skx_printk(KERN_ERR, "Bad imc %d\n", res->imc); + return false; + } + + list_for_each_entry(d, &dev_edac_list, list) { + if (d->imc[0].src_id == res->socket) { + res->dev = d; + break; + } + } + + if (!res->dev) { + skx_printk(KERN_ERR, "No device for src_id %d imc %d\n", + res->socket, res->imc); + return false; + } + for (i = 0; i < adxl_component_count; i++) { if (adxl_values[i] == ~0x0ull) continue; @@ -452,24 +471,6 @@ static void skx_unregister_mci(struct skx_imc *imc) edac_mc_free(mci); } -static struct mem_ctl_info *get_mci(int src_id, int lmc) -{ - struct skx_dev *d; - - if (lmc > NUM_IMC - 1) { - skx_printk(KERN_ERR, "Bad lmc %d\n", lmc); - return NULL; - } - - list_for_each_entry(d, &dev_edac_list, list) { - if (d->imc[0].src_id == src_id) - return d->imc[lmc].mci; - } - - skx_printk(KERN_ERR, "No mci for src_id %d lmc %d\n", src_id, lmc); - return NULL; -} - static void skx_mce_output_error(struct mem_ctl_info *mci, const struct mce *m, struct decoded_addr *res) @@ -583,15 +584,12 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, if (adxl_component_count) { if (!skx_adxl_decode(&res)) return NOTIFY_DONE; - - mci = get_mci(res.socket, res.imc); - } else { - if (!skx_decode || !skx_decode(&res)) - return NOTIFY_DONE; - - mci = res.dev->imc[res.imc].mci; + } else if (!skx_decode || !skx_decode(&res)) { + return NOTIFY_DONE; } + mci = res.dev->imc[res.imc].mci; + if (!mci) return NOTIFY_DONE; -- 2.20.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. 2019-09-13 22:13 ` [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode Tony Luck @ 2019-09-18 10:40 ` Mauro Carvalho Chehab 2019-09-23 23:36 ` Luck, Tony 0 siblings, 1 reply; 11+ messages in thread From: Mauro Carvalho Chehab @ 2019-09-18 10:40 UTC (permalink / raw) To: Tony Luck; +Cc: Borislav Petkov, Qiuxu Zhuo, Aristeu Rozanski, linux-edac Em Fri, 13 Sep 2019 15:13:43 -0700 Tony Luck <tony.luck@intel.com> escreveu: > Simplifies the code a little. > > Signed-off-by: Tony Luck <tony.luck@intel.com> Patch itself looks good... > --- > drivers/edac/skx_common.c | 48 +++++++++++++++++++-------------------- > 1 file changed, 23 insertions(+), 25 deletions(-) > > diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c > index d8ff63d91b86..58b8348d0f71 100644 > --- a/drivers/edac/skx_common.c > +++ b/drivers/edac/skx_common.c > @@ -100,6 +100,7 @@ void __exit skx_adxl_put(void) > > static bool skx_adxl_decode(struct decoded_addr *res) > { > + struct skx_dev *d; > int i, len = 0; > > if (res->addr >= skx_tohm || (res->addr >= skx_tolm && > @@ -118,6 +119,24 @@ static bool skx_adxl_decode(struct decoded_addr *res) > res->channel = (int)adxl_values[component_indices[INDEX_CHANNEL]]; > res->dimm = (int)adxl_values[component_indices[INDEX_DIMM]]; > > + if (res->imc > NUM_IMC - 1) { > + skx_printk(KERN_ERR, "Bad imc %d\n", res->imc); I would report this via EDAC as well. > + return false; > + } > + > + list_for_each_entry(d, &dev_edac_list, list) { > + if (d->imc[0].src_id == res->socket) { > + res->dev = d; > + break; > + } > + } > + > + if (!res->dev) { > + skx_printk(KERN_ERR, "No device for src_id %d imc %d\n", > + res->socket, res->imc); I would report this via EDAC as well. > + return false; > + } > + > for (i = 0; i < adxl_component_count; i++) { > if (adxl_values[i] == ~0x0ull) > continue; > @@ -452,24 +471,6 @@ static void skx_unregister_mci(struct skx_imc *imc) > edac_mc_free(mci); > } > > -static struct mem_ctl_info *get_mci(int src_id, int lmc) > -{ > - struct skx_dev *d; > - > - if (lmc > NUM_IMC - 1) { > - skx_printk(KERN_ERR, "Bad lmc %d\n", lmc); > - return NULL; > - } > - > - list_for_each_entry(d, &dev_edac_list, list) { > - if (d->imc[0].src_id == src_id) > - return d->imc[lmc].mci; > - } > - > - skx_printk(KERN_ERR, "No mci for src_id %d lmc %d\n", src_id, lmc); > - return NULL; > -} > - > static void skx_mce_output_error(struct mem_ctl_info *mci, > const struct mce *m, > struct decoded_addr *res) > @@ -583,15 +584,12 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, > if (adxl_component_count) { > if (!skx_adxl_decode(&res)) > return NOTIFY_DONE; > - > - mci = get_mci(res.socket, res.imc); > - } else { > - if (!skx_decode || !skx_decode(&res)) > - return NOTIFY_DONE; > - > - mci = res.dev->imc[res.imc].mci; > + } else if (!skx_decode || !skx_decode(&res)) { > + return NOTIFY_DONE; > } > > + mci = res.dev->imc[res.imc].mci; > + > if (!mci) > return NOTIFY_DONE; > Thanks, Mauro ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. 2019-09-18 10:40 ` Mauro Carvalho Chehab @ 2019-09-23 23:36 ` Luck, Tony 0 siblings, 0 replies; 11+ messages in thread From: Luck, Tony @ 2019-09-23 23:36 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Borislav Petkov, Zhuo, Qiuxu, Aristeu Rozanski, linux-edac >> + if (res->imc > NUM_IMC - 1) { >> + skx_printk(KERN_ERR, "Bad imc %d\n", res->imc); > > I would report this via EDAC as well. It would be nice, but I don't see how. This function is trying to figure out which memory controller (and thus which EDAC struct mem_ctl_info) is connected to this error. If it fails, then we don't know where to report it. On the plus side this error (and the other one you flagged) "can't happen"(TM) so we shouldn't expend too much effort to solve this. Code must give up here rather than trigger out of bounds array accesses later. If we did want to solve this, we could invent a mechanism for EDAC drivers to log errors not related to a particular memory controller (by passing NULL to edac_mc_handle_error()???). -Tony ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers 2019-09-13 22:13 [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Tony Luck 2019-09-13 22:13 ` [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode Tony Luck @ 2019-09-13 22:13 ` Tony Luck 2019-09-18 10:52 ` Mauro Carvalho Chehab 2019-09-17 20:05 ` [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Aristeu Rozanski 2019-09-25 13:51 ` Aristeu Rozanski 3 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2019-09-13 22:13 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Qiuxu Zhuo, Aristeu Rozanski, Mauro Carvalho Chehab, linux-edac Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and print it. retry_rd_err_[five 32-bit register values] correrrcnt[four hex values] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank (two 16-bit values in each register). If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. Signed-off-by: Tony Luck <tony.luck@intel.com> --- drivers/edac/skx_base.c | 38 ++++++++++++++++++++++++++++++++++++-- drivers/edac/skx_common.c | 7 ++++++- drivers/edac/skx_common.h | 4 +++- 3 files changed, 45 insertions(+), 4 deletions(-) diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index 0fcf3785e8f3..e0c0366fdc84 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -46,7 +46,8 @@ static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) } enum munittype { - CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD + CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD, + ERRCHAN0, ERRCHAN1, ERRCHAN2, }; struct munit { @@ -68,6 +69,9 @@ static const struct munit skx_all_munits[] = { { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, + { 0x2043, { PCI_DEVFN(10, 3), PCI_DEVFN(12, 3) }, 2, 2, ERRCHAN0 }, + { 0x2047, { PCI_DEVFN(10, 7), PCI_DEVFN(12, 7) }, 2, 2, ERRCHAN1 }, + { 0x204b, { PCI_DEVFN(11, 3), PCI_DEVFN(13, 3) }, 2, 2, ERRCHAN2 }, { 0x208e, { }, 1, 0, SAD }, { } }; @@ -108,6 +112,10 @@ static int get_all_munits(const struct munit *m) pci_dev_get(pdev); d->imc[i].chan[m->mtype].cdev = pdev; break; + case ERRCHAN0: case ERRCHAN1: case ERRCHAN2: + pci_dev_get(pdev); + d->imc[i].chan[m->mtype - ERRCHAN0].edev = pdev; + break; case SAD_ALL: pci_dev_get(pdev); d->sad_all = pdev; @@ -216,6 +224,32 @@ static int skx_get_dimm_config(struct mem_ctl_info *mci) #define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) #define SKX_ILV_TARGET(tgt) ((tgt) & 7) +static void skx_show_retry_rd_err_log(struct decoded_addr *res) +{ + u32 log0, log1, log2, log3, log4; + u32 corr0, corr1, corr2, corr3; + struct pci_dev *edev; + + edev = res->dev->imc[res->imc].chan[res->channel].edev; + + pci_read_config_dword(edev, 0x154, &log0); + pci_read_config_dword(edev, 0x148, &log1); + pci_read_config_dword(edev, 0x150, &log2); + pci_read_config_dword(edev, 0x15c, &log3); + pci_read_config_dword(edev, 0x114, &log4); + + dev_err(&edev->dev, "retry_rd_err_log[%.8x %.8x %.8x %.8x %.8x]\n", + log0, log1, log2, log3, log4); + + pci_read_config_dword(edev, 0x104, &corr0); + pci_read_config_dword(edev, 0x108, &corr1); + pci_read_config_dword(edev, 0x10c, &corr2); + pci_read_config_dword(edev, 0x110, &corr3); + + dev_err(&edev->dev, "correrrcnt[%.8x %.8x %.8x %.8x]\n", + corr0, corr1, corr2, corr3); +} + static bool skx_sad_decode(struct decoded_addr *res) { struct skx_dev *d = list_first_entry(skx_edac_list, typeof(*d), list); @@ -659,7 +693,7 @@ static int __init skx_init(void) } } - skx_set_decode(skx_decode); + skx_set_decode(skx_decode, skx_show_retry_rd_err_log); if (nvdimm_count && skx_adxl_get() == -ENODEV) skx_printk(KERN_NOTICE, "Only decoding DDR4 address!\n"); diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 58b8348d0f71..982154a899ce 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -37,6 +37,7 @@ static char *adxl_msg; static char skx_msg[MSG_SIZE]; static skx_decode_f skx_decode; +static skx_show_retry_log_f skx_show_retry_rd_err_log; static u64 skx_tolm, skx_tohm; static LIST_HEAD(dev_edac_list); @@ -150,9 +151,10 @@ static bool skx_adxl_decode(struct decoded_addr *res) return true; } -void skx_set_decode(skx_decode_f decode) +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log) { skx_decode = decode; + skx_show_retry_rd_err_log = show_retry_log; } int skx_get_src_id(struct skx_dev *d, int off, u8 *id) @@ -611,6 +613,9 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, "%u APIC 0x%x\n", mce->cpuvendor, mce->cpuid, mce->time, mce->socketid, mce->apicid); + if (skx_show_retry_rd_err_log) + skx_show_retry_rd_err_log(&res); + skx_mce_output_error(mci, mce, &res); return NOTIFY_DONE; diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index 08cc971a50ea..25209321ea0d 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -64,6 +64,7 @@ struct skx_dev { u8 src_id, node_id; struct skx_channel { struct pci_dev *cdev; + struct pci_dev *edev; struct skx_dimm { u8 close_pg; u8 bank_xor_enable; @@ -113,10 +114,11 @@ struct decoded_addr { typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); typedef bool (*skx_decode_f)(struct decoded_addr *res); +typedef void (*skx_show_retry_log_f)(struct decoded_addr *res); int __init skx_adxl_get(void); void __exit skx_adxl_put(void); -void skx_set_decode(skx_decode_f decode); +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log); int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id); -- 2.20.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers 2019-09-13 22:13 ` [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers Tony Luck @ 2019-09-18 10:52 ` Mauro Carvalho Chehab 2019-09-23 23:57 ` Luck, Tony 2019-09-24 21:52 ` [PATCHv2 " Tony Luck 0 siblings, 2 replies; 11+ messages in thread From: Mauro Carvalho Chehab @ 2019-09-18 10:52 UTC (permalink / raw) To: Tony Luck; +Cc: Borislav Petkov, Qiuxu Zhuo, Aristeu Rozanski, linux-edac Em Fri, 13 Sep 2019 15:13:44 -0700 Tony Luck <tony.luck@intel.com> escreveu: > Skylake logs some additional useful information in per-channel > registers in addition the the architectural status/addr/misc > logged in the machine check bank. > > Pick up this information and print it. > retry_rd_err_[five 32-bit register values] > correrrcnt[four hex values] > > Note that if additional errors are logged while these registers > are being read, you may see a jumble of values some from earlier > errors, others from later errors (since the registers report the > most recent logged error). The correrrcnt registers provide error > counts per possible rank (two 16-bit values in each register). If > these counts only change by one since the previous error logged > for this channel, then it is safe to assume that the registers > logged provide a coherent view of one error. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > drivers/edac/skx_base.c | 38 ++++++++++++++++++++++++++++++++++++-- > drivers/edac/skx_common.c | 7 ++++++- > drivers/edac/skx_common.h | 4 +++- > 3 files changed, 45 insertions(+), 4 deletions(-) > > diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c > index 0fcf3785e8f3..e0c0366fdc84 100644 > --- a/drivers/edac/skx_base.c > +++ b/drivers/edac/skx_base.c > @@ -46,7 +46,8 @@ static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) > } > > enum munittype { > - CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD > + CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD, > + ERRCHAN0, ERRCHAN1, ERRCHAN2, > }; > > struct munit { > @@ -68,6 +69,9 @@ static const struct munit skx_all_munits[] = { > { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, > { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, > { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, > + { 0x2043, { PCI_DEVFN(10, 3), PCI_DEVFN(12, 3) }, 2, 2, ERRCHAN0 }, > + { 0x2047, { PCI_DEVFN(10, 7), PCI_DEVFN(12, 7) }, 2, 2, ERRCHAN1 }, > + { 0x204b, { PCI_DEVFN(11, 3), PCI_DEVFN(13, 3) }, 2, 2, ERRCHAN2 }, > { 0x208e, { }, 1, 0, SAD }, > { } > }; > @@ -108,6 +112,10 @@ static int get_all_munits(const struct munit *m) > pci_dev_get(pdev); > d->imc[i].chan[m->mtype].cdev = pdev; > break; > + case ERRCHAN0: case ERRCHAN1: case ERRCHAN2: I would place each case on a separate line, in order to make easier to read it, and to follow the Kernel coding style. > + pci_dev_get(pdev); > + d->imc[i].chan[m->mtype - ERRCHAN0].edev = pdev; > + break; > case SAD_ALL: > pci_dev_get(pdev); > d->sad_all = pdev; > @@ -216,6 +224,32 @@ static int skx_get_dimm_config(struct mem_ctl_info *mci) > #define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) > #define SKX_ILV_TARGET(tgt) ((tgt) & 7) > > +static void skx_show_retry_rd_err_log(struct decoded_addr *res) > +{ > + u32 log0, log1, log2, log3, log4; > + u32 corr0, corr1, corr2, corr3; > + struct pci_dev *edev; > + > + edev = res->dev->imc[res->imc].chan[res->channel].edev; > + > + pci_read_config_dword(edev, 0x154, &log0); > + pci_read_config_dword(edev, 0x148, &log1); > + pci_read_config_dword(edev, 0x150, &log2); > + pci_read_config_dword(edev, 0x15c, &log3); > + pci_read_config_dword(edev, 0x114, &log4); > + > + dev_err(&edev->dev, "retry_rd_err_log[%.8x %.8x %.8x %.8x %.8x]\n", > + log0, log1, log2, log3, log4); > + > + pci_read_config_dword(edev, 0x104, &corr0); > + pci_read_config_dword(edev, 0x108, &corr1); > + pci_read_config_dword(edev, 0x10c, &corr2); > + pci_read_config_dword(edev, 0x110, &corr3); > + > + dev_err(&edev->dev, "correrrcnt[%.8x %.8x %.8x %.8x]\n", > + corr0, corr1, corr2, corr3); I would report both dev_err above via EDAC. Btw, can't those be output on a way that wouldn't require someone to look at the datasheet for the meaning of those registers? "retry_rd_err_log" and "correrrcnt" sounds too obscure for me to understand what they mean without reading the entire driver's code and read the datasheets. > +} > + > static bool skx_sad_decode(struct decoded_addr *res) > { > struct skx_dev *d = list_first_entry(skx_edac_list, typeof(*d), list); > @@ -659,7 +693,7 @@ static int __init skx_init(void) > } > } > > - skx_set_decode(skx_decode); > + skx_set_decode(skx_decode, skx_show_retry_rd_err_log); > > if (nvdimm_count && skx_adxl_get() == -ENODEV) > skx_printk(KERN_NOTICE, "Only decoding DDR4 address!\n"); > diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c > index 58b8348d0f71..982154a899ce 100644 > --- a/drivers/edac/skx_common.c > +++ b/drivers/edac/skx_common.c > @@ -37,6 +37,7 @@ static char *adxl_msg; > > static char skx_msg[MSG_SIZE]; > static skx_decode_f skx_decode; > +static skx_show_retry_log_f skx_show_retry_rd_err_log; > static u64 skx_tolm, skx_tohm; > static LIST_HEAD(dev_edac_list); > > @@ -150,9 +151,10 @@ static bool skx_adxl_decode(struct decoded_addr *res) > return true; > } > > -void skx_set_decode(skx_decode_f decode) > +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log) > { > skx_decode = decode; > + skx_show_retry_rd_err_log = show_retry_log; > } > > int skx_get_src_id(struct skx_dev *d, int off, u8 *id) > @@ -611,6 +613,9 @@ int skx_mce_check_error(struct notifier_block *nb, unsigned long val, > "%u APIC 0x%x\n", mce->cpuvendor, mce->cpuid, > mce->time, mce->socketid, mce->apicid); > > + if (skx_show_retry_rd_err_log) > + skx_show_retry_rd_err_log(&res); > + > skx_mce_output_error(mci, mce, &res); > > return NOTIFY_DONE; > diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h > index 08cc971a50ea..25209321ea0d 100644 > --- a/drivers/edac/skx_common.h > +++ b/drivers/edac/skx_common.h > @@ -64,6 +64,7 @@ struct skx_dev { > u8 src_id, node_id; > struct skx_channel { > struct pci_dev *cdev; > + struct pci_dev *edev; > struct skx_dimm { > u8 close_pg; > u8 bank_xor_enable; > @@ -113,10 +114,11 @@ struct decoded_addr { > > typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); > typedef bool (*skx_decode_f)(struct decoded_addr *res); > +typedef void (*skx_show_retry_log_f)(struct decoded_addr *res); > > int __init skx_adxl_get(void); > void __exit skx_adxl_put(void); > -void skx_set_decode(skx_decode_f decode); > +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log); > > int skx_get_src_id(struct skx_dev *d, int off, u8 *id); > int skx_get_node_id(struct skx_dev *d, u8 *id); Thanks, Mauro ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers 2019-09-18 10:52 ` Mauro Carvalho Chehab @ 2019-09-23 23:57 ` Luck, Tony 2019-09-24 21:52 ` [PATCHv2 " Tony Luck 1 sibling, 0 replies; 11+ messages in thread From: Luck, Tony @ 2019-09-23 23:57 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Borislav Petkov, Qiuxu Zhuo, Aristeu Rozanski, linux-edac On Wed, Sep 18, 2019 at 07:52:46AM -0300, Mauro Carvalho Chehab wrote: > > break; > > + case ERRCHAN0: case ERRCHAN1: case ERRCHAN2: > > I would place each case on a separate line, in order to make easier > to read it, and to follow the Kernel coding style. This follows the pattern in this driver a couple of lines earlier: case CHAN0: case CHAN1: case CHAN2: It's not explicitly disallowed by Documentation/process/coding-style.rst which just says to indent the "case" at the same level as the "switch". (Though the example does put each case on a new line). > > + pci_read_config_dword(edev, 0x154, &log0); > > + pci_read_config_dword(edev, 0x148, &log1); > > + pci_read_config_dword(edev, 0x150, &log2); > > + pci_read_config_dword(edev, 0x15c, &log3); > > + pci_read_config_dword(edev, 0x114, &log4); > > + > > + dev_err(&edev->dev, "retry_rd_err_log[%.8x %.8x %.8x %.8x %.8x]\n", > > + log0, log1, log2, log3, log4); > > + > > + pci_read_config_dword(edev, 0x104, &corr0); > > + pci_read_config_dword(edev, 0x108, &corr1); > > + pci_read_config_dword(edev, 0x10c, &corr2); > > + pci_read_config_dword(edev, 0x110, &corr3); > > + > > + dev_err(&edev->dev, "correrrcnt[%.8x %.8x %.8x %.8x]\n", > > + corr0, corr1, corr2, corr3); > > I would report both dev_err above via EDAC. I was concerned about how big the buffer was ... but I see that MSG_SIZE is 1024 ... so plenty of space for this extra information. I will move this into the EDAC report in next version. > Btw, can't those be output on a way that wouldn't require someone > to look at the datasheet for the meaning of those registers? > "retry_rd_err_log" and "correrrcnt" sounds too obscure for me to > understand what they mean without reading the entire driver's code and > read the datasheets. I did put a note about correrrcnt in the commit comment. Each value contains a pair of 16-bit values for the per-rank corrected error counters (max 8 with a pair of quad-rank DIMMs in a channel). I suppose it would be better to print as 8 per-rank values instead of 4 paired values. Intel isn't keen on doing the detailed decode of the retry_rd_err_log (it took some arm twisting to get folks to let me print them in hex). -Tony ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCHv2 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers 2019-09-18 10:52 ` Mauro Carvalho Chehab 2019-09-23 23:57 ` Luck, Tony @ 2019-09-24 21:52 ` Tony Luck 1 sibling, 0 replies; 11+ messages in thread From: Tony Luck @ 2019-09-24 21:52 UTC (permalink / raw) To: Borislav Petkov Cc: Tony Luck, Qiuxu Zhuo, Aristeu Rozanski, Mauro Carvalho Chehab, linux-edac Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and add it to the EDAC log: retry_rd_err_log[five 32-bit register values] Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems. correrrcnt[per rank corrected error counts] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. With this change EDAC logs look like this: EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000]) Signed-off-by: Tony Luck <tony.luck@intel.com> --- Changes since v1(RFC): - Fixed case statements in switch to be one per line - Changed from printing extra information to console to appending to EDAC log - Decode the corrected error registers into the two 16-bit values corresponding to the two ranks covered by each. - Update commit comment with apology about lack of decode for the retry_rd_err_log registers drivers/edac/skx_base.c | 51 ++++++++++++++++++++++++++++++++++++--- drivers/edac/skx_common.c | 12 ++++++--- drivers/edac/skx_common.h | 4 ++- 3 files changed, 60 insertions(+), 7 deletions(-) diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c index 0fcf3785e8f3..a8853e724d1f 100644 --- a/drivers/edac/skx_base.c +++ b/drivers/edac/skx_base.c @@ -46,7 +46,8 @@ static struct skx_dev *get_skx_dev(struct pci_bus *bus, u8 idx) } enum munittype { - CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD + CHAN0, CHAN1, CHAN2, SAD_ALL, UTIL_ALL, SAD, + ERRCHAN0, ERRCHAN1, ERRCHAN2, }; struct munit { @@ -68,6 +69,9 @@ static const struct munit skx_all_munits[] = { { 0x2040, { PCI_DEVFN(10, 0), PCI_DEVFN(12, 0) }, 2, 2, CHAN0 }, { 0x2044, { PCI_DEVFN(10, 4), PCI_DEVFN(12, 4) }, 2, 2, CHAN1 }, { 0x2048, { PCI_DEVFN(11, 0), PCI_DEVFN(13, 0) }, 2, 2, CHAN2 }, + { 0x2043, { PCI_DEVFN(10, 3), PCI_DEVFN(12, 3) }, 2, 2, ERRCHAN0 }, + { 0x2047, { PCI_DEVFN(10, 7), PCI_DEVFN(12, 7) }, 2, 2, ERRCHAN1 }, + { 0x204b, { PCI_DEVFN(11, 3), PCI_DEVFN(13, 3) }, 2, 2, ERRCHAN2 }, { 0x208e, { }, 1, 0, SAD }, { } }; @@ -104,10 +108,18 @@ static int get_all_munits(const struct munit *m) } switch (m->mtype) { - case CHAN0: case CHAN1: case CHAN2: + case CHAN0: + case CHAN1: + case CHAN2: pci_dev_get(pdev); d->imc[i].chan[m->mtype].cdev = pdev; break; + case ERRCHAN0: + case ERRCHAN1: + case ERRCHAN2: + pci_dev_get(pdev); + d->imc[i].chan[m->mtype - ERRCHAN0].edev = pdev; + break; case SAD_ALL: pci_dev_get(pdev); d->sad_all = pdev; @@ -216,6 +228,39 @@ static int skx_get_dimm_config(struct mem_ctl_info *mci) #define SKX_ILV_REMOTE(tgt) (((tgt) & 8) == 0) #define SKX_ILV_TARGET(tgt) ((tgt) & 7) +static void skx_show_retry_rd_err_log(struct decoded_addr *res, + char *msg, int len) +{ + u32 log0, log1, log2, log3, log4; + u32 corr0, corr1, corr2, corr3; + struct pci_dev *edev; + int n; + + edev = res->dev->imc[res->imc].chan[res->channel].edev; + + pci_read_config_dword(edev, 0x154, &log0); + pci_read_config_dword(edev, 0x148, &log1); + pci_read_config_dword(edev, 0x150, &log2); + pci_read_config_dword(edev, 0x15c, &log3); + pci_read_config_dword(edev, 0x114, &log4); + + n = snprintf(msg, len, " retry_rd_err_log[%.8x %.8x %.8x %.8x %.8x]", + log0, log1, log2, log3, log4); + + pci_read_config_dword(edev, 0x104, &corr0); + pci_read_config_dword(edev, 0x108, &corr1); + pci_read_config_dword(edev, 0x10c, &corr2); + pci_read_config_dword(edev, 0x110, &corr3); + + if (len - n > 0) + snprintf(msg + n, len - n, + " correrrcnt[%.4x %.4x %.4x %.4x %.4x %.4x %.4x %.4x]", + corr0 & 0xffff, corr0 >> 16, + corr1 & 0xffff, corr1 >> 16, + corr2 & 0xffff, corr2 >> 16, + corr3 & 0xffff, corr3 >> 16); +} + static bool skx_sad_decode(struct decoded_addr *res) { struct skx_dev *d = list_first_entry(skx_edac_list, typeof(*d), list); @@ -659,7 +704,7 @@ static int __init skx_init(void) } } - skx_set_decode(skx_decode); + skx_set_decode(skx_decode, skx_show_retry_rd_err_log); if (nvdimm_count && skx_adxl_get() == -ENODEV) skx_printk(KERN_NOTICE, "Only decoding DDR4 address!\n"); diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 58b8348d0f71..9174836ba85d 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -37,6 +37,7 @@ static char *adxl_msg; static char skx_msg[MSG_SIZE]; static skx_decode_f skx_decode; +static skx_show_retry_log_f skx_show_retry_rd_err_log; static u64 skx_tolm, skx_tohm; static LIST_HEAD(dev_edac_list); @@ -150,9 +151,10 @@ static bool skx_adxl_decode(struct decoded_addr *res) return true; } -void skx_set_decode(skx_decode_f decode) +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log) { skx_decode = decode; + skx_show_retry_rd_err_log = show_retry_log; } int skx_get_src_id(struct skx_dev *d, int off, u8 *id) @@ -481,6 +483,7 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, bool overflow = GET_BITFIELD(m->status, 62, 62); bool uncorrected_error = GET_BITFIELD(m->status, 61, 61); bool recoverable; + int len; u32 core_err_cnt = GET_BITFIELD(m->status, 38, 52); u32 mscod = GET_BITFIELD(m->status, 16, 31); u32 errcode = GET_BITFIELD(m->status, 0, 15); @@ -540,12 +543,12 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, } } if (adxl_component_count) { - snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x %s", + len = snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x %s", overflow ? " OVERFLOW" : "", (uncorrected_error && recoverable) ? " recoverable" : "", mscod, errcode, adxl_msg); } else { - snprintf(skx_msg, MSG_SIZE, + len = snprintf(skx_msg, MSG_SIZE, "%s%s err_code:0x%04x:0x%04x socket:%d imc:%d rank:%d bg:%d ba:%d row:0x%x col:0x%x", overflow ? " OVERFLOW" : "", (uncorrected_error && recoverable) ? " recoverable" : "", @@ -554,6 +557,9 @@ static void skx_mce_output_error(struct mem_ctl_info *mci, res->bank_group, res->bank_address, res->row, res->column); } + if (skx_show_retry_rd_err_log) + skx_show_retry_rd_err_log(res, skx_msg + len, MSG_SIZE - len); + edac_dbg(0, "%s\n", skx_msg); /* Call the helper to output message */ diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index 08cc971a50ea..60d1ea669afd 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -64,6 +64,7 @@ struct skx_dev { u8 src_id, node_id; struct skx_channel { struct pci_dev *cdev; + struct pci_dev *edev; struct skx_dimm { u8 close_pg; u8 bank_xor_enable; @@ -113,10 +114,11 @@ struct decoded_addr { typedef int (*get_dimm_config_f)(struct mem_ctl_info *mci); typedef bool (*skx_decode_f)(struct decoded_addr *res); +typedef void (*skx_show_retry_log_f)(struct decoded_addr *res, char *msg, int len); int __init skx_adxl_get(void); void __exit skx_adxl_put(void); -void skx_set_decode(skx_decode_f decode); +void skx_set_decode(skx_decode_f decode, skx_show_retry_log_f show_retry_log); int skx_get_src_id(struct skx_dev *d, int off, u8 *id); int skx_get_node_id(struct skx_dev *d, u8 *id); -- 2.20.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] EDAC, skx: Provide more machine specific location detail 2019-09-13 22:13 [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Tony Luck 2019-09-13 22:13 ` [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode Tony Luck 2019-09-13 22:13 ` [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers Tony Luck @ 2019-09-17 20:05 ` Aristeu Rozanski 2019-09-18 10:30 ` Mauro Carvalho Chehab 2019-09-25 13:51 ` Aristeu Rozanski 3 siblings, 1 reply; 11+ messages in thread From: Aristeu Rozanski @ 2019-09-17 20:05 UTC (permalink / raw) To: Tony Luck; +Cc: Borislav Petkov, Qiuxu Zhuo, Mauro Carvalho Chehab, linux-edac Hi Tony, On Fri, Sep 13, 2019 at 03:13:42PM -0700, Tony Luck wrote: > First patch refactors code so that second can work on systems > with and without the ACPI ADXL address translation code. Perhaps > has some value on its own as the code is, IMHO, a little cleaner. > > Second is in RFC state. Im looking for input on whether to just print > the extra information to the console log (as the patch does now) or > whether to tag it onto the long string that we push though the EDAC > reporting path. I believe it'll be more interesting for users that only care about error counts to keep this out of the console. For those who care about the extra information, having it available with rasdaemon or equivalent will be easier than have to look at both stored errors and kernel logs. -- Aristeu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] EDAC, skx: Provide more machine specific location detail 2019-09-17 20:05 ` [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Aristeu Rozanski @ 2019-09-18 10:30 ` Mauro Carvalho Chehab 0 siblings, 0 replies; 11+ messages in thread From: Mauro Carvalho Chehab @ 2019-09-18 10:30 UTC (permalink / raw) To: Aristeu Rozanski; +Cc: Tony Luck, Borislav Petkov, Qiuxu Zhuo, linux-edac Em Tue, 17 Sep 2019 16:05:04 -0400 Aristeu Rozanski <aris@redhat.com> escreveu: > Hi Tony, > > On Fri, Sep 13, 2019 at 03:13:42PM -0700, Tony Luck wrote: > > First patch refactors code so that second can work on systems > > with and without the ACPI ADXL address translation code. Perhaps > > has some value on its own as the code is, IMHO, a little cleaner. > > > > Second is in RFC state. Im looking for input on whether to just print > > the extra information to the console log (as the patch does now) or > > whether to tag it onto the long string that we push though the EDAC > > reporting path. > > I believe it'll be more interesting for users that only care about error > counts to keep this out of the console. For those who care about the extra > information, having it available with rasdaemon or equivalent will be > easier than have to look at both stored errors and kernel logs. I agree with Aris here: the best is to report extra info via the EDAC way, as some monitoring tool like rasdaemon will store it on a database and/or report via some mechanism like ABRT. I would expect that someone interested on monitoring hardware errors to have all relevant details at the same place. So, between a more detailed print or a more complete EDAC report, I would do the latter. Yet, nothing prevents to do both. Thanks, Mauro ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] EDAC, skx: Provide more machine specific location detail 2019-09-13 22:13 [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Tony Luck ` (2 preceding siblings ...) 2019-09-17 20:05 ` [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Aristeu Rozanski @ 2019-09-25 13:51 ` Aristeu Rozanski 3 siblings, 0 replies; 11+ messages in thread From: Aristeu Rozanski @ 2019-09-25 13:51 UTC (permalink / raw) To: Tony Luck; +Cc: Borislav Petkov, Qiuxu Zhuo, Mauro Carvalho Chehab, linux-edac On Fri, Sep 13, 2019 at 03:13:42PM -0700, Tony Luck wrote: > First patch refactors code so that second can work on systems > with and without the ACPI ADXL address translation code. Perhaps > has some value on its own as the code is, IMHO, a little cleaner. > > Second is in RFC state. Im looking for input on whether to just print > the extra information to the console log (as the patch does now) or > whether to tag it onto the long string that we push though the EDAC > reporting path. > > Tony Luck (2): > EDAC, skx_common: Refactor so that we initialize "dev" in result of > adxl decode. > EDAC, skx: Retrieve and print retry_rd_err_log registers with v2: Acked-by: Aristeu Rozanski <aris@redhat.com> -- Aristeu ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-09-25 13:51 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-13 22:13 [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Tony Luck 2019-09-13 22:13 ` [PATCH 1/2] EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode Tony Luck 2019-09-18 10:40 ` Mauro Carvalho Chehab 2019-09-23 23:36 ` Luck, Tony 2019-09-13 22:13 ` [RFC PATCH 2/2] EDAC, skx: Retrieve and print retry_rd_err_log registers Tony Luck 2019-09-18 10:52 ` Mauro Carvalho Chehab 2019-09-23 23:57 ` Luck, Tony 2019-09-24 21:52 ` [PATCHv2 " Tony Luck 2019-09-17 20:05 ` [PATCH 0/2] EDAC, skx: Provide more machine specific location detail Aristeu Rozanski 2019-09-18 10:30 ` Mauro Carvalho Chehab 2019-09-25 13:51 ` Aristeu Rozanski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).