From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [v1] EDAC, armv8: Add Cache Error Reporting driver for ARMv8 processors From: James Morse Message-Id: <5A620BD5.9030205@arm.com> Date: Fri, 19 Jan 2018 15:16:37 +0000 To: Kyle Yan Cc: Mark Rutland , linux-edac@vger.kernel.org, mchehab@kernel.org, bp@alien8.de, linux-arm-kernel@lists.infradead.org, suzuki.poulose@arm.com List-ID: SGkgS3lsZSwKCk9uIDE4LzAxLzE4IDAxOjE5LCBLeWxlIFlhbiB3cm90ZToKPiBPbiAyMDE4LTAx LTE2IDA1OjExLCBKYW1lcyBNb3JzZSB3cm90ZToKPj4gT24gMTYvMDEvMTggMDE6MDAsIEt5bGUg WWFuIHdyb3RlOgo+Pj4gT24gMjAxOC0wMS0xNSAwNjo0NCwgTWFyayBSdXRsYW5kIHdyb3RlOgo+ Pj4+IE9uIEZyaSwgSmFuIDEyLCAyMDE4IGF0IDA0OjUwOjI2UE0gLTA4MDAsIEt5bGUgWWFuIHdy b3RlOgo+Pj4+PiBJbnRlcnJ1cHQgYmFzZWQgRURBQyBkcml2ZXIgZm9yIEFSTXY4IHByb2Nlc3Nv cnMgdGhhdCBpbXBsZW1lbnQKPj4+Pj4gUkFTIGZvciBlcnJvciBkZXRlY3Rpb24gb2YgQ1BVIGNh Y2hlcyBhbmQgbHNvIGFsbG93cyBvcHRpb25hbCBwb2xsaW5nCj4+Pj4+IG9mIGVycm9yIHN5bmRy b21lIHJlZ2lzdGVycyBpZiBpbnRlcnJ1cHRzIGFyZSBub3Qgc3VwcG9ydGVkLgo+Pj4+Cj4+Pj4g SXMgdGhpcyB1c2luZyB0aGUgYXJjaGl0ZWN0dXJhbCBSQVMgZXh0ZW5zaW9ucywgb3Igc29tZXRo aW5nIHNwZWNpZmljIHRvCj4+Pj4gUUMgQ1BVcz8gSXQgd291bGQgYmUgZ29vZCB0byBjYWxsIHRo aXMgb3V0IGV4cGxpY2l0bHkuCj4+Pj4KPj4+PiBJZiBpdCdzIHRoZSBsYXR0ZXIsIHRoaXMgbmVl ZHMgcmV3b3JkaW5nIHRvIGJlIGNsZWFyIHRoYXQgdGhpcyBpcwo+Pj4+IHNwZWNpZmljIHRvIFFD IENQVXMsIGFuZCBpcyBub3QgYSBnZW5lcmljIEFSTXY4IGZlYXR1cmUuCj4+Cj4+PiBUaGlzIGlz IG5vdCBzcGVjaWZpYyB0byBRQyBDUFVzIGFuZCBsb29rcyBhdCB0aGUgc3lzdGVtIHJlZ2lzdGVy cyBkZWZpbmVkIGJ5IHRoZQo+Pj4gUkFTIGV4dGVuc2lvbnMuCgpbLi4uXQoKPj4+Pj4gZGlmZiAt LWdpdCBhL2RyaXZlcnMvZWRhYy9hcm12OF9lZGFjLmMgYi9kcml2ZXJzL2VkYWMvYXJtdjhfZWRh Yy5jCj4+Pj4+IG5ldyBmaWxlIG1vZGUgMTAwNjQ0Cgo+Pj4+PiArc3RhdGljIHZvaWQgaW5pdF9y ZWdzX29uX2NwdShib29sIGFsbF9jcHVzKQoKWy4uLl0KCj4+Pj4+ICsgICAgd3JpdGVfZXJyc2Vs cl9lbDEoMSk7Cj4+Pj4+ICsgICAgaW5pdGlhbGl6ZV9yZWdpc3RlcnMoTlVMTCk7Cj4+Cj4+IFlv dSBzZWVtIHRvIG1hZ2ljYWxseS1rbm93IHRoYXQgdGhpcyAncmVjb3JkIDEnIGlzIHNoYXJlZCBi ZXR3ZWVuIGFsbCBDUFVzIGFuZAo+PiB2aXNpYmxlIHZpYSB0aGUgY3B1IGludGVyZmFjZS4KPj4K PiBZZXMgdGhpcyBwb3J0aW9uIChhbmQgY29taW5nIHRvIHRoZSByZWFsaXphdGlvbiB0aGF0IGEg bG90IG9mIG90aGVyIHBhcnRzKSBvZiB0aGUKPiBjb2RlIGlzIHNwZWNpZmljIHRvIG91ciB0b3Bv bG9neS4gSXMgdGhlcmUgYW55IHdheSB0byBkZXRlY3QgdGhpcz8KCk5vLCB3ZSBhcmUgZ29pbmcg dG8gbmVlZCBEVC9BQ1BJIGRhdGEgdG8gdGVsbCB1cyB0aGUgdG9wb2xvZ3kuCgpUbyBrbm93IGlm IGEgcmVjb3JkIGlzIHNoYXJlZCBvciBDUFUtcHJpdmF0ZSwgd2UgaGF2ZSB0aGUgRVJSREVWQUZG ICdEZXZpY2UKQWZmaW5pdHkgUmVnaXN0ZXInLCBidXQgdGhpcyBpcyBvbmx5IGFjY2Vzc2libGUg dmlhIHRoZSBtZW1vcnktbWFwLCB3aGljaCBhbHNvCmdyb3VwcyByZWNvcmRzIGJ5IG5vZGUvZGV2 aWNlLiBEVC9BQ1BJIHdpbGwgaGF2ZSB0byB0ZWxsIHVzIHdoZXJlIHRoZSBlbnRyaWVzIGluCnRo ZSBtZW1vcnktbWFwIGFyZS4KClBhaXJpbmcgdXAgbWVtb3J5LW1hcCBub2RlL2RldmljZXMgd2l0 aCBpbmRleGVzIGluIHRoZSBjcHUgaW50ZXJmYWNlIHJlZ2lzdGVycwppcyBzb21ldGhpbmcgSSdt IGN1cnJlbnRseSB0cnlpbmcgdG8gZ2V0IG15IGhlYWQgYXJvdW5kLgoKCj4gRVJSSURSIHJlcG9y dHMKPiBob3cgbWFueSByZWNvcmRzIHRoZXJlIGFyZSBidXQgbm90IHdoYXQgZWFjaCByZWNvcmQg aXMgZm9yLiBXaXRob3V0IHRoaXMKPiBpbmZvcm1hdGlvbiwKPiBub3Qgc3VyZSBob3cgdGhlcmUg Y291bGQgYmUgYSBnZW5lcmljIGludGVycnVwdCBiYXNlZCBkcml2ZXIuCgpXaGljaCBpbnRlcnJ1 cHRzIGEgbm9kZS9kZXZpY2UgZ2VuZXJhdGVzIChhbmQgaG93KSBpcyBhbm90aGVyIHBpZWNlIG9m IHBlcgpub2RlL2RldmljZSBkYXRhIHdlIG5lZWQuCgpEbyB3ZSBuZWVkIHRvIGtub3cgd2hhdCBz b3J0IG9mIG5vZGUvZGV2aWNlIHRoZSByZWNvcmRzIGFyZSBmb3IgaW4gYWR2YW5jZT8KCldlIHdp bGwgcHJvYmFibHkgbmVlZCBzb21lIGluZnJhc3RydWN0dXJlIHRvIHBhcnNlIHRoZSBEVC9BQ1BJ IHRhYmxlcyBhbmQgZW5hYmxlCmFuZCByZWdpc3RlciB0aGUgYXBwcm9wcmlhdGUgaW50ZXJydXB0 cy4gSSB3b3VsZCBsaWtlIGFueSBrZXJuZWwtZmlyc3QKZXh0ZXJuYWwtYWJvcnQgUkFTIGNvZGUg dG8gaGF2ZSBhIGxpc3Qgb2YgdGhlIG5vZGVzIHRoYXQgZ2VuZXJhdGUgJ2luIGJhbmQnCm5vdGlm aWNhdGlvbnMsIHNvIHdlIGRvbid0IGhhdmUgdG8gd2FsayB0aGVtIGFsbC4gSWYgd2UgY2FuIGtu b3cgdGhlCmNwdS1hZmZpbml0eSwgd2UgY2FuIG1ha2UgdGhlIGxpc3Qgc2hvcnRlciBhbmQgcGVy LWNwdS4KCk9uY2Ugd2UgaGF2ZSB0aGF0IGluZnJhc3RydWN0dXJlLCBhIGRyaXZlciBjYW4gcmVn aXN0ZXIgdG8gaGFuZGxlIHRoZQphcmNoaXRlY3RlZCBjYWNoZSBlcnJvcnMuIFdlIGRvbid0IG5l ZWQgdG8ga25vdyB3aGF0IHNvcnQgb2YgcmVjb3JkcyBhCm5vZGUvZGV2aWNlIGdlbmVyYXRlcyB1 bnRpbCBvbmUgc2hvd3MgdXAuIChJIGFzc3VtZSBjYWNoZXMgY2FuIGFsc28gZ2VuZXJhdGUKdGlt ZW91dHMgYW5kIGJ1cyBlcnJvcnMpLgoKSSB0aGluayB5b3UgbmVlZCB0byBrbm93IHRoZSByZWNv cmQgdHlwZXMgaW4gYWR2YW5jZSBiZWNhdXNlIHlvdSBhcmUgcG9sbGluZwp0aG9zZSByZWNvcmRz LiBJIGNhbid0IHNlZSBhIHdheSBvZiBkZXRlcm1pbmluZyB3aGljaCBub2RlL2RldmljZXMgYXJl IGNhY2hlcywKc28gaWYgd2UgbmVlZCB0aGlzIGluZm9ybWF0aW9uLCBpdCB3aWxsIGhhdmUgdG8g Y29tZSBmcm9tIEFDUEkvRFQuCgoKWyAuLi5dCgo+IEkgc2hvdWxkIHByb2JhYmx5IGFtZW5kIGEg cHJldmlvdXMgc3RhdGVtZW50IGFib3V0IHRoaXMgYWxsIGJlaW5nIGRlZmluZWQgaW4KPiBSQVMu IERpZG4ndCByZWFsaXplIHRoaXMgaW5pdGlhbGx5Cj4gYnV0IGFzIHlvdSBoYXZlIG1lbnRpb25l ZCwgZXZlbiB0aG91Z2ggdGhlIHJlZ2lzdGVycyBhcmUgYXJjaGl0ZWN0ZWQsIHRoZQo+IGRlY29k aW5nL2JpdHMgd2l0aGluIHRoZSByZWdpc3RlcnMgYXJlCj4gaW1wbGVtZW50YXRpb24gZGVmaW5l ZC4KPiBTbyB3aXRoaW4gdGhlIGRyaXZlciwgc29tZSBvZiB0aGUgIm1hZ2ljYWxseSBrbm93biIK PiB2YWx1ZXMgYXJlIGJlY2F1c2UgYXMgaXQgaXMgcmlnaHQgbm93LAo+IHRoZSBkZWNvZGluZ3Mg YW5kIHNvbWUgYmVoYXZpb3IgaXMgc3BlY2lmaWMgdG8gb3VyIGltcGxlbWVudGF0aW9uLgoKKGFz IGlzIHRoZSB0b3Bsb2d5KSwgdGhpcyBpcyB3aHkgaXRzIHByZWZlcmFibGUgdG8gaGF2ZSB0aGlz IGluIGZpcm13YXJlLCBhbmQKdXNlIHNvbWV0aGluZyBsaWtlIEFQRUkuCgpZb3VyIGxldmVsIGFu ZCB3YXkgcGFyYW1ldGVycyBhcmVuJ3QgYXJjaGl0ZWN0ZWQsIGRvZXMgdGhlIGVkYWMgZHJpdmVy IGRvCmFueXRoaW5nIG90aGVyIHRoYW4gcHJpbnQgdGhlc2Ugb3V0PwoKSSBkb24ndCBzZWUgaG93 IHNvZnR3YXJlIGNhbiBhdm9pZCB1c2luZyB3YXktNiBpbiBMMSwgYWxsIGl0IGNhbiBkbyBpcyB3 YXRjaCB0aGUKKGFyY2hpdGVjdGVkKSBlcnJvciBjb3VudGVyIGFuZCBkZWNpZGUgJ3RvbyBtdWNo IScgYXQgc29tZSB0aHJlc2hvbGQuIEZ1dHVyZQpzbWFydHMgY291bGQgbGV0IHVzIG9mZmxpbmUg dGhlIGFmZmVjdGVkIENQVXMsIGZvciB3aGljaCB3ZSB3b3VsZCBuZWVkIHRvIGtub3cKdGhlIGFm ZmluaXR5IG9mIHRoZSBkZXZpY2Uvbm9kZSB0aGF0IGdlbmVyYXRlcyB0aGVzZSByZWNvcmRzLiAo d2hpY2ggd2UgaGF2ZSwKZnJvbSB0aGUgbWVtb3J5IG1hcCkuCgoKPiBUbyBtb3JlIGdlbmVyYWxp emUgdGhpcywgd291bGQgaXQgYmUgYmV0dGVyIHRvIChhZnRlciBjaGVja2luZyB0aGUgSURSL0ZS Cj4gcmVnaXN0ZXJzKSB0byBpbnN0ZWFkIGp1c3QgZHVtcCBvdXQgdGhlCj4gRVJSWFNUQVRVUy9F UlJYTUlTQyBkaXJlY3RseSwgYW5kIHRoZW4gaGF2ZSByZWdpc3RlcmVkIGNhbGxiYWNrcyB0byBh Y3R1YWxseQo+IHByb3Blcmx5IGRlY29kZSB0aGUgdmFsdWVzIGludG8gc29tZXRoaW5nIG1vcmUg cmVhZGFibGU/CgpJJ20gbm90IGNvbnZpbmNlZCB3ZSBuZWVkIHRoZSBpbXBkZWYgdmFsdWVzLCB3 ZSBjYW4ndCBkbyBhbnl0aGluZyB3aXRoIHRoZQppbmZvcm1hdGlvbi4KSWYgd2UgaGF2ZSBzb21l IGluZnJhc3RydWN0dXJlIGZvciB0aGUgdXNlcnMgb2YgdGhlc2UgUkFTIEVSUi1yZWdpc3RlcnMv bm9kZXMKYW5kIGRyaXZlcnMgY2FuIHJlZ2lzdGVyIHRvIG1vbml0b3IgY2xhc3NlcyBvZiBlcnJv ciwgdGhlbiB3ZSd2ZSBrZXB0IHRoZSBkb29yCm9wZW4gZm9yIFNvQy1zcGVjaWZpYyBjb21wYXRp YmxlcyBmb3IgZHJpdmVycyB0byBwb2tlIGFyb3VuZCBpbiB0aGUgaW1wZGVmIHBhcnRzCm9mIHRo ZSByZWdpc3RlcnMgaWYgaXQgdHVybnMgb3V0IHRvIGJlIG5lY2Vzc2FyeS4KClRoaXMgaW5mcmFz dHJ1Y3R1cmUgd291bGQgYWxzbyBsZXRzIHVzIHBva2UgdGhlIGVkYWMgZHJpdmVyIGZvciBjYWNo ZS1lcnJvcnMgZm9yCmluLWJhbmQgc2lnbmFsbGVkIGVycm9ycy4KCgpUaGFua3MsCgpKYW1lcwot LS0KVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2Ny aWJlIGxpbnV4LWVkYWMiIGluCnRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdl ci5rZXJuZWwub3JnCk1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5v cmcvbWFqb3Jkb21vLWluZm8uaHRtbAo= From mboxrd@z Thu Jan 1 00:00:00 1970 From: james.morse@arm.com (James Morse) Date: Fri, 19 Jan 2018 15:16:37 +0000 Subject: [PATCH v1] EDAC, armv8: Add Cache Error Reporting driver for ARMv8 processors In-Reply-To: <6c3d84bd102d49aa78591caa5adf7bfc@codeaurora.org> References: <1515804626-21254-1-git-send-email-kyan@codeaurora.org> <20180115144441.d6dlwl7herq24byv@lakrids.cambridge.arm.com> <3d3da849d32b1a674d1bcd560eeba261@codeaurora.org> <5A5DF9F4.90101@arm.com> <6c3d84bd102d49aa78591caa5adf7bfc@codeaurora.org> Message-ID: <5A620BD5.9030205@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi Kyle, On 18/01/18 01:19, Kyle Yan wrote: > On 2018-01-16 05:11, James Morse wrote: >> On 16/01/18 01:00, Kyle Yan wrote: >>> On 2018-01-15 06:44, Mark Rutland wrote: >>>> On Fri, Jan 12, 2018 at 04:50:26PM -0800, Kyle Yan wrote: >>>>> Interrupt based EDAC driver for ARMv8 processors that implement >>>>> RAS for error detection of CPU caches and lso allows optional polling >>>>> of error syndrome registers if interrupts are not supported. >>>> >>>> Is this using the architectural RAS extensions, or something specific to >>>> QC CPUs? It would be good to call this out explicitly. >>>> >>>> If it's the latter, this needs rewording to be clear that this is >>>> specific to QC CPUs, and is not a generic ARMv8 feature. >> >>> This is not specific to QC CPUs and looks at the system registers defined by the >>> RAS extensions. [...] >>>>> diff --git a/drivers/edac/armv8_edac.c b/drivers/edac/armv8_edac.c >>>>> new file mode 100644 >>>>> +static void init_regs_on_cpu(bool all_cpus) [...] >>>>> + write_errselr_el1(1); >>>>> + initialize_registers(NULL); >> >> You seem to magically-know that this 'record 1' is shared between all CPUs and >> visible via the cpu interface. >> > Yes this portion (and coming to the realization that a lot of other parts) of the > code is specific to our topology. Is there any way to detect this? No, we are going to need DT/ACPI data to tell us the topology. To know if a record is shared or CPU-private, we have the ERRDEVAFF 'Device Affinity Register', but this is only accessible via the memory-map, which also groups records by node/device. DT/ACPI will have to tell us where the entries in the memory-map are. Pairing up memory-map node/devices with indexes in the cpu interface registers is something I'm currently trying to get my head around. > ERRIDR reports > how many records there are but not what each record is for. Without this > information, > not sure how there could be a generic interrupt based driver. Which interrupts a node/device generates (and how) is another piece of per node/device data we need. Do we need to know what sort of node/device the records are for in advance? We will probably need some infrastructure to parse the DT/ACPI tables and enable and register the appropriate interrupts. I would like any kernel-first external-abort RAS code to have a list of the nodes that generate 'in band' notifications, so we don't have to walk them all. If we can know the cpu-affinity, we can make the list shorter and per-cpu. Once we have that infrastructure, a driver can register to handle the architected cache errors. We don't need to know what sort of records a node/device generates until one shows up. (I assume caches can also generate timeouts and bus errors). I think you need to know the record types in advance because you are polling those records. I can't see a way of determining which node/devices are caches, so if we need this information, it will have to come from ACPI/DT. [ ...] > I should probably amend a previous statement about this all being defined in > RAS. Didn't realize this initially > but as you have mentioned, even though the registers are architected, the > decoding/bits within the registers are > implementation defined. > So within the driver, some of the "magically known" > values are because as it is right now, > the decodings and some behavior is specific to our implementation. (as is the toplogy), this is why its preferable to have this in firmware, and use something like APEI. Your level and way parameters aren't architected, does the edac driver do anything other than print these out? I don't see how software can avoid using way-6 in L1, all it can do is watch the (architected) error counter and decide 'too much!' at some threshold. Future smarts could let us offline the affected CPUs, for which we would need to know the affinity of the device/node that generates these records. (which we have, from the memory map). > To more generalize this, would it be better to (after checking the IDR/FR > registers) to instead just dump out the > ERRXSTATUS/ERRXMISC directly, and then have registered callbacks to actually > properly decode the values into something more readable? I'm not convinced we need the impdef values, we can't do anything with the information. If we have some infrastructure for the users of these RAS ERR-registers/nodes and drivers can register to monitor classes of error, then we've kept the door open for SoC-specific compatibles for drivers to poke around in the impdef parts of the registers if it turns out to be necessary. This infrastructure would also lets us poke the edac driver for cache-errors for in-band signalled errors. Thanks, James