linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: "Wiebe, Wladislav (Nokia - DE/Ulm)" <wladislav.wiebe@nokia.com>
To: James Morse <james.morse@arm.com>, Borislav Petkov <bp@alien8.de>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>,
	"devicetree@vger.kernel.org" <devicetree@vger.kernel.org>,
	"arnd@arndb.de" <arnd@arndb.de>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"robh+dt@kernel.org" <robh+dt@kernel.org>,
	"Sverdlin,
	Alexander \(Nokia - DE/Ulm\)" <alexander.sverdlin@nokia.com>,
	"mchehab+samsung@kernel.org" <mchehab+samsung@kernel.org>,
	"Wiebe, Wladislav \(Nokia - DE/Ulm\)" <wladislav.wiebe@nokia.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: RE: [PATCH 2/2] EDAC: add ARM Cortex A15 L2 internal asynchronous error detection driver
Date: Wed, 9 Jan 2019 14:44:26 +0000	[thread overview]
Message-ID: <AM6PR0702MB37997C0FA1670DD1AF4C25C7FA8B0@AM6PR0702MB3799.eurprd07.prod.outlook.com> (raw)
In-Reply-To: <e1458b95-6d17-21be-5dd8-ee7ff8d8b3fb@arm.com>

Hi James,

first of all thanks a lot for the constructive and fast feedback!

> -----Original Message-----
> From: James Morse <james.morse@arm.com>
> Sent: Tuesday, January 08, 2019 6:57 PM
> 
> Hi Boris, Wladislav,
> 
> On 08/01/2019 10:42, Borislav Petkov wrote:
> > + James and leaving in the rest for reference.
> 
> (thanks!)
> 
> > So the first thing to figure out here is how generic is this and if
> > so, to make it a cortex_a15_edac.c driver which contains all the RAS
> > functionality for A15. Definitely not an EDAC driver per functional
> > unit but rather per vendor or even ARM core.
> 
> This is implementation-defined/specific-to-A15 and is documented in the
> TRM [0].
> (On the 'all the RAS functionality for A15' front: there are two more registers:
> L2MERRSR and CPUMERRSR. These are both accessible from the normal-
> world, and don't appear to need enabling.)
> 
> 
> But we have the usual pre-v8.2 problems, and in addition cluster-interrupts,
> as this signal might be per-cluster, or it might be combined.
> 
> Wladislav, I'm afraid we've had a few attempts at pre-8.2 EDAC drivers, the
> below list of problems is what we've learnt along the way. The upshot is that
> before the architected RAS extensions, the expectation is firmware will
> handle all this, as its difficult for the OS to deal with.
> 
> 
> My first question is how useful is a 'something bad happened' edac event?

We experienced sometimes random user-space crashes where we didn't 
expect a bug in the application code. If there would be a notification 
by such edac event, we would at least know that something bad happened before.

> Before the v8.2 extensions with its classification of errors, we don't know
> anything more.
> 
> The usual suspects are, (partly taken from the thread at [1]):
> * A15 exists in big/little configurations. We need to know which CPUs are
> A15.
> * We need to know we aren't running under a hypervisor, (a hypervisor can
> trap
>   accesses to these imp-def register, KVM does).
> * Nothing else should be clearing these bits, e.g. secure-world software, or
>   another CPU.
> * Secure-world needs to enable write-access to L2ECTLR, and we need to
>   know its done it. This needs doing on every CPU, and needs to not 'go
> missing'
>   over cpu-hotplug or cpu-idle.
> 
> These are things that don't naturally live in the DT.
> 
> 
> The new-one is these cluster-interrupts: How do we know which set of CPUs
> each interrupt goes with? What happens if user-space tries to rebalance
> them?

Valid question - so far, I didn't consider this case.
 
> Another SoC with A15 may combine all the cluster-interrupts into a single
> 'something bad happened' interrupt. Done like this, we would need to cross-
> call to the other CPUs when we take an interrupt - which is not something we
> can do.
> 
> Is this a level or edge interrupt? Is it necessary to clear that bit in the register
> to lower the interrupt line?
> The TRM talks about 'pending L2 internal asynchronous error', pending
> makes me suspect this is at least possible. If it is, a level-interrupt to one
> CPU, that can only be cleared by another leads to deadlock.
> 
> 
> Thanks,
> 
> James
> 
> > On Tue, Jan 08, 2019 at 08:10:45AM +0000, Wiebe, Wladislav (Nokia -
> DE/Ulm) wrote:
> >> This driver adds support for L2 internal asynchronous error detection
> >> caused by L2 RAM double-bit ECC error or illegal writes to the
> >> Interrupt Controller memory-map region on the Cortex A15.
> 
> >> diff --git a/drivers/edac/cortex_a15_l2_async_edac.c
> >> b/drivers/edac/cortex_a15_l2_async_edac.c
> >> new file mode 100644
> >> index 000000000000..26252568e961
> >> --- /dev/null
> >> +++ b/drivers/edac/cortex_a15_l2_async_edac.c
> >> @@ -0,0 +1,134 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (C) 2018 Nokia Corporation
> 
> (boiler plate not needed with the SPDX header)
> 
> >> + */
> >> +
> >> +#include <linux/module.h>
> >> +#include <linux/interrupt.h>
> >> +#include <linux/platform_device.h>
> >> +#include <linux/of.h>
> >> +
> >> +#include "edac_module.h"
> >> +
> >> +#define DRIVER_NAME "cortex_a15_l2_async_edac"
> >> +
> >> +#define L2ECTLR_L2_ASYNC_ERR BIT(30)
> >> +
> >> +static irqreturn_t cortex_a15_l2_async_edac_err_handler(int irq,
> >> +void *dev_id) {
> >> +	struct edac_device_ctl_info *dci = dev_id;
> >> +	u32 status = 0;
> >> +
> >> +	/*
> >> +	 * Read and clear L2ECTLR L2 ASYNC error bit caused by INTERRIRQ.
> >> +	 * Reason could be a L2 RAM double-bit ECC error or illegal writes
> >> +	 * to the Interrupt Controller memory-map region.
> >> +	 */
> >> +	asm("mrc p15, 1, %0, c9, c0, 3" : "=r" (status));
> 
> "L2 internal asynchronous error caused by L2 RAM double-bit ECC error"
> doesn't tell us if a CPU consumed the error, or if the error has caused a write
> to go missing. Without the classification, this means 'something bad
> happened'.
> 
> I'd prefer to panic() when we see one of these. I'd like it even more if
> firmware rebooted for us.

The EDAC subsystem allows to configure a panic() from userspace/sysfs.
So we can be flexible at this point I think.

> 
> >> +	if (status & L2ECTLR_L2_ASYNC_ERR) {
> >> +		status &= ~L2ECTLR_L2_ASYNC_ERR;
> >> +		asm("mcr p15, 1, %0, c9, c0, 3" : : "r" (status));
> 
> 4.3.49 "L2 Extended Control Register" of the A15 TRM says this field can be
> read-only/write-ignored for the normal world if NSACR.NS_L2ERR is 0.
> 
> How do we know if firmware has set this bit on all CPUs? We can't clear the
> error otherwise.

Valid point!
 
> 
> >> +		edac_printk(KERN_EMERG, DRIVER_NAME,
> >> +			    "L2 internal asynchronous error occurred!\n");
> >> +		edac_device_handle_ue(dci, 0, 0, dci->ctl_name);
> 
> >> +
> >> +		return IRQ_HANDLED;
> >> +	}
> >> +
> >> +	return IRQ_NONE;
> >> +}
> >> +
> >> +static int cortex_a15_l2_async_edac_probe(struct platform_device
> >> +*pdev) {
> >> +	struct edac_device_ctl_info *dci;
> >> +	struct device_node *np = pdev->dev.of_node;
> >> +	char *ctl_name = (char *)np->name;
> >> +	int i = 0, ret = 0, err_irq = 0, irq_count = 0;
> >> +
> >> +	/* We can have multiple CPU clusters with one INTERRIRQ per cluster
> >> +*/
> 
> Surely this an integration choice?
> 
> You're accessing the cluster through a cpu register in the handler, what
> happens if the interrupt is delivered to the wrong cluster?
> How do we know which interrupt maps to which cluster?
> How do we stop user-space 'balancing' the interrupts?

You are right, based on all your inputs I think we can stop using this driver
as generic A15 solution (at least I would need more time to do
the refactoring considering all points you stated and experienced already).

Thanks a lot!

- Wladislav

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2019-01-09 14:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-08  8:10 [PATCH 2/2] EDAC: add ARM Cortex A15 L2 internal asynchronous error detection driver Wiebe, Wladislav (Nokia - DE/Ulm)
2019-01-08 10:42 ` Borislav Petkov
2019-01-08 17:57   ` James Morse
2019-01-08 18:12     ` gregkh
2019-01-09  9:57       ` James Morse
2019-01-09 14:44     ` Wiebe, Wladislav (Nokia - DE/Ulm) [this message]
2019-01-11 18:11       ` James Morse
2019-01-11 18:29         ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AM6PR0702MB37997C0FA1670DD1AF4C25C7FA8B0@AM6PR0702MB3799.eurprd07.prod.outlook.com \
    --to=wladislav.wiebe@nokia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.sverdlin@nokia.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=devicetree@vger.kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab+samsung@kernel.org \
    --cc=mchehab@kernel.org \
    --cc=robh+dt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).