From: James Morse <james.morse@arm.com>
To: "Shenhar, Talel" <talel@amazon.com>
Cc: Marc Zyngier <maz@kernel.org>,
robh+dt@kernel.org, tglx@linutronix.de, jason@lakedaemon.net,
mark.rutland@arm.com, nicolas.ferre@microchip.com,
mchehab+samsung@kernel.org, shawn.lin@rock-chips.com,
gregkh@linuxfoundation.org, dwmw@amazon.co.uk,
benh@kernel.crashing.org, linux-kernel@vger.kernel.org,
devicetree@vger.kernel.org
Subject: Re: [UNVERIFIED SENDER] Re: [UNVERIFIED SENDER] Re: [PATCH v2 2/3] soc: amazon: al-pos: Introduce Amazon's Annapurna Labs POS driver
Date: Thu, 19 Sep 2019 15:42:21 +0100 [thread overview]
Message-ID: <93e5ac72-13e1-4672-16f0-62ee6b8a8390@arm.com> (raw)
In-Reply-To: <36f19b3f-46d3-f6d0-3681-71e3a9bb52ce@amazon.com>
Hi guys,
On 12/09/2019 10:19, Shenhar, Talel wrote:
> On 9/12/2019 11:50 AM, Marc Zyngier wrote:
>> On Thu, 12 Sep 2019 07:50:03 +0100,
>> "Shenhar, Talel" <talel@amazon.com> wrote:
>>> On 9/11/2019 5:15 PM, Marc Zyngier wrote:
>>>> On Tue, 10 Sep 2019 20:05:09 +0100,
>>>> Talel Shenhar <talel@amazon.com> wrote:
>>>>> + if (!FIELD_GET(AL_POS_ERROR_LOG_1_VALID, log1))
>>>>> + return IRQ_NONE;
>>>>> +
>>>>> + log0 = readl(pos->mmio_base + AL_POS_ERROR_LOG_0);
>>>>> + writel(0, pos->mmio_base + AL_POS_ERROR_LOG_1);
>>>>> +
>>>>> + addr = FIELD_GET(AL_POS_ERROR_LOG_0_ADDR_LOW, log0);
>>>>> + addr |= (FIELD_GET(AL_POS_ERROR_LOG_1_ADDR_HIGH, log1) << 32);
>>>>> + request_id = FIELD_GET(AL_POS_ERROR_LOG_1_REQUEST_ID, log1);
>>>>> + bresp = FIELD_GET(AL_POS_ERROR_LOG_1_BRESP, log1);
>>>>> +
>>>>> + dev_err(&pdev->dev, "addr=0x%llx request_id=0x%x bresp=0x%x\n",
>>>>> + addr, request_id, bresp);
>>>> What is this information? How do we make use of it? Given that this is
>>>> asynchronous, how do we correlate it to the offending software?
>>> Indeed this information arriving from the HW is asynchronous.
>>>
>>> There is no direct method to get the offending software.
>>>
>>> There are all kinds of hacks we do to find the offending software once
>>> we find this error. most of the time its a new patch introduced but
>>> some of the time is just digging.
>> OK, so that the moment, this is more of a debug tool than anything
>> else, right?
> Not sure what do you mean by debug tool. this is used to capture iliigle access and allow
> panic() based on them or simple logging.
Plumbing this into edac as a 'device' gives you the existing/standard interface for
user-space. For example the 'panic_on_ue' that is exposed via sysfs, this saves you having
another interface to toggle it for your driver. You can then use the existing distro tools
to drive/monitor/sample it.
>>>> The whole think looks to me like a poor man's EDAC handling, and I'd
>>>> expect to be plugged in that subsystem instead. Any reason why this
>>>> isn't the case? It would certainly make the handling uniform for the
>>>> user.
>>> This logic was not plugged into EDAC as there is no "Correctable"
>>> error here. its just error event. Not all errors are EDAC in the sense
>>> of Error Detection And *Correction*. There are no correctable errors
>>> for this driver.
>> I'd argue the opposite! Because you obviously don't let a read-only
>> register being written to, the error has been corrected, and you
>> signal the correction status.
> Not the meaning of corrected from my point of view - the system as a whole (sw&hw) are not
> working state. a driver thinks it configured the system to do A while the system doesn't
> really do that. and the critical part is that the driver that did operation A doesn't even
> have a way to know it.
>
> So I would not call this corrected.
I don't think corrected/uncorrected helps here. If the register is read-only, and software
writes to it, its a software bug.
(from the v8.2's RAS extensions view, its somewhere between unrecoverable and uncontained)
>>> So plugging it under EDAC seems like abusing the EDAC system.
If EDAC doesn't do what you need, it can always be extended.
>>> Now that I've emphasize the reason for not putting this under EDAC,
>>> what do you think? should this "only uncorrectable event" driver
>>> should be part of EDAC?
Sure, (its for memory controllers, but:) enum edac_type has a EDAC_EC: "Error Checking, no
correction". This wouldn't be the only device that only reports uncorrectable errors.
>> My choice would be to plug it into the EDAC subsystem, and report all
>> interrupts as "Corrected" events. Optionally, and only if you are
>> debugging something that requires it, report the error as
>> "Uncorrectable", in which case the EDAC subsystem should trigger a
>> panic.
>> At least you'd get the infrastructure, logging and tooling that the
>> EDAC subsystem offers (parsing the kernel log doesn't really count).
> I see what you say. However, I don't see too much added value in plugging this to EDAC and
> feel like it would abuse EDAC framework.
> James, will love your input from EDAC point of view, does it make sense to plug
> un-correctable only event to EDAC?
I think this device is an example of something like a "Fabric switch units" in
Documentation/driver-api/edac.rst. It makes sense that it should be described as a
'device' to edac. You can then use the existing user-space tools to control/report/monitor
the values.
Thanks,
James
next prev parent reply other threads:[~2019-09-19 14:42 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-10 19:05 [PATCH v2 0/3] Amazon's Annapurna Labs POS Driver Talel Shenhar
2019-09-10 19:05 ` [PATCH v2 1/3] dt-bindings: soc: al-pos: Amazon's Annapurna Labs POS Talel Shenhar
2019-09-18 13:32 ` Rob Herring
2019-09-18 13:44 ` Shenhar, Talel
2019-09-10 19:05 ` [PATCH v2 2/3] soc: amazon: al-pos: Introduce Amazon's Annapurna Labs POS driver Talel Shenhar
2019-09-11 14:15 ` Marc Zyngier
2019-09-12 6:50 ` [UNVERIFIED SENDER] " Shenhar, Talel
2019-09-12 8:50 ` Marc Zyngier
2019-09-12 9:19 ` [UNVERIFIED SENDER] " Shenhar, Talel
2019-09-19 14:42 ` James Morse [this message]
2019-09-22 6:55 ` Shenhar, Talel
2019-09-10 19:05 ` [PATCH v2 3/3] soc: amazon: al-pos: cast to u64 before left shifting Talel Shenhar
2019-09-11 14:18 ` Marc Zyngier
2019-09-12 6:51 ` [UNVERIFIED SENDER] " Shenhar, Talel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=93e5ac72-13e1-4672-16f0-62ee6b8a8390@arm.com \
--to=james.morse@arm.com \
--cc=benh@kernel.crashing.org \
--cc=devicetree@vger.kernel.org \
--cc=dwmw@amazon.co.uk \
--cc=gregkh@linuxfoundation.org \
--cc=jason@lakedaemon.net \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=mchehab+samsung@kernel.org \
--cc=nicolas.ferre@microchip.com \
--cc=robh+dt@kernel.org \
--cc=shawn.lin@rock-chips.com \
--cc=talel@amazon.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).