All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Hawa, Hanna" <hhhawa@amazon.com>
To: James Morse <james.morse@arm.com>
Cc: <robh+dt@kernel.org>, <mark.rutland@arm.com>, <bp@alien8.de>,
	<mchehab@kernel.org>, <davem@davemloft.net>,
	<gregkh@linuxfoundation.org>, <nicolas.ferre@microchip.com>,
	<paulmck@linux.ibm.com>, <dwmw@amazon.co.uk>, <benh@amazon.com>,
	<ronenk@amazon.com>, <talel@amazon.com>, <jonnyc@amazon.com>,
	<hanochu@amazon.com>, <linux-edac@vger.kernel.org>,
	<devicetree@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Date: Mon, 17 Jun 2019 16:00:45 +0300	[thread overview]
Message-ID: <bbb9b41d-8ffa-d4c5-c199-2400695cce8d@amazon.com> (raw)
In-Reply-To: <fdc3b458-96eb-1734-c294-2463f37f2244@arm.com>


>>>> +static void al_a57_edac_l2merrsr(void *arg)
>>>> +{
>>>
>>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>>
>>> How do we know this is corrected?
> 
>>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
>>> this what you are depending on here?
> 
>> No - not on this. Reporting all the errors as corrected seems to be bad.
>>
>> Can i be depends on fatal field?
> 
> That is described as "set to 1 on the first memory error that caused a Data Abort". I
> assume this is one of the parity-error external-aborts.
> 
> If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of
> these errors caused an abort. But it could have been all three. The repeat counter only
> matches against the RAMID and friends, otherwise the error is counted in 'other'.
> 
> I don't think there is a right thing to do here, (other than increase the scrubbing
> frequency). As you can only feed one error into edac at a time then:
> 
>> if (fatal)
>>      edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
>> else
>>      edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
> 
> seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just
> go missing.
I had print the values of 'other/repeat' to be noticed.

> 
> 
>> How can L2CTLR_EL1[20] force fatal?
> 
> I don't think it can, on a second reading, it looks to be even more complicated than I
> thought! That bit is described as disabling forwarding of uncorrected data, but it looks
> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
> means in this context.)
> I was looking for reasons you could 'know' that any reported error was corrected. This was
> just a bad suggestion!
Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?

In case no interrupt, can we use die-notifier subsystem to check if any 
error had occur while system shutdown?

>>>> +        cluster = topology_physical_package_id(cpu);
>>>
>>> Hmm, I'm not sure cluster==package is guaranteed to be true forever.
>>>
>>> If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
>>> pulling out the DT using something like the arch code's parse_cluster().
> 
>> I rely on that it's alpine SoC specific driver.
> 
> ... and that the topology code hasn't changed to really know what a package is:
> https://lore.kernel.org/lkml/20190529211340.17087-2-atish.patra@wdc.com/T/#u
> 
> As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(),
> would struct cacheinfo's shared_cpu_map be a better fit?
> 
> This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's
> you've visited. It removes the dependency on package==L2, and insulates you from the
> cpu-numbering not being exactly as you expect.
I'll add dt property that point to L2-cache node (phandle), then it'll 
be easy to create cpu-mask with all cores that point to same l2 cache.

Thanks,
Hanna



WARNING: multiple messages have this Message-ID (diff)
From: "Hawa, Hanna" <hhhawa@amazon.com>
To: James Morse <james.morse@arm.com>
Cc: robh+dt@kernel.org, mark.rutland@arm.com, bp@alien8.de,
	mchehab@kernel.org, davem@davemloft.net,
	gregkh@linuxfoundation.org, nicolas.ferre@microchip.com,
	paulmck@linux.ibm.com, dwmw@amazon.co.uk, benh@amazon.com,
	ronenk@amazon.com, talel@amazon.com, jonnyc@amazon.com,
	hanochu@amazon.com, linux-edac@vger.kernel.org,
	devicetree@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Date: Mon, 17 Jun 2019 16:00:45 +0300	[thread overview]
Message-ID: <bbb9b41d-8ffa-d4c5-c199-2400695cce8d@amazon.com> (raw)
In-Reply-To: <fdc3b458-96eb-1734-c294-2463f37f2244@arm.com>


>>>> +static void al_a57_edac_l2merrsr(void *arg)
>>>> +{
>>>
>>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>>
>>> How do we know this is corrected?
> 
>>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
>>> this what you are depending on here?
> 
>> No - not on this. Reporting all the errors as corrected seems to be bad.
>>
>> Can i be depends on fatal field?
> 
> That is described as "set to 1 on the first memory error that caused a Data Abort". I
> assume this is one of the parity-error external-aborts.
> 
> If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of
> these errors caused an abort. But it could have been all three. The repeat counter only
> matches against the RAMID and friends, otherwise the error is counted in 'other'.
> 
> I don't think there is a right thing to do here, (other than increase the scrubbing
> frequency). As you can only feed one error into edac at a time then:
> 
>> if (fatal)
>>      edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
>> else
>>      edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
> 
> seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just
> go missing.
I had print the values of 'other/repeat' to be noticed.

> 
> 
>> How can L2CTLR_EL1[20] force fatal?
> 
> I don't think it can, on a second reading, it looks to be even more complicated than I
> thought! That bit is described as disabling forwarding of uncorrected data, but it looks
> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
> means in this context.)
> I was looking for reasons you could 'know' that any reported error was corrected. This was
> just a bad suggestion!
Is there interrupt for un-correctable error?
Does 'asynchronous errors' in L2 used to report UE?

In case no interrupt, can we use die-notifier subsystem to check if any 
error had occur while system shutdown?

>>>> +        cluster = topology_physical_package_id(cpu);
>>>
>>> Hmm, I'm not sure cluster==package is guaranteed to be true forever.
>>>
>>> If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
>>> pulling out the DT using something like the arch code's parse_cluster().
> 
>> I rely on that it's alpine SoC specific driver.
> 
> ... and that the topology code hasn't changed to really know what a package is:
> https://lore.kernel.org/lkml/20190529211340.17087-2-atish.patra@wdc.com/T/#u
> 
> As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(),
> would struct cacheinfo's shared_cpu_map be a better fit?
> 
> This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's
> you've visited. It removes the dependency on package==L2, and insulates you from the
> cpu-numbering not being exactly as you expect.
I'll add dt property that point to L2-cache node (phandle), then it'll 
be easy to create cpu-mask with all cores that point to same l2 cache.

Thanks,
Hanna

  parent reply	other threads:[~2019-06-17 13:01 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-30 10:15 [PATCH 0/2] Add support for Amazon's Annapurna Labs EDAC for L1/L2 Hanna Hawa
2019-05-30 10:15 ` Hanna Hawa
2019-05-30 10:15 ` [PATCH 1/2] dt-bindings: EDAC: add Amazon Annapurna Labs EDAC binding Hanna Hawa
2019-05-30 10:15   ` Hanna Hawa
2019-05-30 11:54   ` Greg KH
2019-05-31  0:35     ` Borislav Petkov
2019-06-03  7:24       ` Woodhouse, David
2019-05-30 10:15 ` [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC Hanna Hawa
2019-05-30 10:15   ` Hanna Hawa
2019-05-30 11:57   ` Greg KH
2019-05-30 12:52     ` hhhawa
2019-05-30 12:52       ` hhhawa
2019-05-30 13:04       ` Joe Perches
2019-05-30 18:19   ` Boris Petkov
2019-05-30 18:19     ` Boris Petkov
2019-05-31  1:15     ` Herrenschmidt, Benjamin
2019-05-31  5:14       ` Borislav Petkov
2019-06-05 15:13         ` James Morse
2019-06-06  7:53         ` Hawa, Hanna
2019-06-06 10:03           ` Borislav Petkov
2019-06-06 10:03             ` Borislav Petkov
2019-06-06 10:33           ` James Morse
2019-06-06 11:22             ` Borislav Petkov
2019-06-06 11:22               ` Borislav Petkov
2019-06-06 11:37             ` Shenhar, Talel
2019-06-07 15:11               ` James Morse
2019-06-07 15:11                 ` James Morse
2019-06-08  0:22                 ` Benjamin Herrenschmidt
2019-06-08  0:16             ` Benjamin Herrenschmidt
2019-06-08  9:05               ` Borislav Petkov
2019-06-08  9:05                 ` Borislav Petkov
2019-06-11  5:50                 ` Benjamin Herrenschmidt
2019-06-11  5:50                   ` Benjamin Herrenschmidt
2019-06-11  7:21                   ` Benjamin Herrenschmidt
2019-06-11  7:21                     ` Benjamin Herrenschmidt
2019-06-11 11:56                     ` Borislav Petkov
2019-06-11 11:56                       ` Borislav Petkov
2019-06-11 22:25                       ` Benjamin Herrenschmidt
2019-06-11 22:25                         ` Benjamin Herrenschmidt
2019-06-12  3:48                         ` Borislav Petkov
2019-06-12  8:29                           ` Benjamin Herrenschmidt
2019-06-12 10:42                             ` Borislav Petkov
2019-06-12 10:42                               ` Borislav Petkov
2019-06-12 23:54                               ` Benjamin Herrenschmidt
2019-06-12 23:54                                 ` Benjamin Herrenschmidt
2019-06-13  7:44                                 ` Borislav Petkov
2019-06-13  7:44                                   ` Borislav Petkov
2019-06-14 10:53                                 ` Borislav Petkov
2019-06-14 10:53                                   ` Borislav Petkov
2019-06-12 10:42                             ` Mauro Carvalho Chehab
2019-06-12 11:00                               ` Borislav Petkov
2019-06-12 11:00                                 ` Borislav Petkov
2019-06-12 11:42                                 ` Mauro Carvalho Chehab
2019-06-12 11:42                                   ` Mauro Carvalho Chehab
2019-06-12 11:57                                   ` Benjamin Herrenschmidt
2019-06-12 12:25                                     ` Borislav Petkov
2019-06-12 12:25                                       ` Borislav Petkov
2019-06-12 12:35                                       ` Hawa, Hanna
2019-06-12 15:34                                         ` Borislav Petkov
2019-06-12 15:34                                           ` Borislav Petkov
2019-06-12 23:57                                       ` Benjamin Herrenschmidt
2019-06-12 23:57                                         ` Benjamin Herrenschmidt
2019-06-12 23:56                                 ` Benjamin Herrenschmidt
2019-06-11  7:29                   ` Hawa, Hanna
2019-06-11 11:59                     ` Borislav Petkov
2019-06-11 11:59                       ` Borislav Petkov
2019-06-11 11:47                   ` Borislav Petkov
2019-06-11 11:47                     ` Borislav Petkov
2019-06-03  6:56       ` Hawa, Hanna
2019-06-05 15:16   ` James Morse
2019-06-11 19:56     ` Hawa, Hanna
2019-06-11 19:56       ` Hawa, Hanna
2019-06-13 17:05       ` James Morse
2019-06-14 10:49         ` James Morse
2019-06-17 13:00         ` Hawa, Hanna [this message]
2019-06-17 13:00           ` Hawa, Hanna
2019-06-19 17:22           ` James Morse
2019-06-19 17:22             ` James Morse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bbb9b41d-8ffa-d4c5-c199-2400695cce8d@amazon.com \
    --to=hhhawa@amazon.com \
    --cc=benh@amazon.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=devicetree@vger.kernel.org \
    --cc=dwmw@amazon.co.uk \
    --cc=gregkh@linuxfoundation.org \
    --cc=hanochu@amazon.com \
    --cc=james.morse@arm.com \
    --cc=jonnyc@amazon.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=paulmck@linux.ibm.com \
    --cc=robh+dt@kernel.org \
    --cc=ronenk@amazon.com \
    --cc=talel@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.