linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: "Hawa, Hanna" <hhhawa@amazon.com>
Cc: robh+dt@kernel.org, mark.rutland@arm.com, bp@alien8.de,
	mchehab@kernel.org, davem@davemloft.net,
	gregkh@linuxfoundation.org, nicolas.ferre@microchip.com,
	paulmck@linux.ibm.com, dwmw@amazon.co.uk, benh@amazon.com,
	ronenk@amazon.com, talel@amazon.com, jonnyc@amazon.com,
	hanochu@amazon.com, linux-edac@vger.kernel.org,
	devicetree@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC
Date: Thu, 13 Jun 2019 18:05:14 +0100	[thread overview]
Message-ID: <fdc3b458-96eb-1734-c294-2463f37f2244@arm.com> (raw)
In-Reply-To: <4514bfa2-68b2-2074-b817-2f5037650c4e@amazon.com>

Hi Hawa,

On 11/06/2019 20:56, Hawa, Hanna wrote:
> James Morse wrote:
>> Hawa, Hanna wrote:
>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>
>> How do we know this was corrected?
>>
>> 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in a
>> paragraph talking about the L1 memory system.

> I'll check fatal field to check if it's uncorrected/corrected.


>>> +    edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
>>> +            cpu, (fatal) ? "Fatal " : "");
>>> +    edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
>>> +
>>> +    switch (ramid) {
>>> +    case ARM_CA57_L1_I_TAG_RAM:
>>> +        pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
>>> +        break;
>>> +    case ARM_CA57_L1_I_DATA_RAM:
>>> +        pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
>>> +        break;
>>
>> Is index/way information really useful? I can't replace way-3 on the system, nor can I
>> stop it being used. If its useless, I'd rather we don't bother parsing and printing it out.

> I'll remove the index/way information, and keep CPUMERRSR_EL1 value print (who want this
> information can parse the value and get the index/bank status)

Good idea, just print it raw.


>>> +    pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other,
>>> +        val);
>>
>> 'other' here is another error, but we don't know the ramid.
>> 'repeat' is another error for the same ramid.
>>
>> Could we still feed this stuff into edac? This would make the counters accurate if the
>> polling frequency isn't quite fast enough.

> There is no API in EDAC to increase the counters, I can increase by accessing the
> ce_count/ue_count from edac_device_ctl_info/edac_device_instance structures if it's okay..

Ah, sorry, I was thinking of the edac_mc_handle_error(), which has an error_count parameter.

(I wouldn't go poking in the structures directly).


>>> +static void al_a57_edac_l2merrsr(void *arg)
>>> +{
>>
>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>
>> How do we know this is corrected?

>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is
>> this what you are depending on here?

> No - not on this. Reporting all the errors as corrected seems to be bad.
> 
> Can i be depends on fatal field?

That is described as "set to 1 on the first memory error that caused a Data Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of
these errors caused an abort. But it could have been all three. The repeat counter only
matches against the RAMID and friends, otherwise the error is counted in 'other'.

I don't think there is a right thing to do here, (other than increase the scrubbing
frequency). As you can only feed one error into edac at a time then:

> if (fatal)
>     edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
> else
>     edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just
go missing.


> How can L2CTLR_EL1[20] force fatal?

I don't think it can, on a second reading, it looks to be even more complicated than I
thought! That bit is described as disabling forwarding of uncorrected data, but it looks
like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was corrected. This was
just a bad suggestion!


>> (it would be good to have a list of integration-time and firmware dependencies this driver
>> has, for the next person who tries to enable it on their system and complains it doesn't
>> work for them)

> Where can i add such information?

The mailing list archive is good enough. I'll ask about any obvious dependency I spot from
the TRM, (e.g. that list at the top of my first reply). If you know of anything weird
please call it out.


>>> +    pr_cont(", cpuid/way=%d, repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)\n",
>>> +        way, repeat, other, val);
>>
>> cpuid could be useful if you can map it back to the cpu number linux has.
>> If you can spot that cpu-7 is experiencing more errors than it should, you can leave it
>> offline.
>>
>> To do this you'd need to map each L2MERRSR_EL1's '0-3' range back to the CPUs they
>> actually are. The gic's 'ppi-partitions' does this with phandles, e.g.
>> Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.yaml. You could add a
>> similar shaped thing to the l2-cacheX node in the DT, (or in your edac node, but it is a
>> property of the cache integration).

> As in L1 prints, I'll remove non-relevant prints.

Fair enough.


>>> +    /*
>>> +     * Use get_online_cpus/put_online_cpus to prevent the online CPU map
>>> +     * changing while reads the L1/L2 error status
>>
>> For walking the list of offline cpus, this makes sense. But you schedule work without
>> waiting, it may get run after you drop the cpus_read_lock()...,

> Will update the smp_call_function_single() call function to wait.


>>> +    for_each_online_cpu(cpu) {
>>> +        /* Check L1 errors */
>>> +        smp_call_function_single(cpu, al_a57_edac_cpumerrsr, edac_dev,
>>> +                     0);
>>
>> As you aren't testing for big/little, wouldn't on_each_cpu() here be simpler?

> Could be simpler for L1, how can be implemented for L2?

You'd need bitmasks for each cluster to feed to smp_call_function_any().


>>> +        cluster = topology_physical_package_id(cpu);
>>
>> Hmm, I'm not sure cluster==package is guaranteed to be true forever.
>>
>> If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise
>> pulling out the DT using something like the arch code's parse_cluster().

> I rely on that it's alpine SoC specific driver.

... and that the topology code hasn't changed to really know what a package is:
https://lore.kernel.org/lkml/20190529211340.17087-2-atish.patra@wdc.com/T/#u

As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(),
would struct cacheinfo's shared_cpu_map be a better fit?

This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's
you've visited. It removes the dependency on package==L2, and insulates you from the
cpu-numbering not being exactly as you expect.


>>> +        if (cluster != last_cluster) {
>>> +            smp_call_function_single(cpu, al_a57_edac_l2merrsr,
>>> +                         edac_dev, 0);
>>> +            last_cluster = cluster;
>>> +        }
>>
>> Here you depend on the CPUs being listed in cluster-order in the DT. I'm fairly sure the
>> numbering is arbitrary: On my Juno 0,3,4,5 are the A53 cluster, and 1,2 are the A57
>> cluster.
>>
>> If 1,3,5 were cluster-a and 2,4,6 were cluster-b, you would end up calling
>> al_a57_edac_l2merrsr() for each cpu. As you don't wait, they could race.
>>
>> If you can get a cpu-mask for each cluster, smp_call_function_any() would to the
>> pick-one-online-cpu work for you.

> Again, I rely on that it's alpine SoC specific driver.
> How can I get cpu-mask for each cluster? from DT?

Its not cluster you want, its the L2. Cacheinfo has this for online CPUs, and you're
already holding the cpus_read_lock().


>>> +static int al_a57_edac_remove(struct platform_device *pdev)
>>> +{
>>> +    struct edac_device_ctl_info *edac_dev = platform_get_drvdata(pdev);
>>> +
>>> +    edac_device_del_device(edac_dev->dev);
>>> +    edac_device_free_ctl_info(edac_dev);
>>
>> Your poll function schedule work on other CPUs and didn't wait, is it possible
>> al_a57_edac_l2merrsr() is still using this memory when you free it?

> This will be okay, after using wait in smp_call_function_single().

Yup.


Thanks,

James

  reply	other threads:[~2019-06-13 17:05 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-30 10:15 [PATCH 0/2] Add support for Amazon's Annapurna Labs EDAC for L1/L2 Hanna Hawa
2019-05-30 10:15 ` [PATCH 1/2] dt-bindings: EDAC: add Amazon Annapurna Labs EDAC binding Hanna Hawa
2019-05-30 11:54   ` Greg KH
2019-05-31  0:35     ` Borislav Petkov
2019-05-30 10:15 ` [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC Hanna Hawa
2019-05-30 11:57   ` Greg KH
2019-05-30 12:52     ` hhhawa
2019-05-30 13:04       ` Joe Perches
2019-05-30 18:19   ` Boris Petkov
2019-05-31  1:15     ` Herrenschmidt, Benjamin
2019-05-31  5:14       ` Borislav Petkov
2019-06-05 15:13         ` James Morse
2019-06-06  7:53         ` Hawa, Hanna
2019-06-06 10:03           ` Borislav Petkov
2019-06-06 10:33           ` James Morse
2019-06-06 11:22             ` Borislav Petkov
2019-06-06 11:37             ` Shenhar, Talel
2019-06-07 15:11               ` James Morse
2019-06-08  0:22                 ` Benjamin Herrenschmidt
2019-06-08  0:16             ` Benjamin Herrenschmidt
2019-06-08  9:05               ` Borislav Petkov
2019-06-11  5:50                 ` Benjamin Herrenschmidt
2019-06-11  7:21                   ` Benjamin Herrenschmidt
2019-06-11 11:56                     ` Borislav Petkov
2019-06-11 22:25                       ` Benjamin Herrenschmidt
2019-06-12  3:48                         ` Borislav Petkov
2019-06-12  8:29                           ` Benjamin Herrenschmidt
2019-06-12 10:42                             ` Borislav Petkov
2019-06-12 23:54                               ` Benjamin Herrenschmidt
2019-06-13  7:44                                 ` Borislav Petkov
2019-06-14 10:53                                 ` Borislav Petkov
2019-06-12 10:42                             ` Mauro Carvalho Chehab
2019-06-12 11:00                               ` Borislav Petkov
2019-06-12 11:42                                 ` Mauro Carvalho Chehab
2019-06-12 11:57                                   ` Benjamin Herrenschmidt
2019-06-12 12:25                                     ` Borislav Petkov
2019-06-12 12:35                                       ` Hawa, Hanna
2019-06-12 15:34                                         ` Borislav Petkov
2019-06-12 23:57                                       ` Benjamin Herrenschmidt
2019-06-12 23:56                                 ` Benjamin Herrenschmidt
2019-06-11  7:29                   ` Hawa, Hanna
2019-06-11 11:59                     ` Borislav Petkov
2019-06-11 11:47                   ` Borislav Petkov
2019-06-03  6:56       ` Hawa, Hanna
2019-06-05 15:16   ` James Morse
2019-06-11 19:56     ` Hawa, Hanna
2019-06-13 17:05       ` James Morse [this message]
2019-06-14 10:49         ` James Morse
2019-06-17 13:00         ` Hawa, Hanna
2019-06-19 17:22           ` James Morse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdc3b458-96eb-1734-c294-2463f37f2244@arm.com \
    --to=james.morse@arm.com \
    --cc=benh@amazon.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=devicetree@vger.kernel.org \
    --cc=dwmw@amazon.co.uk \
    --cc=gregkh@linuxfoundation.org \
    --cc=hanochu@amazon.com \
    --cc=hhhawa@amazon.com \
    --cc=jonnyc@amazon.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=paulmck@linux.ibm.com \
    --cc=robh+dt@kernel.org \
    --cc=ronenk@amazon.com \
    --cc=talel@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).