All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: Robert Richter <rrichter@marvell.com>
Cc: Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver
Date: Tue, 4 Jun 2019 18:15:50 +0100	[thread overview]
Message-ID: <1fac170a-f461-a779-9e82-5b4a0fa2c154@arm.com> (raw)
In-Reply-To: <20190603131005.e23lovwyvii53vzo@rric.localdomain>

Hi Robert,

On 03/06/2019 14:10, Robert Richter wrote:
> On 29.05.19 16:12:38, James Morse wrote:
>> On 29/05/2019 09:44, Robert Richter wrote:
>>> Almost duplicate code, remove it.
>>
>>> Note: there is a difference in the calculation of the grain_bits,
>>> using the edac_mc's version here.
>>
>> But is it the right thing to do?
>>
>> Is this an off-by-one bug being papered over as some cleanup?
>> If so could you post a separate fix that can be picked up for an rc.
>>
>> Do Marvell have firmware that populates this field?
>>
>> ...
>>
>> Unless the argument is no one cares about this...
>>
>> >From ghes_edac_report_mem_error():
>> |	/* Error grain */
>> |	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>> |		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
>>
>> Fishy, why would the kernel page-size be relevant here?
> 
> That looked broken to me too, I did not put to much effort in fixing
> the grain yet. So I just took the edac_mc version first in the
> assumption, that one is working.

(Ah, it would have been good to note this in the commit-message)


> It looks like the intention here is to limit the grain to the page
> size.
I'm not convinced that makes sense. If some architecture let you configure the page-size,
(as arm64 does), and your hypervisor had a bigger page-size, then any hardware fault would
be rounded up to hypervisor's page-size.

The kernel's page-size has very little to do with the error, it only matters for when we
go unmapping stuff in memory_failure().


> But right, the calculation is wrong here. I am also going to
> reply to your patch you sent on this.

Thanks!


>> If physical_addr_mask were the same as PAGE_MASK this wouldn't this always give ~0?
>> (masking logic like this always does my head in)
>>
>> /me gives it ago:
>> | {1}[Hardware Error]:   physical_address: 0x00000000deadbeef
>> | {1}[Hardware Error]:   physical_address_mask: 0xffffffffffff0000
>> | {1}[Hardware Error]:   error_type: 6, master abort
>> | EDAC MC0: 1 CE Master abort on unknown label ( page:0xdead offset:0xbeef
>> | grain:-1 syndrome:0x0 - status(0x0000000000000001): reserved)
>>
>> That 'grain:-1' is because the calculated e->grain was an unlikely 0xffffffffffffffff.
>> Patch incoming, if you could test it on your platform that'd be great.
>>
>> I don't think ghes_edac.c wants this '+1'.
> 
> The +1 looks odd to me also for the edac_mc driver, but I need to take
> a closer look here as well as some logs suggest the grain is
> calculated correctly.

My theory on this is that ghes_edac.c is generating a grain like 0x1000, fls() does the
right thing. Other edac drivers are generating a grain like 0xfff to describe the same
size, fls() is now off-by-one, hence the addition.
I don't have a platform where I can trigger any other edac driver to test this though.

The way round this would be to put the grain_bits in struct edac_raw_error_desc so that
ghes_edac.c can calculate it directly.


Thanks,

James

  reply	other threads:[~2019-06-04 17:15 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-29  8:44 [PATCH 00/21] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Robert Richter
2019-05-29  8:44 ` [PATCH 01/21] EDAC, mc: Fix edac_mc_find() in case no device is found Robert Richter
2019-05-29  8:44 ` [PATCH 02/21] EDAC: Fixes to use put_device() after device_add() errors Robert Richter
2019-06-11 17:28   ` Borislav Petkov
2019-06-12 17:17     ` Robert Richter
2019-05-29  8:44 ` [PATCH 03/21] EDAC: Kill EDAC_DIMM_PTR() macro Robert Richter
2019-05-29  8:44 ` [PATCH 04/21] EDAC: Kill EDAC_DIMM_OFF() macro Robert Richter
2019-05-29  8:44 ` [PATCH 05/21] EDAC: Introduce mci_for_each_dimm() iterator Robert Richter
2019-05-29  8:44 ` [PATCH 06/21] EDAC, mc: Cleanup _edac_mc_free() code Robert Richter
2019-05-29  8:44 ` [PATCH 07/21] EDAC, mc: Remove per layer counters Robert Richter
2019-05-29  8:44 ` [PATCH 08/21] EDAC, mc: Rework edac_raw_mc_handle_error() to use struct dimm_info Robert Richter
2019-05-29  8:44 ` [PATCH 09/21] EDAC, ghes: Use standard kernel macros for page calculations Robert Richter
2019-05-29 15:13   ` James Morse
2019-05-29  8:44 ` [PATCH 10/21] EDAC, ghes: Remove pvt->detail_location string Robert Richter
2019-05-29 15:13   ` James Morse
2019-06-12 18:13     ` Robert Richter
2019-05-29  8:44 ` [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver Robert Richter
2019-05-29 15:12   ` James Morse
2019-06-03 13:10     ` Robert Richter
2019-06-04 17:15       ` James Morse [this message]
2019-06-13 22:23         ` Robert Richter
2019-05-29  8:44 ` [PATCH 12/21] EDAC, ghes: Add support for legacy API counters Robert Richter
2019-05-29 15:13   ` James Morse
2019-06-12 18:41     ` Robert Richter
2019-06-19 17:22       ` James Morse
2019-06-20  6:55         ` Robert Richter
2019-06-26  9:33           ` James Morse
2019-06-26 10:27             ` Robert Richter
2019-05-29  8:44 ` [PATCH 13/21] EDAC, ghes: Rework memory hierarchy detection Robert Richter
2019-05-29 15:06   ` James Morse
2019-05-31 13:41     ` Robert Richter
2019-05-29  8:44 ` [PATCH 14/21] EDAC, ghes: Extract numa node information for each dimm Robert Richter
2019-05-29 17:51   ` James Morse
2019-06-13 20:52     ` Robert Richter
2019-05-29  8:44 ` [PATCH 15/21] EDAC, ghes: Moving code around ghes_edac_register() Robert Richter
2019-05-29  8:44 ` [PATCH 16/21] EDAC, ghes: Create one memory controller device per node Robert Richter
2019-05-29  8:44 ` [PATCH 17/21] EDAC, ghes: Fill sysfs with the DMI DIMM label information Robert Richter
2019-05-29  8:44 ` [PATCH 18/21] EDAC, mc: Introduce edac_mc_alloc_by_dimm() for per dimm allocation Robert Richter
2019-05-29  8:44 ` [PATCH 19/21] EDAC, ghes: Identify dimm by node, card, module and handle Robert Richter
2019-05-29  8:44 ` [PATCH 20/21] EDAC, ghes: Enable per-layer reporting based on card/module Robert Richter
2019-05-29  8:44 ` [PATCH 21/21] EDAC, Documentation: Describe CPER module definition and DIMM ranks Robert Richter
2019-05-29 14:54 ` [PATCH 00/21] EDAC, mc, ghes: Fixes and updates to improve memory error reporting Borislav Petkov
2019-05-31 14:48   ` Robert Richter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1fac170a-f461-a779-9e82-5b4a0fa2c154@arm.com \
    --to=james.morse@arm.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rrichter@marvell.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.