[v2,02/24] EDAC, ghes: Fix grain calculation
diff mbox series

Message ID 20190624150758.6695-3-rrichter@marvell.com
State New, archived
Headers show
Series
  • EDAC, mc, ghes: Fixes and updates to improve memory error reporting
Related show

Commit Message

Robert Richter June 24, 2019, 3:08 p.m. UTC
The conversion from the physical address mask to a grain (defined as
granularity in bytes) is broken:

	e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);

E.g., a physical address mask of ~0xfff should give a grain of 0x1000,
instead the grain is wrong with the upper bits always set. We also
remove the limitation to the page size as the granularity is unrelated
to the page size used in the system. We fix this with:

	e->grain = ~mem_err->physical_addr_mask + 1;

Note: We need to adopt the grain_bits calculation as e->grain is now a
power of 2 and no longer a bit mask. The formula is now the same as in
edac_mc and can later be unified.

Signed-off-by: Robert Richter <rrichter@marvell.com>
---
 drivers/edac/ghes_edac.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Borislav Petkov Aug. 9, 2019, 1:15 p.m. UTC | #1
On Mon, Jun 24, 2019 at 03:08:57PM +0000, Robert Richter wrote:
> The conversion from the physical address mask to a grain (defined as
> granularity in bytes) is broken:
> 
> 	e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
> 
> E.g., a physical address mask of ~0xfff should give a grain of 0x1000,
> instead the grain is wrong with the upper bits always set. We also
> remove the limitation to the page size as the granularity is unrelated
> to the page size used in the system. We fix this with:
> 
> 	e->grain = ~mem_err->physical_addr_mask + 1;
> 
> Note: We need to adopt the grain_bits calculation as e->grain is now a
> power of 2 and no longer a bit mask. The formula is now the same as in
> edac_mc and can later be unified.

Please refrain from using "We" or "I" or etc personal pronouns in a
commit message and in the code comments below.

From Documentation/process/submitting-patches.rst:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

Please fix all your other commit messages for the next submission.

> Signed-off-by: Robert Richter <rrichter@marvell.com>
> ---
>  drivers/edac/ghes_edac.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 7f19f1c672c3..d095d98d6a8d 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -222,6 +222,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	/* Cleans the error report buffer */
>  	memset(e, 0, sizeof (*e));
>  	e->error_count = 1;
> +	e->grain = 1;
>  	strcpy(e->label, "unknown label");
>  	e->msg = pvt->msg;
>  	e->other_detail = pvt->other_detail;
> @@ -317,7 +318,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  
>  	/* Error grain */
>  	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> -		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
> +		e->grain = ~mem_err->physical_addr_mask + 1;

This is assuming that that ->physical_addr_mask is contiguous but I
don't trust any firmware. I guess we can leave it like that for now
until some "inventive" firmware actually does it.

>  
>  	/* Memory error location, mapped on e->location */
>  	p = e->location;
> @@ -433,8 +434,15 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	if (p > pvt->other_detail)
>  		*(p - 1) = '\0';
>  
> +	/*
> +	 * We expect the hw to report a reasonable grain, fallback to
> +	 * 1 byte granularity otherwise.
> +	 */
> +	if (WARN_ON_ONCE(!e->grain))

Please move that WARN_ON_ONCE in the

	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)

branch above because you're presetting grain to 1 so the warn should be
close to where it could happen, i.e., when coming from the firmware.

Thx.
Robert Richter Aug. 12, 2019, 6:42 a.m. UTC | #2
On 09.08.19 15:15:59, Borislav Petkov wrote:
> On Mon, Jun 24, 2019 at 03:08:57PM +0000, Robert Richter wrote:
> > The conversion from the physical address mask to a grain (defined as
> > granularity in bytes) is broken:
> > 
> > 	e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
> > 
> > E.g., a physical address mask of ~0xfff should give a grain of 0x1000,
> > instead the grain is wrong with the upper bits always set. We also
> > remove the limitation to the page size as the granularity is unrelated
> > to the page size used in the system. We fix this with:
> > 
> > 	e->grain = ~mem_err->physical_addr_mask + 1;
> > 
> > Note: We need to adopt the grain_bits calculation as e->grain is now a
> > power of 2 and no longer a bit mask. The formula is now the same as in
> > edac_mc and can later be unified.
> 
> Please refrain from using "We" or "I" or etc personal pronouns in a
> commit message and in the code comments below.
> 
> >From Documentation/process/submitting-patches.rst:
> 
>  "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>   instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>   to do frotz", as if you are giving orders to the codebase to change
>   its behaviour."
> 
> Please fix all your other commit messages for the next submission.

Sure, will reword.

I have seen you had actively promoted this style guideline, I even was
not aware of it, thanks for the pointer.

> 
> > Signed-off-by: Robert Richter <rrichter@marvell.com>
> > ---
> >  drivers/edac/ghes_edac.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> > index 7f19f1c672c3..d095d98d6a8d 100644
> > --- a/drivers/edac/ghes_edac.c
> > +++ b/drivers/edac/ghes_edac.c
> > @@ -222,6 +222,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> >  	/* Cleans the error report buffer */
> >  	memset(e, 0, sizeof (*e));
> >  	e->error_count = 1;
> > +	e->grain = 1;
> >  	strcpy(e->label, "unknown label");
> >  	e->msg = pvt->msg;
> >  	e->other_detail = pvt->other_detail;
> > @@ -317,7 +318,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> >  
> >  	/* Error grain */
> >  	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> > -		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
> > +		e->grain = ~mem_err->physical_addr_mask + 1;
> 
> This is assuming that that ->physical_addr_mask is contiguous but I
> don't trust any firmware. I guess we can leave it like that for now
> until some "inventive" firmware actually does it.

With the grain_bits calculation the mask is rounded up to the next
power of 2 value. I therefore don't see any issues for non-contiguous
bit masks. I have updated the patch description.

> 
> >  
> >  	/* Memory error location, mapped on e->location */
> >  	p = e->location;
> > @@ -433,8 +434,15 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
> >  	if (p > pvt->other_detail)
> >  		*(p - 1) = '\0';
> >  
> > +	/*
> > +	 * We expect the hw to report a reasonable grain, fallback to
> > +	 * 1 byte granularity otherwise.
> > +	 */
> > +	if (WARN_ON_ONCE(!e->grain))
> 
> Please move that WARN_ON_ONCE in the
> 
> 	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> 
> branch above because you're presetting grain to 1 so the warn should be
> close to where it could happen, i.e., when coming from the firmware.

The reason this is here is because this check will be moved to
edac_raw_mc_handle_error() to unify edac_mc and ghes code (see patch
#4). I understand the warn should be close to its source, on the other
side we need the check for all the drivers that setup the grain. Thus,
it cannot be in the driver that is setting up the grain.

Thanks,

-Robert
Borislav Petkov Aug. 12, 2019, 7:32 a.m. UTC | #3
On Mon, Aug 12, 2019 at 06:42:00AM +0000, Robert Richter wrote:
> I have seen you had actively promoted this style guideline, I even was
> not aware of it, thanks for the pointer.

It is about time we started writing proper commit messages. How long are
we trying, 20 years...?

> With the grain_bits calculation the mask is rounded up to the next
> power of 2 value.

mask	  = 0xffffffffff00ff00
~mask	  = 0x0000000000ff00ff
~mask + 1 = 0x0000000000ff0100

Your "trick" of adding a 1 to get to the most significant bit simply
doesn't work here. Thus:

"I guess we can leave it like that for now until some "inventive"
firmware actually does it."

> The reason this is here is because this check will be moved to
> edac_raw_mc_handle_error() to unify edac_mc and ghes code (see patch
> #4).

Ok.

Thx.
Robert Richter Aug. 12, 2019, 12:05 p.m. UTC | #4
On 12.08.19 09:32:22, Borislav Petkov wrote:
> On Mon, Aug 12, 2019 at 06:42:00AM +0000, Robert Richter wrote:

> > With the grain_bits calculation the mask is rounded up to the next
> > power of 2 value.
> 
> mask	  = 0xffffffffff00ff00

grain = ~mask + 1

> ~mask	  = 0x0000000000ff00ff
> ~mask + 1 = 0x0000000000ff0100

grain_bits = fls_long(e->grain - 1);
grain_bits = 24

grain = 1 << grain_bits
grain = 0x1000000

So for masks in the range from 0xffffffffff000000 to
0xffffffffff7fffff we have grain_bits set to 24, which corresponds to
a grain of 0x1000000. Looks good to me.

> 
> Your "trick" of adding a 1 to get to the most significant bit simply
> doesn't work here. Thus:
> 
> "I guess we can leave it like that for now until some "inventive"
> firmware actually does it."

Fine to me.

-Robert
Borislav Petkov Aug. 12, 2019, 12:38 p.m. UTC | #5
On Mon, Aug 12, 2019 at 12:05:25PM +0000, Robert Richter wrote:
> So for masks in the range from 0xffffffffff000000 to
> 0xffffffffff7fffff we have grain_bits set to 24, which corresponds to
> a grain of 0x1000000.

I don't think you're reading what I'm trying to say so let me go into
more detail:

I'm very suspicious about any and all information we get from firmware.
I think that is clear why by now.

If we get an address mask, we better sanity-check that mask. For
example, whether it is contiguous or whether the set bits in it are even
making any sense and so on.

What you're doing is assuming the firmware will give you a sensible mask
and you start working with it without checking it.

For example, if you get a mask of 0xffffffffff00ff00, how do you know
that the grain bits are really 24? Says who? There's a hole in the damn
mask so it could just as well be *anything* *but* an address mask. Hell,
it can be some random garbage.

Do you catch my drift now?

But, since we don't use the grain all too much and don't depend on it
yet, we keep it simple and lazy for now:

> > "I guess we can leave it like that for now until some "inventive"
> > firmware actually does it."

Patch
diff mbox series

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 7f19f1c672c3..d095d98d6a8d 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -222,6 +222,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	/* Cleans the error report buffer */
 	memset(e, 0, sizeof (*e));
 	e->error_count = 1;
+	e->grain = 1;
 	strcpy(e->label, "unknown label");
 	e->msg = pvt->msg;
 	e->other_detail = pvt->other_detail;
@@ -317,7 +318,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 
 	/* Error grain */
 	if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
-		e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
+		e->grain = ~mem_err->physical_addr_mask + 1;
 
 	/* Memory error location, mapped on e->location */
 	p = e->location;
@@ -433,8 +434,15 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	if (p > pvt->other_detail)
 		*(p - 1) = '\0';
 
+	/*
+	 * We expect the hw to report a reasonable grain, fallback to
+	 * 1 byte granularity otherwise.
+	 */
+	if (WARN_ON_ONCE(!e->grain))
+		e->grain = 1;
+	grain_bits = fls_long(e->grain - 1);
+
 	/* Generate the trace event */
-	grain_bits = fls_long(e->grain);
 	snprintf(pvt->detail_location, sizeof(pvt->detail_location),
 		 "APEI location: %s %s", e->location, e->other_detail);
 	trace_mc_event(type, e->msg, e->label, e->error_count,