Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/1] edac: fsl_ddr_edac: fix expected data message
@ 2020-07-24 11:18 Gregor Herburger
  2020-08-17  9:53 ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Gregor Herburger @ 2020-07-24 11:18 UTC (permalink / raw)
  To: york.sun, bp, mchehab, tony.luck, james.morse, rrichter
  Cc: linux-edac, Gregor Herburger

In some cases a wrong 'Expected Data' is calculated and reported.
When comparing Expected/Captured Data this looks like dual bit errors when
only a single bit error occurred.

On my aarch64 machine it prints something similar to this:
[  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
[  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffff / 0x80000059
[  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59

Fix this by only shift the register where the error occurred.

Signed-off-by: Gregor Herburger <gregor.herburger@ew.tq-group.com>
---
 drivers/edac/fsl_ddr_edac.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/fsl_ddr_edac.c b/drivers/edac/fsl_ddr_edac.c
index 6d8ea226010d..4b6989cf1947 100644
--- a/drivers/edac/fsl_ddr_edac.c
+++ b/drivers/edac/fsl_ddr_edac.c
@@ -343,9 +343,9 @@ static void fsl_mc_check(struct mem_ctl_info *mci)
 
 		fsl_mc_printk(mci, KERN_ERR,
 			"Expected Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
-			cap_high ^ (1 << (bad_data_bit - 32)),
-			cap_low ^ (1 << bad_data_bit),
-			syndrome ^ (1 << bad_ecc_bit));
+			(bad_data_bit > 31) ? cap_high ^ (1 << (bad_data_bit - 32)) : cap_high,
+			(bad_data_bit <= 31) ? cap_low ^ (1 << (bad_data_bit)) : cap_low,
+			(bad_ecc_bit != -1) ? syndrome ^ (1 << (bad_ecc_bit)) : syndrome);
 	}
 
 	fsl_mc_printk(mci, KERN_ERR,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] edac: fsl_ddr_edac: fix expected data message
  2020-07-24 11:18 [PATCH 1/1] edac: fsl_ddr_edac: fix expected data message Gregor Herburger
@ 2020-08-17  9:53 ` Borislav Petkov
  2020-08-27  7:56   ` [PATCH v2 " Gregor Herburger
  0 siblings, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2020-08-17  9:53 UTC (permalink / raw)
  To: Gregor Herburger
  Cc: york.sun, mchehab, tony.luck, james.morse, rrichter, linux-edac

On Fri, Jul 24, 2020 at 01:18:46PM +0200, Gregor Herburger wrote:
> In some cases a wrong 'Expected Data' is calculated and reported.

In some cases? Which cases?

You need to expand that sentence with more details as to what the
problem is because I'm not getting any smarter from it.

> When comparing Expected/Captured Data this looks like dual bit errors when
> only a single bit error occurred.
> 
> On my aarch64 machine it prints something similar to this:
> [  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
> [  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffff / 0x80000059
> [  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59

Is that output before or after your change?

0xffffffef is with bit 4 XORed and cap_high was -1 before, cap_low is -1
too. The expected data syndrome has bit 31 set?!

Yeah, I'm confused. Please explain the issue in greater detail, try
structuring it this way:

Problem is A.

It happens because of B.

Fix it by doing C.

(Potentially do D).

For more detailed info, see
Documentation/process/submitting-patches.rst, Section "2) Describe your
changes".

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/1] edac: fsl_ddr_edac: fix expected data message
  2020-08-17  9:53 ` Borislav Petkov
@ 2020-08-27  7:56   ` Gregor Herburger
  2020-09-03 10:58     ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Gregor Herburger @ 2020-08-27  7:56 UTC (permalink / raw)
  To: york.sun, bp, mchehab, tony.luck, james.morse, rrichter
  Cc: linux-edac, linux-kernel, Gregor Herburger

When a correctable single bit error occurs, the driver calculates the
bad_data_bit respectively the bad_ecc_bit. If there is no error in the
corresponding data, the value becomes -1. With this the expected data
message is calculated.

In the case of an error in the lower 32 bits or no error (-1) the right
side operand of the bit-shift becomes negative which is undefined
behavior.

This can result in wrong and misleading messages like this:
[  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
[  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffff / 0x80000059
[  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59

Fix this by only calculating the expected data where the error occurred.

With the fix the dmesg output looks like this:
[  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
[  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffef / 0x59
[  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59

Signed-off-by: Gregor Herburger <gregor.herburger@ew.tq-group.com>
---
 drivers/edac/fsl_ddr_edac.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/fsl_ddr_edac.c b/drivers/edac/fsl_ddr_edac.c
index 6d8ea226010d..4b6989cf1947 100644
--- a/drivers/edac/fsl_ddr_edac.c
+++ b/drivers/edac/fsl_ddr_edac.c
@@ -343,9 +343,9 @@ static void fsl_mc_check(struct mem_ctl_info *mci)
 
 		fsl_mc_printk(mci, KERN_ERR,
 			"Expected Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
-			cap_high ^ (1 << (bad_data_bit - 32)),
-			cap_low ^ (1 << bad_data_bit),
-			syndrome ^ (1 << bad_ecc_bit));
+			(bad_data_bit > 31) ? cap_high ^ (1 << (bad_data_bit - 32)) : cap_high,
+			(bad_data_bit <= 31) ? cap_low ^ (1 << (bad_data_bit)) : cap_low,
+			(bad_ecc_bit != -1) ? syndrome ^ (1 << (bad_ecc_bit)) : syndrome);
 	}
 
 	fsl_mc_printk(mci, KERN_ERR,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] edac: fsl_ddr_edac: fix expected data message
  2020-08-27  7:56   ` [PATCH v2 " Gregor Herburger
@ 2020-09-03 10:58     ` Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2020-09-03 10:58 UTC (permalink / raw)
  To: Gregor Herburger
  Cc: york.sun, mchehab, tony.luck, james.morse, rrichter, linux-edac,
	linux-kernel

On Thu, Aug 27, 2020 at 09:56:00AM +0200, Gregor Herburger wrote:
> When a correctable single bit error occurs, the driver calculates the
> bad_data_bit respectively the bad_ecc_bit. If there is no error in the
> corresponding data, the value becomes -1. With this the expected data
> message is calculated.
> 
> In the case of an error in the lower 32 bits or no error (-1) the right
> side operand of the bit-shift becomes negative which is undefined
> behavior.
> 
> This can result in wrong and misleading messages like this:
> [  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
> [  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffff / 0x80000059
> [  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59
> 
> Fix this by only calculating the expected data where the error occurred.
> 
> With the fix the dmesg output looks like this:
> [  311.103794] EDAC FSL_DDR MC0: Faulty Data bit: 36
> [  311.108490] EDAC FSL_DDR MC0: Expected Data / ECC:   0xffffffef_ffffffef / 0x59
> [  311.116135] EDAC FSL_DDR MC0: Captured Data / ECC:   0xffffffff_ffffffef / 0x59
> 
> Signed-off-by: Gregor Herburger <gregor.herburger@ew.tq-group.com>
> ---
>  drivers/edac/fsl_ddr_edac.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/edac/fsl_ddr_edac.c b/drivers/edac/fsl_ddr_edac.c
> index 6d8ea226010d..4b6989cf1947 100644
> --- a/drivers/edac/fsl_ddr_edac.c
> +++ b/drivers/edac/fsl_ddr_edac.c
> @@ -343,9 +343,9 @@ static void fsl_mc_check(struct mem_ctl_info *mci)
>  
>  		fsl_mc_printk(mci, KERN_ERR,
>  			"Expected Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
> -			cap_high ^ (1 << (bad_data_bit - 32)),
> -			cap_low ^ (1 << bad_data_bit),
> -			syndrome ^ (1 << bad_ecc_bit));
> +			(bad_data_bit > 31) ? cap_high ^ (1 << (bad_data_bit - 32)) : cap_high,
> +			(bad_data_bit <= 31) ? cap_low ^ (1 << (bad_data_bit)) : cap_low,

But if bad_data_bit is -1, this check above will hit and you'd still
shift by -1, IINM.

How about you fix it properly, clean it up and make it more readable in
the process (pasting the code directly instead of a diff because a diff
is less readable):

        if ((err_detect & DDR_EDE_SBE) && (bus_width == 64)) {
                sbe_ecc_decode(cap_high, cap_low, syndrome,
                                &bad_data_bit, &bad_ecc_bit);

                if (bad_data_bit != -1) {
                        if (bad_data_bit > 31)
                                cap_high ^= 1 << (bad_data_bit - 32);
                        else
                                cap_low  ^= 1 << bad_data_bit;

                        fsl_mc_printk(mci, KERN_ERR, "Faulty Data bit: %d\n", bad_data_bit);
                        fsl_mc_printk(mci, KERN_ERR, "Expected Data: %#8.8x_%08x\n",
                                      cap_high, cap_low);
                }

                if (bad_ecc_bit != -1) {
                        fsl_mc_printk(mci, KERN_ERR, "Faulty ECC bit: %d\n", bad_ecc_bit);
                        fsl_mc_printk(mci, KERN_ERR, "Expected ECC: %#2.2x\n",
                                      syndrome ^ (1 << bad_ecc_bit));
                }
        }

This way you print only when the respective faulty bits have been
properly found and not print anything otherwise.

Hmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] edac: fsl_ddr_edac: fix expected data message
  2020-09-04 13:32   ` Gregor Herburger
@ 2020-09-08 19:24     ` Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2020-09-08 19:24 UTC (permalink / raw)
  To: Gregor Herburger
  Cc: york.sun, mchehab, tony.luck, james.morse, rrichter, linux-edac,
	linux-kernel

On Fri, Sep 04, 2020 at 03:32:58PM +0200, Gregor Herburger wrote:
> That shouldn't happen. The whole if-block is only executed when a single 
> bit correctable error has occured (DDR_EDE_SBE). So we always should have
> bad_data_bit or bad_ecc_bit (exclusively).

Ooh, that sbe_ecc_decode() function would give you either the data bit
- if that one is in error - and if not the data bit, then the ECC bit.
Aha.

Ok, so what the driver should do, IMO, is this:

	if (bad_data_bit != -1) {
		...

		fsl_mc_printk("Single-bit data error, ... ", bad_data_bit);
		fsl_mc_printk("Expected Data/Captured Data, ... ", exp_high, exp_low, cap_high, cap_low);
	}

	if (bad_ecc_bit != -1) {
		...

		fsl_mc_printk("Single-bit ECC error, ... ", bad_ecc_bit);
		fsl_mc_printk("Expected ECC/Captured ECC, ... ", exp_syndrome, syndrome);
	}

This way you only print either the data or the ECC value which was in
error but not both.

Makes sense?

> Also i just noticed in the kernel log is no hint that this is an
> single bit error. Maybe we should add this too?

Yap, see above.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] edac: fsl_ddr_edac: fix expected data message
  2020-09-04  9:17 ` Borislav Petkov
@ 2020-09-04 13:32   ` Gregor Herburger
  2020-09-08 19:24     ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Gregor Herburger @ 2020-09-04 13:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: york.sun, mchehab, tony.luck, james.morse, rrichter, linux-edac,
	linux-kernel, gregor.herburger

On Fri, Sep 04, 2020 at 11:17:18AM +0200, Borislav Petkov wrote:
> Your mail client broke threading...
> 
Indeed. Guess I have to change the mail client. Sorry for that.
> On Fri, Sep 04, 2020 at 06:52:24AM +0000, Gregor Herburger wrote:
> 
> > The cap_low, cap_high and syndrome are used in the printk following the if-Block.
> > This will make expected data / captured data look the same.
> 
> Right.
> 
> > @@ -334,18 +337,32 @@ static void fsl_mc_check(struct mem_ctl_info *mci)
> >                 sbe_ecc_decode(cap_high, cap_low, syndrome,
> >                                 &bad_data_bit, &bad_ecc_bit);
> > 
> > +               exp_high = cap_high;
> > +               exp_low = cap_low;
> > +               exp_syndrome = syndrome;
> > +
> >                 if (bad_data_bit != -1)
> > +               {
> 
> Opening brace is on the same line for if-statements.
> 
> >                         fsl_mc_printk(mci, KERN_ERR,
> >                                 "Faulty Data bit: %d\n", bad_data_bit);
> > +
> > +                       if (bad_data_bit < 32)
> > +                               exp_low = cap_low ^ (1 << bad_data_bit);
> > +                       else
> > +                               exp_high = cap_high ^ (1 << (bad_data_bit - 32));
> > +               }
> > +
> >                 if (bad_ecc_bit != -1)
> > +               {
> 
> Ditto.
> 
> >                         fsl_mc_printk(mci, KERN_ERR,
> >                                 "Faulty ECC bit: %d\n", bad_ecc_bit);
> > 
> > +                       exp_syndrome = syndrome ^ (1 << bad_ecc_bit);
> > +               }
> > +
> >                 fsl_mc_printk(mci, KERN_ERR,
> >                         "Expected Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
> > -                       cap_high ^ (1 << (bad_data_bit - 32)),
> > -                       cap_low ^ (1 << bad_data_bit),
> > -                       syndrome ^ (1 << bad_ecc_bit));
> > +                       exp_high, exp_low, exp_syndrome);
> >         }
> > 
> >           fsl_mc_printk(mci, KERN_ERR,
> >                           "Captured Data / ECC:\t%#8.8x_%08x / %#2.2x\n",
> >                           cap_high, cap_low, syndrome);
> > 
> > How about something like this?
> 
> My only concern here is that you'll be printing "Expected Data ..."
> unconditionally even if either or both - bad_data_bit and bad_ecc_bit
> - are -1.
That shouldn't happen. The whole if-block is only executed when a single 
bit correctable error has occured (DDR_EDE_SBE). So we always should have
bad_data_bit or bad_ecc_bit (exclusively).

> 
> If the driver cannot decode the data and/or ECC syndrome bits, then it
> should say so - not dump expected data and claim that it is a valid
> information.
> 
Ok, that is reaonable. But that shouldn't that go into sbe_ecc_decode()?.
Currently sbe_ecc_decude() returns on the first error it finds. So we would
have to rework this function.

> So maybe in addition to the above:
> 
> 	if (bad_data_bit != -1) {
> 		...
> 	} else {
> 		fsl_mc_printk(..., "Unable to decode the Faulty Data bit");
> 	}
> 
> and the same for the ECC bit.
> 
I suggest adding such an message to sbe_ecc_decode(). Also to add an
return 0 on success and to check that before printing infos about single
bit errors.

> And then print only the expected data for the bit which sbe_ecc_decode()
> found correctly and not say anything otherwise.
> 
Also i just noticed in the kernel log is no hint that this is an
single bit error. Maybe we should add this too?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-24 11:18 [PATCH 1/1] edac: fsl_ddr_edac: fix expected data message Gregor Herburger
2020-08-17  9:53 ` Borislav Petkov
2020-08-27  7:56   ` [PATCH v2 " Gregor Herburger
2020-09-03 10:58     ` Borislav Petkov
2020-09-04  6:52 (EXT) " Gregor Herburger
2020-09-04  9:17 ` Borislav Petkov
2020-09-04 13:32   ` Gregor Herburger
2020-09-08 19:24     ` Borislav Petkov

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git