All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
@ 2009-09-18  2:09 Keith Mannthey
  2009-09-18 14:42 ` Borislav Petkov
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Mannthey @ 2009-09-18  2:09 UTC (permalink / raw)
  To: lkml; +Cc: borislav.petkov, dougthompson

I tested 2.6.31 and the mainline amd64_edac driver.  Once errors were
getting reported I noticed that a good amount of valid looking system
addressed were not being correctly decoded to the csrow level.  I was
only able to correctly decode errors on channel 0 of a given csrow.
Errors on channel 1 were unable to be mapped. 

After some digging I realized that the there was an issue with handling
of Dram Chip Select Masks.  Specifically that amd64_map_to_dcs_mask was
returning incorrect values on my Rev F based system.  See AMD #32559,
4.5.4, starting on pg 90 for a Rev F explanation of the correct mapping
of DRAM CS Base and DRAM CS Mask regs.  This lead to the code below.  I
can provide further explanation if needed. 


I have tested this code on Rev F based system with ecc debug dimms and
fully expect it to work on earlier and later cpus.  I am now able to
correctly decode and map errors to csrows rows on this system. 


Submitted-by:  Keith Mannthey<kmannth@us.ibm.com>
---

diff -urN linux-2.6.31/drivers/edac/amd64_edac.c linux-2.6.31-fixed/drivers/edac/amd64_edac.c
--- linux-2.6.31/drivers/edac/amd64_edac.c	2009-09-09 15:13:59.000000000 -0700
+++ linux-2.6.31-fixed/drivers/edac/amd64_edac.c	2009-09-17 22:32:09.000000000 -0700
@@ -1,6 +1,8 @@
-#include "amd64_edac.h"
+#include <linux/log2.h>
 #include <asm/k8.h>
 
+#include "amd64_edac.h"
+
 static struct edac_pci_ctl_info *amd64_ctl_pci;
 
 static int report_gart_errors;
@@ -132,7 +134,7 @@
 /* Map from a CSROW entry to the mask entry that operates on it */
 static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
 {
-	return csrow >> (pvt->num_dcsm >> 3);
+	return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));
 }
 
 /* return the 'base' address the i'th CS entry of the 'dct' DRAM controller */
diff -urN linux-2.6.31/drivers/edac/amd64_edac.h linux-2.6.31-fixed/drivers/edac/amd64_edac.h
--- linux-2.6.31/drivers/edac/amd64_edac.h	2009-09-17 22:22:18.000000000 -0700
+++ linux-2.6.31-fixed/drivers/edac/amd64_edac.h	2009-09-17 22:48:50.000000000 -0700
@@ -526,7 +526,7 @@
 	/*
 	 * The following fields are set at (load) run time, after CPU revision
 	 * has been determined, since the dct_base and dct_mask registers vary
-	 * based on revision
+	 * based on revision.  num_dcsm is assumed to be a power of 2. 
 	 */
 	u32 dcsb_base;		/* DCSB base bits */
 	u32 dcsm_mask;		/* DCSM mask bits */



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-18  2:09 [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask Keith Mannthey
@ 2009-09-18 14:42 ` Borislav Petkov
  2009-09-18 17:28   ` Keith Mannthey
  0 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2009-09-18 14:42 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: lkml, dougthompson

On Thu, Sep 17, 2009 at 07:09:41PM -0700, Keith Mannthey wrote:
> I tested 2.6.31 and the mainline amd64_edac driver.  Once errors were
> getting reported I noticed that a good amount of valid looking system
> addressed were not being correctly decoded to the csrow level.  I was
> only able to correctly decode errors on channel 0 of a given csrow.
> Errors on channel 1 were unable to be mapped. 
> 
> After some digging I realized that the there was an issue with handling
> of Dram Chip Select Masks.  Specifically that amd64_map_to_dcs_mask was
> returning incorrect values on my Rev F based system.  See AMD #32559,
> 4.5.4, starting on pg 90 for a Rev F explanation of the correct mapping
> of DRAM CS Base and DRAM CS Mask regs.  This lead to the code below.  I
> can provide further explanation if needed. 
> 
> 
> I have tested this code on Rev F based system with ecc debug dimms and
> fully expect it to work on earlier and later cpus.  I am now able to
> correctly decode and map errors to csrows rows on this system. 
> 
> 
> Submitted-by:  Keith Mannthey<kmannth@us.ibm.com>
> ---

This whole DSC[BM] handling is rather long-winded and overengineered. It
is on my to-be-rewritten-and-simplified list.

> 
> diff -urN linux-2.6.31/drivers/edac/amd64_edac.c linux-2.6.31-fixed/drivers/edac/amd64_edac.c
> --- linux-2.6.31/drivers/edac/amd64_edac.c	2009-09-09 15:13:59.000000000 -0700
> +++ linux-2.6.31-fixed/drivers/edac/amd64_edac.c	2009-09-17 22:32:09.000000000 -0700
> @@ -1,6 +1,8 @@
> -#include "amd64_edac.h"
> +#include <linux/log2.h>
>  #include <asm/k8.h>
>  
> +#include "amd64_edac.h"
> +
>  static struct edac_pci_ctl_info *amd64_ctl_pci;
>  
>  static int report_gart_errors;
> @@ -132,7 +134,7 @@
>  /* Map from a CSROW entry to the mask entry that operates on it */
>  static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
>  {
> -	return csrow >> (pvt->num_dcsm >> 3);
> +	return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));

Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
this way you get wrong DCSM offsets for F11h. A dirty fix would be:

if (boot_cpu_data.x86 == 0xf && pvt->ext_model < OPTERON_CPU_REV_E) {
	return csrow;
else
	return csrow >> 1;

The problem is, the csrow thing still goes over 0..7 which is obviously
wrong on F11h but I'll fix that later. Care to redo your patch according
to these and the comments from my previous mail and resend?

By the way, your patches made me look harder at that code region and
I've found some more problems with it which I've fixed. Would you like
to test the whole bunch of fixes on your setup?

Thanks.

-- 
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-18 14:42 ` Borislav Petkov
@ 2009-09-18 17:28   ` Keith Mannthey
  2009-09-19 14:08     ` Borislav Petkov
  2009-09-21 14:55     ` Borislav Petkov
  0 siblings, 2 replies; 7+ messages in thread
From: Keith Mannthey @ 2009-09-18 17:28 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: lkml, dougthompson

On Fri, 2009-09-18 at 16:42 +0200, Borislav Petkov wrote:
> On Thu, Sep 17, 2009 at 07:09:41PM -0700, Keith Mannthey wrote:

<snip>

> >  /* Map from a CSROW entry to the mask entry that operates on it */
> >  static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
> >  {
> > -	return csrow >> (pvt->num_dcsm >> 3);
> > +	return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));
> 
> Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
> this way you get wrong DCSM offsets for F11h. A dirty fix would be:

I think this will still be ok for F11. 


ilog2(2) = 1 

1 + 1 == 2 

8 >> 2 == 2 

csrow >> 2 

This would be ok rev F11 assuming 8 total. 

Am I missing something else? 

> if (boot_cpu_data.x86 == 0xf && pvt->ext_model < OPTERON_CPU_REV_E) {
> 	return csrow;
> else
> 	return csrow >> 1;

Still busted for F11. 

> The problem is, the csrow thing still goes over 0..7 which is obviously
> wrong on F11h but I'll fix that later. Care to redo your patch according
> to these and the comments from my previous mail and resend?

Are there more than 8 csrows any any version (I don't currently have F11
specs).  Maybe should just move to a map rather than a math trick to get
to the right index? 


> By the way, your patches made me look harder at that code region and
> I've found some more problems with it which I've fixed. Would you like
> to test the whole bunch of fixes on your setup?

Yes please send any changes you have.  I have a decent test setup for
live errors. 

Thanks,
  Keith 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-18 17:28   ` Keith Mannthey
@ 2009-09-19 14:08     ` Borislav Petkov
  2009-09-21 14:55     ` Borislav Petkov
  1 sibling, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2009-09-19 14:08 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: lkml, dougthompson

On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
> > this way you get wrong DCSM offsets for F11h. A dirty fix would be:
> 
> I think this will still be ok for F11. 
> 
> 
> ilog2(2) = 1 
> 
> 1 + 1 == 2 
> 
> 8 >> 2 == 2 
> 
> csrow >> 2 
> 
> This would be ok rev F11 assuming 8 total. 
> 
> Am I missing something else?

Yes, F11h has only 4 DCSB and 2 DCSM registers. So the first
two DCSB registers F2x[1,0]40 and F2x[1,0]44 use F2x[1,0]60 as
a mask register and F2x[1,0]48 and F2x[1,0]4C use F2x[1,0]64.
You can look at F11h as a F10h but with only the half of the
DSC[BM] registers present. You can find the F11h BKDG at
http://support.amd.com/us/Processor_TechDocs/41256.pdf and especially
Table 24 on page 115.

So, in that case csrow >> 1 is still valid but csrow going beyond 3 is
out of range that's why it needs to be fixed differently.

> Are there more than 8 csrows any any version (I don't currently have
> F11 specs). Maybe should just move to a map rather than a math trick
> to get to the right index?

I'll think up something on Monday.

> > By the way, your patches made me look harder at that code region and
> > I've found some more problems with it which I've fixed. Would you
> > like to test the whole bunch of fixes on your setup?
>
> Yes please send any changes you have. I have a decent test setup for
> live errors.

Cool, I'll get back to you when I have them ready, thanks.

-- 
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-18 17:28   ` Keith Mannthey
  2009-09-19 14:08     ` Borislav Petkov
@ 2009-09-21 14:55     ` Borislav Petkov
  2009-09-21 23:50       ` Keith Mannthey
  1 sibling, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2009-09-21 14:55 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: lkml, dougthompson

Hi Keith,

On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > By the way, your patches made me look harder at that code region and
> > I've found some more problems with it which I've fixed. Would you like
> > to test the whole bunch of fixes on your setup?
> 
> Yes please send any changes you have.  I have a decent test setup for
> live errors. 

The patch queue is at

git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git

the 'edac-fixes' branch.

Please do test and let me know of any problems you might have.

Thanks a lot.

-- 
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-21 14:55     ` Borislav Petkov
@ 2009-09-21 23:50       ` Keith Mannthey
  2009-09-22  7:14         ` Borislav Petkov
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Mannthey @ 2009-09-21 23:50 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: lkml, dougthompson

On Mon, 2009-09-21 at 16:55 +0200, Borislav Petkov wrote:
> Hi Keith,
> 
> On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > > By the way, your patches made me look harder at that code region and
> > > I've found some more problems with it which I've fixed. Would you like
> > > to test the whole bunch of fixes on your setup?
> > 
> > Yes please send any changes you have.  I have a decent test setup for
> > live errors. 
> 
> The patch queue is at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
> 
> the 'edac-fixes' branch.

Please check you have synced with that git tree.  When I pull the tree
it is 2.6.30-rc5 without the amd64_edac driver.  It has no branch
outside of "master". 


Thanks,
  Keith 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask
  2009-09-21 23:50       ` Keith Mannthey
@ 2009-09-22  7:14         ` Borislav Petkov
  0 siblings, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2009-09-22  7:14 UTC (permalink / raw)
  To: Keith Mannthey; +Cc: lkml, dougthompson

On Mon, Sep 21, 2009 at 04:50:24PM -0700, Keith Mannthey wrote:
> On Mon, 2009-09-21 at 16:55 +0200, Borislav Petkov wrote:
> > Hi Keith,
> > 
> > On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > > > By the way, your patches made me look harder at that code region and
> > > > I've found some more problems with it which I've fixed. Would you like
> > > > to test the whole bunch of fixes on your setup?
> > > 
> > > Yes please send any changes you have.  I have a decent test setup for
> > > live errors. 
> > 
> > The patch queue is at
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
> > 
> > the 'edac-fixes' branch.
> 
> Please check you have synced with that git tree.  When I pull the tree
> it is 2.6.30-rc5 without the amd64_edac driver.  It has no branch
> outside of "master". 

This happens because it is a bare repository without a working tree.
You'll have to add it as a remote repository to an already existing
(with a working tree) one you have locally like so:

(in your repository do:)

git remote add edac git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git

git fetch edac

git checkout -b edac-test edac/edac-fixes

Now you should be on the branch with the patchset.

Let me know how it goes.

Thanks.

-- 
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-09-22  7:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-18  2:09 [Patch] AMD64_EDAC: Fix amd64_map_to_dcs_mask Keith Mannthey
2009-09-18 14:42 ` Borislav Petkov
2009-09-18 17:28   ` Keith Mannthey
2009-09-19 14:08     ` Borislav Petkov
2009-09-21 14:55     ` Borislav Petkov
2009-09-21 23:50       ` Keith Mannthey
2009-09-22  7:14         ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.