linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	bp@suse.de, tony.luck@intel.com, x86@kernel.org,
	Smita.KoralahalliChannabasappa@amd.com
Subject: Re: [PATCH 1/2] x86/MCE/AMD, EDAC/mce_amd: Use AMD NodeId for Family17h+ DRAM Decode
Date: Mon, 17 Aug 2020 13:35:05 -0500	[thread overview]
Message-ID: <20200817183401.GA1304633@yaz-nikka.amd.com> (raw)
In-Reply-To: <20200815084212.GA2444151@gmail.com>

On Sat, Aug 15, 2020 at 10:42:12AM +0200, Ingo Molnar wrote:
> 
> * Yazen Ghannam <Yazen.Ghannam@amd.com> wrote:
> 
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> > 
> > The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and
> > later systems. This function is used in amd64_edac_mod to do
> > system-specific decoding for DRAM ECC errors. The function takes a
> > "NodeId" as a parameter.
> > 
> > In AMD documentation, NodeId is used to identify a physical die in a
> > system. This can be used to identify a node in the AMD_NB code and also
> > it is used with umc_normaddr_to_sysaddr().
> > 
> > However, the input used for decode_dram_ecc() is currently the NUMA node
> > of a logical CPU. In the default configuration, the NUMA node and
> > physical die will be equivalent, so this doesn't have an impact. But the
> > NUMA node configuration can be adjusted with optional memory
> > interleaving schemes. This will cause the NUMA node enumeration to not
> > match the physical die enumeration. The mismatch will cause the address
> > translation function to fail or report incorrect results.
> > 
> > Save the "NodeId" as a percpu value during init in AMD MCE code. Export
> > a function to return the value which can be used from modules like
> > edac_mce_amd.
> > 
> > Fixes: fbe63acf62f5 ("EDAC, mce_amd: Use cpu_to_node() to find the node ID")
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> >  arch/x86/include/asm/mce.h    |  2 ++
> >  arch/x86/kernel/cpu/mce/amd.c | 11 +++++++++++
> >  drivers/edac/mce_amd.c        |  2 +-
> >  3 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > index cf503824529c..92527cc9ed06 100644
> > --- a/arch/x86/include/asm/mce.h
> > +++ b/arch/x86/include/asm/mce.h
> > @@ -343,6 +343,8 @@ extern struct smca_bank smca_banks[MAX_NR_BANKS];
> >  extern const char *smca_get_long_name(enum smca_bank_types t);
> >  extern bool amd_mce_is_memory_error(struct mce *m);
> >  
> > +extern u8 amd_cpu_to_node(unsigned int cpu);
> > +
> >  extern int mce_threshold_create_device(unsigned int cpu);
> >  extern int mce_threshold_remove_device(unsigned int cpu);
> >  
> > diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> > index 99be063fcb1b..524edf81e287 100644
> > --- a/arch/x86/kernel/cpu/mce/amd.c
> > +++ b/arch/x86/kernel/cpu/mce/amd.c
> > @@ -202,6 +202,9 @@ static DEFINE_PER_CPU(unsigned int, bank_map);
> >  /* Map of banks that have more than MCA_MISC0 available. */
> >  static DEFINE_PER_CPU(u32, smca_misc_banks_map);
> >  
> > +/* CPUID_Fn8000001E_ECX[NodeId] used to identify a physical node/die. */
> > +static DEFINE_PER_CPU(u8, node_id);
> > +
> >  static void amd_threshold_interrupt(void);
> >  static void amd_deferred_error_interrupt(void);
> >  
> > @@ -233,6 +236,12 @@ static void smca_set_misc_banks_map(unsigned int bank, unsigned int cpu)
> >  
> >  }
> >  
> > +u8 amd_cpu_to_node(unsigned int cpu)
> > +{
> > +	return per_cpu(node_id, cpu);
> > +}
> > +EXPORT_SYMBOL_GPL(amd_cpu_to_node);
> > +
> >  static void smca_configure(unsigned int bank, unsigned int cpu)
> >  {
> >  	unsigned int i, hwid_mcatype;
> > @@ -240,6 +249,8 @@ static void smca_configure(unsigned int bank, unsigned int cpu)
> >  	u32 high, low;
> >  	u32 smca_config = MSR_AMD64_SMCA_MCx_CONFIG(bank);
> >  
> > +	this_cpu_write(node_id, cpuid_ecx(0x8000001e) & 0xFF);
> 
> So we already have this magic number used for a similar purpose, in 
> amd_get_topology():
> 
>                 cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
> 
>                 node_id  = ecx & 0xff;
>

Yes, that's right. I did have a patch that tried to leverage the
existing topology variables. But it wasn't working for all targeted
systems. So I thought to have something local to the AMD MCA code in
order to avoid messing with the topology code just for this feature.

> Firstly, could we please at least give 0x8000001e a proper symbolic 
> name, use it in hygon.c too (which AFAIK is derived from AMD anyway), 
> and then use it in these new patches?
> 

Sure, but all places that use a symbolic name for a CPUID leaf define it
locally. Should the same be done here? Or should there be common place
for all the defines like in <asm/cpufeatures.h> or maybe a new header
file?

> Secondly, why not stick node_id into struct cpuinfo_x86, where the MCA 
> code can then use it without having to introduce a new percpu data 
> structure?
> 

I think this would be the simplest approach. I can write it. Also, the
amd_get_nb_id() function could then be replaced with this.

> There's also the underlying assumption that there's only ever going to 
> be 256 nodes, which limitation I'm sure we'll hear about in a couple 
> of years as not being quite enough. ;-)
> 

Yeah, CPU topology seems very fractal-like. :)

> So less hardcoding and more generalizations please.
> 

Will do.

Thanks,
Yazen

  reply	other threads:[~2020-08-17 18:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-14 19:14 [PATCH 0/2] AMD MCA Address Translation Updates Yazen Ghannam
2020-08-14 19:14 ` [PATCH 1/2] x86/MCE/AMD, EDAC/mce_amd: Use AMD NodeId for Family17h+ DRAM Decode Yazen Ghannam
2020-08-15  8:42   ` Ingo Molnar
2020-08-17 18:35     ` Yazen Ghannam [this message]
2020-08-14 19:14 ` [PATCH 2/2] x86/MCE/AMD Support new memory interleaving schemes during address translation Yazen Ghannam
2020-08-15  9:13   ` Ingo Molnar
2020-08-18 14:02     ` Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200817183401.GA1304633@yaz-nikka.amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=bp@suse.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).