From: Borislav Petkov <bp@alien8.de>
To: Naveen Krishna Chatradhi <nchatrad@amd.com>
Cc: linux-edac@vger.kernel.org, x86@kernel.org,
linux-kernel@vger.kernel.org, mingo@redhat.com,
mchehab@kernel.org, yazen.ghannam@amd.com,
Muralidhara M K <muralimk@amd.com>
Subject: Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes
Date: Thu, 11 Nov 2021 14:12:12 +0100 [thread overview]
Message-ID: <YY0WrKjnQ20IjrhB@zn.tnic> (raw)
In-Reply-To: <20211028130106.15701-6-nchatrad@amd.com>
On Thu, Oct 28, 2021 at 06:31:06PM +0530, Naveen Krishna Chatradhi wrote:
> On newer heterogeneous systems with AMD CPUs, the data fabrics of the GPUs
> are connected directly via custom links.
>
> One such system, where Aldebaran GPU nodes are connected to the
> Family 19h, model 30h family of CPU nodes, the Aldebaran GPUs can report
> memory errors via SMCA banks.
>
> Aldebaran GPU support was added to DRM framework
> https://lists.freedesktop.org/archives/amd-gfx/2021-February/059694.html
>
> The GPU nodes comes with HBM2 memory in-built, ECC support is enabled by
> default and the UMCs on GPU node are different from the UMCs on CPU nodes.
>
> GPU specific ops routines are defined to extend the amd64_edac
> module to enumerate HBM memory leveraging the existing edac and the
> amd64 specific data structures.
>
> Note: The UMC Phys on GPU nodes are enumerated as csrows and the UMC
> channels connected to HBM banks are enumerated as ranks.
For all your future commit messages, from
Documentation/process/submitting-patches.rst:
"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour."
Also, do not talk about what your patch does - that should hopefully be
visible in the diff itself. Rather, talk about *why* you're doing what
you're doing.
>
> Cc: Yazen Ghannam <yazen.ghannam@amd.com>
> Co-developed-by: Muralidhara M K <muralimk@amd.com>
> Signed-off-by: Muralidhara M K <muralimk@amd.com>
> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
> Link: https://lkml.kernel.org/r/20210823185437.94417-4-nchatrad@amd.com
> ---
> Changes since v5:
> Removed else condition in per_family_init for 19h family
>
> Changes since v4:
> Split "f17_addr_mask_to_cs_size" instead as did in 3rd patch earlier
>
> Changes since v3:
> 1. Bifurcated the GPU code from v2
>
> Changes since v2:
> 1. Restored line deletions and handled minor comments
> 2. Modified commit message and some of the function comments
> 3. variable df_inst_id is introduced instead of umc_num
>
> Changes since v1:
> 1. Modifed the commit message
> 2. Change the edac_cap
> 3. kept sizes of both cpu and noncpu together
> 4. return success if the !F3 condition true and remove unnecessary validation
>
> drivers/edac/amd64_edac.c | 298 +++++++++++++++++++++++++++++++++-----
> drivers/edac/amd64_edac.h | 27 ++++
> 2 files changed, 292 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 7953ffe9d547..b404fa5b03ce 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -1121,6 +1121,20 @@ static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
> }
> }
>
> +static void debug_display_dimm_sizes_gpu(struct amd64_pvt *pvt, u8 ctrl)
> +{
> + int size, cs = 0, cs_mode;
> +
> + edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
> +
> + cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
> +
> + for_each_chip_select(cs, ctrl, pvt) {
> + size = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs);
> + amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size);
> + }
> +}
> +
> static void __dump_misc_regs_df(struct amd64_pvt *pvt)
> {
> struct amd64_umc *umc;
> @@ -1165,6 +1179,27 @@ static void __dump_misc_regs_df(struct amd64_pvt *pvt)
> pvt->dhar, dhar_base(pvt));
> }
>
> +static void __dump_misc_regs_gpu(struct amd64_pvt *pvt)
The function pointer this gets assigned to is called *get*_misc_regs.
But this function is is doing *dump*.
When I see __dump_misc_regs_gpu() I'd expect this function to be called
by a higher level dump_misc_regs() as the "__" denotes layering usually.
There is, in fact, dump_misc_regs() but that one calls
->get_misc_regs().
So this all needs fixing - right now I see a mess.
Also, there's <function_name>_gpu and gpu_<function_name> with the "gpu"
being either prefix or suffix. You need to fix the current ones to be
only prefixes - in a pre-patch - and then add yours as prefixes only too.
And in talking about pre-patches - this one is doing a bunch of things
at once and needs splitting. You do the preparatory work like carving
out common functionality and other refactoring in pre-patches, and then
you add the new functionality ontop.
Also, I don't like ->is_gpu one bit, it being sprinkled like that around
the code. This says that the per-family attrs and ops assignment is
lacking.
> @@ -2982,7 +3132,17 @@ static void decode_umc_error(int node_id, struct mce *m)
>
> pvt->ops->get_umc_err_info(m, &err);
>
> - if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
> + /*
> + * GPU node has #phys[X] which has #channels[Y] each.
> + * On GPUs, df_inst_id = [X] * num_ch_per_phy + [Y].
> + * On CPUs, "Channel"="UMC Number"="DF Instance ID".
> + */
This comment doesn't look like it is destined for human beings to read
but maybe to be parsed by programs?
> + if (pvt->is_gpu)
> + df_inst_id = (err.csrow * pvt->channel_count / mci->nr_csrows) + err.channel;
I'm not sure we want to log ECCs from the GPUs: what is going to happen
to them, how does the further error recovery look like?
Also, EDAC sysfs structure currently has only DIMMs, below's from my
box.
I don't see how that structure fits the GPUs so here's the $10^6
question: why does EDAC need to know about GPUs?
What is the strategy here?
Your 0th message talks about the "what" but what gets added is not
important - the "why" is.
$ grep -r . /sys/devices/system/edac/mc/ 2>/dev/null
/sys/devices/system/edac/mc/power/runtime_active_time:0
/sys/devices/system/edac/mc/power/runtime_status:unsupported
/sys/devices/system/edac/mc/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/power/control:auto
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/control:auto
/sys/devices/system/edac/mc/mc0/rank7/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank7/size:8192
/sys/devices/system/edac/mc/mc0/rank7/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_label:mc#0csrow#3channel#1
/sys/devices/system/edac/mc/mc0/rank7/dimm_location:csrow 3 channel 1
/sys/devices/system/edac/mc/mc0/rank7/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/topmem:0x00000000e0000000
/sys/devices/system/edac/mc/mc0/mc_name:F17h
/sys/devices/system/edac/mc/mc0/rank5/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/control:auto
/sys/devices/system/edac/mc/mc0/rank5/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank5/size:8192
/sys/devices/system/edac/mc/mc0/rank5/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_label:mc#0csrow#2channel#1
/sys/devices/system/edac/mc/mc0/rank5/dimm_location:csrow 2 channel 1
/sys/devices/system/edac/mc/mc0/rank5/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/dram_hole:0 0 0
/sys/devices/system/edac/mc/mc0/ce_noinfo_count:0
...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2021-11-11 13:12 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-28 13:01 [PATCH v6 0/5] x86/edac/amd64: Add heterogeneous node support Naveen Krishna Chatradhi
2021-10-28 13:01 ` [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-11-01 17:28 ` Borislav Petkov
2021-11-02 18:03 ` Borislav Petkov
2021-11-04 13:18 ` Chatradhi, Naveen Krishna
2021-11-08 13:34 ` Borislav Petkov
2021-11-08 16:53 ` Chatradhi, Naveen Krishna
2021-11-08 19:03 ` Borislav Petkov
2021-11-09 11:30 ` Chatradhi, Naveen Krishna
2021-11-09 20:41 ` Borislav Petkov
2021-11-04 13:21 ` Chatradhi, Naveen Krishna
2021-10-28 13:01 ` [PATCH v6 2/5] EDAC/mce_amd: Extract node id from MCA_IPID Naveen Krishna Chatradhi
2021-11-08 13:37 ` Borislav Petkov
2021-10-28 13:01 ` [PATCH v6 3/5] EDAC/amd64: Extend family ops functions Naveen Krishna Chatradhi
2021-11-10 17:45 ` Borislav Petkov
2021-11-11 16:23 ` Chatradhi, Naveen Krishna
2021-11-11 18:05 ` Borislav Petkov
2021-11-12 20:59 ` Yazen Ghannam
2021-11-13 11:58 ` Borislav Petkov
2021-10-28 13:01 ` [PATCH v6 4/5] EDAC/amd64: Move struct fam_type into amd64_pvt structure Naveen Krishna Chatradhi
2021-11-11 12:39 ` Borislav Petkov
2021-11-11 16:26 ` Chatradhi, Naveen Krishna
2021-10-28 13:01 ` [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes Naveen Krishna Chatradhi
2021-11-11 13:12 ` Borislav Petkov [this message]
2021-11-15 15:24 ` Chatradhi, Naveen Krishna
2021-11-15 16:04 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YY0WrKjnQ20IjrhB@zn.tnic \
--to=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=mingo@redhat.com \
--cc=muralimk@amd.com \
--cc=nchatrad@amd.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).