All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Chatradhi, Naveen Krishna" <nchatrad@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-edac@vger.kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, mchehab@kernel.org,
	Yazen.Ghannam@amd.com, Muralidhara M K <muralimk@amd.com>
Subject: Re: [PATCH v3 1/3] x86/amd_nb: Add support for northbridges on Aldebaran
Date: Mon, 11 Oct 2021 19:56:34 +0530	[thread overview]
Message-ID: <dcc7c537-0370-d190-9ca7-ce60fa29d68c@amd.com> (raw)
In-Reply-To: <YSYeo6S2OSZbBpb4@zn.tnic>

Hi Boris,

Apologies for the late reply. We were modifying the code based on the 
conversation between yourself and Yazen.

I would like to answer some of your questions before submitting the next 
version of the patch-set.

On 8/25/2021 4:12 PM, Borislav Petkov wrote:
> [CAUTION: External Email]
>
> On Tue, Aug 24, 2021 at 12:24:35AM +0530, Naveen Krishna Chatradhi wrote:
>> From: Muralidhara M K <muralimk@amd.com>
>>
>> On newer systems the CPUs manage MCA errors reported from the GPUs.
>> Enumerate the GPU nodes with the AMD NB framework to support EDAC.
>>
>> This patch adds necessary code to manage the Aldebaran nodes along with
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
>
> Also, do
>
> $ git grep 'This patch' Documentation/process
>
> for more details.
Sure thanks
>
> Also, what are the "Aldebaran nodes"?
>
> Something on a star which is 65 light years away?

Aldebaran is an AMD GPU name, code submitted [PATCH 000/159] Aldebaran 
support (lists.freedesktop.org) 
<https://lists.freedesktop.org/archives/amd-gfx/2021-February/059694.html> 
is a part of the DRM framework

>
>> the CPU nodes.
>>
>> The GPU nodes are enumerated in sequential order based on the
>> PCI hierarchy, and the first GPU node is assumed to have an "AMD Node
>> ID" value of 8 (the second GPU node has 9, etc.).
> What does that mean? The GPU nodes are simply numerically after the CPU
> nodes or how am I to understand this nomenclature?
>
>> Each Aldebaran GPU
>> package has 2 Data Fabrics, which are enumerated as 2 nodes.
>> With this implementation detail, the Data Fabric on the GPU nodes can be
>> accessed the same way as the Data Fabric on CPU nodes.
>>
>> Signed-off-by: Muralidhara M K <muralimk@amd.com>
>> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> ---
>> Changes since v2: Added Reviewed-by Yazen Ghannam
>>
>>   arch/x86/include/asm/amd_nb.h | 10 ++++++
>>   arch/x86/kernel/amd_nb.c      | 63 ++++++++++++++++++++++++++++++++---
>>   include/linux/pci_ids.h       |  1 +
>>   3 files changed, 69 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h
>> index 455066a06f60..09905f6c7218 100644
>> --- a/arch/x86/include/asm/amd_nb.h
>> +++ b/arch/x86/include/asm/amd_nb.h
>> @@ -80,6 +80,16 @@ struct amd_northbridge_info {
>>
>>   #ifdef CONFIG_AMD_NB
>>
>> +/*
>> + * On newer heterogeneous systems the data fabrics of the CPUs and GPUs
>> + * are connected directly via a custom links, like is done with
> s/ a //
>
>> + * 2 socket CPU systems and also within a socket for Multi-chip Module
>> + * (MCM) CPUs like Naples.
>> + * The first GPU node(non cpu) is assumed to have an "AMD Node ID" value
> In all your text:
>
> s/cpu/CPU/g
>
>> + * of 8 (the second GPU node has 9, etc.).
>> + */
>> +#define NONCPU_NODE_INDEX    8
> Why is this assumed? Can it instead be read from the hardware somewhere?
> Or there simply won't be more than 8 CPU nodes anyway? Not at least in
> the near future?
>
> I'd prefer stuff to be read out directly from the hardware so that when
> the hardware changes, the code just works instead of doing assumptions
> which get invalidated later.
>
>> +
>>   u16 amd_nb_num(void);
>>   bool amd_nb_has_feature(unsigned int feature);
>>   struct amd_northbridge *node_to_amd_nb(int node);
>> diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
>> index 23dda362dc0f..6ad5664a18aa 100644
>> --- a/arch/x86/kernel/amd_nb.c
>> +++ b/arch/x86/kernel/amd_nb.c
>> @@ -26,6 +26,8 @@
>>   #define PCI_DEVICE_ID_AMD_17H_M70H_DF_F4 0x1444
>>   #define PCI_DEVICE_ID_AMD_19H_DF_F4  0x1654
>>   #define PCI_DEVICE_ID_AMD_19H_M50H_DF_F4 0x166e
>> +#define PCI_DEVICE_ID_AMD_ALDEBARAN_ROOT     0x14bb
>> +#define PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F4    0x14d4
> You see how those defines are aligned vertically, right?
>
> If your new defines don't fit, you realign them all vertically too - you
> don't just slap them there.
>
> And if it wasn't clear above, that Aldebaran GPU chip name means
> something only to AMD folks. If this is correct
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FVideo_Core_Next&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ce18fc8abf0da4da621b908d967b4ff31%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637654849356966973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ge%2Fh%2FA2YmlKA7IF8XjuwM%2F4eygYYfybMOJs4jLX3g3I%3D&amp;reserved=0
>
> then that Aldebaran thing is VCN 2.6 although this is only the video
> encoding stuff and GPU I guess is more than that.
>
> IOW, what I'm trying to say is, just like we name the CPUs using their
> families, you should name the GPUs nomenclature with GPU families (I
> guess there's stuff like that) and not use the marketing crap.

Aldebaran GPU might be a later variant of gfx9 and are connected to the 
CPU sockets via custom xGMI links.

I could not find any family number associated with the GPUs. The DRM 
driver code uses it as follows and

does not expose the value to other frameworks in Linux.

+#define CHIP_ALDEBARAN 25

in 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm

> If you need an example, here's how we did it for the Intel marketing
> pile of bullsh*t:
>
> arch/x86/include/asm/intel-family.h
>
> Please provide a similar way of referring to the GPU chips.
>
>>   /* Protect the PCI config register pairs used for SMN and DF indirect access. */
>>   static DEFINE_MUTEX(smn_mutex);
>> @@ -94,6 +96,21 @@ static const struct pci_device_id hygon_nb_link_ids[] = {
>>        {}
>>   };
>>
>> +static const struct pci_device_id amd_noncpu_root_ids[] = {
> Why is that "noncpu" thing everywhere? Is this thing going to be
> anything else besides a GPU?
>
> If not, you can simply call it
>
>          amd_gpu_root_ids
>
> to mean *exactly* what they are. PCI IDs on the GPU.
>
>> +     { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_ROOT) },
>> +     {}
>> +};
>> +
>> +static const struct pci_device_id amd_noncpu_nb_misc_ids[] = {
>> +     { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F3) },
>> +     {}
>> +};
>> +
>> +static const struct pci_device_id amd_noncpu_nb_link_ids[] = {
>> +     { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F4) },
>> +     {}
>> +};
>> +
>>   const struct amd_nb_bus_dev_range amd_nb_bus_dev_ranges[] __initconst = {
>>        { 0x00, 0x18, 0x20 },
>>        { 0xff, 0x00, 0x20 },
>> @@ -230,11 +247,16 @@ int amd_cache_northbridges(void)
>>        const struct pci_device_id *misc_ids = amd_nb_misc_ids;
>>        const struct pci_device_id *link_ids = amd_nb_link_ids;
>>        const struct pci_device_id *root_ids = amd_root_ids;
>> +
>> +     const struct pci_device_id *noncpu_misc_ids = amd_noncpu_nb_misc_ids;
>> +     const struct pci_device_id *noncpu_link_ids = amd_noncpu_nb_link_ids;
>> +     const struct pci_device_id *noncpu_root_ids = amd_noncpu_root_ids;
>> +
>>        struct pci_dev *root, *misc, *link;
>>        struct amd_northbridge *nb;
>>        u16 roots_per_misc = 0;
>> -     u16 misc_count = 0;
>> -     u16 root_count = 0;
>> +     u16 misc_count = 0, misc_count_noncpu = 0;
>> +     u16 root_count = 0, root_count_noncpu = 0;
>>        u16 i, j;
>>
>>        if (amd_northbridges.num)
>> @@ -253,10 +275,16 @@ int amd_cache_northbridges(void)
>>        if (!misc_count)
>>                return -ENODEV;
>>
>> +     while ((misc = next_northbridge(misc, noncpu_misc_ids)) != NULL)
>> +             misc_count_noncpu++;
>> +
>>        root = NULL;
>>        while ((root = next_northbridge(root, root_ids)) != NULL)
>>                root_count++;
>>
>> +     while ((root = next_northbridge(root, noncpu_root_ids)) != NULL)
>> +             root_count_noncpu++;
>> +
>>        if (root_count) {
>>                roots_per_misc = root_count / misc_count;
>>
>> @@ -270,15 +298,28 @@ int amd_cache_northbridges(void)
>>                }
>>        }
>>
>> -     nb = kcalloc(misc_count, sizeof(struct amd_northbridge), GFP_KERNEL);
>> +     if (misc_count_noncpu) {
>> +             /*
>> +              * The first non-CPU Node ID starts at 8 even if there are fewer
>> +              * than 8 CPU nodes. To maintain the AMD Node ID to Linux amd_nb
>> +              * indexing scheme, allocate the number of GPU nodes plus 8.
>> +              * Some allocated amd_northbridge structures will go unused when
>> +              * the number of CPU nodes is less than 8, but this tradeoff is to
>> +              * keep things relatively simple.
> Why simple?
>
> What's wrong with having
>
> [node IDs][GPU node IDs]
>
> i.e., the usual nodes come first and the GPU ones after it.
>
> You enumerate everything properly here so you can control what goes
> where. Which means, you don't need this NONCPU_NODE_INDEX non-sense at
> all.
>
> Hmmm?
>
>> +              */
>> +             amd_northbridges.num = NONCPU_NODE_INDEX + misc_count_noncpu;
>> +     } else {
>> +             amd_northbridges.num = misc_count;
>> +     }
>> +
>> +     nb = kcalloc(amd_northbridges.num, sizeof(struct amd_northbridge), GFP_KERNEL);
>>        if (!nb)
>>                return -ENOMEM;
>>
>>        amd_northbridges.nb = nb;
>> -     amd_northbridges.num = misc_count;
>>
>>        link = misc = root = NULL;
>> -     for (i = 0; i < amd_northbridges.num; i++) {
>> +     for (i = 0; i < misc_count; i++) {
>>                node_to_amd_nb(i)->root = root =
>>                        next_northbridge(root, root_ids);
>>                node_to_amd_nb(i)->misc = misc =
>> @@ -299,6 +340,18 @@ int amd_cache_northbridges(void)
>>                        root = next_northbridge(root, root_ids);
>>        }
>>
>> +     if (misc_count_noncpu) {
>> +             link = misc = root = NULL;
>> +             for (i = NONCPU_NODE_INDEX; i < NONCPU_NODE_INDEX + misc_count_noncpu; i++) {
> So this is not keeping things relatively simple - this is making you
> jump to the GPU nodes to prepare them too which is making them special.
>
>> +                     node_to_amd_nb(i)->root = root =
>> +                             next_northbridge(root, noncpu_root_ids);
>> +                     node_to_amd_nb(i)->misc = misc =
>> +                             next_northbridge(misc, noncpu_misc_ids);
>> +                     node_to_amd_nb(i)->link = link =
>> +                             next_northbridge(link, noncpu_link_ids);
> And seeing how you put those pointers in ->root, ->misc and ->link,
> you can just as well drop those noncpu_*_ids and put the aldebaran
> PCI IDs simply in amd_root_ids, amd_nb_misc_ids and amd_nb_link_ids
> respectively.
>
> Because to this code, the RAS functionality is no different than any
> other CPU because, well, the interface is those PCI devices. So the
> thing doesn't care if it is GPU or not.
>
> So you don't need any of that separation between GPU and CPU nodes when
> it comes to the RAS code.

The roots_per_misc count is different for the CPU nodes and GPU nodes. 
We tried to address

your comment without introducing pci_dev_id arrays for GPU roots, misc 
and links. But, introducing

GPU ID arrays looks cleaner, let me submit the revised code and we can 
revisit this point.

>
> Makes sense?
>
> --
> Regards/Gruss,
>      Boris.

Regards,

Naveen

>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ce18fc8abf0da4da621b908d967b4ff31%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637654849356966973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Y5j%2BNctWbWjlpOkQ5DYvDtroWv8MplSBTPgopewm38E%3D&amp;reserved=0

  parent reply	other threads:[~2021-10-11 14:27 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-30 15:28 [PATCH 0/7] x86/edac/amd64: Add support for noncpu nodes Naveen Krishna Chatradhi
2021-06-30 15:28 ` [PATCH 1/7] x86/amd_nb: Add Aldebaran device to PCI IDs Naveen Krishna Chatradhi
2021-07-19 19:28   ` Yazen Ghannam
2021-08-10 12:45     ` Chatradhi, Naveen Krishna
2021-08-10 13:54       ` Borislav Petkov
2021-06-30 15:28 ` [PATCH 2/7] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-07-19 20:25   ` Yazen Ghannam
2021-08-10 12:45     ` Chatradhi, Naveen Krishna
2021-06-30 15:28 ` [PATCH 3/7] EDAC/mc: Add new HBM2 memory type Naveen Krishna Chatradhi
2021-07-19 20:47   ` Yazen Ghannam
2021-07-19 21:16     ` Luck, Tony
2021-07-20 12:13       ` Zhuo, Qiuxu
2021-07-20 16:30         ` [PATCH] EDAC/skx_common: Set the memory type correctly for HBM memory Luck, Tony
2021-07-20 16:33     ` [PATCH 3/7] EDAC/mc: Add new HBM2 memory type Luck, Tony
2021-06-30 15:28 ` [PATCH 4/7] EDAC/mce_amd: extract node id from InstanceHi in IPID Naveen Krishna Chatradhi
2021-07-29 16:32   ` Yazen Ghannam
2021-08-10 12:45     ` Chatradhi, Naveen Krishna
2021-06-30 15:28 ` [PATCH 5/7] EDAC/amd64: Enumerate memory on noncpu nodes Naveen Krishna Chatradhi
2021-07-29 17:55   ` Yazen Ghannam
2021-08-10 12:49     ` Chatradhi, Naveen Krishna
2021-06-30 15:28 ` [PATCH 6/7] EDAC/amd64: Add address translation support for DF3.5 Naveen Krishna Chatradhi
2021-07-29 17:59   ` Yazen Ghannam
2021-07-29 18:13     ` Chatradhi, Naveen Krishna
2021-06-30 15:28 ` [PATCH 7/7] EDAC/amd64: Add fixed UMC to CS mapping Naveen Krishna Chatradhi
2021-08-06  7:43 ` [PATCH v2 0/3] x86/edac/amd64: Add support for noncpu nodes Naveen Krishna Chatradhi
2021-08-06  7:43   ` [PATCH v2 1/3] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-08-20 15:36     ` Yazen Ghannam
2021-08-06  7:43   ` [PATCH v2 2/3] EDAC/mce_amd: Extract node id from InstanceHi in IPID Naveen Krishna Chatradhi
2021-08-20 15:43     ` Yazen Ghannam
2021-08-06  7:43   ` [PATCH v2 3/3] EDAC/amd64: Enumerate memory on noncpu nodes Naveen Krishna Chatradhi
2021-08-20 17:02     ` Yazen Ghannam
2021-08-23 16:56       ` Chatradhi, Naveen Krishna
2021-08-23 18:54     ` [PATCH v3 0/3] x86/edac/amd64: Add support for " Naveen Krishna Chatradhi
2021-08-23 18:54       ` [PATCH v3 1/3] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-08-25 10:42         ` Borislav Petkov
2021-09-01 18:17           ` Yazen Ghannam
2021-09-02 17:30             ` Borislav Petkov
2021-09-13 18:07               ` Yazen Ghannam
2021-09-16 16:36                 ` Borislav Petkov
2021-10-11 14:26           ` Chatradhi, Naveen Krishna [this message]
2021-10-11 18:08             ` Borislav Petkov
2021-08-23 18:54       ` [PATCH v3 2/3] EDAC/mce_amd: Extract node id from MCA_IPID Naveen Krishna Chatradhi
2021-08-27 10:24         ` Borislav Petkov
2021-09-01 18:18           ` Yazen Ghannam
2021-08-23 18:54       ` [PATCH v3 3/3] EDAC/amd64: Enumerate memory on noncpu nodes Naveen Krishna Chatradhi
2021-08-27 11:30         ` Borislav Petkov
2021-09-01 18:42           ` Yazen Ghannam
2021-09-08 18:41             ` Borislav Petkov
2021-09-13 18:19               ` Yazen Ghannam
2021-10-14 18:50       ` [PATCH v4 0/4] x86/edac/amd64: Add support for " Naveen Krishna Chatradhi
2021-10-14 18:50         ` [PATCH v4 1/4] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-10-14 18:50         ` [PATCH v4 2/4] EDAC/mce_amd: Extract node id from MCA_IPID Naveen Krishna Chatradhi
2021-10-14 18:50         ` [PATCH 3/4] EDAC/amd64: Extend family ops functions Naveen Krishna Chatradhi
2021-10-14 18:50         ` [PATCH 4/4] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes Naveen Krishna Chatradhi
2021-10-14 18:53       ` [PATCH v4 0/4] x86/edac/amd64: Add support for noncpu nodes Naveen Krishna Chatradhi
2021-10-14 18:53         ` [PATCH v4 1/4] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2021-10-21 15:52           ` Yazen Ghannam
2021-10-14 18:53         ` [PATCH v4 2/4] EDAC/mce_amd: Extract node id from MCA_IPID Naveen Krishna Chatradhi
2021-10-21 15:55           ` Yazen Ghannam
2021-10-14 18:53         ` [PATCH v4 3/4] EDAC/amd64: Extend family ops functions Naveen Krishna Chatradhi
2021-10-21 16:27           ` Yazen Ghannam
2021-10-14 18:54         ` [PATCH v4 4/4] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes Naveen Krishna Chatradhi
2021-10-21 16:40           ` Yazen Ghannam
2021-10-14 19:53         ` [PATCH v4 0/4] x86/edac/amd64: Add support for noncpu nodes Borislav Petkov
2021-10-15 12:18           ` Chatradhi, Naveen Krishna
2021-10-15 19:25             ` Borislav Petkov
2021-08-20 17:07   ` [PATCH v2 0/3] " Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dcc7c537-0370-d190-9ca7-ce60fa29d68c@amd.com \
    --to=nchatrad@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=muralimk@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.