linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] EDAC updates for v6.9
@ 2024-03-11 15:57 Borislav Petkov
  2024-03-12  1:12 ` Linus Torvalds
  2024-03-12  1:30 ` pr-tracker-bot
  0 siblings, 2 replies; 9+ messages in thread
From: Borislav Petkov @ 2024-03-11 15:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: x86-ml, lkml, linux-edac

Hi Linus,

please pull EDAC updates for 6.9.

Due to the topology changes from tip, a oneliner is needed to be applied
as part of the merge commit:

diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
index 08c6dbd44c62..59b6169093f7 100644
--- a/drivers/ras/amd/atl/umc.c
+++ b/drivers/ras/amd/atl/umc.c
@@ -315,7 +315,7 @@ static u8 get_die_id(struct atl_err *err)
 	 * For CPUs, this is the AMD Node ID modulo the number
 	 * of AMD Nodes per socket.
 	 */
-	return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
+	return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
 }
 
 #define UMC_CHANNEL_NUM	GENMASK(31, 20)
---

Linux-next did test with a similar diff carried on forwards:

https://lore.kernel.org/r/20240227134352.6deda860@canb.auug.org.au

but we very recently realized that
s/topology_die_id/topology_amd_node_id/ needs to happen too.

That's not a big deal, though, as these are all new drivers for new
hardware which pretty much no one has yet so there's no risk of breaking
any existing machines out there.

Thx.

---

The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:

  Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/edac_updates_for_v6.9

for you to fetch changes up to af65545a0f82d7336f62e34f69d3c644806f5f95:

  Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9 (2024-03-11 16:24:20 +0100)

----------------------------------------------------------------
 - Add a FRU (Field Replaceable Unit) memory poison manager which
   collects and manages previously encountered hw errors in order to
   save them to persistent storage across reboots. Previously recorded
   errors are "replayed" upon reboot in order to poison memory which has
   caused said errors in the past.

   The main use case is stacked, on-chip memory which cannot simply be
   replaced so poisoning faulty areas of it and thus making them
   inaccessible is the only strategy to prolong its lifetime.

 - Add an AMD address translation library glue which converts the
   reported addresses of hw errors into system physical addresses in
   order to be used by other subsystems like memory failure, for
   example. Add support for MI300 accelerators to that library.

 - igen6: Add support for Alder Lake-N SoC

 - i10nm: Add Grand Ridge support

 - The usual fixlets and cleanups

----------------------------------------------------------------
Borislav Petkov (AMD) (3):
      Documentation: Move RAS section to admin-guide
      RAS: Export helper to get ras_debugfs_dir
      Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9

Dan Carpenter (2):
      RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
      RAS/AMD/FMPM: Fix off by one when unwinding on error

Lili Li (1):
      EDAC/igen6: Add one more Intel Alder Lake-N SoC support

Muralidhara M K (1):
      RAS/AMD/ATL: Add MI300 support

Qiuxu Zhuo (1):
      EDAC/i10nm: Add Intel Grand Ridge micro-server support

Shubhrajyoti Datta (1):
      EDAC/versal: Make the bit position of injected errors configurable

Uwe Kleine-König (1):
      EDAC/versal: Convert to platform remove callback returning void

Yangtao Li (1):
      EDAC/synopsys: Convert to devm_platform_ioremap_resource()

Yazen Ghannam (9):
      RAS: Introduce AMD Address Translation Library
      EDAC/amd64: Use new AMD Address Translation Library
      Documentation: RAS: Add index and address translation section
      RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
      RAS/AMD/ATL: Add MI300 row retirement support
      RAS: Introduce a FRU memory poison manager
      RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
      RAS/AMD/FMPM: Save SPA values
      RAS/AMD/FMPM: Add debugfs interface to print record entries

 .../admin-guide/RAS/address-translation.rst        |   24 +
 .../ras.rst => admin-guide/RAS/error-decoding.rst} |   11 +-
 Documentation/admin-guide/RAS/index.rst            |    7 +
 .../admin-guide/{ras.rst => RAS/main.rst}          |   10 +-
 Documentation/admin-guide/index.rst                |    2 +-
 Documentation/index.rst                            |    1 -
 MAINTAINERS                                        |   15 +-
 drivers/edac/Kconfig                               |    1 +
 drivers/edac/amd64_edac.c                          |  286 +-----
 drivers/edac/i10nm_base.c                          |    1 +
 drivers/edac/igen6_edac.c                          |    2 +
 drivers/edac/synopsys_edac.c                       |    4 +-
 drivers/edac/versal_edac.c                         |  199 +++-
 drivers/ras/Kconfig                                |   13 +
 drivers/ras/Makefile                               |    3 +
 drivers/ras/amd/atl/Kconfig                        |   21 +
 drivers/ras/amd/atl/Makefile                       |   18 +
 drivers/ras/amd/atl/access.c                       |  133 +++
 drivers/ras/amd/atl/core.c                         |  225 +++++
 drivers/ras/amd/atl/dehash.c                       |  500 ++++++++++
 drivers/ras/amd/atl/denormalize.c                  |  718 ++++++++++++++
 drivers/ras/amd/atl/internal.h                     |  306 ++++++
 drivers/ras/amd/atl/map.c                          |  682 +++++++++++++
 drivers/ras/amd/atl/reg_fields.h                   |  606 ++++++++++++
 drivers/ras/amd/atl/system.c                       |  288 ++++++
 drivers/ras/amd/atl/umc.c                          |  341 +++++++
 drivers/ras/amd/fmpm.c                             | 1013 ++++++++++++++++++++
 drivers/ras/cec.c                                  |   10 +-
 drivers/ras/debugfs.c                              |    8 +-
 drivers/ras/debugfs.h                              |    2 +-
 drivers/ras/ras.c                                  |   31 +
 include/linux/ras.h                                |   18 +
 32 files changed, 5164 insertions(+), 335 deletions(-)
 create mode 100644 Documentation/admin-guide/RAS/address-translation.rst
 rename Documentation/{RAS/ras.rst => admin-guide/RAS/error-decoding.rst} (73%)
 create mode 100644 Documentation/admin-guide/RAS/index.rst
 rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)
 create mode 100644 drivers/ras/amd/atl/Kconfig
 create mode 100644 drivers/ras/amd/atl/Makefile
 create mode 100644 drivers/ras/amd/atl/access.c
 create mode 100644 drivers/ras/amd/atl/core.c
 create mode 100644 drivers/ras/amd/atl/dehash.c
 create mode 100644 drivers/ras/amd/atl/denormalize.c
 create mode 100644 drivers/ras/amd/atl/internal.h
 create mode 100644 drivers/ras/amd/atl/map.c
 create mode 100644 drivers/ras/amd/atl/reg_fields.h
 create mode 100644 drivers/ras/amd/atl/system.c
 create mode 100644 drivers/ras/amd/atl/umc.c
 create mode 100644 drivers/ras/amd/fmpm.c


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-11 15:57 [GIT PULL] EDAC updates for v6.9 Borislav Petkov
@ 2024-03-12  1:12 ` Linus Torvalds
  2024-03-12  2:24   ` Randy Dunlap
                     ` (2 more replies)
  2024-03-12  1:30 ` pr-tracker-bot
  1 sibling, 3 replies; 9+ messages in thread
From: Linus Torvalds @ 2024-03-12  1:12 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86-ml, lkml, linux-edac

On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <bp@alien8.de> wrote:
>
> -       return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
> +       return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();

Ho humm. Lookie here:

    static inline unsigned int topology_amd_nodes_per_pkg(void)
    { return 0; };

that's the UP case.

Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
potentially do that modulus by zero.

So I made the merge also change that UP case of
topology_amd_nodes_per_pkg() to return 1.

Because dammit, not only is a mod-by-zero wrong, a UP system most
definitely has one node per package, not zero.

               Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-11 15:57 [GIT PULL] EDAC updates for v6.9 Borislav Petkov
  2024-03-12  1:12 ` Linus Torvalds
@ 2024-03-12  1:30 ` pr-tracker-bot
  1 sibling, 0 replies; 9+ messages in thread
From: pr-tracker-bot @ 2024-03-12  1:30 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linus Torvalds, x86-ml, lkml, linux-edac

The pull request you sent on Mon, 11 Mar 2024 16:57:11 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/edac_updates_for_v6.9

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b0402403e54ae9eb94ce1cbb53c7def776e97426

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  1:12 ` Linus Torvalds
@ 2024-03-12  2:24   ` Randy Dunlap
  2024-03-12  2:25     ` Linus Torvalds
  2024-03-12  7:45   ` Borislav Petkov
  2024-03-12 10:07   ` Thomas Gleixner
  2 siblings, 1 reply; 9+ messages in thread
From: Randy Dunlap @ 2024-03-12  2:24 UTC (permalink / raw)
  To: Linus Torvalds, Borislav Petkov; +Cc: x86-ml, lkml, linux-edac



On 3/11/24 18:12, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <bp@alien8.de> wrote:
>>
>> -       return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
>> +       return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
> 
> Ho humm. Lookie here:
> 
>     static inline unsigned int topology_amd_nodes_per_pkg(void)
>     { return 0; };
> 

and there's an extra/trailing ';'.

> that's the UP case.
> 
> Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
> potentially do that modulus by zero.
> 
> So I made the merge also change that UP case of
> topology_amd_nodes_per_pkg() to return 1.
> 
> Because dammit, not only is a mod-by-zero wrong, a UP system most
> definitely has one node per package, not zero.
> 
>                Linus
> 

-- 
#Randy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  2:24   ` Randy Dunlap
@ 2024-03-12  2:25     ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2024-03-12  2:25 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Borislav Petkov, x86-ml, lkml, linux-edac

On Mon, 11 Mar 2024 at 19:24, Randy Dunlap <rdunlap@infradead.org> wrote:
>
> and there's an extra/trailing ';'.

Ayup, I fixed that too while I was in there anyway.

              Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  1:12 ` Linus Torvalds
  2024-03-12  2:24   ` Randy Dunlap
@ 2024-03-12  7:45   ` Borislav Petkov
  2024-03-12  9:16     ` Ingo Molnar
  2024-03-12 10:07   ` Thomas Gleixner
  2 siblings, 1 reply; 9+ messages in thread
From: Borislav Petkov @ 2024-03-12  7:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: x86-ml, lkml, linux-edac

On Mon, Mar 11, 2024 at 06:12:54PM -0700, Linus Torvalds wrote:
> Ho humm. Lookie here:
> 
>     static inline unsigned int topology_amd_nodes_per_pkg(void)
>     { return 0; };
> 
> that's the UP case.
> 
> Yeah, I'm assuming nobody tests this for UP,

Unless it gets randomly enabled in my randconfig builds once in a blue
moon, I'd say pretty seldomly. I've heard people raise the question
multiple times whether we should simply make CONFIG_SMP default y on x86
and frankly, it'll get rid of a whole bunch of stupid corner cases like
that...

> but it's clearly wrong to potentially do that modulus by zero.

Yep.

> So I made the merge also change that UP case of
> topology_amd_nodes_per_pkg() to return 1.
> 
> Because dammit, not only is a mod-by-zero wrong, a UP system most
> definitely has one node per package, not zero.

Yap, that's the the straight-forward thing to do, thanks for fixing it!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  7:45   ` Borislav Petkov
@ 2024-03-12  9:16     ` Ingo Molnar
  2024-03-12  9:29       ` Borislav Petkov
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2024-03-12  9:16 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Linus Torvalds, x86-ml, lkml, linux-edac


* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Mar 11, 2024 at 06:12:54PM -0700, Linus Torvalds wrote:
> > Ho humm. Lookie here:
> > 
> >     static inline unsigned int topology_amd_nodes_per_pkg(void)
> >     { return 0; };
> > 
> > that's the UP case.
> > 
> > Yeah, I'm assuming nobody tests this for UP,
> 
> Unless it gets randomly enabled in my randconfig builds once in a blue
> moon, I'd say pretty seldomly. I've heard people raise the question
> multiple times whether we should simply make CONFIG_SMP default y on x86
> and frankly, it'll get rid of a whole bunch of stupid corner cases like
> that...

Making it 'default y' in the Kconfig alone changes very little, as people & 
bots will still stumble on !SMP via allnoconfig or randconfig builds.

If you mean forcing CONFIG_SMP via 'select SMP' on x86 on the other hand, 
that's worth considering - although I think there will be a ton of extra 
cross-build breakage as most patches still get created & tested on x86.

In other words, the x86 UP build basically has the side-effect utility of 
covering a lot of UP cross-build scenarios in generic code.

I think the most viable approach would be to make SMP the only model all 
across the kernel (and eventually removing the CONFIG_SMP option), while 
propagating UP data structures and locking primitives to the UP arch level, 
instead of having CONFIG_SMP #ifdefs in generic code.

Maybe not today, but certainly in a few years.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  9:16     ` Ingo Molnar
@ 2024-03-12  9:29       ` Borislav Petkov
  0 siblings, 0 replies; 9+ messages in thread
From: Borislav Petkov @ 2024-03-12  9:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, x86-ml, lkml, linux-edac

On Tue, Mar 12, 2024 at 10:16:10AM +0100, Ingo Molnar wrote:
> If you mean forcing CONFIG_SMP via 'select SMP' on x86 on the other
> hand, that's worth considering

Yeah, that.

> - although I think there will be a ton of extra cross-build breakage
> as most patches still get created & tested on x86.

I wanna say "this better be build-tested on the target architecture too"
but I can certainly see the use case of having to cross-build a UP
config.

> I think the most viable approach would be to make SMP the only model
> all across the kernel (and eventually removing the CONFIG_SMP option),
> while propagating UP data structures and locking primitives to the UP
> arch level, instead of having CONFIG_SMP #ifdefs in generic code.

Right, UP is a SMP machine with only 1 CPU. It should just work. :-P

> Maybe not today, but certainly in a few years.

It makes sense to aim for such a model, yap. Let's do it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GIT PULL] EDAC updates for v6.9
  2024-03-12  1:12 ` Linus Torvalds
  2024-03-12  2:24   ` Randy Dunlap
  2024-03-12  7:45   ` Borislav Petkov
@ 2024-03-12 10:07   ` Thomas Gleixner
  2 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2024-03-12 10:07 UTC (permalink / raw)
  To: Linus Torvalds, Borislav Petkov; +Cc: x86-ml, lkml, linux-edac

On Mon, Mar 11 2024 at 18:12, Linus Torvalds wrote:

> On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <bp@alien8.de> wrote:
>>
>> -       return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
>> +       return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
>
> Ho humm. Lookie here:
>
>     static inline unsigned int topology_amd_nodes_per_pkg(void)
>     { return 0; };
>
> that's the UP case.
>
> Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
> potentially do that modulus by zero.

Duh. I clearly was not thinking at all when I wrote this.

Thanks for spotting it.


       tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-12 10:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-11 15:57 [GIT PULL] EDAC updates for v6.9 Borislav Petkov
2024-03-12  1:12 ` Linus Torvalds
2024-03-12  2:24   ` Randy Dunlap
2024-03-12  2:25     ` Linus Torvalds
2024-03-12  7:45   ` Borislav Petkov
2024-03-12  9:16     ` Ingo Molnar
2024-03-12  9:29       ` Borislav Petkov
2024-03-12 10:07   ` Thomas Gleixner
2024-03-12  1:30 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).