Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
@ 2019-12-04 21:23 Aristeu Rozanski
  2019-12-09 21:52 ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Aristeu Rozanski @ 2019-12-04 21:23 UTC (permalink / raw)
  To: linux-edac; +Cc: Tony Luck, Borislav Petkov, Mauro Carvalho Chehab

Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
available which leads the module to be automatically loaded in virtual machines
as well. That will fail due the missing PCI devices. In both drivers the first
function to make use of the PCI devices is skx_get_hi_lo() will simply print

	EDAC skx: Can't get tolm/tohm

for each CPU core, which is noisy. This patch makes it a debug message.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>

diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 95662a4ff4c4..99bbaf629b8d 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -256,7 +256,7 @@ int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm)
 
 	pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, NULL);
 	if (!pdev) {
-		skx_printk(KERN_ERR, "Can't get tolm/tohm\n");
+		edac_dbg(2, "Can't get tolm/tohm\n");
 		return -ENODEV;
 	}
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2019-12-04 21:23 [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device Aristeu Rozanski
@ 2019-12-09 21:52 ` Luck, Tony
  2019-12-10  0:02   ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2019-12-09 21:52 UTC (permalink / raw)
  To: Aristeu Rozanski, linux-edac; +Cc: Borislav Petkov, Mauro Carvalho Chehab

>	EDAC skx: Can't get tolm/tohm
>
> for each CPU core, which is noisy. This patch makes it a debug message.

This looks like we call skx_init() once per core. Do we keep calling it because
the calls are failing?  Or do we do that even when calls succeed?

I was only really expecting that skx_init() would be called once.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2019-12-09 21:52 ` Luck, Tony
@ 2019-12-10  0:02   ` Luck, Tony
  2019-12-10  9:00     ` Borislav Petkov
  2019-12-10 14:18     ` Aristeu Rozanski
  0 siblings, 2 replies; 13+ messages in thread
From: Luck, Tony @ 2019-12-10  0:02 UTC (permalink / raw)
  To: Aristeu Rozanski, linux-edac; +Cc: Borislav Petkov, Mauro Carvalho Chehab

> This looks like we call skx_init() once per core. Do we keep calling it because
> the calls are failing?  Or do we do that even when calls succeed?
>
> I was only really expecting that skx_init() would be called once.

So (by experimentation) it seems that if the module load fails it
will be retried num_online_cpus times (though not bound to each
CPU in turn ... it will maybe try the init call on the same CPU multiple
times, but miss running on some CPUs).

If the load succeeds, then whoever is repeating the load decides
to stop.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2019-12-10  0:02   ` Luck, Tony
@ 2019-12-10  9:00     ` Borislav Petkov
  2020-01-06 15:12       ` Aristeu Rozanski
  2019-12-10 14:18     ` Aristeu Rozanski
  1 sibling, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2019-12-10  9:00 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Aristeu Rozanski, linux-edac, Mauro Carvalho Chehab

On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > This looks like we call skx_init() once per core. Do we keep calling it because
> > the calls are failing?  Or do we do that even when calls succeed?
> >
> > I was only really expecting that skx_init() would be called once.
> 
> So (by experimentation) it seems that if the module load fails it
> will be retried num_online_cpus times (though not bound to each
> CPU in turn ... it will maybe try the init call on the same CPU multiple
> times, but miss running on some CPUs).
> 
> If the load succeeds, then whoever is repeating the load decides
> to stop.

That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU
models. So it tries once on each CPU:

https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic

I have no clean solution for this except maybe remembering the return
value of the first instance probing in the edac core module and then
asking it... it ain't pretty though.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2019-12-10  0:02   ` Luck, Tony
  2019-12-10  9:00     ` Borislav Petkov
@ 2019-12-10 14:18     ` Aristeu Rozanski
  1 sibling, 0 replies; 13+ messages in thread
From: Aristeu Rozanski @ 2019-12-10 14:18 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-edac, Borislav Petkov, Mauro Carvalho Chehab

On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > This looks like we call skx_init() once per core. Do we keep calling it because
> > the calls are failing?  Or do we do that even when calls succeed?
> >
> > I was only really expecting that skx_init() would be called once.
> 
> So (by experimentation) it seems that if the module load fails it
> will be retried num_online_cpus times (though not bound to each
> CPU in turn ... it will maybe try the init call on the same CPU multiple
> times, but miss running on some CPUs).
> 
> If the load succeeds, then whoever is repeating the load decides
> to stop.

Or silently fails to load the module again for all online cpus.

-- 
Aristeu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2019-12-10  9:00     ` Borislav Petkov
@ 2020-01-06 15:12       ` Aristeu Rozanski
  2020-01-06 16:17         ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Aristeu Rozanski @ 2020-01-06 15:12 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Tue, Dec 10, 2019 at 10:00:13AM +0100, Borislav Petkov wrote:
> On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote:
> > > This looks like we call skx_init() once per core. Do we keep calling it because
> > > the calls are failing?  Or do we do that even when calls succeed?
> > >
> > > I was only really expecting that skx_init() would be called once.
> > 
> > So (by experimentation) it seems that if the module load fails it
> > will be retried num_online_cpus times (though not bound to each
> > CPU in turn ... it will maybe try the init call on the same CPU multiple
> > times, but miss running on some CPUs).
> > 
> > If the load succeeds, then whoever is repeating the load decides
> > to stop.
> 
> That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU
> models. So it tries once on each CPU:
> 
> https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic
> 
> I have no clean solution for this except maybe remembering the return
> value of the first instance probing in the edac core module and then
> asking it... it ain't pretty though.

The other option I can think about is just allowing the module to load
in this situation: the CPU is present but you have no memory controller
PCI devices present. Some drivers will allow loading without having a
device present without errors. It's not clean as having to come up with
something modutils can pick up as hint to not try to load more than once.

Or could just downgrade this specific message since it's a very common
case and let the more unusual situations report more than once.

-- 
Aristeu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-06 15:12       ` Aristeu Rozanski
@ 2020-01-06 16:17         ` Borislav Petkov
  2020-01-06 16:20           ` Aristeu Rozanski
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2020-01-06 16:17 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote:
> The other option I can think about is just allowing the module to load
> in this situation: the CPU is present but you have no memory controller
> PCI devices present. Some drivers will allow loading without having a
> device present without errors. It's not clean as having to come up with
> something modutils can pick up as hint to not try to load more than once.

Yeah, but having a driver loaded when there's no hw to drive is also not
pretty.

> Or could just downgrade this specific message since it's a very common
> case and let the more unusual situations report more than once.

Yap, we did a similar thing for adm64_edac:

7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message")

but instead of downgrading prio we actually went and killed it directly.

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-06 16:17         ` Borislav Petkov
@ 2020-01-06 16:20           ` Aristeu Rozanski
  2020-01-06 16:23             ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Aristeu Rozanski @ 2020-01-06 16:20 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Mon, Jan 06, 2020 at 05:17:32PM +0100, Borislav Petkov wrote:
> On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote:
> > The other option I can think about is just allowing the module to load
> > in this situation: the CPU is present but you have no memory controller
> > PCI devices present. Some drivers will allow loading without having a
> > device present without errors. It's not clean as having to come up with
> > something modutils can pick up as hint to not try to load more than once.
> 
> Yeah, but having a driver loaded when there's no hw to drive is also not
> pretty.
> 
> > Or could just downgrade this specific message since it's a very common
> > case and let the more unusual situations report more than once.
> 
> Yap, we did a similar thing for adm64_edac:
> 
> 7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message")
> 
> but instead of downgrading prio we actually went and killed it directly.
> 
> :-)

OK, will resubmit this patch just removing the messages then.

Thanks!

-- 
Aristeu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-06 16:20           ` Aristeu Rozanski
@ 2020-01-06 16:23             ` Borislav Petkov
  2020-01-07 15:51               ` Aristeu Rozanski
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2020-01-06 16:23 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Mon, Jan 06, 2020 at 11:20:14AM -0500, 'Aristeu Rozanski' wrote:
> OK, will resubmit this patch just removing the messages then.

I'm not saying you should blindly remove them. They might be useful for
debugging purposes so you should consider that usage angle first. In the
AMD case, the message was really useless.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-06 16:23             ` Borislav Petkov
@ 2020-01-07 15:51               ` Aristeu Rozanski
  2020-01-07 16:45                 ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Aristeu Rozanski @ 2020-01-07 15:51 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> I'm not saying you should blindly remove them. They might be useful for
> debugging purposes so you should consider that usage angle first. In the
> AMD case, the message was really useless.

Ah, I thought you had an objection to this patch :)
Do you mind considering it for inclusion?

-- 
Aristeu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-07 15:51               ` Aristeu Rozanski
@ 2020-01-07 16:45                 ` Borislav Petkov
  2020-01-07 21:43                   ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2020-01-07 16:45 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: Luck, Tony, linux-edac, Mauro Carvalho Chehab

On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote:
> On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> > I'm not saying you should blindly remove them. They might be useful for
> > debugging purposes so you should consider that usage angle first. In the
> > AMD case, the message was really useless.
> 
> Ah, I thought you had an objection to this patch :)
> Do you mind considering it for inclusion?

That's Tony's call as he's maintaining the Intel side of EDAC.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-07 16:45                 ` Borislav Petkov
@ 2020-01-07 21:43                   ` Luck, Tony
  2020-01-08 14:54                     ` Aristeu Rozanski
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2020-01-07 21:43 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Aristeu Rozanski, linux-edac, Mauro Carvalho Chehab

On Tue, Jan 07, 2020 at 05:45:28PM +0100, Borislav Petkov wrote:
> On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote:
> > On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote:
> > > I'm not saying you should blindly remove them. They might be useful for
> > > debugging purposes so you should consider that usage angle first. In the
> > > AMD case, the message was really useless.
> > 
> > Ah, I thought you had an objection to this patch :)
> > Do you mind considering it for inclusion?
> 
> That's Tony's call as he's maintaining the Intel side of EDAC.

Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
in edac-for-next branch.  Sorry, should have sent you an "Applied" message.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device
  2020-01-07 21:43                   ` Luck, Tony
@ 2020-01-08 14:54                     ` Aristeu Rozanski
  0 siblings, 0 replies; 13+ messages in thread
From: Aristeu Rozanski @ 2020-01-08 14:54 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-edac, Mauro Carvalho Chehab

On Tue, Jan 07, 2020 at 01:43:10PM -0800, Luck, Tony wrote:
> Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
> in edac-for-next branch.  Sorry, should have sent you an "Applied" message.

Thanks Tony.

-- 
Aristeu


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-04 21:23 [PATCH] EDAC: skx_common: downgrade message importance on missing PCI device Aristeu Rozanski
2019-12-09 21:52 ` Luck, Tony
2019-12-10  0:02   ` Luck, Tony
2019-12-10  9:00     ` Borislav Petkov
2020-01-06 15:12       ` Aristeu Rozanski
2020-01-06 16:17         ` Borislav Petkov
2020-01-06 16:20           ` Aristeu Rozanski
2020-01-06 16:23             ` Borislav Petkov
2020-01-07 15:51               ` Aristeu Rozanski
2020-01-07 16:45                 ` Borislav Petkov
2020-01-07 21:43                   ` Luck, Tony
2020-01-08 14:54                     ` Aristeu Rozanski
2019-12-10 14:18     ` Aristeu Rozanski

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git