linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] amd64_edac: Build module on x86-32
@ 2014-11-02 10:22 Tomasz Pala
  2014-11-02 10:33 ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Tomasz Pala @ 2014-11-02 10:22 UTC (permalink / raw)
  To: Borislav Petkov, linux-edac; +Cc: linux-kernel, trivial

While both K8 and F10h are AMD64 CPUs, EDAC doesn't require them to run
in long mode; AMD_NB dependency is enough.

Fixes: 3d373290450b ("amd64_edac: do not enable module by default").

Signed-off-by: Tomasz Pala <gotar@polanet.pl>
---
 drivers/edac/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 7072c28..af9c9fa 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -106,7 +106,7 @@ config EDAC_GHES
 
 config EDAC_AMD64
 	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
-	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
+	depends on EDAC_MM_EDAC && AMD_NB && EDAC_DECODE_MCE
 	help
 	  Support for error detection and correction of DRAM ECC errors on
 	  the AMD64 families of memory controllers (K8 and F10h)
-- 
1.6.0.4



-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-02 10:22 [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
@ 2014-11-02 10:33 ` Borislav Petkov
  2014-11-02 10:45   ` [PATCH] amd64_edac: Document why it is 64-bit only Borislav Petkov
  2014-11-02 12:11   ` [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
  0 siblings, 2 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-11-02 10:33 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-edac, linux-kernel, trivial

On Sun, Nov 02, 2014 at 11:22:12AM +0100, Tomasz Pala wrote:
> While both K8 and F10h are AMD64 CPUs, EDAC doesn't require them to run
> in long mode; AMD_NB dependency is enough.

Not enabling it on 32-bit was a conscious decision for the simple reason
that with the current DIMM sizes, you can have 1 or 2 DIMMs tops which
you can use on 32-bit and having a fat driver mapping memory errors to
DIMMs in that case does seem like a waste of time, energy, resources...
you name it.

I guess I'll add a note about this in the Kconfig text because I keep
getting patches about this once every couple of months :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] amd64_edac: Document why it is 64-bit only
  2014-11-02 10:33 ` Borislav Petkov
@ 2014-11-02 10:45   ` Borislav Petkov
  2014-11-02 12:11   ` [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
  1 sibling, 0 replies; 10+ messages in thread
From: Borislav Petkov @ 2014-11-02 10:45 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-edac, linux-kernel, trivial, Aravind Gopalakrishnan

On Sun, Nov 02, 2014 at 11:33:00AM +0100, Borislav Petkov wrote:
> I guess I'll add a note about this in the Kconfig text because I keep
> getting patches about this once every couple of months :-)

---
From: Borislav Petkov <bp@suse.de>
Subject: [PATCH] amd64_edac: Document why it is 64-bit only

I keep getting requests for this so let's hold it down why it doesn't
make sense. Also, drop incorrect statement about K8 and F10h - this
driver supports all relevant AMD CPUs.

Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/Kconfig | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 7072c2892d63..4be205741221 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -105,11 +105,16 @@ config EDAC_GHES
 	  In doubt, say 'Y'.
 
 config EDAC_AMD64
-	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
+	tristate "AMD64 (Opteron, Athlon64)"
 	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
 	help
 	  Support for error detection and correction of DRAM ECC errors on
-	  the AMD64 families of memory controllers (K8 and F10h)
+	  the AMD64 families of memory controllers, everything >= K8.
+
+	  Note: this driver is 64-bit only on purpose because having a big
+	  driver report errors for 1 or 2 DIMMs max on 32-bit, considering
+	  current DIMM sizes, does not make a lot of sense to be worth the
+	  effort and energy.
 
 config EDAC_AMD64_ERROR_INJECTION
 	bool "Sysfs HW Error injection facilities"
-- 
2.0.0


-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-02 10:33 ` Borislav Petkov
  2014-11-02 10:45   ` [PATCH] amd64_edac: Document why it is 64-bit only Borislav Petkov
@ 2014-11-02 12:11   ` Tomasz Pala
  2014-11-02 12:35     ` Borislav Petkov
  1 sibling, 1 reply; 10+ messages in thread
From: Tomasz Pala @ 2014-11-02 12:11 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

On Sun, Nov 02, 2014 at 11:33:00 +0100, Borislav Petkov wrote:

> Not enabling it on 32-bit was a conscious decision for the simple reason
> that with the current DIMM sizes, you can have 1 or 2 DIMMs tops which
> you can use on 32-bit and having a fat driver mapping memory errors to
> DIMMs in that case does seem like a waste of time, energy, resources...
> you name it.

In my case it's not about mapping but Detection.

=== begin story ===

Recently my PostgreSQL db failed with:

invalid page header in block 240 of relation base/49095/161613

which was fortunately 'fixed' by:

echo 1 > /proc/sys/vm/drop_caches

It turned out that there were on-disk differences between RAID1 (md)
components, not only shown by next run of mdadm-checkarray, but also
visible in actual filesystem after splitting RAID1 into separate
volumes. There were no problems registered in S.M.A.R.T. logs, but
_somehow_ my data got corrupted and I got not a single diagnostic tool
available. There were no power outages or any other abrupt events, it
just happened, without any reason. I've found some page cache corruption
reports on the net, but none of those matched my conditions.

Currently I'm using checksums at application level (available since
PostgreSQL 9.3) and FS level (BTRFS) and EDAC for 4x1 GB ECC UDIMM
(I did replace 2x2 GB non-ECC with these).

If I could I'd use block-level checksumming or setup RAID1 to
scrub-on-read mode, as this system has very low usage volume and I don't
care about performance at all. Unfortunately SATA T13 didn't made it to
the market, and SCSI drives with DIF/DIX are overkill for this system.

=== end story ===


There is absolutely no reason for you to forbid me using EDAC.

And your reasoning is flawn because:

1. I got 4 pieces of 1 GB ECC UDIMM, not 1 or 2 as you stated; isn't it
   supported config? or maybe you would like to replace my modules with
   4 GB one free of charge (shipping included)?
2. It is my time, energy and resources, it's not up to you to decide how
   I'm going to waste them. What next, removing support for power-hungry
   CPUs? Anyway Poland recently got free CO2 emmisions in EU commision;)
3. If _that_ was the reason, why didn't you made it straightforward
   by depending on HIGHMEM64G? Because such ridiculous condition would
   be soon removed?
4. You could apply the same logic to all the other EDAC modules - next
   to the system mentioned above I got some real server boards, some of
   them running 32-bit kernel with ECC FBDIMMs and first from the top:
	config EDAC_I5000
	depends on EDAC_MM_EDAC && X86 && PCI
   Actually there are only 2 X86_64 dependencies in drivers/edac/Kconfig:
   EDAC_SBRIDGE and EDAC_AMD64 - would you 'fix' every X86 as pointless?
5. If you want to prevent this module from loading when only 1 or 2
   DIMMs are installed, just wire this into the module; I got 4 modules.
   Anyway, even with just 1 module installed, I'd like to know error
   rates to be aware of memory module/controller quality and replace it
   when failing too often.

> I guess I'll add a note about this in the Kconfig text because I keep
> getting patches about this once every couple of months :-)

...so, didn't you think that maybe someone needs this?!

Once again: the circuits are working, there is no technical reason not
to use them. It's up to the owner to decide whether it makes sense.

regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-02 12:11   ` [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
@ 2014-11-02 12:35     ` Borislav Petkov
  2014-11-02 14:08       ` Tomasz Pala
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2014-11-02 12:35 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-edac, linux-kernel

On Sun, Nov 02, 2014 at 01:11:39PM +0100, Tomasz Pala wrote:
> On Sun, Nov 02, 2014 at 11:33:00 +0100, Borislav Petkov wrote:
> 
> > Not enabling it on 32-bit was a conscious decision for the simple reason
> > that with the current DIMM sizes, you can have 1 or 2 DIMMs tops which
> > you can use on 32-bit and having a fat driver mapping memory errors to
> > DIMMs in that case does seem like a waste of time, energy, resources...
> > you name it.
> 
> In my case it's not about mapping but Detection.

Detection of what? Which DIMMs or simply error reporting? Because you
can get reported errors with simply enabling CONFIG_EDAC_DECODE_MCE -
you don't really need amd64_edac for that.

Or do you want for amd64_edac to try to pinpoint which DIMMs are causing
the errors too?

> === begin story ===
> 
> Recently my PostgreSQL db failed with:
> 
> invalid page header in block 240 of relation base/49095/161613
> 
> which was fortunately 'fixed' by:
> 
> echo 1 > /proc/sys/vm/drop_caches
> 
> It turned out that there were on-disk differences between RAID1 (md)
> components, not only shown by next run of mdadm-checkarray, but also
> visible in actual filesystem after splitting RAID1 into separate
> volumes. There were no problems registered in S.M.A.R.T. logs, but
> _somehow_ my data got corrupted and I got not a single diagnostic tool
> available. There were no power outages or any other abrupt events, it
> just happened, without any reason. I've found some page cache corruption
> reports on the net, but none of those matched my conditions.
> 
> Currently I'm using checksums at application level (available since
> PostgreSQL 9.3) and FS level (BTRFS) and EDAC for 4x1 GB ECC UDIMM
> (I did replace 2x2 GB non-ECC with these).
> 
> If I could I'd use block-level checksumming or setup RAID1 to
> scrub-on-read mode, as this system has very low usage volume and I don't
> care about performance at all. Unfortunately SATA T13 didn't made it to
> the market, and SCSI drives with DIF/DIX are overkill for this system.

So were you able to confirm that those errors went away after replacing
the DIMMs?

> There is absolutely no reason for you to forbid me using EDAC.
> 
> And your reasoning is flawn because:
...

First of all, you need to relax yourself. Just calm down a bit, maybe
take a walk first. Take a deep breath, whatever helps.

No one is forbidding you anything - we're simply talking here. And since
you haven't heard my point yet, acting offended for no apparent reason
is simply waste of energy on your part.

Now, to the technical side:

I'm not talking about your time, energy and resources but about mine! I
don't have 32-bit configurations to test 32-bit amd64_edac and am not
willing to go buy any. So let me flip your question: are you going to
test amd64_edac on 32-bit and fix issues when people report them?

If so, I'll gladly enable it there and bounce all such bugs to you for
fixing after people start using it. Oh, and also, all fixes for 32-bit
should *not* break 64-bit amd64_edac so you'll have to test that too.

And this is the main reason why it isn't enabled on 32-bit: lack of
resources and desire to maintain. Very simple.

And finally, if you're only interested in error rates,
CONFIG_EDAC_DECODE_MCE is enough - you get each error reported but
without the amd64_edac output.

> Once again: the circuits are working, there is no technical reason not
> to use them. It's up to the owner to decide whether it makes sense.

Not only to the owner, as I've stated above.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-02 12:35     ` Borislav Petkov
@ 2014-11-02 14:08       ` Tomasz Pala
  2014-11-03 10:55         ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Tomasz Pala @ 2014-11-02 14:08 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

On Sun, Nov 02, 2014 at 13:35:38 +0100, Borislav Petkov wrote:

> Or do you want for amd64_edac to try to pinpoint which DIMMs are causing
> the errors too?

Yes - when error happens, it would be desirable to locate failing module.

> So were you able to confirm that those errors went away after replacing
> the DIMMs?

Can't say - such error (noticed) happened to me only once, how many silent bit
rots I've missed is hard to say, as I haven't got data checksums before.
The previous modules were well tested in this motherboard, so I can't
blame them nor any other component - it's a 'cosmic ray' situation.

OK, with EDAC_DECODE_MCE I would know if I should blame RAM or not. But
if UCE rate is 1/year I can't randomly remove modules and wait if the
problem is gone. Any single UCE should result in action that narrows
down the possibile causes. Other than 'replace entire RAM' obviously.

> First of all, you need to relax yourself. Just calm down a bit, maybe
> take a walk first. Take a deep breath, whatever helps.

OK, done. Sorry for being rude.

> I'm not talking about your time, energy and resources but about mine! I
> don't have 32-bit configurations to test 32-bit amd64_edac and am not
> willing to go buy any. So let me flip your question: are you going to
> test amd64_edac on 32-bit and fix issues when people report them?

1. Yes, I'm going to test, but no, I'm not capable of fixing it, sorry.
1a. There were other reporters you said, maybe some of them are capable.
2. Were there any op-mode specific issues in this code till now? Does
   this differ from not having e.g. F10h hardware? If that happens, I
   might grant remote access to such machine, but that's unfortunatelly all.
3. Didn't know that lack of resources to support discrepancies that might
   occur (but not occuring right now) is valid reason for disabling module
   entirely. After all, there are many parts that are not maintained
   actively at all and nobody removes them preemptively. Back then it
   could be (X86_64 || EXPERIMENTAL), couldn't now it be just a note in
   the description?
4. To be honest I think that more people are abandoning x86-32 than
   enabling ECC on them, so I wouldn't worry about people starting to use this
   and report 32-bit related errors. If you got reports on this once per a few
   months that's the order of magnitude we are talking about.

So, if 32-bit related error are real threat, not just an excuse, ENOTIME
for handling them is fair enough - people determined to have this
running like me will find their way. But please don't say it's not
_worth_ it, Kconfig descriptions are not a place to make such judgements
(as it's YOUR time vs MY data). I'd go for something more objective,
like "this driver might be run on 32-bit kernel, however no complains
would be accepted due to lack of resources to handle 32-bit specific
bugs").

Oh, and one more thing about the proposed description - I've noticed before:

[PATCH 01/16] amd64_edac: Remove F11h support
Fri, 26 Nov 2010 20:04:08 +0100

F11h doesn't support DRAM ECC so whack it away.

and I see F10h, F15h and F16h families only mentioned in amd64_edac.c.

regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-02 14:08       ` Tomasz Pala
@ 2014-11-03 10:55         ` Borislav Petkov
  2014-11-05 12:03           ` Tomasz Pala
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2014-11-03 10:55 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-edac, linux-kernel

On Sun, Nov 02, 2014 at 03:08:39PM +0100, Tomasz Pala wrote:
> Can't say - such error (noticed) happened to me only once, how many silent bit
> rots I've missed is hard to say, as I haven't got data checksums before.
> The previous modules were well tested in this motherboard, so I can't
> blame them nor any other component - it's a 'cosmic ray' situation.

So we still don't know. I wouldn't throw away the old DIMMs if it is a
single failure only.

> OK, with EDAC_DECODE_MCE I would know if I should blame RAM or not. But
> if UCE rate is 1/year I can't randomly remove modules and wait if the
> problem is gone. Any single UCE should result in action that narrows
> down the possibile causes. Other than 'replace entire RAM' obviously.

I think with UCE you mean uncorrectable error?

If so, those normally announce themselves by a bunch of correctable
errors prior. Which roughly means, a DIMM usually announces that it is
going to get bad soon.

So yes, enabling EDAC_DECODE_MCE should be a good first step as it will
tell you when errors occur.

Btw, I forgot to ask, why are you even running 32-bit? Do you have some
old K8 CPU which is not 64-bit capable?

As a matter of fact, can you apply your patch, enable CONFIG_EDAC_DEBUG
and catch dmesg and send it to me, privately is fine too.

> 1. Yes, I'm going to test, but no, I'm not capable of fixing it, sorry.
> 1a. There were other reporters you said, maybe some of them are capable.

Well, I haven't seen anything besides this one patch, similar to yours.
And they've claimed that it works. But the devil is in the detail as
always so what I'm afraid of is we enable this and then months later,
once it trickles down to more users, bug reports start coming in. Bug
reports which no one would have time to address and maybe fix.

I guess we can add this hunk to your patch, albeit a bit controversial:

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 1a1d7c43a20f..17638d7cf5c2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3035,6 +3035,11 @@ static int __init amd64_edac_init(void)
 		goto err_no_instances;
 
 	setup_pci_device();
+
+#ifdef CONFIG_X86_32
+	amd64_err("%s on 32-bit is unsupported. USE AT YOUR OWN RISK!\n", EDAC_MOD_STR);
+#endif
+
 	return 0;
 
 err_no_instances:

> 2. Were there any op-mode specific issues in this code till now? Does
>    this differ from not having e.g. F10h hardware? If that happens, I
>    might grant remote access to such machine, but that's unfortunatelly all.


> 3. Didn't know that lack of resources to support discrepancies that might
>    occur (but not occuring right now) is valid reason for disabling module
>    entirely. After all, there are many parts that are not maintained
>    actively at all and nobody removes them preemptively. Back then it
>    could be (X86_64 || EXPERIMENTAL), couldn't now it be just a note in
>    the description?

EXPERIMENTAL is gone from the kernel.

And as I explained already, I just don't have the bandwidth and am not
really persuaded supporting 32-bit is worth the effort to do it. It is a
conscious decision. That's why I say, if people want it, they can send
me fixes but cannot expect me to fix stuff.

> 4. To be honest I think that more people are abandoning x86-32 than
>    enabling ECC on them, so I wouldn't worry about people starting to use this
>    and report 32-bit related errors. If you got reports on this once per a few
>    months that's the order of magnitude we are talking about.

Probably. And yes, people should abandon 32-bit if they haven't done so! :-D

> So, if 32-bit related error are real threat, not just an excuse, ENOTIME
> for handling them is fair enough - people determined to have this
> running like me will find their way. But please don't say it's not
> _worth_ it, Kconfig descriptions are not a place to make such judgements
> (as it's YOUR time vs MY data). I'd go for something more objective,
> like "this driver might be run on 32-bit kernel, however no complains
> would be accepted due to lack of resources to handle 32-bit specific
> bugs").

Fair enough. How about the warning above? It will issue upon successful
loading on 32-bit.

But I'd still like to know what is the reason you're not moving to 64-bit.

> Oh, and one more thing about the proposed description - I've noticed before:
> 
> [PATCH 01/16] amd64_edac: Remove F11h support
> Fri, 26 Nov 2010 20:04:08 +0100
> 
> F11h doesn't support DRAM ECC so whack it away.
> 
> and I see F10h, F15h and F16h families only mentioned in amd64_edac.c.

I don't understand what you mean here...

The driver supports everything from K8 on which can do ECC. Family 11h
doesn't support ECC so no need for an EDAC driver. I hope this answers
your question.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-03 10:55         ` Borislav Petkov
@ 2014-11-05 12:03           ` Tomasz Pala
  2014-11-05 14:56             ` Borislav Petkov
  0 siblings, 1 reply; 10+ messages in thread
From: Tomasz Pala @ 2014-11-05 12:03 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

On Mon, Nov 03, 2014 at 11:55:08 +0100, Borislav Petkov wrote:

>> The previous modules were well tested in this motherboard, so I can't
>> blame them nor any other component - it's a 'cosmic ray' situation.
> 
> So we still don't know. I wouldn't throw away the old DIMMs if it is a
> single failure only.

They found their place in some workstations, for less critical usage.

> Btw, I forgot to ask, why are you even running 32-bit? Do you have some
> old K8 CPU which is not 64-bit capable?

This system was backed-up by some Intel one without 64-bit support and
it needed to be fully binary-compatible (including databases storage).

Over the time, as older hardware is disposed, it might eventually be
upgraded to 64-bit kernel running 32-bit userland in compat mode (full
transition is not going to happen soon as costs of such operation, i.e.
dumping and restoring all the data, application tests etc. greatly
overweight any benefits), but even the kernel change is not trivial due
to many quirks that happened every time before (and it was really hard
to find some stable configuration). Thus, until there is some bigger
maintaince undergoing or the hardware reaches it's lifetime, noone is
going to "pay" (allocate time, people at night shifts etc.) for such
change.

> As a matter of fact, can you apply your patch, enable CONFIG_EDAC_DEBUG
> and catch dmesg and send it to me, privately is fine too.

There's not much of if related (system is running 3.14.4):

MCE: In-kernel MCE decoding enabled.
EDAC MC: Ver: 3.0.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC enabled.
EDAC amd64: K8 revF or later detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0:  2048MB 1:     0MB
EDAC amd64: MC: 2:  2048MB 3:     0MB
EDAC amd64: MC: 4:     0MB 5:     0MB
EDAC amd64: MC: 6:     0MB 7:     0MB
EDAC amd64: CS0: Unbuffered DDR2 RAM
EDAC amd64: CS2: Unbuffered DDR2 RAM
EDAC MC0: Giving out device to module amd64_edac controller K8: DEV 0000:00:18.2 (INTERRUPT)
EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)

(there are 4 modules 1 GB each, I haven't tested if above changes with
ganged/unganged mode.)

> Fair enough. How about the warning above? It will issue upon successful
> loading on 32-bit.

That is decent solution IMHO. There is a warning visible in logs (not
only in sources or during configuration), so everyone interested would
be informed in the first place they should start reading after any
possible error happens.

> But I'd still like to know what is the reason you're not moving to 64-bit.

Mostly because "If it ain't broke, don't fix it" rule. These are systems
with a few hundreds days uptime (e.g. 3 weeks ago some malfunction
caused 1,5 half year uptime machine to reboot, 500-900 days are not so
uncommon, I remember my pain rebooting machines over 1200 days online).
Restarting them usually causes some minor troubles (not saved changes),
changing software leads to compat troubles (that might be tested before
going to production), but changing kernel makes uncertainty about entire
platform, so it is avoided until necessary (and these running are
polished as much as possible, with backported bugfixes etc.) So
replacing rock-solid kernel with some other is a no-go, even preserving
the current sources (there might always be some 64-bit related bugs).

> The driver supports everything from K8 on which can do ECC. Family 11h
> doesn't support ECC so no need for an EDAC driver. I hope this answers

I mean your '[PATCH] amd64_edac: Document why it is 64-bit only':

-         the AMD64 families of memory controllers (K8 and F10h)
+         the AMD64 families of memory controllers, everything >= K8.

"everything >= K8" mislead me.

best regards,
-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-05 12:03           ` Tomasz Pala
@ 2014-11-05 14:56             ` Borislav Petkov
  2014-11-17 11:15               ` Tomasz Pala
  0 siblings, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2014-11-05 14:56 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: linux-edac, linux-kernel

On Wed, Nov 05, 2014 at 01:03:24PM +0100, Tomasz Pala wrote:
> This system was backed-up by some Intel one without 64-bit support and
> it needed to be fully binary-compatible (including databases storage).
> 
> Over the time, as older hardware is disposed, it might eventually be
> upgraded to 64-bit kernel running 32-bit userland in compat mode (full
> transition is not going to happen soon as costs of such operation, i.e.
> dumping and restoring all the data, application tests etc. greatly
> overweight any benefits), but even the kernel change is not trivial due
> to many quirks that happened every time before (and it was really hard
> to find some stable configuration). Thus, until there is some bigger
> maintaince undergoing or the hardware reaches it's lifetime, noone is
> going to "pay" (allocate time, people at night shifts etc.) for such
> change.

Ok, understood.

> There's not much of if related (system is running 3.14.4):

CONFIG_EDAC_DEBUG gives additional debugging output and this is without
it but it doesn't matter - I see you have a K8 box.

> That is decent solution IMHO. There is a warning visible in logs (not
> only in sources or during configuration), so everyone interested would
> be informed in the first place they should start reading after any
> possible error happens.

I hope :)

> Mostly because "If it ain't broke, don't fix it" rule. These are systems
> with a few hundreds days uptime (e.g. 3 weeks ago some malfunction
> caused 1,5 half year uptime machine to reboot, 500-900 days are not so
> uncommon, I remember my pain rebooting machines over 1200 days online).
> Restarting them usually causes some minor troubles (not saved changes),
> changing software leads to compat troubles (that might be tested before
> going to production), but changing kernel makes uncertainty about entire
> platform, so it is avoided until necessary (and these running are
> polished as much as possible, with backported bugfixes etc.) So
> replacing rock-solid kernel with some other is a no-go, even preserving
> the current sources (there might always be some 64-bit related bugs).

I see.

Just FYI though, one serious advantage of 64-bit is that gets orders of
magnitude more testing than 32-bit so if you still are contemplating a
64-bit switch someday, remember that fact. :)

> > The driver supports everything from K8 on which can do ECC. Family 11h
> > doesn't support ECC so no need for an EDAC driver. I hope this answers
> 
> I mean your '[PATCH] amd64_edac: Document why it is 64-bit only':
> 
> -         the AMD64 families of memory controllers (K8 and F10h)
> +         the AMD64 families of memory controllers, everything >= K8.
> 
> "everything >= K8" mislead me.

Ok, it is supposed to say, on everything K8 and later. K8 is what you
have. What would make it more understandable?

So here's an updated version of your patch:

---
From: Tomasz Pala <gotar@polanet.pl>
Subject: [PATCH] amd64_edac: Build module on x86-32

By popular demand, enable amd64_edac on 32-bit too.

Boris:
 - update Kconfig text.
 - add a warning on load which states that 32-bit configurations are unsupported.

Signed-off-by: Tomasz Pala <gotar@polanet.pl>
Link: http://lkml.kernel.org/r/20141102102212.GA7034@polanet.pl
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/Kconfig      | 6 +++---
 drivers/edac/amd64_edac.c | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 7072c2892d63..4316c9e955b3 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -105,11 +105,11 @@ config EDAC_GHES
 	  In doubt, say 'Y'.
 
 config EDAC_AMD64
-	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
-	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
+	tristate "AMD64 (Opteron, Athlon64)"
+	depends on EDAC_MM_EDAC && AMD_NB && EDAC_DECODE_MCE
 	help
 	  Support for error detection and correction of DRAM ECC errors on
-	  the AMD64 families of memory controllers (K8 and F10h)
+	  the AMD64 families (>= K8) of memory controllers.
 
 config EDAC_AMD64_ERROR_INJECTION
 	bool "Sysfs HW Error injection facilities"
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 1a1d7c43a20f..17638d7cf5c2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3035,6 +3035,11 @@ static int __init amd64_edac_init(void)
 		goto err_no_instances;
 
 	setup_pci_device();
+
+#ifdef CONFIG_X86_32
+	amd64_err("%s on 32-bit is unsupported. USE AT YOUR OWN RISK!\n", EDAC_MOD_STR);
+#endif
+
 	return 0;
 
 err_no_instances:
-- 
2.0.0


Thanks.



-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] amd64_edac: Build module on x86-32
  2014-11-05 14:56             ` Borislav Petkov
@ 2014-11-17 11:15               ` Tomasz Pala
  0 siblings, 0 replies; 10+ messages in thread
From: Tomasz Pala @ 2014-11-17 11:15 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

On Wed, Nov 05, 2014 at 15:56:03 +0100, Borislav Petkov wrote:

>> There's not much of if related (system is running 3.14.4):
> 
> CONFIG_EDAC_DEBUG gives additional debugging output and this is without
> it but it doesn't matter - I see you have a K8 box.

Oh, I've misread the 'DEBUG' part. Unfortunately I've got only K8s to
test.

> Just FYI though, one serious advantage of 64-bit is that gets orders of
> magnitude more testing than 32-bit so if you still are contemplating a
> 64-bit switch someday, remember that fact. :)

Yes, I'm aware of this and won't put 32-bit on new pieces of hardware.
Well, it has always been a pain to support Old, Running, Important Systems;)

>> -         the AMD64 families of memory controllers (K8 and F10h)
>> +         the AMD64 families of memory controllers, everything >= K8.
>> 
>> "everything >= K8" mislead me.
> 
> Ok, it is supposed to say, on everything K8 and later. K8 is what you
> have. What would make it more understandable?

"K8, K10 and everything >= 15h"? To emphase there is a gap? Dunno...

> So here's an updated version of your patch:

That looks fine, thank you for your assistance and patience!

regards,

> ---
> From: Tomasz Pala <gotar@polanet.pl>
> Subject: [PATCH] amd64_edac: Build module on x86-32
> 
> By popular demand, enable amd64_edac on 32-bit too.
> 
> Boris:
>  - update Kconfig text.
>  - add a warning on load which states that 32-bit configurations are unsupported.
> 
> Signed-off-by: Tomasz Pala <gotar@polanet.pl>
> Link: http://lkml.kernel.org/r/20141102102212.GA7034@polanet.pl
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  drivers/edac/Kconfig      | 6 +++---
>  drivers/edac/amd64_edac.c | 5 +++++
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 7072c2892d63..4316c9e955b3 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -105,11 +105,11 @@ config EDAC_GHES
>  	  In doubt, say 'Y'.
>  
>  config EDAC_AMD64
> -	tristate "AMD64 (Opteron, Athlon64) K8, F10h"
> -	depends on EDAC_MM_EDAC && AMD_NB && X86_64 && EDAC_DECODE_MCE
> +	tristate "AMD64 (Opteron, Athlon64)"
> +	depends on EDAC_MM_EDAC && AMD_NB && EDAC_DECODE_MCE
>  	help
>  	  Support for error detection and correction of DRAM ECC errors on
> -	  the AMD64 families of memory controllers (K8 and F10h)
> +	  the AMD64 families (>= K8) of memory controllers.
>  
>  config EDAC_AMD64_ERROR_INJECTION
>  	bool "Sysfs HW Error injection facilities"
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 1a1d7c43a20f..17638d7cf5c2 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3035,6 +3035,11 @@ static int __init amd64_edac_init(void)
>  		goto err_no_instances;
>  
>  	setup_pci_device();
> +
> +#ifdef CONFIG_X86_32
> +	amd64_err("%s on 32-bit is unsupported. USE AT YOUR OWN RISK!\n", EDAC_MOD_STR);
> +#endif
> +
>  	return 0;
>  
>  err_no_instances:

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-11-17 11:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-02 10:22 [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
2014-11-02 10:33 ` Borislav Petkov
2014-11-02 10:45   ` [PATCH] amd64_edac: Document why it is 64-bit only Borislav Petkov
2014-11-02 12:11   ` [PATCH] amd64_edac: Build module on x86-32 Tomasz Pala
2014-11-02 12:35     ` Borislav Petkov
2014-11-02 14:08       ` Tomasz Pala
2014-11-03 10:55         ` Borislav Petkov
2014-11-05 12:03           ` Tomasz Pala
2014-11-05 14:56             ` Borislav Petkov
2014-11-17 11:15               ` Tomasz Pala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).