From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754413AbcGHK1B (ORCPT <rfc822;w@1wt.eu>);
	Fri, 8 Jul 2016 06:27:01 -0400
Received: from mail-qt0-f194.google.com ([209.85.216.194]:33415 "EHLO
	mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753018AbcGHK0w (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 8 Jul 2016 06:26:52 -0400
Date: Fri, 8 Jul 2016 12:26:48 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Borislav Petkov <bp@alien8.de>
Cc: LKML <linux-kernel@vger.kernel.org>, Yazen Ghannam <Yazen.Ghannam@amd.com>,
        Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register
Message-ID: <20160708102648.GA22597@gmail.com>
References: <1467968983-4874-1-git-send-email-bp@alien8.de>
 <1467968983-4874-4-git-send-email-bp@alien8.de>
 <20160708092659.GB13849@gmail.com>
 <20160708093731.GC3808@pd.tnic>
 <20160708094653.GC13849@gmail.com>
 <20160708101452.GD3808@pd.tnic>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160708101452.GD3808@pd.tnic>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote:
> > I'm not sure I can parse that: how can a reported error have bits corrupted?
> 
> No, it is about the actual bits in memory the ECC error is generated
> for. So, for example, if an ECC error reports that memory location X had
> some bit flips, the syndrome value which gets reported together with
> same ECC error shows which actual bits have flipped.
> 
> Here's an example from the AMD BKDG, maybe that'll make it more clear:
> 
> http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
> 
> Go to page 246, there it says this:
> 
> "For example, assume the ECC syndrome is 03EAh. First search row EAh
> for the complete syndrome. Since it is not found, search row 03h for
> the complete syndrome. It is found in column 9h, so symbol 9h has the
> error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1
> within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the
> corrupted bits are 72 and 73 of the line."
> 
> So you basically search the table of x8 ECC correctable syndromes, first
> in row EAh (second syndrome byte) and if you don't find the complete
> syndrome there, you search row 03 for it.
> 
> It is in column 9 and that means symbol 9. The symbols are 16 - one
> symbol for each byte in a 128bit DRAM word + 3 special symbols for the
> ECC bits.
> 
> The row number 3h is also the error bitmask, so bits 0 and 1 are the
> ones which are corrupted.
> 
> Which means, when you look at the value in DRAM at the address the error
> was reported, you need to go to symbol 9, that's 9*8 = 72 which means,
> bits 72-79 and the first 2 in that byte are bits 72 and 73.
> 
> So if you want to correct them, you simply flip them as the syndrome
> tells you that those 2 are corrupted.
> 
> Ok?

So is 'ECC syndrome' a fancy word and a complicated process for identifying what 
data got corrupted, in a more accurate fashion than what we had before?

Because previously we already had a memory address of the memory corruption, 
right?

What is the typical 'scope' of that memory corruption address - a cache line, a 
machine word, a byte or maybe a variable unit that is memory hardware dependent?

Thanks,

	Ingo