From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754413AbcGHK1B (ORCPT ); Fri, 8 Jul 2016 06:27:01 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:33415 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753018AbcGHK0w (ORCPT ); Fri, 8 Jul 2016 06:26:52 -0400 Date: Fri, 8 Jul 2016 12:26:48 +0200 From: Ingo Molnar To: Borislav Petkov Cc: LKML , Yazen Ghannam , Thomas Gleixner , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register Message-ID: <20160708102648.GA22597@gmail.com> References: <1467968983-4874-1-git-send-email-bp@alien8.de> <1467968983-4874-4-git-send-email-bp@alien8.de> <20160708092659.GB13849@gmail.com> <20160708093731.GC3808@pd.tnic> <20160708094653.GC13849@gmail.com> <20160708101452.GD3808@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160708101452.GD3808@pd.tnic> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Borislav Petkov wrote: > On Fri, Jul 08, 2016 at 11:46:53AM +0200, Ingo Molnar wrote: > > I'm not sure I can parse that: how can a reported error have bits corrupted? > > No, it is about the actual bits in memory the ECC error is generated > for. So, for example, if an ECC error reports that memory location X had > some bit flips, the syndrome value which gets reported together with > same ECC error shows which actual bits have flipped. > > Here's an example from the AMD BKDG, maybe that'll make it more clear: > > http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf > > Go to page 246, there it says this: > > "For example, assume the ECC syndrome is 03EAh. First search row EAh > for the complete syndrome. Since it is not found, search row 03h for > the complete syndrome. It is found in column 9h, so symbol 9h has the > error. Since the error bitmask indicates value 3h (0011b), bits 0 and 1 > within that symbol are corrupted. Symbol 9h maps to bits 72-79, so the > corrupted bits are 72 and 73 of the line." > > So you basically search the table of x8 ECC correctable syndromes, first > in row EAh (second syndrome byte) and if you don't find the complete > syndrome there, you search row 03 for it. > > It is in column 9 and that means symbol 9. The symbols are 16 - one > symbol for each byte in a 128bit DRAM word + 3 special symbols for the > ECC bits. > > The row number 3h is also the error bitmask, so bits 0 and 1 are the > ones which are corrupted. > > Which means, when you look at the value in DRAM at the address the error > was reported, you need to go to symbol 9, that's 9*8 = 72 which means, > bits 72-79 and the first 2 in that byte are bits 72 and 73. > > So if you want to correct them, you simply flip them as the syndrome > tells you that those 2 are corrupted. > > Ok? So is 'ECC syndrome' a fancy word and a complicated process for identifying what data got corrupted, in a more accurate fashion than what we had before? Because previously we already had a memory address of the memory corruption, right? What is the typical 'scope' of that memory corruption address - a cache line, a machine word, a byte or maybe a variable unit that is memory hardware dependent? Thanks, Ingo