From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751058AbeABV1V (ORCPT + 1 other); Tue, 2 Jan 2018 16:27:21 -0500 Received: from mail.skyhub.de ([5.9.137.197]:45616 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750960AbeABV1U (ORCPT ); Tue, 2 Jan 2018 16:27:20 -0500 Date: Tue, 2 Jan 2018 22:27:06 +0100 From: Borislav Petkov To: Meelis Roos Cc: Linux Kernel list , x86@kernel.org, linux-edac@vger.kernel.org, Tom Lendacky Subject: Re: 4.15-rc6 PTI regression: L1 TLB mismatch MCE on Athlon64 Message-ID: <20180102212706.5gtevvg4rr7rfy5o@pd.tnic> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Tue, Jan 02, 2018 at 10:49:16PM +0200, Meelis Roos wrote: > This is on a socket 939 Athlon64 3500+, with PTI enabled. LOL. > [ 316.384669] mce: [Hardware Error]: Machine check events logged > [ 316.384698] [Hardware Error]: Corrected error, no action required. > [ 316.384719] [Hardware Error]: CPU:0 (f:2f:2) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000010011 > [ 316.384742] [Hardware Error]: Error Addr: 0x0000ffff81e000e0 That's the [47:12] slice of the virtual address which it tried to execute. According to our map in mm.txt: ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor vs ffff81e000e0... which makes me think: WTF now?! I don't see any hypervisor happening in dmesg... > [ 316.384757] [Hardware Error]: MC1 Error: L1 TLB multimatch. > [ 316.384774] [Hardware Error]: cache level: L1, tx: INSN > > These MCE-s do not happen on 4.14 and 4.15.0-rc4-00041-gace52288edf0. > They do happen on each boot into 4.15-rc6. Will try to bisect. Please do. And try -rc5 too. And then Linus' pti merges: 52c90f2d32bfa7d6eccd66a56c44ace1f78fbadd 5aa90a84589282b87666f92b6c3c917c8080a9bf caf9a82657b313106aae8f4a35936c116a152299 64a48099b3b31568ac45716b7fafcb74a0c2fcfe > I understand there exist patches that turn off PTI on AMD CPUs but the > MCE-s seem still interesting. Yes, there is: https://lkml.kernel.org/r/20171227054354.20369.94587.stgit@tlendack-t1.amdoffice.net > > Same kernel with "nopti" boot command line option does not show the > MCE-s either. > > When the MCE-s happen, they happen with 5 minute interval or slightly > more, like this (excerpt from grep mce: /var/log/kern.log, not full > dmesg). The first ones always happen at 316 and 627 seconds after > bootup. That's the 5 minute default check interval for corrected errors. You can do # echo 10 > /sys/devices/system/machinecheck/machinecheck0/check_interval to decrease it. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.