From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753969AbdKIQ6j (ORCPT ); Thu, 9 Nov 2017 11:58:39 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:58571 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753224AbdKIQ6i (ORCPT ); Thu, 9 Nov 2017 11:58:38 -0500 Date: Thu, 9 Nov 2017 10:58:29 -0600 (CST) From: Manoj Iyer X-X-Sender: manjo@lazy To: James Morse cc: Manoj Iyer , Shanker Donthineni , Will Deacon , Marc Zyngier , linux-arm-kernel@lists.infradead.org, Catalin Marinas , Ard Biesheuvel , Matt Fleming , Christoffer Dall , linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org, kvmarm@lists.cs.columbia.edu Subject: Re: [3/3] arm64: Add software workaround for Falkor erratum 1041 In-Reply-To: Message-ID: References: <1509679664-3749-4-git-send-email-shankerd@codeaurora.org> <5A04369A.2020405@arm.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org James, Looks like my VM test raised a false alarm. I retested stock Artful 4.13 kernel (No erratum 1041 patches applied). Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. Guest: Ubuntu Zesty (4.10) kernel. - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. And, I am able to reproduce the system reset issue I previously reported. I think the problem I reported with VMs might have nothing to do with the erratum 1041 patches, and probably needs to be root caused seperately. With stock 4.13 kernel (no erratum 1041 patches applied): awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! On Thu, 9 Nov 2017, Manoj Iyer wrote: > > > > On Thu, 9 Nov 2017, Manoj Iyer wrote: > >> >> James, >> >> (sorry for top-posting) >> >> Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) >> >> - Start 20 VMs one at a time >> >> In a loop: >> - Stop (virsh destroy) 20 VMs one at a time >> - Start (virsh start) 20 VMs one at a time. > > Fixing some confusion I might have introduced in my prev email. > > - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) > > - Created 20 VMs one at a time > > In a loop: > - Stop (virsh destroy) 20 VMs one at a time > - Start (virsh start) 20 VMs one at a time. > >> >> The system reset's itself after starting the last VM on the 1st loop >> displaying the following: >> >> awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 >> [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 >> [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 >> [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 >> [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 >> >> SYS_DBG: Running SDI image (immediate mode) >> SYS_DBG: Ram Dump Init >> SYS_DBG: Failed to init SD card >> SYS_DBG: Resetting system! >> >> Followed by the following messages on system reboot: >> [ 6.616891] BERT: Error records from previous boot: >> [ 6.621655] [Hardware Error]: event severity: fatal >> [ 6.626516] [Hardware Error]: imprecise tstamp: 0000-00-00 00:00:00 >> [ 6.632851] [Hardware Error]: Error 0, type: fatal >> [ 6.637713] [Hardware Error]: section type: unknown, >> d2e2621c-f936-468d-0d84-15a4ed015c8b >> [ 6.646045] [Hardware Error]: section length: 0x238 >> [ 6.651082] [Hardware Error]: 00000000: 72724502 5220726f 6f736165 6e55206e >> .Error Reason Un >> [ 6.659761] [Hardware Error]: 00000010: 776f6e6b 0000006e 00000000 00000000 >> known........... >> [ 6.668442] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 >> ................ >> [ 6.677122] [Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 >> ................ >> >> >> On Thu, 9 Nov 2017, James Morse wrote: >> >>> Hi Manoj, >>> >>> On 08/11/17 19:05, Manoj Iyer wrote: >>>> On Thu, 2 Nov 2017, Shanker Donthineni wrote: >>>>> The ARM architecture defines the memory locations that are permitted >>>>> to be accessed as the result of a speculative instruction fetch from >>>>> an exception level for which all stages of translation are disabled. >>>>> Specifically, the core is permitted to speculatively fetch from the >>>>> 4KB region containing the current program counter and next 4KB. >>>>> >>>>> When translation is changed from enabled to disabled for the running >>>>> exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the >>>>> Falkor core may errantly speculatively access memory locations outside >>>>> of the 4KB region permitted by the architecture. The errant memory >>>>> access may lead to one of the following unexpected behaviors. >>> >>>> I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and >>>> ran stress-ng cpu tests on QDF2400 server >>> >>> [...] >>> >>>> Where stress-ng would spawn N workers and test cpu offline/online, >>>> perform >>>> matrix operations, do rapid context switchs, and anonymous mmaps. >>>> Although >>>> I was not able to reproduce the erratum on the stock 4.13 kernel using >>>> the >>>> same test case, the patched kernel did not seem to introduce any >>>> regressions either. I ran the stress-ng tests for over 8hrs found the >>>> system to be stable. >>> >>> >>> Could you throw kexec and KVM into the mix? This issue only shows up when >>> we >>> disable the MMU, which we almost never do. >>> >>> For CPU offline/online we make the PSCI 'offline' call with the MMU >>> enabled. >>> When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a higher >>> exception level, so it won't hit this issue. >>> >>> One place we do this is kexec, where we drop into purgatory with the MMU >>> disabled. >>> >>> The other is KVM unloading itself to return to the hyp stub. You can >>> stress this >>> by starting and stopping a VM. When the number of VMs reaches 0 KVM should >>> unload via 'kvm_arch_hardware_disable()'. >>> >>> >>> Thanks, >>> >>> James >>> >>> >> >> -- >> ============================ >> Manoj Iyer >> Ubuntu/Canonical >> ARM Servers - Cloud >> ============================ >> >> > > -- > ============================ > Manoj Iyer > Ubuntu/Canonical > ARM Servers - Cloud > ============================ > > -- ============================ Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud ============================