From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758203AbdKOPM6 (ORCPT ); Wed, 15 Nov 2017 10:12:58 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:55321 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755088AbdKOPMs (ORCPT ); Wed, 15 Nov 2017 10:12:48 -0500 Date: Wed, 15 Nov 2017 09:12:33 -0600 (CST) From: Manoj Iyer X-X-Sender: manjo@hungry To: Shanker Donthineni , James Morse cc: Manoj Iyer , Will Deacon , Marc Zyngier , linux-arm-kernel@lists.infradead.org, Catalin Marinas , Ard Biesheuvel , Matt Fleming , Christoffer Dall , linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org, kvmarm@lists.cs.columbia.edu Subject: Re: [3/3] arm64: Add software workaround for Falkor erratum 1041 In-Reply-To: Message-ID: References: <1509679664-3749-4-git-send-email-shankerd@codeaurora.org> <5A04369A.2020405@arm.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 10 Nov 2017, Manoj Iyer wrote: > On Thu, 9 Nov 2017, Manoj Iyer wrote: > >> >> James, >> >> Looks like my VM test raised a false alarm. I retested stock Artful 4.13 >> kernel (No erratum 1041 patches applied). >> > > James, an update on the crash (false alarm). We suspect this is a firmware > crash due to a possible fw bug. Once this is addressed I will be able to send > you the test results you requested on VM start/stop with the erratum 1041 > patches applied. > James/Shanker, I can report that VM start/stop/restart tests worked with the patches applied to Ubuntu 4.13 (Artful) kernel on the qdf2400 hardware. Host: Ubuntu 4.13 with Erratum 1041 patches applied Guest: Stock Ubuntu 4.13 kernel - create 20 vms one at a time 10 iteration of: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. Tested-by: Manoj Iyer > >> Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. >> Guest: Ubuntu Zesty (4.10) kernel. >> >> - Created 20 VMs one at a time >> >> In a loop: >> - Stop (virsh destroy) 20 VMs one at a time >> - Start (virsh start) 20 VMs one at a time. >> >> And, I am able to reproduce the system reset issue I previously reported. I >> think the problem I reported with VMs might have nothing to do with the >> erratum 1041 patches, and probably needs to be root caused seperately. >> >> With stock 4.13 kernel (no erratum 1041 patches applied): >> >> awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 >> [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 >> [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 >> [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 >> [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 >> [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 >> [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 >> [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 >> [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 >> [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 >> [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 >> >> >> SYS_DBG: Running SDI image (immediate mode) >> SYS_DBG: Ram Dump Init >> SYS_DBG: Failed to init SD card >> SYS_DBG: Resetting system! >> >> >> On Thu, 9 Nov 2017, Manoj Iyer wrote: >> >>> >>> >>> >>> On Thu, 9 Nov 2017, Manoj Iyer wrote: >>> >>>> >>>> James, >>>> >>>> (sorry for top-posting) >>>> >>>> Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) >>>> >>>> - Start 20 VMs one at a time >>>> >>>> In a loop: >>>> - Stop (virsh destroy) 20 VMs one at a time >>>> - Start (virsh start) 20 VMs one at a time. >>> >>> Fixing some confusion I might have introduced in my prev email. >>> >>> - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) >>> >>> - Created 20 VMs one at a time >>> >>> In a loop: >>> - Stop (virsh destroy) 20 VMs one at a time >>> - Start (virsh start) 20 VMs one at a time. >>> >>>> >>>> The system reset's itself after starting the last VM on the 1st loop >>>> displaying the following: >>>> >>>> awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 >>>> [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 >>>> [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 >>>> [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 >>>> [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 >>>> >>>> SYS_DBG: Running SDI image (immediate mode) >>>> SYS_DBG: Ram Dump Init >>>> SYS_DBG: Failed to init SD card >>>> SYS_DBG: Resetting system! >>>> >>>> Followed by the following messages on system reboot: >>>> [ 6.616891] BERT: Error records from previous boot: >>>> [ 6.621655] [Hardware Error]: event severity: fatal >>>> [ 6.626516] [Hardware Error]: imprecise tstamp: 0000-00-00 00:00:00 >>>> [ 6.632851] [Hardware Error]: Error 0, type: fatal >>>> [ 6.637713] [Hardware Error]: section type: unknown, >>>> d2e2621c-f936-468d-0d84-15a4ed015c8b >>>> [ 6.646045] [Hardware Error]: section length: 0x238 >>>> [ 6.651082] [Hardware Error]: 00000000: 72724502 5220726f 6f736165 >>>> 6e55206e .Error Reason Un >>>> [ 6.659761] [Hardware Error]: 00000010: 776f6e6b 0000006e 00000000 >>>> 00000000 known........... >>>> [ 6.668442] [Hardware Error]: 00000020: 00000000 00000000 00000000 >>>> 00000000 ................ >>>> [ 6.677122] [Hardware Error]: 00000030: 00000000 00000000 00000000 >>>> 00000000 ................ >>>> >>>> >>>> On Thu, 9 Nov 2017, James Morse wrote: >>>> >>>>> Hi Manoj, >>>>> >>>>> On 08/11/17 19:05, Manoj Iyer wrote: >>>>>> On Thu, 2 Nov 2017, Shanker Donthineni wrote: >>>>>>> The ARM architecture defines the memory locations that are permitted >>>>>>> to be accessed as the result of a speculative instruction fetch from >>>>>>> an exception level for which all stages of translation are disabled. >>>>>>> Specifically, the core is permitted to speculatively fetch from the >>>>>>> 4KB region containing the current program counter and next 4KB. >>>>>>> >>>>>>> When translation is changed from enabled to disabled for the running >>>>>>> exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the >>>>>>> Falkor core may errantly speculatively access memory locations outside >>>>>>> of the 4KB region permitted by the architecture. The errant memory >>>>>>> access may lead to one of the following unexpected behaviors. >>>>> >>>>>> I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and >>>>>> ran stress-ng cpu tests on QDF2400 server >>>>> >>>>> [...] >>>>> >>>>>> Where stress-ng would spawn N workers and test cpu offline/online, >>>>>> perform >>>>>> matrix operations, do rapid context switchs, and anonymous mmaps. >>>>>> Although >>>>>> I was not able to reproduce the erratum on the stock 4.13 kernel using >>>>>> the >>>>>> same test case, the patched kernel did not seem to introduce any >>>>>> regressions either. I ran the stress-ng tests for over 8hrs found the >>>>>> system to be stable. >>>>> >>>>> >>>>> Could you throw kexec and KVM into the mix? This issue only shows up >>>>> when we >>>>> disable the MMU, which we almost never do. >>>>> >>>>> For CPU offline/online we make the PSCI 'offline' call with the MMU >>>>> enabled. >>>>> When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a >>>>> higher >>>>> exception level, so it won't hit this issue. >>>>> >>>>> One place we do this is kexec, where we drop into purgatory with the MMU >>>>> disabled. >>>>> >>>>> The other is KVM unloading itself to return to the hyp stub. You can >>>>> stress this >>>>> by starting and stopping a VM. When the number of VMs reaches 0 KVM >>>>> should >>>>> unload via 'kvm_arch_hardware_disable()'. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> James >>>>> >>>>> >>>> >>>> -- >>>> ============================ >>>> Manoj Iyer >>>> Ubuntu/Canonical >>>> ARM Servers - Cloud >>>> ============================ >>>> >>>> >>> >>> -- >>> ============================ >>> Manoj Iyer >>> Ubuntu/Canonical >>> ARM Servers - Cloud >>> ============================ >>> >>> >> >> -- >> ============================ >> Manoj Iyer >> Ubuntu/Canonical >> ARM Servers - Cloud >> ============================ >> >> > > -- > ============================ > Manoj Iyer > Ubuntu/Canonical > ARM Servers - Cloud > ============================ > >