From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753969AbdKIQ6j (ORCPT <rfc822;w@1wt.eu>);
        Thu, 9 Nov 2017 11:58:39 -0500
Received: from youngberry.canonical.com ([91.189.89.112]:58571 "EHLO
        youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753224AbdKIQ6i (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 9 Nov 2017 11:58:38 -0500
Date: Thu, 9 Nov 2017 10:58:29 -0600 (CST)
From: Manoj Iyer <manoj.iyer@canonical.com>
X-X-Sender: manjo@lazy
To: James Morse <james.morse@arm.com>
cc: Manoj Iyer <manoj.iyer@canonical.com>,
        Shanker Donthineni <shankerd@codeaurora.org>,
        Will Deacon <will.deacon@arm.com>, Marc Zyngier <marc.zyngier@arm.com>,
        linux-arm-kernel@lists.infradead.org,
        Catalin Marinas <catalin.marinas@arm.com>,
        Ard Biesheuvel <ard.biesheuvel@linaro.org>,
        Matt Fleming <matt@codeblueprint.co.uk>,
        Christoffer Dall <christoffer.dall@linaro.org>,
        linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org,
        kvmarm@lists.cs.columbia.edu
Subject: Re: [3/3] arm64: Add software workaround for Falkor erratum 1041
In-Reply-To: <alpine.DEB.2.20.1711091010570.15101@lazy>
Message-ID: <alpine.DEB.2.20.1711091041180.15101@lazy>
References: <1509679664-3749-4-git-send-email-shankerd@codeaurora.org> <alpine.DEB.2.20.1711081305310.26324@lazy> <5A04369A.2020405@arm.com> <alpine.DEB.2.20.1711090949110.15101@lazy> <alpine.DEB.2.20.1711091010570.15101@lazy>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


James,

Looks like my VM test raised a false alarm. I retested stock Artful 4.13 
kernel (No erratum 1041 patches applied).

Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied.
Guest: Ubuntu Zesty (4.10) kernel.

- Created 20 VMs one at a time

In a loop:
- Stop (virsh destroy) 20 VMs one at a time
- Start (virsh start) 20 VMs one at a time.

And, I am able to reproduce the system reset issue I previously reported. 
I think the problem I reported with VMs might have nothing to do with the 
erratum 1041 patches, and probably needs to be root caused seperately.

With stock 4.13 kernel (no erratum 1041 patches applied):

awrep6 login: [  461.881379] ACPI CPPC: PCC check channel failed. Status=0
[  462.051194] ACPI CPPC: PCC check channel failed. Status=0
[  462.223137] ACPI CPPC: PCC check channel failed. Status=0
[  462.633790] ACPI CPPC: PCC check channel failed. Status=0
[  463.231971] ACPI CPPC: PCC check channel failed. Status=0
[  463.403163] ACPI CPPC: PCC check channel failed. Status=0
[  463.822936] ACPI CPPC: PCC check channel failed. Status=0
[  463.995222] ACPI CPPC: PCC check channel failed. Status=0
[  464.130962] ACPI CPPC: PCC check channel failed. Status=0
[  464.258973] ACPI CPPC: PCC check channel failed. Status=0
[  465.283028] ACPI CPPC: PCC check channel failed. Status=0


SYS_DBG: Running SDI image (immediate mode)
SYS_DBG: Ram Dump Init
SYS_DBG: Failed to init SD card
SYS_DBG: Resetting system!


On Thu, 9 Nov 2017, Manoj Iyer wrote:

>
>
>
> On Thu, 9 Nov 2017, Manoj Iyer wrote:
>
>> 
>> James,
>> 
>> (sorry for top-posting)
>> 
>> Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic )
>> 
>> - Start 20 VMs one at a time
>> 
>> In a loop:
>> - Stop (virsh destroy) 20 VMs one at a time
>> - Start (virsh start) 20 VMs one at a time.
>
> Fixing some confusion I might have introduced in my prev email.
>
> - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic )
>
> - Created 20 VMs one at a time
>
> In a loop:
> - Stop (virsh destroy) 20 VMs one at a time
> - Start (virsh start) 20 VMs one at a time.
>
>> 
>> The system reset's itself after starting the last VM on the 1st loop 
>> displaying the following:
>> 
>> awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0
>> [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0
>> [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0
>> [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0
>> [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0
>> 
>> SYS_DBG: Running SDI image (immediate mode)
>> SYS_DBG: Ram Dump Init
>> SYS_DBG: Failed to init SD card
>> SYS_DBG: Resetting system!
>> 
>> Followed by the following messages on system reboot:
>> [ 6.616891] BERT: Error records from previous boot:
>> [ 6.621655] [Hardware Error]: event severity: fatal
>> [ 6.626516] [Hardware Error]: imprecise tstamp: 0000-00-00 00:00:00
>> [ 6.632851] [Hardware Error]: Error 0, type: fatal
>> [ 6.637713] [Hardware Error]: section type: unknown, 
>> d2e2621c-f936-468d-0d84-15a4ed015c8b
>> [ 6.646045] [Hardware Error]: section length: 0x238
>> [ 6.651082] [Hardware Error]: 00000000: 72724502 5220726f 6f736165 6e55206e 
>> .Error Reason Un
>> [ 6.659761] [Hardware Error]: 00000010: 776f6e6b 0000006e 00000000 00000000 
>> known...........
>> [ 6.668442] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 
>> ................
>> [ 6.677122] [Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 
>> ................
>> 
>> 
>> On Thu, 9 Nov 2017, James Morse wrote:
>> 
>>> Hi Manoj,
>>> 
>>> On 08/11/17 19:05, Manoj Iyer wrote:
>>>> On Thu, 2 Nov 2017, Shanker Donthineni wrote:
>>>>> The ARM architecture defines the memory locations that are permitted
>>>>> to be accessed as the result of a speculative instruction fetch from
>>>>> an exception level for which all stages of translation are disabled.
>>>>> Specifically, the core is permitted to speculatively fetch from the
>>>>> 4KB region containing the current program counter and next 4KB.
>>>>> 
>>>>> When translation is changed from enabled to disabled for the running
>>>>> exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the
>>>>> Falkor core may errantly speculatively access memory locations outside
>>>>> of the 4KB region permitted by the architecture. The errant memory
>>>>> access may lead to one of the following unexpected behaviors.
>>> 
>>>> I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and
>>>> ran stress-ng cpu tests on QDF2400 server
>>> 
>>> [...]
>>> 
>>>> Where stress-ng would spawn N workers and test cpu offline/online, 
>>>> perform
>>>> matrix operations, do rapid context switchs, and anonymous mmaps. 
>>>> Although
>>>> I was not able to reproduce the erratum on the stock 4.13 kernel using 
>>>> the
>>>> same test case, the patched kernel did not seem to introduce any
>>>> regressions either. I ran the stress-ng tests for over 8hrs found the
>>>> system to be stable.
>>> 
>>> 
>>> Could you throw kexec and KVM into the mix? This issue only shows up when 
>>> we
>>> disable the MMU, which we almost never do.
>>> 
>>> For CPU offline/online we make the PSCI 'offline' call with the MMU 
>>> enabled.
>>> When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a higher
>>> exception level, so it won't hit this issue.
>>> 
>>> One place we do this is kexec, where we drop into purgatory with the MMU 
>>> disabled.
>>> 
>>> The other is KVM unloading itself to return to the hyp stub. You can 
>>> stress this
>>> by starting and stopping a VM. When the number of VMs reaches 0 KVM should
>>> unload via 'kvm_arch_hardware_disable()'.
>>> 
>>> 
>>> Thanks,
>>> 
>>> James
>>> 
>>> 
>> 
>> --
>> ============================
>> Manoj Iyer
>> Ubuntu/Canonical
>> ARM Servers - Cloud
>> ============================
>> 
>> 
>
> --
> ============================
> Manoj Iyer
> Ubuntu/Canonical
> ARM Servers - Cloud
> ============================
>
>

--
============================
Manoj Iyer
Ubuntu/Canonical
ARM Servers - Cloud
============================