From mboxrd@z Thu Jan  1 00:00:00 1970
From: anup@brainfault.org (Anup Patel)
Date: Fri, 16 Aug 2013 23:41:51 +0530
Subject: [PATCH] ARM64: KVM: Fix coherent_icache_guest_page() for host
 with external L3-cache.
In-Reply-To: <20130816175034.GE20246@cbox>
References: <3ebe469fcf451cc7396dd2a5d3f01272@www.loen.fr>
 <CAAhSdy3WwZcLW6s05mV=Mn8=w8TcubTdYrgNqAChqiHxXC53eg@mail.gmail.com>
 <747a0675165da4ef147bbda4e140549b@www.loen.fr>
 <CAAhSdy3t1B_94qiYiLeR2-imQU0KCct6+76kjW_FS=QgAtTb3g@mail.gmail.com>
 <5935339137684ecf90dd484cc5739548@www.loen.fr>
 <20130815165344.GA3853@cbox>
 <CAAhSdy0mMAkziHQOYBw1yT1-h2N6vpQQdn_f25BEJpv2qYOHxA@mail.gmail.com>
 <CAAhSdy3W+q4yk992ajnHBb23SVDqmfe_4BiGfH9XTYkwfVmC-A@mail.gmail.com>
 <20130816171912.GB20246@cbox>
 <CAAhSdy2bk6Ou=326cYJW2wKvm9pffbRRmO2ABFwF9dmqwWdKqQ@mail.gmail.com>
 <20130816175034.GE20246@cbox>
Message-ID: <CAAhSdy3B5JTcd7==u_QzS3V5k7pYkymtiLa7X=yZH+=eSBAw8w@mail.gmail.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
>> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
>> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
>> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
>> >> > <christoffer.dall@linaro.org> wrote:
>> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>> >> >>> On 2013-08-15 16:13, Anup Patel wrote:
>> >> >>> > Hi Marc,
>> >> >>> >
>> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> >> >>> > wrote:
>> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>> >> >>> >>> <marc.zyngier@arm.com>
>> >> >>> >>> wrote:
>> >> >>> >>>>
>> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> Hi Marc,
>> >> >>> >>>>>
>> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>> >> >>> >>>>> <marc.zyngier@arm.com>
>> >> >>> >>>>> wrote:
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> Hi Anup,
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>> >> >>> >>>>>>> <marc.zyngier@arm.com>
>> >> >>> >>>>>>> wrote:
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Hi Pranav,
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>> >> >>> >>>>>>>>> dirty
>> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>> >> >>> >>>>>>>>> this,
>> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>> >> >>> >>>>>>>>> guest
>> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>> >> >>> >>>>>>>>> disabled.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>> >> >>> >>>>>>>>> L3-cache.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>> >> >>> >>>>>>>>> <pranavkumar@linaro.org>
>> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> >> >>> >>>>>>>>> ---
>> >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>> >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> index efe609c..5129038 100644
>> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>> >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>> >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>> >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
>> >> >>> >>>>>>>>> caches. */
>> >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>> >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>> >> >>> >>>>>>>>> ASID-tagged
>> >> >>> >>>>>>>>> VIVT
>> >> >>> >>>>>>>>> */
>> >> >>> >>>>>>>>>               /* any kind of VIPT cache */
>> >> >>> >>>>>>>>>               __flush_icache_all();
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>> >> >>> >>>>>>>> past]
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>> >> >>> >>>>>>>> the
>> >> >>> >>>>>>>> data
>> >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>> >> >>> >>>>>>>> kvmtools
>> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
>> >> >>> >>>>>>>> can do a
>> >> >>> >>>>>>>> "DC
>> >> >>> >>>>>>>> DVAC" on this region.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>> >> >>> >>>>>>> I am
>> >> >>> >>>>>>> sure
>> >> >>> >>>>>>> other architectures don't have such cache cleaning in
>> >> >>> >>>>>>> user-space for
>> >> >>> >>>>>>> KVM.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>> >> >>> >>>>>>>> - but
>> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
>> >> >>> >>>>>>>> guest
>> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>> >> >>> >>>>>>> VCPU was
>> >> >>> >>>>>>> our second option but current approach seemed very simple hence
>> >> >>> >>>>>>> we went for this.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
>> >> >>> >>>>>>> VCPU then
>> >> >>> >>>>>>> we should revise this patch.
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>> >> >>> >>>>>> boot-tested
>> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
>> >> >>> >>>>>> your
>> >> >>> >>>>>> issue.
>> >> >>> >>>>>>
>> >> >>> >>>>>> As far as I can see, it should fix it without any additional
>> >> >>> >>>>>> flushing.
>> >> >>> >>>>>>
>> >> >>> >>>>>> Please let me know how it goes.
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>> >> >>> >>>>> Write-Allocate,
>> >> >>> >>>>> Outer Write-Back Write-Allocate memory"
>> >> >>> >>>>>
>> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> >> >>> >>>>> treated as "Strongly-ordered device memory"
>> >> >>> >>>>>
>> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>> >> >>> >>>> be
>> >> >>> >>>> accessed as device memory.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>> >> >>> >>>>> KVM ARM64 because we use VGIC.
>> >> >>> >>>>>
>> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> >> >>> >>>>> when Stage1 MMU is off.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> See above. My understanding is that HCR.DC controls the default
>> >> >>> >>>> output of
>> >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> >> >>> >>> and wanted guest to decide the memory attribute. In other words,
>> >> >>> >>> you
>> >> >>> >>> did not want to enforce any memory attribute in Stage2.
>> >> >>> >>>
>> >> >>> >>> Please refer to this patch
>> >> >>> >>> https://patchwork.kernel.org/patch/2543201/
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> This patch has never been merged. If you carry on following the
>> >> >>> >> discussion,
>> >> >>> >> you will certainly notice it was dropped for a very good reason:
>> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>> >> >>> >>
>> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
>> >> >>> >> they are
>> >> >>> >> essential to the patch I sent this morning.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>         M.
>> >> >>> >> --
>> >> >>> >> Fast, cheap, reliable. Pick two.
>> >> >>> >
>> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>> >> >>> > is
>> >> >>> > provided a DMA-capable device in pass-through mode. The reason being
>> >> >>> > bootloader/firmware typically don't enable MMU and such
>> >> >>> > bootloader/firmware
>> >> >>> > will programme a pass-through DMA-capable device without any flushes
>> >> >>> > to
>> >> >>> > guest RAM (because it has not enabled MMU).
>> >> >>> >
>> >> >>> > A good example of such a device would be SATA AHCI controller given
>> >> >>> > to a
>> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
>> >> >>> > bootloader/firmware
>> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
>> >> >>> > disk.
>> >> >>> > In this situation, we will have to hack Guest bootloader/firmware
>> >> >>> > AHCI driver to
>> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>> >> >>>
>> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>> >> >>> curiosity: is that a made up example or something you actually have?
>> >> >>>
>> >> >>> Back to square one:
>> >> >>> Can you please benchmark the various cache cleaning options (global at
>> >> >>> startup time, per-page on S2 translation fault, and user-space)?
>> >> >>>
>> >> >> Eh, why is this a more valid argument than the vgic?  The device
>> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
>> >> >> attributes to configure that memory region as device memory.  Why is it
>> >> >> relevant if the device is DMA-capable in this context?
>> >> >>
>> >> >> -Christoffer
>> >> >
>> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
>> >> > explicit cache flushes when programming DMA-capable device. Now If
>> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
>> >> > will not be visible to DMA-capable device.
>> >> >
>> >> > --Anup
>> >>
>> >> The approach of flushing d-cache by set/way upon first run of VCPU will
>> >> not work because for set/way operations ARM ARM says: "For set/way
>> >> operations, and for All (entire cache) operations, the point is defined to be
>> >> to the next level of caching". In other words, set/way operations work upto
>> >> point of unification.
>> >>
>> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
>> >> at bootup time which does not work for us when L3-cache is enabled so,
>> >> there is no point is adding one more __flush_dcache_all() upon first run of
>> >> VCPU in KVM ARM64.
>> >
>> > When did anyone suggest using a cache cleaning method that doesn't apply
>> > to this case.  I'm a little confused about your comment here.
>>
>> Please refer last reply from Marc Z where he says:
>> "Can you please benchmark the various cache cleaning options (global at
>> startup time, per-page on S2 translation fault, and user-space)?".
>>
>> Rather doing __flush_dcache_range() on each page in
>> coherent_icache_guest_page() we could also flush entire d-cache upon
>> first VCPU run. This only issue in flushing d-cache upon first VCPU run
>> is that we cannot flush d-cache by set/way because as per ARM ARM
>> all operations by set/way are upto PoU and not PoC. In presence of
>> L3-Cache we need to flush d-cache upto PoC so we have to use
>> __flush_dache_range()
>>
>> >
>> >>
>> >> IMHO, we are left with following options:
>> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>> >>     upon first run of VCPU
>> >
>> > We could do that, but we have to ensure that no other memory regions can
>> > be added later.  Is this the case?  I don't remember.
>>
>> Yes, memory regions being added later be problem for this option.
>>
>> >
>> >> 2. Implement outer-cache framework for ARM64 and flush all
>> >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
>> >
>> > What's the difference between (2) and (1)?
>>
>> Linux ARM (i.e. 32-bit port) has a framework for having outer
>> caches (i.e. caches that are not part of the CPU but exist as
>> separate entity between CPU and Memory for performance
>> improvement). Using this framework we can flush all CPU caches
>> and outer cache.
>>
>
> And we don't have such a framework in arm64?  But __flush_dcache_range

Yes, for now we don't have outer-cache framework in arm64.

> does nevertheless flush outer caches as well?

Yes, __flush_dcache_range() flushes the outer cache too.

The __flush_dcache_range() works for us because it flushes (i.e.
clean & invalidate) by VA upto PoC. All operation upto PoC on a
cache line will force eviction of the cache line from all CPU caches
(i.e. both L1 & L2) and also from external L3-cache to ensure that
RAM has updated contents of cache line.

Now, the outer caches (such as L3-cache) typically provide a
mechanism to flush entire L3-cache using L3-cache registers. If
we have an outer-cache framework then we can flush all caches
using __flush_dcache_all() + outer cache flush all.

>
>> >
>> >> 3. Use an alternate version of flush_icache_range() which will
>> >>     flush d-cache by PoC instead of PoU. We can also ensure
>> >>     that coherent_icache_guest_page() function will be called
>> >>     upon Stage2 prefetch aborts only.
>> >>
>> >
>> > I'm sorry, but I'm not following this discussion.  Are we not mixing a
>> > discussion along two different axis here?  As I see it there are two
>> > separate issues:
>> >
>> >  (A) When/Where to flush the memory regions
>> >  (B) How to flush the memory regions, including the hardware method and
>> >      the kernel software engineering aspect.
>>
>> Discussion here is about getting KVM ARM64 working in-presence
>> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
>> user-space typically loads images to guest RAM so, in-presence of
>> huge L3-cache (few MBs). When the VCPU starts running some of the
>> contents guest RAM will be still in L3-cache and VCPU runs with
>> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
>> see incorrect contents. To solve this problem we need to flush the
>> guest RAM contents before they are accessed by first time by VCPU.
>>
> ok, I'm with you that far.
>
> But is it also not true that we need to decide between:
>
>   A.1: Flush the entire guest RAM before running the VCPU
>   A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

>
> And (independently):
>
>   B.1: Use __flush_dcache_range
>   B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

>   B.3: Use flush_icache_range

Actually modified version of flush_icache_range() which uses
"dc cvac" and not "dc cvau".

>
> Or do these all interleave somehow?  If so, I don't understand why.  Can
> you help?
>
> -Christoffer

--Anup