From mboxrd@z Thu Jan 1 00:00:00 1970 From: christoffer.dall@linaro.org (Christoffer Dall) Date: Fri, 16 Aug 2013 11:20:32 -0700 Subject: [PATCH] ARM64: KVM: Fix coherent_icache_guest_page() for host with external L3-cache. In-Reply-To: References: <747a0675165da4ef147bbda4e140549b@www.loen.fr> <5935339137684ecf90dd484cc5739548@www.loen.fr> <20130815165344.GA3853@cbox> <20130816171912.GB20246@cbox> <20130816175034.GE20246@cbox> Message-ID: <20130816182032.GG20246@cbox> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Fri, Aug 16, 2013 at 11:41:51PM +0530, Anup Patel wrote: > On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall > wrote: > > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > >> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall > >> wrote: > >> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: > >> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel wrote: > >> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > >> >> > wrote: > >> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > >> >> >>> On 2013-08-15 16:13, Anup Patel wrote: > >> >> >>> > Hi Marc, > >> >> >>> > > >> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier > >> >> >>> > wrote: > >> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: > >> >> >>> >>> > >> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >> >> >>> >>> > >> >> >>> >>> wrote: > >> >> >>> >>>> > >> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> Hi Marc, > >> >> >>> >>>>> > >> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >> >> >>> >>>>> > >> >> >>> >>>>> wrote: > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> Hi Anup, > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> wrote: > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> Hi Pranav, > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >> >> >>> >>>>>>>>> dirty > >> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >> >> >>> >>>>>>>>> this, > >> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that > >> >> >>> >>>>>>>>> guest > >> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is > >> >> >>> >>>>>>>>> disabled. > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >> >> >>> >>>>>>>>> L3-cache. > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> Signed-off-by: Anup Patel > >> >> >>> >>>>>>>>> --- > >> >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >> >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> index efe609c..5129038 100644 > >> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >> >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >> >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >> >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >> >> >>> >>>>>>>>> + /* Flush d-cache for systems with external > >> >> >>> >>>>>>>>> caches. */ > >> >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >> >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >> >> >>> >>>>>>>>> ASID-tagged > >> >> >>> >>>>>>>>> VIVT > >> >> >>> >>>>>>>>> */ > >> >> >>> >>>>>>>>> /* any kind of VIPT cache */ > >> >> >>> >>>>>>>>> __flush_icache_all(); > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the > >> >> >>> >>>>>>>> past] > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >> >> >>> >>>>>>>> the > >> >> >>> >>>>>>>> data > >> >> >>> >>>>>>>> cache on each page mapping. It looks overkill. > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >> >> >>> >>>>>>>> kvmtools > >> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and > >> >> >>> >>>>>>>> can do a > >> >> >>> >>>>>>>> "DC > >> >> >>> >>>>>>>> DVAC" on this region. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >> >> >>> >>>>>>> I am > >> >> >>> >>>>>>> sure > >> >> >>> >>>>>>> other architectures don't have such cache cleaning in > >> >> >>> >>>>>>> user-space for > >> >> >>> >>>>>>> KVM. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu > >> >> >>> >>>>>>>> - but > >> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the > >> >> >>> >>>>>>>> guest > >> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >> >> >>> >>>>>>> VCPU was > >> >> >>> >>>>>>> our second option but current approach seemed very simple hence > >> >> >>> >>>>>>> we went for this. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of > >> >> >>> >>>>>>> VCPU then > >> >> >>> >>>>>>> we should revise this patch. > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've > >> >> >>> >>>>>> boot-tested > >> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes > >> >> >>> >>>>>> your > >> >> >>> >>>>>> issue. > >> >> >>> >>>>>> > >> >> >>> >>>>>> As far as I can see, it should fix it without any additional > >> >> >>> >>>>>> flushing. > >> >> >>> >>>>>> > >> >> >>> >>>>>> Please let me know how it goes. > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back > >> >> >>> >>>>> Write-Allocate, > >> >> >>> >>>>> Outer Write-Back Write-Allocate memory" > >> >> >>> >>>>> > >> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >> >> >>> >>>>> treated as "Strongly-ordered device memory" > >> >> >>> >>>>> > >> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >> >> >>> >>>> be > >> >> >>> >>>> accessed as device memory. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >> >> >>> >>>>> KVM ARM64 because we use VGIC. > >> >> >>> >>>>> > >> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >> >> >>> >>>>> when Stage1 MMU is off. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> See above. My understanding is that HCR.DC controls the default > >> >> >>> >>>> output of > >> >> >>> >>>> Stage-1, and Stage-2 overrides still apply. > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >> >> >>> >>> and wanted guest to decide the memory attribute. In other words, > >> >> >>> >>> you > >> >> >>> >>> did not want to enforce any memory attribute in Stage2. > >> >> >>> >>> > >> >> >>> >>> Please refer to this patch > >> >> >>> >>> https://patchwork.kernel.org/patch/2543201/ > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> This patch has never been merged. If you carry on following the > >> >> >>> >> discussion, > >> >> >>> >> you will certainly notice it was dropped for a very good reason: > >> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> >> >>> >> > >> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and > >> >> >>> >> they are > >> >> >>> >> essential to the patch I sent this morning. > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> M. > >> >> >>> >> -- > >> >> >>> >> Fast, cheap, reliable. Pick two. > >> >> >>> > > >> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > >> >> >>> > is > >> >> >>> > provided a DMA-capable device in pass-through mode. The reason being > >> >> >>> > bootloader/firmware typically don't enable MMU and such > >> >> >>> > bootloader/firmware > >> >> >>> > will programme a pass-through DMA-capable device without any flushes > >> >> >>> > to > >> >> >>> > guest RAM (because it has not enabled MMU). > >> >> >>> > > >> >> >>> > A good example of such a device would be SATA AHCI controller given > >> >> >>> > to a > >> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest > >> >> >>> > bootloader/firmware > >> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA > >> >> >>> > disk. > >> >> >>> > In this situation, we will have to hack Guest bootloader/firmware > >> >> >>> > AHCI driver to > >> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > >> >> >>> > >> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > >> >> >>> curiosity: is that a made up example or something you actually have? > >> >> >>> > >> >> >>> Back to square one: > >> >> >>> Can you please benchmark the various cache cleaning options (global at > >> >> >>> startup time, per-page on S2 translation fault, and user-space)? > >> >> >>> > >> >> >> Eh, why is this a more valid argument than the vgic? The device > >> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory > >> >> >> attributes to configure that memory region as device memory. Why is it > >> >> >> relevant if the device is DMA-capable in this context? > >> >> >> > >> >> >> -Christoffer > >> >> > > >> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do > >> >> > explicit cache flushes when programming DMA-capable device. Now If > >> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > >> >> > will not be visible to DMA-capable device. > >> >> > > >> >> > --Anup > >> >> > >> >> The approach of flushing d-cache by set/way upon first run of VCPU will > >> >> not work because for set/way operations ARM ARM says: "For set/way > >> >> operations, and for All (entire cache) operations, the point is defined to be > >> >> to the next level of caching". In other words, set/way operations work upto > >> >> point of unification. > >> >> > >> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > >> >> at bootup time which does not work for us when L3-cache is enabled so, > >> >> there is no point is adding one more __flush_dcache_all() upon first run of > >> >> VCPU in KVM ARM64. > >> > > >> > When did anyone suggest using a cache cleaning method that doesn't apply > >> > to this case. I'm a little confused about your comment here. > >> > >> Please refer last reply from Marc Z where he says: > >> "Can you please benchmark the various cache cleaning options (global at > >> startup time, per-page on S2 translation fault, and user-space)?". > >> > >> Rather doing __flush_dcache_range() on each page in > >> coherent_icache_guest_page() we could also flush entire d-cache upon > >> first VCPU run. This only issue in flushing d-cache upon first VCPU run > >> is that we cannot flush d-cache by set/way because as per ARM ARM > >> all operations by set/way are upto PoU and not PoC. In presence of > >> L3-Cache we need to flush d-cache upto PoC so we have to use > >> __flush_dache_range() > >> > >> > > >> >> > >> >> IMHO, we are left with following options: > >> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() > >> >> upon first run of VCPU > >> > > >> > We could do that, but we have to ensure that no other memory regions can > >> > be added later. Is this the case? I don't remember. > >> > >> Yes, memory regions being added later be problem for this option. > >> > >> > > >> >> 2. Implement outer-cache framework for ARM64 and flush all > >> >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > >> > > >> > What's the difference between (2) and (1)? > >> > >> Linux ARM (i.e. 32-bit port) has a framework for having outer > >> caches (i.e. caches that are not part of the CPU but exist as > >> separate entity between CPU and Memory for performance > >> improvement). Using this framework we can flush all CPU caches > >> and outer cache. > >> > > > > And we don't have such a framework in arm64? But __flush_dcache_range > > Yes, for now we don't have outer-cache framework in arm64. > > > does nevertheless flush outer caches as well? > > Yes, __flush_dcache_range() flushes the outer cache too. > > The __flush_dcache_range() works for us because it flushes (i.e. > clean & invalidate) by VA upto PoC. All operation upto PoC on a > cache line will force eviction of the cache line from all CPU caches > (i.e. both L1 & L2) and also from external L3-cache to ensure that > RAM has updated contents of cache line. > > Now, the outer caches (such as L3-cache) typically provide a > mechanism to flush entire L3-cache using L3-cache registers. If > we have an outer-cache framework then we can flush all caches > using __flush_dcache_all() + outer cache flush all. > > > > >> > > >> >> 3. Use an alternate version of flush_icache_range() which will > >> >> flush d-cache by PoC instead of PoU. We can also ensure > >> >> that coherent_icache_guest_page() function will be called > >> >> upon Stage2 prefetch aborts only. > >> >> > >> > > >> > I'm sorry, but I'm not following this discussion. Are we not mixing a > >> > discussion along two different axis here? As I see it there are two > >> > separate issues: > >> > > >> > (A) When/Where to flush the memory regions > >> > (B) How to flush the memory regions, including the hardware method and > >> > the kernel software engineering aspect. > >> > >> Discussion here is about getting KVM ARM64 working in-presence > >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > >> user-space typically loads images to guest RAM so, in-presence of > >> huge L3-cache (few MBs). When the VCPU starts running some of the > >> contents guest RAM will be still in L3-cache and VCPU runs with > >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > >> see incorrect contents. To solve this problem we need to flush the > >> guest RAM contents before they are accessed by first time by VCPU. > >> > > ok, I'm with you that far. > > > > But is it also not true that we need to decide between: > > > > A.1: Flush the entire guest RAM before running the VCPU > > A.2: Flush the pages as we fault them in > > Yes, thats the decision we have to make. > > > > > And (independently): > > > > B.1: Use __flush_dcache_range > > B.2: Use something else + outer cache framework for arm64 > > This would be __flush_dcache_all() + outer cache flush all. > > > B.3: Use flush_icache_range > > Actually modified version of flush_icache_range() which uses > "dc cvac" and not "dc cvau". > > > > > Or do these all interleave somehow? If so, I don't understand why. Can > > you help? > > Hi Anup, Thanks for the explanation, it's crystal now. -Christoffer