All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ard Biesheuvel <ardb@kernel.org>
To: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Marc Zyngier <maz@kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH] ARM: decompressor: cover BSS in cache clean and reorder with MMU disable on v7
Date: Sun, 24 Jan 2021 16:45:14 +0100	[thread overview]
Message-ID: <CAMj1kXEc3e64QweYLf=izp2HHk_aQBriSgiY-8d9Hxp3pZ02ug@mail.gmail.com> (raw)
In-Reply-To: <20210124152127.GA1551@shell.armlinux.org.uk>

On Sun, 24 Jan 2021 at 16:21, Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Sun, Jan 24, 2021 at 02:35:31PM +0100, Ard Biesheuvel wrote:
> > So what I think is happening is the following:
> >
> > In v5.7 and before, the set/way operations trap into KVM, which sets
> > another trap bit to ensure that second trap occurs the next time the
> > MMU is disabled. So if any cachelines are allocated after the call to
> > cache_clean_flush(), they will be invalidated again when KVM
> > invalidates the VM's entire IPA space.
> >
> > According to DDI0406C.d paragraph B3.2.1, it is implementation defined
> > whether non-cacheable accesses that occur with MMU/caches disabled may
> > hit in the data cache.
> >
> > So after v5.7, without set/way instructions being issued, the second
> > trap is never set, and so the only cache clean+invalidate that occurs
> > is the one that the decompressor performs itself, and the one that KVM
> > does on the guest's behalf at cache_off() time is omitted. This
> > results in clean cachelines being allocated that shadow the
> > mini-stack, which are hit by the non-cacheable accesses that occur
> > before the kernel proper enables the MMU again.
> >
> > Reordering the clean+invalidate with the MMU/cache disabling prevent
> > the issue, as disabling the MMU and caches first disables any mappings
> > that the cache could perform speculative linefills from, and so the
> > mini-stack memory access cannot be served from the cache.
>
> This may be part of the story, but it doesn't explain all of the
> observed behaviour.
>
> First, some backround...
>
> We have three levels of cache on the Armada 8040 - there are the two
> levels inside the A72 clusters, as designed by Arm Ltd. There is a
> third level designed by Marvell which is common to all CPUs, which is
> an exclusive cache. This means that if the higher levels of cache
> contain a cache line, the L3 cache will not.
>
> Next, consider the state leading up to this point inside the guest:
>
> - the decompressor code has been copied, overlapping the BSS and the
>   mini-stack.
> - the decompressor code and data has been C+I using the by-MVA
>   instructions. This should push the data out to DDR.
> - the decompressor has run, writing a large amount of data (that being
>   the decompressed kernel image.)
>
> At this precise point where we write to the mini-cache, the data cache
> and MMU are both turned off, but the instruction cache is left enabled.
>
> The action around the mini-stack involves writing the following hex
> values to the mini-stack, located at 0x40e69420 - note it's alignment:
>
>    ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
>
> It has been observed that immediately after writing, reading the values
> read back have been observed to be (when incorrect, these are a couple
> of examples):
>
>    ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30 (1)
>    ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090 (2)
>
> and after v1_invalidate_l1, it always seems to be:
>
>    ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> v1_invalidate_l1 operates by issuing set/way instructions that target
> only the L1 cache - its purpose is to initialise the at-reset undefined
> state of the L1 cache. These invalidates must not target lower level
> caches, since these may contain valid data from other CPUs already
> brought up in the system.
>
> To be absolutely clear about these two observed cases:
>
> case 1:
> write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
> read : ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30
> read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> case 2:
> write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
> read : ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090
> read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> If we look at the captured data above, there are a few things to note:
> 1) the point at which we read-back wrong data is part way through
>    a cache line.
> 2) case 2 shows only one value is wrong initially, mid-way through the
>    stack.
> 3) after v1_invalidate_l1, it seems that all data is incorrect. This
>    could be a result of the actions of v1_invalidate_l1, or merely
>    due to time passing and there being pressure from other system
>    activity to evict lines from the various levels of caches.
>
> Considering your theory that there are clean cache lines overlapping
> the mini-stack, and that non-cacheable accesses hit those cache lines,
> then the stmia write should hit those cache lines and mark them dirty.
> The subsequent read-back should also hit those cache lines, and return
> consistent data. If the cache lines are evicted back to RAM, then a
> read will not hit any cache lines, and should still return the data
> that was written. Therefore, we should not be seeing any effects at
> all, and the data should be consistent. This does not fit with the
> observations.
>
> If we consider an alternative theory - that there are clean cache lines
> overlapping the mini-stack, and non-cacheable accesses do not hit the
> cache lines. This means that the stmia write bypasses the caches and
> hits the RAM directly, and reads would also fetch from the RAM. The
> only way in this case that we would see data change is if the cache
> line were in fact dirty, and it gets written back to RAM between our
> non-cacheable write and a subsequent non-cacheable read. This also does
> not fit the observations, particularly case (2) that I highlight above
> where only _one_ value was seen to be incorrect.
>
> There is another theory along this though - the L1 and L2 have
> differing behaviour to non-cacheable accesses from L3, and when a
> clean cache line is discarded from L1/L2, it is placed in L3. For
> example, if non-cacheable accesses bypass L1 and L2 but not L3. Now we
> have a _possibility_ to explain this behaviour. Initially, L1/L2
> contain a clean cache line overlapping this area. Accesses initially
> bypass the clean cache line, until it gets evicted into L3, where
> accesses hit it instead. When it gets evicted from L3, as it was clean,
> it doesn't get written back, and we see the in-DDR data. The reverse
> could also be true - L1/L2 could be hit by an uncached access but not
> L3, and I'd suggest similar effects would be possible. However, this
> does not fully explain case (2).
>
> So, I don't think we have a full and proper idea of what is really
> behind this.
>

Fair enough. I don't think we will ever be able to get to the bottom of this.

But I do stand by my conclusion that the likely reason that this patch
fixes the issue is that it reinstates the additional cache
clean+invalidate occuring *after* cache_off(), which was carried out
on the guest's behalf before, due to the way KVM handles set/way
operations and the subsequent disabling of the MMU. I.e.,

kvm_set_way_flush() in arch/arm64/kvm/mmu.c, which is called every
time the guest performs a cache op by set/way

/*
 * If this is the first time we do a S/W operation
 * (i.e. HCR_TVM not set) flush the whole memory, and set the
 * VM trapping.
 *
 * Otherwise, rely on the VM trapping to wait for the MMU +
 * Caches to be turned off. At that point, we'll be able to
 * clean the caches again.
 */

(so the significance here is *not* that it flushes the whole memory
but that it sets HCR_TVM - this is why the experiment we did with
adding a single set/way op fixed the issue but changing it to another
mcr instruction that is also known to trap to KVM did not make a
difference)

Then, in kvm_toggle_cache(),

/*
 * If switching the MMU+caches on, need to invalidate the caches.
 * If switching it off, need to clean the caches.
 * Clean + invalidate does the trick always.
 */

which is why cache_off() used to result in a full clean+invalidate
under KVM, but not after my by-MVA patch was applied.

So on the basis that this patch adds back a full cache
clean+invalidate that was inadvertently removed by my previous patch,
I think we should apply this patch. GIven that this change can be
traced back to commit 401b368caaecdce1cf8f05bab448172752230cb0, we
should probably include

Fixes: 401b368caaec ("ARM: decompressor: switch to by-VA cache
maintenance for v7 cores")

Are you on board with that?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-01-24 15:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-22 15:20 [PATCH] ARM: decompressor: cover BSS in cache clean and reorder with MMU disable on v7 Ard Biesheuvel
2021-01-22 16:13 ` Russell King - ARM Linux admin
2021-01-22 16:32   ` Ard Biesheuvel
2021-01-24 13:35     ` Ard Biesheuvel
2021-01-24 15:21       ` Russell King - ARM Linux admin
2021-01-24 15:45         ` Ard Biesheuvel [this message]
2021-01-24 16:18           ` Russell King - ARM Linux admin
2021-01-24 23:08             ` Russell King - ARM Linux admin
2021-01-25  7:51               ` Ard Biesheuvel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMj1kXEc3e64QweYLf=izp2HHk_aQBriSgiY-8d9Hxp3pZ02ug@mail.gmail.com' \
    --to=ardb@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux@armlinux.org.uk \
    --cc=maz@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.