Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

From: Daniel Axtens <dja@axtens.net>
To: Mark Rutland <mark.rutland@arm.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com, linux-mm@kvack.org, x86@kernel.org,
	glider@google.com, luto@kernel.org, linux-kernel@vger.kernel.org,
	dvyukov@google.com, christophe.leroy@c-s.fr,
	linuxppc-dev@lists.ozlabs.org, gor@linux.ibm.com
Subject: Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory
Date: Mon, 28 Oct 2019 12:26:23 +1100	[thread overview]
Message-ID: <87eeyx8xts.fsf@dja-thinkpad.axtens.net> (raw)
In-Reply-To: <20191016132233.GA46264@lakrids.cambridge.arm.com>

Hi Mark and Andrey,

I've spent some quality time with the barrier documentation and
all of your emails.

I'm still trying to puzzle out the barrier. The memory model
documentation doesn't talk about how synchronisation works when a
page-table walk is involved, so that's making things hard. However, I
think I have something for the spurious fault case. Apologies for the
length, and for any mistakes!

I am assuming here that the poison and zeros and PTEs are correctly
being stored and we're just concerned about whether an architecturally
correct load can cause a spurious fault on x86.

> There is the risk (as laid out in [1]) that CPU 1 attempts to hoist the
> loads of the shadow memory above the load of the PTE, samples a stale
> (faulting) status from the TLB, then performs the load of the PTE and
> sees a valid value. In this case (on arm64) a spurious fault could be
> taken when the access is architecturally performed.
>
> It is possible on arm64 to use a barrier here to prevent the spurious
> fault, but this is not smp_read_barrier_depends(), as that does nothing
> for everyone but alpha. On arm64 We have a spurious fault handler to fix
> this up.

Will's email has the following example:

	CPU 0				CPU 1
	-----				-----
	spin_lock(&lock);		spin_lock(&lock);
	set_fixmap(0, paddr, prot);	if (mapped)
	mapped = true;				foo = *fix_to_virt(0);
	spin_unlock(&lock);		spin_unlock(&lock);

If I understand the following properly, it's because of a quirk in
ARM, the translation of fix_to_virt(0) can escape outside the lock:

>   DDI0487E_a, B2-125:
> 
>   | DMB and DSB instructions affect reads and writes to the memory system
>   | generated by Load/Store instructions and data or unified cache maintenance
>   | instructions being executed by the PE. Instruction fetches or accesses
>   | caused by a hardware translation table access are not explicit accesses.
> 
> which appears to claim that the DSB alone is insufficient. Unfortunately,
> some CPU designers have followed the second clause above, whereas in Linux
> we've been relying on the first. This means that our mapping sequence:
> 
> 	MOV	X0, <valid pte> 
> 	STR	X0, [Xptep]	// Store new PTE to page table
> 	DSB	ISHST
> 	LDR	X1, [X2]	// Translates using the new PTE
> 
> can actually raise a translation fault on the load instruction because the
> translation can be performed speculatively before the page table update and
> then marked as "faulting" by the CPU. For user PTEs, this is ok because we
> can handle the spurious fault, but for kernel PTEs and intermediate table
> entries this results in a panic().

So the DSB isn't sufficient to stop the CPU speculating the
_translation_ above the page table store - to do that you need an
ISB. [I'm not an ARM person so apologies if I've butchered this!] Then
the load then uses the speculated translation and faults.

So, do we need to do something to protect ourselves against the case of
these sorts of spurious faults on x86? I'm also not an x86 person, so
again apologies in advance if I've butchered anything.

Firstly, it's not trivial to get a fixed address from the vmalloc
infrastructure - you have to do something like
__vmalloc_node_range(size, align, fixed_start_address, fixed_start_address + size, ...)
I don't see any callers doing that. But we press on just in case.

Section 4.10.2.3 of Book 3 of the Intel Developers Manual says:

 | The processor may cache translations required for prefetches and for
 | accesses that are a result of speculative execution that would never
 | actually occur in the executed code path.

That's all it says, it doesn't say if it will cache a negative or
faulting lookup in the speculative case. However, if you _could_ cache
a negative result, you'd hope the documentation on when to invalidate
would tell you. That's in 4.10.4.

4.10.4.3 Optional Invalidations includes:

 | The read of a paging-structure entry in translating an address being
 | used to fetch an instruction may appear to execute before an earlier
 | write to that paging-structure entry if there is no serializing
 | instruction between the write and the instruction fetch. Note that
 | the invalidating instructions identified in Section 4.10.4.1 are all
 | serializing instructions.

That only applies to _instruction fetch_, not data fetch. There's no
corresponding dot point for data fetch, suggesting that data fetches
aren't subject to this.

Lastly, arch/x86's native_set_pte_at() performs none of the extra
barriers that ARM does - this also suggests to me that this isn't a
concern on x86. Perhaps page-table walking for data fetches is able to
snoop the store queues, and that's how they get around it.

Given that analysis, that x86 has generally strong memory ordering, and
the lack of response to Will's email from x86ers, I think we probably do
not need a spurious fault handler on x86. (Although I'd love to hear
from any actual x86 experts on this!) Other architecture enablement will
have to do their own analysis.

As I said up top, I'm still puzzling through the smp_wmb() discussion
and I hope to have something for that soon.

Regards,
Daniel

>
> Thanks,
> Mark.
>
> [1] https://lore.kernel.org/linux-arm-kernel/20190827131818.14724-1-will@kernel.org/
> [2] https://lore.kernel.org/linux-mm/20191014152717.GA20438@lakrids.cambridge.arm.com/