Re: [PATCH v2 1/2] MIPS: ptrace: disallow setting watchpoints in kernel address space

From: James Hogan <james.hogan@imgtec.com>
To: "Maciej W. Rozycki" <macro@imgtec.com>
Cc: Marcin Nowakowski <marcin.nowakowski@imgtec.com>,
	Ralf Baechle <ralf@linux-mips.org>, <linux-mips@linux-mips.org>
Subject: Re: [PATCH v2 1/2] MIPS: ptrace: disallow setting watchpoints in kernel address space
Date: Tue, 24 Jan 2017 22:05:54 +0000	[thread overview]
Message-ID: <20170124220554.GM29015@jhogan-linux.le.imgtec.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1701241909540.13564@tp.orcam.me.uk>

[-- Attachment #1: Type: text/plain, Size: 5026 bytes --]

On Tue, Jan 24, 2017 at 08:52:22PM +0000, Maciej W. Rozycki wrote:
> On Tue, 24 Jan 2017, James Hogan wrote:
> 
> > >  Can't for example the low-level exception handling entry/exit code be 
> > > moved out of the way of the EVA overlap range and then all watchpoints 
> > > masked for the duration of kernel mode execution?  This would be quite 
> > > expensive, however it could only be executed if a task flag indicates 
> > > watchpoints are being used.
> > 
> > That doesn't cover data watches. RAM would still need accessing, e.g. to
> > save/restore the watch state from the thread context, or even to read
> > the task flag, and stack accesses in C code.
> 
>  All the critical data structures would have to be outside the EVA 
> overlap.

This in itself is awkward. If a SoC supports multiple RAM sizes, e.g.
up to 2GB, you might want a single EVA kernel that could support both.
Normally you could just go with a 2GB compatible layout (for the sake of
argument, lets say RAM cached @ kernel VA 0x40000000 .. 0xBFFFFFFF,
ignoring BEV overlays for now), but if less than 1GB is fitted then none
of that RAM is outside of the user overlap range.

> 
> > The only safe way for it to work would be to somehow disable or inhibit 
> > watchpoints before clearing EXL, and re-enable them after setting EXL, 
> > though you'd still get a loop of deferred watchpoints if it hit on the 
> > way out to user mode unless cleared at the last moment before ERET.
> 
>  Ah, I forgot about CP0.Cause.WP -- is it not enough to clear the bit to 
> have any pending exception cancelled?  If so (and the architecture manual 
> is actually clear that it is) then it looks like we have a solution and we 
> don't have to place any code or data specially, although it'll have to be 
> carefully coded.
> 
> > > Alternatively perhaps we could clobber 
> > > CP0.EntryHi.ASID, at least temporarily; that would be cheaper.
> > 
> > Kernel mode still needs to access the user address space.
> 
>  Sure, that's why it would have to be temporary.  Low-level exception 
> entry/exit code is not supposed to have a need to access user memory.

Ah I see, really temporary.

> 
>  So we can put aside a certain ASID value, say 0 (for easy pasting with 
> INS from $0), and never use it for a real context.  Then it can be cleared 
> right away at the general exception entry point if EVA is in use, say:
> 
> 	<d>mfc0	$k0, $10
> 	<d>ins	$k0, $zero, 0, 10
> 	<d>mtc0	$k0, $10
> 
> (there'll be a hazard here, but we can clear it later on if needed).  
> There is no neeed to save the old ASID as we can retrieve the original 
> from our data structures.
> 
>  Then we can proceed with the usual switch to the kernel mode, switching 
> stacks, saving registers, etc.  We can then check CP0.Cause.WP and stash 
> it away for further processing if needed (though discarding it would I 
> think be the usual if not only choice) and clear, with a hazard barrier, 
> right before CP0.Status.EXL is cleared.
> 
>  Now that we're in the regular kernel mode, with ASID still set to 0, we 
> can check if process tracing has been enabled and if so, then iterate over 
> the watchpoints registers masking them all.  At this point we can restore 
> the correct ASID in CP0.EntryHi and proceed with the exception handler.
> 
>  And then we'd do the reverse in the exception epilogue, only restoring 
> the ASID as the last instruction before final ERET.
> 
> > Alternatively we could set WatchHi.ASID to a reserved one, and only
> > clear/set the WatchHi.G bit (to bypass the ASID match) at the first/last
> > moment while EXL=1. It still wouldn't protect against code watches
> > around there exposing the kernel address of that code by the resulting
> > hang though, so would need to move the ebase out of the overlap range
> > too (which would have to be platform specific).
> 
>  You'd still have to iterate over all WatchHi registers, a variable number 
> up to 8 architecturally, which I think would be too expensive for the 
> common exception path.
> 
>  Poking at ASID as I described above is just a couple of instructions at 
> entry and exit, and the rest would only be done if tracing is active.  
> Plus you don't actually have to move anything away, except from the final 
> ERET, though likely not even that, owing to the delayed nature of an ASID 
> update.

Probably, so long as you ignore QEMU.

> 
>  So can you find a flaw in my proposal so far?

not a functional one.

> We'll have to think about 
> the TLB refill handler yet though.

A deferred watch from refill handler (e.g. page tables) would I think
trigger an immediate watch exception on eret, and get cleared / ignored.
It would probably make enough of a timing difference for userland to
reliably detect (in order to probe where the process' page tables are in
kernel virtual memory, to be able to mount a more successful attack
given some other vulnerability).

Cheers
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]