Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

From: Andy Lutomirski <luto@kernel.org>
To: madvenka@linux.microsoft.com
Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>,
	 Linux API <linux-api@vger.kernel.org>,
	 linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	 Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	 linux-integrity <linux-integrity@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	 LSM List <linux-security-module@vger.kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,  X86 ML <x86@kernel.org>
Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
Date: Tue, 28 Jul 2020 10:31:59 -0700	[thread overview]
Message-ID: <CALCETrVy5OMuUx04-wWk9FJbSxkrT2vMfN_kANinudrDwC4Cig@mail.gmail.com> (raw)
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>

> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>

> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.

This is quite clever, but now I’m wondering just how much kernel help
is really needed. In your series, the trampoline is an non-executable
page.  I can think of at least two alternative approaches, and I'd
like to know the pros and cons.

1. Entirely userspace: a return trampoline would be something like:

1:
pushq %rax
pushq %rbc
pushq %rcx
...
pushq %r15
movq %rsp, %rdi # pointer to saved regs
leaq 1b(%rip), %rsi # pointer to the trampoline itself
callq trampoline_handler # see below

You would fill a page with a bunch of these, possibly compacted to get
more per page, and then you would remap as many copies as needed.  The
'callq trampoline_handler' part would need to be a bit clever to make
it continue to work despite this remapping.  This will be *much*
faster than trampfd. How much of your use case would it cover?  For
the inverse, it's not too hard to write a bit of asm to set all
registers and jump somewhere.

2. Use existing kernel functionality.  Raise a signal, modify the
state, and return from the signal.  This is very flexible and may not
be all that much slower than trampfd.

3. Use a syscall.  Instead of having the kernel handle page faults,
have the trampoline code push the syscall nr register, load a special
new syscall nr into the syscall nr register, and do a syscall. On
x86_64, this would be:

pushq %rax
movq __NR_magic_trampoline, %rax
syscall

with some adjustment if the stack slot you're clobbering is important.

Also, will using trampfd cause issues with various unwinders?  I can
easily imagine unwinders expecting code to be readable, although this
is slowly going away for other reasons.

All this being said, I think that the kernel should absolutely add a
sensible interface for JITs to use to materialize their code.  This
would integrate sanely with LSMs and wouldn't require hacks like using
files, etc.  A cleverly designed JIT interface could function without
seriailization IPIs, and even lame architectures like x86 could
potentially avoid shootdown IPIs if the interface copied code instead
of playing virtual memory games.  At its very simplest, this could be:

void *jit_create_code(const void *source, size_t len);

and the result would be a new anonymous mapping that contains exactly
the code requested.  There could also be:

int jittfd_create(...);

that does something similar but creates a memfd.  A nicer
implementation for short JIT sequences would allow appending more code
to an existing JIT region.  On x86, an appendable JIT region would
start filled with 0xCC, and I bet there's a way to materialize new
code into a previously 0xcc-filled virtual page wthout any
synchronization.  One approach would be to start with:

<some code>
0xcc
0xcc
...
0xcc

and to create a whole new page like:

<some code>
<some more code>
0xcc
...
0xcc

so that the only difference is that some code changed to some more
code.  Then replace the PTE to swap from the old page to the new page,
and arrange to avoid freeing the old page until we're sure it's gone
from all TLBs.  This may not work if <some more code> spans a page
boundary.  The #BP fixup would zap the TLB and retry.  Even just
directly copying code over some 0xcc bytes almost works, but there's a
nasty corner case involving instructions that fetch I$ fetch
boundaries.  I'm not sure to what extent I$ snooping helps.

--Andy