Re: [PATCH v3 0/6] Static calls

From: Nadav Amit <namit@vmware.com>
To: "hpa@zytor.com" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	the arch/x86 maintainers <x86@kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Jason Baron <jbaron@akamai.com>, Jiri Kosina <jkosina@suse.cz>,
	David Laight <David.Laight@aculab.com>,
	Borislav Petkov <bp@alien8.de>, Julia Cartwright <julia@ni.com>,
	Jessica Yu <jeyu@kernel.org>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Edward Cree <ecree@solarflare.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>
Subject: Re: [PATCH v3 0/6] Static calls
Date: Fri, 11 Jan 2019 19:33:30 +0000	[thread overview]
Message-ID: <F39DA47E-BFCA-49F2-B2F4-C19F860CC79F@vmware.com> (raw)
In-Reply-To: <12578A17-E695-4DD5-AEC7-E29FAB2C8322@zytor.com>

> On Jan 11, 2019, at 11:23 AM, hpa@zytor.com wrote:
> 
> On January 11, 2019 11:03:30 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoimboe@redhat.com>
>> wrote:
>>>> Now, in the int3 handler can you take the faulting RIP and search
>> for it in
>>>> the “static-calls” table, writing the RIP+5 (offset) into R10
>> (return
>>>> address) and the target into R11. You make the int3 handler to
>> divert the
>>>> code execution by changing pt_regs->rip to point to a new function
>> that does:
>>>>      push R10
>>>>      jmp __x86_indirect_thunk_r11
>>>> 
>>>> And then you are done. No?
>>> 
>>> IIUC, that sounds pretty much like what Steven proposed:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019&amp;sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3D&amp;reserved=0
>>> I liked the idea, BUT, how would it work for callee-saved PV ops?  In
>>> that case there's only one clobbered register to work with (rax).
>> 
>> Actually, there's a much simpler model now that I think about it.
>> 
>> The BP fixup just fixes up %rip to to point to "bp_int3_handler".
>> 
>> And that's just a random text address set up by "text_poke_bp()".
>> 
>> So how about the static call rewriting simply do this:
>> 
>> - for each static call:
>> 
>> 1)   create a fixup code stub that does
>> 
>>       push $returnaddressforTHIScall
>>       jmp targetforTHIScall
>> 
>> 2) do
>> 
>>       on_each_cpu(do_sync_core, NULL, 1);
>> 
>>    to make sure all CPU's see this generated code
>> 
>> 3) do
>> 
>>       text_poke_bp(addr, newcode, newlen, generatedcode);
>> 
>> Ta-daa! Done.
>> 
>> In fact, it turns out that even the extra "do_sync_core()" in #2 is
>> unnecessary, because taking the BP will be serializing on the CPU that
>> takes it, so we can skip it.
>> 
>> End result: the text_poke_bp() function will do the two do_sync_core
>> IPI's that guarantee that by the time it returns, no other CPU is
>> using the generated code any more, so it can be re-used for the next
>> static call fixup.
>> 
>> Notice? No odd emulation, no need to adjust the stack in the BP
>> handler, just the regular "return to a different IP".
>> 
>> Now, there is a nasty special case with that stub, though.
>> 
>> So nasty thing with the whole "generate a stub for each call" case:
>> because it's dynamic and because of the re-use of the stub, you could
>> be in the situation where:
>> 
>> CPU1                  CPU2
>> ----                  ----
>> 
>> generate a stub
>> on_each_cpu(do_sync_core..)
>> text_poke_bp()
>> ...
>> 
>> rewrite to BP
>>                       trigger the BP
>>                       return to the stub
>>                       fun the first instruction of the stub
>>                       *INTERRUPT causes rescheduling*
>> 
>> on_each_cpu(do_sync_core..)
>> rewrite to good instruction
>> on_each_cpu(do_sync_core..)
>> 
>> free or re-generate the stub
>> 
>>                       !! The stub is still in use !!
>> 
>> So that simple "just generate the stub dynamically" isn't so simple
>> after all.
>> 
>> But it turns out that that is really simple to handle too. How do we do
>> that?
>> 
>> We do that by giving the BP handler *two* code sequences, and we make
>> the BP handler pick one depending on whether it is returning to a
>> "interrupts disabled" or "interrupts enabled" case.
>> 
>> So the BP handler does this:
>> 
>> - if we're returning with interrupts disabled, pick the simple stub
>> 
>> - if we're returning with interrupts enabled, clkear IF in the return
>> %rflags, and pick a *slightly* more complex stub:
>> 
>>       push $returnaddressforTHIScall
>>       sti
>>       jmp targetforTHIScall
>> 
>> and now the STI shadow will mean that this sequence is uninterruptible.
>> 
>> So we'd not do complex emulation of the call instruction at BP time,
>> but we'd do that *trivial* change at BP time.
>> 
>> This seems simple, doesn't need any temporary registers at all, and
>> doesn't need any extra stack magic. It literally needs just a trivial
>> sequence in poke_int3_handler().
>> 
>> The we'd change the end of poke_int3_handler() to do something like
>> this instead:
>> 
>>       void *newip = bp_int3_handler;
>>       ..
>>       if (new == magic_static_call_bp_int3_handler) {
>>               if (regs->flags &X86_FLAGS_IF) {
>>                       newip = magic_static_call_bp_int3_handler_sti;
>>                       regs->flags &= ~X86_FLAGS_IF;
>>       }
>>       regs->ip = (unsigned long) newip;
>>       return 1;
>> 
>> AAND now we're *really* done.
>> 
>> Does anybody see any issues in this?
>> 
>>             Linus
> 
> I still don't see why can't simply spin in the #BP handler until the patch is complete.
> 
> We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress.
> 
> It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.)
> 
> If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it.

One thing that comes to mind is that text_poke_bp() runs after patching the
int3 and before patching in the instruction:

	on_each_cpu(do_sync_core, NULL, 1);

If IRQs are disabled when the BP is hit, spinning can cause the system to
hang.