From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9078DC43334 for ; Wed, 20 Jul 2022 21:14:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229636AbiGTVOJ (ORCPT ); Wed, 20 Jul 2022 17:14:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229606AbiGTVOH (ORCPT ); Wed, 20 Jul 2022 17:14:07 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6601442AD3 for ; Wed, 20 Jul 2022 14:14:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=kMWXfrkX0KV7UVEW0A2YIUVYYSZnVBn30EuggOhd7MY=; b=FkKAQDFjSfsxjXz1O5O2/7J33N J5htel6g80VHLNqLCAX79raT1BAi/dhn6+sFlz8inHye1N3kXe1c4U22CAwDMdvl1cxpbzXy2jgeI vW/TGGDwVI0MPKeUz/PVGPG9vkEyYubf8zvHbiVND1vZvAtDgDr8z9K6fSi7p2WzDLeTxVE+84KGi xOCoXew0MWeNf+3OQ4QFx9opOodBC3pO89xupVUEg4vI9JetiSZVwMXyeZMXqgiPsWkHCLJpIBFjf 1FJ+HqwJ5W8GIZ1Nu3nYUvYZRtjoockxmKnqMAPbSCFvbfgPDzB0e1byV74YONjL13UjzEbPFC3o5 QDqUkeMQ==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=worktop.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1oEH0Q-005PHu-VE; Wed, 20 Jul 2022 21:13:19 +0000 Received: by worktop.programming.kicks-ass.net (Postfix, from userid 1000) id A9746980BBE; Wed, 20 Jul 2022 23:13:16 +0200 (CEST) Date: Wed, 20 Jul 2022 23:13:16 +0200 From: Peter Zijlstra To: Sami Tolvanen Cc: Linus Torvalds , Thomas Gleixner , Joao Moreira , LKML , the arch/x86 maintainers , Tim Chen , Josh Poimboeuf , "Cooper, Andrew" , Pawan Gupta , Johannes Wikner , Alyssa Milburn , Jann Horn , "H.J. Lu" , "Moreira, Joao" , "Nuzman, Joseph" , Steven Rostedt , "Gross, Jurgen" , Masami Hiramatsu , Alexei Starovoitov , Daniel Borkmann , Peter Collingbourne , Kees Cook Subject: Re: [patch 00/38] x86/retbleed: Call depth tracking mitigation Message-ID: References: <87lesqukm5.ffs@tglx> <2f7f899cb75b79b08b0662ff4d2cb877@overdrivepizza.com> <87fsiyuhyz.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 19, 2022 at 10:19:18AM -0700, Sami Tolvanen wrote: > Clang's current CFI implementation is somewhat similar to this. It > creates separate thunks for address-taken functions and changes > function addresses in C code to point to the thunks instead. > > While this works, it creates painful situations when interacting with > assembly (e.g. a function address taken in assembly cannot be used > for indirect calls in C as it doesn't point to the thunk) and needs > unpleasant hacks when we want take the actual function address in C > (i.e. scattering the code with function_nocfi() calls). > > I have to agree with Peter on this, I would rather avoid messing with > function pointers in KCFI to avoid these issues. It is either this; and I think I can avoid the worst of it (see below); or grow the indirect_callsites to obscure the immediate (as Linus suggested), there's around ~16k indirect callsites in a defconfig-ish kernel, so growing it isn't too horrible, but it isn't nice either. The prettiest option to obscure the immediate at the callsite I could conjure up is something like: kcfi_caller_linus: movl $0x12345600, %r10d movb $0x78, %r10b cmpl %r10d, -OFFSET(%r11) je 1f ud2 1: call __x86_thunk_indirect_r11 Which comes to around 22 bytes (+5 over the original). Joao suggested putting part of that in the retpoline thunk like: kcfi_caller_joao: movl $0x12345600, %r10d movb $0x78, %r10b call __x86_thunk_indirect_cfi __x86_thunk_indirect_cfi: cmpl %r10d, -OFFSET(%r11) je 1f ud2 1: call 1f int3 1: mov %r11, (%rsp) ret int3 The only down-side there is that eIBRS hardware doesn't need retpolines (given we currently default to ignoring Spectre-BHB) and as such this doesn't really work nicely (we don't want to re-introduce funneling). The other option I came up with, alluded to above, is below, and having written it out, I'm pretty sure I faviour just growing the indirect callsite as per Linus' option above. Suppose: indirect_callsite: cmpl $0x12345678, -6(%r11) # 8 je 1f # 2 ud2 # 2 call __x86_indirect_thunk_r11 # 5 (-> .retpoline_sites) __cfi_\func: movl $0x12345678, %eax # 5 int3 # 1 int3 # 1 \func: # aligned 16 endbr # 4 nop12 # 12 call __fentry__ # 5 ... And for functions that do not get their address taken: \func: # aligned 16 nop16 # 16 call __fentry__ # 5 ... Instead, extend the objtool .call_sites to also include tail-calls and for: - regular (!SKL, !IBT) systems; * patch all direct calls/jmps to +16 (.call_sites) * static_call/ftrace/etc.. can triviall add the +16 * retpolines can do +16 for the indirect calls * retutn thunks are patched to ret;int3 (.return_sites) (indirect calls for eIBRS which don't use retpoline simply eat the nops) - SKL systems; * patch the first 16 bytes into: nop6 sarq $5, PER_CPU_VAR(__x86_call_depth) * patch all direct calls to +6 (.call_sites) * patch all direct jumps to +16 (.call_sites) * static_call/ftrace adjust to +6/+16 depending on instruction type * retpolines are split between call/jmp and do +6/+16 resp. * return thunks are patches to x86_return_skl (.return_sites) - IBT systes; * patch the first 16 bytes to: endbr # 4 xorl $0x12345678, %r10d # 7 je 1f # 2 ud2 # 2 nop # 1 1: * patch the callsites to: (.retpoline_sites) movl $0x12345678, %r10d # 7 call *$r11 # 3 nop7 # 7 * patch all the direct calls/jmps to +16 (.call_sites) * static_call/ftrace/etc.. add +16 * retutn thunks are patched to ret;int3 (.return_sites) Yes, frobbing the address for static_call/ftrace/etc.. is a bit horrible, but at least &sym remains exactly that address and not something magical. Note: It is possible to shift the __fentry__ call, but that would mean that we loose alignment or get to carry .call_sites at runtime (and it is *huge*)