From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753312AbcLCPzX (ORCPT ); Sat, 3 Dec 2016 10:55:23 -0500 Received: from mail-ua0-f195.google.com ([209.85.217.195]:33081 "EHLO mail-ua0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753169AbcLCPzV (ORCPT ); Sat, 3 Dec 2016 10:55:21 -0500 MIME-Version: 1.0 In-Reply-To: <20161202102942.GA17332@gmail.com> References: <20161128030521.4423-1-khuey@kylehuey.com> <20161202102942.GA17332@gmail.com> From: Kyle Huey Date: Sat, 3 Dec 2016 07:37:32 -0800 Message-ID: Subject: Re: [PATCH v13 0/8] x86/arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction To: Ingo Molnar Cc: "Robert O'Callahan" , Thomas Gleixner , Andy Lutomirski , Ingo Molnar , "H. Peter Anvin" , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Paolo Bonzini , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Jeff Dike , Richard Weinberger , Alexander Viro , Shuah Khan , Dave Hansen , Borislav Petkov , Peter Zijlstra , Boris Ostrovsky , Len Brown , "Rafael J. Wysocki" , Dmitry Safonov , David Matlack , Nadav Amit , Andi Kleen , open list , "open list:USER-MODE LINUX (UML)" , "open list:USER-MODE LINUX (UML)" , "open list:FILESYSTEMS (VFS and infrastructure)" , "open list:KERNEL SELFTEST FRAMEWORK" , kvm list Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 2, 2016 at 2:29 AM, Ingo Molnar wrote: > > * Kyle Huey wrote: > >> rr (http://rr-project.org/), a userspace record-and-replay reverse- >> execution debugger, would like to trap and emulate the CPUID instruction. >> This would allow us to a) mask away certain hardware features that rr does >> not support (e.g. RDRAND) and b) enable trace portability across machines >> by providing constant results. >> >> Newer Intel CPUs (Ivy Bridge and later) can fault when CPUID is executed at >> CPL > 0. Expose this capability to userspace as a new pair of arch_prctls, >> ARCH_GET_CPUID and ARCH_SET_CPUID. >> >> Since v12: >> Patch 4: x86/syscalls/32: Wire up arch_prctl on x86-32 >> - compat_sys_arch_prctl prototype has argument names. > > So while I am fine with the feature, I'm still unconvinced about the > implementation: > > 1) > > As I pointed out before, the arbitrary 'code' argument name x86-ism should be > changed to 'option' like the canonical core kernel option name is for prctls(). > > This is still unfixed. Yeah, it'll be fixed next time around. > 2) > > As I complained about in my first review, TIF_NOCPUID flag is too far removed from > the value what will be written into the MSR. > > The result is poor code generation on 64-bit defconfig+CONFIG_PREEMPT=y: > > if (test_tsk_thread_flag(prev_p, TIF_NOCPUID) ^ > test_tsk_thread_flag(next_p, TIF_NOCPUID)) { > set_cpuid_faulting(test_tsk_thread_flag(next_p, TIF_NOCPUID)); > > is compiled as: > > 476: 49 8b 06 mov (%r14),%rax > 479: 49 8b 55 00 mov 0x0(%r13),%rdx > 47d: 48 c1 e8 0f shr $0xf,%rax > 481: 48 c1 ea 0f shr $0xf,%rdx > 485: 83 e2 01 and $0x1,%edx > 488: 83 e0 01 and $0x1,%eax > 48b: 38 c2 cmp %al,%dl > 48d: 74 10 je 49f <__switch_to_xtra+0x9f> > 48f: 49 8b 7d 00 mov 0x0(%r13),%rdi > 493: 48 c1 ef 0f shr $0xf,%rdi > 497: 83 e7 01 and $0x1,%edi > 49a: e8 61 fb ff ff callq 0 > > ... the first 7 instructions burdens all __switch_to_xtra() users, not just the > faulting-CPUID users. That's fair. This is certainly suboptimal, and it's suboptimal for TIF_BLOCKSTEP and TIF_NOTSC too which generate essentially identical code. A much better code sequence here would be: mov (%r14), %rax xor (%r13), %rax test $0x80, %ah jz /* do cpuid faulting work */ We could do this by introducing a test_tsk_thread_flag_differs(...), and supporting infrastructure, that XORs the flags of the two tasks before doing the bit test. Once we do that, the non-faulting case is pretty much equivalent to the mov, mov, cmp, je sequence that would be needed if we stored the MSR values in the task_struct. The faulting case becomes a straightforward time vs space tradeoff, and I'm inclined to think that calling set_cpuid_faulting (which I don't think is as bad as you suggest, see below) is better than taking up 8 bytes in every task_struct for an uncommon feature. And, as a bonus, we can improve the TIF_BLOCKSTEP and TIF_NOTSC cases too. > The set_cpuid_faulting() call is also unnecessary and the set_cpuid_faulting() > call generates into an obscene sequence of: > > 0000000000000000 : > 0: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 6 > 6: 55 push %rbp > 7: 48 89 e5 mov %rsp,%rbp > a: 53 push %rbx > b: 40 0f b6 df movzbl %dil,%ebx > f: 85 d2 test %edx,%edx > 11: 75 07 jne 1a > 13: 9c pushfq > 14: 58 pop %rax > 15: f6 c4 02 test $0x2,%ah > 18: 75 48 jne 62 > 1a: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax # 22 > 21: 00 > 22: 48 83 e0 fe and $0xfffffffffffffffe,%rax > 26: b9 40 01 00 00 mov $0x140,%ecx > 2b: 48 09 d8 or %rbx,%rax > 2e: 48 89 c2 mov %rax,%rdx > 31: 65 48 89 05 00 00 00 mov %rax,%gs:0x0(%rip) # 39 > 38: 00 > 39: 48 c1 ea 20 shr $0x20,%rdx > 3d: 0f 30 wrmsr > 3f: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 44: 5b pop %rbx > 45: 5d pop %rbp > 46: c3 retq > 47: 48 c1 e2 20 shl $0x20,%rdx > 4b: 89 c0 mov %eax,%eax > 4d: bf 40 01 00 00 mov $0x140,%edi > 52: 48 09 d0 or %rdx,%rax > 55: 31 d2 xor %edx,%edx > 57: 48 89 c6 mov %rax,%rsi > 5a: e8 00 00 00 00 callq 5f > 5f: 5b pop %rbx > 60: 5d pop %rbp > 61: c3 retq > 62: e8 00 00 00 00 callq 67 > 67: 85 c0 test %eax,%eax > 69: 74 af je 1a > 6b: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 71 > 71: 85 c0 test %eax,%eax > 73: 75 a5 jne 1a > 75: 48 c7 c1 00 00 00 00 mov $0x0,%rcx > 7c: 48 c7 c2 00 00 00 00 mov $0x0,%rdx > 83: be b9 00 00 00 mov $0xb9,%esi > 88: 48 c7 c7 00 00 00 00 mov $0x0,%rdi > 8f: e8 00 00 00 00 callq 94 > 94: eb 84 jmp 1a > 96: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) > 9d: 00 00 00 I don't know why you're getting that. Locally (with gcc 5.4) I have, with CONFIG_PARAVIRT=y: 0000000000000000 : 0: e8 00 00 00 00 callq 5 5: 55 push %rbp 6: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # e d: 00 e: 48 89 d0 mov %rdx,%rax 11: 40 0f b6 d7 movzbl %dil,%edx 15: 48 89 e5 mov %rsp,%rbp 18: 48 83 e0 fe and $0xfffffffffffffffe,%rax 1c: bf 40 01 00 00 mov $0x140,%edi 21: 48 09 c2 or %rax,%rdx 24: 89 d6 mov %edx,%esi 26: 65 48 89 15 00 00 00 mov %rdx,%gs:0x0(%rip) # 2e 2d: 00 2e: 48 c1 ea 20 shr $0x20,%rdx 32: ff 14 25 00 00 00 00 callq *0x0 39: 5d pop %rbp 3a: c3 retq 3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) and with CONFIG_PARAVIRT=n, set_cpuid_faulting gets inlined into __switch_to_xtra producing: 4da: 48 8b 03 mov (%rbx),%rax 4dd: 49 33 04 24 xor (%r12),%rax 4e2: f6 c4 80 test $0x80,%ah 4e5: 0f 85 ee 00 00 00 jne 5d9 <__switch_to_xtra+0x179> ... 5d9: 48 8b 33 mov (%rbx),%rsi 5dc: b9 40 01 00 00 mov $0x140,%ecx 5e1: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax # 5e9 <__switch_to_xtra+0x189> 5e8: 00 5e9: 48 83 e0 fe and $0xfffffffffffffffe,%rax 5ed: 48 89 c2 mov %rax,%rdx 5f0: 48 89 f0 mov %rsi,%rax 5f3: 48 c1 e8 0f shr $0xf,%rax 5f7: 83 e0 01 and $0x1,%eax 5fa: 48 09 d0 or %rdx,%rax 5fd: 48 89 c2 mov %rax,%rdx 600: 65 48 89 05 00 00 00 mov %rax,%gs:0x0(%rip) # 608 <__switch_to_xtra+0x1a8> 607: 00 608: 48 c1 ea 20 shr $0x20,%rdx 60c: 0f 30 wrmsr 60e: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 613: e9 d3 fe ff ff jmpq 4eb <__switch_to_xtra+0x8b> both of which are much saner code. > The affected object file code size blows up as well, by 17%: > > arch/x86/kernel/process.o: > text data bss dec hex filename > 3325 8577 32 11934 2e9e process.o.before > 3889 8609 32 12530 30f2 process.o.after > > A good deal of this overhead and complexity comes from the implementation > inefficiency I pointed out, and all this can be avoided with the method I > suggested in my previous review, by caching the per task MSR value in the thread > struct. 12% here, where .before is before any of my changes, .as-is is with these patches as submitted, and .after is with the changes as described above. size process.o* text data bss dec hex filename 4669 8521 96 13286 33e6 process.o.after 4685 8521 96 13302 33f6 process.o.as-is 4197 8506 96 12799 31ff process.o.before > So sorry, NAK for this implementation - especially considering how relatively > straightforward the changes I suggested are to implement. Would these proposed changes satisfy you? Obviously I want to get this into the kernel or I wouldn't be here. So if you insist on caching the MSR in the task_struct I'll do it, but I think this is at least as good of an approach. - Kyle