From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756571AbcDGOkN (ORCPT ); Thu, 7 Apr 2016 10:40:13 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:35732 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756516AbcDGOjr (ORCPT ); Thu, 7 Apr 2016 10:39:47 -0400 MIME-Version: 1.0 In-Reply-To: <57064E6C.2030202@virtuozzo.com> References: <1459960170-4454-1-git-send-email-dsafonov@virtuozzo.com> <1459960170-4454-2-git-send-email-dsafonov@virtuozzo.com> <57064E6C.2030202@virtuozzo.com> From: Andy Lutomirski Date: Thu, 7 Apr 2016 07:39:26 -0700 Message-ID: Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode To: Dmitry Safonov Cc: Thomas Gleixner , Dmitry Safonov <0x7f454c46@gmail.com>, Dave Hansen , Ingo Molnar , Shuah Khan , Borislav Petkov , X86 ML , khorenko@virtuozzo.com, Andrew Morton , xemul@virtuozzo.com, linux-kselftest@vger.kernel.org, Cyrill Gorcunov , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Apr 7, 2016 5:12 AM, "Dmitry Safonov" wrote: > > On 04/06/2016 09:04 PM, Andy Lutomirski wrote: >> >> [cc Dave Hansen for MPX] >> >> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" wrote: >>> >>> Now each process that runs natively on x86_64 may execute 32-bit code >>> by proper setting it's CS selector: either from LDT or reuse Linux's >>> USER32_CS. The vice-versa is also valid: running 64-bit code in >>> compatible task is also possible by choosing USER_CS. >>> So we may switch between 32 and 64 bit code execution in any process. >>> Linux will choose the right syscall numbers in entries for those >>> processes. But it still will consider them native/compat by the >>> personality, that elf loader set on launch. This affects i.e., ptrace >>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset >>> according to process's mode (that's how strace detect task's >>> personality from 4.8 version). >>> >>> This patch adds arch_prctl calls for x86 that make possible to tell >>> Linux kernel in which mode the application is running currently. >>> Mainly, this is needed for CRIU: restoring compatible & native >>> applications both from 64-bit restorer. By that reason I wrapped all >>> the code in CONFIG_CHECKPOINT_RESTORE. >>> This patch solves also a problem for running 64-bit code in 32-bit elf >>> (and reverse), that you have only 32-bit elf vdso for fast syscalls. >>> When switching between native <-> compat mode by arch_prctl, it will >>> remap needed vdso binary blob for target mode. >> >> General comments first: > > Thanks for your comments. >> >> You forgot about x32. > > Will add x32 support for v2. > >> I think that you should separate vdso remapping from "personality". >> vdso remapping should be available even on native 32-bit builds, which >> means that either you can't use arch_prctl for it or you'll have to >> wire up arch_prctl as a 32-bit syscall. > > I cant say, I got your point. Do you mean by vdso remapping > mremap for vdso/vvar pages? I think, it should work now. For 32-bit, the vdso *must* exist in memory at the address that the kernel thinks it's at. Even if you had a pure 32-bit restore stub, you would still need vdso remap, because there's a chance the vdso could land at an unusable address, say one page off from where you want it. You couldn't map a wrapper because there wouldn't be any space for it without moving the real vdso out of the way. Remember, you *cannot* mremap() the 32-bit vdso because you will crash. It works by luck for 64-bit, but it's plausible that we'd want to change that some day. (I have awful patches that speed a bunch of things up at the cost of a vdso trampoline for 64-bit code and a bunch of other hacks. Those patches will never go in for real, but something else might want the ability to use 64-bit vdso trampolines.) > I did remapping for vdso as blob for native x86_64 task differs > to compatible task. So it's just changing blobs, address value > is there for convenience - I may omit it and just remap > different vdso blob at the same place where was previous vdso. > I'm not sure, why do we need possibility to map 64-bit vdso blob > on native 32-bit builds? That would fail, but I think the API should exist. But a native 32-bit program should be able to remap the 32-bit vdso. IOW, I think you should be able to do, roughly: map_new_vdso(VDSO_32BIT, addr); on any kernel. Am I making sense? > >> For "personality", someone needs to enumerate all of the various thigs >> that try to track bitness and see how many of them even make sense. >> On brief inspection: >> >> - TIF_IA32: affects signal format and does something to ptrace. I >> suspect that whatever it does to ptrace is nonsensical, and I don't >> know whether we're stuck with it. >> >> - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter >> isn't even done in a sensible way). >> >> - is_64bit_mm affects MPX and uprobes. >> >> On even more brief inspection: >> >> - uprobes using is_64bit_mm is buggy. >> >> - I doubt that having TASK_SIZE vary serves any purpose. Does anyone >> know why TASK_SIZE is different for different tasks? It would save >> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX. >> - Using TIF_IA32 for signal processing is IMO suboptimal. Instead, >> we should record which syscall installed the signal handler and use >> the corresponding frame format. > > Oh, I like it, will do. > >> - Using TIF_IA32 of the *target* for ptrace is nonsense. Having >> strace figure out syscall type using that is actively buggy, and I ran >> into that bug a few days ago and cursed at it. strace should inspect >> TS_COMPAT (I don't know how, but that's what should happen). We may >> be stuck with this for ABI reasons. > > ptrace may check seg_32bit for code selector, what do you think? Not sure. I have never fully wrapped my had around ptrace.