linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Andy Lutomirski <luto@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH] x86_64: uninline TASK_SIZE
Date: Tue, 23 Apr 2019 13:12:58 +0200	[thread overview]
Message-ID: <20190423111258.GA23410@gmail.com> (raw)
In-Reply-To: <CALCETrWqPStQqv2xCJZLzsqTE-9L9U08CMApWyKD1zwjrrH4ow@mail.gmail.com>


* Andy Lutomirski <luto@kernel.org> wrote:

> > Saving 2KB on a defconfig is quite a lot.
> 
> Saving 2kB of text by adding 8 bytes to thread_info seems rather
> dubious to me.  You only need 256 tasks before you lose.  My
> not-particularly-loaded laptop has 865 tasks right now.

I was suggesting current->task_size or thread_info->task_size as a way to 
100% avoid the function call overhead. Worth a tiny amount of RAM - even 
with 1 million tasks it's only 4MB of RAM. ;-)

Some TASK_SIZE users are prominent syscalls: mmap(), 

> As a general principle, the mere existence of TIF_ADDR32 is a bug. The 
> value of that flag is *wrong* under the 32-bit variant of CRIU. How 
> about instead making some more progress toward getting rid of dubious 
> TASK_SIZE users?  I'm working on a little series to get rid of most of 
> them.  Meanwhile: it sure looks like a large fraction of the users are 
> confused as to whether TASK_SIZE is the highest user address or the 
> lowest non-user address.

I really like that, replacing TASK_SIZE with *nothing* would be even 
faster.

In fact instead of just reducing its usage I'd suggest removing TASK_SIZE 
altogether and renaming TASK_SIZE_MAX back to TASK_SIZE, or something 
like that - the confusion from the deceptively macro-ish naming of 
TASK_SIZE is real.

The original commit of making TASK_SIZE dynamic on the task's compat flag 
was done in:

  84929801e14d: [PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes

Here's the justification given:

    Appended patch will setup compatibility mode TASK_SIZE properly.  This will
    fix atleast three known bugs that can be encountered while running
    compatibility mode apps.
    
    a) A malicious 32bit app can have an elf section at 0xffffe000.  During
       exec of this app, we will have a memory leak as insert_vm_struct() is
       not checking for return value in syscall32_setup_pages() and thus not
       freeing the vma allocated for the vsyscall page.  And instead of exec
       failing (as it has addresses > TASK_SIZE), we were allowing it to
       succeed previously.
    
    b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area
       may return addresses beyond 32bits, ultimately causing corruption
       because of wrap-around and resulting in SEGFAULT, instead of returning
       ENOMEM.
    
    c) 32bit app doing this below mmap will now fail.
    
      mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE,
            MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0);

I believe a) is addressed by getting rid of the vsyscall page - but it 
might also not be a current problem because the vsyscall page has its own 
gate-vma now.

b) shouldn't be an issue if the mmap allocator correctly treats the 
compat bit - this doesn't require generic TASK_SIZE variations though, as 
the mmap allocator is already specific to arch/x86/.

c) is a variant of a) I believe, which should be fixed by now.

I just looked through some of the current TASK_SIZE users, and *ALL* of 
them seem dubious to me, with the exception of the mmap allocators. In 
fact some of them seem to be active bugs:

kernel/:

 - PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE, prctl_set_mm() et al. Ugh, what a 
   nasty prctl()! But besides that, the TASK_SIZE restriction to the ABI 
   looks questionable: if we offer this CRIU functionality then why 
   should it be impossible for a 32-bit CRIU task to switch to 64-bit?

 - kernel/events/core.c: TASK_SIZE_MAX should be a fine filter here, in 
   fact it's probably *wrong* to restrict the profiling data here just 
   because the task happens to be in 32-bit compat mode.

 - kernel/rseq.c: is this TASK_SIZE restriction even required, wouldn't 
   TASK_SIZE_MAX be sufficient?

mm/:

 - GUP's get_get_area() et al looks really weird - why do we special-case 
   vsyscalls:

      - Can we get rid of the vsyscall page in modern kernels?

      - I don't think anyone runs those ancient glibc versions with a 
        fresh kernel anymore - can we start generating a WARN()ing 
        perhaps to see whether there's any complaints?

      - Or at least pretend it doesn't exist in terms of a GUP target page?

 - mm/kasan/generic_report.c:get_wild_bug_type() - this can use 
   TASK_SIZE_MAX just fine IMHO.

 - mm/mempolicy.c:mpol_shared_policy_init() - unsure, but I suspect we 
   can just create the pseudo-vma with a TASK_SIZE_MAX vm_end just fine.

 - mm/mlock.c:mlockall() - I believe it could be considered an outright 
   *bug* if there any pages outside the 32-bit area and don't get mlocked 
   by mlockall, just because this is a compat task. Especially with the 
   CRIU prctl() having 64-bit vmas outside the 32-bit mappings is a real 
   possibility, right? I.e. TASK_SIZE_MAX would be the right solution 
   here.

To turn the argument around: beyond the memory allocators, which includes 
the mmap and huge-mmap variants plus the SysV shmem allocator, can we 
list all the places that absolutely *rely* on TASK_SIZE being TIF_ADDR32 
restricted on compat tasks? I couldn't find any.

So I concur 100% that most TASK_SIZE uses are questionable. In fact think 
84929801e14d was a mistake, and we should effectively revert it 
carefully, by:

 - First by moving almost all TASK_SIZE users over to TASK_SIZE_MAX, 
   analyzing and justifying the impact case by case.

 - Then making the mmap allocators compat compatible (ha) without relying 
   on TASK_SIZE.

 - Renaming TASK_SIZE back to TASK_SIZE_MAX and getting rid of the 
   TASK_SIZE and TASK_SIZE_MAX differentiation.

Or am I missing some complication?

Thanks,

	Ingo

  reply	other threads:[~2019-04-23 11:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-21 16:06 Alexey Dobriyan
2019-04-21 18:28 ` Ingo Molnar
2019-04-21 20:07   ` hpa
2019-04-21 21:10     ` Alexey Dobriyan
2019-04-22 10:34       ` Ingo Molnar
2019-04-22 14:30         ` Andy Lutomirski
2019-04-22 22:09           ` Alexey Dobriyan
2019-04-23  0:54             ` Andy Lutomirski
2019-04-23 11:12               ` Ingo Molnar [this message]
2019-04-23 17:21                 ` Andy Lutomirski
2019-04-24 10:38                   ` Ingo Molnar
2019-04-22 22:04         ` Alexey Dobriyan
2019-04-21 22:07   ` [PATCH v2] " Alexey Dobriyan
2019-04-22 22:40 ` [PATCH] " Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190423111258.GA23410@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=adobriyan@gmail.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    --subject='Re: [PATCH] x86_64: uninline TASK_SIZE' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).