Re: [GIT PULL] x86/mm for 6.2

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: kirill.shutemov@linux.intel.com,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org, x86@kernel.org,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>
Subject: Re: [GIT PULL] x86/mm for 6.2
Date: Thu, 15 Dec 2022 14:46:11 -0800	[thread overview]
Message-ID: <CAHk-=whKuB=mno0a5i9g7hPGdKhz3d5DErTZZGs3FjMW4ap4GA@mail.gmail.com> (raw)
In-Reply-To: <242daeb2-b96b-d0dd-5597-ebf5fb2dfeca@intel.com>

On Thu, Dec 15, 2022 at 1:53 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Back in the MPX days, we had some users pop up and tell us that MPX
> wasn't working for them on certain threads.  Those threads ended up
> having been clone()'d even _before_ MPX got enabled which was via some
> pre-main() startup code that the compiler inserted.  Since these early
> threads' FPU state was copied before MPX setup was done, they never got
> MPX-enabled.

Yeah, I can see that happening, but I do think:

> Maybe we call those "early thread" folks too crazy to get LAM.

I think we should at least *start* that way.

Yes, I can imagine some early LD linkage thing deciding "this early
dynamic linking is very expensive, I will thread it".

I personally think that's a bit crazy - if your early linking is that
expensive, you have some serious design issues - but even if it does
happen, I'd rather start with very strict rules, see if that works
out, and then if we hit problems we have other alternatives.

Those other alternatives could involve relaxing the rules later and
saying "ok, we'll allow you to enable LAM even in this case, because
you only did Xyz".

But another alternative could also be to just have that LAM enabled
even earlier by adding an ELF flag, so that the address masking is
essentially set up at execve() time.

And independently of that, there's another issue: I suspect you want
separate the concept of "the kernel will mask virtual addresses in
certain system calls" from "the CPU will ignore those masked bits"

And I say that because that's basically what arm64 does.

On arm64, the 'untagged_addr()' masking is always enabled, but whether
the actual CPU hardware ignores the flags when virtually addressed is
a separate thing that you need to do the prctl() for.

Put another way: you can always pass tagged addresses to things like
mmap() and friends, *even if* those tagged addresses would cause a
fault when accessed normally, and wouldn't work for "read()" and
'write()" and friends (because read/write uses regular CPU accesses).

Now, the Intel LAM model is mode complicated, because the mask is
dynamic, and because there are people who want to use the full linear
address space with no masking.

But the whole "want to use the full linear address space" issue is
actually in some ways closer to the the "use five-level page tables"
decision - in that by *default* we don't use VA space ab9ove 47, and
you have to basically actively *ask* for that 5-level address space.

And I think the LAM model might want to take that same approach: maybe
by *default*, untagged_addr() always does the masking, and x86 takes
the arm64 approach.

And then the truly *special* apps that want unmasked addresses are the
ones who mark themselves special, kind of the same way they already
have to if they want 5-level paging.

And again: untagged_addr() doing masking does *not* necessarily mean
that the CPU does. I think arm64 does actually set the TBI bit by
default, but there's very much a prctl() to enable/disable it. But
that "BTI is disabled" doesn't actually mean that the mmap() address
masking is disabled.

I dunno. I'm really just saying "look, arm64 does this very
differently, and it seems to work fine there".

                    Linus