linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Ingo Molnar <mingo@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Arnd Bergmann <arnd@arndb.de>,
	"H. Peter Anvin" <hpa@zytor.com>, Andi Kleen <ak@linux.intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Andy Lutomirski <luto@amacapital.net>,
	linux-arch@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC, PATCHv1 00/28] 5-level paging
Date: Fri, 9 Dec 2016 13:37:22 +0300	[thread overview]
Message-ID: <20161209103722.GE30380@node.shutemov.name> (raw)
In-Reply-To: <20161209050130.GC2595@gmail.com>

On Fri, Dec 09, 2016 at 06:01:30AM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
> 
> > x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> > of physical address space. We are already bumping into this limit: some
> > vendors offers servers with 64 TiB of memory today.
> > 
> > To overcome the limitation upcoming hardware will introduce support for
> > 5-level paging[1]. It is a straight-forward extension of the current page
> > table structure adding one more layer of translation.
> > 
> > It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> > physical address space. This "ought to be enough for anybody" ©.
> > 
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> > 
> > More details on the design and what’s left to implement are below.
> 
> The patches don't look too painful, so no big complaints from me - kudos!

Thanks.

> > There is still work to do:
> > 
> >   - Boot-time switch between 4- and 5-level paging.
> > 
> >     We assume that distributions will be keen to avoid returning to the
> >     i386 days where we shipped one kernel binary for each page table
> >     layout.
> 
> Absolutely.
> 
> >     As page table format is the same for 4- and 5-level paging it should
> >     be possible to have single kernel binary and switch between them at
> >     boot-time without too much hassle.
> > 
> >     For now I only implemented compile-time switch.
> > 
> >     I hoped to bring this feature with separate patchset once basic
> >     enabling is in upstream.
> > 
> >     Is it okay?
> 
> LGTM, but we would eventually want to convert this kind of crazy open coding:
> 
>         pgd_t *pgd, *pgd_ref;
>         p4d_t *p4d, *p4d_ref;
>         pud_t *pud, *pud_ref;
>         pmd_t *pmd, *pmd_ref;
>         pte_t *pte, *pte_ref;
> 
> To something saner that iterates and navigates the page table hierarchy in an 
> extensible fashion. That would also make it (much) easier to make the paging depth 
> boot time switchable.

Yes, it would be nice to replace all these p??_t with something more
flexible. But that's no obviously right design for such transition.

I would rather not tight it to boot-time switch for paging, but have
separate experimental patchset. One day...

> Somehow I'm quite certain we'll see requests for more than 4 PiB memory in our 
> lifetimes.
> 
> In a decade or two once global warming really gets going, especially after Trump & 
> Republicans & Old Energy implement their billionaire welfare policies to mine, 
> sell and burn even more coal & oil without paying for the damage caused, the U.S. 
> meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable of 1+ 
> trillion dollars damage) in near real time at 1 meter resolution will have to run 
> on something capable, right?
> 
> >   - Handle opt-in wider address space for userspace.
> > 
> >     Not all userspace is ready to handle addresses wider than current
> >     47-bits. At least some JIT compiler make use of upper bits to encode
> >     their info.
> > 
> >     We need to have an interface to opt-in wider addresses from userspace
> >     to avoid regressions.
> > 
> >     For now, I've included testing-only patch which bumps TASK_SIZE to
> >     56-bits. This can be handy for testing to see what breaks if we max-out
> >     size of virtual address space.
> 
> So this is just a detail - but it sounds a bit limiting to me to provide an 'opt 
> in' flag for something that will work just fine on the vast majority of 64-bit 
> software.
> 
> Please make this an opt out compatibility flag instead: similar to how we handle 
> address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT, 
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Well, that's true that most userspace can handle wide addresses just fine.
But even by simply booting Fedora on QEMU I see one SIGSEGV for this
reason: libmozjs-17.0.so cannot handle it (polkitd linked with it, hell
knows why).

I think keeping software from crashing is kinda priority in this
transition.

Beyond that, most of software would not benefit much from large virtual
address space. Okay, there's more bits for ASLR, but that's it.

On other hand, large virtual address space would put more pressure on
cache -- at least one more page table per process, if we make 56-bit VA
default.

-- 
 Kirill A. Shutemov

  parent reply	other threads:[~2016-12-09 10:37 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-08 16:21 [RFC, PATCHv1 00/28] 5-level paging Kirill A. Shutemov
2016-12-08 16:21 ` [QEMU, PATCH] x86: implement la57 paging mode Kirill A. Shutemov
2016-12-08 16:48   ` [Qemu-devel] " no-reply
2016-12-08 16:21 ` [RFC, PATCHv1 01/28] asm-generic: introduce 5level-fixup.h Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 02/28] asm-generic: introduce __ARCH_USE_5LEVEL_HACK Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 03/28] arch, mm: convert all architectures to use 5level-fixup.h Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 04/28] asm-generic: introduce <asm-generic/pgtable-nop4d.h> Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 05/28] mm: convert generic code to 5-level paging Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 06/28] x86: basic changes into headers for " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 07/28] x86: trivial portion of 5-level paging conversion Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 08/28] x86/gup: add 5-level paging support Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 09/28] x86/ident_map: " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 10/28] x86/mm: add support of p4d_t in vmalloc_fault() Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 11/28] x86/power: support p4d_t in hibernate code Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 12/28] x86/kexec: support p4d_t Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 13/28] x86: convert the rest of the code to " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 14/28] mm: introduce __p4d_alloc() Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 15/28] x86: detect 5-level paging support Kirill A. Shutemov
2016-12-08 20:05   ` Borislav Petkov
2016-12-08 20:08     ` Linus Torvalds
2016-12-08 20:20       ` Borislav Petkov
2016-12-13 22:44         ` H. Peter Anvin
2016-12-13 23:07           ` Boris Petkov
2016-12-15 14:39             ` Borislav Petkov
2016-12-15 17:52               ` hpa
2016-12-15 19:09                 ` Borislav Petkov
2016-12-15 19:20                   ` Andi Kleen
2016-12-15 20:52                     ` hpa
2016-12-15 20:57                     ` hpa
2016-12-09 15:32     ` Kirill A. Shutemov
2016-12-09 16:33       ` Borislav Petkov
2016-12-13 22:50       ` H. Peter Anvin
2016-12-08 16:21 ` [RFC, PATCHv1 16/28] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert Kirill A. Shutemov
2016-12-08 18:39   ` Andy Lutomirski
2016-12-08 19:22     ` Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 17/28] x86/mm: define virtual memory map for 5-level paging Kirill A. Shutemov
2016-12-08 18:56   ` Randy Dunlap
2016-12-08 19:24     ` Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 18/28] x86/paravirt: make paravirt code support " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 19/28] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 20/28] x86/dump_pagetables: support 5-level paging Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 21/28] x86/mm: extend kasan to " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 22/28] x86/espfix: " Kirill A. Shutemov
2016-12-08 18:40   ` Andy Lutomirski
2016-12-12 14:22     ` Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 23/28] x86/mm: add support of additional page table level during early boot Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 24/28] x86/mm: add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2016-12-08 18:42   ` Andy Lutomirski
2016-12-08 19:33     ` Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 25/28] x86/mm: make kernel_physical_mapping_init() support " Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 26/28] x86/mm: add support for 5-level paging for KASLR Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 27/28] x86: enable la57 support Kirill A. Shutemov
2016-12-08 16:21 ` [RFC, PATCHv1 28/28] TESTING-ONLY: bump TASK_SIZE_MAX Kirill A. Shutemov
2016-12-08 18:16 ` [RFC, PATCHv1 00/28] 5-level paging Linus Torvalds
2016-12-08 18:26   ` hpa
2016-12-08 19:20   ` Kirill A. Shutemov
2016-12-09  5:01 ` Ingo Molnar
2016-12-09 10:24   ` Arnd Bergmann
2016-12-09 10:51     ` Catalin Marinas
2016-12-09 10:37   ` Kirill A. Shutemov [this message]
2016-12-09 16:40     ` Andi Kleen
2016-12-09 17:21       ` Kirill A. Shutemov
2016-12-09 16:49     ` Dave Hansen
2016-12-13 21:06   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161209103722.GE30380@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).