linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "Torvalds, Linus" <torvalds@linux-foundation.org>
Cc: "songliubraving@fb.com" <songliubraving@fb.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"hch@infradead.org" <hch@infradead.org>,
	"ast@kernel.org" <ast@kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"Kernel-team@fb.com" <Kernel-team@fb.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"song@kernel.org" <song@kernel.org>,
	"pmladek@suse.com" <pmladek@suse.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"dborkman@redhat.com" <dborkman@redhat.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"bp@alien8.de" <bp@alien8.de>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"mbenes@suse.cz" <mbenes@suse.cz>,
	"npiggin@gmail.com" <npiggin@gmail.com>,
	"imbrenda@linux.ibm.com" <imbrenda@linux.ibm.com>
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP
Date: Fri, 22 Apr 2022 16:54:52 +0000	[thread overview]
Message-ID: <e00b452952e7aaef0d94bc25a32261aafeeff7ea.camel@intel.com> (raw)
In-Reply-To: <CAHk-=wg+xn7WbSEb1boSCj+AEUwwAGmXf5Hvb0822BHyBwRoDw@mail.gmail.com>

On Thu, 2022-04-21 at 19:47 -0700, Linus Torvalds wrote:
> I don't disagree, but I think the real problem is that the whole "oen
> page_order per vmalloc() area" itself is a bit broken.

Yea. It is the main reason it has to round up to huge page sizes
AFAICT. I'd really like to see it use memory a little more efficiently
if it is going to be an opt-out thing again.

> 
> For example, AMD already does this "automatic TLB size" thing for
> when
> you have multiple contiguous PTE entries (shades of the old alpha
> "page size hint" thing, except it's automatic and doesn't have
> explicit hints).
> 
> And I'm hoping Intel will do something similar in the future.
> 
> End result? It would actually be really good to just map contiguous
> pages, but it doesn't have anything to do with the 2MB PMD size.
> 
> And there's no "fixed order" needed either. If you have mapping that
> is 17 pages in size, it would still be good to allocate them as a
> block of 16 pages ("page_order = 4") and as a single page, because
> just laying them out in the page tables that way will already allow
> AMD to use a 64kB TLB entry for that 16-page block.
> 
> But it would also work to just do the allocations as a set of 8, 4, 4
> and 1.

Hmm, that's neat.

> 
> But the whole "one page order for one vmalloc" means that doesn't
> work
> very well.
> 
> Where I disagree (violently) with Nick is his contention that (a)
> this
> is x86-specific and (b) this is somehow trivial to fix.
> 
> Let's face it - the current code is broken. I think the sub-page
> issue
> is not entirely trivial, and the current design isn't even very good
> for it.
> 
> But the *easy* cases are the ones that simply don't care - the ones
> that powerpc has actually been testing.
> 
> So for 5.18, I think it's quite likely reasonable to re-enable
> large-page vmalloc for the easy case (ie those big hash tables).
> 
> Re-enabling it *all*, considering how broken it has been, and how
> little testing it has clearly gotten? And potentially not enabling it
> on x86 because x86 is so much better at showing issues? That's not
> what I want to do.
> 
> If the code is so broken that it can't be used on x86, then it's too
> broken to be enabled on powerpc and s390 too. Never mind that those
> architectures might have so limited use that they never realized how
> broken they were..

I think there is another cross-arch issue here that we shouldn't lose
sight of. There are not enough warnings in the code about the
assumptions made on the arch's. The other issues are x86 specific in
terms of who gets affected in rc1, but I dug up this prophetic
assessment:

https://lore.kernel.org/lkml/4488d39f-0698-7bfd-b81c-1e609821818f@intel.com/

That is pretty much what happened. Song came along and, in its current
state, took it as a knob that could just be flipped. Seems pretty
reasonable that it could happen again.

So IMHO, the other general issue is the lack of guard rails or warnings
for the next arch that comes along. Probably VM_FLUSH_RESET_PERMS
should get some warnings as well.

I kind of like the idea in that thread of making functions or configs
for arch's to be forced to declare they have specific properties. Does
it seem reasonable at this point? Probably not necessary as a 5.18 fix.


  reply	other threads:[~2022-04-22 16:55 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220415164413.2727220-1-song@kernel.org>
     [not found] ` <20220415164413.2727220-2-song@kernel.org>
2022-04-15 17:43   ` [PATCH v4 bpf 1/4] vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP Rik van Riel
     [not found] ` <20220415164413.2727220-3-song@kernel.org>
2022-04-15 17:43   ` [PATCH v4 bpf 2/4] page_alloc: use vmalloc_huge for large system hash Rik van Riel
2022-04-25  7:07     ` Geert Uytterhoeven
2022-04-25  8:17       ` Linus Torvalds
2022-04-25  8:24         ` Geert Uytterhoeven
     [not found] ` <20220415164413.2727220-4-song@kernel.org>
2022-04-15 18:06   ` [PATCH v4 bpf 3/4] module: introduce module_alloc_huge Rik van Riel
2022-06-16 16:10   ` Dave Hansen
2022-04-15 19:05 ` [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Luis Chamberlain
2022-04-16  1:34   ` Song Liu
2022-04-16  1:42     ` Luis Chamberlain
2022-04-16  1:43       ` Luis Chamberlain
2022-04-16  5:08   ` Christoph Hellwig
2022-04-16 19:55     ` Song Liu
2022-04-16 20:30       ` Linus Torvalds
2022-04-16 22:26         ` Song Liu
2022-04-18 10:06           ` Mike Rapoport
2022-04-19  0:44             ` Luis Chamberlain
2022-04-19  1:56               ` Edgecombe, Rick P
2022-04-19  5:36                 ` Song Liu
2022-04-19 18:42                   ` Mike Rapoport
2022-04-19 19:20                     ` Linus Torvalds
2022-04-20  2:03                       ` Alexei Starovoitov
2022-04-20  2:18                         ` Linus Torvalds
2022-04-20 14:42                           ` Song Liu
2022-04-20 18:28                             ` Luis Chamberlain
2022-04-21  3:25                       ` Nicholas Piggin
2022-04-21  5:48                         ` Linus Torvalds
2022-04-21  6:02                           ` Linus Torvalds
2022-04-21  9:07                             ` Nicholas Piggin
2022-04-21  8:57                           ` Nicholas Piggin
2022-04-21 15:44                             ` Linus Torvalds
2022-04-21 23:30                               ` Nicholas Piggin
2022-04-22  0:49                                 ` Linus Torvalds
2022-04-22  1:51                                   ` Nicholas Piggin
2022-04-22  2:31                                     ` Linus Torvalds
2022-04-22  2:57                                       ` Nicholas Piggin
2022-04-21 15:47                             ` Edgecombe, Rick P
2022-04-21 16:15                               ` Linus Torvalds
2022-04-22  0:12                                 ` Nicholas Piggin
2022-04-22  2:29                                   ` Edgecombe, Rick P
2022-04-22  2:47                                     ` Linus Torvalds
2022-04-22 16:54                                       ` Edgecombe, Rick P [this message]
2022-04-22  3:08                                     ` Nicholas Piggin
2022-04-22  4:31                                       ` Nicholas Piggin
2022-04-22 17:10                                         ` Edgecombe, Rick P
2022-04-22 20:22                                           ` Edgecombe, Rick P
2022-04-22  3:33                                     ` Nicholas Piggin
2022-04-21  9:47                           ` Nicholas Piggin
2022-04-19 21:24                 ` Luis Chamberlain
2022-04-19 23:58                   ` Edgecombe, Rick P
2022-04-20  7:58                   ` Petr Mladek
2022-04-19 18:20               ` Mike Rapoport
2022-04-24 17:43       ` Linus Torvalds
2022-04-25  6:48         ` Song Liu
2022-04-21  3:19     ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e00b452952e7aaef0d94bc25a32261aafeeff7ea.camel@intel.com \
    --to=rick.p.edgecombe@intel.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dborkman@redhat.com \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbenes@suse.cz \
    --cc=mcgrof@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=pmladek@suse.com \
    --cc=rppt@kernel.org \
    --cc=song@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).