linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Song Liu <songliubraving@fb.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "rppt@kernel.org" <rppt@kernel.org>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"ast@kernel.org" <ast@kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"Torvalds, Linus" <torvalds@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"song@kernel.org" <song@kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	"pmladek@suse.com" <pmladek@suse.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"dborkman@redhat.com" <dborkman@redhat.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"bp@alien8.de" <bp@alien8.de>, "mbenes@suse.cz" <mbenes@suse.cz>,
	"imbrenda@linux.ibm.com" <imbrenda@linux.ibm.com>
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP
Date: Tue, 19 Apr 2022 05:36:45 +0000	[thread overview]
Message-ID: <B20F8051-301C-4DE4-A646-8A714AF8450C@fb.com> (raw)
In-Reply-To: <88eafc9220d134d72db9eb381114432e71903022.camel@intel.com>

Hi Mike, Luis, and Rick,

Thanks for sharing your work and findings in the space. I didn't 
realize we were looking at the same set of problems. 

> On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
>>> There are use-cases that require 4K pages with non-default
>>> permissions in
>>> the direct map and the pages not necessarily should be executable.
>>> There
>>> were several suggestions to implement caches of 4K pages backed by
>>> 2M
>>> pages.
>> 
>> Even if we just focus on the executable side of the story... there
>> may
>> be users who can share this too.
>> 
>> I've gone down memory lane now at least down to year 2005 in kprobes
>> to see why the heck module_alloc() was used. At first glance there
>> are
>> some old comments about being within the 2 GiB text kernel range...
>> But
>> some old tribal knowledge is still lost. The real hints come from
>> kprobe work
>> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
>> line
>> - take2"), so that the "For the %rip-relative displacement fixups to
>> be
>> doable"... but this got me wondering, would other users who *do* want
>> similar funcionality benefit from a cache. If the space is limited
>> then
>> using a cache makes sense. Specially if architectures tend to require
>> hacks for some of this to all work.
> 
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.
> 
> 
> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
> 
> https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> 
> Since then Mike has taken a lot further the direct map cache piece.

These are really interesting work. With this landed, we won't need 
the bpf_prog_pack work at all (I think). OTOH, this looks like a 
long term project, as some of the work in bpf_prog_pack took quite 
some time to discuss/debate, and that was just a subset of the 
whole thing. 

I really like the two types of cache concept. But there are some 
details I cannot figure out about them:

1. Is "caches of unmapped pages on direct map" (cache #1) 
   sufficient to fix all direct map fragmentation? IIUC, pages in
   the cache may still be used by other allocation (with some 
   memory pressure). If the system runs for long enough, there 
   may be a lot of direct map fragmentation. Is this right?
2. If we have "cache of virtual memory mappings" (cache #2), do we
   still need cache #1? I know cache #2 alone may waste some 
   memory, but I still think 2MB within noise for modern systems. 
3. If we do need both caches, what would be the right APIs? 

Thanks,
Song



> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.
> 
>> 
>> Then, since it seems since the vmalloc area was not initialized,
>> wouldn't that break the old JIT spray fixes, refer to commit
>> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
>> attacks")?
> 
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
> 
>> 
>> Is that sort of work not needed anymore? If in doubt I at least made
>> the
>> old proof of concept JIT spray stuff compile on recent kernels [0],
>> but
>> I haven't tried out your patches yet. If this is not needed anymore,
>> why not?
> 
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.
> 
>> 
>> The collection of tribal knowedge around these sorts of things would
>> be
>> good to not loose and if we can share, even better.
> 
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.
> 


  reply	other threads:[~2022-04-19  5:36 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220415164413.2727220-1-song@kernel.org>
     [not found] ` <20220415164413.2727220-2-song@kernel.org>
2022-04-15 17:43   ` [PATCH v4 bpf 1/4] vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP Rik van Riel
     [not found] ` <20220415164413.2727220-3-song@kernel.org>
2022-04-15 17:43   ` [PATCH v4 bpf 2/4] page_alloc: use vmalloc_huge for large system hash Rik van Riel
2022-04-25  7:07     ` Geert Uytterhoeven
2022-04-25  8:17       ` Linus Torvalds
2022-04-25  8:24         ` Geert Uytterhoeven
     [not found] ` <20220415164413.2727220-4-song@kernel.org>
2022-04-15 18:06   ` [PATCH v4 bpf 3/4] module: introduce module_alloc_huge Rik van Riel
2022-06-16 16:10   ` Dave Hansen
2022-04-15 19:05 ` [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Luis Chamberlain
2022-04-16  1:34   ` Song Liu
2022-04-16  1:42     ` Luis Chamberlain
2022-04-16  1:43       ` Luis Chamberlain
2022-04-16  5:08   ` Christoph Hellwig
2022-04-16 19:55     ` Song Liu
2022-04-16 20:30       ` Linus Torvalds
2022-04-16 22:26         ` Song Liu
2022-04-18 10:06           ` Mike Rapoport
2022-04-19  0:44             ` Luis Chamberlain
2022-04-19  1:56               ` Edgecombe, Rick P
2022-04-19  5:36                 ` Song Liu [this message]
2022-04-19 18:42                   ` Mike Rapoport
2022-04-19 19:20                     ` Linus Torvalds
2022-04-20  2:03                       ` Alexei Starovoitov
2022-04-20  2:18                         ` Linus Torvalds
2022-04-20 14:42                           ` Song Liu
2022-04-20 18:28                             ` Luis Chamberlain
2022-04-21  3:25                       ` Nicholas Piggin
2022-04-21  5:48                         ` Linus Torvalds
2022-04-21  6:02                           ` Linus Torvalds
2022-04-21  9:07                             ` Nicholas Piggin
2022-04-21  8:57                           ` Nicholas Piggin
2022-04-21 15:44                             ` Linus Torvalds
2022-04-21 23:30                               ` Nicholas Piggin
2022-04-22  0:49                                 ` Linus Torvalds
2022-04-22  1:51                                   ` Nicholas Piggin
2022-04-22  2:31                                     ` Linus Torvalds
2022-04-22  2:57                                       ` Nicholas Piggin
2022-04-21 15:47                             ` Edgecombe, Rick P
2022-04-21 16:15                               ` Linus Torvalds
2022-04-22  0:12                                 ` Nicholas Piggin
2022-04-22  2:29                                   ` Edgecombe, Rick P
2022-04-22  2:47                                     ` Linus Torvalds
2022-04-22 16:54                                       ` Edgecombe, Rick P
2022-04-22  3:08                                     ` Nicholas Piggin
2022-04-22  4:31                                       ` Nicholas Piggin
2022-04-22 17:10                                         ` Edgecombe, Rick P
2022-04-22 20:22                                           ` Edgecombe, Rick P
2022-04-22  3:33                                     ` Nicholas Piggin
2022-04-21  9:47                           ` Nicholas Piggin
2022-04-19 21:24                 ` Luis Chamberlain
2022-04-19 23:58                   ` Edgecombe, Rick P
2022-04-20  7:58                   ` Petr Mladek
2022-04-19 18:20               ` Mike Rapoport
2022-04-24 17:43       ` Linus Torvalds
2022-04-25  6:48         ` Song Liu
2022-04-21  3:19     ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B20F8051-301C-4DE4-A646-8A714AF8450C@fb.com \
    --to=songliubraving@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dborkman@redhat.com \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbenes@suse.cz \
    --cc=mcgrof@kernel.org \
    --cc=pmladek@suse.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rppt@kernel.org \
    --cc=song@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).