All of lore.kernel.org
 help / color / mirror / Atom feed
From: Luis Chamberlain <mcgrof@kernel.org>
To: Song Liu <songliubraving@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	"David S. Miller" <davem@davemloft.net>,
	Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>,
	Kees Cook <keescook@chromium.org>, Song Liu <song@kernel.org>,
	bpf <bpf@vger.kernel.org>, Christoph Hellwig <hch@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	lkml <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"rick.p.edgecombe@intel.com" <rick.p.edgecombe@intel.com>,
	"linux-modules@vger.kernel.org" <linux-modules@vger.kernel.org>
Subject: Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup
Date: Tue, 12 Jul 2022 12:04:29 -0700	[thread overview]
Message-ID: <Ys3FvYnASr2v9iPc@bombadil.infradead.org> (raw)
In-Reply-To: <E23B6EB1-AFFA-4B65-963E-B44BA0F2142D@fb.com>

On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
> > On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> > I believe you are mentioning requiring text_poke() because the way
> > eBPF code uses the module_alloc() is different. Correct me if I'm
> > wrong, but from what I gather is you use the text_poke_copy() as the data
> > is already RO+X, contrary module_alloc() use cases. You do this since your
> > bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> > module_alloc() and before you can use this memory. This is a different type
> > of allocator. And, again please correct me if I'm wrong but now you want to
> > share *one* 2 MiB huge-page for multiple BPF programs to help with the
> > impact of TLB misses.
> 
> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
> much value on top of current module_alloc(). 

Thanks for the clarification.

> > A vmalloc_ro_exec() by definition would imply a text_poke().
> > 
> > Can kprobes, ftrace and modules use it too? It would be nice
> > so to not have to deal with the loose semantics on the user to
> > have to use set_vm_flush_reset_perms() on ro+x later, but
> > I think this can be addressed separately on a case by case basis.
> 
> I am pretty confident that kprobe and ftrace can share huge pages with 
> BPF programs.

Then wonderful, we know where to go in terms of a new API then as it
can be shared in the future for sure and there are gains.

> I haven't looked into all the details with modules, but 
> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
> possible.

Sure.

> Once this is done, a regular system (without huge BPF program or huge
> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
> and bpf programs. 

That would be nice, if possible, however modules will require likely its
own thing, on my system I see about 57 MiB used on coresize alone.

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename | \
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
60001272

And so perhaps we need such a pool size to be configurable.

> > But a vmalloc_ro_exec() with a respective free can remove the
> > requirement to do set_vm_flush_reset_perms().
> 
> Removing the requirement to set_vm_flush_reset_perms() is the other
> reason to go directly to vmalloc_ro_exec(). 

Yes fantastic.

> My current version looks like this:
> 
> void *vmalloc_exec(unsigned long size);
> void vfree_exec(void *ptr, unsigned int size);
> 
> ro is eliminated as there is no rw version of the API. 

Alright.

I am not sure if 2 MiB will suffice given what I mentioned above, and
what to do to ensure this grows at a reasonable pace. Then, at least for
usage for all architectures since not all will support text_poke() we
will want to consider a way to make it easy to users to use non huge
page fallbacks, but that would be up to those users, so we can wait for
that.

> The ugly part is @size for vfree_exec(). We need it to share huge 
> pages. 

I suppose this will become evident during patch review.

> Under the hood, it looks similar to current bpf_prog_pack_alloc
> and bpf_prog_pack_free. 

Groovy.

  Luis

  reply	other threads:[~2022-07-12 19:28 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-07 22:35 [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 1/5] module: introduce module_alloc_huge Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 2/5] bpf: use module_alloc_huge for bpf_prog_pack Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 4/5] vmalloc: introduce huge_vmalloc_supported Song Liu
2022-07-07 22:35 ` [PATCH v6 bpf-next 5/5] bpf: simplify select_bpf_prog_pack_size Song Liu
2022-07-07 22:59 ` [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Luis Chamberlain
2022-07-07 23:52   ` Song Liu
2022-07-08  0:53     ` Luis Chamberlain
2022-07-08  1:36       ` Song Liu
2022-07-08 15:58         ` Luis Chamberlain
2022-07-08 19:58           ` Song Liu
2022-07-08 22:24             ` Luis Chamberlain
2022-07-09  1:14               ` Song Liu
2022-07-12  4:18                 ` Luis Chamberlain
2022-07-12  4:24                   ` Luis Chamberlain
2022-07-12  5:49                   ` Song Liu
2022-07-12 19:04                     ` Luis Chamberlain [this message]
2022-07-12 23:12                       ` Song Liu
2022-07-12 23:42                         ` Luis Chamberlain
2022-07-13  1:00                           ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ys3FvYnASr2v9iPc@bombadil.infradead.org \
    --to=mcgrof@kernel.org \
    --cc=Kernel-team@fb.com \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=davem@davemloft.net \
    --cc=hch@infradead.org \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-modules@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=naveen.n.rao@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=song@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.