All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
       [not found] ` <20211116071347.520327-3-songliubraving@fb.com>
@ 2021-11-16  8:00   ` Peter Zijlstra
  2021-11-17 21:36     ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-16  8:00 UTC (permalink / raw)
  To: Song Liu
  Cc: x86, bpf, netdev, tglx, mingo, bp, dave.hansen, ast, daniel,
	andrii, kernel-team

On Mon, Nov 15, 2021 at 11:13:42PM -0800, Song Liu wrote:
> These allow setting ro/x for module_alloc() mapping, while leave the
> linear mapping rw/nx.

This needs a very strong rationale for *why*. How does this not
trivially circumvent W^X ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-16  8:00   ` [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias Peter Zijlstra
@ 2021-11-17 21:36     ` Song Liu
  2021-11-17 22:01       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-11-17 21:36 UTC (permalink / raw)
  To: Peter Zijlstra, Johannes Weiner
  Cc: the arch/x86 maintainers, bpf, netdev, tglx, mingo, bp,
	dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 16, 2021, at 12:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Nov 15, 2021 at 11:13:42PM -0800, Song Liu wrote:
>> These allow setting ro/x for module_alloc() mapping, while leave the
>> linear mapping rw/nx.
> 
> This needs a very strong rationale for *why*. How does this not
> trivially circumvent W^X ?

In this case, we want to have multiple BPF programs sharing the 2MB page. 
When the JIT engine is working on one program, we would rather existing
BPF programs on the same page stay on RO+X mapping (the module_alloc() 
address). The solution in this version is to let the JIT engine write to 
the page via linear address. 

An alternative is to only use the module_alloc() address, and flip the 
read-only bit (of the whole 2MB page) back and forth. However, this 
requires some serialization among different JIT jobs. 

Johannes also noticed that set_memory_[ro|x] for kernel modules and BPF 
programs causes splitting the 1GB linear mapping. This leads to visible 
performance degradation in the tests. CC'ing him for more details on this. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-17 21:36     ` Song Liu
@ 2021-11-17 22:01       ` Peter Zijlstra
  2021-11-17 23:57         ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-17 22:01 UTC (permalink / raw)
  To: Song Liu
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team

On Wed, Nov 17, 2021 at 09:36:27PM +0000, Song Liu wrote:
> 
> 
> > On Nov 16, 2021, at 12:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Mon, Nov 15, 2021 at 11:13:42PM -0800, Song Liu wrote:
> >> These allow setting ro/x for module_alloc() mapping, while leave the
> >> linear mapping rw/nx.
> > 
> > This needs a very strong rationale for *why*. How does this not
> > trivially circumvent W^X ?
> 
> In this case, we want to have multiple BPF programs sharing the 2MB page. 
> When the JIT engine is working on one program, we would rather existing
> BPF programs on the same page stay on RO+X mapping (the module_alloc() 
> address). The solution in this version is to let the JIT engine write to 
> the page via linear address. 
> 
> An alternative is to only use the module_alloc() address, and flip the 
> read-only bit (of the whole 2MB page) back and forth. However, this 
> requires some serialization among different JIT jobs. 

Neither options seem acceptible to me as they both violate W^X.

Please have a close look at arch/x86/kernel/alternative.c:__text_poke()
for how we modify active text. I think that or something very similar is
the only option. By having an alias in a special (user) address space
that is not accessible by any other CPU, only the poking CPU can expoit
this (temporary) hole, which is a much larger ask than any of the
proposed options.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-17 22:01       ` Peter Zijlstra
@ 2021-11-17 23:57         ` Song Liu
  2021-11-18  0:11           ` Song Liu
  2021-11-18  7:54           ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Song Liu @ 2021-11-17 23:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 17, 2021, at 2:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 17, 2021 at 09:36:27PM +0000, Song Liu wrote:
>> 
>> 
>>> On Nov 16, 2021, at 12:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Mon, Nov 15, 2021 at 11:13:42PM -0800, Song Liu wrote:
>>>> These allow setting ro/x for module_alloc() mapping, while leave the
>>>> linear mapping rw/nx.
>>> 
>>> This needs a very strong rationale for *why*. How does this not
>>> trivially circumvent W^X ?
>> 
>> In this case, we want to have multiple BPF programs sharing the 2MB page. 
>> When the JIT engine is working on one program, we would rather existing
>> BPF programs on the same page stay on RO+X mapping (the module_alloc() 
>> address). The solution in this version is to let the JIT engine write to 
>> the page via linear address. 
>> 
>> An alternative is to only use the module_alloc() address, and flip the 
>> read-only bit (of the whole 2MB page) back and forth. However, this 
>> requires some serialization among different JIT jobs. 
> 
> Neither options seem acceptible to me as they both violate W^X.
> 
> Please have a close look at arch/x86/kernel/alternative.c:__text_poke()
> for how we modify active text. I think that or something very similar is
> the only option. By having an alias in a special (user) address space
> that is not accessible by any other CPU, only the poking CPU can expoit
> this (temporary) hole, which is a much larger ask than any of the
> proposed options.

I would agree that __text_poke() is a safer option. But in this case, we 
will need the temporary hole to be 2MB in size. Also, we will probably 
hold the temporary mapping for longer time (the whole JITing process). 
Does this sound reasonable?

Thanks,
Song


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-17 23:57         ` Song Liu
@ 2021-11-18  0:11           ` Song Liu
  2021-11-18  7:54           ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Song Liu @ 2021-11-18  0:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 17, 2021, at 3:57 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Nov 17, 2021, at 2:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Wed, Nov 17, 2021 at 09:36:27PM +0000, Song Liu wrote:
>>> 
>>> 
>>>> On Nov 16, 2021, at 12:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Mon, Nov 15, 2021 at 11:13:42PM -0800, Song Liu wrote:
>>>>> These allow setting ro/x for module_alloc() mapping, while leave the
>>>>> linear mapping rw/nx.
>>>> 
>>>> This needs a very strong rationale for *why*. How does this not
>>>> trivially circumvent W^X ?
>>> 
>>> In this case, we want to have multiple BPF programs sharing the 2MB page. 
>>> When the JIT engine is working on one program, we would rather existing
>>> BPF programs on the same page stay on RO+X mapping (the module_alloc() 
>>> address). The solution in this version is to let the JIT engine write to 
>>> the page via linear address. 
>>> 
>>> An alternative is to only use the module_alloc() address, and flip the 
>>> read-only bit (of the whole 2MB page) back and forth. However, this 
>>> requires some serialization among different JIT jobs. 
>> 
>> Neither options seem acceptible to me as they both violate W^X.
>> 
>> Please have a close look at arch/x86/kernel/alternative.c:__text_poke()
>> for how we modify active text. I think that or something very similar is
>> the only option. By having an alias in a special (user) address space
>> that is not accessible by any other CPU, only the poking CPU can expoit
>> this (temporary) hole, which is a much larger ask than any of the
>> proposed options.
> 
> I would agree that __text_poke() is a safer option. But in this case, we 
> will need the temporary hole to be 2MB in size. Also, we will probably 
> hold the temporary mapping for longer time (the whole JITing process). 
> Does this sound reasonable?

Actually, the hole is probably not always 2MB in size. But it could be up 
to 2MB in size. 

Song


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-17 23:57         ` Song Liu
  2021-11-18  0:11           ` Song Liu
@ 2021-11-18  7:54           ` Peter Zijlstra
  2021-11-18 17:16             ` Song Liu
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-18  7:54 UTC (permalink / raw)
  To: Song Liu
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team

On Wed, Nov 17, 2021 at 11:57:12PM +0000, Song Liu wrote:

> I would agree that __text_poke() is a safer option. But in this case, we 
> will need the temporary hole to be 2MB in size. Also, we will probably 
> hold the temporary mapping for longer time (the whole JITing process). 
> Does this sound reasonable?

No :-)

Jit to a buffer, then copy the buffer into the 2M page using 4k aliases.
IIRC each program is still smaller than a single page, right? So at no
point do you need more than 2 pages mapped anyway.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-18  7:54           ` Peter Zijlstra
@ 2021-11-18 17:16             ` Song Liu
  2021-11-18 18:28               ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-11-18 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 17, 2021, at 11:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Nov 17, 2021 at 11:57:12PM +0000, Song Liu wrote:
> 
>> I would agree that __text_poke() is a safer option. But in this case, we 
>> will need the temporary hole to be 2MB in size. Also, we will probably 
>> hold the temporary mapping for longer time (the whole JITing process). 
>> Does this sound reasonable?
> 
> No :-)
> 
> Jit to a buffer, then copy the buffer into the 2M page using 4k aliases.
> IIRC each program is still smaller than a single page, right? So at no
> point do you need more than 2 pages mapped anyway.

JITing to a separate buffer adds complexity to the JIT process, as we 
need to redo some offsets before the copy to match the final location of 
the program. I don't have much experience with the JIT engine, so I am
not very sure how much work it gonna be. 

The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
instructions (BPF instructions). So it could easily go beyond a few 
pages. Mapping the 2MB page all together should make the logic simpler. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-18 17:16             ` Song Liu
@ 2021-11-18 18:28               ` Peter Zijlstra
  2021-11-18 18:39                 ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-18 18:28 UTC (permalink / raw)
  To: Song Liu
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team

On Thu, Nov 18, 2021 at 05:16:24PM +0000, Song Liu wrote:
> 
> 
> > On Nov 17, 2021, at 11:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Nov 17, 2021 at 11:57:12PM +0000, Song Liu wrote:
> > 
> >> I would agree that __text_poke() is a safer option. But in this case, we 
> >> will need the temporary hole to be 2MB in size. Also, we will probably 
> >> hold the temporary mapping for longer time (the whole JITing process). 
> >> Does this sound reasonable?
> > 
> > No :-)
> > 
> > Jit to a buffer, then copy the buffer into the 2M page using 4k aliases.
> > IIRC each program is still smaller than a single page, right? So at no
> > point do you need more than 2 pages mapped anyway.
> 
> JITing to a separate buffer adds complexity to the JIT process, as we 
> need to redo some offsets before the copy to match the final location of 
> the program. I don't have much experience with the JIT engine, so I am
> not very sure how much work it gonna be. 

You're going to have to do that anyway if you're going to write to the
directmap while executing from the alias.

> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
> instructions (BPF instructions). So it could easily go beyond a few 
> pages. Mapping the 2MB page all together should make the logic simpler. 

Then copy it in smaller chunks I suppose.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-18 18:28               ` Peter Zijlstra
@ 2021-11-18 18:39                 ` Song Liu
  2021-11-18 18:58                   ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-11-18 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 18, 2021, at 10:28 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Nov 18, 2021 at 05:16:24PM +0000, Song Liu wrote:
>> 
>> 
>>> On Nov 17, 2021, at 11:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Nov 17, 2021 at 11:57:12PM +0000, Song Liu wrote:
>>> 
>>>> I would agree that __text_poke() is a safer option. But in this case, we 
>>>> will need the temporary hole to be 2MB in size. Also, we will probably 
>>>> hold the temporary mapping for longer time (the whole JITing process). 
>>>> Does this sound reasonable?
>>> 
>>> No :-)
>>> 
>>> Jit to a buffer, then copy the buffer into the 2M page using 4k aliases.
>>> IIRC each program is still smaller than a single page, right? So at no
>>> point do you need more than 2 pages mapped anyway.
>> 
>> JITing to a separate buffer adds complexity to the JIT process, as we 
>> need to redo some offsets before the copy to match the final location of 
>> the program. I don't have much experience with the JIT engine, so I am
>> not very sure how much work it gonna be. 
> 
> You're going to have to do that anyway if you're going to write to the
> directmap while executing from the alias.

Not really. If you look at current version 7/7, the logic is mostly 
straightforward. We just make all the writes to the directmap, while 
calculate offset from the alias. 

> 
>> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
>> instructions (BPF instructions). So it could easily go beyond a few 
>> pages. Mapping the 2MB page all together should make the logic simpler. 
> 
> Then copy it in smaller chunks I suppose.

How fast/slow is the __text_poke routine? I guess we cannot do it thousands
of times per BPF program (in chunks of a few bytes)? 

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-18 18:39                 ` Song Liu
@ 2021-11-18 18:58                   ` Peter Zijlstra
  2021-11-19  4:14                     ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-18 18:58 UTC (permalink / raw)
  To: Song Liu
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team

On Thu, Nov 18, 2021 at 06:39:49PM +0000, Song Liu wrote:

> > You're going to have to do that anyway if you're going to write to the
> > directmap while executing from the alias.
> 
> Not really. If you look at current version 7/7, the logic is mostly 
> straightforward. We just make all the writes to the directmap, while 
> calculate offset from the alias. 

Then you can do the exact same thing but do the writes to a temp buffer,
no different.

> >> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
> >> instructions (BPF instructions). So it could easily go beyond a few 
> >> pages. Mapping the 2MB page all together should make the logic simpler. 
> > 
> > Then copy it in smaller chunks I suppose.
> 
> How fast/slow is the __text_poke routine? I guess we cannot do it thousands
> of times per BPF program (in chunks of a few bytes)? 

You can copy in at least 4k chunks since any 4k will at most use 2
pages, which is what it does. If that's not fast enough we can look at
doing bigger chunks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-18 18:58                   ` Peter Zijlstra
@ 2021-11-19  4:14                     ` Song Liu
  2021-11-19  9:35                       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-11-19  4:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team



> On Nov 18, 2021, at 10:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Nov 18, 2021 at 06:39:49PM +0000, Song Liu wrote:
> 
>>> You're going to have to do that anyway if you're going to write to the
>>> directmap while executing from the alias.
>> 
>> Not really. If you look at current version 7/7, the logic is mostly 
>> straightforward. We just make all the writes to the directmap, while 
>> calculate offset from the alias. 
> 
> Then you can do the exact same thing but do the writes to a temp buffer,
> no different.

There will be some extra work, but I guess I will give it a try. 

> 
>>>> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
>>>> instructions (BPF instructions). So it could easily go beyond a few 
>>>> pages. Mapping the 2MB page all together should make the logic simpler. 
>>> 
>>> Then copy it in smaller chunks I suppose.
>> 
>> How fast/slow is the __text_poke routine? I guess we cannot do it thousands
>> of times per BPF program (in chunks of a few bytes)? 
> 
> You can copy in at least 4k chunks since any 4k will at most use 2
> pages, which is what it does. If that's not fast enough we can look at
> doing bigger chunks.

If we do JIT in a buffer first, 4kB chunks should be fast enough. 

Another side of this issue is the split of linear mapping (1GB => 
many 4kB). If we only split to PMD, but not PTE, we can probably 
recover most of the regression. I will check this with Johannes. 

Thanks,
Song




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-19  4:14                     ` Song Liu
@ 2021-11-19  9:35                       ` Peter Zijlstra
  2021-11-19 16:06                         ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2021-11-19  9:35 UTC (permalink / raw)
  To: Song Liu
  Cc: Johannes Weiner, the arch/x86 maintainers, bpf, netdev, tglx,
	mingo, bp, dave.hansen, ast, daniel, andrii, Kernel Team

On Fri, Nov 19, 2021 at 04:14:46AM +0000, Song Liu wrote:
> 
> 
> > On Nov 18, 2021, at 10:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Thu, Nov 18, 2021 at 06:39:49PM +0000, Song Liu wrote:
> > 
> >>> You're going to have to do that anyway if you're going to write to the
> >>> directmap while executing from the alias.
> >> 
> >> Not really. If you look at current version 7/7, the logic is mostly 
> >> straightforward. We just make all the writes to the directmap, while 
> >> calculate offset from the alias. 
> > 
> > Then you can do the exact same thing but do the writes to a temp buffer,
> > no different.
> 
> There will be some extra work, but I guess I will give it a try. 
> 
> > 
> >>>> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
> >>>> instructions (BPF instructions). So it could easily go beyond a few 
> >>>> pages. Mapping the 2MB page all together should make the logic simpler. 
> >>> 
> >>> Then copy it in smaller chunks I suppose.
> >> 
> >> How fast/slow is the __text_poke routine? I guess we cannot do it thousands
> >> of times per BPF program (in chunks of a few bytes)? 
> > 
> > You can copy in at least 4k chunks since any 4k will at most use 2
> > pages, which is what it does. If that's not fast enough we can look at
> > doing bigger chunks.
> 
> If we do JIT in a buffer first, 4kB chunks should be fast enough. 
> 
> Another side of this issue is the split of linear mapping (1GB => 
> many 4kB). If we only split to PMD, but not PTE, we can probably 
> recover most of the regression. I will check this with Johannes. 

__text_poke() shouldn't affect the fragmentation of the kernel
mapping, it's a user-space alias into the same physical memory. For all
it cares we're poking into GB pages.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias
  2021-11-19  9:35                       ` Peter Zijlstra
@ 2021-11-19 16:06                         ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2021-11-19 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, the arch/x86 maintainers, bpf, netdev, tglx, mingo, bp,
	dave.hansen, ast, daniel, andrii, Kernel Team

On Fri, Nov 19, 2021 at 10:35:56AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 19, 2021 at 04:14:46AM +0000, Song Liu wrote:
> > 
> > 
> > > On Nov 18, 2021, at 10:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > On Thu, Nov 18, 2021 at 06:39:49PM +0000, Song Liu wrote:
> > > 
> > >>> You're going to have to do that anyway if you're going to write to the
> > >>> directmap while executing from the alias.
> > >> 
> > >> Not really. If you look at current version 7/7, the logic is mostly 
> > >> straightforward. We just make all the writes to the directmap, while 
> > >> calculate offset from the alias. 
> > > 
> > > Then you can do the exact same thing but do the writes to a temp buffer,
> > > no different.
> > 
> > There will be some extra work, but I guess I will give it a try. 
> > 
> > > 
> > >>>> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
> > >>>> instructions (BPF instructions). So it could easily go beyond a few 
> > >>>> pages. Mapping the 2MB page all together should make the logic simpler. 
> > >>> 
> > >>> Then copy it in smaller chunks I suppose.
> > >> 
> > >> How fast/slow is the __text_poke routine? I guess we cannot do it thousands
> > >> of times per BPF program (in chunks of a few bytes)? 
> > > 
> > > You can copy in at least 4k chunks since any 4k will at most use 2
> > > pages, which is what it does. If that's not fast enough we can look at
> > > doing bigger chunks.
> > 
> > If we do JIT in a buffer first, 4kB chunks should be fast enough. 
> > 
> > Another side of this issue is the split of linear mapping (1GB => 
> > many 4kB). If we only split to PMD, but not PTE, we can probably 
> > recover most of the regression. I will check this with Johannes. 
> 
> __text_poke() shouldn't affect the fragmentation of the kernel
> mapping, it's a user-space alias into the same physical memory. For all
> it cares we're poking into GB pages.

Right, __text_poke won't, it's the initial set_memory_ro/x against the
vmap space that does it, since the linear mapping is updated as an
alias with the same granularity.

However, my guess would also be that once we stop doing that with 4k
pages for every single bpf program, and batch into shared 2M pages
instead, the problem will be much smaller. Maybe negligible.*

So ro linear mapping + __text_poke() sounds like a good idea to me.

[ If not, we could consider putting the linear mapping updates behind
  a hardening option similar to RETPOLINE, PAGE_TABLE_ISOLATION,
  STRICT_MODULE_RWX. The __text_poke() method would mean independence
  from the linear mapping and we can do with that what we want
  then. Many machines aren't exposed enough to necessarily care about
  W^X if the price is too high - and the impact of losing the GB
  mappings is actually significant on central workloads right now.

  But yeah, hopefully we won't have to go there. ]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-19 16:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20211116071347.520327-1-songliubraving@fb.com>
     [not found] ` <20211116071347.520327-3-songliubraving@fb.com>
2021-11-16  8:00   ` [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]_noalias Peter Zijlstra
2021-11-17 21:36     ` Song Liu
2021-11-17 22:01       ` Peter Zijlstra
2021-11-17 23:57         ` Song Liu
2021-11-18  0:11           ` Song Liu
2021-11-18  7:54           ` Peter Zijlstra
2021-11-18 17:16             ` Song Liu
2021-11-18 18:28               ` Peter Zijlstra
2021-11-18 18:39                 ` Song Liu
2021-11-18 18:58                   ` Peter Zijlstra
2021-11-19  4:14                     ` Song Liu
2021-11-19  9:35                       ` Peter Zijlstra
2021-11-19 16:06                         ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.