bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] BPF: Disable on PREEMPT_RT
@ 2019-10-17  9:05 Sebastian Andrzej Siewior
  2019-10-17 14:53 ` Daniel Borkmann
  0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-17  9:05 UTC (permalink / raw)
  To: bpf
  Cc: Thomas Gleixner, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Peter Zijlstra

Disable BPF on PREEMPT_RT because
- it allocates and frees memory in atomic context
- it uses up_read_non_owner()
- BPF_PROG_RUN() expects to be invoked in non-preemptible context

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

I tried to fix the memory allocations in 
  http://lkml.kernel.org/r/20190410143025.11997-1-bigeasy@linutronix.de

but I have no idea how to address the other two issues.

 init/Kconfig    |    1 +
 net/kcm/Kconfig |    1 +
 2 files changed, 2 insertions(+)

--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1629,6 +1629,7 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
+	depends on !PREEMPT_RT
 	select BPF
 	select IRQ_WORK
 	default n
--- a/net/kcm/Kconfig
+++ b/net/kcm/Kconfig
@@ -3,6 +3,7 @@
 config AF_KCM
 	tristate "KCM sockets"
 	depends on INET
+	depends on !PREEMPT_RT
 	select BPF_SYSCALL
 	select STREAM_PARSER
 	---help---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17  9:05 [PATCH] BPF: Disable on PREEMPT_RT Sebastian Andrzej Siewior
@ 2019-10-17 14:53 ` Daniel Borkmann
  2019-10-17 15:40   ` Sebastian Andrzej Siewior
  2019-10-17 17:26   ` David Miller
  0 siblings, 2 replies; 26+ messages in thread
From: Daniel Borkmann @ 2019-10-17 14:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Thomas Gleixner, Alexei Starovoitov, Martin KaFai Lau,
	Song Liu, Yonghong Song, Peter Zijlstra

On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
> Disable BPF on PREEMPT_RT because
> - it allocates and frees memory in atomic context
> - it uses up_read_non_owner()
> - BPF_PROG_RUN() expects to be invoked in non-preemptible context

For the latter you'd also need to disable seccomp-BPF and everything
cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...

> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> I tried to fix the memory allocations in 
>   http://lkml.kernel.org/r/20190410143025.11997-1-bigeasy@linutronix.de
> 
> but I have no idea how to address the other two issues.
> 
>  init/Kconfig    |    1 +
>  net/kcm/Kconfig |    1 +
>  2 files changed, 2 insertions(+)
> 
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1629,6 +1629,7 @@ config KALLSYMS_BASE_RELATIVE
>  # syscall, maps, verifier
>  config BPF_SYSCALL
>  	bool "Enable bpf() system call"
> +	depends on !PREEMPT_RT
>  	select BPF
>  	select IRQ_WORK
>  	default n
> --- a/net/kcm/Kconfig
> +++ b/net/kcm/Kconfig
> @@ -3,6 +3,7 @@
>  config AF_KCM
>  	tristate "KCM sockets"
>  	depends on INET
> +	depends on !PREEMPT_RT
>  	select BPF_SYSCALL
>  	select STREAM_PARSER
>  	---help---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 14:53 ` Daniel Borkmann
@ 2019-10-17 15:40   ` Sebastian Andrzej Siewior
  2019-10-17 17:25     ` David Miller
  2019-10-17 17:26   ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-17 15:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, Thomas Gleixner, Alexei Starovoitov, Martin KaFai Lau,
	Song Liu, Yonghong Song, Peter Zijlstra

On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote:
> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
> > Disable BPF on PREEMPT_RT because
> > - it allocates and frees memory in atomic context
> > - it uses up_read_non_owner()
> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context
> 
> For the latter you'd also need to disable seccomp-BPF and everything
> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...

I looked at tracing and it depended on BPF_SYSCALL so I assumed they all
do… Now looking for BPF_PROG_RUN() there is PPP_FILTER,
NET_TEAM_MODE_LOADBALANCE and probably more.  I didn't find a symbol for
seccomp-BPF. 
Would it make sense to override BPF_PROG_RUN() and make each caller fail
instead? Other recommendations?
 
Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 15:40   ` Sebastian Andrzej Siewior
@ 2019-10-17 17:25     ` David Miller
  2019-10-17 21:54       ` Thomas Gleixner
  2019-10-17 22:11       ` Thomas Gleixner
  0 siblings, 2 replies; 26+ messages in thread
From: David Miller @ 2019-10-17 17:25 UTC (permalink / raw)
  To: bigeasy; +Cc: daniel, bpf, tglx, ast, kafai, songliubraving, yhs, peterz

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu, 17 Oct 2019 17:40:21 +0200

> On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote:
>> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
>> > Disable BPF on PREEMPT_RT because
>> > - it allocates and frees memory in atomic context
>> > - it uses up_read_non_owner()
>> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context
>> 
>> For the latter you'd also need to disable seccomp-BPF and everything
>> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...
> 
> I looked at tracing and it depended on BPF_SYSCALL so I assumed they all
> do… Now looking for BPF_PROG_RUN() there is PPP_FILTER,
> NET_TEAM_MODE_LOADBALANCE and probably more.  I didn't find a symbol for
> seccomp-BPF. 
> Would it make sense to override BPF_PROG_RUN() and make each caller fail
> instead? Other recommendations?

I hope you understand that basically you are disabling any packet sniffing
on the system with this patch you are proposing.

This means no tcpdump, not wireshark, etc.  They will all become
non-functional.

Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is
absolutely essential functionality for a Linux system at this point.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 14:53 ` Daniel Borkmann
  2019-10-17 15:40   ` Sebastian Andrzej Siewior
@ 2019-10-17 17:26   ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2019-10-17 17:26 UTC (permalink / raw)
  To: daniel; +Cc: bigeasy, bpf, tglx, ast, kafai, songliubraving, yhs, peterz

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 17 Oct 2019 16:53:58 +0200

> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
>> Disable BPF on PREEMPT_RT because
>> - it allocates and frees memory in atomic context
>> - it uses up_read_non_owner()
>> - BPF_PROG_RUN() expects to be invoked in non-preemptible context
> 
> For the latter you'd also need to disable seccomp-BPF and everything
> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...

+1  Right, we can't do this, no way.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 17:25     ` David Miller
@ 2019-10-17 21:54       ` Thomas Gleixner
  2019-10-17 22:13         ` David Miller
                           ` (2 more replies)
  2019-10-17 22:11       ` Thomas Gleixner
  1 sibling, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-17 21:54 UTC (permalink / raw)
  To: David Miller
  Cc: Sebastian Sewior, daniel, bpf, ast, kafai, songliubraving, yhs,
	Peter Zijlstra, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 2178 bytes --]

On Thu, 17 Oct 2019, David Miller wrote:

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Thu, 17 Oct 2019 17:40:21 +0200
> 
> > On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote:
> >> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
> >> > Disable BPF on PREEMPT_RT because
> >> > - it allocates and frees memory in atomic context
> >> > - it uses up_read_non_owner()
> >> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context
> >> 
> >> For the latter you'd also need to disable seccomp-BPF and everything
> >> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...
> > 
> > I looked at tracing and it depended on BPF_SYSCALL so I assumed they all
> > do… Now looking for BPF_PROG_RUN() there is PPP_FILTER,
> > NET_TEAM_MODE_LOADBALANCE and probably more.  I didn't find a symbol for
> > seccomp-BPF. 
> > Would it make sense to override BPF_PROG_RUN() and make each caller fail
> > instead? Other recommendations?
> 
> I hope you understand that basically you are disabling any packet sniffing
> on the system with this patch you are proposing.
> 
> This means no tcpdump, not wireshark, etc.  They will all become
> non-functional.
> 
> Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is
> absolutely essential functionality for a Linux system at this point.

I'm all ears for an alternative solution. Here are the pain points:

  #1) BPF disables preemption unconditionally with no way to do a proper RT
      substitution like most other infrastructure in the kernel provides
      via spinlocks or other locking primitives.

  #2) BPF does allocations in atomic contexts, which is a dubious decision
      even for non RT. That's related to #1

  #3) BPF uses the up_read_non_owner() hackery which was only invented to
      deal with already existing horrors and not meant to be proliferated.

      Yes, I know it's a existing facility ....

TBH, I have no idea how to deal with those things. So the only way forward
for RT right now is to disable the whole thing.

Clark might have some insight from the product side for you how much that
impacts usability.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 17:25     ` David Miller
  2019-10-17 21:54       ` Thomas Gleixner
@ 2019-10-17 22:11       ` Thomas Gleixner
  2019-10-17 22:23         ` David Miller
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-17 22:11 UTC (permalink / raw)
  To: David Miller
  Cc: Sebastian Sewior, daniel, bpf, ast, kafai, songliubraving, yhs,
	Peter Zijlstra, Clark Williams

[-- Attachment #1: Type: text/plain, Size: 1480 bytes --]

On Thu, 17 Oct 2019, David Miller wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Thu, 17 Oct 2019 17:40:21 +0200
> 
> > On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote:
> >> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:
> >> > Disable BPF on PREEMPT_RT because
> >> > - it allocates and frees memory in atomic context
> >> > - it uses up_read_non_owner()
> >> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context
> >> 
> >> For the latter you'd also need to disable seccomp-BPF and everything
> >> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...
> > 
> > I looked at tracing and it depended on BPF_SYSCALL so I assumed they all
> > do… Now looking for BPF_PROG_RUN() there is PPP_FILTER,
> > NET_TEAM_MODE_LOADBALANCE and probably more.  I didn't find a symbol for
> > seccomp-BPF. 
> > Would it make sense to override BPF_PROG_RUN() and make each caller fail
> > instead? Other recommendations?
> 
> I hope you understand that basically you are disabling any packet sniffing
> on the system with this patch you are proposing.
> 
> This means no tcpdump, not wireshark, etc.  They will all become
> non-functional.

Just for the record.

tcpdump and wireshark work perfectly fine on a BPF disabled kernel at least
in the limited way I am using them.

They might become non functional in a decade from now but I assume that we
find a solution for those problems until then.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 21:54       ` Thomas Gleixner
@ 2019-10-17 22:13         ` David Miller
  2019-10-17 23:50           ` Thomas Gleixner
  2019-10-17 23:27         ` Alexei Starovoitov
  2019-10-18  2:49         ` Clark Williams
  2 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2019-10-17 22:13 UTC (permalink / raw)
  To: tglx
  Cc: bigeasy, daniel, bpf, ast, kafai, songliubraving, yhs, peterz, williams

From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 17 Oct 2019 23:54:07 +0200 (CEST)

> Clark might have some insight from the product side for you how much that
> impacts usability.

You won't even be able to load systemd, it uses bpf.

We're moving to the point where even LSM modules will be implemented in bpf.
IR drivers require bpf:

	https://lwn.net/Articles/759188/

I understand the problems, and realize they are non-trivial, but this hammer
is really destructive on a fundamental level.

To a certain extent, those who insert bpf programs are explicitly
changing the kernel so really the onus is on them to make sure the
programs complete in time which is not only finite (which is
guaranteed by BPF) but also within the RT constraints.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 22:11       ` Thomas Gleixner
@ 2019-10-17 22:23         ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2019-10-17 22:23 UTC (permalink / raw)
  To: tglx
  Cc: bigeasy, daniel, bpf, ast, kafai, songliubraving, yhs, peterz, williams

From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 18 Oct 2019 00:11:38 +0200 (CEST)

> tcpdump and wireshark work perfectly fine on a BPF disabled kernel at least
> in the limited way I am using them.

Yes it works, but with every packet flowing through the system getting
copied into userspace.  This takes us back to 1992 :-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 21:54       ` Thomas Gleixner
  2019-10-17 22:13         ` David Miller
@ 2019-10-17 23:27         ` Alexei Starovoitov
  2019-10-18  0:22           ` Thomas Gleixner
  2019-10-18  2:49         ` Clark Williams
  2 siblings, 1 reply; 26+ messages in thread
From: Alexei Starovoitov @ 2019-10-17 23:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Thu, Oct 17, 2019 at 2:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I'm all ears for an alternative solution. Here are the pain points:

Let's talk about them one by one.

>   #1) BPF disables preemption unconditionally with no way to do a proper RT
>       substitution like most other infrastructure in the kernel provides
>       via spinlocks or other locking primitives.

Kernel has a ton of code that disables preemption.
Why BPF is somehow special?
Are you saying RT kernel doesn't disable preemption at all?
I'm complete noob in RT.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 22:13         ` David Miller
@ 2019-10-17 23:50           ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-17 23:50 UTC (permalink / raw)
  To: David Miller
  Cc: Sebastian Sewior, daniel, bpf, ast, kafai, songliubraving, yhs,
	Peter Zijlstra, Clark Williams

David,

On Thu, 17 Oct 2019, David Miller wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 17 Oct 2019 23:54:07 +0200 (CEST)
> 
> > Clark might have some insight from the product side for you how much that
> > impacts usability.
> 
> You won't even be able to load systemd, it uses bpf.

As I said before: At some point in the future from now.

Right now I'm writing this mail from a Debian testing system which runs a
kernel with Sebastians patch applied. That means a halfways recent systemd
started just fine and everything works.

You surely made your point.
 
> We're moving to the point where even LSM modules will be implemented in bpf.

Emphasis on 'We're moving'. Which means this is in progress and not after
the fact. 

> IR drivers require bpf:
> 
> 	https://lwn.net/Articles/759188/

The fact that IR drivers require BPF is not a real convincing argument
either.

  Guess how many RT systems depend on functional IR drivers?

  Guess how many other subsystems are not RT safe and disabled on RT and
  still RT is successfully deployed in production?

Quoting from the other end of that thread just to avoid fragmentation:

> > tcpdump and wireshark work perfectly fine on a BPF disabled kernel at least
> > in the limited way I am using them.
>
> Yes it works, but with every packet flowing through the system getting
> copied into userspace.  This takes us back to 1992 :-)

Guess what? RT real world deployments survived for the past 15 years on the
packet sniffing state of 1992. There is a world outside of networking...

> I understand the problems, and realize they are non-trivial, but this hammer
> is really destructive on a fundamental level.

The fundamentally desctructive component is that this whole thread does not
provide any form of constructive feedback.

 - Sebastians changelog has a list of the issues
 - I expanded on those

All we got as a reply is a destructive NO along with a long list of half
baken arguments why temporary disabling of this functionality solely for RT
is unacceptable.

It's probably also solely my (our / RT folks) problem that BPF made design
decisions which are focussed on (network) performance without considering
any other already existing constraints.

Sure we have the usual policy that we don't care about out of tree stuff
and it's the problem of the out of tree folks to deal with that, but I
politely ask you to think hard about this in the context of RT.

I'm going to shut up for now and wait for constructive and reasonable
feedback how to tackle these issues on a technical level.

Thanks,

	Thomas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 23:27         ` Alexei Starovoitov
@ 2019-10-18  0:22           ` Thomas Gleixner
  2019-10-18  5:52             ` Alexei Starovoitov
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-18  0:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Thu, 17 Oct 2019, Alexei Starovoitov wrote:
> On Thu, Oct 17, 2019 at 2:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > I'm all ears for an alternative solution. Here are the pain points:
> 
> Let's talk about them one by one.
> 
> >   #1) BPF disables preemption unconditionally with no way to do a proper RT
> >       substitution like most other infrastructure in the kernel provides
> >       via spinlocks or other locking primitives.
> 
> Kernel has a ton of code that disables preemption.
> Why BPF is somehow special?
> Are you saying RT kernel doesn't disable preemption at all?
> I'm complete noob in RT.

The basic principle of RT is to break up the arbitrary long
preemption/interrupt disabled sections of the mainline kernel.

Most preempt/interrupt disabled sections are implicit by taking locks
(spinlock, rwlock). Just a few are explicit by issuing
preempt/local_irq_disable()

RT substitutes spinlock/rwlock with RT aware counterparts which

 - Do not disable preemption/interrupts

 - Prevent migration to keep the implicit migrate disable semantics
   of preempt disable

 - Convert the underlying lock primitive to a priority inheritance aware
   mechanism, aka. rtmutex.

In order to make the above work, RT forces interrupt and soft interrupt
processing into thread context except for interrupts which are explicitely
marked as interrupt safe (IRQF_NOTHREAD).

As a consequence most of the kernel code becomes fully preemptible. Of
course there are still code parts which require that preemption/interrupts
are hard disabled. That's pretty much initial low level entry code, hard
interrupt handling code (which just wakes up the threads), context switch
code and some other rather low level functions (vmenter/exit ....).

That also requires that we have still locks which disable
preemption/interrupts. That's why we have raw_spinlock and
spinlock. spinlock is substituted with a RT primitive while raw_spinlock
behaves like the traditional spinlock on a non RT kernel (disables
preemption/interrupts).

But that also means any code which explcitely disables preemption or
interrupts without taking a spin/rw lock can trigger the following issues:

  - Calling into code which requires to be preemtible/sleepable on RT
    results in a might sleep splat.

  - Has in RT terms potentially unbound or undesired runtime length without
    any chance for the scheduler to control it.

Aside of that RT has a more strict view vs. lock ownership because almost
all lock primitives except real counting semaphores are substituted by
priority inheritance aware counterparts. PI aware locks have not only the
requirement that they can only be taken in preemptible context (see above),
they also have a strict locker == unlocker requirement for obvious reasons.
up_read_non_owner() can't obviously fulfil that requirement.

I surely answered more than your initial question and probably not enough,
so feel free to ask for clarification.

Thanks for caring!

       Thomas



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-17 21:54       ` Thomas Gleixner
  2019-10-17 22:13         ` David Miller
  2019-10-17 23:27         ` Alexei Starovoitov
@ 2019-10-18  2:49         ` Clark Williams
  2019-10-18  4:57           ` David Miller
  2019-10-18  8:46           ` Thomas Gleixner
  2 siblings, 2 replies; 26+ messages in thread
From: Clark Williams @ 2019-10-18  2:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

+acme

On Thu, 17 Oct 2019 23:54:07 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 17 Oct 2019, David Miller wrote:
> 
> > From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Date: Thu, 17 Oct 2019 17:40:21 +0200
> >   
> > > On 2019-10-17 16:53:58 [+0200], Daniel Borkmann wrote:  
> > >> On Thu, Oct 17, 2019 at 11:05:01AM +0200, Sebastian Andrzej Siewior wrote:  
> > >> > Disable BPF on PREEMPT_RT because
> > >> > - it allocates and frees memory in atomic context
> > >> > - it uses up_read_non_owner()
> > >> > - BPF_PROG_RUN() expects to be invoked in non-preemptible context  
> > >> 
> > >> For the latter you'd also need to disable seccomp-BPF and everything
> > >> cBPF related as they are /all/ invoked via BPF_PROG_RUN() ...  
> > > 
> > > I looked at tracing and it depended on BPF_SYSCALL so I assumed they all
> > > do… Now looking for BPF_PROG_RUN() there is PPP_FILTER,
> > > NET_TEAM_MODE_LOADBALANCE and probably more.  I didn't find a symbol for
> > > seccomp-BPF. 
> > > Would it make sense to override BPF_PROG_RUN() and make each caller fail
> > > instead? Other recommendations?  
> > 
> > I hope you understand that basically you are disabling any packet sniffing
> > on the system with this patch you are proposing.
> > 
> > This means no tcpdump, not wireshark, etc.  They will all become
> > non-functional.
> > 
> > Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is
> > absolutely essential functionality for a Linux system at this point.  
> 
> I'm all ears for an alternative solution. Here are the pain points:
> 
>   #1) BPF disables preemption unconditionally with no way to do a proper RT
>       substitution like most other infrastructure in the kernel provides
>       via spinlocks or other locking primitives.

As I understand it, BPF programs cannot loop and are limited to 4096 instructions.
Has anyone done any timing to see just how much having preemption off while a
BPF program executes is going to affect us? Are we talking 1us or 50us? or longer?
I wonder if there's some instrumentation we could use to determine the maximum time
spent running a BPF program. Maybe some perf mojo...

> 
>   #2) BPF does allocations in atomic contexts, which is a dubious decision
>       even for non RT. That's related to #1

I guess my question here is, are the allocations done on behalf of an about-to-run
BPF program, or as a result of executing BPF code?  Is it something we might be able
to satisfy from a pre-allocated pool rather than kmalloc()? Ok, I need to go dive
into BPF a bit deeper.

> 
>   #3) BPF uses the up_read_non_owner() hackery which was only invented to
>       deal with already existing horrors and not meant to be proliferated.
> 
>       Yes, I know it's a existing facility ....

I'm sure I'll regret asking this, but why is up_read_non_owner() a horror? I mean,
I get the fundamental wrongness of having someone that's not the owner of a semaphore
performing an 'up' on it, but is there an RT-specific reason that it's bad? Is it
totally a blocker for using BPF with RT or is it something we should fix over time?

> 
> TBH, I have no idea how to deal with those things. So the only way forward
> for RT right now is to disable the whole thing.
> 
> Clark might have some insight from the product side for you how much that
> impacts usability.
> 
> Thanks,
> 
> 	tglx


Clark is only just starting his journey with BPF, so not an expert.

I do think that we (RT) are going to have to co-exist with BPF, if only due to the
increased use of XDP. I also think that other sub-systems will start to
employ BPF for production purposes (as opposed to debug/analysis which is
how we generally look at tracing, packet sniffing, etc.). I think we *have* to
figure out how to co-exist. 

Guess my "hey, that look interesting, think I'll leisurely read up on it" just got
a little less leisurely. I'm out most of the day tomorrow but I'll catch up on email
over the weekend.


Clark

-- 
The United States Coast Guard
Ruining Natural Selection since 1790

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  2:49         ` Clark Williams
@ 2019-10-18  4:57           ` David Miller
  2019-10-18  5:54             ` Alexei Starovoitov
  2019-10-18  8:38             ` Thomas Gleixner
  2019-10-18  8:46           ` Thomas Gleixner
  1 sibling, 2 replies; 26+ messages in thread
From: David Miller @ 2019-10-18  4:57 UTC (permalink / raw)
  To: williams
  Cc: tglx, bigeasy, daniel, bpf, ast, kafai, songliubraving, yhs,
	peterz, acme

From: Clark Williams <williams@redhat.com>
Date: Thu, 17 Oct 2019 21:49:17 -0500

> BPF programs cannot loop and are limited to 4096 instructions.

The limit was increased to 1 million not too long ago.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  0:22           ` Thomas Gleixner
@ 2019-10-18  5:52             ` Alexei Starovoitov
  2019-10-18 11:28               ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Alexei Starovoitov @ 2019-10-18  5:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Fri, Oct 18, 2019 at 02:22:40AM +0200, Thomas Gleixner wrote:
> 
> But that also means any code which explcitely disables preemption or
> interrupts without taking a spin/rw lock can trigger the following issues:
> 
>   - Calling into code which requires to be preemtible/sleepable on RT
>     results in a might sleep splat.
> 
>   - Has in RT terms potentially unbound or undesired runtime length without
>     any chance for the scheduler to control it.

Much appreciate the explanation. Few more questions:
There is a ton of kernel code that does preempt_disable()
and proceeds to do per-cpu things. How is it handled in RT?
Are you saying that every preempt_disable has to be paired with some lock?
I don't think it's a practical requirement for fulfill, so I probably
misunderstood something.

In BPF we disable preemption because of per-cpu maps and per-cpu data structures
that are shared between bpf program execution and kernel execution.

BPF doesn't call into code that might sleep.
BPF also doesn't have unbound runtime.
So two above issues are actually non-issues.

May be we should go back to concerns that prompted this patch.
Do you have any numbers from production that show that BPF is causing
unbounded latency for RT workloads? If it's all purely theoretical than
we should share the knowledge how different systems behave
instead of building walls. It feels to me that there are no
actual issues. Only misunderstandings.

All that aside I'm working on new BPF program categories that
will be fully preemptable and sleepable. That requirement came
from tracing long ago. The verifier infrastructure wasn't ready
at that time. Now we can do it.
BPF programs will be able to do copy_from_user() and take faults.
preempt_disable and rcu_read_lock regions will be controlled by
the verifier. We will have to support all existing semantics though.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  4:57           ` David Miller
@ 2019-10-18  5:54             ` Alexei Starovoitov
  2019-10-18  8:38             ` Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Alexei Starovoitov @ 2019-10-18  5:54 UTC (permalink / raw)
  To: David Miller
  Cc: williams, tglx, bigeasy, daniel, bpf, ast, kafai, songliubraving,
	yhs, peterz, acme

On Thu, Oct 17, 2019 at 09:57:39PM -0700, David Miller wrote:
> From: Clark Williams <williams@redhat.com>
> Date: Thu, 17 Oct 2019 21:49:17 -0500
> 
> > BPF programs cannot loop and are limited to 4096 instructions.
> 
> The limit was increased to 1 million not too long ago.

Right.
And it's easy to limit to 100 instructions per program for PREEMPT_RT.
I doubt it's necessary.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  4:57           ` David Miller
  2019-10-18  5:54             ` Alexei Starovoitov
@ 2019-10-18  8:38             ` Thomas Gleixner
  2019-10-18 12:49               ` Clark Williams
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-18  8:38 UTC (permalink / raw)
  To: David Miller
  Cc: Clark Williams, Sebastian Sewior, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

On Thu, 17 Oct 2019, David Miller wrote:

> From: Clark Williams <williams@redhat.com>
> Date: Thu, 17 Oct 2019 21:49:17 -0500
> 
> > BPF programs cannot loop and are limited to 4096 instructions.
> 
> The limit was increased to 1 million not too long ago.

Assuming a instruction/cycle ratio of 1.0 and a CPU frequency of 2GHz,
that's 500us of preempt disabled time. Out of bounds by at least one order
of magntiude for a lot of RT scenarios.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  2:49         ` Clark Williams
  2019-10-18  4:57           ` David Miller
@ 2019-10-18  8:46           ` Thomas Gleixner
  2019-10-18 12:43             ` Sebastian Sewior
  2019-10-18 12:58             ` Clark Williams
  1 sibling, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-18  8:46 UTC (permalink / raw)
  To: Clark Williams
  Cc: David Miller, Sebastian Sewior, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

Clark,

On Thu, 17 Oct 2019, Clark Williams wrote:
> On Thu, 17 Oct 2019 23:54:07 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> >   #2) BPF does allocations in atomic contexts, which is a dubious decision
> >       even for non RT. That's related to #1
> 
> I guess my question here is, are the allocations done on behalf of an about-to-run
> BPF program, or as a result of executing BPF code?  Is it something we might be able
> to satisfy from a pre-allocated pool rather than kmalloc()? Ok, I need to go dive
> into BPF a bit deeper.

Sebastion?
 
> >   #3) BPF uses the up_read_non_owner() hackery which was only invented to
> >       deal with already existing horrors and not meant to be proliferated.
> > 
> >       Yes, I know it's a existing facility ....
> 
> I'm sure I'll regret asking this, but why is up_read_non_owner() a horror? I mean,
> I get the fundamental wrongness of having someone that's not the owner of a semaphore
> performing an 'up' on it, but is there an RT-specific reason that it's bad? Is it
> totally a blocker for using BPF with RT or is it something we should fix over time?

RT has strict locker == unlocker semantics simply because the owner
(locker) is subject to priority inheritance and a non-owner unlock cannot
undo PI on behalf of the locker sanely. Also exposing the locker to PI if
the locker is not involved in unlocking is obviously a pointless exercise
and potentially a source of unbound priority inversion.

> I do think that we (RT) are going to have to co-exist with BPF, if only due to the
> increased use of XDP. I also think that other sub-systems will start to
> employ BPF for production purposes (as opposed to debug/analysis which is
> how we generally look at tracing, packet sniffing, etc.). I think we *have* to
> figure out how to co-exist.

I'm not saying that RT does not want BPF, quite the contrary, but for the
initial merge BPF is not a hard requirement, so disabling it was the
straight forward path.

I'm all ears to get pointers how to solve that right now.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  5:52             ` Alexei Starovoitov
@ 2019-10-18 11:28               ` Thomas Gleixner
  2019-10-18 12:48                 ` Sebastian Sewior
  2019-10-18 23:05                 ` Alexei Starovoitov
  0 siblings, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-18 11:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

Alexei,

On Thu, 17 Oct 2019, Alexei Starovoitov wrote:
> On Fri, Oct 18, 2019 at 02:22:40AM +0200, Thomas Gleixner wrote:
> > 
> > But that also means any code which explcitely disables preemption or
> > interrupts without taking a spin/rw lock can trigger the following issues:
> > 
> >   - Calling into code which requires to be preemtible/sleepable on RT
> >     results in a might sleep splat.
> > 
> >   - Has in RT terms potentially unbound or undesired runtime length without
> >     any chance for the scheduler to control it.
> 
> Much appreciate the explanation. Few more questions:
> There is a ton of kernel code that does preempt_disable()
> and proceeds to do per-cpu things. How is it handled in RT?

There is not really tons of it, at least not tons which actually hurt. Most
of those sections are extremly small or actually required even on RT
(e.g. scheduler, lock internals ...)

> Are you saying that every preempt_disable has to be paired with some lock?
> I don't think it's a practical requirement for fulfill, so I probably
> misunderstood something.

See above. The ones RT cares about are:

    - Long and potentially unbound preempt/interrupt disabled sections

    - Preempt disabled sections which call into code which might sleep
      under RT due to the magic 'sleeping' spin/rw_locks which we use
      as substitution.

> In BPF we disable preemption because of per-cpu maps and per-cpu data structures
> that are shared between bpf program execution and kernel execution.
> 
> BPF doesn't call into code that might sleep.

Sure, not if you look at it from the mainline perspective. RT changes the
picture there because due to forced interrupt/soft interrupt threading and
the lock substitution 'innocent' code becomes sleepable. That's especially
true for the memory allocators, which are required to be called with
preemption enabled on RT. But again, most GFP_ATOMIC allocations happen
from within spin/rwlock held sections, which are already made preemptible
by RT magically. The ones which were inside of contexts which are atomic
even on RT have been moved out of the atomic sections already (except for
the BPF ones).

The problem with standalone preempt_disable() and local_irq_disable() is
that the protection scope is not defined. These two are basically per CPU
big kernel locks. We all know how well the BKL semantics worked :)

One of the mechanisms RT uses to substitute standalone preempt_disable()
and local_irq_disable() which are not related to a lock operation with so
called local_locks. We haven't submitted the local_lock code yet, but that
might be a way out. The way it works is simple:

DEFINE_LOCAL_LOCK(this_scope);

in the code:

-	preempt_disable();
+	local_lock(this_scope);

and all kind of variants local_lock_bh/irq/irqsave(). You get the idea.

On a non RT enabled build these primitives just resolve to
preempt_disable(), local_bh_disable(), local_irq_disable() and
local_irq_save().

On RT the local lock is actually a per CPU lock which allows nesting. i.e.

      preempt_disable();
      ...
      local_irq_disable();

becomes

	local_lock(this_scope);
	...
	local_lock_irq(this_scope);

The local lock is a 'sleeping' spinlock on RT (PI support) and as any other
RT substituted lock it also ensures that the task cannot be migrated when
it is held, which makes per cpu assumptions work - the kernel has lots of
them. :)

That works as long as the scope is well defined and clear. It does not work
when preempt_disable() or any of the other scopeless protections is used to
protect random (unidentifiable) code against each other, which means the
protection has the dreaded per CPU BKL semantics, i.e. undefined.

One nice thing about local_lock even aside of RT is that it annotates the
code in terms of protection scope which actually gives you also lockdep
coverage. We found already a few bugs that way in the past, where data was
protected with preempt_disable() when the code was introduced and later
access from interrupt code was added without anyone noticing for years....

> BPF also doesn't have unbound runtime.
> So two above issues are actually non-issues.

That'd be nice :)

Anyway, we'll have a look whether this can be solved with local locks which
would be nice, but that still does not solve the issue with the non_owner
release of the rwsem.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  8:46           ` Thomas Gleixner
@ 2019-10-18 12:43             ` Sebastian Sewior
  2019-10-18 12:58             ` Clark Williams
  1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Sewior @ 2019-10-18 12:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Clark Williams, David Miller, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

On 2019-10-18 10:46:01 [+0200], Thomas Gleixner wrote:
> Clark,
> 
> On Thu, 17 Oct 2019, Clark Williams wrote:
> > On Thu, 17 Oct 2019 23:54:07 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > >   #2) BPF does allocations in atomic contexts, which is a dubious decision
> > >       even for non RT. That's related to #1
> > 
> > I guess my question here is, are the allocations done on behalf of an about-to-run
> > BPF program, or as a result of executing BPF code?  Is it something we might be able
> > to satisfy from a pre-allocated pool rather than kmalloc()? Ok, I need to go dive
> > into BPF a bit deeper.
> 
> Sebastion?

The data structures use raw_spinlock_t as protection. This is where the
atomic context is from.
lpm_trie with trie_update_elem() allocates a new element while holding
the lock. I tried to make it a spinlock_t which wouldn't have the
problem but then
   https://lore.kernel.org/bpf/4150a0db-18f9-aa93-cdb4-8cf047093740@iogearbox.net/

pointed out that it has been made raw_spinlock_t due to kprobe on -RT.
Commit ac00881f92210 ("bpf: convert hashtab lock to raw lock") was
"okay" back then (according to Steven Rostedt) but it got wrong with the
memory allocation which came in later.

In order to tackle one thing at a time, I would say that kprobe isn't
our biggest concern…
 
> 	tglx

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18 11:28               ` Thomas Gleixner
@ 2019-10-18 12:48                 ` Sebastian Sewior
  2019-10-18 23:05                 ` Alexei Starovoitov
  1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Sewior @ 2019-10-18 12:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexei Starovoitov, David Miller, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On 2019-10-18 13:28:21 [+0200], Thomas Gleixner wrote:
> The local lock is a 'sleeping' spinlock on RT (PI support) and as any other

it is a "per-CPU 'sleeping' spinlock on RT". Which means that it can be
acquired on multiple CPUs simultaneously (same like
preempt_disable(),…).

> RT substituted lock it also ensures that the task cannot be migrated when
> it is held, which makes per cpu assumptions work - the kernel has lots of
> them. :)
> 	tglx

Sebastian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  8:38             ` Thomas Gleixner
@ 2019-10-18 12:49               ` Clark Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Clark Williams @ 2019-10-18 12:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, 18 Oct 2019 10:38:06 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Thu, 17 Oct 2019, David Miller wrote:
> 
> > From: Clark Williams <williams@redhat.com>
> > Date: Thu, 17 Oct 2019 21:49:17 -0500
> >   
> > > BPF programs cannot loop and are limited to 4096 instructions.  
> > 
> > The limit was increased to 1 million not too long ago.  
> 
> Assuming a instruction/cycle ratio of 1.0 and a CPU frequency of 2GHz,
> that's 500us of preempt disabled time. Out of bounds by at least one order
> of magntiude for a lot of RT scenarios.
> 

So if I do my arithmetic right, 4096 instructions would be around 2us. In
many cases that would just be noise, but there are some customer cases on
the horizon where 2us would be a significant fraction of their max latency.

Hmmm. A quick grep through the Kconfigs didn't show me a BPF config that
was set to a large numeric value (e.g. 1000000). Is the instruction limit
hard coded into the verifier/VM logic and if so, could we lift it out and
set a smaller limit for PREEMPT_RT? Not a silver bullet, but might be a
start, since it sounds like BPF code presumes it has the cpu for the duration
of a program run. 

Clark

-- 
The United States Coast Guard
Ruining Natural Selection since 1790

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18  8:46           ` Thomas Gleixner
  2019-10-18 12:43             ` Sebastian Sewior
@ 2019-10-18 12:58             ` Clark Williams
  1 sibling, 0 replies; 26+ messages in thread
From: Clark Williams @ 2019-10-18 12:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, daniel, bpf, ast, kafai,
	songliubraving, yhs, Peter Zijlstra, Arnaldo Carvalho de Melo

On Fri, 18 Oct 2019 10:46:01 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> > >   #3) BPF uses the up_read_non_owner() hackery which was only invented to
> > >       deal with already existing horrors and not meant to be proliferated.
> > > 
> > >       Yes, I know it's a existing facility ....  
> > 
> > I'm sure I'll regret asking this, but why is up_read_non_owner() a horror? I mean,
> > I get the fundamental wrongness of having someone that's not the owner of a semaphore
> > performing an 'up' on it, but is there an RT-specific reason that it's bad? Is it
> > totally a blocker for using BPF with RT or is it something we should fix over time?  
> 
> RT has strict locker == unlocker semantics simply because the owner
> (locker) is subject to priority inheritance and a non-owner unlock cannot
> undo PI on behalf of the locker sanely. Also exposing the locker to PI if
> the locker is not involved in unlocking is obviously a pointless exercise
> and potentially a source of unbound priority inversion.

Ok, I forgot about the PI consequences. Thanks teacher :)

> 
> > I do think that we (RT) are going to have to co-exist with BPF, if only due to the
> > increased use of XDP. I also think that other sub-systems will start to
> > employ BPF for production purposes (as opposed to debug/analysis which is
> > how we generally look at tracing, packet sniffing, etc.). I think we *have* to
> > figure out how to co-exist.  
> 
> I'm not saying that RT does not want BPF, quite the contrary, but for the
> initial merge BPF is not a hard requirement, so disabling it was the
> straight forward path.
> 
> I'm all ears to get pointers how to solve that right now.
>  

Yeah, put it down to lack of sleep. After waking up and rereading, I first realized
that we're not immediately affected since RHEL8 doesn't use BPF. But we're going to 
have to deal with it next year, since we'll be looking at a 5.6+ kernel which will 
have PREEMPT_RT and the latest BPF bells and whistles. So might as well start the 
conversations now. 

Arnaldo and I have been having lots of conversations regarding BPF, so we'll extend
that and dig in on the preemption issue for now. We also need to understand the 
memory allocation behavior so we can hopefully move it out of atomic regions. I'm not
sure how we should address the up_read_non_owner() issue at the moment.

Clark

-- 
The United States Coast Guard
Ruining Natural Selection since 1790

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18 11:28               ` Thomas Gleixner
  2019-10-18 12:48                 ` Sebastian Sewior
@ 2019-10-18 23:05                 ` Alexei Starovoitov
  2019-10-20  9:06                   ` Thomas Gleixner
  1 sibling, 1 reply; 26+ messages in thread
From: Alexei Starovoitov @ 2019-10-18 23:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Fri, Oct 18, 2019 at 01:28:21PM +0200, Thomas Gleixner wrote:
> 
> On a non RT enabled build these primitives just resolve to
> preempt_disable(), local_bh_disable(), local_irq_disable() and
> local_irq_save().
> 
> On RT the local lock is actually a per CPU lock which allows nesting. i.e.
> 
>       preempt_disable();
>       ...
>       local_irq_disable();
> 
> becomes
> 
> 	local_lock(this_scope);
> 	...
> 	local_lock_irq(this_scope);
> 
> The local lock is a 'sleeping' spinlock on RT (PI support) and as any other
> RT substituted lock it also ensures that the task cannot be migrated when
> it is held, which makes per cpu assumptions work - the kernel has lots of
> them. :)
> 
> That works as long as the scope is well defined and clear. It does not work
> when preempt_disable() or any of the other scopeless protections is used to
> protect random (unidentifiable) code against each other, which means the
> protection has the dreaded per CPU BKL semantics, i.e. undefined.
> 
> One nice thing about local_lock even aside of RT is that it annotates the
> code in terms of protection scope which actually gives you also lockdep
> coverage. We found already a few bugs that way in the past, where data was
> protected with preempt_disable() when the code was introduced and later
> access from interrupt code was added without anyone noticing for years....

The concept on local_lock() makes sense to me.
The magic macro you're proposing that will convert it to old school
preempt_disable() on !RT should hopefully make the changes across
net and bpf land mostly mechanical.
One thing to clarify:
when networking core interacts with bpf we know that bh doesn't migrate,
so per-cpu datastructres that bpf side populates are accessed later
by networking core when program finishes.
Similar thing happens between tracing bits and bpf progs.
It feels to me that local_lock() approach should work here as well.
napi processing bits will grab it. Later bpf will grab potentially
the same lock again.
The weird bit that such lock will have numbe_of_lockers >= 1
for long periods of time. At least until napi runners won't see
any more incoming packets. I'm not sure yet where such local_lock
will be placed in the napi code (may be in drivers too for xdp).
Does this make sense from RT perspective?

> > BPF also doesn't have unbound runtime.
> > So two above issues are actually non-issues.
> 
> That'd be nice :)
> 
> Anyway, we'll have a look whether this can be solved with local locks which
> would be nice, but that still does not solve the issue with the non_owner
> release of the rwsem.

Sure. We can discuss this separately.
up_read_non_owner is used only by build_id mode of stack collectors.
We can disable it for RT for long time. We're using stack with build_id heavily
and have no plans to use RT. I believe datacenters, in general, are not going
to use RT for foreseeable future, so a trade off between stack with build_id
vs RT, I think, is acceptable.

But reading your other replies the gradual approach we're discussing here
doesn't sound acceptable ? And you guys insist on disabling bpf under RT
just to merge some out of tree code ? I find this rude and not acceptable.

If RT wants to get merged it should be disabled when BPF is on
and not the other way around.
Something like this:
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index deff97217496..b3dbc5f9a6de 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,7 +57,7 @@ config PREEMPT

 config PREEMPT_RT
        bool "Fully Preemptible Kernel (Real-Time)"
-       depends on EXPERT && ARCH_SUPPORTS_RT
+       depends on EXPERT && ARCH_SUPPORTS_RT && !BPF_SYSCALL
        select PREEMPTION
        help
          This option turns the kernel into a real-time kernel by replacing


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-18 23:05                 ` Alexei Starovoitov
@ 2019-10-20  9:06                   ` Thomas Gleixner
  2019-10-22  1:43                     ` Alexei Starovoitov
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-10-20  9:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Fri, 18 Oct 2019, Alexei Starovoitov wrote:
> On Fri, Oct 18, 2019 at 01:28:21PM +0200, Thomas Gleixner wrote:
> The concept on local_lock() makes sense to me.
> The magic macro you're proposing that will convert it to old school
> preempt_disable() on !RT should hopefully make the changes across
> net and bpf land mostly mechanical.
> One thing to clarify:
> when networking core interacts with bpf we know that bh doesn't migrate,
> so per-cpu datastructres that bpf side populates are accessed later
> by networking core when program finishes.
> Similar thing happens between tracing bits and bpf progs.
> It feels to me that local_lock() approach should work here as well.
> napi processing bits will grab it. Later bpf will grab potentially
> the same lock again.
> The weird bit that such lock will have numbe_of_lockers >= 1
> for long periods of time. At least until napi runners won't see
> any more incoming packets. I'm not sure yet where such local_lock
> will be placed in the napi code (may be in drivers too for xdp).
> Does this make sense from RT perspective?

I don't see why the lock would have more than one locker. The code in BPF
does

	preempt_disable();
	some operation
	preempt_enable();

So how should that gain more than one context per CPU locking it?

> > > BPF also doesn't have unbound runtime.
> > > So two above issues are actually non-issues.
> > 
> > That'd be nice :)
> > 
> > Anyway, we'll have a look whether this can be solved with local locks which
> > would be nice, but that still does not solve the issue with the non_owner
> > release of the rwsem.
> 
> Sure. We can discuss this separately.
> up_read_non_owner is used only by build_id mode of stack collectors.
> We can disable it for RT for long time. We're using stack with build_id heavily
> and have no plans to use RT. I believe datacenters, in general, are not going
> to use RT for foreseeable future, so a trade off between stack with build_id
> vs RT, I think, is acceptable.
> 
> But reading your other replies the gradual approach we're discussing here
> doesn't sound acceptable ?

I don't know how you read anything like that out of my replies. I clearly
said that we are interested in supporting BPF and you are replying to a
sentence where I clearly said:

  Anyway, we'll have a look whether this can be solved with local locks...

> And you guys insist on disabling bpf under RT just to merge some out of
> tree code ?

Where did I insist on that?

Also you might have to accept that there is a world outside of BPF and that
the 'some out of tree code' which we are talking about is the last part of
a 15+ years effort which significantly helped to bring the Linux kernel
into the shape it is today.

> I find this rude and not acceptable.

I call it a pragmatic approach to solve a real world problem.

> If RT wants to get merged it should be disabled when BPF is on
> and not the other way around.

Of course I could have done that right away without even talking to
you. That'd have been dishonest and sneaky.

It's sad that the discussion between the two of us which started perfectly
fine on a purely technical level had to degrade this way.

If your attitude of wilfully misinterpreting my mails and intentions
continues, I'm surely going this route.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] BPF: Disable on PREEMPT_RT
  2019-10-20  9:06                   ` Thomas Gleixner
@ 2019-10-22  1:43                     ` Alexei Starovoitov
  0 siblings, 0 replies; 26+ messages in thread
From: Alexei Starovoitov @ 2019-10-22  1:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Sebastian Sewior, Daniel Borkmann, bpf,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	Peter Zijlstra, Clark Williams

On Sun, Oct 20, 2019 at 11:06:13AM +0200, Thomas Gleixner wrote:
> On Fri, 18 Oct 2019, Alexei Starovoitov wrote:
> > On Fri, Oct 18, 2019 at 01:28:21PM +0200, Thomas Gleixner wrote:
> > The concept on local_lock() makes sense to me.
> > The magic macro you're proposing that will convert it to old school
> > preempt_disable() on !RT should hopefully make the changes across
> > net and bpf land mostly mechanical.
> > One thing to clarify:
> > when networking core interacts with bpf we know that bh doesn't migrate,
> > so per-cpu datastructres that bpf side populates are accessed later
> > by networking core when program finishes.
> > Similar thing happens between tracing bits and bpf progs.
> > It feels to me that local_lock() approach should work here as well.
> > napi processing bits will grab it. Later bpf will grab potentially
> > the same lock again.
> > The weird bit that such lock will have numbe_of_lockers >= 1
> > for long periods of time. At least until napi runners won't see
> > any more incoming packets. I'm not sure yet where such local_lock
> > will be placed in the napi code (may be in drivers too for xdp).
> > Does this make sense from RT perspective?
> 
> I don't see why the lock would have more than one locker. The code in BPF
> does
> 
> 	preempt_disable();
> 	some operation
> 	preempt_enable();
> 
> So how should that gain more than one context per CPU locking it?

napi is doing preempt_disable() then calls into driver's poll function
which further calls into bpf and when its over it looks at per-cpu
data structures that bpf side shares with drivers.

Even without bpf the networking processing takes long time.
What is the plan for RT there?
Are you planning to replace napi's preempt_disable with local_lock() too?

> Also you might have to accept that there is a world outside of BPF and that
> the 'some out of tree code' which we are talking about is the last part of
> a 15+ years effort which significantly helped to bring the Linux kernel
> into the shape it is today.

No doubt about good stuff that came out of that effort and will continue to come.
I'm arguing that 'pragmatic approach' to disable existing kernel features
to get the rest of RT upstream will backfire on RT in the first place.
ftrace disables preemption too. Lots of things do.
imo it's better to get key RT bits (like local_locks) in without
sending contentious patches like the one that sparked this thread.
When everyone can see what this local_lock is we can figure out
how and when to use it.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-10-22  1:43 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17  9:05 [PATCH] BPF: Disable on PREEMPT_RT Sebastian Andrzej Siewior
2019-10-17 14:53 ` Daniel Borkmann
2019-10-17 15:40   ` Sebastian Andrzej Siewior
2019-10-17 17:25     ` David Miller
2019-10-17 21:54       ` Thomas Gleixner
2019-10-17 22:13         ` David Miller
2019-10-17 23:50           ` Thomas Gleixner
2019-10-17 23:27         ` Alexei Starovoitov
2019-10-18  0:22           ` Thomas Gleixner
2019-10-18  5:52             ` Alexei Starovoitov
2019-10-18 11:28               ` Thomas Gleixner
2019-10-18 12:48                 ` Sebastian Sewior
2019-10-18 23:05                 ` Alexei Starovoitov
2019-10-20  9:06                   ` Thomas Gleixner
2019-10-22  1:43                     ` Alexei Starovoitov
2019-10-18  2:49         ` Clark Williams
2019-10-18  4:57           ` David Miller
2019-10-18  5:54             ` Alexei Starovoitov
2019-10-18  8:38             ` Thomas Gleixner
2019-10-18 12:49               ` Clark Williams
2019-10-18  8:46           ` Thomas Gleixner
2019-10-18 12:43             ` Sebastian Sewior
2019-10-18 12:58             ` Clark Williams
2019-10-17 22:11       ` Thomas Gleixner
2019-10-17 22:23         ` David Miller
2019-10-17 17:26   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).