[PATCH bpf-next 0/2] bpf/stackmap: fix A-A deadlock in bpf_get

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
@ 2019-10-10  6:19 Song Liu
  2019-10-10  6:19 ` [PATCH bpf-next 1/2] sched: introduce this_rq_is_locked() Song Liu
  2019-10-10  6:19 ` [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
  0 siblings, 2 replies; 9+ messages in thread
From: Song Liu @ 2019-10-10  6:19 UTC (permalink / raw)
  To: linux-kernel, bpf, netdev; +Cc: kernel-team, Song Liu

bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

For more details on how to reproduce this is error, please refer to 2/2.

Fix this issue by checking a new helper this_rq_is_locked(). If the
rq_lock is already locked, postpone up_read() in irq_work, just like the
in_nmi() case.

Song Liu (2):
  sched: introduce this_rq_is_locked()
  bpf/stackmap: fix A-A deadlock in bpf_get_stack()

 include/linux/sched.h | 1 +
 kernel/bpf/stackmap.c | 2 +-
 kernel/sched/core.c   | 8 ++++++++
 3 files changed, 10 insertions(+), 1 deletion(-)

--
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf-next 1/2] sched: introduce this_rq_is_locked()
  2019-10-10  6:19 [PATCH bpf-next 0/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
@ 2019-10-10  6:19 ` Song Liu
  2019-10-10  6:19 ` [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
  1 sibling, 0 replies; 9+ messages in thread
From: Song Liu @ 2019-10-10  6:19 UTC (permalink / raw)
  To: linux-kernel, bpf, netdev
  Cc: kernel-team, Song Liu, stable, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Tejun Heo

this_rq_is_locked() is introduced to check whether current CPU is holding
rq_lock(). This will be used in bpf/stackmap.c to decide whether is safe
to call up_read(), which may call rq_lock() for the same CPU.

Fixes: commit 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Cc: stable@vger.kernel.org # v4.17+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2e56bd8913..fb0fcbd1b6f6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1995,4 +1995,5 @@ int sched_trace_rq_cpu(struct rq *rq);

 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+bool this_rq_is_locked(void);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..577cbe7c05fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,14 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	}
 }

+bool this_rq_is_locked(void)
+{
+	struct rq *rq;
+
+	rq = this_rq();
+	return raw_spin_is_locked(&rq->lock);
+}
+
 /*
  * RQ-clock updating methods:
  */
--
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10  6:19 [PATCH bpf-next 0/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
  2019-10-10  6:19 ` [PATCH bpf-next 1/2] sched: introduce this_rq_is_locked() Song Liu
@ 2019-10-10  6:19 ` Song Liu
  2019-10-10  7:36   ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Song Liu @ 2019-10-10  6:19 UTC (permalink / raw)
  To: linux-kernel, bpf, netdev
  Cc: kernel-team, Song Liu, stable, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann

bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
   build-id, e.g. with trace.py from bcc tools:
   trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch

A sample reproducer is attached at the end.

Fix this by checking whether rq_lock is already locked. If so, postpone
the up_read() to irq_work, just like the in_nmi() case.

Fixes: commit 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Cc: stable@vger.kernel.org # v4.17+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>

Reproducer:
============================ 8< ============================

char *filename;

void *worker(void *p)
{
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
}

int main(int argc, char *argv[])
{
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
}
============================ 8< ============================
---
 kernel/bpf/stackmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 052580c33d26..3b278f6b0c3e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	bool irq_work_busy = false;
 	struct stack_map_irq_work *work = NULL;

-	if (in_nmi()) {
+	if (in_nmi() || this_rq_is_locked()) {
 		work = this_cpu_ptr(&up_read_work);
 		if (work->irq_work.flags & IRQ_WORK_BUSY)
 			/* cannot queue more up_read, fallback */
--
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10  6:19 ` [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
@ 2019-10-10  7:36   ` Peter Zijlstra
  2019-10-10 17:19     ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2019-10-10  7:36 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, bpf, netdev, kernel-team, stable,
	Alexei Starovoitov, Daniel Borkmann

On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> deadlock on rq_lock():
> 
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [...]
> Call Trace:
>  try_to_wake_up+0x1ad/0x590
>  wake_up_q+0x54/0x80
>  rwsem_wake+0x8a/0xb0
>  bpf_get_stack+0x13c/0x150
>  bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>  bpf_overflow_handler+0x60/0x100
>  __perf_event_overflow+0x4f/0xf0
>  perf_swevent_overflow+0x99/0xc0
>  ___perf_sw_event+0xe7/0x120
>  __schedule+0x47d/0x620
>  schedule+0x29/0x90
>  futex_wait_queue_me+0xb9/0x110
>  futex_wait+0x139/0x230
>  do_futex+0x2ac/0xa50
>  __x64_sys_futex+0x13c/0x180
>  do_syscall_64+0x42/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 

> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 052580c33d26..3b278f6b0c3e 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>  	bool irq_work_busy = false;
>  	struct stack_map_irq_work *work = NULL;
> 
> -	if (in_nmi()) {
> +	if (in_nmi() || this_rq_is_locked()) {
>  		work = this_cpu_ptr(&up_read_work);
>  		if (work->irq_work.flags & IRQ_WORK_BUSY)
>  			/* cannot queue more up_read, fallback */

This is horrific crap. Just say no to that get_build_id_offset()
trainwreck.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10  7:36   ` Peter Zijlstra
@ 2019-10-10 17:19     ` Alexei Starovoitov
  2019-10-10 17:46       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2019-10-10 17:19 UTC (permalink / raw)
  To: Peter Zijlstra, Song Liu
  Cc: linux-kernel, bpf, netdev, Kernel Team, stable,
	Alexei Starovoitov, Daniel Borkmann

On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>> deadlock on rq_lock():
>>
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [...]
>> Call Trace:
>>   try_to_wake_up+0x1ad/0x590
>>   wake_up_q+0x54/0x80
>>   rwsem_wake+0x8a/0xb0
>>   bpf_get_stack+0x13c/0x150
>>   bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>   bpf_overflow_handler+0x60/0x100
>>   __perf_event_overflow+0x4f/0xf0
>>   perf_swevent_overflow+0x99/0xc0
>>   ___perf_sw_event+0xe7/0x120
>>   __schedule+0x47d/0x620
>>   schedule+0x29/0x90
>>   futex_wait_queue_me+0xb9/0x110
>>   futex_wait+0x139/0x230
>>   do_futex+0x2ac/0xa50
>>   __x64_sys_futex+0x13c/0x180
>>   do_syscall_64+0x42/0x100
>>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
> 
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 052580c33d26..3b278f6b0c3e 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>   	bool irq_work_busy = false;
>>   	struct stack_map_irq_work *work = NULL;
>>
>> -	if (in_nmi()) {
>> +	if (in_nmi() || this_rq_is_locked()) {
>>   		work = this_cpu_ptr(&up_read_work);
>>   		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>   			/* cannot queue more up_read, fallback */
> 
> This is horrific crap. Just say no to that get_build_id_offset()
> trainwreck.

this is not a helpful comment.
What issues do you see with this approach?




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10 17:19     ` Alexei Starovoitov
@ 2019-10-10 17:46       ` Peter Zijlstra
  2019-10-10 18:06         ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2019-10-10 17:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, linux-kernel, bpf, netdev, Kernel Team, stable,
	Alexei Starovoitov, Daniel Borkmann

On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
> > On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
> >> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
> >> deadlock on rq_lock():
> >>
> >> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [...]
> >> Call Trace:
> >>   try_to_wake_up+0x1ad/0x590
> >>   wake_up_q+0x54/0x80
> >>   rwsem_wake+0x8a/0xb0
> >>   bpf_get_stack+0x13c/0x150
> >>   bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
> >>   bpf_overflow_handler+0x60/0x100
> >>   __perf_event_overflow+0x4f/0xf0
> >>   perf_swevent_overflow+0x99/0xc0
> >>   ___perf_sw_event+0xe7/0x120
> >>   __schedule+0x47d/0x620
> >>   schedule+0x29/0x90
> >>   futex_wait_queue_me+0xb9/0x110
> >>   futex_wait+0x139/0x230
> >>   do_futex+0x2ac/0xa50
> >>   __x64_sys_futex+0x13c/0x180
> >>   do_syscall_64+0x42/0x100
> >>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> > 
> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >> index 052580c33d26..3b278f6b0c3e 100644
> >> --- a/kernel/bpf/stackmap.c
> >> +++ b/kernel/bpf/stackmap.c
> >> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> >>   	bool irq_work_busy = false;
> >>   	struct stack_map_irq_work *work = NULL;
> >>
> >> -	if (in_nmi()) {
> >> +	if (in_nmi() || this_rq_is_locked()) {
> >>   		work = this_cpu_ptr(&up_read_work);
> >>   		if (work->irq_work.flags & IRQ_WORK_BUSY)
> >>   			/* cannot queue more up_read, fallback */
> > 
> > This is horrific crap. Just say no to that get_build_id_offset()
> > trainwreck.
> 
> this is not a helpful comment.
> What issues do you see with this approach?

It will still generate deadlocks if I place a tracepoint inside a lock
that nests inside rq->lock, and it won't ever be able to detect that.
Say do the very same thing on trace_hrtimer_start(), which is under
cpu_base->lock, which nests inside rq->lock. That should give you an
AB-BA.

tracepoints / perf-overflow should _never_ take locks.

All of stack_map_get_build_id_offset() is just disguisting games; I did
tell you guys how to do lockless vma lookups a few years ago -- and yes,
that is invasive core mm surgery. But this is just disguisting hacks for
not wanting to do it right.

Basically the only semi-sane thing to do with that trainwreck is
s/in_nmi()/true/ and pray.

On top of that I just hate buildids in general.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10 17:46       ` Peter Zijlstra
@ 2019-10-10 18:06         ` Alexei Starovoitov
  2019-10-14  9:09           ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2019-10-10 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, linux-kernel, bpf, netdev, Kernel Team, stable,
	Alexei Starovoitov, Daniel Borkmann

On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2019 at 05:19:01PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 12:36 AM, Peter Zijlstra wrote:
>>> On Wed, Oct 09, 2019 at 11:19:16PM -0700, Song Liu wrote:
>>>> bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
>>>> deadlock on rq_lock():
>>>>
>>>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>>>> [...]
>>>> Call Trace:
>>>>    try_to_wake_up+0x1ad/0x590
>>>>    wake_up_q+0x54/0x80
>>>>    rwsem_wake+0x8a/0xb0
>>>>    bpf_get_stack+0x13c/0x150
>>>>    bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
>>>>    bpf_overflow_handler+0x60/0x100
>>>>    __perf_event_overflow+0x4f/0xf0
>>>>    perf_swevent_overflow+0x99/0xc0
>>>>    ___perf_sw_event+0xe7/0x120
>>>>    __schedule+0x47d/0x620
>>>>    schedule+0x29/0x90
>>>>    futex_wait_queue_me+0xb9/0x110
>>>>    futex_wait+0x139/0x230
>>>>    do_futex+0x2ac/0xa50
>>>>    __x64_sys_futex+0x13c/0x180
>>>>    do_syscall_64+0x42/0x100
>>>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index 052580c33d26..3b278f6b0c3e 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -287,7 +287,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>>    	bool irq_work_busy = false;
>>>>    	struct stack_map_irq_work *work = NULL;
>>>>
>>>> -	if (in_nmi()) {
>>>> +	if (in_nmi() || this_rq_is_locked()) {
>>>>    		work = this_cpu_ptr(&up_read_work);
>>>>    		if (work->irq_work.flags & IRQ_WORK_BUSY)
>>>>    			/* cannot queue more up_read, fallback */
>>>
>>> This is horrific crap. Just say no to that get_build_id_offset()
>>> trainwreck.
>>
>> this is not a helpful comment.
>> What issues do you see with this approach?
> 
> It will still generate deadlocks if I place a tracepoint inside a lock
> that nests inside rq->lock, and it won't ever be able to detect that.
> Say do the very same thing on trace_hrtimer_start(), which is under
> cpu_base->lock, which nests inside rq->lock. That should give you an
> AB-BA.
> 
> tracepoints / perf-overflow should _never_ take locks.
> 
> All of stack_map_get_build_id_offset() is just disguisting games; I did
> tell you guys how to do lockless vma lookups a few years ago -- and yes,
> that is invasive core mm surgery. But this is just disguisting hacks for
> not wanting to do it right.

you mean speculative page fault stuff?
That was my hope as well and I offered Laurent all the help to land it.
Yet after a year since we've talked the patches are not any closer
to landing.
Any other 'invasive mm surgery' you have in mind?

> Basically the only semi-sane thing to do with that trainwreck is
> s/in_nmi()/true/ and pray.
> 
> On top of that I just hate buildids in general.

Emotions aside... build_id is useful and used in production.
It's used widely because it solves real problems.
This dead lock is from real servers and not from some sanitizer wannabe.
Hence we need to fix it as cleanly as possible and quickly.
s/in_nmi/true/ is certainly an option.
I'm worried about overhead of doing irq_work_queue() all the time.
But I'm not familiar with mechanism enough to justify the concerns.
Would it make sense to do s/in_nmi/irgs_disabled/ instead?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-10 18:06         ` Alexei Starovoitov
@ 2019-10-14  9:09           ` Peter Zijlstra
  2019-10-14 20:12             ` Song Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2019-10-14  9:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, linux-kernel, bpf, netdev, Kernel Team, stable,
	Alexei Starovoitov, Daniel Borkmann

On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
> On 10/10/19 10:46 AM, Peter Zijlstra wrote:

> > All of stack_map_get_build_id_offset() is just disguisting games; I did
> > tell you guys how to do lockless vma lookups a few years ago -- and yes,
> > that is invasive core mm surgery. But this is just disguisting hacks for
> > not wanting to do it right.
> 
> you mean speculative page fault stuff?
> That was my hope as well and I offered Laurent all the help to land it.
> Yet after a year since we've talked the patches are not any closer
> to landing.
> Any other 'invasive mm surgery' you have in mind?

Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
which is exactly what you need here.

> > Basically the only semi-sane thing to do with that trainwreck is
> > s/in_nmi()/true/ and pray.
> > 
> > On top of that I just hate buildids in general.
> 
> Emotions aside... build_id is useful and used in production.
> It's used widely because it solves real problems.

AFAIU it solves the problem of you not knowing what version of the
binary runs where; which I was hoping your cloud infrastructure thing
would actually know already.

Anyway, I know what it does, I just don't nessecarily agree it is the
right way around that particular problem (also, the way I'm personally
affected is that perf-record is dead slow by default due to built-id
post processing).

And it obviously leads to horrible hacks like the code currently under
discussion :/

> This dead lock is from real servers and not from some sanitizer wannabe.

If you enable CFS bandwidth control and run this function on the
trace_hrtimer_start() tracepoint, you should be able to trigger a real
AB-BA lockup.

> Hence we need to fix it as cleanly as possible and quickly.
> s/in_nmi/true/ is certainly an option.

That is the best option; because tracepoints / perf-overflow handlers
really should not be taking any locks.

> I'm worried about overhead of doing irq_work_queue() all the time.
> But I'm not familiar with mechanism enough to justify the concerns.
> Would it make sense to do s/in_nmi/irgs_disabled/ instead?

irqs_disabled() should work in this particular case because rq->lock
(and therefore all it's nested locks) are IRQ-safe.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack()
  2019-10-14  9:09           ` Peter Zijlstra
@ 2019-10-14 20:12             ` Song Liu
  0 siblings, 0 replies; 9+ messages in thread
From: Song Liu @ 2019-10-14 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, linux-kernel, bpf, netdev, Kernel Team,
	stable, Alexei Starovoitov, Daniel Borkmann

Thanks Peter!

> On Oct 14, 2019, at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Oct 10, 2019 at 06:06:14PM +0000, Alexei Starovoitov wrote:
>> On 10/10/19 10:46 AM, Peter Zijlstra wrote:
> 
>>> All of stack_map_get_build_id_offset() is just disguisting games; I did
>>> tell you guys how to do lockless vma lookups a few years ago -- and yes,
>>> that is invasive core mm surgery. But this is just disguisting hacks for
>>> not wanting to do it right.
>> 
>> you mean speculative page fault stuff?
>> That was my hope as well and I offered Laurent all the help to land it.
>> Yet after a year since we've talked the patches are not any closer
>> to landing.
>> Any other 'invasive mm surgery' you have in mind?
> 
> Indeed that series. It had RCU managed VMAs and lockless VMA lookups,
> which is exactly what you need here.

Lockless VMA lookups will be really useful. It would resolve all the 
pains we are having here. 

I remember Google folks also mentioned in LPC that they would like 
better mechanism to confirm build-id in perf. 

> 
>>> Basically the only semi-sane thing to do with that trainwreck is
>>> s/in_nmi()/true/ and pray.
>>> 
>>> On top of that I just hate buildids in general.
>> 
>> Emotions aside... build_id is useful and used in production.
>> It's used widely because it solves real problems.
> 
> AFAIU it solves the problem of you not knowing what version of the
> binary runs where; which I was hoping your cloud infrastructure thing
> would actually know already.
> 
> Anyway, I know what it does, I just don't nessecarily agree it is the
> right way around that particular problem (also, the way I'm personally
> affected is that perf-record is dead slow by default due to built-id
> post processing).
> 
> And it obviously leads to horrible hacks like the code currently under
> discussion :/
> 
>> This dead lock is from real servers and not from some sanitizer wannabe.
> 
> If you enable CFS bandwidth control and run this function on the
> trace_hrtimer_start() tracepoint, you should be able to trigger a real
> AB-BA lockup.
> 
>> Hence we need to fix it as cleanly as possible and quickly.
>> s/in_nmi/true/ is certainly an option.
> 
> That is the best option; because tracepoints / perf-overflow handlers
> really should not be taking any locks.
> 
>> I'm worried about overhead of doing irq_work_queue() all the time.
>> But I'm not familiar with mechanism enough to justify the concerns.
>> Would it make sense to do s/in_nmi/irgs_disabled/ instead?
> 
> irqs_disabled() should work in this particular case because rq->lock
> (and therefore all it's nested locks) are IRQ-safe.

We worry about the overhead of irq_work for every single stackmap 
lookup. So we would like to go with the irqs_disabled() check. I just 
sent v2 of the patch. 

Thanks again,
Song


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-14 20:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-10  6:19 [PATCH bpf-next 0/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
2019-10-10  6:19 ` [PATCH bpf-next 1/2] sched: introduce this_rq_is_locked() Song Liu
2019-10-10  6:19 ` [PATCH bpf-next 2/2] bpf/stackmap: fix A-A deadlock in bpf_get_stack() Song Liu
2019-10-10  7:36   ` Peter Zijlstra
2019-10-10 17:19     ` Alexei Starovoitov
2019-10-10 17:46       ` Peter Zijlstra
2019-10-10 18:06         ` Alexei Starovoitov
2019-10-14  9:09           ` Peter Zijlstra
2019-10-14 20:12             ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).