bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf v3 0/2] bpf: Fix deadlock with rq_lock in bpf_send_signal()
@ 2020-03-04 19:11 Yonghong Song
  2020-03-04 19:11 ` [PATCH bpf v3 1/2] " Yonghong Song
  2020-03-04 19:11 ` [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test Yonghong Song
  0 siblings, 2 replies; 5+ messages in thread
From: Yonghong Song @ 2020-03-04 19:11 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

Commit 8b401f9ed244 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper and Commit 8482941f0906
("bpf: Add bpf_send_signal_thread() helper") added bpf_send_signal_thread()
helper. Both helpers try to send a signel to current process or thread.

When bpf_prog is called with scheduler rq_lock held, a deadlock
could happen since bpf_send_signal() and bpf_send_signal_thread()
will call group_send_sig_info() which may ultimately want to acquire
rq_lock() again. This happens in 5.2 and 4.16 based kernels in our
production environment with perf_sw_event.

In a different scenario, the following is also possible in the last kernel:
  cpu 1:
     do_task_stat <- holding sighand->siglock
     ...
     task_sched_runtime <- trying to grab rq_lock

  cpu 2:
     __schedule <- holding rq_lock
     ...
     do_send_sig_info <- trying to grab sighand->siglock

Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") has a similar issue with above
rq_lock() deadlock. This patch set addressed the issue
in a similar way. Patch #1 provided kernel solution and
Patch #2 added a selftest.

Changelogs:
  v2 -> v3:
    . simplify selftest send_signal_sched_switch().
      The previous version has mmap/munmap inherited
      from Song's reproducer. They are not necessary
      in this context.
  v1 -> v2:
    . previous fix using task_work in nmi() is incorrect.
      there is no nmi() involvement here. Using task_work
      in all cases might be a solution. But decided to
      use a similar fix as in Commit eac9153f2b58.

Yonghong Song (2):
  bpf: Fix deadlock with rq_lock in bpf_send_signal()
  selftests/bpf: add send_signal_sched_switch test

 kernel/trace/bpf_trace.c                      |  5 +-
 .../bpf/prog_tests/send_signal_sched_switch.c | 60 +++++++++++++++++++
 .../bpf/progs/test_send_signal_kern.c         |  6 ++
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH bpf v3 1/2] bpf: Fix deadlock with rq_lock in bpf_send_signal()
  2020-03-04 19:11 [PATCH bpf v3 0/2] bpf: Fix deadlock with rq_lock in bpf_send_signal() Yonghong Song
@ 2020-03-04 19:11 ` Yonghong Song
  2020-03-05 22:14   ` Alexei Starovoitov
  2020-03-04 19:11 ` [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test Yonghong Song
  1 sibling, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2020-03-04 19:11 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Song Liu

When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b58, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
     --- a/tools/trace.py
     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/trace/bpf_trace.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 19e793aa441a..55a69b53054e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -732,7 +732,10 @@ static int bpf_send_signal_common(u32 sig, enum pid_type type)
 	if (unlikely(!nmi_uaccess_okay()))
 		return -EPERM;
 
-	if (in_nmi()) {
+	/* Delay sending signal if irq is disabled. Otherwise,
+	 * we risk deadlock with rq_lock.
+	 */
+	if (irqs_disabled()) {
 		/* Do an early check on signal validity. Otherwise,
 		 * the error is lost in deferred irq_work.
 		 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test
  2020-03-04 19:11 [PATCH bpf v3 0/2] bpf: Fix deadlock with rq_lock in bpf_send_signal() Yonghong Song
  2020-03-04 19:11 ` [PATCH bpf v3 1/2] " Yonghong Song
@ 2020-03-04 19:11 ` Yonghong Song
  2020-03-04 20:21   ` Andrii Nakryiko
  1 sibling, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2020-03-04 19:11 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Song Liu

Added one test, send_signal_sched_switch, to test bpf_send_signal()
helper triggered by sched/sched_switch tracepoint. This test can be used
to verify kernel deadlocks fixed by the previous commit. The test itself
is heavily borrowed from Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock
with rq_lock in bpf_get_stack()").

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../bpf/prog_tests/send_signal_sched_switch.c | 60 +++++++++++++++++++
 .../bpf/progs/test_send_signal_kern.c         |  6 ++
 2 files changed, 66 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c

diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c b/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c
new file mode 100644
index 000000000000..189a34a7addb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "test_send_signal_kern.skel.h"
+
+static void sigusr1_handler(int signum)
+{
+}
+
+#define THREAD_COUNT 100
+
+static void *worker(void *p)
+{
+	int i;
+
+	for ( i = 0; i < 1000; i++)
+		usleep(1);
+
+	return NULL;
+}
+
+void test_send_signal_sched_switch(void)
+{
+	struct test_send_signal_kern *skel;
+	pthread_t threads[THREAD_COUNT];
+	u32 duration = 0;
+	int i, err;
+
+	signal(SIGUSR1, sigusr1_handler);
+
+	skel = test_send_signal_kern__open_and_load();
+	if (CHECK(!skel, "skel_open_and_load", "skeleton open_and_load failed\n"))
+		return;
+
+	skel->bss->pid = getpid();
+	skel->bss->sig = SIGUSR1;
+
+	err = test_send_signal_kern__attach(skel);
+	if (CHECK(err, "skel_attach", "skeleton attach failed\n"))
+		goto destroy_skel;
+
+	for (i = 0; i < THREAD_COUNT; i++) {
+		err = pthread_create(threads + i, NULL, worker, NULL);
+		if (CHECK(err, "pthread_create", "Error creating thread, %s\n",
+			  strerror(errno)))
+			goto destroy_skel;
+	}
+
+	for (i = 0; i < THREAD_COUNT; i++)
+		pthread_join(threads[i], NULL);
+
+destroy_skel:
+	test_send_signal_kern__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
index 1acc91e87bfc..b4233d3efac2 100644
--- a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
@@ -31,6 +31,12 @@ int send_signal_tp(void *ctx)
 	return bpf_send_signal_test(ctx);
 }
 
+SEC("tracepoint/sched/sched_switch")
+int send_signal_tp_sched(void *ctx)
+{
+	return bpf_send_signal_test(ctx);
+}
+
 SEC("perf_event")
 int send_signal_perf(void *ctx)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test
  2020-03-04 19:11 ` [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test Yonghong Song
@ 2020-03-04 20:21   ` Andrii Nakryiko
  0 siblings, 0 replies; 5+ messages in thread
From: Andrii Nakryiko @ 2020-03-04 20:21 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team, Song Liu

On Wed, Mar 4, 2020 at 11:11 AM Yonghong Song <yhs@fb.com> wrote:
>
> Added one test, send_signal_sched_switch, to test bpf_send_signal()
> helper triggered by sched/sched_switch tracepoint. This test can be used
> to verify kernel deadlocks fixed by the previous commit. The test itself
> is heavily borrowed from Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock
> with rq_lock in bpf_get_stack()").
>
> Cc: Song Liu <songliubraving@fb.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

LGTM.

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  .../bpf/prog_tests/send_signal_sched_switch.c | 60 +++++++++++++++++++
>  .../bpf/progs/test_send_signal_kern.c         |  6 ++
>  2 files changed, 66 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c b/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c
> new file mode 100644
> index 000000000000..189a34a7addb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal_sched_switch.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/mman.h>
> +#include <pthread.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include "test_send_signal_kern.skel.h"
> +
> +static void sigusr1_handler(int signum)
> +{
> +}
> +
> +#define THREAD_COUNT 100
> +
> +static void *worker(void *p)
> +{
> +       int i;
> +
> +       for ( i = 0; i < 1000; i++)
> +               usleep(1);
> +
> +       return NULL;
> +}
> +
> +void test_send_signal_sched_switch(void)
> +{
> +       struct test_send_signal_kern *skel;
> +       pthread_t threads[THREAD_COUNT];
> +       u32 duration = 0;
> +       int i, err;
> +
> +       signal(SIGUSR1, sigusr1_handler);
> +
> +       skel = test_send_signal_kern__open_and_load();
> +       if (CHECK(!skel, "skel_open_and_load", "skeleton open_and_load failed\n"))
> +               return;
> +
> +       skel->bss->pid = getpid();
> +       skel->bss->sig = SIGUSR1;
> +
> +       err = test_send_signal_kern__attach(skel);
> +       if (CHECK(err, "skel_attach", "skeleton attach failed\n"))
> +               goto destroy_skel;
> +
> +       for (i = 0; i < THREAD_COUNT; i++) {
> +               err = pthread_create(threads + i, NULL, worker, NULL);
> +               if (CHECK(err, "pthread_create", "Error creating thread, %s\n",
> +                         strerror(errno)))
> +                       goto destroy_skel;
> +       }
> +
> +       for (i = 0; i < THREAD_COUNT; i++)
> +               pthread_join(threads[i], NULL);
> +
> +destroy_skel:
> +       test_send_signal_kern__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> index 1acc91e87bfc..b4233d3efac2 100644
> --- a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> @@ -31,6 +31,12 @@ int send_signal_tp(void *ctx)
>         return bpf_send_signal_test(ctx);
>  }
>
> +SEC("tracepoint/sched/sched_switch")
> +int send_signal_tp_sched(void *ctx)
> +{
> +       return bpf_send_signal_test(ctx);
> +}
> +
>  SEC("perf_event")
>  int send_signal_perf(void *ctx)
>  {
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf v3 1/2] bpf: Fix deadlock with rq_lock in bpf_send_signal()
  2020-03-04 19:11 ` [PATCH bpf v3 1/2] " Yonghong Song
@ 2020-03-05 22:14   ` Alexei Starovoitov
  0 siblings, 0 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2020-03-05 22:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team, Song Liu

On Wed, Mar 04, 2020 at 11:11:04AM -0800, Yonghong Song wrote:
>  
> -	if (in_nmi()) {
> +	/* Delay sending signal if irq is disabled. Otherwise,
> +	 * we risk deadlock with rq_lock.
> +	 */

This comment read in isolation is confusing. It's not clear how irq
has anything to do with rq_lock and why deadlock is possible.
But commit log has very nice summary,
so I've just deleted that comment and applied to bpf tree.
The patch 2 did reproduce the lockdep splat for me. Which was great.
Running test_progs few times before and after I noticed that older
send_signal tests are flaky and fail from time to time.
It's unrelated to this patch, but please take a look.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-05 22:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04 19:11 [PATCH bpf v3 0/2] bpf: Fix deadlock with rq_lock in bpf_send_signal() Yonghong Song
2020-03-04 19:11 ` [PATCH bpf v3 1/2] " Yonghong Song
2020-03-05 22:14   ` Alexei Starovoitov
2020-03-04 19:11 ` [PATCH bpf v3 2/2] selftests/bpf: add send_signal_sched_switch test Yonghong Song
2020-03-04 20:21   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).