[PATCH 0/2] Fix SCHED_DEADLINE nested priority inheritance

* [PATCH 0/2] Fix SCHED_DEADLINE nested priority inheritance
@ 2019-11-12  7:50 Juri Lelli
  2019-11-12  7:50 ` [PATCH 1/2] sched/deadline: Fix nested priority inheritace at enqueue time Juri Lelli
  2019-11-12  7:50 ` [PATCH 2/2] sched/deadline: Temporary copy static parameters to boosted non-DEADLINE entities Juri Lelli
  0 siblings, 2 replies; 8+ messages in thread
From: Juri Lelli @ 2019-11-12  7:50 UTC (permalink / raw)
  To: peterz, mingo, glenn
  Cc: linux-kernel, rostedt, vincent.guittot, dietmar.eggemann, tglx,
	luca.abeni, c.scordino, tommaso.cucinotta, bristot, juri.lelli

Hi,

Glenn reported a bug concerning SCHED_DEADLINE nested (cross-class)
priority inheritance. With his words:

"""
Hello maintainers,

My application produces a BUG in deadline.c when a SCHED_DEADLINE task
contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes.  I
believe the bug is triggered when a CFS task that was boosted by a
SCHED_DEADLINE task boosts another CFS task (nested priority
inheritance).

I am able to reproduce the bug on 4.15, 4.19-rt, and 5.3.0-19-generic
(an Ubuntu kernel) kernels.

I have a small test program that can reproduce the issue.  Please find
the source code for this test at the end of this email.  I have tested
on an Intel Xeon 8276 CPU, as well as within a x86_64 VM.  I can try
this test with a 4.19-rt kernel on aarch64, if desired.

Here is the BUG output on a 4.19-rt kernel:

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: P           O      4.19.72-rt25-appaloosa-v1.5 #1
 Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 Code: ...
 RSP: 002b:00007fa58ac54ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000023
 RAX: ffffffffffffffda RBX: ffffffffffffff98 RCX: 00007fa58b52330d
 RDX: 00007fa58b81d780 RSI: 00007fa58ac54f00 RDI: 00007fa58ac54f00
 RBP: 0000000000000000 R08: 00007fa58ac55700 R09: 0000000000000000
 R10: 0000000000000735 R11: 0000000000000293 R12: 0000000000000000
 R13: 00007ffc807fd43f R14: 00007fa58ac559c0 R15: 0000000000000000
 Modules linked in: ...
 ---[ end trace 0000000000000002 ]—
"""

He also provided a reproducer (appended to this cover) with which I was
able to confirm that tip/sched/core is affect as well.

Proposed fix is composed of 2 patches

 1 - Trigger priority inheritance (boost) for "indirect" cases as well;
     where I (improperly?) consider "indirect" a non-SCHED_DEADLINE waiter
     currently boosted from a different lock chain containing a
     SCHED_DEADLINE task
 2 - Temporarily copy static parameters to non-SCHED_DEADLINE boosted
     tasks, so that they can be used in "indirect" cases

Now. This is pretty far from ideal, I know. Consider it a stop gap
solution (hack, if it makes any sense) to proper proxy execution (on
which I'm still working :-/). Better ideas are more than welcome!

Best,

Juri

Juri Lelli (2):
  sched/deadline: Fix nested priority inheritace at enqueue time
  sched/deadline: Temporary copy static parameters to boosted
    non-DEADLINE entities

 kernel/sched/core.c     |  6 ++++--
 kernel/sched/deadline.c | 19 ++++++++++++++++++-
 kernel/sched/sched.h    |  1 +
 3 files changed, 23 insertions(+), 3 deletions(-)

-- 
2.17.2

--->8---
— test case —

/* gcc main.c -lpthread */

/*
 * This program reproduces a kernel bug at line
 * https://elixir.bootlin.com/linux/v4.15/source/kernel/sched/deadline.c#L1405, but not limited
 * to the version v4.15.
 *
 * This is bug is triggered when a non-deadline task that was boosted by a deadline task boosts
 * another non-deadline task.
 *
 * So the execution order of locking steps are the following
 * (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled
 * with priority inheritance.)
 *
 * Time moves forward as this timeline goes down:
 *
 * N1              N2               D1
 * |               |                |
 * |               |                |
 * Lock(M1)        |                |
 * |               |                |
 * |             Lock(M2)           |
 * |               |                |
 * |               |              Lock(M2)
 * |               |                |
 * |             Lock(M1)           |
 * |             (!!bug triggered!) |
 *
 */

#define _GNU_SOURCE
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/unistd.h>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <time.h>
#include <unistd.h>

#define gettid() syscall(__NR_gettid)

#define SCHED_DEADLINE 6

#ifdef __x86_64__
#define __NR_sched_setattr 314
#define __NR_sched_getattr 315
#endif

#ifdef __i386__
#define __NR_sched_setattr 351
#define __NR_sched_getattr 352
#endif

#ifdef __arm__
#define __NR_sched_setattr 380
#define __NR_sched_getattr 381
#endif

static volatile int step;
pthread_mutex_t m1;
pthread_mutex_t m2;

struct sched_attr {
    __u32 size;

    __u32 sched_policy;
    __u64 sched_flags;

    /* SCHED_NORMAL, SCHED_BATCH */
    __s32 sched_nice;

    /* SCHED_FIFO, SCHED_RR */
    __u32 sched_priority;

    /* SCHED_DEADLINE (nsec) */
    __u64 sched_runtime;
    __u64 sched_deadline;
    __u64 sched_period;
};

int sched_setattr(pid_t pid, const struct sched_attr *attr,
                  unsigned int flags) {
    return syscall(__NR_sched_setattr, pid, attr, flags);
}

int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size,
                  unsigned int flags) {
    return syscall(__NR_sched_getattr, pid, attr, size, flags);
}

void *run_normal_1(void *data) {
    int x = 0;

    printf("normal thread started [%ld]\n", gettid());

    // N1 locks M1
    printf("N1 is locking M1\n");
    pthread_mutex_lock(&m1);
    printf("N1 locked M1\n");

    // Notify N2
    step = 1;
    // Wait to be boosted by N2 (on M1)
    sleep(10);

    // Won't be able to reach here because of the rt_mutex + sched_deadline bug.
    printf("N1 is unlocking M1\n");
    pthread_mutex_unlock(&m1);
    printf("N1 unlocked M1\n");

    printf("normal thread dies [%ld]\n", gettid());
    return NULL;
}

void *run_normal_2(void *data) {
    int x = 0;

    printf("normal thread started [%ld]\n", gettid());

    // N2 locks M2
    printf("N2 is locking M2\n");
    pthread_mutex_lock(&m2);
    printf("N2 locked M2\n");

    // Wait until N1 locked M1
    while (step < 1) {
        x++;
    }

    // Notify D1
    step = 2;
    // Wait to be boosted by D1 (on M2)
    sleep(5);

    printf("N2 is locking M1\n");
    // This will boost N1 and trigger the bug.
    pthread_mutex_lock(&m1);
    // Won't reach here because of the bug
    printf("N2 locked M1\n");

    printf("N2 is unlocking M1\n");
    pthread_mutex_unlock(&m1);
    printf("N2 unlocked M1\n");

    printf("N2 is unlocking M2\n");
    pthread_mutex_unlock(&m2);
    printf("N2 unlocked M2\n");

    printf("normal thread dies [%ld]\n", gettid());
    return NULL;
}

void *run_deadline(void *data) {
    struct sched_attr attr;
    int x = 0;
    int ret = 0;
    unsigned int flags = 0;

    printf("deadline thread started [%ld]\n", gettid());

    attr.size = sizeof(attr);
    attr.sched_flags = 0;
    attr.sched_nice = 0;
    attr.sched_priority = 0;

    /* This creates a 10ms/30ms reservation */
    attr.sched_policy = SCHED_DEADLINE;
    attr.sched_runtime = 10 * 1000 * 1000;
    attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;

    ret = sched_setattr(0, &attr, flags);
    if (ret < 0) {
        step = 0;
        perror("sched_setattr");
        exit(-1);
    }

    // Wait until N2 locked M2
    while (step < 2) {
        x++;
    }

    printf("D1 is locking M2\n");
    // This will boost N2
    pthread_mutex_lock(&m2);
    printf("D1 locked M2\n");

    sleep(10);
    // Won't reach here because of the bug.

    printf("D1 is unlocking M2\n");
    pthread_mutex_unlock(&m2);
    printf("D1 unlocked M2\n");

    printf("deadline thread dies [%ld]\n", gettid());
    return NULL;
}

int main(int argc, char **argv) {
    pthread_t thread[3];

    printf("main thread [%ld]\n", gettid());

    int rtn;
    pthread_mutexattr_t mutexattr;
    if ((rtn = pthread_mutexattr_init(&mutexattr) != 0)) {
        fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
        exit(1);
    }
    if ((rtn = pthread_mutexattr_setprotocol(&mutexattr,
                                             PTHREAD_PRIO_INHERIT)) != 0) {
        fprintf(stderr, "pthread_mutexattr_setprotocol: %s", strerror(rtn));
        exit(1);
    }

    if ((rtn = pthread_mutex_init(&m1, &mutexattr)) != 0) {
        fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
        exit(1);
    }

    if ((rtn = pthread_mutex_init(&m2, &mutexattr)) != 0) {
        fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
        exit(1);
    }

    // use this volatile variable to coordinate execution order between threads
    step = 0;
    pthread_create(thread, NULL, run_normal_1, NULL);
    pthread_create(thread + 1, NULL, run_normal_2, NULL);
    pthread_create(thread + 2, NULL, run_deadline, NULL);

    pthread_join(thread[0], NULL);
    pthread_join(thread[1], NULL);
    pthread_join(thread[2], NULL);

    printf("main dies [%ld]\n", gettid());
    return 0;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread