All of lore.kernel.org
 help / color / mirror / Atom feed
* WARNING in timer_wait_running
@ 2020-08-31 17:07 syzbot
  2023-04-05 21:07 ` Marco Elver
       [not found] ` <20230407134620.1034-1-hdanton@sina.com>
  0 siblings, 2 replies; 20+ messages in thread
From: syzbot @ 2020-08-31 17:07 UTC (permalink / raw)
  To: baolu.lu, jroedel, jun.miao, koba.ko, linux-kernel, syzkaller-bugs, tglx

Hello,

syzbot found the following issue on:

HEAD commit:    1127b219 Merge tag 'fallthrough-fixes-5.9-rc3' of git://gi..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1501768e900000
kernel config:  https://syzkaller.appspot.com/x/.config?x=978db74cb30aa994
dashboard link: https://syzkaller.appspot.com/bug?extid=3b14b2ed9b3d06dcaa07
compiler:       gcc (GCC) 10.1.0-syz 20200507
userspace arch: i386
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1446598e900000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11ac97d5900000

The issue was bisected to:

commit b1012ca8dc4f9b1a1fe8e2cb1590dd6d43ea3849
Author: Lu Baolu <baolu.lu@linux.intel.com>
Date:   Thu Jul 23 01:34:37 2020 +0000

    iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=177a7499900000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=14fa7499900000
console output: https://syzkaller.appspot.com/x/log.txt?x=10fa7499900000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+3b14b2ed9b3d06dcaa07@syzkaller.appspotmail.com
Fixes: b1012ca8dc4f ("iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu")

------------[ cut here ]------------
WARNING: CPU: 0 PID: 10581 at kernel/time/posix-timers.c:849 timer_wait_running+0x218/0x250 kernel/time/posix-timers.c:849
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 10581 Comm: syz-executor958 Not tainted 5.9.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:231
 __warn.cold+0x20/0x4a kernel/panic.c:600
 report_bug+0x1bd/0x210 lib/bug.c:198
 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:timer_wait_running+0x218/0x250 kernel/time/posix-timers.c:849
Code: 00 48 c7 c2 a0 97 4d 88 be 7b 02 00 00 48 c7 c7 00 98 4d 88 c6 05 8b eb 46 09 01 e8 87 85 f4 ff e9 b1 fe ff ff e8 f8 04 0e 00 <0f> 0b e9 08 ff ff ff e8 6c 1d 4e 00 e9 3b fe ff ff 4c 89 e7 e8 6f
RSP: 0018:ffffc9000ab87d40 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8880a0b46100 RSI: ffffffff81663a18 RDI: ffffffff884da458
RBP: ffff888093acbb48 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000ab87da0
R13: 0000000000000000 R14: ffffffff884da3e0 R15: ffffc9000ab87da0
 do_timer_settime.part.0+0x119/0x1d0 kernel/time/posix-timers.c:929
 do_timer_settime kernel/time/posix-timers.c:961 [inline]
 __do_sys_timer_settime32 kernel/time/posix-timers.c:974 [inline]
 __se_sys_timer_settime32 kernel/time/posix-timers.c:961 [inline]
 __ia32_sys_timer_settime32+0x219/0x310 kernel/time/posix-timers.c:961
 do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline]
 __do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126
 do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149
 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
RIP: 0023:0xf7fa7549
Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000f7f8114c EFLAGS: 00000296 ORIG_RAX: 0000000000000104
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000020000300 RSI: 0000000000000000 RDI: 0000000000000002
RBP: 0000000020000180 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2020-08-31 17:07 WARNING in timer_wait_running syzbot
@ 2023-04-05 21:07 ` Marco Elver
  2023-04-05 22:19   ` Frederic Weisbecker
       [not found] ` <20230407134620.1034-1-hdanton@sina.com>
  1 sibling, 1 reply; 20+ messages in thread
From: Marco Elver @ 2023-04-05 21:07 UTC (permalink / raw)
  To: syzbot
  Cc: linux-kernel, syzkaller-bugs, Thomas Gleixner,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Frederic Weisbecker, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 4100 bytes --]

[-Bcc wrongly added folks]
[+Cc kernel/time maintainers]

On Mon, Aug 31, 2020 at 10:07AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    1127b219 Merge tag 'fallthrough-fixes-5.9-rc3' of git://gi..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1501768e900000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=978db74cb30aa994
> dashboard link: https://syzkaller.appspot.com/bug?extid=3b14b2ed9b3d06dcaa07

Dashboard has recent reports (also below) and reproducer (also attached).

> compiler:       gcc (GCC) 10.1.0-syz 20200507
> userspace arch: i386

Not just i386.

> The issue was bisected to:
> 
> commit b1012ca8dc4f9b1a1fe8e2cb1590dd6d43ea3849
> Author: Lu Baolu <baolu.lu@linux.intel.com>
> Date:   Thu Jul 23 01:34:37 2020 +0000
> 
>     iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu

This bisection is wrong.

> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3b14b2ed9b3d06dcaa07@syzkaller.appspotmail.com

[...]
> WARNING: CPU: 0 PID: 10581 at kernel/time/posix-timers.c:849 timer_wait_running+0x218/0x250 kernel/time/posix-timers.c:849
[ .. snip ancient warning ..]

Up-to-date warning:

 | WARNING: CPU: 1 PID: 6695 at kernel/time/posix-timers.c:849 timer_wait_running+0x255/0x290 kernel/time/posix-timers.c:849
 | Modules linked in:
 | CPU: 1 PID: 6695 Comm: syz-executor.3 Not tainted 6.3.0-rc3-syzkaller-00338-gda8e7da11e4b #0
 | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
 | RIP: 0010:timer_wait_running+0x255/0x290 kernel/time/posix-timers.c:849
 | Code: 00 48 c7 c2 80 fe 4e 8a be 06 03 00 00 48 c7 c7 e0 fe 4e 8a c6 05 63 cb ed 0c 01 e8 85 77 ef ff e9 5b fe ff ff e8 2b 7d 0e 00 <0f> 0b e9 b2 fe ff ff e8 0f 8a 5f 00 e9 fe fd ff ff e8 25 8a 5f 00
 | RSP: 0018:ffffc90003ecfd50 EFLAGS: 00010293
 | RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 | RDX: ffff888020e4ba80 RSI: ffffffff81746785 RDI: ffffffff8a4f0ad8
 | RBP: ffff88807c696d38 R08: 0000000000000001 R09: 0000000000000001
 | R10: fffffbfff1cef98a R11: 0000000000000021 R12: ffffc90003ecfdb0
 | R13: 0000000000000000 R14: ffffffff8a4f0a60 R15: ffffc90003ecfdb0
 | FS:  00007fae387fe700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
 | CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 | CR2: 00007f0d105821b8 CR3: 000000002a283000 CR4: 00000000003526e0
 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 | Call Trace:
 |  <TASK>
 |  do_timer_settime.part.0+0x119/0x1d0 kernel/time/posix-timers.c:929
 |  do_timer_settime kernel/time/posix-timers.c:938 [inline]
 |  __do_sys_timer_settime kernel/time/posix-timers.c:952 [inline]
 |  __se_sys_timer_settime kernel/time/posix-timers.c:938 [inline]
 |  __x64_sys_timer_settime+0x21d/0x310 kernel/time/posix-timers.c:938
 |  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 |  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
 |  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 | RIP: 0033:0x7fae3948c0f9
 | Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
 | RSP: 002b:00007fae387fe168 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
 | RAX: ffffffffffffffda RBX: 00007fae395ac050 RCX: 00007fae3948c0f9
 | RDX: 0000000020000080 RSI: 0000000000000000 RDI: 0000000000000000
 | RBP: 00007fae394e7b39 R08: 0000000000000000 R09: 0000000000000000
 | R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 | R13: 00007fae396cfb1f R14: 00007fae387fe300 R15: 0000000000022000
 |  </TASK>

I've seen this warning in the wild recently, and it's been around since
2020 according to syzbot.

The warning was added in ec8f954a40da8 ("posix-timers: Use a callback
for cancel synchronization on PREEMPT_RT").

Why is it wrong for timer_wait_running to be NULL?

[ attached C reproducer ]

Thanks,
-- Marco

[-- Attachment #2: timer-warn.c --]
[-- Type: text/x-csrc, Size: 6573 bytes --]

// https://syzkaller.appspot.com/bug?id=e2edd7e2c1186400e65f86f07324a8e062308159
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/futex.h>
#include <pthread.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

static void sleep_ms(uint64_t ms) { usleep(ms * 1000); }

static uint64_t current_time_ms(void) {
  struct timespec ts;
  if (clock_gettime(CLOCK_MONOTONIC, &ts)) exit(1);
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void thread_start(void* (*fn)(void*), void* arg) {
  pthread_t th;
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_attr_setstacksize(&attr, 128 << 10);
  int i = 0;
  for (; i < 100; i++) {
    if (pthread_create(&th, &attr, fn, arg) == 0) {
      pthread_attr_destroy(&attr);
      return;
    }
    if (errno == EAGAIN) {
      usleep(50);
      continue;
    }
    break;
  }
  exit(1);
}

typedef struct {
  int state;
} event_t;

static void event_init(event_t* ev) { ev->state = 0; }

static void event_reset(event_t* ev) { ev->state = 0; }

static void event_set(event_t* ev) {
  if (ev->state) exit(1);
  __atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
  syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1000000);
}

static void event_wait(event_t* ev) {
  while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
}

static int event_isset(event_t* ev) {
  return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
}

static int event_timedwait(event_t* ev, uint64_t timeout) {
  uint64_t start = current_time_ms();
  uint64_t now = start;
  for (;;) {
    uint64_t remain = timeout - (now - start);
    struct timespec ts;
    ts.tv_sec = remain / 1000;
    ts.tv_nsec = (remain % 1000) * 1000 * 1000;
    syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
    if (__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE)) return 1;
    now = current_time_ms();
    if (now - start > timeout) return 0;
  }
}

static bool write_file(const char* file, const char* what, ...) {
  char buf[1024];
  va_list args;
  va_start(args, what);
  vsnprintf(buf, sizeof(buf), what, args);
  va_end(args);
  buf[sizeof(buf) - 1] = 0;
  int len = strlen(buf);
  int fd = open(file, O_WRONLY | O_CLOEXEC);
  if (fd == -1) return false;
  if (write(fd, buf, len) != len) {
    int err = errno;
    close(fd);
    errno = err;
    return false;
  }
  close(fd);
  return true;
}

static void kill_and_wait(int pid, int* status) {
  kill(-pid, SIGKILL);
  kill(pid, SIGKILL);
  for (int i = 0; i < 100; i++) {
    if (waitpid(-1, status, WNOHANG | __WALL) == pid) return;
    usleep(1000);
  }
  DIR* dir = opendir("/sys/fs/fuse/connections");
  if (dir) {
    for (;;) {
      struct dirent* ent = readdir(dir);
      if (!ent) break;
      if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
        continue;
      char abort[300];
      snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
               ent->d_name);
      int fd = open(abort, O_WRONLY);
      if (fd == -1) {
        continue;
      }
      if (write(fd, abort, 1) < 0) {
      }
      close(fd);
    }
    closedir(dir);
  } else {
  }
  while (waitpid(-1, status, __WALL) != pid) {
  }
}

static void setup_test() {
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  write_file("/proc/self/oom_score_adj", "1000");
}

struct thread_t {
  int created, call;
  event_t ready, done;
};

static struct thread_t threads[16];
static void execute_call(int call);
static int running;

static void* thr(void* arg) {
  struct thread_t* th = (struct thread_t*)arg;
  for (;;) {
    event_wait(&th->ready);
    event_reset(&th->ready);
    execute_call(th->call);
    __atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
    event_set(&th->done);
  }
  return 0;
}

static void execute_one(void) {
  int i, call, thread;
  for (call = 0; call < 4; call++) {
    for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
         thread++) {
      struct thread_t* th = &threads[thread];
      if (!th->created) {
        th->created = 1;
        event_init(&th->ready);
        event_init(&th->done);
        event_set(&th->done);
        thread_start(thr, th);
      }
      if (!event_isset(&th->done)) continue;
      event_reset(&th->done);
      th->call = call;
      __atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
      event_set(&th->ready);
      event_timedwait(&th->done, 50);
      break;
    }
  }
  for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
    sleep_ms(1);
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void) {
  int iter = 0;
  for (;; iter++) {
    int pid = fork();
    if (pid < 0) exit(1);
    if (pid == 0) {
      setup_test();
      execute_one();
      exit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid) break;
      sleep_ms(1);
      if (current_time_ms() - start < 5000) continue;
      kill_and_wait(pid, &status);
      break;
    }
  }
}

uint64_t r[1] = {0x0};

void execute_call(int call) {
  intptr_t res = 0;
  switch (call) {
    case 0:
      res = syscall(__NR_timer_create, 2ul, 0ul, 0x20002180ul);
      if (res != -1) r[0] = *(uint32_t*)0x20002180;
      break;
    case 1:
      *(uint64_t*)0x2006b000 = 0;
      *(uint64_t*)0x2006b008 = 8;
      *(uint64_t*)0x2006b010 = 0;
      *(uint64_t*)0x2006b018 = 9;
      syscall(__NR_timer_settime, 0, 0ul, 0x2006b000ul, 0ul);
      break;
    case 2:
      syscall(__NR_mmap, 0x20000000ul, 0xb36000ul, 0xb635773f05ebbeeful,
              0x8031ul, -1, 0ul);
      break;
    case 3:
      *(uint64_t*)0x20002100 = 0x77359400;
      *(uint64_t*)0x20002108 = 0;
      *(uint64_t*)0x20002110 = 0;
      *(uint64_t*)0x20002118 = 0;
      syscall(__NR_timer_settime, r[0], 1ul, 0x20002100ul, 0ul);
      break;
  }
}
int main(void) {
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  loop();
  return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-05 21:07 ` Marco Elver
@ 2023-04-05 22:19   ` Frederic Weisbecker
  2023-04-06 19:37     ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-05 22:19 UTC (permalink / raw)
  To: Marco Elver
  Cc: syzbot, linux-kernel, syzkaller-bugs, Thomas Gleixner,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Wed, Apr 05, 2023 at 11:07:24PM +0200, Marco Elver wrote:
> Up-to-date warning:
> 
>  | WARNING: CPU: 1 PID: 6695 at kernel/time/posix-timers.c:849 timer_wait_running+0x255/0x290 kernel/time/posix-timers.c:849
>  | Modules linked in:
>  | CPU: 1 PID: 6695 Comm: syz-executor.3 Not tainted 6.3.0-rc3-syzkaller-00338-gda8e7da11e4b #0
>  | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
>  | RIP: 0010:timer_wait_running+0x255/0x290 kernel/time/posix-timers.c:849
>  | Code: 00 48 c7 c2 80 fe 4e 8a be 06 03 00 00 48 c7 c7 e0 fe 4e 8a c6 05 63 cb ed 0c 01 e8 85 77 ef ff e9 5b fe ff ff e8 2b 7d 0e 00 <0f> 0b e9 b2 fe ff ff e8 0f 8a 5f 00 e9 fe fd ff ff e8 25 8a 5f 00
>  | RSP: 0018:ffffc90003ecfd50 EFLAGS: 00010293
>  | RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>  | RDX: ffff888020e4ba80 RSI: ffffffff81746785 RDI: ffffffff8a4f0ad8
>  | RBP: ffff88807c696d38 R08: 0000000000000001 R09: 0000000000000001
>  | R10: fffffbfff1cef98a R11: 0000000000000021 R12: ffffc90003ecfdb0
>  | R13: 0000000000000000 R14: ffffffff8a4f0a60 R15: ffffc90003ecfdb0
>  | FS:  00007fae387fe700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
>  | CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  | CR2: 00007f0d105821b8 CR3: 000000002a283000 CR4: 00000000003526e0
>  | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  | Call Trace:
>  |  <TASK>
>  |  do_timer_settime.part.0+0x119/0x1d0 kernel/time/posix-timers.c:929
>  |  do_timer_settime kernel/time/posix-timers.c:938 [inline]
>  |  __do_sys_timer_settime kernel/time/posix-timers.c:952 [inline]
>  |  __se_sys_timer_settime kernel/time/posix-timers.c:938 [inline]
>  |  __x64_sys_timer_settime+0x21d/0x310 kernel/time/posix-timers.c:938
>  |  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  |  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  |  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>  | RIP: 0033:0x7fae3948c0f9
>  | Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
>  | RSP: 002b:00007fae387fe168 EFLAGS: 00000246 ORIG_RAX: 00000000000000df
>  | RAX: ffffffffffffffda RBX: 00007fae395ac050 RCX: 00007fae3948c0f9
>  | RDX: 0000000020000080 RSI: 0000000000000000 RDI: 0000000000000000
>  | RBP: 00007fae394e7b39 R08: 0000000000000000 R09: 0000000000000000
>  | R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>  | R13: 00007fae396cfb1f R14: 00007fae387fe300 R15: 0000000000022000
>  |  </TASK>
> 
> I've seen this warning in the wild recently, and it's been around since
> 2020 according to syzbot.
> 
> The warning was added in ec8f954a40da8 ("posix-timers: Use a callback
> for cancel synchronization on PREEMPT_RT").
> 
> Why is it wrong for timer_wait_running to be NULL?

It appears to concern clock_posix_cpu which indeed doesn't implement
->timer_wait_running even though posix_cpu_timer_set() might return
TIMER_RETRY if the timer is about to fire.

Then on RT and if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y then
timer_wait_running() must busy loop waiting for the task to complete
the task work.

We could arrange for doing the same thing as hrtimer_cancel_wait_running()
but for posix cpu timers, with taking a similar lock within
handle_posix_cpu_timers() that timer_wait_running() could sleep on and
inject its PI into.

Thomas, Anna-Maria, would that make sense?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-05 22:19   ` Frederic Weisbecker
@ 2023-04-06 19:37     ` Thomas Gleixner
  2023-04-07  8:44       ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-06 19:37 UTC (permalink / raw)
  To: Frederic Weisbecker, Marco Elver
  Cc: syzbot, linux-kernel, syzkaller-bugs, Anna-Maria Behnsen,
	Jacob Keller, Paul E. McKenney, Peter Zijlstra

On Thu, Apr 06 2023 at 00:19, Frederic Weisbecker wrote:
> On Wed, Apr 05, 2023 at 11:07:24PM +0200, Marco Elver wrote:
>> Up-to-date warning:
>> 
>>  | WARNING: CPU: 1 PID: 6695 at kernel/time/posix-timers.c:849 timer_wait_running+0x255/0x290 kernel/time/posix-timers.c:849
>> 
>> Why is it wrong for timer_wait_running to be NULL?
>
> It appears to concern clock_posix_cpu which indeed doesn't implement
> ->timer_wait_running even though posix_cpu_timer_set() might return
> TIMER_RETRY if the timer is about to fire.
>
> Then on RT and if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y then
> timer_wait_running() must busy loop waiting for the task to complete
> the task work.
>
> We could arrange for doing the same thing as hrtimer_cancel_wait_running()
> but for posix cpu timers, with taking a similar lock within
> handle_posix_cpu_timers() that timer_wait_running() could sleep on and
> inject its PI into.

I have a faint memory that we discussed something like that, but there
was an issue which completely escaped my memory.

But yes, something like this could work.

Though we should quickly shut this warning up for the !RT case by
providing an callback which does

  WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT);

and let the RT folks deal with it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-06 19:37     ` Thomas Gleixner
@ 2023-04-07  8:44       ` Thomas Gleixner
  2023-04-07 11:50         ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-07  8:44 UTC (permalink / raw)
  To: Frederic Weisbecker, Marco Elver
  Cc: syzbot, linux-kernel, syzkaller-bugs, Anna-Maria Behnsen,
	Jacob Keller, Paul E. McKenney, Peter Zijlstra

On Thu, Apr 06 2023 at 21:37, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 00:19, Frederic Weisbecker wrote:
>> We could arrange for doing the same thing as hrtimer_cancel_wait_running()
>> but for posix cpu timers, with taking a similar lock within
>> handle_posix_cpu_timers() that timer_wait_running() could sleep on and
>> inject its PI into.
>
> I have a faint memory that we discussed something like that, but there
> was an issue which completely escaped my memory.

Now memory came back. The problem with posix CPU timers is that it is
not really known to the other side which task is actually doing the
expiry. For process wide timers this could be any task in the process.

For hrtimers this works because the expiring context is known.

> But yes, something like this could work.

Needs some more thought, but still can be made work.

> Though we should quickly shut this warning up for the !RT case by
> providing an callback which does
>
>   WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT);
>
> and let the RT folks deal with it.

OTOH, this is not only a RT issue.

On preemptible kernels the task which collected the expired timers onto
a local list and set the firing bit, can be preempted after dropping
sighand lock. So the other side still can busy wait for quite a while.
Same is obviously true for guests independent of preemption when the
vCPU gets scheduled out.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07  8:44       ` Thomas Gleixner
@ 2023-04-07 11:50         ` Frederic Weisbecker
  2023-04-07 17:47           ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-07 11:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 21:37, Thomas Gleixner wrote:
> > On Thu, Apr 06 2023 at 00:19, Frederic Weisbecker wrote:
> >> We could arrange for doing the same thing as hrtimer_cancel_wait_running()
> >> but for posix cpu timers, with taking a similar lock within
> >> handle_posix_cpu_timers() that timer_wait_running() could sleep on and
> >> inject its PI into.
> >
> > I have a faint memory that we discussed something like that, but there
> > was an issue which completely escaped my memory.
> 
> Now memory came back. The problem with posix CPU timers is that it is
> not really known to the other side which task is actually doing the
> expiry. For process wide timers this could be any task in the process.
> 
> For hrtimers this works because the expiring context is known.

So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
delay put_pid() with RCU, we could retrieve that information without
holding the timer lock (with appropriate RCU accesses all around).

> 
> > Though we should quickly shut this warning up for the !RT case by
> > providing an callback which does
> >
> >   WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT);
> >
> > and let the RT folks deal with it.
> 
> OTOH, this is not only a RT issue.
> 
> On preemptible kernels the task which collected the expired timers onto
> a local list and set the firing bit, can be preempted after dropping
> sighand lock. So the other side still can busy wait for quite a while.
> Same is obviously true for guests independent of preemption when the
> vCPU gets scheduled out.

Ok, fortunately task work is a sleepable context so using a mutex would
work for everyone, at the cost of a new mutex in task_struct though.
Lemme try something.

> 
> Thanks,
> 
>         tglx
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
       [not found] ` <20230407134620.1034-1-hdanton@sina.com>
@ 2023-04-07 14:49   ` Frederic Weisbecker
  0 siblings, 0 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-07 14:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs, Thomas Gleixner

On Fri, Apr 07, 2023 at 09:46:20PM +0800, Hillf Danton wrote:
> On 5 Apr 2023 23:07:24 +0200 Marco Elver <elver@google.com>
> > On Mon, Aug 31, 2020 at 10:07AM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:    1127b219 Merge tag 'fallthrough-fixes-5.9-rc3' of git://gi..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1501768e900000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=978db74cb30aa994
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=3b14b2ed9b3d06dcaa07
> > 
> > Dashboard has recent reports (also below) and reproducer (also attached).
> 
> My 2c, with PREEMPT_RT enabled it simply waits by taking timer->it_lock.
> 
> --- upstream/kernel/time/posix-cpu-timers.c
> +++ y/kernel/time/posix-cpu-timers.c
> @@ -1613,6 +1613,21 @@ static int thread_cpu_timer_create(struc
>  	return posix_cpu_timer_create(timer);
>  }
>  
> +static void posix_cpu_timer_wait_running(struct k_itimer *timer)
> +{
> +#ifdef CONFIG_PREEMPT_RT
> +	int stop = 0;
> +
> +	while (!stop) {
> +		spin_lock(&timer->it_lock);
> +		stop = timer->it.cpu.firing == 0;
> +		spin_unlock(&timer->it_lock);
> +	}

No, because there is a whole lot of preemptible area with timer->it_lock
not held between the time ctmr->firing is set to 1, and the actual handling
of that timer that holds the lock.

So no priority inheritance in that case.

There has to be a lock between the time it is set to 1 and the handling
of the timer.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07 11:50         ` Frederic Weisbecker
@ 2023-04-07 17:47           ` Thomas Gleixner
  2023-04-07 18:36             ` Frederic Weisbecker
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-07 17:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Fri, Apr 07 2023 at 13:50, Frederic Weisbecker wrote:
> On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
>> Now memory came back. The problem with posix CPU timers is that it is
>> not really known to the other side which task is actually doing the
>> expiry. For process wide timers this could be any task in the process.
>> 
>> For hrtimers this works because the expiring context is known.
>
> So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
> delay put_pid() with RCU, we could retrieve that information without
> holding the timer lock (with appropriate RCU accesses all around).

No, you can't. This only gives you the process, but the expiry might run
on any task of that. To make that work you need a mutex in sighand.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07 17:47           ` Thomas Gleixner
@ 2023-04-07 18:36             ` Frederic Weisbecker
  2023-04-07 19:27               ` Thomas Gleixner
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-07 18:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Fri, Apr 07, 2023 at 07:47:40PM +0200, Thomas Gleixner wrote:
> On Fri, Apr 07 2023 at 13:50, Frederic Weisbecker wrote:
> > On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
> >> Now memory came back. The problem with posix CPU timers is that it is
> >> not really known to the other side which task is actually doing the
> >> expiry. For process wide timers this could be any task in the process.
> >> 
> >> For hrtimers this works because the expiring context is known.
> >
> > So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
> > delay put_pid() with RCU, we could retrieve that information without
> > holding the timer lock (with appropriate RCU accesses all around).
> 
> No, you can't. This only gives you the process, but the expiry might run
> on any task of that. To make that work you need a mutex in sighand.

Duh right missed that. Ok will try.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07 18:36             ` Frederic Weisbecker
@ 2023-04-07 19:27               ` Thomas Gleixner
  2023-04-07 21:00                 ` Frederic Weisbecker
  2023-04-11 14:31                 ` Marco Elver
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-07 19:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Fri, Apr 07 2023 at 20:36, Frederic Weisbecker wrote:

> On Fri, Apr 07, 2023 at 07:47:40PM +0200, Thomas Gleixner wrote:
>> On Fri, Apr 07 2023 at 13:50, Frederic Weisbecker wrote:
>> > On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
>> >> Now memory came back. The problem with posix CPU timers is that it is
>> >> not really known to the other side which task is actually doing the
>> >> expiry. For process wide timers this could be any task in the process.
>> >> 
>> >> For hrtimers this works because the expiring context is known.
>> >
>> > So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
>> > delay put_pid() with RCU, we could retrieve that information without
>> > holding the timer lock (with appropriate RCU accesses all around).
>> 
>> No, you can't. This only gives you the process, but the expiry might run
>> on any task of that. To make that work you need a mutex in sighand.
>
> Duh right missed that. Ok will try.

But that's nasty as well as this can race against exec/exit and you can't
hold sighand lock when acquiring the mutex.

The mutex needs to be per task and held by the task which runs the
expiry task work.

Something like the completely untested below. You get the idea though.

Thanks,

        tglx
---
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/alarmtimer.h>
 #include <linux/timerqueue.h>
 
@@ -62,9 +63,10 @@ static inline int clockid_to_fd(const cl
  * cpu_timer - Posix CPU timer representation for k_itimer
  * @node:	timerqueue node to queue in the task/sig
  * @head:	timerqueue head on which this timer is queued
- * @task:	Pointer to target task
+ * @pid:	Pointer to target task PID
  * @elist:	List head for the expiry list
  * @firing:	Timer is currently firing
+ * @handling:	Pointer to the task which handles expiry
  */
 struct cpu_timer {
 	struct timerqueue_node	node;
@@ -72,6 +74,7 @@ struct cpu_timer {
 	struct pid		*pid;
 	struct list_head	elist;
 	int			firing;
+	struct task_struct	*handling;
 };
 
 static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
@@ -135,10 +138,12 @@ struct posix_cputimers {
 /**
  * posix_cputimers_work - Container for task work based posix CPU timer expiry
  * @work:	The task work to be scheduled
+ * @mutex:	Mutex held around expiry in context of this task work
  * @scheduled:  @work has been scheduled already, no further processing
  */
 struct posix_cputimers_work {
 	struct callback_head	work;
+	struct mutex		mutex;
 	unsigned int		scheduled;
 };
 
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct tim
 			return expires;
 
 		ctmr->firing = 1;
+		/* See posix_cpu_timer_wait_running() */
+		WRITE_ONCE(ctmr->handling, current);
 		cpu_timer_dequeue(ctmr);
 		list_add_tail(&ctmr->elist, firing);
 	}
@@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(stru
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
 static void posix_cpu_timers_work(struct callback_head *work)
 {
+	struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
+
+	mutex_lock(&cw->mutex);
 	handle_posix_cpu_timers(current);
+	mutex_unlock(&cw->mutex);
+}
+
+/*
+ * Invoked from the posix-timer core when a cancel operation failed because
+ * the timer is marked firing. The caller holds rcu_read_lock(), which
+ * protects the timer and the task which is expiring it from being freed.
+ */
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	struct task_struct *tsk = READ_ONCE(timr->it.cpu.handling);
+
+	/* Has the handling task completed expiry already? */
+	if (!tsk)
+		return;
+
+	/* Ensure that the task cannot go away */
+	get_task_struct(tsk);
+	/* Now drop the RCU protection so the mutex can be locked */
+	rcu_read_unlock();
+	/* Wait on the expiry mutex */
+	mutex_lock(&tsk->posix_cputimers_work.mutex);
+	/* Release it immediately again. */
+	mutex_unlock(&tsk->posix_cputimers_work.mutex);
+	/* Drop the task reference. */
+	put_task_struct(tsk);
+	/* Relock RCU so the callsite is balanced */
+	rcu_read_lock();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	/* Ensure that timr->it.cpu.handling task cannot go away */
+	rcu_read_lock();
+	spin_unlock_irq(&timr->it_lock);
+	posix_cpu_timer_wait_running(timr);
+	rcu_read_unlock();
+	/* @timrr is on stack and is valid */
+	spin_lock_irq(&timr->it_lock);
 }
 
 /*
@@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct t
 	       sizeof(p->posix_cputimers_work.work));
 	init_task_work(&p->posix_cputimers_work.work,
 		       posix_cpu_timers_work);
+	mutex_init(&p->posix_cputimers_work.mutex);
 	p->posix_cputimers_work.scheduled = false;
 }
 
@@ -1255,6 +1300,15 @@ static inline void __run_posix_cpu_timer
 	lockdep_posixtimer_exit();
 }
 
+static inline void posix_cpu_timer_wait_running(struct k_itimer *timr) { }
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timer.it_lock);
+	cpu_relax();
+	spin_lock_irq(&timer.it_lock);
+}
+
 static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
 {
 	return false;
@@ -1354,6 +1408,8 @@ static void handle_posix_cpu_timers(stru
 		 */
 		spin_lock(&timer->it_lock);
 		list_del_init(&timer->it.cpu.elist);
+		/* See posix_cpu_timer_wait_running() */
+		WRITE_ONCE(timer->it.cpu.handling, NULL);
 		cpu_firing = timer->it.cpu.firing;
 		timer->it.cpu.firing = 0;
 		/*
@@ -1497,23 +1553,16 @@ static int do_cpu_nanosleep(const clocki
 		expires = cpu_timer_getexpires(&timer.it.cpu);
 		error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
 		if (!error) {
-			/*
-			 * Timer is now unarmed, deletion can not fail.
-			 */
+			/* Timer is now unarmed, deletion can not fail. */
 			posix_cpu_timer_del(&timer);
+		} else {
+			while (error == TIMER_RETRY) {
+				posix_cpu_timer_wait_running_nsleep(&timer);
+				error = posix_cpu_timer_del(&timer);
+			}
 		}
-		spin_unlock_irq(&timer.it_lock);
 
-		while (error == TIMER_RETRY) {
-			/*
-			 * We need to handle case when timer was or is in the
-			 * middle of firing. In other cases we already freed
-			 * resources.
-			 */
-			spin_lock_irq(&timer.it_lock);
-			error = posix_cpu_timer_del(&timer);
-			spin_unlock_irq(&timer.it_lock);
-		}
+		spin_unlock_irq(&timer.it_lock);
 
 		if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
 			/*
@@ -1623,6 +1672,7 @@ const struct k_clock clock_posix_cpu = {
 	.timer_del		= posix_cpu_timer_del,
 	.timer_get		= posix_cpu_timer_get,
 	.timer_rearm		= posix_cpu_timer_rearm,
+	.timer_wait_running	= posix_cpu_timer_wait_running,
 };
 
 const struct k_clock clock_process = {
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_runni
 	rcu_read_lock();
 	unlock_timer(timer, *flags);
 
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
 	if (!WARN_ON_ONCE(!kc->timer_wait_running))
 		kc->timer_wait_running(timer);
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07 19:27               ` Thomas Gleixner
@ 2023-04-07 21:00                 ` Frederic Weisbecker
  2023-04-11 14:31                 ` Marco Elver
  1 sibling, 0 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-07 21:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

Le Fri, Apr 07, 2023 at 09:27:40PM +0200, Thomas Gleixner a écrit :
> On Fri, Apr 07 2023 at 20:36, Frederic Weisbecker wrote:
> 
> > On Fri, Apr 07, 2023 at 07:47:40PM +0200, Thomas Gleixner wrote:
> >> On Fri, Apr 07 2023 at 13:50, Frederic Weisbecker wrote:
> >> > On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
> >> >> Now memory came back. The problem with posix CPU timers is that it is
> >> >> not really known to the other side which task is actually doing the
> >> >> expiry. For process wide timers this could be any task in the process.
> >> >> 
> >> >> For hrtimers this works because the expiring context is known.
> >> >
> >> > So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
> >> > delay put_pid() with RCU, we could retrieve that information without
> >> > holding the timer lock (with appropriate RCU accesses all around).
> >> 
> >> No, you can't. This only gives you the process, but the expiry might run
> >> on any task of that. To make that work you need a mutex in sighand.
> >
> > Duh right missed that. Ok will try.
> 
> But that's nasty as well as this can race against exec/exit and you can't
> hold sighand lock when acquiring the mutex.
> 
> The mutex needs to be per task and held by the task which runs the
> expiry task work.
> 
> Something like the completely untested below. You get the idea though.

For something untested, it looks quite right!

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: WARNING in timer_wait_running
  2023-04-07 19:27               ` Thomas Gleixner
  2023-04-07 21:00                 ` Frederic Weisbecker
@ 2023-04-11 14:31                 ` Marco Elver
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
  1 sibling, 1 reply; 20+ messages in thread
From: Marco Elver @ 2023-04-11 14:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra

On Fri, 7 Apr 2023 at 21:27, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, Apr 07 2023 at 20:36, Frederic Weisbecker wrote:
>
> > On Fri, Apr 07, 2023 at 07:47:40PM +0200, Thomas Gleixner wrote:
> >> On Fri, Apr 07 2023 at 13:50, Frederic Weisbecker wrote:
> >> > On Fri, Apr 07, 2023 at 10:44:22AM +0200, Thomas Gleixner wrote:
> >> >> Now memory came back. The problem with posix CPU timers is that it is
> >> >> not really known to the other side which task is actually doing the
> >> >> expiry. For process wide timers this could be any task in the process.
> >> >>
> >> >> For hrtimers this works because the expiring context is known.
> >> >
> >> > So if posix_cpu_timer_del() were to clear ctmr->pid to NULL and then
> >> > delay put_pid() with RCU, we could retrieve that information without
> >> > holding the timer lock (with appropriate RCU accesses all around).
> >>
> >> No, you can't. This only gives you the process, but the expiry might run
> >> on any task of that. To make that work you need a mutex in sighand.
> >
> > Duh right missed that. Ok will try.
>
> But that's nasty as well as this can race against exec/exit and you can't
> hold sighand lock when acquiring the mutex.
>
> The mutex needs to be per task and held by the task which runs the
> expiry task work.
>
> Something like the completely untested below. You get the idea though.

I threw the existing reproducer at this, and it seems happy enough -
no warnings nor lockdep splats either.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-11 14:31                 ` Marco Elver
@ 2023-04-17 13:37                   ` Thomas Gleixner
  2023-04-17 15:34                     ` Sebastian Andrzej Siewior
                                       ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-17 13:37 UTC (permalink / raw)
  To: Marco Elver
  Cc: Frederic Weisbecker, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra, Sebastian Andrzej Siewior

For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.

Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:

1) If not enabled, the expiry happens in hard interrupt context so
   spin waiting on the remote CPU is reasonably time bound.

   Implement an empty stub function for that case.

2) If enabled, the expiry happens in task work before returning to user
   space or guest mode. The expired timers are marked as firing and moved
   from the timer queue to a local list head with sighand lock held. Once
   the timers are moved, sighand lock is dropped and the expiry happens in
   fully preemptible context. That means the expiring task can be scheduled
   out, migrated, interrupted etc. So spin waiting on it is more than
   suboptimal.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way:

    - Add a mutex to struct posix_cputimers_work. This struct is per task
      and used to schedule the expiry task work from the timer interrupt.

    - Add a task_struct pointer to struct cpu_timer which is used to store
      a the task which runs the expiry. That's filled in when the task
      moves the expired timers to the local expiry list. That's not
      affecting the size of the k_itimer union as there are bigger union
      members already

    - Let the task take the expiry mutex around the expiry function

    - Let the waiter acquire a task reference with rcu_read_lock() held and
      block on the expiry mutex

   This avoids spin-waiting on a task which might not even be on a CPU and
   works nicely for RT too.

Reported-by: Marco Elver <elver@google.com>
Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/posix-timers.h   |    7 +++
 kernel/time/posix-cpu-timers.c |   81 +++++++++++++++++++++++++++++++++--------
 kernel/time/posix-timers.c     |    4 ++
 3 files changed, 77 insertions(+), 15 deletions(-)

--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/alarmtimer.h>
 #include <linux/timerqueue.h>
 
@@ -62,9 +63,10 @@ static inline int clockid_to_fd(const cl
  * cpu_timer - Posix CPU timer representation for k_itimer
  * @node:	timerqueue node to queue in the task/sig
  * @head:	timerqueue head on which this timer is queued
- * @task:	Pointer to target task
+ * @pid:	Pointer to target task PID
  * @elist:	List head for the expiry list
  * @firing:	Timer is currently firing
+ * @handling:	Pointer to the task which handles expiry
  */
 struct cpu_timer {
 	struct timerqueue_node	node;
@@ -72,6 +74,7 @@ struct cpu_timer {
 	struct pid		*pid;
 	struct list_head	elist;
 	int			firing;
+	struct task_struct	*handling;
 };
 
 static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
@@ -135,10 +138,12 @@ struct posix_cputimers {
 /**
  * posix_cputimers_work - Container for task work based posix CPU timer expiry
  * @work:	The task work to be scheduled
+ * @mutex:	Mutex held around expiry in context of this task work
  * @scheduled:  @work has been scheduled already, no further processing
  */
 struct posix_cputimers_work {
 	struct callback_head	work;
+	struct mutex		mutex;
 	unsigned int		scheduled;
 };
 
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct tim
 			return expires;
 
 		ctmr->firing = 1;
+		/* See posix_cpu_timer_wait_running() */
+		WRITE_ONCE(ctmr->handling, current);
 		cpu_timer_dequeue(ctmr);
 		list_add_tail(&ctmr->elist, firing);
 	}
@@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(stru
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
 static void posix_cpu_timers_work(struct callback_head *work)
 {
+	struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
+
+	mutex_lock(&cw->mutex);
 	handle_posix_cpu_timers(current);
+	mutex_unlock(&cw->mutex);
+}
+
+/*
+ * Invoked from the posix-timer core when a cancel operation failed because
+ * the timer is marked firing. The caller holds rcu_read_lock(), which
+ * protects the timer and the task which is expiring it from being freed.
+ */
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	struct task_struct *tsk = READ_ONCE(timr->it.cpu.handling);
+
+	/* Has the handling task completed expiry already? */
+	if (!tsk)
+		return;
+
+	/* Ensure that the task cannot go away */
+	get_task_struct(tsk);
+	/* Now drop the RCU protection so the mutex can be locked */
+	rcu_read_unlock();
+	/* Wait on the expiry mutex */
+	mutex_lock(&tsk->posix_cputimers_work.mutex);
+	/* Release it immediately again. */
+	mutex_unlock(&tsk->posix_cputimers_work.mutex);
+	/* Drop the task reference. */
+	put_task_struct(tsk);
+	/* Relock RCU so the callsite is balanced */
+	rcu_read_lock();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	/* Ensure that timr->it.cpu.handling task cannot go away */
+	rcu_read_lock();
+	spin_unlock_irq(&timr->it_lock);
+	posix_cpu_timer_wait_running(timr);
+	rcu_read_unlock();
+	/* @timr is on stack and is valid */
+	spin_lock_irq(&timr->it_lock);
 }
 
 /*
@@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct t
 	       sizeof(p->posix_cputimers_work.work));
 	init_task_work(&p->posix_cputimers_work.work,
 		       posix_cpu_timers_work);
+	mutex_init(&p->posix_cputimers_work.mutex);
 	p->posix_cputimers_work.scheduled = false;
 }
 
@@ -1255,6 +1300,18 @@ static inline void __run_posix_cpu_timer
 	lockdep_posixtimer_exit();
 }
 
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	cpu_relax();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timer.it_lock);
+	cpu_relax();
+	spin_lock_irq(&timer.it_lock);
+}
+
 static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
 {
 	return false;
@@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(stru
 		 */
 		if (likely(cpu_firing >= 0))
 			cpu_timer_fire(timer);
+		/* See posix_cpu_timer_wait_running() */
+		WRITE_ONCE(timer->it.cpu.handling, NULL);
 		spin_unlock(&timer->it_lock);
 	}
 }
@@ -1497,23 +1556,16 @@ static int do_cpu_nanosleep(const clocki
 		expires = cpu_timer_getexpires(&timer.it.cpu);
 		error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
 		if (!error) {
-			/*
-			 * Timer is now unarmed, deletion can not fail.
-			 */
+			/* Timer is now unarmed, deletion can not fail. */
 			posix_cpu_timer_del(&timer);
+		} else {
+			while (error == TIMER_RETRY) {
+				posix_cpu_timer_wait_running_nsleep(&timer);
+				error = posix_cpu_timer_del(&timer);
+			}
 		}
-		spin_unlock_irq(&timer.it_lock);
 
-		while (error == TIMER_RETRY) {
-			/*
-			 * We need to handle case when timer was or is in the
-			 * middle of firing. In other cases we already freed
-			 * resources.
-			 */
-			spin_lock_irq(&timer.it_lock);
-			error = posix_cpu_timer_del(&timer);
-			spin_unlock_irq(&timer.it_lock);
-		}
+		spin_unlock_irq(&timer.it_lock);
 
 		if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
 			/*
@@ -1623,6 +1675,7 @@ const struct k_clock clock_posix_cpu = {
 	.timer_del		= posix_cpu_timer_del,
 	.timer_get		= posix_cpu_timer_get,
 	.timer_rearm		= posix_cpu_timer_rearm,
+	.timer_wait_running	= posix_cpu_timer_wait_running,
 };
 
 const struct k_clock clock_process = {
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_runni
 	rcu_read_lock();
 	unlock_timer(timer, *flags);
 
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
 	if (!WARN_ON_ONCE(!kc->timer_wait_running))
 		kc->timer_wait_running(timer);
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
@ 2023-04-17 15:34                     ` Sebastian Andrzej Siewior
  2023-04-18  6:00                     ` Marco Elver
                                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-04-17 15:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marco Elver, Frederic Weisbecker, syzbot, linux-kernel,
	syzkaller-bugs, Anna-Maria Behnsen, Jacob Keller,
	Paul E. McKenney, Peter Zijlstra

On 2023-04-17 15:37:55 [+0200], Thomas Gleixner wrote:
…
> 
> Reported-by: Marco Elver <elver@google.com>
> Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I didn't manage to trigger the warning via timer-warn.c, does not cause
any problems as far as I can see.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
  2023-04-17 15:34                     ` Sebastian Andrzej Siewior
@ 2023-04-18  6:00                     ` Marco Elver
  2023-04-18 16:44                     ` Frederic Weisbecker
                                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Marco Elver @ 2023-04-18  6:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra, Sebastian Andrzej Siewior

On Mon, 17 Apr 2023 at 15:37, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> For some unknown reason the introduction of the timer_wait_running callback
> missed to fixup posix CPU timers, which went unnoticed for almost four years.
> Marco reported recently that the WARN_ON() in timer_wait_running()
> triggers with a posix CPU timer test case.
>
> Posix CPU timers have two execution models for expiring timers depending on
> CONFIG_POSIX_CPU_TIMERS_TASK_WORK:
>
> 1) If not enabled, the expiry happens in hard interrupt context so
>    spin waiting on the remote CPU is reasonably time bound.
>
>    Implement an empty stub function for that case.
>
> 2) If enabled, the expiry happens in task work before returning to user
>    space or guest mode. The expired timers are marked as firing and moved
>    from the timer queue to a local list head with sighand lock held. Once
>    the timers are moved, sighand lock is dropped and the expiry happens in
>    fully preemptible context. That means the expiring task can be scheduled
>    out, migrated, interrupted etc. So spin waiting on it is more than
>    suboptimal.
>
>    The timer wheel has a timer_wait_running() mechanism for RT, which uses
>    a per CPU timer-base expiry lock which is held by the expiry code and the
>    task waiting for the timer function to complete blocks on that lock.
>
>    This does not work in the same way for posix CPU timers as there is no
>    timer base and expiry for process wide timers can run on any task
>    belonging to that process, but the concept of waiting on an expiry lock
>    can be used too in a slightly different way:
>
>     - Add a mutex to struct posix_cputimers_work. This struct is per task
>       and used to schedule the expiry task work from the timer interrupt.
>
>     - Add a task_struct pointer to struct cpu_timer which is used to store
>       a the task which runs the expiry. That's filled in when the task
>       moves the expired timers to the local expiry list. That's not
>       affecting the size of the k_itimer union as there are bigger union
>       members already
>
>     - Let the task take the expiry mutex around the expiry function
>
>     - Let the waiter acquire a task reference with rcu_read_lock() held and
>       block on the expiry mutex
>
>    This avoids spin-waiting on a task which might not even be on a CPU and
>    works nicely for RT too.
>
> Reported-by: Marco Elver <elver@google.com>

Tested-by: Marco Elver <elver@google.com>

Thanks!

> Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/posix-timers.h   |    7 +++
>  kernel/time/posix-cpu-timers.c |   81 +++++++++++++++++++++++++++++++++--------
>  kernel/time/posix-timers.c     |    4 ++
>  3 files changed, 77 insertions(+), 15 deletions(-)
>
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -4,6 +4,7 @@
>
>  #include <linux/spinlock.h>
>  #include <linux/list.h>
> +#include <linux/mutex.h>
>  #include <linux/alarmtimer.h>
>  #include <linux/timerqueue.h>
>
> @@ -62,9 +63,10 @@ static inline int clockid_to_fd(const cl
>   * cpu_timer - Posix CPU timer representation for k_itimer
>   * @node:      timerqueue node to queue in the task/sig
>   * @head:      timerqueue head on which this timer is queued
> - * @task:      Pointer to target task
> + * @pid:       Pointer to target task PID
>   * @elist:     List head for the expiry list
>   * @firing:    Timer is currently firing
> + * @handling:  Pointer to the task which handles expiry
>   */
>  struct cpu_timer {
>         struct timerqueue_node  node;
> @@ -72,6 +74,7 @@ struct cpu_timer {
>         struct pid              *pid;
>         struct list_head        elist;
>         int                     firing;
> +       struct task_struct      *handling;
>  };
>
>  static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
> @@ -135,10 +138,12 @@ struct posix_cputimers {
>  /**
>   * posix_cputimers_work - Container for task work based posix CPU timer expiry
>   * @work:      The task work to be scheduled
> + * @mutex:     Mutex held around expiry in context of this task work
>   * @scheduled:  @work has been scheduled already, no further processing
>   */
>  struct posix_cputimers_work {
>         struct callback_head    work;
> +       struct mutex            mutex;
>         unsigned int            scheduled;
>  };
>
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct tim
>                         return expires;
>
>                 ctmr->firing = 1;
> +               /* See posix_cpu_timer_wait_running() */
> +               WRITE_ONCE(ctmr->handling, current);
>                 cpu_timer_dequeue(ctmr);
>                 list_add_tail(&ctmr->elist, firing);
>         }
> @@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(stru
>  #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
>  static void posix_cpu_timers_work(struct callback_head *work)
>  {
> +       struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
> +
> +       mutex_lock(&cw->mutex);
>         handle_posix_cpu_timers(current);
> +       mutex_unlock(&cw->mutex);
> +}
> +
> +/*
> + * Invoked from the posix-timer core when a cancel operation failed because
> + * the timer is marked firing. The caller holds rcu_read_lock(), which
> + * protects the timer and the task which is expiring it from being freed.
> + */
> +static void posix_cpu_timer_wait_running(struct k_itimer *timr)
> +{
> +       struct task_struct *tsk = READ_ONCE(timr->it.cpu.handling);
> +
> +       /* Has the handling task completed expiry already? */
> +       if (!tsk)
> +               return;
> +
> +       /* Ensure that the task cannot go away */
> +       get_task_struct(tsk);
> +       /* Now drop the RCU protection so the mutex can be locked */
> +       rcu_read_unlock();
> +       /* Wait on the expiry mutex */
> +       mutex_lock(&tsk->posix_cputimers_work.mutex);
> +       /* Release it immediately again. */
> +       mutex_unlock(&tsk->posix_cputimers_work.mutex);
> +       /* Drop the task reference. */
> +       put_task_struct(tsk);
> +       /* Relock RCU so the callsite is balanced */
> +       rcu_read_lock();
> +}
> +
> +static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
> +{
> +       /* Ensure that timr->it.cpu.handling task cannot go away */
> +       rcu_read_lock();
> +       spin_unlock_irq(&timr->it_lock);
> +       posix_cpu_timer_wait_running(timr);
> +       rcu_read_unlock();
> +       /* @timr is on stack and is valid */
> +       spin_lock_irq(&timr->it_lock);
>  }
>
>  /*
> @@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct t
>                sizeof(p->posix_cputimers_work.work));
>         init_task_work(&p->posix_cputimers_work.work,
>                        posix_cpu_timers_work);
> +       mutex_init(&p->posix_cputimers_work.mutex);
>         p->posix_cputimers_work.scheduled = false;
>  }
>
> @@ -1255,6 +1300,18 @@ static inline void __run_posix_cpu_timer
>         lockdep_posixtimer_exit();
>  }
>
> +static void posix_cpu_timer_wait_running(struct k_itimer *timr)
> +{
> +       cpu_relax();
> +}
> +
> +static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
> +{
> +       spin_unlock_irq(&timer.it_lock);
> +       cpu_relax();
> +       spin_lock_irq(&timer.it_lock);
> +}
> +
>  static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
>  {
>         return false;
> @@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(stru
>                  */
>                 if (likely(cpu_firing >= 0))
>                         cpu_timer_fire(timer);
> +               /* See posix_cpu_timer_wait_running() */
> +               WRITE_ONCE(timer->it.cpu.handling, NULL);
>                 spin_unlock(&timer->it_lock);
>         }
>  }
> @@ -1497,23 +1556,16 @@ static int do_cpu_nanosleep(const clocki
>                 expires = cpu_timer_getexpires(&timer.it.cpu);
>                 error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
>                 if (!error) {
> -                       /*
> -                        * Timer is now unarmed, deletion can not fail.
> -                        */
> +                       /* Timer is now unarmed, deletion can not fail. */
>                         posix_cpu_timer_del(&timer);
> +               } else {
> +                       while (error == TIMER_RETRY) {
> +                               posix_cpu_timer_wait_running_nsleep(&timer);
> +                               error = posix_cpu_timer_del(&timer);
> +                       }
>                 }
> -               spin_unlock_irq(&timer.it_lock);
>
> -               while (error == TIMER_RETRY) {
> -                       /*
> -                        * We need to handle case when timer was or is in the
> -                        * middle of firing. In other cases we already freed
> -                        * resources.
> -                        */
> -                       spin_lock_irq(&timer.it_lock);
> -                       error = posix_cpu_timer_del(&timer);
> -                       spin_unlock_irq(&timer.it_lock);
> -               }
> +               spin_unlock_irq(&timer.it_lock);
>
>                 if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
>                         /*
> @@ -1623,6 +1675,7 @@ const struct k_clock clock_posix_cpu = {
>         .timer_del              = posix_cpu_timer_del,
>         .timer_get              = posix_cpu_timer_get,
>         .timer_rearm            = posix_cpu_timer_rearm,
> +       .timer_wait_running     = posix_cpu_timer_wait_running,
>  };
>
>  const struct k_clock clock_process = {
> --- a/kernel/time/posix-timers.c
> +++ b/kernel/time/posix-timers.c
> @@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_runni
>         rcu_read_lock();
>         unlock_timer(timer, *flags);
>
> +       /*
> +        * kc->timer_wait_running() might drop RCU lock. So @timer
> +        * cannot be touched anymore after the function returns!
> +        */
>         if (!WARN_ON_ONCE(!kc->timer_wait_running))
>                 kc->timer_wait_running(timer);
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
  2023-04-17 15:34                     ` Sebastian Andrzej Siewior
  2023-04-18  6:00                     ` Marco Elver
@ 2023-04-18 16:44                     ` Frederic Weisbecker
  2023-04-19  7:33                       ` Thomas Gleixner
  2023-04-19  8:33                     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
                                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2023-04-18 16:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra, Sebastian Andrzej Siewior

Le Mon, Apr 17, 2023 at 03:37:55PM +0200, Thomas Gleixner a écrit :
>  struct cpu_timer {
>  	struct timerqueue_node	node;
> @@ -72,6 +74,7 @@ struct cpu_timer {
>  	struct pid		*pid;
>  	struct list_head	elist;
>  	int			firing;
> +	struct task_struct	*handling;

I guess it can be made __rcu

>  };
>  
>  static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
> @@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct tim
>  			return expires;
>  
>  		ctmr->firing = 1;
> +		/* See posix_cpu_timer_wait_running() */
> +		WRITE_ONCE(ctmr->handling, current);

That can be rcu_assign_pointer()

>  		cpu_timer_dequeue(ctmr);
>  		list_add_tail(&ctmr->elist, firing);
>  	}
> @@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(stru
> +static void posix_cpu_timer_wait_running(struct k_itimer *timr)
> +{
> +	struct task_struct *tsk = READ_ONCE(timr->it.cpu.handling);

And rcu_dereference()

> +
> +	/* Has the handling task completed expiry already? */
> +	if (!tsk)
> +		return;
> +
> +	/* Ensure that the task cannot go away */
> +	get_task_struct(tsk);
> +	/* Now drop the RCU protection so the mutex can be locked */
> +	rcu_read_unlock();
> +	/* Wait on the expiry mutex */
> +	mutex_lock(&tsk->posix_cputimers_work.mutex);
> +	/* Release it immediately again. */
> +	mutex_unlock(&tsk->posix_cputimers_work.mutex);
> +	/* Drop the task reference. */
> +	put_task_struct(tsk);
> +	/* Relock RCU so the callsite is balanced */
> +	rcu_read_lock();
> +}
> @@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(stru
>  		 */
>  		if (likely(cpu_firing >= 0))
>  			cpu_timer_fire(timer);
> +		/* See posix_cpu_timer_wait_running() */
> +		WRITE_ONCE(timer->it.cpu.handling, NULL);

And rcu_assign_pointer()

Aside the boring details:

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-18 16:44                     ` Frederic Weisbecker
@ 2023-04-19  7:33                       ` Thomas Gleixner
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Gleixner @ 2023-04-19  7:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marco Elver, syzbot, linux-kernel, syzkaller-bugs,
	Anna-Maria Behnsen, Jacob Keller, Paul E. McKenney,
	Peter Zijlstra, Sebastian Andrzej Siewior

On Tue, Apr 18 2023 at 18:44, Frederic Weisbecker wrote:
> Le Mon, Apr 17, 2023 at 03:37:55PM +0200, Thomas Gleixner a écrit :
>>  struct cpu_timer {
>>  	struct timerqueue_node	node;
>> @@ -72,6 +74,7 @@ struct cpu_timer {
>>  	struct pid		*pid;
>>  	struct list_head	elist;
>>  	int			firing;
>> +	struct task_struct	*handling;
>
> I guess it can be made __rcu

Indeed. 
>>  		if (likely(cpu_firing >= 0))
>>  			cpu_timer_fire(timer);
>> +		/* See posix_cpu_timer_wait_running() */
>> +		WRITE_ONCE(timer->it.cpu.handling, NULL);
>
> And rcu_assign_pointer()

I fix that up on the fly.

> Aside the boring details:
>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thanks for going through this!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip: timers/core] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
                                       ` (2 preceding siblings ...)
  2023-04-18 16:44                     ` Frederic Weisbecker
@ 2023-04-19  8:33                     ` tip-bot2 for Thomas Gleixner
  2023-04-20  7:51                     ` tip-bot2 for Thomas Gleixner
  2023-04-21 13:43                     ` tip-bot2 for Thomas Gleixner
  5 siblings, 0 replies; 20+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-04-19  8:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Marco Elver, Thomas Gleixner, Sebastian Andrzej Siewior,
	Frederic Weisbecker, stable, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     1bb5b68fd3aabb6b9d6b9e9bb092bb8f3c2ade62
Gitweb:        https://git.kernel.org/tip/1bb5b68fd3aabb6b9d6b9e9bb092bb8f3c2ade62
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 17 Apr 2023 15:37:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Wed, 19 Apr 2023 10:29:00 +02:00

posix-cpu-timers: Implement the missing timer_wait_running callback

For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.

Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:

1) If not enabled, the expiry happens in hard interrupt context so
   spin waiting on the remote CPU is reasonably time bound.

   Implement an empty stub function for that case.

2) If enabled, the expiry happens in task work before returning to user
   space or guest mode. The expired timers are marked as firing and moved
   from the timer queue to a local list head with sighand lock held. Once
   the timers are moved, sighand lock is dropped and the expiry happens in
   fully preemptible context. That means the expiring task can be scheduled
   out, migrated, interrupted etc. So spin waiting on it is more than
   suboptimal.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way:

    - Add a mutex to struct posix_cputimers_work. This struct is per task
      and used to schedule the expiry task work from the timer interrupt.

    - Add a task_struct pointer to struct cpu_timer which is used to store
      a the task which runs the expiry. That's filled in when the task
      moves the expired timers to the local expiry list. That's not
      affecting the size of the k_itimer union as there are bigger union
      members already

    - Let the task take the expiry mutex around the expiry function

    - Let the waiter acquire a task reference with rcu_read_lock() held and
      block on the expiry mutex

   This avoids spin-waiting on a task which might not even be on a CPU and
   works nicely for RT too.

Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
---
 include/linux/posix-timers.h   | 17 ++++---
 kernel/time/posix-cpu-timers.c | 81 +++++++++++++++++++++++++++------
 kernel/time/posix-timers.c     |  4 ++-
 3 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 2c6e99c..d607f51 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/alarmtimer.h>
 #include <linux/timerqueue.h>
 
@@ -62,16 +63,18 @@ static inline int clockid_to_fd(const clockid_t clk)
  * cpu_timer - Posix CPU timer representation for k_itimer
  * @node:	timerqueue node to queue in the task/sig
  * @head:	timerqueue head on which this timer is queued
- * @task:	Pointer to target task
+ * @pid:	Pointer to target task PID
  * @elist:	List head for the expiry list
  * @firing:	Timer is currently firing
+ * @handling:	Pointer to the task which handles expiry
  */
 struct cpu_timer {
-	struct timerqueue_node	node;
-	struct timerqueue_head	*head;
-	struct pid		*pid;
-	struct list_head	elist;
-	int			firing;
+	struct timerqueue_node		node;
+	struct timerqueue_head		*head;
+	struct pid			*pid;
+	struct list_head		elist;
+	int				firing;
+	struct task_struct __rcu	*handling;
 };
 
 static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
@@ -135,10 +138,12 @@ struct posix_cputimers {
 /**
  * posix_cputimers_work - Container for task work based posix CPU timer expiry
  * @work:	The task work to be scheduled
+ * @mutex:	Mutex held around expiry in context of this task work
  * @scheduled:  @work has been scheduled already, no further processing
  */
 struct posix_cputimers_work {
 	struct callback_head	work;
+	struct mutex		mutex;
 	unsigned int		scheduled;
 };
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2f5e9b3..fb56e02 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct timerqueue_head *head,
 			return expires;
 
 		ctmr->firing = 1;
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(ctmr->handling, current);
 		cpu_timer_dequeue(ctmr);
 		list_add_tail(&ctmr->elist, firing);
 	}
@@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(struct task_struct *tsk);
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
 static void posix_cpu_timers_work(struct callback_head *work)
 {
+	struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
+
+	mutex_lock(&cw->mutex);
 	handle_posix_cpu_timers(current);
+	mutex_unlock(&cw->mutex);
+}
+
+/*
+ * Invoked from the posix-timer core when a cancel operation failed because
+ * the timer is marked firing. The caller holds rcu_read_lock(), which
+ * protects the timer and the task which is expiring it from being freed.
+ */
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	struct task_struct *tsk = rcu_dereference(timr->it.cpu.handling);
+
+	/* Has the handling task completed expiry already? */
+	if (!tsk)
+		return;
+
+	/* Ensure that the task cannot go away */
+	get_task_struct(tsk);
+	/* Now drop the RCU protection so the mutex can be locked */
+	rcu_read_unlock();
+	/* Wait on the expiry mutex */
+	mutex_lock(&tsk->posix_cputimers_work.mutex);
+	/* Release it immediately again. */
+	mutex_unlock(&tsk->posix_cputimers_work.mutex);
+	/* Drop the task reference. */
+	put_task_struct(tsk);
+	/* Relock RCU so the callsite is balanced */
+	rcu_read_lock();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	/* Ensure that timr->it.cpu.handling task cannot go away */
+	rcu_read_lock();
+	spin_unlock_irq(&timr->it_lock);
+	posix_cpu_timer_wait_running(timr);
+	rcu_read_unlock();
+	/* @timr is on stack and is valid */
+	spin_lock_irq(&timr->it_lock);
 }
 
 /*
@@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct task_struct *p)
 	       sizeof(p->posix_cputimers_work.work));
 	init_task_work(&p->posix_cputimers_work.work,
 		       posix_cpu_timers_work);
+	mutex_init(&p->posix_cputimers_work.mutex);
 	p->posix_cputimers_work.scheduled = false;
 }
 
@@ -1255,6 +1300,18 @@ static inline void __run_posix_cpu_timers(struct task_struct *tsk)
 	lockdep_posixtimer_exit();
 }
 
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	cpu_relax();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timer.it_lock);
+	cpu_relax();
+	spin_lock_irq(&timer.it_lock);
+}
+
 static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
 {
 	return false;
@@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
 		 */
 		if (likely(cpu_firing >= 0))
 			cpu_timer_fire(timer);
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(timer->it.cpu.handling, NULL);
 		spin_unlock(&timer->it_lock);
 	}
 }
@@ -1497,23 +1556,16 @@ static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
 		expires = cpu_timer_getexpires(&timer.it.cpu);
 		error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
 		if (!error) {
-			/*
-			 * Timer is now unarmed, deletion can not fail.
-			 */
+			/* Timer is now unarmed, deletion can not fail. */
 			posix_cpu_timer_del(&timer);
+		} else {
+			while (error == TIMER_RETRY) {
+				posix_cpu_timer_wait_running_nsleep(&timer);
+				error = posix_cpu_timer_del(&timer);
+			}
 		}
-		spin_unlock_irq(&timer.it_lock);
 
-		while (error == TIMER_RETRY) {
-			/*
-			 * We need to handle case when timer was or is in the
-			 * middle of firing. In other cases we already freed
-			 * resources.
-			 */
-			spin_lock_irq(&timer.it_lock);
-			error = posix_cpu_timer_del(&timer);
-			spin_unlock_irq(&timer.it_lock);
-		}
+		spin_unlock_irq(&timer.it_lock);
 
 		if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
 			/*
@@ -1623,6 +1675,7 @@ const struct k_clock clock_posix_cpu = {
 	.timer_del		= posix_cpu_timer_del,
 	.timer_get		= posix_cpu_timer_get,
 	.timer_rearm		= posix_cpu_timer_rearm,
+	.timer_wait_running	= posix_cpu_timer_wait_running,
 };
 
 const struct k_clock clock_process = {
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0c8a87a..808a247 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	rcu_read_lock();
 	unlock_timer(timer, *flags);
 
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
 	if (!WARN_ON_ONCE(!kc->timer_wait_running))
 		kc->timer_wait_running(timer);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip: timers/core] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
                                       ` (3 preceding siblings ...)
  2023-04-19  8:33                     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
@ 2023-04-20  7:51                     ` tip-bot2 for Thomas Gleixner
  2023-04-21 13:43                     ` tip-bot2 for Thomas Gleixner
  5 siblings, 0 replies; 20+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-04-20  7:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Marco Elver, Thomas Gleixner, Sebastian Andrzej Siewior,
	Frederic Weisbecker, stable, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     2aaae4bf41b101f7e58e8b06778b1cd9a1dddf94
Gitweb:        https://git.kernel.org/tip/2aaae4bf41b101f7e58e8b06778b1cd9a1dddf94
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 17 Apr 2023 15:37:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 20 Apr 2023 09:47:26 +02:00

posix-cpu-timers: Implement the missing timer_wait_running callback

For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.

Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:

1) If not enabled, the expiry happens in hard interrupt context so
   spin waiting on the remote CPU is reasonably time bound.

   Implement an empty stub function for that case.

2) If enabled, the expiry happens in task work before returning to user
   space or guest mode. The expired timers are marked as firing and moved
   from the timer queue to a local list head with sighand lock held. Once
   the timers are moved, sighand lock is dropped and the expiry happens in
   fully preemptible context. That means the expiring task can be scheduled
   out, migrated, interrupted etc. So spin waiting on it is more than
   suboptimal.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way:

    - Add a mutex to struct posix_cputimers_work. This struct is per task
      and used to schedule the expiry task work from the timer interrupt.

    - Add a task_struct pointer to struct cpu_timer which is used to store
      a the task which runs the expiry. That's filled in when the task
      moves the expired timers to the local expiry list. That's not
      affecting the size of the k_itimer union as there are bigger union
      members already

    - Let the task take the expiry mutex around the expiry function

    - Let the waiter acquire a task reference with rcu_read_lock() held and
      block on the expiry mutex

   This avoids spin-waiting on a task which might not even be on a CPU and
   works nicely for RT too.

Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
---
 include/linux/posix-timers.h   | 17 ++++---
 kernel/time/posix-cpu-timers.c | 81 +++++++++++++++++++++++++++------
 kernel/time/posix-timers.c     |  4 ++-
 3 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 2c6e99c..d607f51 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/alarmtimer.h>
 #include <linux/timerqueue.h>
 
@@ -62,16 +63,18 @@ static inline int clockid_to_fd(const clockid_t clk)
  * cpu_timer - Posix CPU timer representation for k_itimer
  * @node:	timerqueue node to queue in the task/sig
  * @head:	timerqueue head on which this timer is queued
- * @task:	Pointer to target task
+ * @pid:	Pointer to target task PID
  * @elist:	List head for the expiry list
  * @firing:	Timer is currently firing
+ * @handling:	Pointer to the task which handles expiry
  */
 struct cpu_timer {
-	struct timerqueue_node	node;
-	struct timerqueue_head	*head;
-	struct pid		*pid;
-	struct list_head	elist;
-	int			firing;
+	struct timerqueue_node		node;
+	struct timerqueue_head		*head;
+	struct pid			*pid;
+	struct list_head		elist;
+	int				firing;
+	struct task_struct __rcu	*handling;
 };
 
 static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
@@ -135,10 +138,12 @@ struct posix_cputimers {
 /**
  * posix_cputimers_work - Container for task work based posix CPU timer expiry
  * @work:	The task work to be scheduled
+ * @mutex:	Mutex held around expiry in context of this task work
  * @scheduled:  @work has been scheduled already, no further processing
  */
 struct posix_cputimers_work {
 	struct callback_head	work;
+	struct mutex		mutex;
 	unsigned int		scheduled;
 };
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2f5e9b3..93c5a19 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct timerqueue_head *head,
 			return expires;
 
 		ctmr->firing = 1;
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(ctmr->handling, current);
 		cpu_timer_dequeue(ctmr);
 		list_add_tail(&ctmr->elist, firing);
 	}
@@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(struct task_struct *tsk);
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
 static void posix_cpu_timers_work(struct callback_head *work)
 {
+	struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
+
+	mutex_lock(&cw->mutex);
 	handle_posix_cpu_timers(current);
+	mutex_unlock(&cw->mutex);
+}
+
+/*
+ * Invoked from the posix-timer core when a cancel operation failed because
+ * the timer is marked firing. The caller holds rcu_read_lock(), which
+ * protects the timer and the task which is expiring it from being freed.
+ */
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	struct task_struct *tsk = rcu_dereference(timr->it.cpu.handling);
+
+	/* Has the handling task completed expiry already? */
+	if (!tsk)
+		return;
+
+	/* Ensure that the task cannot go away */
+	get_task_struct(tsk);
+	/* Now drop the RCU protection so the mutex can be locked */
+	rcu_read_unlock();
+	/* Wait on the expiry mutex */
+	mutex_lock(&tsk->posix_cputimers_work.mutex);
+	/* Release it immediately again. */
+	mutex_unlock(&tsk->posix_cputimers_work.mutex);
+	/* Drop the task reference. */
+	put_task_struct(tsk);
+	/* Relock RCU so the callsite is balanced */
+	rcu_read_lock();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	/* Ensure that timr->it.cpu.handling task cannot go away */
+	rcu_read_lock();
+	spin_unlock_irq(&timr->it_lock);
+	posix_cpu_timer_wait_running(timr);
+	rcu_read_unlock();
+	/* @timr is on stack and is valid */
+	spin_lock_irq(&timr->it_lock);
 }
 
 /*
@@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct task_struct *p)
 	       sizeof(p->posix_cputimers_work.work));
 	init_task_work(&p->posix_cputimers_work.work,
 		       posix_cpu_timers_work);
+	mutex_init(&p->posix_cputimers_work.mutex);
 	p->posix_cputimers_work.scheduled = false;
 }
 
@@ -1255,6 +1300,18 @@ static inline void __run_posix_cpu_timers(struct task_struct *tsk)
 	lockdep_posixtimer_exit();
 }
 
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	cpu_relax();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timr.it_lock);
+	cpu_relax();
+	spin_lock_irq(&timr.it_lock);
+}
+
 static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
 {
 	return false;
@@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
 		 */
 		if (likely(cpu_firing >= 0))
 			cpu_timer_fire(timer);
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(timer->it.cpu.handling, NULL);
 		spin_unlock(&timer->it_lock);
 	}
 }
@@ -1497,23 +1556,16 @@ static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
 		expires = cpu_timer_getexpires(&timer.it.cpu);
 		error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
 		if (!error) {
-			/*
-			 * Timer is now unarmed, deletion can not fail.
-			 */
+			/* Timer is now unarmed, deletion can not fail. */
 			posix_cpu_timer_del(&timer);
+		} else {
+			while (error == TIMER_RETRY) {
+				posix_cpu_timer_wait_running_nsleep(&timer);
+				error = posix_cpu_timer_del(&timer);
+			}
 		}
-		spin_unlock_irq(&timer.it_lock);
 
-		while (error == TIMER_RETRY) {
-			/*
-			 * We need to handle case when timer was or is in the
-			 * middle of firing. In other cases we already freed
-			 * resources.
-			 */
-			spin_lock_irq(&timer.it_lock);
-			error = posix_cpu_timer_del(&timer);
-			spin_unlock_irq(&timer.it_lock);
-		}
+		spin_unlock_irq(&timer.it_lock);
 
 		if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
 			/*
@@ -1623,6 +1675,7 @@ const struct k_clock clock_posix_cpu = {
 	.timer_del		= posix_cpu_timer_del,
 	.timer_get		= posix_cpu_timer_get,
 	.timer_rearm		= posix_cpu_timer_rearm,
+	.timer_wait_running	= posix_cpu_timer_wait_running,
 };
 
 const struct k_clock clock_process = {
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0c8a87a..808a247 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	rcu_read_lock();
 	unlock_timer(timer, *flags);
 
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
 	if (!WARN_ON_ONCE(!kc->timer_wait_running))
 		kc->timer_wait_running(timer);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip: timers/core] posix-cpu-timers: Implement the missing timer_wait_running callback
  2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
                                       ` (4 preceding siblings ...)
  2023-04-20  7:51                     ` tip-bot2 for Thomas Gleixner
@ 2023-04-21 13:43                     ` tip-bot2 for Thomas Gleixner
  5 siblings, 0 replies; 20+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-04-21 13:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Marco Elver, Thomas Gleixner, Sebastian Andrzej Siewior,
	Frederic Weisbecker, stable, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     f7abf14f0001a5a47539d9f60bbdca649e43536b
Gitweb:        https://git.kernel.org/tip/f7abf14f0001a5a47539d9f60bbdca649e43536b
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 17 Apr 2023 15:37:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 21 Apr 2023 15:34:33 +02:00

posix-cpu-timers: Implement the missing timer_wait_running callback

For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.

Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:

1) If not enabled, the expiry happens in hard interrupt context so
   spin waiting on the remote CPU is reasonably time bound.

   Implement an empty stub function for that case.

2) If enabled, the expiry happens in task work before returning to user
   space or guest mode. The expired timers are marked as firing and moved
   from the timer queue to a local list head with sighand lock held. Once
   the timers are moved, sighand lock is dropped and the expiry happens in
   fully preemptible context. That means the expiring task can be scheduled
   out, migrated, interrupted etc. So spin waiting on it is more than
   suboptimal.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way:

    - Add a mutex to struct posix_cputimers_work. This struct is per task
      and used to schedule the expiry task work from the timer interrupt.

    - Add a task_struct pointer to struct cpu_timer which is used to store
      a the task which runs the expiry. That's filled in when the task
      moves the expired timers to the local expiry list. That's not
      affecting the size of the k_itimer union as there are bigger union
      members already

    - Let the task take the expiry mutex around the expiry function

    - Let the waiter acquire a task reference with rcu_read_lock() held and
      block on the expiry mutex

   This avoids spin-waiting on a task which might not even be on a CPU and
   works nicely for RT too.

Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
---
 include/linux/posix-timers.h   | 17 ++++---
 kernel/time/posix-cpu-timers.c | 81 +++++++++++++++++++++++++++------
 kernel/time/posix-timers.c     |  4 ++-
 3 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 2c6e99c..d607f51 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/alarmtimer.h>
 #include <linux/timerqueue.h>
 
@@ -62,16 +63,18 @@ static inline int clockid_to_fd(const clockid_t clk)
  * cpu_timer - Posix CPU timer representation for k_itimer
  * @node:	timerqueue node to queue in the task/sig
  * @head:	timerqueue head on which this timer is queued
- * @task:	Pointer to target task
+ * @pid:	Pointer to target task PID
  * @elist:	List head for the expiry list
  * @firing:	Timer is currently firing
+ * @handling:	Pointer to the task which handles expiry
  */
 struct cpu_timer {
-	struct timerqueue_node	node;
-	struct timerqueue_head	*head;
-	struct pid		*pid;
-	struct list_head	elist;
-	int			firing;
+	struct timerqueue_node		node;
+	struct timerqueue_head		*head;
+	struct pid			*pid;
+	struct list_head		elist;
+	int				firing;
+	struct task_struct __rcu	*handling;
 };
 
 static inline bool cpu_timer_enqueue(struct timerqueue_head *head,
@@ -135,10 +138,12 @@ struct posix_cputimers {
 /**
  * posix_cputimers_work - Container for task work based posix CPU timer expiry
  * @work:	The task work to be scheduled
+ * @mutex:	Mutex held around expiry in context of this task work
  * @scheduled:  @work has been scheduled already, no further processing
  */
 struct posix_cputimers_work {
 	struct callback_head	work;
+	struct mutex		mutex;
 	unsigned int		scheduled;
 };
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2f5e9b3..e9c6f9d 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -846,6 +846,8 @@ static u64 collect_timerqueue(struct timerqueue_head *head,
 			return expires;
 
 		ctmr->firing = 1;
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(ctmr->handling, current);
 		cpu_timer_dequeue(ctmr);
 		list_add_tail(&ctmr->elist, firing);
 	}
@@ -1161,7 +1163,49 @@ static void handle_posix_cpu_timers(struct task_struct *tsk);
 #ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
 static void posix_cpu_timers_work(struct callback_head *work)
 {
+	struct posix_cputimers_work *cw = container_of(work, typeof(*cw), work);
+
+	mutex_lock(&cw->mutex);
 	handle_posix_cpu_timers(current);
+	mutex_unlock(&cw->mutex);
+}
+
+/*
+ * Invoked from the posix-timer core when a cancel operation failed because
+ * the timer is marked firing. The caller holds rcu_read_lock(), which
+ * protects the timer and the task which is expiring it from being freed.
+ */
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	struct task_struct *tsk = rcu_dereference(timr->it.cpu.handling);
+
+	/* Has the handling task completed expiry already? */
+	if (!tsk)
+		return;
+
+	/* Ensure that the task cannot go away */
+	get_task_struct(tsk);
+	/* Now drop the RCU protection so the mutex can be locked */
+	rcu_read_unlock();
+	/* Wait on the expiry mutex */
+	mutex_lock(&tsk->posix_cputimers_work.mutex);
+	/* Release it immediately again. */
+	mutex_unlock(&tsk->posix_cputimers_work.mutex);
+	/* Drop the task reference. */
+	put_task_struct(tsk);
+	/* Relock RCU so the callsite is balanced */
+	rcu_read_lock();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	/* Ensure that timr->it.cpu.handling task cannot go away */
+	rcu_read_lock();
+	spin_unlock_irq(&timr->it_lock);
+	posix_cpu_timer_wait_running(timr);
+	rcu_read_unlock();
+	/* @timr is on stack and is valid */
+	spin_lock_irq(&timr->it_lock);
 }
 
 /*
@@ -1177,6 +1221,7 @@ void clear_posix_cputimers_work(struct task_struct *p)
 	       sizeof(p->posix_cputimers_work.work));
 	init_task_work(&p->posix_cputimers_work.work,
 		       posix_cpu_timers_work);
+	mutex_init(&p->posix_cputimers_work.mutex);
 	p->posix_cputimers_work.scheduled = false;
 }
 
@@ -1255,6 +1300,18 @@ static inline void __run_posix_cpu_timers(struct task_struct *tsk)
 	lockdep_posixtimer_exit();
 }
 
+static void posix_cpu_timer_wait_running(struct k_itimer *timr)
+{
+	cpu_relax();
+}
+
+static void posix_cpu_timer_wait_running_nsleep(struct k_itimer *timr)
+{
+	spin_unlock_irq(&timr->it_lock);
+	cpu_relax();
+	spin_lock_irq(&timr->it_lock);
+}
+
 static inline bool posix_cpu_timers_work_scheduled(struct task_struct *tsk)
 {
 	return false;
@@ -1363,6 +1420,8 @@ static void handle_posix_cpu_timers(struct task_struct *tsk)
 		 */
 		if (likely(cpu_firing >= 0))
 			cpu_timer_fire(timer);
+		/* See posix_cpu_timer_wait_running() */
+		rcu_assign_pointer(timer->it.cpu.handling, NULL);
 		spin_unlock(&timer->it_lock);
 	}
 }
@@ -1497,23 +1556,16 @@ static int do_cpu_nanosleep(const clockid_t which_clock, int flags,
 		expires = cpu_timer_getexpires(&timer.it.cpu);
 		error = posix_cpu_timer_set(&timer, 0, &zero_it, &it);
 		if (!error) {
-			/*
-			 * Timer is now unarmed, deletion can not fail.
-			 */
+			/* Timer is now unarmed, deletion can not fail. */
 			posix_cpu_timer_del(&timer);
+		} else {
+			while (error == TIMER_RETRY) {
+				posix_cpu_timer_wait_running_nsleep(&timer);
+				error = posix_cpu_timer_del(&timer);
+			}
 		}
-		spin_unlock_irq(&timer.it_lock);
 
-		while (error == TIMER_RETRY) {
-			/*
-			 * We need to handle case when timer was or is in the
-			 * middle of firing. In other cases we already freed
-			 * resources.
-			 */
-			spin_lock_irq(&timer.it_lock);
-			error = posix_cpu_timer_del(&timer);
-			spin_unlock_irq(&timer.it_lock);
-		}
+		spin_unlock_irq(&timer.it_lock);
 
 		if ((it.it_value.tv_sec | it.it_value.tv_nsec) == 0) {
 			/*
@@ -1623,6 +1675,7 @@ const struct k_clock clock_posix_cpu = {
 	.timer_del		= posix_cpu_timer_del,
 	.timer_get		= posix_cpu_timer_get,
 	.timer_rearm		= posix_cpu_timer_rearm,
+	.timer_wait_running	= posix_cpu_timer_wait_running,
 };
 
 const struct k_clock clock_process = {
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0c8a87a..808a247 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -846,6 +846,10 @@ static struct k_itimer *timer_wait_running(struct k_itimer *timer,
 	rcu_read_lock();
 	unlock_timer(timer, *flags);
 
+	/*
+	 * kc->timer_wait_running() might drop RCU lock. So @timer
+	 * cannot be touched anymore after the function returns!
+	 */
 	if (!WARN_ON_ONCE(!kc->timer_wait_running))
 		kc->timer_wait_running(timer);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-04-21 13:44 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-31 17:07 WARNING in timer_wait_running syzbot
2023-04-05 21:07 ` Marco Elver
2023-04-05 22:19   ` Frederic Weisbecker
2023-04-06 19:37     ` Thomas Gleixner
2023-04-07  8:44       ` Thomas Gleixner
2023-04-07 11:50         ` Frederic Weisbecker
2023-04-07 17:47           ` Thomas Gleixner
2023-04-07 18:36             ` Frederic Weisbecker
2023-04-07 19:27               ` Thomas Gleixner
2023-04-07 21:00                 ` Frederic Weisbecker
2023-04-11 14:31                 ` Marco Elver
2023-04-17 13:37                   ` [PATCH] posix-cpu-timers: Implement the missing timer_wait_running callback Thomas Gleixner
2023-04-17 15:34                     ` Sebastian Andrzej Siewior
2023-04-18  6:00                     ` Marco Elver
2023-04-18 16:44                     ` Frederic Weisbecker
2023-04-19  7:33                       ` Thomas Gleixner
2023-04-19  8:33                     ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-04-20  7:51                     ` tip-bot2 for Thomas Gleixner
2023-04-21 13:43                     ` tip-bot2 for Thomas Gleixner
     [not found] ` <20230407134620.1034-1-hdanton@sina.com>
2023-04-07 14:49   ` WARNING in timer_wait_running Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.