io_uring: worker thread NULL dereference during openat op

* io_uring: worker thread NULL dereference during openat op
@ 2024-04-15 23:26 Dan Clash
  2024-04-16  3:29 ` Paul Moore
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Clash @ 2024-04-15 23:26 UTC (permalink / raw)
  To: io-uring; +Cc: audit, Jens Axboe, Paul Moore

Below is a test program that causes multiple io_uring worker threads to 
hit a NULL dereference while executing openat ops.

The test program hangs forever in a D state.  The test program can be 
run again after the NULL dereferences.  However, there are long delays 
at reboot time because the io_uring_cancel() during do_exit() attempts 
to wake the worker threads.

The test program is single threaded but it queues multiple openat and 
close ops with IOSQE_ASYNC set before waiting for completions.

I collected trace with /sys/kernel/tracing/events/io_uring/enable 
enabled if that is helpful.

The test program reproduces similar problems in the following releases.

mainline v6.9-rc3
stable 6.8.5
Ubuntu 6.5.0-1018-azure

The test program does not reproduce the problem in Ubuntu 
5.15.0-1052-azure, which does not have the io_uring audit changes.

The following is the first io_uring worker thread backtrace in the repro 
against v6.9-rc3.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, 
BIOS Hyper-V UEFI Release v4.1 11/28/2023
RIP: 0010:strlen (lib/string.c:402)
Call Trace:
  <TASK>
? show_regs (arch/x86/kernel/dumpstack.c:479)
? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
? page_fault_oops (arch/x86/mm/fault.c:713)
? do_user_addr_fault (arch/x86/mm/fault.c:1261)
? exc_page_fault (./arch/x86/include/asm/irqflags.h:37 
./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 
arch/x86/mm/fault.c:1563)
? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
? __pfx_strlen (lib/string.c:402)
? parent_len (kernel/auditfilter.c:1284)
__audit_inode (kernel/auditsc.c:2381 (discriminator 4))
? link_path_walk.part.0.constprop.0 (fs/namei.c:2324)
path_openat (fs/namei.c:3550 fs/namei.c:3796)
do_filp_open (fs/namei.c:3826)
? alloc_fd (./arch/x86/include/asm/paravirt.h:589 (discriminator 10) 
./arch/x86/include/asm/qspinlock.h:57 (discriminator 10) 
./include/linux/spinlock.h:204 (discriminator 10) 
./include/linux/spinlock_api_smp.h:142 (discriminator 10) 
./include/linux/spinlock.h:391 (discriminator 10) fs/file.c:553 
(discriminator 10))
io_openat2 (io_uring/openclose.c:140)
io_openat (io_uring/openclose.c:178)
io_issue_sqe (io_uring/io_uring.c:1897)
io_wq_submit_work (io_uring/io_uring.c:2006)
io_worker_handle_work (io_uring/io-wq.c:540 io_uring/io-wq.c:597)
io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
? raw_spin_rq_unlock (./arch/x86/include/asm/paravirt.h:589 
./arch/x86/include/asm/qspinlock.h:57 ./include/linux/spinlock.h:204 
./include/linux/spinlock_api_smp.h:142 kernel/sched/core.c:603)
? finish_task_switch.isra.0 (./arch/x86/include/asm/irqflags.h:42 
./arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1397 
kernel/sched/core.c:5163 kernel/sched/core.c:5281)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork (arch/x86/kernel/process.c:156)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
RIP: 0033:0x0

Test program:

#include <cassert>
#include <fcntl.h>
#include <filesystem>
#include <getopt.h>
#include <iostream>
#include <liburing.h>

// This single threaded test program triggers a kernel BUG in an
// io_uring worker thread.
//
// The --count parameter sets the number of times that the following
// workload is executed:
//
// 1. io_uring_prep_openat(same_filename_each_time)
// 2. io_uring_submit
// 3. io_uring_wait_cqe
// 4. io_uring_prep_close(fd_from_openat_in_step_1)
// 5. io_uring_submit
// 6. io_uring_wait_cqe
//
// The SQEs in steps 1 and 4 have the IOSQE_ASYNC flag set.
//
// The --inflight parameter sets the number of the workloads that are
// executing at the same time.
// Steps 1 and 2 are executed, before step 3, until the inflight limit
// is reached.
//
// The sqe user_data is set to the io_uring op to keep the event_loop()
// simple.
// io_uring_sqe_set_data64() was not available during development.
//
// Warning: This test program may cause system instability and reboot
// may be affected.

// Usage:
//
// sudo apt install -y g++
// sudo apt install -y liburing-dev
// g++ ./io_uring_open_close_inflight.cc -o io_uring_open_close_inflight 
-luring
// sudo dmesg --clear
// ./io_uring_open_close_inflight --directory /tmp/deleteme --count 1 
--inflight 1
// sudo dmesg
//
// A good starting point to reproduce the problem is:
// --count 1000000 --inflight 100

void
submit1(struct io_uring& ring, const std::string& fileName)
{
     int ret;
     struct io_uring_sqe* sqe {};
     uintptr_t user_data {IORING_OP_OPENAT};

     int flags {O_RDWR | O_CREAT};
     mode_t mode {0666};

     sqe = io_uring_get_sqe(&ring);
     assert(sqe != nullptr);

     io_uring_prep_openat(sqe, AT_FDCWD, fileName.data(), flags, mode);
     io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
     io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

     ret = io_uring_submit(&ring);
     assert(ret == 1);
}

void
submit2(struct io_uring& ring, const int fd)
{
     int ret;
     struct io_uring_sqe* sqe {};
     uintptr_t user_data {IORING_OP_CLOSE};

     sqe = io_uring_get_sqe(&ring);
     assert(sqe != nullptr);

     io_uring_prep_close(sqe, fd);
     io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
     io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

     ret = io_uring_submit(&ring);
     assert(ret == 1);
}

void
event_loop(struct io_uring& ring,
            const std::string& filePath,
            const uint32_t count_max,
            const uint32_t inflight_max)
{
     uint32_t inflight {0};
     uint32_t count_phase1 {0};
     uint32_t count_phase2 {0};

     int i {0};
     std::string fileName(filePath + "/file" + std::to_string(i));

     while (count_phase2 < count_max) {
         int ret {0};
         int res {0};
         uintptr_t user_data {};

         while ((count_phase1 < count_max) and (inflight < inflight_max)) {
             if (0 == (count_phase1 % 100)) {
                 std::cout << "count=" << std::to_string(count_phase1)
			  << std::endl;
             }
             count_phase1 += 1;
             inflight += 1;
             submit1(ring, fileName);
         }

         {
             struct io_uring_cqe* cqe {};

             ret = io_uring_wait_cqe(&ring, &cqe);
             assert(ret == 0);

             res = cqe->res;
             user_data = 
reinterpret_cast<uintptr_t>(io_uring_cqe_get_data(cqe));

             io_uring_cqe_seen(&ring, cqe);
         }

         switch (user_data) {
         case IORING_OP_OPENAT:
             // res is a file descriptor or -errno.
             assert(res >= 0);
             submit2(ring, res);
             break;
         case IORING_OP_CLOSE:
             // assert that the close succeeded.
             assert(res == 0);
             count_phase2 += 1;
             inflight -= 1;
             break;
         default:
             assert(0);
         }
     }

     assert(count_phase1 == count_max);
     assert(count_phase2 == count_max);
     assert(inflight == 0);
}

void
workload(const std::string& filePath, const uint32_t count_max, const 
uint32_t inflight_max)
{
     int ret {0};
     struct io_uring ring;

     ret = io_uring_queue_init(2 * inflight_max, &ring, 0);
     assert(0 == ret);

     std::filesystem::create_directory(filePath);

     event_loop(ring, filePath, count_max, inflight_max);

     std::filesystem::remove_all(filePath);

     io_uring_queue_exit(&ring);
}

int
main(int argc, char** argv)
{
     std::string executableName {argv[0]};
     std::string filePath {};
     int count {};
     int inflight {};

     struct option options[]
     {
         {"help", no_argument, 0, 'h'},
         {"directory", required_argument, 0, 'd'},
         {"count", required_argument, 0, 'c'},
         {"inflight", required_argument, 0, 'i'},
         { 0, 0, 0, 0 }
     };
     bool printUsage {false};
     int val {};

     while ((val = getopt_long_only(argc, argv, "", options, nullptr)) 
!= -1) {
         if (val == 'h') {
             printUsage = true;
         } else if (val == 'd') {
             filePath = optarg;
             if (std::filesystem::exists(filePath)) {
                 printUsage = true;
                 std::cerr << "directory must not exist" << std::endl;
             }
         } else if (val == 'c') {
             count = atoi(optarg);
             if (0 == count) {
                 printUsage = true;
             }
         } else if (val == 'i') {
             inflight = atoi(optarg);
             if (0 == inflight) {
                 printUsage = true;
             }
         } else {
             printUsage = true;
         }
     }

     if ((0 == count) || (0 == inflight) || (filePath.empty())) {
         printUsage = true;
     }

     if (printUsage || (optind < argc)) {
         std::cerr << executableName
             << " --directory DIR --count COUNT --inflight INFLIGHT"
             << std::endl;
         exit(1);
     }

     workload(filePath, count, inflight);
     return 0;
}

Details:

The test program has no io_uring worker threads associated after the 
NULL dereferences.

root@openat-hang-3:~# date
Mon Apr 15 06:03:22 PM UTC 2024

root@openat-hang-3:~# ps -o pid,state,command -C 
io_uring_open_close_inflight
     PID S COMMAND
    4605 D ./io_uring_open_close_inflight --directory /tmp/deleteme 
--count 1000000 --inflight 100

root@openat-hang-3:~# ps -L -p 4605
     PID     LWP TTY          TIME CMD
    4605    4605 pts/3    00:00:07 io_uring_open_c

root@openat-hang-3:/# ps -e -L | grep -Ee 'iou[-]wrk[-]' --count
0

10 io_uring worker threads are present in dmesg.

root@openat-hang-3:~# nproc --all
8

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 
'CPU.*iou-wrk-' --count
10

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 
'CPU.*iou-wrk-'
[Mon Apr 15 18:02:55 2024] CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not 
tainted 6.9.0-rc3 #2
[Mon Apr 15 18:02:55 2024] CPU: 1 PID: 4635 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:56 2024] CPU: 2 PID: 4636 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:57 2024] CPU: 3 PID: 4632 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:57 2024] CPU: 2 PID: 4634 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 5 PID: 4626 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 7 PID: 4630 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 4 PID: 4707 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 1 PID: 4705 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 0 PID: 4711 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2

The alloc_fd lines in the next section were changed to the following.

  ? alloc_fd (... fs/file.c:553 (discriminator 10))

The 10 backtraces are very similar.

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-1.txt
16a17
 >  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-2.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-3.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-4.txt
14a15
 >  ? update_cfs_group (kernel/sched/fair.c:3983)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-5.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-6.txt
16a17
 >  ? __io_req_complete_post (io_uring/io_uring.c:1016)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-7.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-8.txt
16a17
 >  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-9.txt
14a15
 >  ? update_cfs_group (kernel/sched/fair.c:3983)

^ permalink raw reply	[flat|nested] 4+ messages in thread