io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* io_uring: worker thread NULL dereference during openat op
@ 2024-04-15 23:26 Dan Clash
  2024-04-16  3:29 ` Paul Moore
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Clash @ 2024-04-15 23:26 UTC (permalink / raw)
  To: io-uring; +Cc: audit, Jens Axboe, Paul Moore

Below is a test program that causes multiple io_uring worker threads to 
hit a NULL dereference while executing openat ops.

The test program hangs forever in a D state.  The test program can be 
run again after the NULL dereferences.  However, there are long delays 
at reboot time because the io_uring_cancel() during do_exit() attempts 
to wake the worker threads.

The test program is single threaded but it queues multiple openat and 
close ops with IOSQE_ASYNC set before waiting for completions.

I collected trace with /sys/kernel/tracing/events/io_uring/enable 
enabled if that is helpful.

The test program reproduces similar problems in the following releases.

mainline v6.9-rc3
stable 6.8.5
Ubuntu 6.5.0-1018-azure

The test program does not reproduce the problem in Ubuntu 
5.15.0-1052-azure, which does not have the io_uring audit changes.

The following is the first io_uring worker thread backtrace in the repro 
against v6.9-rc3.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, 
BIOS Hyper-V UEFI Release v4.1 11/28/2023
RIP: 0010:strlen (lib/string.c:402)
Call Trace:
  <TASK>
? show_regs (arch/x86/kernel/dumpstack.c:479)
? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
? page_fault_oops (arch/x86/mm/fault.c:713)
? do_user_addr_fault (arch/x86/mm/fault.c:1261)
? exc_page_fault (./arch/x86/include/asm/irqflags.h:37 
./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 
arch/x86/mm/fault.c:1563)
? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
? __pfx_strlen (lib/string.c:402)
? parent_len (kernel/auditfilter.c:1284)
__audit_inode (kernel/auditsc.c:2381 (discriminator 4))
? link_path_walk.part.0.constprop.0 (fs/namei.c:2324)
path_openat (fs/namei.c:3550 fs/namei.c:3796)
do_filp_open (fs/namei.c:3826)
? alloc_fd (./arch/x86/include/asm/paravirt.h:589 (discriminator 10) 
./arch/x86/include/asm/qspinlock.h:57 (discriminator 10) 
./include/linux/spinlock.h:204 (discriminator 10) 
./include/linux/spinlock_api_smp.h:142 (discriminator 10) 
./include/linux/spinlock.h:391 (discriminator 10) fs/file.c:553 
(discriminator 10))
io_openat2 (io_uring/openclose.c:140)
io_openat (io_uring/openclose.c:178)
io_issue_sqe (io_uring/io_uring.c:1897)
io_wq_submit_work (io_uring/io_uring.c:2006)
io_worker_handle_work (io_uring/io-wq.c:540 io_uring/io-wq.c:597)
io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
? raw_spin_rq_unlock (./arch/x86/include/asm/paravirt.h:589 
./arch/x86/include/asm/qspinlock.h:57 ./include/linux/spinlock.h:204 
./include/linux/spinlock_api_smp.h:142 kernel/sched/core.c:603)
? finish_task_switch.isra.0 (./arch/x86/include/asm/irqflags.h:42 
./arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1397 
kernel/sched/core.c:5163 kernel/sched/core.c:5281)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork (arch/x86/kernel/process.c:156)
? __pfx_io_wq_worker (io_uring/io-wq.c:627)
ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
RIP: 0033:0x0

Test program:

#include <cassert>
#include <fcntl.h>
#include <filesystem>
#include <getopt.h>
#include <iostream>
#include <liburing.h>

// This single threaded test program triggers a kernel BUG in an
// io_uring worker thread.
//
// The --count parameter sets the number of times that the following
// workload is executed:
//
// 1. io_uring_prep_openat(same_filename_each_time)
// 2. io_uring_submit
// 3. io_uring_wait_cqe
// 4. io_uring_prep_close(fd_from_openat_in_step_1)
// 5. io_uring_submit
// 6. io_uring_wait_cqe
//
// The SQEs in steps 1 and 4 have the IOSQE_ASYNC flag set.
//
// The --inflight parameter sets the number of the workloads that are
// executing at the same time.
// Steps 1 and 2 are executed, before step 3, until the inflight limit
// is reached.
//
// The sqe user_data is set to the io_uring op to keep the event_loop()
// simple.
// io_uring_sqe_set_data64() was not available during development.
//
// Warning: This test program may cause system instability and reboot
// may be affected.

// Usage:
//
// sudo apt install -y g++
// sudo apt install -y liburing-dev
// g++ ./io_uring_open_close_inflight.cc -o io_uring_open_close_inflight 
-luring
// sudo dmesg --clear
// ./io_uring_open_close_inflight --directory /tmp/deleteme --count 1 
--inflight 1
// sudo dmesg
//
// A good starting point to reproduce the problem is:
// --count 1000000 --inflight 100

void
submit1(struct io_uring& ring, const std::string& fileName)
{
     int ret;
     struct io_uring_sqe* sqe {};
     uintptr_t user_data {IORING_OP_OPENAT};

     int flags {O_RDWR | O_CREAT};
     mode_t mode {0666};

     sqe = io_uring_get_sqe(&ring);
     assert(sqe != nullptr);

     io_uring_prep_openat(sqe, AT_FDCWD, fileName.data(), flags, mode);
     io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
     io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

     ret = io_uring_submit(&ring);
     assert(ret == 1);
}

void
submit2(struct io_uring& ring, const int fd)
{
     int ret;
     struct io_uring_sqe* sqe {};
     uintptr_t user_data {IORING_OP_CLOSE};

     sqe = io_uring_get_sqe(&ring);
     assert(sqe != nullptr);

     io_uring_prep_close(sqe, fd);
     io_uring_sqe_set_data(sqe, reinterpret_cast<void*>(user_data));
     io_uring_sqe_set_flags(sqe, IOSQE_ASYNC);

     ret = io_uring_submit(&ring);
     assert(ret == 1);
}

void
event_loop(struct io_uring& ring,
            const std::string& filePath,
            const uint32_t count_max,
            const uint32_t inflight_max)
{
     uint32_t inflight {0};
     uint32_t count_phase1 {0};
     uint32_t count_phase2 {0};

     int i {0};
     std::string fileName(filePath + "/file" + std::to_string(i));

     while (count_phase2 < count_max) {
         int ret {0};
         int res {0};
         uintptr_t user_data {};


         while ((count_phase1 < count_max) and (inflight < inflight_max)) {
             if (0 == (count_phase1 % 100)) {
                 std::cout << "count=" << std::to_string(count_phase1)
			  << std::endl;
             }
             count_phase1 += 1;
             inflight += 1;
             submit1(ring, fileName);
         }

         {
             struct io_uring_cqe* cqe {};

             ret = io_uring_wait_cqe(&ring, &cqe);
             assert(ret == 0);

             res = cqe->res;
             user_data = 
reinterpret_cast<uintptr_t>(io_uring_cqe_get_data(cqe));

             io_uring_cqe_seen(&ring, cqe);
         }

         switch (user_data) {
         case IORING_OP_OPENAT:
             // res is a file descriptor or -errno.
             assert(res >= 0);
             submit2(ring, res);
             break;
         case IORING_OP_CLOSE:
             // assert that the close succeeded.
             assert(res == 0);
             count_phase2 += 1;
             inflight -= 1;
             break;
         default:
             assert(0);
         }
     }

     assert(count_phase1 == count_max);
     assert(count_phase2 == count_max);
     assert(inflight == 0);
}

void
workload(const std::string& filePath, const uint32_t count_max, const 
uint32_t inflight_max)
{
     int ret {0};
     struct io_uring ring;

     ret = io_uring_queue_init(2 * inflight_max, &ring, 0);
     assert(0 == ret);

     std::filesystem::create_directory(filePath);

     event_loop(ring, filePath, count_max, inflight_max);

     std::filesystem::remove_all(filePath);

     io_uring_queue_exit(&ring);
}

int
main(int argc, char** argv)
{
     std::string executableName {argv[0]};
     std::string filePath {};
     int count {};
     int inflight {};

     struct option options[]
     {
         {"help", no_argument, 0, 'h'},
         {"directory", required_argument, 0, 'd'},
         {"count", required_argument, 0, 'c'},
         {"inflight", required_argument, 0, 'i'},
         { 0, 0, 0, 0 }
     };
     bool printUsage {false};
     int val {};

     while ((val = getopt_long_only(argc, argv, "", options, nullptr)) 
!= -1) {
         if (val == 'h') {
             printUsage = true;
         } else if (val == 'd') {
             filePath = optarg;
             if (std::filesystem::exists(filePath)) {
                 printUsage = true;
                 std::cerr << "directory must not exist" << std::endl;
             }
         } else if (val == 'c') {
             count = atoi(optarg);
             if (0 == count) {
                 printUsage = true;
             }
         } else if (val == 'i') {
             inflight = atoi(optarg);
             if (0 == inflight) {
                 printUsage = true;
             }
         } else {
             printUsage = true;
         }
     }

     if ((0 == count) || (0 == inflight) || (filePath.empty())) {
         printUsage = true;
     }

     if (printUsage || (optind < argc)) {
         std::cerr << executableName
             << " --directory DIR --count COUNT --inflight INFLIGHT"
             << std::endl;
         exit(1);
     }

     workload(filePath, count, inflight);
     return 0;
}


Details:

The test program has no io_uring worker threads associated after the 
NULL dereferences.

root@openat-hang-3:~# date
Mon Apr 15 06:03:22 PM UTC 2024

root@openat-hang-3:~# ps -o pid,state,command -C 
io_uring_open_close_inflight
     PID S COMMAND
    4605 D ./io_uring_open_close_inflight --directory /tmp/deleteme 
--count 1000000 --inflight 100

root@openat-hang-3:~# ps -L -p 4605
     PID     LWP TTY          TIME CMD
    4605    4605 pts/3    00:00:07 io_uring_open_c

root@openat-hang-3:/# ps -e -L | grep -Ee 'iou[-]wrk[-]' --count
0

10 io_uring worker threads are present in dmesg.

root@openat-hang-3:~# nproc --all
8

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 
'CPU.*iou-wrk-' --count
10

root@openat-hang-3:/# grep /tmp/io_uring_dmesg_decode.txt -Ee 
'CPU.*iou-wrk-'
[Mon Apr 15 18:02:55 2024] CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not 
tainted 6.9.0-rc3 #2
[Mon Apr 15 18:02:55 2024] CPU: 1 PID: 4635 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:56 2024] CPU: 2 PID: 4636 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:57 2024] CPU: 3 PID: 4632 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:57 2024] CPU: 2 PID: 4634 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 5 PID: 4626 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 7 PID: 4630 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 4 PID: 4707 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 1 PID: 4705 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2
[Mon Apr 15 18:02:58 2024] CPU: 0 PID: 4711 Comm: iou-wrk-4605 Tainted: 
G      D            6.9.0-rc3 #2

The alloc_fd lines in the next section were changed to the following.

  ? alloc_fd (... fs/file.c:553 (discriminator 10))

The 10 backtraces are very similar.

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-1.txt
16a17
 >  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-2.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-3.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-4.txt
14a15
 >  ? update_cfs_group (kernel/sched/fair.c:3983)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-5.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-6.txt
16a17
 >  ? __io_req_complete_post (io_uring/io_uring.c:1016)

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-7.txt
root@openat-hang-3:/tmp/call-traces#

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-8.txt
16a17
 >  ? __send_ipi_one (arch/x86/hyperv/hv_apic.c:252 (discriminator 3))

root@openat-hang-3:/tmp/call-traces# diff call-trace-0.txt call-trace-9.txt
14a15
 >  ? update_cfs_group (kernel/sched/fair.c:3983)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: io_uring: worker thread NULL dereference during openat op
  2024-04-15 23:26 io_uring: worker thread NULL dereference during openat op Dan Clash
@ 2024-04-16  3:29 ` Paul Moore
  2024-04-16 13:45   ` Pavel Begunkov
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Moore @ 2024-04-16  3:29 UTC (permalink / raw)
  To: Dan Clash; +Cc: io-uring, audit, Jens Axboe

On Mon, Apr 15, 2024 at 7:26 PM Dan Clash <daclash@linux.microsoft.com> wrote:
>
> Below is a test program that causes multiple io_uring worker threads to
> hit a NULL dereference while executing openat ops.
>
> The test program hangs forever in a D state.  The test program can be
> run again after the NULL dereferences.  However, there are long delays
> at reboot time because the io_uring_cancel() during do_exit() attempts
> to wake the worker threads.
>
> The test program is single threaded but it queues multiple openat and
> close ops with IOSQE_ASYNC set before waiting for completions.
>
> I collected trace with /sys/kernel/tracing/events/io_uring/enable
> enabled if that is helpful.
>
> The test program reproduces similar problems in the following releases.
>
> mainline v6.9-rc3
> stable 6.8.5
> Ubuntu 6.5.0-1018-azure
>
> The test program does not reproduce the problem in Ubuntu
> 5.15.0-1052-azure, which does not have the io_uring audit changes.
>
> The following is the first io_uring worker thread backtrace in the repro
> against v6.9-rc3.
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP PTI
> CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2
> Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
> BIOS Hyper-V UEFI Release v4.1 11/28/2023
> RIP: 0010:strlen (lib/string.c:402)
> Call Trace:
>   <TASK>
> ? show_regs (arch/x86/kernel/dumpstack.c:479)
> ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
> ? page_fault_oops (arch/x86/mm/fault.c:713)
> ? do_user_addr_fault (arch/x86/mm/fault.c:1261)
> ? exc_page_fault (./arch/x86/include/asm/irqflags.h:37
> ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513
> arch/x86/mm/fault.c:1563)
> ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
> ? __pfx_strlen (lib/string.c:402)
> ? parent_len (kernel/auditfilter.c:1284).
> __audit_inode (kernel/auditsc.c:2381 (discriminator 4))

Thanks for the well documented bug report!

That's interesting, it looks like audit_inode() is potentially being
passed a filename struct with a NULL name field (filename::name ==
NULL).  Given the IOSQE_ASYNC and what looks like io_uring calling
getname() from within the __io_openat_prep() function, I suspect the
issue is that we aren't associating the filename information we
collect in getname() with the proper audit_context().  In other words,
we do the getname() in one context, and then the actual open operation
in another, and the audit filename info is lost in the switch.

I think this is related to another issue that Jens and I have been
discussing relating to connect() and sockaddrs.  We had been hoping
that the issue we were seeing with sockaddrs was just a special case
with connect, but it doesn't look like that is the case.

I'm going to be a bit busy this week with conferences, but given the
previous discussions with Jens as well as this new issue, I suspect
that we are going to need to do some work to support creation of a
thin, or lazily setup, audit_context that we can initialize in the
io_uring prep routines for use by getname(), move_addr_to_kernel(),
etc., store in the io_kiocb struct, and then fully setup in
audit_uring_entry().

> ? link_path_walk.part.0.constprop.0 (fs/namei.c:2324)
> path_openat (fs/namei.c:3550 fs/namei.c:3796)
> do_filp_open (fs/namei.c:3826)
> ? alloc_fd (./arch/x86/include/asm/paravirt.h:589 (discriminator 10)
> ./arch/x86/include/asm/qspinlock.h:57 (discriminator 10)
> ./include/linux/spinlock.h:204 (discriminator 10)
> ./include/linux/spinlock_api_smp.h:142 (discriminator 10)
> ./include/linux/spinlock.h:391 (discriminator 10) fs/file.c:553
> (discriminator 10))
> io_openat2 (io_uring/openclose.c:140)
> io_openat (io_uring/openclose.c:178)
> io_issue_sqe (io_uring/io_uring.c:1897)
> io_wq_submit_work (io_uring/io_uring.c:2006)
> io_worker_handle_work (io_uring/io-wq.c:540 io_uring/io-wq.c:597)
> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
> ? raw_spin_rq_unlock (./arch/x86/include/asm/paravirt.h:589
> ./arch/x86/include/asm/qspinlock.h:57 ./include/linux/spinlock.h:204
> ./include/linux/spinlock_api_smp.h:142 kernel/sched/core.c:603)
> ? finish_task_switch.isra.0 (./arch/x86/include/asm/irqflags.h:42
> ./arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1397
> kernel/sched/core.c:5163 kernel/sched/core.c:5281)
> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
> ret_from_fork (arch/x86/kernel/process.c:156)
> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:256)

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: io_uring: worker thread NULL dereference during openat op
  2024-04-16  3:29 ` Paul Moore
@ 2024-04-16 13:45   ` Pavel Begunkov
  2024-04-16 17:15     ` Paul Moore
  0 siblings, 1 reply; 4+ messages in thread
From: Pavel Begunkov @ 2024-04-16 13:45 UTC (permalink / raw)
  To: Paul Moore, Dan Clash; +Cc: io-uring, audit, Jens Axboe



On 4/16/24 04:29, Paul Moore wrote:
> On Mon, Apr 15, 2024 at 7:26 PM Dan Clash <daclash@linux.microsoft.com> wrote:
>>
>> Below is a test program that causes multiple io_uring worker threads to
>> hit a NULL dereference while executing openat ops.
>>
>> The test program hangs forever in a D state.  The test program can be
>> run again after the NULL dereferences.  However, there are long delays
>> at reboot time because the io_uring_cancel() during do_exit() attempts
>> to wake the worker threads.
>>
>> The test program is single threaded but it queues multiple openat and
>> close ops with IOSQE_ASYNC set before waiting for completions.
>>
>> I collected trace with /sys/kernel/tracing/events/io_uring/enable
>> enabled if that is helpful.
>>
>> The test program reproduces similar problems in the following releases.
>>
>> mainline v6.9-rc3
>> stable 6.8.5
>> Ubuntu 6.5.0-1018-azure
>>
>> The test program does not reproduce the problem in Ubuntu
>> 5.15.0-1052-azure, which does not have the io_uring audit changes.
>>
>> The following is the first io_uring worker thread backtrace in the repro
>> against v6.9-rc3.
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor read access in kernel mode
>> #PF: error_code(0x0000) - not-present page
>> PGD 0 P4D 0
>> Oops: 0000 [#1] SMP PTI
>> CPU: 0 PID: 4628 Comm: iou-wrk-4605 Not tainted 6.9.0-rc3 #2
>> Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
>> BIOS Hyper-V UEFI Release v4.1 11/28/2023
>> RIP: 0010:strlen (lib/string.c:402)
>> Call Trace:
>>    <TASK>
>> ? show_regs (arch/x86/kernel/dumpstack.c:479)
>> ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
>> ? page_fault_oops (arch/x86/mm/fault.c:713)
>> ? do_user_addr_fault (arch/x86/mm/fault.c:1261)
>> ? exc_page_fault (./arch/x86/include/asm/irqflags.h:37
>> ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513
>> arch/x86/mm/fault.c:1563)
>> ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
>> ? __pfx_strlen (lib/string.c:402)
>> ? parent_len (kernel/auditfilter.c:1284).
>> __audit_inode (kernel/auditsc.c:2381 (discriminator 4))
> 
> Thanks for the well documented bug report!
> 
> That's interesting, it looks like audit_inode() is potentially being
> passed a filename struct with a NULL name field (filename::name ==
> NULL).  Given the IOSQE_ASYNC and what looks like io_uring calling
> getname() from within the __io_openat_prep() function, I suspect the
> issue is that we aren't associating the filename information we
> collect in getname() with the proper audit_context().  In other words,
> we do the getname() in one context, and then the actual open operation
> in another, and the audit filename info is lost in the switch.
> 
> I think this is related to another issue that Jens and I have been
> discussing relating to connect() and sockaddrs.  We had been hoping
> that the issue we were seeing with sockaddrs was just a special case
> with connect, but it doesn't look like that is the case.
> 
> I'm going to be a bit busy this week with conferences, but given the
> previous discussions with Jens as well as this new issue, I suspect
> that we are going to need to do some work to support creation of a
> thin, or lazily setup, audit_context that we can initialize in the
> io_uring prep routines for use by getname(), move_addr_to_kernel(),
> etc., store in the io_kiocb struct, and then fully setup in
> audit_uring_entry().

I'd prefer not to leak that much into the io_uring's hot path. I
don't understand specifics of the problem, but let me throw
some ideas:

1) Each io_uring request has a reference to the task it was
submitted from, i.e. req->task, can you use the context from
the submitter task? E.g.

audit_ctx = req->task->audit_context

io_uring explicitly lists all tasks using it, and you can easily
hook in there and initialise audit so that req->ctx->audit_context
is always set.

2) Can we initialise audit for each io-wq task when we create
them? We can also try to share audit ctx b/w iowq tasks and
the task they were created for.

  
>> ? link_path_walk.part.0.constprop.0 (fs/namei.c:2324)
>> path_openat (fs/namei.c:3550 fs/namei.c:3796)
>> do_filp_open (fs/namei.c:3826)
>> ? alloc_fd (./arch/x86/include/asm/paravirt.h:589 (discriminator 10)
>> ./arch/x86/include/asm/qspinlock.h:57 (discriminator 10)
>> ./include/linux/spinlock.h:204 (discriminator 10)
>> ./include/linux/spinlock_api_smp.h:142 (discriminator 10)
>> ./include/linux/spinlock.h:391 (discriminator 10) fs/file.c:553
>> (discriminator 10))
>> io_openat2 (io_uring/openclose.c:140)
>> io_openat (io_uring/openclose.c:178)
>> io_issue_sqe (io_uring/io_uring.c:1897)
>> io_wq_submit_work (io_uring/io_uring.c:2006)
>> io_worker_handle_work (io_uring/io-wq.c:540 io_uring/io-wq.c:597)
>> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
>> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
>> ? raw_spin_rq_unlock (./arch/x86/include/asm/paravirt.h:589
>> ./arch/x86/include/asm/qspinlock.h:57 ./include/linux/spinlock.h:204
>> ./include/linux/spinlock_api_smp.h:142 kernel/sched/core.c:603)
>> ? finish_task_switch.isra.0 (./arch/x86/include/asm/irqflags.h:42
>> ./arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1397
>> kernel/sched/core.c:5163 kernel/sched/core.c:5281)
>> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
>> ret_from_fork (arch/x86/kernel/process.c:156)
>> ? __pfx_io_wq_worker (io_uring/io-wq.c:627)
>> ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: io_uring: worker thread NULL dereference during openat op
  2024-04-16 13:45   ` Pavel Begunkov
@ 2024-04-16 17:15     ` Paul Moore
  0 siblings, 0 replies; 4+ messages in thread
From: Paul Moore @ 2024-04-16 17:15 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Dan Clash, io-uring, audit, Jens Axboe

On Tue, Apr 16, 2024 at 9:45 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 4/16/24 04:29, Paul Moore wrote:
> > On Mon, Apr 15, 2024 at 7:26 PM Dan Clash <daclash@linux.microsoft.com> wrote:
> >>
> >> Below is a test program that causes multiple io_uring worker threads to
> >> hit a NULL dereference while executing openat ops.

...

> > Thanks for the well documented bug report!
> >
> > That's interesting, it looks like audit_inode() is potentially being
> > passed a filename struct with a NULL name field (filename::name ==
> > NULL).  Given the IOSQE_ASYNC and what looks like io_uring calling
> > getname() from within the __io_openat_prep() function, I suspect the
> > issue is that we aren't associating the filename information we
> > collect in getname() with the proper audit_context().  In other words,
> > we do the getname() in one context, and then the actual open operation
> > in another, and the audit filename info is lost in the switch.
> >
> > I think this is related to another issue that Jens and I have been
> > discussing relating to connect() and sockaddrs.  We had been hoping
> > that the issue we were seeing with sockaddrs was just a special case
> > with connect, but it doesn't look like that is the case.
> >
> > I'm going to be a bit busy this week with conferences, but given the
> > previous discussions with Jens as well as this new issue, I suspect
> > that we are going to need to do some work to support creation of a
> > thin, or lazily setup, audit_context that we can initialize in the
> > io_uring prep routines for use by getname(), move_addr_to_kernel(),
> > etc., store in the io_kiocb struct, and then fully setup in
> > audit_uring_entry().
>
> I'd prefer not to leak that much into the io_uring's hot path. I
> don't understand specifics of the problem, but let me throw
> some ideas:
>
> 1) Each io_uring request has a reference to the task it was
> submitted from, i.e. req->task, can you use the context from
> the submitter task? E.g.
>
> audit_ctx = req->task->audit_context
>
> io_uring explicitly lists all tasks using it, and you can easily
> hook in there and initialise audit so that req->ctx->audit_context
> is always set.

In a few cases, see my previous comments for examples, the work done
in the io_uring prep functions needs to be associated with the work
done in the actual operation, e.g. openat, connect, etc.  The reason
for this is that we need to carry over some bits of state from the
prep portion to the operation itself so that it can be included in the
audit_uring_entry()/_exit() region.  Unfortunately there are a number
of reasons why we can't leverage the submitter's audit_context here,
but most of the reasons essentially come down to the disconnected and
async nature of io_uring operations and the separation between the
prep work and actual operation.

> 2) Can we initialise audit for each io-wq task when we create
> them? We can also try to share audit ctx b/w iowq tasks and
> the task they were created for.

There is still the issue of connecting each individual prep work state
with the associated operation within the audit_uring_entry()/_exit()
region.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-16 17:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-15 23:26 io_uring: worker thread NULL dereference during openat op Dan Clash
2024-04-16  3:29 ` Paul Moore
2024-04-16 13:45   ` Pavel Begunkov
2024-04-16 17:15     ` Paul Moore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).