All of lore.kernel.org
 help / color / mirror / Atom feed
* Canceled read requests never completed
@ 2022-01-18 15:13 Florian Fischer
  2022-01-18 19:07 ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Fischer @ 2022-01-18 15:13 UTC (permalink / raw)
  To: io-uring; +Cc: flow

[-- Attachment #1: Type: text/plain, Size: 4957 bytes --]

Hello,

during our research on entangling io_uring and parallel runtime systems one of our test
cases results in situations where an `IORING_OP_ASYNC_CANCEL` request can not find (-ENOENT)
or not cancel (EALREADY) a previously submitted read of an event file descriptor. 
However, the previously submitted read also never generates a CQE.
We now wonder if this is a bug in the kernel, or, at least in the case of EALRADY, works as intended.
Our current architecture expects that a request eventually creates a CQE when canceled.


# Reproducer pseudo-code:

create N eventfds
create N threads
  
thread_function:
  create thread-private io_uring queue pair
  for (i = 0, i < ITERATIONS, i++)
    submit read from eventfd n
    submit read from eventfd (n + 1) % N
    submit write to eventfd (n + 2) % N
    await completions until the write completion was reaped
    submit cancel requests for the two read requests
    await all outstanding requests (minus a possible already completed read request)
  
Note that:
- Each eventfd is read twice but only written once.
- The read requests are canceled independently of their state. 
- There are five io_uring requests per loop iteration


# Expectation

Each of the five submitted request should be completed:
* Write is always successful because writing to an eventfd only blocks
  if the counter reaches 0xfffffffffffffffe and we add only 1 in each iteration.
  Furthermore the read from the file descriptor resets the counter to 0.
* The cancel requests are always completed with different return values
  dependent on the state of the read request to cancel.
* The read requests should always be completed either because some data is available
  to read or because they are canceled.


# Observation:

Sometimes threads block in io_uring_enter forever because one read request
is never completed and the cancel of such read returned with -ENOENT or -EALREADY.

A C program to reproduce this situation is attached.
It contains the essence of the previously mentioned test case with instructions
how to compile and execute it.

The following log excerpt was generated using a version of the reproducer
where each write adds 0 to the eventfd count and thus not completing read requests.
This means all read request should be canceled and all cancel requests should either
return with 0 (the request was found and canceled) or -EALREADY the read is already
in execution and should be interrupted.

  0 Prepared read request (evfd: 0, tag: 1)
  0 Submitted 1 requests -> 1 inflight
  0 Prepared read request (evfd: 1, tag: 2)
  0 Submitted 1 requests -> 2 inflight
  0 Prepared write request (evfd: 2)
  0 Submitted 1 requests -> 3 inflight
  0 Collect write completion: 8
  0 Prepared cancel request for 1
  0 Prepared cancel request for 2
  0 Submitted 2 requests -> 4 inflight
  0 Collect read 1 completion: -125 - Operation canceled
  0 Collect cancel read 1 completion: 0
  0 Collect cancel read 2 completion: -2 - No such file or directory
  
Thread 0 blocks forever because the second read could not be
canceled (-ENOENT in the last line) but no completion is ever created for it.

The far more common situation with the reproducer and adding 1 to the eventfds in each loop
is that a request is not canceled and the cancel attempt returned with -EALREADY.
There is no progress because the writer has already finished its loop and the cancel
apparently does not really cancel the request.

  1 Starting iteration 996
  1 Prepared read request (evfd: 1, tag: 1)
  1 Submitted 1 requests -> 1 inflight
  1 Prepared read request (evfd: 2, tag: 2)
  1 Submitted 1 requests -> 2 inflight
  1 Prepared write request (evfd: 0)
  1 Submitted 1 requests -> 3 inflight
  1 Collect write completion: 8
  1 Prepared cancel request for read 1
  1 Prepared cancel request for read 2
  1 Submitted 2 requests -> 4 inflight
  1 Collect read 1 completion: -125 - Operation canceled
  1 Collect cancel read 1 completion: 0
  1 Collect cancel read 2 completion: -114 - Operation already in progress

After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
In this case, the request may or may not terminate."

I could reliably reproduce the behavior on different hardware, linux versions
from 5.9 to 5.16 as well as liburing versions 0.7 and 2.1.

With linux 5.6 I was not able to reproduce this cancel miss.

So is the situation we see intended behavior of the API or is it a faulty race in the
io_uring cancel code?
If it is intended then it becomes really hard to build reliable abstractions
using io_uring's cancellation.
We really like to have the invariant that a canceled io_uring operation eventually
generates a cqe, either completed or canceled/interrupted.

---
Florian Fischer      &  Florian Schmaus
f.fischer@cs.fau.de     flow@cs.fau.de

[-- Attachment #2: threaded_cancel_repro.c --]
[-- Type: text/plain, Size: 12668 bytes --]

/** 
 * io_uring cancel miss reproducer
 * Copyright 2022 Florian Fischer <f.fischer@cs.fau.de>
 *
 * io_urings's IORING_OP_ASYNC_CANCEL sometimes does not cancel inflight requests.
 *
 * Tested systems:
 * - Ryzen 7 arch linux
 *  - liburing 2.1
 *  - reproducable on: 5.15.12, 5.16
 *
 * - Xeon E7-4830 Debian 11
 *   - liburing 0.7
 *   - reproducable on: 5.9, 5.15, 5.16
 *   - not reproducable on: 5.6
 * 
 * Building the reproducer with logging and debug symbols:
 *
 * gcc -Werror -Wall -O3 threaded_cancel_repro.c -o threaded_cancel_repro -pthread -luring
 *
 * Steps to reproduce on my system:
 * 1. Compile the reproducer
 * 2. Open reproducer with gdb using a memory mapped log file
 *    $ LOG_FILE=cancel.log gdb ./threaded_cancel_repro
 * 3. Disable pagination
 *    (gdb) set pagination off
 * 4. Run the reproducer
 *    (gdb) r
 * 5. Repeat the reproducer until it hangs (takes less than 1s on my system)
 *    (gdb) b exit
 *    (gdb) command
 *    > r
 *    > end
 *    (gdb) r
 * 6. Stop the execution with Ctrl + c
 * 7. Walk up the stack until we are in threaded_cancel_repro.c code
 *    (gdb) up
 * 8. Disable scheduling (just be to safe)
 *    (gdb) set scheduler-locking on
 * 9. Trim the log file
 *    (gdb) p lbuf_trim()
 * 10. Open the log file in a editor and search for 'res -2'
 *
 * Explanation:
 * The reproducer opens n (3) event file descriptors and starts n (3) threads.
 * Each thread creates an io_uring and enters a loop.
 * In each iteration the thread submits a read request from evfds[thread_id].
 * Submits a second read request from evfds[(thread_id + 1) % n].
 * And issues a write request for evfds[(thread_id + 2) % n].
 * The write request is awaited and possible read completions remembered.
 * Then two cancel requests for the reads are submitted and the outstanding
 * requests are awaited.
 *
 * The reproducer gets stuck because a cancel request returns with -2 (ENOENT) 
 * which means the request to cancel was not found but the actual request gets
 * never completed.
 * Or because a cancellation returned with -EALREADY did not properly cancel
 * the read and no more writes to the eventfd will happen and therefore the read
 * request is never completed.
 */
#define _GNU_SOURCE
#include <assert.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <liburing.h>
#include <pthread.h>
#include <stdarg.h>
#include <stdatomic.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/eventfd.h>
#include <sys/mman.h>
#include <sys/sysinfo.h>
#include <unistd.h>

#define UNUSED __attribute((unused))

#ifdef NDEBUG
#define POSSIBLY_UNUSED UNUSED
#else
#define POSSIBLY_UNUSED
#endif

/**
 * Forward declarations for the log functionality
 */
struct log_buf;
static void lbuf_init(const char* log_file);
static void lbuf_log(const char* msg);
static void lbuf_destroy();
static void log_to_lbuf(const char* msg) { lbuf_log(msg); }

static void log_with_printf(const char* msg) { printf(msg); }


POSSIBLY_UNUSED static const char* req_to_str(void* data);
POSSIBLY_UNUSED static char* prepare_log_msg(const char* fmt, ...);
static void(*log_func)(const char* msg) = log_with_printf;

#ifdef NDEBUG
#define LOG_COMPLETION(tag, res)
#else
#define LOG_COMPLETION(tag, res)                                             \
do {                                                                         \
	const char* req = req_to_str(tag);                                       \
	if ((res) >= 0) {                                                        \
		LOG("Collect %s completion: %d", req, (res));                        \
	} else {                                                                 \
		LOG("Collect %s completion: %d - %s", req, (res), strerror(-(res))); \
	}                                                                        \
} while(0);
#endif

#ifdef NDEBUG
#define LOG(...)
#else
#define LOG(fmt, ...)                                                     \
do {                                                                      \
	char* msg = prepare_log_msg("%3d " fmt "\n", thread_id, __VA_ARGS__); \
	log_func(msg);                                                        \
	free(msg);                                                            \
} while(0);
#endif

#define IORING_ENTRIES 16

#define CANCEL_MASK (1L << 63)
#define WRITE_TAG ((void*)0x42)

#define ITERATIONS 1000

_Thread_local size_t thread_id;

static pthread_t* threads;
static int* evfds;
static int nthreads;

static struct io_uring_sqe* checked_get_sqe(struct io_uring* ring) {
	struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
	assert(sqe);
	return sqe;
}

static void prep_evfd_read(struct io_uring* ring, int evfd, uint64_t* buf, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_read(sqe, evfds[evfd], buf, sizeof(*buf), 0);
	io_uring_sqe_set_data(sqe, (void*)tag);
	LOG("Prepared read request (evfd: %d, tag: %ld)", evfd, tag);
}

static void prep_evfd_write(struct io_uring* ring, int evfd, uint64_t* buf) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_write(sqe, evfds[evfd], buf, sizeof(*buf), 0);
	io_uring_sqe_set_data(sqe, WRITE_TAG);
	LOG("Prepared write request (evfd: %d)", evfd);
}

static void prep_cancel(struct io_uring* ring, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_cancel(sqe, (void*)tag, 0);
	void* cancel_tag = (void*)(tag | CANCEL_MASK);
	io_uring_sqe_set_data(sqe, cancel_tag);
	LOG("Prepared cancel request for read %ld", tag);
}

static void checked_submit_n(struct io_uring* ring, int* inflight, int n) {
	int res = io_uring_submit(ring);
	if (res < 0) err(EXIT_FAILURE, "io_uring_submit failed");
	assert(res == n);
	*inflight = *inflight + n;
	LOG("Submitted %d requests -> %d inflight", n, *inflight);
}

static const char* req_to_str(void* data) {
	uintptr_t tag = (uintptr_t)data;
	switch(tag) {
	case (uintptr_t)WRITE_TAG:
		return "write";
	case 1:
		return "read 1";
	case 2:
		return "read 2";
	case 0x8000000000000001:
		return "cancel read 1";
	case 0x8000000000000002:
		return "cancel read 2";
	default:
		err(EXIT_FAILURE, "unknown tag encountered");
	}
}

static void checked_submit_one(struct io_uring* ring, int* inflight) {
	checked_submit_n(ring, inflight, 1);
}

static void collect_n_completions(struct io_uring* ring, int n) {
	struct io_uring_cqe* cqe;
	for (int i = 0; i < n; ++i) {
		int res = io_uring_wait_cqe(ring, &cqe);
		if (res) err(EXIT_FAILURE, "io_uring_wait_cqe failed");
		LOG_COMPLETION(io_uring_cqe_get_data(cqe), cqe->res);
		io_uring_cqe_seen(ring, cqe);
	}
}

void* thread_func(void* arg) {
	thread_id = (uintptr_t)(arg);

	struct io_uring ring;
	int res = io_uring_queue_init(IORING_ENTRIES, &ring, 0);
	if (res) err(EXIT_FAILURE, "io_uring_queue_init failed");

	for (unsigned i = 0; i < ITERATIONS; ++i) {
		uint64_t rbuf1 = 0;
		uint64_t rbuf2 = 0;
		uint64_t wbuf = 1;
		/* uint64_t wbuf = 0; */

		int inflight = 0;

		LOG("Starting iteration %u", i);
		prep_evfd_read(&ring, thread_id, &rbuf1, 1);
		checked_submit_one(&ring, &inflight);

		prep_evfd_read(&ring, (thread_id + 1) % nthreads, &rbuf2, 2);
		checked_submit_one(&ring, &inflight);

		prep_evfd_write(&ring, (thread_id + 2) % nthreads, &wbuf);
		checked_submit_one(&ring, &inflight);

		struct io_uring_cqe* cqe;
		void* data = NULL;
		// Await the write completion
		while (data != WRITE_TAG) {
			res = io_uring_wait_cqe(&ring, &cqe);
			if (res) err(EXIT_FAILURE, "io_uring_wait_cqe failed");

			data = io_uring_cqe_get_data(cqe);
			if (cqe->res < 0)
				errx(EXIT_FAILURE, "request with tag %p failed: %s", data,
				     strerror(-cqe->res));

			io_uring_cqe_seen(&ring, cqe);
			--inflight;
			LOG_COMPLETION(io_uring_cqe_get_data(cqe), cqe->res);
		}

		prep_cancel(&ring, 1);
		prep_cancel(&ring, 2);
		checked_submit_n(&ring, &inflight, 2);
		collect_n_completions(&ring, inflight);
	}

	return NULL;
}

int main() {
	nthreads = get_nprocs();
	/* nthreads = 3; */
	evfds = malloc(sizeof(int) * nthreads);
	if (!evfds) err(EXIT_FAILURE, "malloc failed");

	char *log_file = getenv("LOG_FILE");
	if (log_file) {
		lbuf_init(log_file);
		log_func = log_to_lbuf;
	}

	for (unsigned i = 0; i < nthreads; ++i) {
		evfds[i] = eventfd(0, 0);
		if(!evfds[i]) err(EXIT_FAILURE, "eventfd failed");
	}

	threads = malloc(sizeof(pthread_t) * nthreads);
	if (!threads) err(EXIT_FAILURE, "malloc failed");

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_create(&threads[i], NULL, thread_func, (void*)(uintptr_t)i);
		if (errno) err(EXIT_FAILURE, "pthread_create failed");
	}

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_join(threads[i], NULL);
		if (errno) err(EXIT_FAILURE, "pthread_join failed");
	}

	if (log_file)
		lbuf_destroy();

	exit(EXIT_SUCCESS);
}

/**
 * Logging code not relevant for the reproducer.
 * It is used to log as fast as possible to a memory mapped file.
 */

static char* prepare_log_msg(const char* fmt, ...) {
	va_list args;
	va_start(args, fmt);

	char* msg;
	if (vasprintf(&msg, fmt, args) == -1)
		err(EXIT_FAILURE, "vasprintf failed"); 

	va_end(args);
	return msg;
}

#define BUFFER_COUNT 3L
#define BUFFER_SIZE (1L << 30)

static struct log_buf {
	char* _Atomic bufs[BUFFER_COUNT];
	size_t _Atomic pos;
	int fd;
} lbuf;

static char* lbuf_get_buf(size_t pos) {
	return lbuf.bufs[(pos / BUFFER_SIZE) % BUFFER_COUNT];
}

void lbuf_log(const char* msg) {
	const size_t len = strlen(msg);
	const size_t start = atomic_fetch_add_explicit(&lbuf.pos, len, memory_order_relaxed);
	const size_t end = start + len - 1;
	char* buf = lbuf_get_buf(start);
	const size_t offset = start % BUFFER_SIZE;
	char* end_buf = lbuf_get_buf(end);

	// The message fits completely into the current buffer
	if (buf == end_buf) {
		memcpy(&buf[offset], msg, len);

		 // Are we still in the active buffer?
		if (offset > 0)
			return;

	// The message spans the active and the next buffer
	} else {
		const size_t left_in_active = BUFFER_SIZE - (start % BUFFER_SIZE);
		memcpy(&buf[offset], msg, left_in_active);
		memcpy(&end_buf[0], &msg[left_in_active], len - left_in_active);
	}

	// This is the first overflow there is no old buffer to remap
	if (start < BUFFER_SIZE * 2)
		return;

	// We are the first writing to the next buffer therefore we are responsible
	// to remap the old.

	// NOTE: This is NOT sound it is possible that other threads are still using the old
	// buffer. But with big enough BUFFER_SIZE it is likely that all threads are using
	// the current active buffer

	// Active buffer -> old buffer
	// Fresh buffer -> active buffer
	// Remap old buffer -> next fresh buffer

	// Remap the old buffer
	const size_t old_buf_idx = ((start / BUFFER_SIZE) - 1) % BUFFER_COUNT;
	char* old_buf = lbuf.bufs[old_buf_idx];

	// The buffer [0, BUFFER_SIZE) is the first active buffer
	// Therefore at bufPos = BUFFER_SIZE + 1 we are in the second active buffer
	const size_t nth_active = (end / BUFFER_SIZE) + 1;

	// Our log file has the size of all ever active buffers plus the new fresh one
	const off_t log_file_size = (off_t)((nth_active + 1) * BUFFER_SIZE);

	// Grow the log file
	if (ftruncate(lbuf.fd, log_file_size))
		err(EXIT_FAILURE, "growing log file failed");

	const off_t next_offset = (off_t)(nth_active * BUFFER_SIZE);
	void* mem = mmap(NULL, BUFFER_SIZE, PROT_WRITE, MAP_FILE | MAP_SHARED, lbuf.fd, next_offset);
	if (mem == MAP_FAILED)
		err(EXIT_FAILURE, "mmap of fresh buffer failed");

	lbuf.bufs[old_buf_idx] = (char*)(mem);

	if (munmap(old_buf, BUFFER_SIZE))
		err(EXIT_FAILURE, "munmap of replaced buffer failed");
}

void lbuf_init(const char* log_file) {
	lbuf.pos = 0;
	lbuf.fd = open(log_file,
	                O_RDWR | O_TRUNC | O_CREAT,
	                S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);

	if (lbuf.fd == -1)
		err(EXIT_FAILURE, "opening log file faield");

	for (size_t i = 0; i < BUFFER_COUNT; ++i) {
		const off_t off = (off_t)(i * BUFFER_SIZE);
		void* mem = mmap(NULL, BUFFER_SIZE, PROT_WRITE, MAP_FILE | MAP_SHARED, lbuf.fd, off);
		if (mem == MAP_FAILED)
			err(EXIT_FAILURE, "map ping log buffer failed");
		lbuf.bufs[i] = (char*)mem;
	}

	if (ftruncate(lbuf.fd, BUFFER_COUNT * BUFFER_SIZE))
		err(EXIT_FAILURE, "initial log file truncation failed");
}

void lbuf_trim() {
	const off_t final_pos = (off_t)(lbuf.pos);
	if (ftruncate(lbuf.fd, final_pos))
		err(EXIT_FAILURE, "truncating log buffer failed");
}

void lbuf_destroy() {
	lbuf_trim();

	for (unsigned i = 0; i < BUFFER_COUNT; ++i) {
		char* buf = lbuf.bufs[i];
		int ret = munmap(buf, BUFFER_SIZE);
		if (ret)
			err(EXIT_FAILURE, "unmapping log buffer failed");
	}

	close(lbuf.fd);
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Canceled read requests never completed
  2022-01-18 15:13 Canceled read requests never completed Florian Fischer
@ 2022-01-18 19:07 ` Jens Axboe
  2022-01-18 20:05   ` Florian Fischer
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2022-01-18 19:07 UTC (permalink / raw)
  To: io-uring, flow

On 1/18/22 8:13 AM, Florian Fischer wrote:
> Hello,
> 
> during our research on entangling io_uring and parallel runtime systems one of our test
> cases results in situations where an `IORING_OP_ASYNC_CANCEL` request can not find (-ENOENT)
> or not cancel (EALREADY) a previously submitted read of an event file descriptor. 
> However, the previously submitted read also never generates a CQE.
> We now wonder if this is a bug in the kernel, or, at least in the case of EALRADY, works as intended.
> Our current architecture expects that a request eventually creates a CQE when canceled.
> 
> 
> # Reproducer pseudo-code:
> 
> create N eventfds
> create N threads
>   
> thread_function:
>   create thread-private io_uring queue pair
>   for (i = 0, i < ITERATIONS, i++)
>     submit read from eventfd n
>     submit read from eventfd (n + 1) % N
>     submit write to eventfd (n + 2) % N
>     await completions until the write completion was reaped
>     submit cancel requests for the two read requests
>     await all outstanding requests (minus a possible already completed read request)
>   
> Note that:
> - Each eventfd is read twice but only written once.
> - The read requests are canceled independently of their state. 
> - There are five io_uring requests per loop iteration
> 
> 
> # Expectation
> 
> Each of the five submitted request should be completed:
> * Write is always successful because writing to an eventfd only blocks
>   if the counter reaches 0xfffffffffffffffe and we add only 1 in each iteration.
>   Furthermore the read from the file descriptor resets the counter to 0.
> * The cancel requests are always completed with different return values
>   dependent on the state of the read request to cancel.
> * The read requests should always be completed either because some data is available
>   to read or because they are canceled.
> 
> 
> # Observation:
> 
> Sometimes threads block in io_uring_enter forever because one read request
> is never completed and the cancel of such read returned with -ENOENT or -EALREADY.
> 
> A C program to reproduce this situation is attached.
> It contains the essence of the previously mentioned test case with instructions
> how to compile and execute it.
> 
> The following log excerpt was generated using a version of the reproducer
> where each write adds 0 to the eventfd count and thus not completing read requests.
> This means all read request should be canceled and all cancel requests should either
> return with 0 (the request was found and canceled) or -EALREADY the read is already
> in execution and should be interrupted.
> 
>   0 Prepared read request (evfd: 0, tag: 1)
>   0 Submitted 1 requests -> 1 inflight
>   0 Prepared read request (evfd: 1, tag: 2)
>   0 Submitted 1 requests -> 2 inflight
>   0 Prepared write request (evfd: 2)
>   0 Submitted 1 requests -> 3 inflight
>   0 Collect write completion: 8
>   0 Prepared cancel request for 1
>   0 Prepared cancel request for 2
>   0 Submitted 2 requests -> 4 inflight
>   0 Collect read 1 completion: -125 - Operation canceled
>   0 Collect cancel read 1 completion: 0
>   0 Collect cancel read 2 completion: -2 - No such file or directory
>   
> Thread 0 blocks forever because the second read could not be
> canceled (-ENOENT in the last line) but no completion is ever created for it.
> 
> The far more common situation with the reproducer and adding 1 to the eventfds in each loop
> is that a request is not canceled and the cancel attempt returned with -EALREADY.
> There is no progress because the writer has already finished its loop and the cancel
> apparently does not really cancel the request.
> 
>   1 Starting iteration 996
>   1 Prepared read request (evfd: 1, tag: 1)
>   1 Submitted 1 requests -> 1 inflight
>   1 Prepared read request (evfd: 2, tag: 2)
>   1 Submitted 1 requests -> 2 inflight
>   1 Prepared write request (evfd: 0)
>   1 Submitted 1 requests -> 3 inflight
>   1 Collect write completion: 8
>   1 Prepared cancel request for read 1
>   1 Prepared cancel request for read 2
>   1 Submitted 2 requests -> 4 inflight
>   1 Collect read 1 completion: -125 - Operation canceled
>   1 Collect cancel read 1 completion: 0
>   1 Collect cancel read 2 completion: -114 - Operation already in progress
> 
> After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
> may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
> In this case, the request may or may not terminate."
> 
> I could reliably reproduce the behavior on different hardware, linux versions
> from 5.9 to 5.16 as well as liburing versions 0.7 and 2.1.
> 
> With linux 5.6 I was not able to reproduce this cancel miss.
> 
> So is the situation we see intended behavior of the API or is it a faulty race in the
> io_uring cancel code?
> If it is intended then it becomes really hard to build reliable abstractions
> using io_uring's cancellation.
> We really like to have the invariant that a canceled io_uring operation eventually
> generates a cqe, either completed or canceled/interrupted.

I took a look at this, and my theory is that the request cancelation
ends up happening right in between when the work item is moved between
the work list and to the worker itself. The way the async queue works,
the work item is sitting in a list until it gets assigned by a worker.
When that assignment happens, it's removed from the general work list
and then assigned to the worker itself. There's a small gap there where
the work cannot be found in the general list, and isn't yet findable in
the worker itself either.

Do you always see -ENOENT from the cancel when you get the hang
condition?

I'll play with this a bit and see if we can't close this hole so the
work is always reliably discoverable (and hence can get canceled).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Canceled read requests never completed
  2022-01-18 19:07 ` Jens Axboe
@ 2022-01-18 20:05   ` Florian Fischer
  2022-01-18 23:36     ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Fischer @ 2022-01-18 20:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, flow

> > After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
> > may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
> > In this case, the request may or may not terminate."
> 
> I took a look at this, and my theory is that the request cancelation
> ends up happening right in between when the work item is moved between
> the work list and to the worker itself. The way the async queue works,
> the work item is sitting in a list until it gets assigned by a worker.
> When that assignment happens, it's removed from the general work list
> and then assigned to the worker itself. There's a small gap there where
> the work cannot be found in the general list, and isn't yet findable in
> the worker itself either.
> 
> Do you always see -ENOENT from the cancel when you get the hang
> condition?

No we also and actually more commonly observe cancel returning -EALREADY and the
canceled read request never gets completed.

As shown in the log snippet I included below.

> > The far more common situation with the reproducer and adding 1 to the eventfds in each loop
> > is that a request is not canceled and the cancel attempt returned with -EALREADY.
> > There is no progress because the writer has already finished its loop and the cancel
> > apparently does not really cancel the request.
> > 
> >   1 Starting iteration 996
> >   1 Prepared read request (evfd: 1, tag: 1)
> >   1 Submitted 1 requests -> 1 inflight
> >   1 Prepared read request (evfd: 2, tag: 2)
> >   1 Submitted 1 requests -> 2 inflight
> >   1 Prepared write request (evfd: 0)
> >   1 Submitted 1 requests -> 3 inflight
> >   1 Collect write completion: 8
> >   1 Prepared cancel request for read 1
> >   1 Prepared cancel request for read 2
> >   1 Submitted 2 requests -> 4 inflight
> >   1 Collect read 1 completion: -125 - Operation canceled
> >   1 Collect cancel read 1 completion: 0
> >   1 Collect cancel read 2 completion: -114 - Operation already in progress

^- the cancel returned with -EALREADY but the cancelled read (the second
prepared read request) is never completed.

> 
> I'll play with this a bit and see if we can't close this hole so the
> work is always reliably discoverable (and hence can get canceled).

Thanks for your effort!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Canceled read requests never completed
  2022-01-18 20:05   ` Florian Fischer
@ 2022-01-18 23:36     ` Jens Axboe
  2022-01-19  2:32       ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2022-01-18 23:36 UTC (permalink / raw)
  To: io-uring, flow

On 1/18/22 1:05 PM, Florian Fischer wrote:
>>> After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
>>> may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
>>> In this case, the request may or may not terminate."
>>
>> I took a look at this, and my theory is that the request cancelation
>> ends up happening right in between when the work item is moved between
>> the work list and to the worker itself. The way the async queue works,
>> the work item is sitting in a list until it gets assigned by a worker.
>> When that assignment happens, it's removed from the general work list
>> and then assigned to the worker itself. There's a small gap there where
>> the work cannot be found in the general list, and isn't yet findable in
>> the worker itself either.
>>
>> Do you always see -ENOENT from the cancel when you get the hang
>> condition?
> 
> No we also and actually more commonly observe cancel returning
> -EALREADY and the canceled read request never gets completed.
> 
> As shown in the log snippet I included below.

I think there are a couple of different cases here. Can you try the
below patch? It's against current -git.


diff --git a/fs/io-wq.c b/fs/io-wq.c
index 5c4f582d6549..e8f8a5ee93c1 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -48,7 +48,8 @@ struct io_worker {
 	struct io_wqe *wqe;
 
 	struct io_wq_work *cur_work;
-	spinlock_t lock;
+	struct io_wq_work *next_work;
+	raw_spinlock_t lock;
 
 	struct completion ref_done;
 
@@ -529,9 +530,11 @@ static void io_assign_current_work(struct io_worker *worker,
 		cond_resched();
 	}
 
-	spin_lock(&worker->lock);
+	WARN_ON_ONCE(work != worker->next_work);
+	raw_spin_lock(&worker->lock);
 	worker->cur_work = work;
-	spin_unlock(&worker->lock);
+	worker->next_work = NULL;
+	raw_spin_unlock(&worker->lock);
 }
 
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work);
@@ -555,9 +558,20 @@ static void io_worker_handle_work(struct io_worker *worker)
 		 * clear the stalled flag.
 		 */
 		work = io_get_next_work(acct, worker);
-		if (work)
+		if (work) {
 			__io_worker_busy(wqe, worker, work);
 
+			/*
+			 * Make sure cancelation can find this, even before
+			 * it becomes the active work. That avoids a window
+			 * where the work has been removed from our general
+			 * work list, but isn't yet discoverable as the
+			 * current work item for this worker.
+			 */
+			raw_spin_lock(&worker->lock);
+			worker->next_work = work;
+			raw_spin_unlock(&worker->lock);
+		}
 		raw_spin_unlock(&wqe->lock);
 		if (!work)
 			break;
@@ -815,7 +829,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 
 	refcount_set(&worker->ref, 1);
 	worker->wqe = wqe;
-	spin_lock_init(&worker->lock);
+	raw_spin_lock_init(&worker->lock);
 	init_completion(&worker->ref_done);
 
 	if (index == IO_WQ_ACCT_BOUND)
@@ -973,6 +987,19 @@ void io_wq_hash_work(struct io_wq_work *work, void *val)
 	work->flags |= (IO_WQ_WORK_HASHED | (bit << IO_WQ_HASH_SHIFT));
 }
 
+static bool __io_wq_worker_cancel(struct io_worker *worker,
+				  struct io_cb_cancel_data *match,
+				  struct io_wq_work *work)
+{
+	if (work && match->fn(work, match->data)) {
+		work->flags |= IO_WQ_WORK_CANCEL;
+		set_notify_signal(worker->task);
+		return true;
+	}
+
+	return false;
+}
+
 static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
 {
 	struct io_cb_cancel_data *match = data;
@@ -981,13 +1008,11 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
 	 * Hold the lock to avoid ->cur_work going out of scope, caller
 	 * may dereference the passed in work.
 	 */
-	spin_lock(&worker->lock);
-	if (worker->cur_work &&
-	    match->fn(worker->cur_work, match->data)) {
-		set_notify_signal(worker->task);
+	raw_spin_lock(&worker->lock);
+	if (__io_wq_worker_cancel(worker, match, worker->cur_work) ||
+	    __io_wq_worker_cancel(worker, match, worker->next_work))
 		match->nr_running++;
-	}
-	spin_unlock(&worker->lock);
+	raw_spin_unlock(&worker->lock);
 
 	return match->nr_running && !match->cancel_all;
 }
@@ -1039,17 +1064,16 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 {
 	int i;
 retry:
-	raw_spin_lock(&wqe->lock);
 	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
 		struct io_wqe_acct *acct = io_get_acct(wqe, i == 0);
 
 		if (io_acct_cancel_pending_work(wqe, acct, match)) {
+			raw_spin_lock(&wqe->lock);
 			if (match->cancel_all)
 				goto retry;
-			return;
+			break;
 		}
 	}
-	raw_spin_unlock(&wqe->lock);
 }
 
 static void io_wqe_cancel_running_work(struct io_wqe *wqe,
@@ -1078,7 +1102,9 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 	for_each_node(node) {
 		struct io_wqe *wqe = wq->wqes[node];
 
+		raw_spin_lock(&wqe->lock);
 		io_wqe_cancel_pending_work(wqe, &match);
+		raw_spin_unlock(&wqe->lock);
 		if (match.nr_pending && !match.cancel_all)
 			return IO_WQ_CANCEL_OK;
 	}
@@ -1092,7 +1118,15 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 	for_each_node(node) {
 		struct io_wqe *wqe = wq->wqes[node];
 
+		raw_spin_lock(&wqe->lock);
+		io_wqe_cancel_pending_work(wqe, &match);
+		if (match.nr_pending && !match.cancel_all) {
+			raw_spin_unlock(&wqe->lock);
+			return IO_WQ_CANCEL_OK;
+		}
+
 		io_wqe_cancel_running_work(wqe, &match);
+		raw_spin_unlock(&wqe->lock);
 		if (match.nr_running && !match.cancel_all)
 			return IO_WQ_CANCEL_RUNNING;
 	}
@@ -1263,7 +1297,9 @@ static void io_wq_destroy(struct io_wq *wq)
 			.fn		= io_wq_work_match_all,
 			.cancel_all	= true,
 		};
+		raw_spin_lock(&wqe->lock);
 		io_wqe_cancel_pending_work(wqe, &match);
+		raw_spin_unlock(&wqe->lock);
 		free_cpumask_var(wqe->cpu_mask);
 		kfree(wqe);
 	}
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 422d6de48688..49f115a2dec4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6386,7 +6386,8 @@ static int io_try_cancel_userdata(struct io_kiocb *req, u64 sqe_addr)
 	WARN_ON_ONCE(!io_wq_current_is_worker() && req->task != current);
 
 	ret = io_async_cancel_one(req->task->io_uring, sqe_addr, ctx);
-	if (ret != -ENOENT)
+	//if (ret != -ENOENT)
+	if (ret == 0)
 		return ret;
 
 	spin_lock(&ctx->completion_lock);
@@ -6892,7 +6893,8 @@ static void io_wq_submit_work(struct io_wq_work *work)
 		io_queue_linked_timeout(timeout);
 
 	/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
-	if (work->flags & IO_WQ_WORK_CANCEL) {
+	if (work->flags & IO_WQ_WORK_CANCEL ||
+	    test_tsk_thread_flag(current, TIF_NOTIFY_SIGNAL)) {
 		io_req_task_queue_fail(req, -ECANCELED);
 		return;
 	}


-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Canceled read requests never completed
  2022-01-18 23:36     ` Jens Axboe
@ 2022-01-19  2:32       ` Jens Axboe
  2022-01-19 13:42         ` Florian Fischer
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2022-01-19  2:32 UTC (permalink / raw)
  To: io-uring, flow

On 1/18/22 4:36 PM, Jens Axboe wrote:
> On 1/18/22 1:05 PM, Florian Fischer wrote:
>>>> After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
>>>> may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
>>>> In this case, the request may or may not terminate."
>>>
>>> I took a look at this, and my theory is that the request cancelation
>>> ends up happening right in between when the work item is moved between
>>> the work list and to the worker itself. The way the async queue works,
>>> the work item is sitting in a list until it gets assigned by a worker.
>>> When that assignment happens, it's removed from the general work list
>>> and then assigned to the worker itself. There's a small gap there where
>>> the work cannot be found in the general list, and isn't yet findable in
>>> the worker itself either.
>>>
>>> Do you always see -ENOENT from the cancel when you get the hang
>>> condition?
>>
>> No we also and actually more commonly observe cancel returning
>> -EALREADY and the canceled read request never gets completed.
>>
>> As shown in the log snippet I included below.
> 
> I think there are a couple of different cases here. Can you try the
> below patch? It's against current -git.

Cleaned it up and split it into functional bits, end result is here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.17

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Canceled read requests never completed
  2022-01-19  2:32       ` Jens Axboe
@ 2022-01-19 13:42         ` Florian Fischer
  0 siblings, 0 replies; 6+ messages in thread
From: Florian Fischer @ 2022-01-19 13:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Florian Schmaus, io-uring

[-- Attachment #1: Type: text/plain, Size: 2822 bytes --]

On 18.01.2022 19:32, Jens Axboe wrote:
> On 1/18/22 4:36 PM, Jens Axboe wrote:
> > On 1/18/22 1:05 PM, Florian Fischer wrote:
> >>>> After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
> >>>> may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
> >>>> In this case, the request may or may not terminate."
> >>>
> >>> I took a look at this, and my theory is that the request cancelation
> >>> ends up happening right in between when the work item is moved between
> >>> the work list and to the worker itself. The way the async queue works,
> >>> the work item is sitting in a list until it gets assigned by a worker.
> >>> When that assignment happens, it's removed from the general work list
> >>> and then assigned to the worker itself. There's a small gap there where
> >>> the work cannot be found in the general list, and isn't yet findable in
> >>> the worker itself either.
> >>>
> >>> Do you always see -ENOENT from the cancel when you get the hang
> >>> condition?
> >>
> >> No we also and actually more commonly observe cancel returning
> >> -EALREADY and the canceled read request never gets completed.
> >>
> >> As shown in the log snippet I included below.
> > 
> > I think there are a couple of different cases here. Can you try the
> > below patch? It's against current -git.
> 
> Cleaned it up and split it into functional bits, end result is here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.17

Thanks. I have build and tested your patches.

The most common error we observed (read -> read -> write -> 2x cancel)
is no longer reproducible and our originally test case works flawless :)

Nor could I reproduce any hangs with cancel returning -ENOENT.

But I still can reliably reproduce stuck threads when not incrementing the evfd
count and thus never completing reads due to available data to read.
(read -> read -> write (do not increment evfd count) -> 2x cancel)

I further reduced the attached C program to reproduce the above problem.
The code is also available now at our gitlab [1].

The following log output was created with a less 'minimal' version still including
log functionality:

  75 Collect write completion: 8
  75 Collect cancel read 1 completion: 0
  75 Collect cancel read 2 completion: -114
  75 Collect read 1 completion: -125
  75 Collect read 2 completion: -4

  75 Collect write completion: 8
  75 Collect cancel read 1 completion: 0
  75 Collect cancel read 2 completion: -114
  75 Collect read 1 completion: -125
  thread 75 stuck here

The scenario seams extremely artificial but non or less I think it should
work regardless of its usefulness.

Flo Fischer

[1]: https://gitlab.cs.fau.de/aj46ezos/io_uring-cancel/-/tree/minimal-write0

[-- Attachment #2: io_uring-cancel.c --]
[-- Type: text/plain, Size: 4862 bytes --]

/**
 * io_uring cancel miss reproducer
 * Copyright 2022 Florian Fischer <f.fischer@cs.fau.de>
 *
 * io_urings's IORING_OP_ASYNC_CANCEL sometimes does not cancel inflight requests.
 *
 * Tested systems:
 * - Xeon E7-4830 Debian 11
 *   - liburing 2.1
 *   - reproducable on: linux-block/io_uring-5.17
 *
 * Building the reproducer:
 *   $ gcc -Werror -Wall -g -O3 io_uring-cancel.c -o io_uring-cancel -pthread -luring
 *
 * Steps to reproduce on the above system:
 * 1. Compile the reproducer
 * 2. Run the reproducer in a loop
 *   $ for i in $(seq 100); do echo $i; ./io_uring-cancel; done
 *
 *   (I need < ~30 runs to reproduce a stuck thread with 96 threads)
 *
 * Explanation:
 * The reproducer opens #CPUs event file descriptors and starts #CPUs threads.
 * Each thread creates an io_uring and enters a loop.
 * In each iteration the thread submits a read request from evfds[thread_id].
 * Submits a second read request from evfds[(thread_id + 1) % #CPUs].
 * And issues a write request for evfds[(thread_id + 2) % #CPUs] but the write
 * request does not increase the eventfds count thus it will not complete any read.
 * The write request is awaited.
 * Then two cancel requests for the reads are submitted and the outstanding
 * requests are awaited.
 *
 * The reproducer gets stuck because a cancel request returning with -EALREADY
 * did not properly cancel the read the read request is never completed.
 *
 * WhenpPassing a non-zero value to the reproducer as write increment
 *   $ ./io_uring-cancel 1
 * I could not longer reproduce stuck threads.
 */
#define _GNU_SOURCE
#include <assert.h>
#include <err.h>
#include <errno.h>
#include <liburing.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/eventfd.h>
#include <sys/sysinfo.h>
#include <unistd.h>

#define IORING_ENTRIES 2
#define ITERATIONS 1000

_Thread_local size_t thread_id;

static pthread_t* threads;
static int* evfds;
static unsigned nthreads;
static unsigned write_increment = 0;

static struct io_uring_sqe* checked_get_sqe(struct io_uring* ring) {
	struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
	assert(sqe);
	return sqe;
}

static void prep_evfd_read(struct io_uring* ring, int evfd, uint64_t* buf, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_read(sqe, evfds[evfd], buf, sizeof(*buf), 0);
	io_uring_sqe_set_data(sqe, (void*)tag);
}

static void prep_evfd_write(struct io_uring* ring, int evfd, uint64_t* buf) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_write(sqe, evfds[evfd], buf, sizeof(*buf), 0);
}

static void prep_cancel(struct io_uring* ring, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_cancel(sqe, (void*)tag, 0);
}

static void checked_submit_n(struct io_uring* ring, int n) {
	int res = io_uring_submit(ring);
	if (res < 0) err(EXIT_FAILURE, "io_uring_submit failed");
	if (res != n) errx(EXIT_FAILURE, "io_uring_submit submitted less sqes than preprared");
}

static void checked_submit_one(struct io_uring* ring) {
	checked_submit_n(ring, 1);
}

static void collect_n_completions(struct io_uring* ring, int n) {
	struct io_uring_cqe* cqe;
	for (int i = 0; i < n; ++i) {
		int res = io_uring_wait_cqe(ring, &cqe);
		if (res) err(EXIT_FAILURE, "io_uring_wait_cqe failed");
		io_uring_cqe_seen(ring, cqe);
	}
}

void* thread_func(void* arg) {
	thread_id = (uintptr_t)(arg);

	struct io_uring ring;
	int res = io_uring_queue_init(IORING_ENTRIES, &ring, 0);
	if (res) err(EXIT_FAILURE, "io_uring_queue_init failed");

	for (unsigned i = 0; i < ITERATIONS; ++i) {
		uint64_t rbuf = 0;
		uint64_t wbuf = write_increment;

		prep_evfd_read(&ring, thread_id, &rbuf, 1);
		checked_submit_one(&ring);

		prep_evfd_read(&ring, (thread_id + 1) % nthreads, &rbuf, 2);
		checked_submit_one(&ring);

		prep_evfd_write(&ring, (thread_id + 2) % nthreads, &wbuf);
		checked_submit_one(&ring);
		collect_n_completions(&ring, 1);


		prep_cancel(&ring, 1);
		prep_cancel(&ring, 2);
		checked_submit_n(&ring, 2);
		collect_n_completions(&ring, 4);
	}

	return NULL;
}

int main(int argc, char* argv[]) {
	if (argc > 1)
		write_increment = atoi(argv[1]);

	nthreads = get_nprocs();
	// nthreads = 10;
	evfds = malloc(sizeof(int) * nthreads);
	if (!evfds) err(EXIT_FAILURE, "malloc failed");

	for (unsigned i = 0; i < nthreads; ++i) {
		evfds[i] = eventfd(0, 0);
		if(!evfds[i]) err(EXIT_FAILURE, "eventfd failed");
	}

	threads = malloc(sizeof(pthread_t) * nthreads);
	if (!threads) err(EXIT_FAILURE, "malloc failed");

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_create(&threads[i], NULL, thread_func, (void*)(uintptr_t)i);
		if (errno) err(EXIT_FAILURE, "pthread_create failed");
	}

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_join(threads[i], NULL);
		if (errno) err(EXIT_FAILURE, "pthread_join failed");
	}

	exit(EXIT_SUCCESS);
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-19 13:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 15:13 Canceled read requests never completed Florian Fischer
2022-01-18 19:07 ` Jens Axboe
2022-01-18 20:05   ` Florian Fischer
2022-01-18 23:36     ` Jens Axboe
2022-01-19  2:32       ` Jens Axboe
2022-01-19 13:42         ` Florian Fischer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.