IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
* [ISSUE] The time cost of IOSQE_IO_LINK
@ 2020-02-12 16:31 Carter Li 李通洲
  2020-02-12 17:11 ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-12 16:31 UTC (permalink / raw)
  To: io-uring

Hi everyone,

IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.

Test code attached below. The program completes after getting 100000000 cqes.

$ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
$ time ./test0
USE_LINK: 0, count: 100000000, submit_count: 1562500
0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
0inputs+0outputs (0major+72minor)pagefaults 0swaps

$ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
$ time ./test1
USE_LINK: 1, count: 100000110, submit_count: 799584
0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
0inputs+0outputs (0major+72minor)pagefaults 0swaps

As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
please have a check.

Thanks,
Carter Li

---

#include <liburing.h>
#include <stdio.h>

#define ENTRY_SIZE 128
#define MAX_COUNT 100000000

#ifndef USE_LINK
#	define USE_LINK 0
#endif

int main(void) {
	struct io_uring ring;
	unsigned i, head;
	unsigned count = 0, cqe_count = 0, submit_count = 0;
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;

	if (io_uring_queue_init(ENTRY_SIZE, &ring, 0) < 0) {
		return -1;
	}

	for (i = 0; i < ENTRY_SIZE / 2; ++i) {
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
#if USE_LINK
		io_uring_sqe_set_flags(sqe, IOSQE_IO_LINK);
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
		sqe->user_data = 1;
#endif
	}

	while (count < MAX_COUNT) {
		io_uring_submit_and_wait(&ring, 1);
		++submit_count;

		io_uring_for_each_cqe(&ring, head, cqe) {
			++cqe_count;
			++count;
#if USE_LINK
			if (cqe->user_data) {
				sqe = io_uring_get_sqe(&ring);
				io_uring_prep_nop(sqe);
				io_uring_sqe_set_flags(sqe, IOSQE_IO_LINK);
				sqe = io_uring_get_sqe(&ring);
				io_uring_prep_nop(sqe);
				sqe->user_data = 1;
			}
#else
			sqe = io_uring_get_sqe(&ring);
			io_uring_prep_nop(sqe);
			sqe->user_data = !cqe->user_data;
#endif
		}
		io_uring_cq_advance(&ring, cqe_count);
		cqe_count = 0;
	}

	printf("USE_LINK: %d, count: %u, submit_count: %u\n\n",
		USE_LINK, count, submit_count);
}



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 16:31 [ISSUE] The time cost of IOSQE_IO_LINK Carter Li 李通洲
@ 2020-02-12 17:11 ` Jens Axboe
  2020-02-12 17:22   ` Jens Axboe
  2020-02-13  0:33   ` Carter Li 李通洲
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-12 17:11 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
> Hi everyone,
> 
> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
> 
> Test code attached below. The program completes after getting 100000000 cqes.
> 
> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
> $ time ./test0
> USE_LINK: 0, count: 100000000, submit_count: 1562500
> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
> 
> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
> $ time ./test1
> USE_LINK: 1, count: 100000110, submit_count: 799584
> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
> 
> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
> please have a check.

The nop isn't really a good test case, as it doesn't contain any smarts
in terms of executing a link fast. So it doesn't say a whole lot outside
of "we could make nop links faster", which is also kind of pointless.

"Normal" commands will work better. Where the link is really a win is if
the first request needs to go async to complete. For that case, the
next link can execute directly from that context. This saves an async
punt for the common case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:11 ` Jens Axboe
@ 2020-02-12 17:22   ` Jens Axboe
  2020-02-12 17:29     ` Jens Axboe
  2020-02-13  0:33   ` Carter Li 李通洲
  1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-12 17:22 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 10:11 AM, Jens Axboe wrote:
> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>> Hi everyone,
>>
>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>
>> Test code attached below. The program completes after getting 100000000 cqes.
>>
>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>> $ time ./test0
>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>
>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>> $ time ./test1
>> USE_LINK: 1, count: 100000110, submit_count: 799584
>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>
>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>> please have a check.
> 
> The nop isn't really a good test case, as it doesn't contain any smarts
> in terms of executing a link fast. So it doesn't say a whole lot outside
> of "we could make nop links faster", which is also kind of pointless.
> 
> "Normal" commands will work better. Where the link is really a win is if
> the first request needs to go async to complete. For that case, the
> next link can execute directly from that context. This saves an async
> punt for the common case.

Case in point, if I just add the below patch, we're a lot closer:

[root@archlinux liburing]# time test/nop-link 0
Using link: 0
count: 100000000, submit_count: 1562500


real	0m7.934s
user	0m0.740s
sys	0m7.157s
[root@archlinux liburing]# time test/nop-link 1
Using link: 1
count: 100000000, submit_count: 781250


real	0m9.009s
user	0m0.710s
sys	0m8.264s

The links are still a bit slower, which is to be expected as the
nop basically just completes, it doesn't do anything at all and
it never needs to go async.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9f00f30e1790..d0f645be91cb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2399,7 +2399,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 /*
  * IORING_OP_NOP just posts a completion event, nothing else.
  */
-static int io_nop(struct io_kiocb *req)
+static int io_nop(struct io_kiocb *req, struct io_kiocb **nxt)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
@@ -2407,7 +2407,7 @@ static int io_nop(struct io_kiocb *req)
 		return -EINVAL;
 
 	io_cqring_add_event(req, 0);
-	io_put_req(req);
+	io_put_req_find_next(req, nxt);
 	return 0;
 }
 
@@ -4355,7 +4355,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	switch (req->opcode) {
 	case IORING_OP_NOP:
-		ret = io_nop(req);
+		ret = io_nop(req, nxt);
 		break;
 	case IORING_OP_READV:
 	case IORING_OP_READ_FIXED:

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:22   ` Jens Axboe
@ 2020-02-12 17:29     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-12 17:29 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 10:22 AM, Jens Axboe wrote:
> On 2/12/20 10:11 AM, Jens Axboe wrote:
>> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>>> Hi everyone,
>>>
>>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>>
>>> Test code attached below. The program completes after getting 100000000 cqes.
>>>
>>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>>> $ time ./test0
>>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>>> $ time ./test1
>>> USE_LINK: 1, count: 100000110, submit_count: 799584
>>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>>> please have a check.
>>
>> The nop isn't really a good test case, as it doesn't contain any smarts
>> in terms of executing a link fast. So it doesn't say a whole lot outside
>> of "we could make nop links faster", which is also kind of pointless.
>>
>> "Normal" commands will work better. Where the link is really a win is if
>> the first request needs to go async to complete. For that case, the
>> next link can execute directly from that context. This saves an async
>> punt for the common case.
> 
> Case in point, if I just add the below patch, we're a lot closer:
> 
> [root@archlinux liburing]# time test/nop-link 0
> Using link: 0
> count: 100000000, submit_count: 1562500
> 
> 
> real	0m7.934s
> user	0m0.740s
> sys	0m7.157s
> [root@archlinux liburing]# time test/nop-link 1
> Using link: 1
> count: 100000000, submit_count: 781250
> 
> 
> real	0m9.009s
> user	0m0.710s
> sys	0m8.264s
> 
> The links are still a bit slower, which is to be expected as the
> nop basically just completes, it doesn't do anything at all and
> it never needs to go async.

Pinning the test for more reliable results and we're basically even.

[root@archlinux liburing]# time taskset -c 0 test/nop-link 1
Using link: 1
count: 100000000, submit_count: 781250


real	0m8.251s
user	0m0.680s
sys	0m7.536s

[root@archlinux liburing]# time taskset -c 0 test/nop-link 0
Using link: 0
count: 100000000, submit_count: 1562500


real	0m7.986s
user	0m0.610s
sys	0m7.340s

For the intended case (outlined above), it'll definitely be a
win.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:11 ` Jens Axboe
  2020-02-12 17:22   ` Jens Axboe
@ 2020-02-13  0:33   ` Carter Li 李通洲
  2020-02-13 15:08     ` Pavel Begunkov
  1 sibling, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-13  0:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Thanks for your reply.

You are right the nop isn't really a good test case. But I actually
found this issue when benchmarking my echo server, which didn't use
NOP of course.

Test case attached below. Use rust_echo_bench for benchmarking.
https://github.com/haraldh/rust_echo_bench


$ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=0
$ ./link_recv 12345
$ cargo run --release # On another console
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Speed: 168264 request/sec, 168264 response/sec
Requests: 10095846
Responses: 10095844

$ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=1
$ ./link_recv 12345
$ cargo run --release # On another console
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Speed: 112666 request/sec, 112666 response/sec
Requests: 6760009
Responses: 6759975


I think `POLL_ADD(POLLIN)-RECV` and `POLL_ADD(POLLOUT)-SEND` are common use cases for networking ( for some reason a short read for SEND is not considered an error, `RECV-SEND` cannot be used in a link chain ). RECV/SEND won't block after polled. I expect better performance for fewer io_uring_enter syscalls. Could you please have a check with it?

Another more complex test case `POLL_ADD-READ_FIXED-WRITE_FIXED` I have posted on Github, which currently results in freeze.

https://github.com/axboe/liburing/issues/71

Carter

---

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/socket.h>
#include <sys/poll.h>
#include <netinet/in.h>

#include <liburing.h>

#define BACKLOG 128
#define MAX_MESSAGE_LEN 1024
#define MAX_CONNECTIONS 1024
#ifndef USE_LINK
#   define USE_LINK 0
#endif

enum { ACCEPT, POLL, READ, WRITE };

struct conn_info {
    __u32 fd;
    __u32 type;
};

typedef char buf_type[MAX_CONNECTIONS][MAX_MESSAGE_LEN];

static struct io_uring ring;
static unsigned cqe_count = 0;

int init_socket(int portno) {
    int sock_listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (sock_listen_fd < 0) {
        perror("socket");
        return -1;
    }

    struct sockaddr_in server_addr = {
        .sin_family = AF_INET,
        .sin_port = htons(portno),
        .sin_addr = {
            .s_addr = INADDR_ANY,
        },
    };

    if (bind(sock_listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
        perror("bind");
        return -1;
    }

    if (listen(sock_listen_fd, BACKLOG) < 0) {
        perror("listen");
        return -1;
    }

    return sock_listen_fd;
}

static struct io_uring_sqe* get_sqe_safe() {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (__builtin_expect(!!sqe, 1)) {
        return sqe;
    } else {
        io_uring_cq_advance(&ring, cqe_count);
        cqe_count = 0;
        io_uring_submit(&ring);
        return io_uring_get_sqe(&ring);
    }
}

static void add_accept(int fd, struct sockaddr *client_addr, socklen_t *client_len) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = ACCEPT,
    };

    io_uring_prep_accept(sqe, fd, client_addr, client_len, 0);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_poll(int fd, int poll_mask, unsigned flags) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = POLL,
    };

    io_uring_prep_poll_add(sqe, fd, poll_mask);
    io_uring_sqe_set_flags(sqe, flags);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_socket_read(int fd, size_t size, buf_type *bufs) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = READ,
    };

    io_uring_prep_recv(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_socket_write(int fd, size_t size, buf_type *bufs, unsigned flags) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = WRITE,
    };

    io_uring_prep_send(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
    io_uring_sqe_set_flags(sqe, flags);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Please give a port number: %s [port]\n", argv[0]);
        return 1;
    }

    int portno = strtol(argv[1], NULL, 10);
    int sock_listen_fd = init_socket(portno);
    if (sock_listen_fd < 0) return -1;
    printf("io_uring echo server listening for connections on port: %d\n", portno);


    int ret = io_uring_queue_init(BACKLOG, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return -1;
    }

    buf_type *bufs = (buf_type *)malloc(sizeof(*bufs));

    struct sockaddr_in client_addr;
    socklen_t client_len = sizeof(client_addr);
    add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);

    while (1) {
        io_uring_submit_and_wait(&ring, 1);

        struct io_uring_cqe *cqe;
        unsigned head;

        io_uring_for_each_cqe(&ring, head, cqe) {
            ++cqe_count;

            struct conn_info conn_i;
            memcpy(&conn_i, &cqe->user_data, sizeof(conn_i));
            int result = cqe->res;

            switch (conn_i.type) {
            case ACCEPT:
#if USE_LINK
                add_poll(result, POLLIN, IOSQE_IO_LINK);
                add_socket_read(result, MAX_MESSAGE_LEN, bufs);
#else
                add_poll(result, POLLIN, 0);
#endif
                add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
                break;

#if !USE_LINK
            case POLL:
                add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
                break;
#endif

            case READ:
                if (__builtin_expect(result <= 0, 0)) {
                    shutdown(conn_i.fd, SHUT_RDWR);
                } else {
                    add_socket_write(conn_i.fd, result, bufs, 0);
                }
                break;

            case WRITE:
#if USE_LINK
                add_poll(conn_i.fd, POLLIN, IOSQE_IO_LINK);
                add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
#else
                add_poll(conn_i.fd, POLLIN, 0);
#endif
                break;
            }
        }

        io_uring_cq_advance(&ring, cqe_count);
        cqe_count = 0;
    }


    close(sock_listen_fd);
    free(bufs);
}



> 2020年2月13日 上午1:11,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>> Hi everyone,
>> 
>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>> 
>> Test code attached below. The program completes after getting 100000000 cqes.
>> 
>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>> $ time ./test0
>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>> 
>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>> $ time ./test1
>> USE_LINK: 1, count: 100000110, submit_count: 799584
>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>> 
>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>> please have a check.
> 
> The nop isn't really a good test case, as it doesn't contain any smarts
> in terms of executing a link fast. So it doesn't say a whole lot outside
> of "we could make nop links faster", which is also kind of pointless.
> 
> "Normal" commands will work better. Where the link is really a win is if
> the first request needs to go async to complete. For that case, the
> next link can execute directly from that context. This saves an async
> punt for the common case.
> 
> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13  0:33   ` Carter Li 李通洲
@ 2020-02-13 15:08     ` Pavel Begunkov
  2020-02-13 15:14       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2020-02-13 15:08 UTC (permalink / raw)
  To: Carter Li 李通洲, Jens Axboe; +Cc: io-uring

On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
> Thanks for your reply.
> 
> You are right the nop isn't really a good test case. But I actually
> found this issue when benchmarking my echo server, which didn't use
> NOP of course.

If there are no hidden subtle issues in io_uring, your benchmark or the
used pattern itself, it's probably because of overhead on async punting
(copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
io-wq itself).

I was going to tune async/punting stuff anyway, so I'll look into this.
And of course, there is always a good chance Jens have some bright insights

BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?

> Test case attached below. Use rust_echo_bench for benchmarking.
> https://github.com/haraldh/rust_echo_bench
> 
> 
> $ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=0
> $ ./link_recv 12345
> $ cargo run --release # On another console
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Speed: 168264 request/sec, 168264 response/sec
> Requests: 10095846
> Responses: 10095844
> 
> $ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=1
> $ ./link_recv 12345
> $ cargo run --release # On another console
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Speed: 112666 request/sec, 112666 response/sec
> Requests: 6760009
> Responses: 6759975
> 
> 
> I think `POLL_ADD(POLLIN)-RECV` and `POLL_ADD(POLLOUT)-SEND` are common use cases for networking ( for some reason a short read for SEND is not considered an error, `RECV-SEND` cannot be used in a link chain ). RECV/SEND won't block after polled. I expect better performance for fewer io_uring_enter syscalls. Could you please have a check with it?
> 
> Another more complex test case `POLL_ADD-READ_FIXED-WRITE_FIXED` I have posted on Github, which currently results in freeze.
> 
> https://github.com/axboe/liburing/issues/71
> 
> Carter
> 
> ---
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #include <sys/socket.h>
> #include <sys/poll.h>
> #include <netinet/in.h>
> 
> #include <liburing.h>
> 
> #define BACKLOG 128
> #define MAX_MESSAGE_LEN 1024
> #define MAX_CONNECTIONS 1024
> #ifndef USE_LINK
> #   define USE_LINK 0
> #endif
> 
> enum { ACCEPT, POLL, READ, WRITE };
> 
> struct conn_info {
>     __u32 fd;
>     __u32 type;
> };
> 
> typedef char buf_type[MAX_CONNECTIONS][MAX_MESSAGE_LEN];
> 
> static struct io_uring ring;
> static unsigned cqe_count = 0;
> 
> int init_socket(int portno) {
>     int sock_listen_fd = socket(AF_INET, SOCK_STREAM, 0);
>     if (sock_listen_fd < 0) {
>         perror("socket");
>         return -1;
>     }
> 
>     struct sockaddr_in server_addr = {
>         .sin_family = AF_INET,
>         .sin_port = htons(portno),
>         .sin_addr = {
>             .s_addr = INADDR_ANY,
>         },
>     };
> 
>     if (bind(sock_listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
>         perror("bind");
>         return -1;
>     }
> 
>     if (listen(sock_listen_fd, BACKLOG) < 0) {
>         perror("listen");
>         return -1;
>     }
> 
>     return sock_listen_fd;
> }
> 
> static struct io_uring_sqe* get_sqe_safe() {
>     struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
>     if (__builtin_expect(!!sqe, 1)) {
>         return sqe;
>     } else {
>         io_uring_cq_advance(&ring, cqe_count);
>         cqe_count = 0;
>         io_uring_submit(&ring);
>         return io_uring_get_sqe(&ring);
>     }
> }
> 
> static void add_accept(int fd, struct sockaddr *client_addr, socklen_t *client_len) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = ACCEPT,
>     };
> 
>     io_uring_prep_accept(sqe, fd, client_addr, client_len, 0);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_poll(int fd, int poll_mask, unsigned flags) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = POLL,
>     };
> 
>     io_uring_prep_poll_add(sqe, fd, poll_mask);
>     io_uring_sqe_set_flags(sqe, flags);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_socket_read(int fd, size_t size, buf_type *bufs) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = READ,
>     };
> 
>     io_uring_prep_recv(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_socket_write(int fd, size_t size, buf_type *bufs, unsigned flags) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = WRITE,
>     };
> 
>     io_uring_prep_send(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
>     io_uring_sqe_set_flags(sqe, flags);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> int main(int argc, char *argv[]) {
>     if (argc < 2) {
>         fprintf(stderr, "Please give a port number: %s [port]\n", argv[0]);
>         return 1;
>     }
> 
>     int portno = strtol(argv[1], NULL, 10);
>     int sock_listen_fd = init_socket(portno);
>     if (sock_listen_fd < 0) return -1;
>     printf("io_uring echo server listening for connections on port: %d\n", portno);
> 
> 
>     int ret = io_uring_queue_init(BACKLOG, &ring, 0);
>     if (ret < 0) {
>         fprintf(stderr, "queue_init: %s\n", strerror(-ret));
>         return -1;
>     }
> 
>     buf_type *bufs = (buf_type *)malloc(sizeof(*bufs));
> 
>     struct sockaddr_in client_addr;
>     socklen_t client_len = sizeof(client_addr);
>     add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
> 
>     while (1) {
>         io_uring_submit_and_wait(&ring, 1);
> 
>         struct io_uring_cqe *cqe;
>         unsigned head;
> 
>         io_uring_for_each_cqe(&ring, head, cqe) {
>             ++cqe_count;
> 
>             struct conn_info conn_i;
>             memcpy(&conn_i, &cqe->user_data, sizeof(conn_i));
>             int result = cqe->res;
> 
>             switch (conn_i.type) {
>             case ACCEPT:
> #if USE_LINK
>                 add_poll(result, POLLIN, IOSQE_IO_LINK);
>                 add_socket_read(result, MAX_MESSAGE_LEN, bufs);
> #else
>                 add_poll(result, POLLIN, 0);
> #endif
>                 add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
>                 break;
> 
> #if !USE_LINK
>             case POLL:
>                 add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
>                 break;
> #endif
> 
>             case READ:
>                 if (__builtin_expect(result <= 0, 0)) {
>                     shutdown(conn_i.fd, SHUT_RDWR);
>                 } else {
>                     add_socket_write(conn_i.fd, result, bufs, 0);
>                 }
>                 break;
> 
>             case WRITE:
> #if USE_LINK
>                 add_poll(conn_i.fd, POLLIN, IOSQE_IO_LINK);
>                 add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
> #else
>                 add_poll(conn_i.fd, POLLIN, 0);
> #endif
>                 break;
>             }
>         }
> 
>         io_uring_cq_advance(&ring, cqe_count);
>         cqe_count = 0;
>     }
> 
> 
>     close(sock_listen_fd);
>     free(bufs);
> }
> 
> 
> 
>> 2020年2月13日 上午1:11,Jens Axboe <axboe@kernel.dk> 写道:
>>
>> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>>> Hi everyone,
>>>
>>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>>
>>> Test code attached below. The program completes after getting 100000000 cqes.
>>>
>>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>>> $ time ./test0
>>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>>> $ time ./test1
>>> USE_LINK: 1, count: 100000110, submit_count: 799584
>>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>>> please have a check.
>>
>> The nop isn't really a good test case, as it doesn't contain any smarts
>> in terms of executing a link fast. So it doesn't say a whole lot outside
>> of "we could make nop links faster", which is also kind of pointless.
>>
>> "Normal" commands will work better. Where the link is really a win is if
>> the first request needs to go async to complete. For that case, the
>> next link can execute directly from that context. This saves an async
>> punt for the common case.
>>
>> -- 
>> Jens Axboe
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:08     ` Pavel Begunkov
@ 2020-02-13 15:14       ` Jens Axboe
  2020-02-13 15:51         ` Carter Li 李通洲
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-13 15:14 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲; +Cc: io-uring

On 2/13/20 8:08 AM, Pavel Begunkov wrote:
> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>> Thanks for your reply.
>>
>> You are right the nop isn't really a good test case. But I actually
>> found this issue when benchmarking my echo server, which didn't use
>> NOP of course.
> 
> If there are no hidden subtle issues in io_uring, your benchmark or the
> used pattern itself, it's probably because of overhead on async punting
> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
> io-wq itself).
> 
> I was going to tune async/punting stuff anyway, so I'll look into this.
> And of course, there is always a good chance Jens have some bright insights

The main issue here is that if you do the poll->recv, then it'll be
an automatic punt of the recv to async context when the poll completes.
That's regardless of whether or not we can complete the poll inline,
we never attempt to recv inline from that completion. This is in contrast
to doing a separate poll, getting the notification, then doing another
sqe and io_uring_enter to perform the recv. For this case, we end up
doing everything inline, just with the cost of an additional system call
to submit the new recv.

It'd be really cool if we could improve on this situation, as recv (or
read) preceded by a poll is indeed a common use case. Or ditto for the
write side.

> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?

If there's no data to begin with, then the read will go async. Hence
it'll be a switch to a worker thread. The above should avoid it, but
it doesn't.

For carter's sake, it's worth nothing that the poll command is special
and normal requests would be more efficient with links. We just need
to work on making the poll linked with read/write perform much better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:14       ` Jens Axboe
@ 2020-02-13 15:51         ` Carter Li 李通洲
  2020-02-14  1:25           ` Carter Li 李通洲
  0 siblings, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-13 15:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring



> 2020年2月13日 下午11:14,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/13/20 8:08 AM, Pavel Begunkov wrote:
>> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>>> Thanks for your reply.
>>> 
>>> You are right the nop isn't really a good test case. But I actually
>>> found this issue when benchmarking my echo server, which didn't use
>>> NOP of course.
>> 
>> If there are no hidden subtle issues in io_uring, your benchmark or the
>> used pattern itself, it's probably because of overhead on async punting
>> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
>> io-wq itself).
>> 
>> I was going to tune async/punting stuff anyway, so I'll look into this.
>> And of course, there is always a good chance Jens have some bright insights
> 
> The main issue here is that if you do the poll->recv, then it'll be
> an automatic punt of the recv to async context when the poll completes.
> That's regardless of whether or not we can complete the poll inline,
> we never attempt to recv inline from that completion. This is in contrast
> to doing a separate poll, getting the notification, then doing another
> sqe and io_uring_enter to perform the recv. For this case, we end up
> doing everything inline, just with the cost of an additional system call
> to submit the new recv.
> 
> It'd be really cool if we could improve on this situation, as recv (or
> read) preceded by a poll is indeed a common use case. Or ditto for the
> write side.
> 
>> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?
> 
> If there's no data to begin with, then the read will go async. Hence
> it'll be a switch to a worker thread. The above should avoid it, but
> it doesn't.

Yes. I actually tested `directly read()` first, and found it was about 30%
slower then poll(fd)->read(fd).

https://github.com/axboe/liburing/issues/69

So it turns out that async punting has high overhead. A (silly) question:
could we implement read/write operations that would block as poll->read/write?


> 
> For carter's sake, it's worth nothing that the poll command is special
> and normal requests would be more efficient with links. We just need
> to work on making the poll linked with read/write perform much better.

Thanks

> 
> -- 
> Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:51         ` Carter Li 李通洲
@ 2020-02-14  1:25           ` Carter Li 李通洲
  2020-02-14  2:45             ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-14  1:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring

Another suggestion: we should always try completing operations inline
unless IOSQE_ASYNC is specified, no matter if the operations are preceded
by a poll.

Carter

> 2020年2月13日 下午11:51,Carter Li 李通洲 <carter.li@eoitek.com> 写道:
> 
> 
> 
>> 2020年2月13日 下午11:14,Jens Axboe <axboe@kernel.dk> 写道:
>> 
>> On 2/13/20 8:08 AM, Pavel Begunkov wrote:
>>> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>>>> Thanks for your reply.
>>>> 
>>>> You are right the nop isn't really a good test case. But I actually
>>>> found this issue when benchmarking my echo server, which didn't use
>>>> NOP of course.
>>> 
>>> If there are no hidden subtle issues in io_uring, your benchmark or the
>>> used pattern itself, it's probably because of overhead on async punting
>>> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
>>> io-wq itself).
>>> 
>>> I was going to tune async/punting stuff anyway, so I'll look into this.
>>> And of course, there is always a good chance Jens have some bright insights
>> 
>> The main issue here is that if you do the poll->recv, then it'll be
>> an automatic punt of the recv to async context when the poll completes.
>> That's regardless of whether or not we can complete the poll inline,
>> we never attempt to recv inline from that completion. This is in contrast
>> to doing a separate poll, getting the notification, then doing another
>> sqe and io_uring_enter to perform the recv. For this case, we end up
>> doing everything inline, just with the cost of an additional system call
>> to submit the new recv.
>> 
>> It'd be really cool if we could improve on this situation, as recv (or
>> read) preceded by a poll is indeed a common use case. Or ditto for the
>> write side.
>> 
>>> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?
>> 
>> If there's no data to begin with, then the read will go async. Hence
>> it'll be a switch to a worker thread. The above should avoid it, but
>> it doesn't.
> 
> Yes. I actually tested `directly read()` first, and found it was about 30%
> slower then poll(fd)->read(fd).
> 
> https://github.com/axboe/liburing/issues/69
> 
> So it turns out that async punting has high overhead. A (silly) question:
> could we implement read/write operations that would block as poll->read/write?
> 
> 
>> 
>> For carter's sake, it's worth nothing that the poll command is special
>> and normal requests would be more efficient with links. We just need
>> to work on making the poll linked with read/write perform much better.
> 
> Thanks
> 
>> 
>> -- 
>> Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  1:25           ` Carter Li 李通洲
@ 2020-02-14  2:45             ` Jens Axboe
  2020-02-14  5:03               ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14  2:45 UTC (permalink / raw)
  To: Carter Li 李通洲; +Cc: Pavel Begunkov, io-uring

On 2/13/20 6:25 PM, Carter Li 李通洲 wrote:
> Another suggestion: we should always try completing operations inline
> unless IOSQE_ASYNC is specified, no matter if the operations are preceded
> by a poll.

Yes that's a given, the problem is that the poll completion is run
inside the waitqueue handler, and with that locked and interrupts
disabled. So it's not quite that easy, unfortunately.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  2:45             ` Jens Axboe
@ 2020-02-14  5:03               ` Jens Axboe
  2020-02-14 15:32                 ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14  5:03 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Pavel Begunkov, io-uring, Peter Zijlstra

On 2/13/20 7:45 PM, Jens Axboe wrote:
> On 2/13/20 6:25 PM, Carter Li 李通洲 wrote:
>> Another suggestion: we should always try completing operations inline
>> unless IOSQE_ASYNC is specified, no matter if the operations are preceded
>> by a poll.
> 
> Yes that's a given, the problem is that the poll completion is run
> inside the waitqueue handler, and with that locked and interrupts
> disabled. So it's not quite that easy, unfortunately.

Super hack alert... While we can't do it from within the wakeup handler,
we can ensure that the task runs the work when it is scheduled. This
patch attempts to do that - when we run the poll wakeup handler inline
and we have a linked request, instead of just sending that async, we
queue it up in our task uring work log. When the task is scheduled in,
the backlog is run. This achieves the desired result of having the next
work item done inline, instead of having a worker thread do it.

CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
currently do something like this? Only thing I'm currently aware of is
the preempt in/out notifiers, but they don't quite provide what I need,
since I need to pass some data (a request) as well.

The full detail on what I'm trying here is:

io_uring can have linked requests. One obvious use case for that is to
queue a POLLIN on a socket, and then link a read/recv to that. When the
poll completes, we want to run the read/recv. io_uring hooks into the
waitqueue wakeup handler to finish the poll request, and since we're
deep in waitqueue wakeup code, it queues the linked read/recv for
execution via an async thread. This is not optimal, obviously, as it
relies on a switch to a new thread to perform this read. This hack
queues a backlog to the task itself, and runs it when it's scheduled in.
Probably want to do the same for sched out as well, currently I just
hack that in the io_uring wait part...

This nets me about a 20% boost running the echo example that Carter
supplied.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 56799be66b49..e100406f4842 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -574,6 +574,7 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
 	struct io_wq_work	work;
 };
 
@@ -931,6 +932,8 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
+	if (!req->task)
+		req->task = get_task_struct(current);
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -953,6 +956,8 @@ static inline void io_req_work_drop_env(struct io_kiocb *req)
 		if (fs)
 			free_fs_struct(fs);
 	}
+	if (req->task)
+		put_task_struct(req->task);
 }
 
 static inline bool io_prep_async_work(struct io_kiocb *req,
@@ -1239,6 +1244,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	refcount_set(&req->refs, 2);
 	req->result = 0;
 	INIT_IO_WORK(&req->work, io_wq_submit_work);
+	req->task = NULL;
 	return req;
 fallback:
 	req = io_get_fallback_req(ctx);
@@ -3670,8 +3676,24 @@ static void __io_poll_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 			trigger_ev = false;
 			req->work.func = io_poll_trigger_evfd;
 		} else {
+			struct io_kiocb *nxt = NULL;
+
 			req->flags |= REQ_F_COMP_LOCKED;
+#if 1
+			io_put_req_find_next(req, &nxt);
+			if (nxt) {
+				struct task_struct *tsk = nxt->task;
+
+				raw_spin_lock(&tsk->uring_lock);
+				list_add_tail(&nxt->list, &tsk->uring_work);
+				raw_spin_unlock(&tsk->uring_lock);
+
+				/* do we need this? */
+				wake_up_process(tsk);
+			}
+#else
 			io_put_req(req);
+#endif
 			req = NULL;
 		}
 		spin_unlock_irqrestore(&ctx->completion_lock, flags);
@@ -5316,6 +5338,40 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return autoremove_wake_function(curr, mode, wake_flags, key);
 }
 
+static void __io_uring_task_handler(struct list_head *list)
+{
+	struct io_kiocb *req;
+
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct io_kiocb, list);
+		list_del(&req->list);
+
+		__io_queue_sqe(req, NULL);
+	}
+}
+
+void io_uring_task_handler(struct task_struct *tsk)
+{
+	LIST_HEAD(list);
+
+	raw_spin_lock_irq(&tsk->uring_lock);
+	if (!list_empty(&tsk->uring_work))
+		list_splice_init(&tsk->uring_work, &list);
+	raw_spin_unlock_irq(&tsk->uring_lock);
+
+	__io_uring_task_handler(&list);
+}
+
+static bool io_uring_handle_tasklog(void)
+{
+	if (list_empty(&current->uring_work))
+		return false;
+
+	__set_current_state(TASK_RUNNING);
+	io_uring_task_handler(current);
+	return true;
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -5358,6 +5414,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 						TASK_INTERRUPTIBLE);
 		if (io_should_wake(&iowq, false))
 			break;
+		/* should be a sched-out handler */
+		if (io_uring_handle_tasklog()) {
+			if (io_should_wake(&iowq, false))
+				break;
+			continue;
+		}
 		schedule();
 		if (signal_pending(current)) {
 			ret = -EINTR;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..447b06c6bed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_IO_URING
+	struct list_head		uring_work;
+	raw_spinlock_t			uring_lock;
+#endif
+
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc007604..b60f081cac17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
+#ifdef CONFIG_IO_URING
+	INIT_LIST_HEAD(&p->uring_work);
+	raw_spin_lock_init(&p->uring_lock);
+#endif
+
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
@@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
 
 #endif /* CONFIG_PREEMPT_NOTIFIERS */
 
+#ifdef CONFIG_IO_URING
+extern void io_uring_task_handler(struct task_struct *tsk);
+
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+	if (!list_empty(&tsk->uring_work))
+		io_uring_task_handler(tsk);
+}
+#else /* !CONFIG_IO_URING */
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+}
+#endif
+
 static inline void prepare_task(struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	balance_callback(rq);
 	preempt_enable();
 
+	io_uring_handler(current);
+
 	if (current->set_child_tid)
 		put_user(task_pid_vnr(current), current->set_child_tid);
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  5:03               ` Jens Axboe
@ 2020-02-14 15:32                 ` Peter Zijlstra
  2020-02-14 15:47                   ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2020-02-14 15:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On Thu, Feb 13, 2020 at 10:03:54PM -0700, Jens Axboe wrote:

> CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
> currently do something like this? Only thing I'm currently aware of is
> the preempt in/out notifiers, but they don't quite provide what I need,
> since I need to pass some data (a request) as well.

Whee, nothing quite like this around I think.

> The full detail on what I'm trying here is:
> 
> io_uring can have linked requests. One obvious use case for that is to
> queue a POLLIN on a socket, and then link a read/recv to that. When the
> poll completes, we want to run the read/recv. io_uring hooks into the
> waitqueue wakeup handler to finish the poll request, and since we're
> deep in waitqueue wakeup code, it queues the linked read/recv for
> execution via an async thread. This is not optimal, obviously, as it
> relies on a switch to a new thread to perform this read. This hack
> queues a backlog to the task itself, and runs it when it's scheduled in.
> Probably want to do the same for sched out as well, currently I just
> hack that in the io_uring wait part...

I'll definitely need to think more about this, but a few comments on the
below.

> +static void __io_uring_task_handler(struct list_head *list)
> +{
> +	struct io_kiocb *req;
> +
> +	while (!list_empty(list)) {
> +		req = list_first_entry(list, struct io_kiocb, list);
> +		list_del(&req->list);
> +
> +		__io_queue_sqe(req, NULL);
> +	}
> +}
> +
> +void io_uring_task_handler(struct task_struct *tsk)
> +{
> +	LIST_HEAD(list);
> +
> +	raw_spin_lock_irq(&tsk->uring_lock);
> +	if (!list_empty(&tsk->uring_work))
> +		list_splice_init(&tsk->uring_work, &list);
> +	raw_spin_unlock_irq(&tsk->uring_lock);
> +
> +	__io_uring_task_handler(&list);
> +}

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fc1dfc007604..b60f081cac17 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
>  
> +#ifdef CONFIG_IO_URING
> +	INIT_LIST_HEAD(&p->uring_work);
> +	raw_spin_lock_init(&p->uring_lock);
> +#endif
> +
>  #ifdef CONFIG_COMPACTION
>  	p->capture_control = NULL;
>  #endif
> @@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
>  
>  #endif /* CONFIG_PREEMPT_NOTIFIERS */
>  
> +#ifdef CONFIG_IO_URING
> +extern void io_uring_task_handler(struct task_struct *tsk);
> +
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +	if (!list_empty(&tsk->uring_work))
> +		io_uring_task_handler(tsk);
> +}
> +#else /* !CONFIG_IO_URING */
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +}
> +#endif
> +
>  static inline void prepare_task(struct task_struct *next)
>  {
>  #ifdef CONFIG_SMP
> @@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>  	balance_callback(rq);
>  	preempt_enable();
>  
> +	io_uring_handler(current);
> +
>  	if (current->set_child_tid)
>  		put_user(task_pid_vnr(current), current->set_child_tid);
>  

I suspect you meant to put that in finish_task_switch() which is the
tail end of every schedule(), schedule_tail() is the tail end of
clone().

Or maybe you meant to put it in (and rename) sched_update_worker() which
is after every schedule() but in a preemptible context -- much saner
since you don't want to go add an unbounded amount of work in a
non-preemptible context.

At which point you already have your callback: io_wq_worker_running(),
or is this for any random task?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 15:32                 ` Peter Zijlstra
@ 2020-02-14 15:47                   ` Jens Axboe
  2020-02-14 16:18                     ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 8:32 AM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 10:03:54PM -0700, Jens Axboe wrote:
> 
>> CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
>> currently do something like this? Only thing I'm currently aware of is
>> the preempt in/out notifiers, but they don't quite provide what I need,
>> since I need to pass some data (a request) as well.
> 
> Whee, nothing quite like this around I think.

Probably not ;-)

>> The full detail on what I'm trying here is:
>>
>> io_uring can have linked requests. One obvious use case for that is to
>> queue a POLLIN on a socket, and then link a read/recv to that. When the
>> poll completes, we want to run the read/recv. io_uring hooks into the
>> waitqueue wakeup handler to finish the poll request, and since we're
>> deep in waitqueue wakeup code, it queues the linked read/recv for
>> execution via an async thread. This is not optimal, obviously, as it
>> relies on a switch to a new thread to perform this read. This hack
>> queues a backlog to the task itself, and runs it when it's scheduled in.
>> Probably want to do the same for sched out as well, currently I just
>> hack that in the io_uring wait part...
> 
> I'll definitely need to think more about this, but a few comments on the
> below.
> 
>> +static void __io_uring_task_handler(struct list_head *list)
>> +{
>> +	struct io_kiocb *req;
>> +
>> +	while (!list_empty(list)) {
>> +		req = list_first_entry(list, struct io_kiocb, list);
>> +		list_del(&req->list);
>> +
>> +		__io_queue_sqe(req, NULL);
>> +	}
>> +}
>> +
>> +void io_uring_task_handler(struct task_struct *tsk)
>> +{
>> +	LIST_HEAD(list);
>> +
>> +	raw_spin_lock_irq(&tsk->uring_lock);
>> +	if (!list_empty(&tsk->uring_work))
>> +		list_splice_init(&tsk->uring_work, &list);
>> +	raw_spin_unlock_irq(&tsk->uring_lock);
>> +
>> +	__io_uring_task_handler(&list);
>> +}
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index fc1dfc007604..b60f081cac17 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>>  #endif
>>  
>> +#ifdef CONFIG_IO_URING
>> +	INIT_LIST_HEAD(&p->uring_work);
>> +	raw_spin_lock_init(&p->uring_lock);
>> +#endif
>> +
>>  #ifdef CONFIG_COMPACTION
>>  	p->capture_control = NULL;
>>  #endif
>> @@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
>>  
>>  #endif /* CONFIG_PREEMPT_NOTIFIERS */
>>  
>> +#ifdef CONFIG_IO_URING
>> +extern void io_uring_task_handler(struct task_struct *tsk);
>> +
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +	if (!list_empty(&tsk->uring_work))
>> +		io_uring_task_handler(tsk);
>> +}
>> +#else /* !CONFIG_IO_URING */
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +}
>> +#endif
>> +
>>  static inline void prepare_task(struct task_struct *next)
>>  {
>>  #ifdef CONFIG_SMP
>> @@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>>  	balance_callback(rq);
>>  	preempt_enable();
>>  
>> +	io_uring_handler(current);
>> +
>>  	if (current->set_child_tid)
>>  		put_user(task_pid_vnr(current), current->set_child_tid);
>>  
> 
> I suspect you meant to put that in finish_task_switch() which is the
> tail end of every schedule(), schedule_tail() is the tail end of
> clone().
> 
> Or maybe you meant to put it in (and rename) sched_update_worker() which
> is after every schedule() but in a preemptible context -- much saner
> since you don't want to go add an unbounded amount of work in a
> non-preemptible context.
> 
> At which point you already have your callback: io_wq_worker_running(),
> or is this for any random task?

Let me try and clarify - this isn't for the worker tasks, this is for
any task that is using io_uring. In fact, it's particularly not for the
worker threads, just the task itself.

I basically want the handler to be called when:

1) The task is scheduled in. The poll will complete and stuff some items
   on that task list, and I want to task to process them as it wakes up.

2) The task is going to sleep, don't want to leave entries around while
   the task is sleeping.

3) I need it to be called from "normal" context, with ints enabled,
   preempt enabled, etc.

sched_update_worker() (with a rename) looks ideal for #1, and the
context is sane for me. Just need a good spot to put the hook call for
schedule out. I think this:

	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
		preempt_disable();
		if (tsk->flags & PF_WQ_WORKER)
			wq_worker_sleeping(tsk);
		else
			io_wq_worker_sleeping(tsk);
		preempt_enable_no_resched();
	}

just needs to go into another helper, and then I can call it there
outside of the preempt.

I'm sure there are daemons lurking here, but I'll test and see how it
goes...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 15:47                   ` Jens Axboe
@ 2020-02-14 16:18                     ` Jens Axboe
  2020-02-14 17:52                       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 8:47 AM, Jens Axboe wrote:
>> I suspect you meant to put that in finish_task_switch() which is the
>> tail end of every schedule(), schedule_tail() is the tail end of
>> clone().
>>
>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>> is after every schedule() but in a preemptible context -- much saner
>> since you don't want to go add an unbounded amount of work in a
>> non-preemptible context.
>>
>> At which point you already have your callback: io_wq_worker_running(),
>> or is this for any random task?
> 
> Let me try and clarify - this isn't for the worker tasks, this is for
> any task that is using io_uring. In fact, it's particularly not for the
> worker threads, just the task itself.
> 
> I basically want the handler to be called when:
> 
> 1) The task is scheduled in. The poll will complete and stuff some items
>    on that task list, and I want to task to process them as it wakes up.
> 
> 2) The task is going to sleep, don't want to leave entries around while
>    the task is sleeping.
> 
> 3) I need it to be called from "normal" context, with ints enabled,
>    preempt enabled, etc.
> 
> sched_update_worker() (with a rename) looks ideal for #1, and the
> context is sane for me. Just need a good spot to put the hook call for
> schedule out. I think this:
> 
> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> 		preempt_disable();
> 		if (tsk->flags & PF_WQ_WORKER)
> 			wq_worker_sleeping(tsk);
> 		else
> 			io_wq_worker_sleeping(tsk);
> 		preempt_enable_no_resched();
> 	}
> 
> just needs to go into another helper, and then I can call it there
> outside of the preempt.
> 
> I'm sure there are daemons lurking here, but I'll test and see how it
> goes...

Here's a stab at cleaning it up:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll

top two patches. First one simply cleans up the sched_update_worker(),
so we now have sched_in_update() and sched_out_update(). No changes in
this patch, just moves the worker sched-out handling into a helper.

2nd patch then utilizes this to flush the per-task requests that may
have been queued up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 16:18                     ` Jens Axboe
@ 2020-02-14 17:52                       ` Jens Axboe
  2020-02-14 20:44                         ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 9:18 AM, Jens Axboe wrote:
> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>> I suspect you meant to put that in finish_task_switch() which is the
>>> tail end of every schedule(), schedule_tail() is the tail end of
>>> clone().
>>>
>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>> is after every schedule() but in a preemptible context -- much saner
>>> since you don't want to go add an unbounded amount of work in a
>>> non-preemptible context.
>>>
>>> At which point you already have your callback: io_wq_worker_running(),
>>> or is this for any random task?
>>
>> Let me try and clarify - this isn't for the worker tasks, this is for
>> any task that is using io_uring. In fact, it's particularly not for the
>> worker threads, just the task itself.
>>
>> I basically want the handler to be called when:
>>
>> 1) The task is scheduled in. The poll will complete and stuff some items
>>    on that task list, and I want to task to process them as it wakes up.
>>
>> 2) The task is going to sleep, don't want to leave entries around while
>>    the task is sleeping.
>>
>> 3) I need it to be called from "normal" context, with ints enabled,
>>    preempt enabled, etc.
>>
>> sched_update_worker() (with a rename) looks ideal for #1, and the
>> context is sane for me. Just need a good spot to put the hook call for
>> schedule out. I think this:
>>
>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>> 		preempt_disable();
>> 		if (tsk->flags & PF_WQ_WORKER)
>> 			wq_worker_sleeping(tsk);
>> 		else
>> 			io_wq_worker_sleeping(tsk);
>> 		preempt_enable_no_resched();
>> 	}
>>
>> just needs to go into another helper, and then I can call it there
>> outside of the preempt.
>>
>> I'm sure there are daemons lurking here, but I'll test and see how it
>> goes...
> 
> Here's a stab at cleaning it up:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
> 
> top two patches. First one simply cleans up the sched_update_worker(),
> so we now have sched_in_update() and sched_out_update(). No changes in
> this patch, just moves the worker sched-out handling into a helper.
> 
> 2nd patch then utilizes this to flush the per-task requests that may
> have been queued up.

In fact, we can go even further. If we have this task handler, then we:

1) Never need to go async for poll completion, and we can remove a bunch
   of code that handles that
2) Don't need to worry about nested eventfd notification, that code goes
   away too
3) Don't need the poll llist for batching flushes, that goes away

In terms of performance, for the single client case we did about 48K
requests per second on my kvm on the laptop, now we're doing 148K.
So it's definitely worthwhile... On top of that, diffstat:

 fs/io_uring.c | 166 +++++++-------------------------------------------
 1 file changed, 22 insertions(+), 144 deletions(-)


diff --git a/fs/io_uring.c b/fs/io_uring.c
index a683d7a08003..1e436b732f2a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -835,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -932,8 +924,6 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
-	if (!req->task)
-		req->task = get_task_struct(current);
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -3545,92 +3535,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
-
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3638,8 +3554,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3647,56 +3564,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				struct io_kiocb *nxt = NULL;
-
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req_find_next(req, &nxt);
-				if (nxt) {
-					struct task_struct *tsk = nxt->task;
-
-					raw_spin_lock(&tsk->uring_lock);
-					list_add_tail(&nxt->list, &tsk->uring_work);
-					raw_spin_unlock(&tsk->uring_lock);
-					/* do we need to wake tsk here??? */
-				}
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3755,7 +3628,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -4863,6 +4735,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return false;
 	}
 
+	req->task = get_task_struct(current);
+
 	/*
 	 * If we already have a head request, queue this one for async
 	 * submittal once the head completes. If we don't have a head but
@@ -5270,10 +5144,14 @@ void io_uring_task_handler(struct task_struct *tsk)
 	raw_spin_unlock_irq(&tsk->uring_lock);
 
 	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
 		req = list_first_entry(&local_list, struct io_kiocb, list);
 		list_del(&req->list);
 
-		__io_queue_sqe(req, NULL);
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(req, NULL);
 	}
 }
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 17:52                       ` Jens Axboe
@ 2020-02-14 20:44                         ` Jens Axboe
  2020-02-15  0:16                           ` Carter Li 李通洲
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-14 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 10:52 AM, Jens Axboe wrote:
> On 2/14/20 9:18 AM, Jens Axboe wrote:
>> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>>> I suspect you meant to put that in finish_task_switch() which is the
>>>> tail end of every schedule(), schedule_tail() is the tail end of
>>>> clone().
>>>>
>>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>>> is after every schedule() but in a preemptible context -- much saner
>>>> since you don't want to go add an unbounded amount of work in a
>>>> non-preemptible context.
>>>>
>>>> At which point you already have your callback: io_wq_worker_running(),
>>>> or is this for any random task?
>>>
>>> Let me try and clarify - this isn't for the worker tasks, this is for
>>> any task that is using io_uring. In fact, it's particularly not for the
>>> worker threads, just the task itself.
>>>
>>> I basically want the handler to be called when:
>>>
>>> 1) The task is scheduled in. The poll will complete and stuff some items
>>>    on that task list, and I want to task to process them as it wakes up.
>>>
>>> 2) The task is going to sleep, don't want to leave entries around while
>>>    the task is sleeping.
>>>
>>> 3) I need it to be called from "normal" context, with ints enabled,
>>>    preempt enabled, etc.
>>>
>>> sched_update_worker() (with a rename) looks ideal for #1, and the
>>> context is sane for me. Just need a good spot to put the hook call for
>>> schedule out. I think this:
>>>
>>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>>> 		preempt_disable();
>>> 		if (tsk->flags & PF_WQ_WORKER)
>>> 			wq_worker_sleeping(tsk);
>>> 		else
>>> 			io_wq_worker_sleeping(tsk);
>>> 		preempt_enable_no_resched();
>>> 	}
>>>
>>> just needs to go into another helper, and then I can call it there
>>> outside of the preempt.
>>>
>>> I'm sure there are daemons lurking here, but I'll test and see how it
>>> goes...
>>
>> Here's a stab at cleaning it up:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
>>
>> top two patches. First one simply cleans up the sched_update_worker(),
>> so we now have sched_in_update() and sched_out_update(). No changes in
>> this patch, just moves the worker sched-out handling into a helper.
>>
>> 2nd patch then utilizes this to flush the per-task requests that may
>> have been queued up.
> 
> In fact, we can go even further. If we have this task handler, then we:
> 
> 1) Never need to go async for poll completion, and we can remove a bunch
>    of code that handles that
> 2) Don't need to worry about nested eventfd notification, that code goes
>    away too
> 3) Don't need the poll llist for batching flushes, that goes away
> 
> In terms of performance, for the single client case we did about 48K
> requests per second on my kvm on the laptop, now we're doing 148K.
> So it's definitely worthwhile... On top of that, diffstat:
> 
>  fs/io_uring.c | 166 +++++++-------------------------------------------
>  1 file changed, 22 insertions(+), 144 deletions(-)

It's now up to 3.5x the original performance for the single client case.
Here's the updated patch, folded with the original that only went half
the way there.


commit b9c04ea10d10cf80f2d2f3b96e1668e523602072
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 14 09:15:29 2020 -0700

    io_uring: add per-task callback handler
    
    For poll requests, it's not uncommon to link a read (or write) after
    the poll to execute immediately after the file is marked as ready.
    Since the poll completion is called inside the waitqueue wake up handler,
    we have to punt that linked request to async context. This slows down
    the processing, and actually means it's faster to not use a link for this
    use case.
    
    We also run into problems if the completion_lock is contended, as we're
    doing a different lock ordering than the issue side is. Hence we have
    to do trylock for completion, and if that fails, go async. Poll removal
    needs to go async as well, for the same reason.
    
    eventfd notification needs special case as well, to avoid stack blowing
    recursion or deadlocks.
    
    These are all deficiencies that were inherited from the aio poll
    implementation, but I think we can do better. When a poll completes,
    simply queue it up in the task poll list. When the task completes the
    list, we can run dependent links inline as well. This means we never
    have to go async, and we can remove a bunch of code associated with
    that, and optimizations to try and make that run faster. The diffstat
    speaks for itself.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a826017ebb8..2756654e2955 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -574,6 +567,7 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
 	struct io_wq_work	work;
 };
 
@@ -834,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -1056,24 +1049,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
 		return false;
 	if (!ctx->eventfd_async)
 		return true;
-	return io_wq_current_is_worker() || in_interrupt();
+	return io_wq_current_is_worker();
 }
 
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev)
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (trigger_ev)
+	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
-{
-	__io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx));
-}
-
 /* Returns true if there are no backlogged entries after the flush */
 static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 {
@@ -1238,6 +1226,8 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	/* one is dropped after submission, the other at completion */
 	refcount_set(&req->refs, 2);
 	req->result = 0;
+	/* task will wait for requests on exit, don't need a ref */
+	req->task = current;
 	INIT_IO_WORK(&req->work, io_wq_submit_work);
 	return req;
 fallback:
@@ -3448,15 +3438,22 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
 static void io_poll_remove_one(struct io_kiocb *req)
 {
 	struct io_poll_iocb *poll = &req->poll;
+	bool do_complete = false;
 
 	spin_lock(&poll->head->lock);
 	WRITE_ONCE(poll->canceled, true);
 	if (!list_empty(&poll->wait.entry)) {
 		list_del_init(&poll->wait.entry);
-		io_queue_async_work(req);
+		do_complete = true;
 	}
 	spin_unlock(&poll->head->lock);
 	hash_del(&req->hash_node);
+	if (do_complete) {
+		io_cqring_fill_event(req, -ECANCELED);
+		io_commit_cqring(req->ctx);
+		req->flags |= REQ_F_COMP_LOCKED;
+		io_put_req(req);
+	}
 }
 
 static void io_poll_remove_all(struct io_ring_ctx *ctx)
@@ -3474,6 +3471,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 			io_poll_remove_one(req);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
 }
 
 static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
@@ -3539,92 +3538,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
 
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3632,8 +3557,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3641,46 +3567,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req(req);
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3739,7 +3631,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -5243,6 +5134,28 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return autoremove_wake_function(curr, mode, wake_flags, key);
 }
 
+void io_uring_task_handler(struct task_struct *tsk)
+{
+	LIST_HEAD(local_list);
+	struct io_kiocb *req;
+
+	raw_spin_lock_irq(&tsk->uring_lock);
+	if (!list_empty(&tsk->uring_work))
+		list_splice_init(&tsk->uring_work, &local_list);
+	raw_spin_unlock_irq(&tsk->uring_lock);
+
+	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
+		req = list_first_entry(&local_list, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(nxt, NULL);
+	}
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..447b06c6bed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_IO_URING
+	struct list_head		uring_work;
+	raw_spinlock_t			uring_lock;
+#endif
+
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51ca491d99ed..170fefa1caf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
+#ifdef CONFIG_IO_URING
+	INIT_LIST_HEAD(&p->uring_work);
+	raw_spin_lock_init(&p->uring_lock);
+#endif
+
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
@@ -4104,6 +4109,20 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
+#ifdef CONFIG_IO_URING
+extern void io_uring_task_handler(struct task_struct *tsk);
+
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+	if (!list_empty(&tsk->uring_work))
+		io_uring_task_handler(tsk);
+}
+#else /* !CONFIG_IO_URING */
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+}
+#endif
+
 static void sched_out_update(struct task_struct *tsk)
 {
 	/*
@@ -4121,6 +4140,7 @@ static void sched_out_update(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
+	io_uring_handler(tsk);
 }
 
 static void sched_in_update(struct task_struct *tsk)
@@ -4131,6 +4151,7 @@ static void sched_in_update(struct task_struct *tsk)
 		else
 			io_wq_worker_running(tsk);
 	}
+	io_uring_handler(tsk);
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 20:44                         ` Jens Axboe
@ 2020-02-15  0:16                           ` Carter Li 李通洲
  2020-02-15  1:10                             ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  0:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

Hello Jens,


> It's now up to 3.5x the original performance for the single client case.
> Here's the updated patch, folded with the original that only went half
> the way there.


I’m looking forward to it.

And question again: since POLL->READ/RECV is much faster then READ/RECV async,
could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
networking, but also for all pollable fds.

Carter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  0:16                           ` Carter Li 李通洲
@ 2020-02-15  1:10                             ` Jens Axboe
  2020-02-15  1:25                               ` Carter Li 李通洲
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-15  1:10 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 5:16 PM, Carter Li 李通洲 wrote:
> Hello Jens,
> 
> 
>> It's now up to 3.5x the original performance for the single client case.
>> Here's the updated patch, folded with the original that only went half
>> the way there.
> 
> 
> I’m looking forward to it.
> 
> And question again: since POLL->READ/RECV is much faster then READ/RECV async,
> could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
> networking, but also for all pollable fds.

That's exactly the next step. With this, we have a very efficient way of
doing async IO for anything that can be driven by poll. Then we can do it
by default, instead of doing an async punt. Much faster and much more
efficient.

I'll try and work on that next week, I think this could be a real game
changer.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:10                             ` Jens Axboe
@ 2020-02-15  1:25                               ` Carter Li 李通洲
  2020-02-15  1:27                                 ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  1:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

There are at least 2 benefits over POLL->READ

1. Reduce a little complexity of user code, and save lots of sqes.
2. Better performance. Users can’t if an operation will block without
issuing an extra O_NONBLOCK syscall, which ends up with always using
POLL->READ link. If it’s handled by kernel, we may only poll when
we know it’s needed.


> 2020年2月15日 上午9:10,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/14/20 5:16 PM, Carter Li 李通洲 wrote:
>> Hello Jens,
>> 
>> 
>>> It's now up to 3.5x the original performance for the single client case.
>>> Here's the updated patch, folded with the original that only went half
>>> the way there.
>> 
>> 
>> I’m looking forward to it.
>> 
>> And question again: since POLL->READ/RECV is much faster then READ/RECV async,
>> could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
>> networking, but also for all pollable fds.
> 
> That's exactly the next step. With this, we have a very efficient way of
> doing async IO for anything that can be driven by poll. Then we can do it
> by default, instead of doing an async punt. Much faster and much more
> efficient.
> 
> I'll try and work on that next week, I think this could be a real game
> changer.
> 
> -- 
> Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:25                               ` Carter Li 李通洲
@ 2020-02-15  1:27                                 ` Jens Axboe
  2020-02-15  6:01                                   ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-15  1:27 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
> There are at least 2 benefits over POLL->READ
> 
> 1. Reduce a little complexity of user code, and save lots of sqes.
> 2. Better performance. Users can’t if an operation will block without
> issuing an extra O_NONBLOCK syscall, which ends up with always using
> POLL->READ link. If it’s handled by kernel, we may only poll when
> we know it’s needed.

Exactly, it'll enable the app to do read/recv or write/send without
having to worry about anything, and it'll be as efficient as having
it linked to a poll command.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:27                                 ` Jens Axboe
@ 2020-02-15  6:01                                   ` Jens Axboe
  2020-02-15  6:32                                     ` Carter Li 李通洲
  2020-02-16 19:06                                     ` Pavel Begunkov
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-15  6:01 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 6:27 PM, Jens Axboe wrote:
> On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
>> There are at least 2 benefits over POLL->READ
>>
>> 1. Reduce a little complexity of user code, and save lots of sqes.
>> 2. Better performance. Users can’t if an operation will block without
>> issuing an extra O_NONBLOCK syscall, which ends up with always using
>> POLL->READ link. If it’s handled by kernel, we may only poll when
>> we know it’s needed.
> 
> Exactly, it'll enable the app to do read/recv or write/send without
> having to worry about anything, and it'll be as efficient as having
> it linked to a poll command.

Couldn't help myself... The below is the general direction, but not
done by any stretch of the imagination. There are a few hacks in there.
But, in short, it does allow eg send/recv to behave in an async manner
without needing the thread offload. Tested it with the test/send_recv
and test/send_recvmsg test apps, and it works there. There seems to be
some weird issue with eg test/socket-rw, not sure what that is yet.

Just wanted to throw it out there. It's essentially the same as the
linked poll, except it's just done internally.


commit d1fc97c5f132eaf39c06783925bd11dca9fa3ecd
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 14 22:23:12 2020 -0700

    io_uring: use poll driven retry for files that support it
    
    Currently io_uring tries any request in a non-blocking manner, if it can,
    and then retries from a worker thread if we got -EAGAIN. Now that we have
    a new and fancy poll based retry backend, use that to retry requests if
    the file supports it.
    
    This means that, for example, an IORING_OP_RECVMSG on a socket no longer
    requires an async thread to complete the IO. If we get -EAGAIN reading
    from the socket in a non-blocking manner, we arm a poll handler for
    notification on when the socket becomes readable. When it does, the
    pending read is executed directly by the task again, through the io_uring
    scheduler handlers.
    
    Note that this is very much a work-in-progress, and it doesn't (yet) pass
    the full test suite. Notable missing features:
    
    - With the work queued up async, we have a common method for locating it
      for cancelation purposes. I need to add cancel tracking for poll
      managed requests.
    
    - Need to double check req->apoll life time.
    
    - Probably a lot I don't quite recall right now...
    
    It does work for the basic read/write, send/recv, etc testing I've
    tried.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fb94b8bac638..530dcd91fa53 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -577,6 +577,7 @@ struct io_kiocb {
 		struct {
 			struct task_struct	*task;
 			struct list_head	task_list;
+			struct io_poll_iocb	*apoll;
 		};
 		struct io_wq_work	work;
 	};
@@ -623,6 +624,9 @@ struct io_op_def {
 	unsigned		file_table : 1;
 	/* needs ->fs */
 	unsigned		needs_fs : 1;
+	/* set if opcode supports polled "wait" */
+	unsigned		pollin : 1;
+	unsigned		pollout : 1;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -632,6 +636,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITEV] = {
 		.async_ctx		= 1,
@@ -639,6 +644,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FSYNC] = {
 		.needs_file		= 1,
@@ -646,11 +652,13 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE_FIXED] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_POLL_ADD] = {
 		.needs_file		= 1,
@@ -666,6 +674,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECVMSG] = {
 		.async_ctx		= 1,
@@ -673,6 +682,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.async_ctx		= 1,
@@ -684,6 +694,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.file_table		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_ASYNC_CANCEL] = {},
 	[IORING_OP_LINK_TIMEOUT] = {
@@ -695,6 +706,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_FALLOCATE] = {
 		.needs_file		= 1,
@@ -723,11 +735,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FADVISE] = {
 		.needs_file		= 1,
@@ -739,11 +753,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECV] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_OPENAT2] = {
 		.needs_file		= 1,
@@ -2228,6 +2244,139 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
+			    struct wait_queue_head *head)
+{
+	if (unlikely(poll->head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	poll->head = head;
+	add_wait_queue(head, &poll->wait);
+}
+
+static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	__io_queue_proc(pt->req->apoll, pt, head);
+}
+
+static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
+			   __poll_t mask)
+{
+	struct task_struct *tsk;
+	unsigned long flags;
+
+	/* for instances that support it check for an event match first: */
+	if (mask && !(mask & poll->events))
+		return 0;
+
+	list_del_init(&poll->wait.entry);
+
+	tsk = req->task;
+	req->result = mask;
+	spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->task_list, &tsk->uring_work);
+	spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
+	return 1;
+}
+
+static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_kiocb *req = wait->private;
+	struct io_poll_iocb *poll = req->apoll;
+
+	trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data, key_to_poll(key));
+	return __io_async_wake(req, poll, key_to_poll(key));
+}
+
+static bool io_arm_poll_handler(struct io_kiocb *req, int *retry_count)
+{
+	const struct io_op_def *def = &io_op_defs[req->opcode];
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_poll_iocb *poll;
+	struct io_poll_table ipt;
+	bool cancel = false;
+	__poll_t mask;
+
+	if (!file_can_poll(req->file))
+		return false;
+	if (req->flags & REQ_F_MUST_PUNT)
+		return false;
+	if (!def->pollin && !def->pollout)
+		return false;
+
+	poll = kmalloc(sizeof(*poll), GFP_ATOMIC);
+	if (unlikely(!poll))
+		return false;
+
+	req->task = current;
+	INIT_LIST_HEAD(&req->task_list);
+	req->apoll = poll;
+
+	if (def->pollin)
+		mask = POLLIN;
+	if (def->pollout)
+		mask |= POLLOUT;
+
+	poll->file = req->file;
+	poll->head = NULL;
+	poll->done = poll->canceled = false;
+	poll->events = mask;
+	ipt.pt._qproc = io_async_queue_proc;
+	ipt.pt._key = mask;
+	ipt.req = req;
+	ipt.error = -EINVAL;
+
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, io_async_wake);
+	poll->wait.private = req;
+
+	mask = vfs_poll(req->file, &ipt.pt) & poll->events;
+
+	spin_lock_irq(&ctx->completion_lock);
+	if (likely(poll->head)) {
+		spin_lock(&poll->head->lock);
+		if (unlikely(list_empty(&poll->wait.entry))) {
+			if (ipt.error)
+				cancel = true;
+			ipt.error = 0;
+			mask = 0;
+		}
+		if (mask || ipt.error)
+			list_del_init(&poll->wait.entry);
+		else if (cancel)
+			WRITE_ONCE(poll->canceled, true);
+#if 0
+		Needs tracking for cancelation
+		else if (!poll->done) /* actually waiting for an event */
+			io_poll_req_insert(req);
+#endif
+		spin_unlock(&poll->head->lock);
+	}
+	if (mask) {
+		ipt.error = 0;
+		poll->done = true;
+		(*retry_count)++;
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+	trace_io_uring_poll_arm(ctx, req->opcode, req->user_data, mask,
+					poll->events);
+	return true;
+}
+
 static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 		   bool force_nonblock)
 {
@@ -3569,44 +3718,16 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	__poll_t mask = key_to_poll(key);
-	struct task_struct *tsk;
-	unsigned long flags;
-
-	/* for instances that support it check for an event match first: */
-	if (mask && !(mask & poll->events))
-		return 0;
-
-	list_del_init(&poll->wait.entry);
 
-	tsk = req->task;
-	req->result = mask;
-	spin_lock_irqsave(&tsk->uring_lock, flags);
-	list_add_tail(&req->task_list, &tsk->uring_work);
-	spin_unlock_irqrestore(&tsk->uring_lock, flags);
-	wake_up_process(tsk);
-	return 1;
+	return __io_async_wake(req, poll, key_to_poll(key));
 }
 
-struct io_poll_table {
-	struct poll_table_struct pt;
-	struct io_kiocb *req;
-	int error;
-};
-
 static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
 			       struct poll_table_struct *p)
 {
 	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
 
-	if (unlikely(pt->req->poll.head)) {
-		pt->error = -EINVAL;
-		return;
-	}
-
-	pt->error = 0;
-	pt->req->poll.head = head;
-	add_wait_queue(head, &pt->req->poll.wait);
+	__io_queue_proc(&pt->req->poll, pt, head);
 }
 
 static void io_poll_req_insert(struct io_kiocb *req)
@@ -4617,11 +4738,13 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_kiocb *linked_timeout;
 	struct io_kiocb *nxt = NULL;
+	int retry_count = 0;
 	int ret;
 
 again:
 	linked_timeout = io_prep_linked_timeout(req);
 
+issue:
 	ret = io_issue_sqe(req, sqe, &nxt, true);
 
 	/*
@@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	 */
 	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
 	    (req->flags & REQ_F_MUST_PUNT))) {
+
+		if (io_arm_poll_handler(req, &retry_count)) {
+			if (retry_count == 1)
+				goto issue;
+			else if (!retry_count)
+				goto done_req;
+			INIT_IO_WORK(&req->work, io_wq_submit_work);
+		}
 punt:
 		if (io_op_defs[req->opcode].file_table) {
 			ret = io_grab_files(req);
@@ -5154,26 +5285,40 @@ void io_uring_task_handler(struct task_struct *tsk)
 {
 	LIST_HEAD(local_list);
 	struct io_kiocb *req;
+	long state;
 
 	spin_lock_irq(&tsk->uring_lock);
 	if (!list_empty(&tsk->uring_work))
 		list_splice_init(&tsk->uring_work, &local_list);
 	spin_unlock_irq(&tsk->uring_lock);
 
+	state = current->state;
+	__set_current_state(TASK_RUNNING);
 	while (!list_empty(&local_list)) {
 		struct io_kiocb *nxt = NULL;
+		void *to_free = NULL;
 
 		req = list_first_entry(&local_list, struct io_kiocb, task_list);
 		list_del(&req->task_list);
 
-		io_poll_task_handler(req, &nxt);
+		if (req->opcode == IORING_OP_POLL_ADD) {
+			io_poll_task_handler(req, &nxt);
+		} else {
+			nxt = req;
+			to_free = req->apoll;
+			WARN_ON_ONCE(!list_empty(&req->apoll->wait.entry));
+		}
 		if (nxt)
 			__io_queue_sqe(nxt, NULL);
 
+		kfree(to_free);
+
 		/* finish next time, if we're out of time slice */
-		if (need_resched())
+		if (need_resched() && !(current->flags & PF_EXITING))
 			break;
 	}
+	WARN_ON_ONCE(current->state != TASK_RUNNING);
+	__set_current_state(state);
 
 	if (!list_empty(&local_list)) {
 		spin_lock_irq(&tsk->uring_lock);
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 27bd9e4f927b..133a16472423 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -357,6 +357,60 @@ TRACE_EVENT(io_uring_submit_sqe,
 			  __entry->force_nonblock, __entry->sq_thread)
 );
 
+TRACE_EVENT(io_uring_poll_arm,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask, int events),
+
+	TP_ARGS(ctx, opcode, user_data, mask, events),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+		__field(  int,		events		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+		__entry->events		= events;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x, events 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask, __entry->events)
+);
+
+TRACE_EVENT(io_uring_poll_wake,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask),
+
+	TP_ARGS(ctx, opcode, user_data, mask),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
diff --git a/kernel/exit.c b/kernel/exit.c
index 2833ffb0c211..988799763f34 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -716,6 +716,7 @@ void __noreturn do_exit(long code)
 	profile_task_exit(tsk);
 	kcov_task_exit(tsk);
 
+	WARN_ON(!list_empty(&tsk->uring_work));
 	WARN_ON(blk_needs_flush_plug(tsk));
 
 	if (unlikely(in_interrupt()))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:01                                   ` Jens Axboe
@ 2020-02-15  6:32                                     ` Carter Li 李通洲
  2020-02-15 15:11                                       ` Jens Axboe
  2020-02-16 19:06                                     ` Pavel Begunkov
  1 sibling, 1 reply; 27+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  6:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

Really appreciate it! Looking forward to getting it merged into mainline!

By the way, what time is it now in your city? ;-)

Carter

> 2020年2月15日 下午2:01,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/14/20 6:27 PM, Jens Axboe wrote:
>> On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
>>> There are at least 2 benefits over POLL->READ
>>> 
>>> 1. Reduce a little complexity of user code, and save lots of sqes.
>>> 2. Better performance. Users can’t if an operation will block without
>>> issuing an extra O_NONBLOCK syscall, which ends up with always using
>>> POLL->READ link. If it’s handled by kernel, we may only poll when
>>> we know it’s needed.
>> 
>> Exactly, it'll enable the app to do read/recv or write/send without
>> having to worry about anything, and it'll be as efficient as having
>> it linked to a poll command.
> 
> Couldn't help myself... The below is the general direction, but not
> done by any stretch of the imagination. There are a few hacks in there.
> But, in short, it does allow eg send/recv to behave in an async manner
> without needing the thread offload. Tested it with the test/send_recv
> and test/send_recvmsg test apps, and it works there. There seems to be
> some weird issue with eg test/socket-rw, not sure what that is yet.
> 
> Just wanted to throw it out there. It's essentially the same as the
> linked poll, except it's just done internally.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:32                                     ` Carter Li 李通洲
@ 2020-02-15 15:11                                       ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-15 15:11 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 11:32 PM, Carter Li 李通洲 wrote:
> Really appreciate it! Looking forward to getting it merged into mainline!

I'm very excited about it too, I'll keep hammering on it to flesh it
out. I'll only be sporadically available next week though, but we'll
see how it turns out.

> By the way, what time is it now in your city? ;-)

It was 11PM when that went out, could be worse :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:01                                   ` Jens Axboe
  2020-02-15  6:32                                     ` Carter Li 李通洲
@ 2020-02-16 19:06                                     ` Pavel Begunkov
  2020-02-16 22:23                                       ` Jens Axboe
  2020-02-16 23:06                                       ` Jens Axboe
  1 sibling, 2 replies; 27+ messages in thread
From: Pavel Begunkov @ 2020-02-16 19:06 UTC (permalink / raw)
  To: Jens Axboe, Carter Li 李通洲; +Cc: Peter Zijlstra, io-uring

[-- Attachment #1.1: Type: text/plain, Size: 1474 bytes --]

On 15/02/2020 09:01, Jens Axboe wrote:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index fb94b8bac638..530dcd91fa53 100644
> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  	 */
>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>  	    (req->flags & REQ_F_MUST_PUNT))) {
> +
> +		if (io_arm_poll_handler(req, &retry_count)) {
> +			if (retry_count == 1)
> +				goto issue;

Better to sqe=NULL before retrying, so it won't re-read sqe and try to init the
req twice.

Also, the second sync-issue may -EAGAIN again, and as I remember, read/write/etc
will try to copy iovec into req->io. But iovec is already in req->io, so it will
self memcpy(). Not a good thing.

> +			else if (!retry_count)
> +				goto done_req;
> +			INIT_IO_WORK(&req->work, io_wq_submit_work);

It's not nice to reset it as this:
- prep() could set some work.flags
- custom work.func is more performant (adds extra switch)
- some may rely on specified work.func to be called. e.g. close(), even though
it doesn't participate in the scheme

> +		}
>  punt:
>  		if (io_op_defs[req->opcode].file_table) {
>  			ret = io_grab_files(req);
> @@ -5154,26 +5285,40 @@ void io_uring_task_handler(struct task_struct *tsk)
>  {
>  	LIST_HEAD(local_list);
>  	struct io_kiocb *req;
> +	long state;
>  
>  	spin_lock_irq(&tsk->uring_lock);
>  	if (!list_empty(&tsk->uring_work))

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 19:06                                     ` Pavel Begunkov
@ 2020-02-16 22:23                                       ` Jens Axboe
  2020-02-16 23:06                                       ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-16 22:23 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 12:06 PM, Pavel Begunkov wrote:
> On 15/02/2020 09:01, Jens Axboe wrote:
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index fb94b8bac638..530dcd91fa53 100644
>> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  	 */
>>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>>  	    (req->flags & REQ_F_MUST_PUNT))) {
>> +
>> +		if (io_arm_poll_handler(req, &retry_count)) {
>> +			if (retry_count == 1)
>> +				goto issue;
> 
> Better to sqe=NULL before retrying, so it won't re-read sqe and try to
> init the req twice.

Good point, that should get cleared after issue.

> Also, the second sync-issue may -EAGAIN again, and as I remember,
> read/write/etc will try to copy iovec into req->io. But iovec is
> already in req->io, so it will self memcpy(). Not a good thing.

I'll look into those details, that has indeed reared its head before.

>> +			else if (!retry_count)
>> +				goto done_req;
>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
> 
> It's not nice to reset it as this:
> - prep() could set some work.flags
> - custom work.func is more performant (adds extra switch)
> - some may rely on specified work.func to be called. e.g. close(), even though
> it doesn't participate in the scheme

It's totally a hack as-is for the "can't do it, go async". I did clean
this up a bit (if you check the git version, it's changed quite a bit),
but it's still a mess in terms of that and ->work vs union ownership.
The commit message also has a note about that.

So more work needed in that area for sure.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 19:06                                     ` Pavel Begunkov
  2020-02-16 22:23                                       ` Jens Axboe
@ 2020-02-16 23:06                                       ` Jens Axboe
  2020-02-16 23:07                                         ` Jens Axboe
  1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2020-02-16 23:06 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 12:06 PM, Pavel Begunkov wrote:
> Also, the second sync-issue may -EAGAIN again, and as I remember,
> read/write/etc will try to copy iovec into req->io. But iovec is
> already in req->io, so it will self memcpy(). Not a good thing.

That got fixed, it's no longer doing that anymore.

>> +			else if (!retry_count)
>> +				goto done_req;
>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
> 
> It's not nice to reset it as this:
> - prep() could set some work.flags
> - custom work.func is more performant (adds extra switch)
> - some may rely on specified work.func to be called. e.g. close(), even though
> it doesn't participate in the scheme

For now I just retain a copy of ->work, seems to be the easiest solution
vs trying to track this state.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 23:06                                       ` Jens Axboe
@ 2020-02-16 23:07                                         ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2020-02-16 23:07 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 4:06 PM, Jens Axboe wrote:
>>> +			else if (!retry_count)
>>> +				goto done_req;
>>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
>>
>> It's not nice to reset it as this:
>> - prep() could set some work.flags
>> - custom work.func is more performant (adds extra switch)
>> - some may rely on specified work.func to be called. e.g. close(), even though
>> it doesn't participate in the scheme
> 
> For now I just retain a copy of ->work, seems to be the easiest solution
> vs trying to track this state.

Should mention this isn't quite enough, we really need to drop anything
we have in ->work as well if it was already prepared/grabbed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 16:31 [ISSUE] The time cost of IOSQE_IO_LINK Carter Li 李通洲
2020-02-12 17:11 ` Jens Axboe
2020-02-12 17:22   ` Jens Axboe
2020-02-12 17:29     ` Jens Axboe
2020-02-13  0:33   ` Carter Li 李通洲
2020-02-13 15:08     ` Pavel Begunkov
2020-02-13 15:14       ` Jens Axboe
2020-02-13 15:51         ` Carter Li 李通洲
2020-02-14  1:25           ` Carter Li 李通洲
2020-02-14  2:45             ` Jens Axboe
2020-02-14  5:03               ` Jens Axboe
2020-02-14 15:32                 ` Peter Zijlstra
2020-02-14 15:47                   ` Jens Axboe
2020-02-14 16:18                     ` Jens Axboe
2020-02-14 17:52                       ` Jens Axboe
2020-02-14 20:44                         ` Jens Axboe
2020-02-15  0:16                           ` Carter Li 李通洲
2020-02-15  1:10                             ` Jens Axboe
2020-02-15  1:25                               ` Carter Li 李通洲
2020-02-15  1:27                                 ` Jens Axboe
2020-02-15  6:01                                   ` Jens Axboe
2020-02-15  6:32                                     ` Carter Li 李通洲
2020-02-15 15:11                                       ` Jens Axboe
2020-02-16 19:06                                     ` Pavel Begunkov
2020-02-16 22:23                                       ` Jens Axboe
2020-02-16 23:06                                       ` Jens Axboe
2020-02-16 23:07                                         ` Jens Axboe

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git