IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
* [ISSUE] The time cost of IOSQE_IO_LINK
@ 2020-02-12 16:31 Carter Li 李通洲
  2020-02-12 17:11 ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-12 16:31 UTC (permalink / raw)
  To: io-uring

Hi everyone,

IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.

Test code attached below. The program completes after getting 100000000 cqes.

$ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
$ time ./test0
USE_LINK: 0, count: 100000000, submit_count: 1562500
0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
0inputs+0outputs (0major+72minor)pagefaults 0swaps

$ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
$ time ./test1
USE_LINK: 1, count: 100000110, submit_count: 799584
0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
0inputs+0outputs (0major+72minor)pagefaults 0swaps

As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
please have a check.

Thanks,
Carter Li

---

#include <liburing.h>
#include <stdio.h>

#define ENTRY_SIZE 128
#define MAX_COUNT 100000000

#ifndef USE_LINK
#	define USE_LINK 0
#endif

int main(void) {
	struct io_uring ring;
	unsigned i, head;
	unsigned count = 0, cqe_count = 0, submit_count = 0;
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;

	if (io_uring_queue_init(ENTRY_SIZE, &ring, 0) < 0) {
		return -1;
	}

	for (i = 0; i < ENTRY_SIZE / 2; ++i) {
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
#if USE_LINK
		io_uring_sqe_set_flags(sqe, IOSQE_IO_LINK);
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
		sqe->user_data = 1;
#endif
	}

	while (count < MAX_COUNT) {
		io_uring_submit_and_wait(&ring, 1);
		++submit_count;

		io_uring_for_each_cqe(&ring, head, cqe) {
			++cqe_count;
			++count;
#if USE_LINK
			if (cqe->user_data) {
				sqe = io_uring_get_sqe(&ring);
				io_uring_prep_nop(sqe);
				io_uring_sqe_set_flags(sqe, IOSQE_IO_LINK);
				sqe = io_uring_get_sqe(&ring);
				io_uring_prep_nop(sqe);
				sqe->user_data = 1;
			}
#else
			sqe = io_uring_get_sqe(&ring);
			io_uring_prep_nop(sqe);
			sqe->user_data = !cqe->user_data;
#endif
		}
		io_uring_cq_advance(&ring, cqe_count);
		cqe_count = 0;
	}

	printf("USE_LINK: %d, count: %u, submit_count: %u\n\n",
		USE_LINK, count, submit_count);
}



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 16:31 [ISSUE] The time cost of IOSQE_IO_LINK Carter Li 李通洲
@ 2020-02-12 17:11 ` Jens Axboe
  2020-02-12 17:22   ` Jens Axboe
  2020-02-13  0:33   ` Carter Li 李通洲
  0 siblings, 2 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-12 17:11 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
> Hi everyone,
> 
> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
> 
> Test code attached below. The program completes after getting 100000000 cqes.
> 
> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
> $ time ./test0
> USE_LINK: 0, count: 100000000, submit_count: 1562500
> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
> 
> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
> $ time ./test1
> USE_LINK: 1, count: 100000110, submit_count: 799584
> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
> 
> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
> please have a check.

The nop isn't really a good test case, as it doesn't contain any smarts
in terms of executing a link fast. So it doesn't say a whole lot outside
of "we could make nop links faster", which is also kind of pointless.

"Normal" commands will work better. Where the link is really a win is if
the first request needs to go async to complete. For that case, the
next link can execute directly from that context. This saves an async
punt for the common case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:11 ` Jens Axboe
@ 2020-02-12 17:22   ` Jens Axboe
  2020-02-12 17:29     ` Jens Axboe
  2020-02-13  0:33   ` Carter Li 李通洲
  1 sibling, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-12 17:22 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 10:11 AM, Jens Axboe wrote:
> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>> Hi everyone,
>>
>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>
>> Test code attached below. The program completes after getting 100000000 cqes.
>>
>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>> $ time ./test0
>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>
>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>> $ time ./test1
>> USE_LINK: 1, count: 100000110, submit_count: 799584
>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>
>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>> please have a check.
> 
> The nop isn't really a good test case, as it doesn't contain any smarts
> in terms of executing a link fast. So it doesn't say a whole lot outside
> of "we could make nop links faster", which is also kind of pointless.
> 
> "Normal" commands will work better. Where the link is really a win is if
> the first request needs to go async to complete. For that case, the
> next link can execute directly from that context. This saves an async
> punt for the common case.

Case in point, if I just add the below patch, we're a lot closer:

[root@archlinux liburing]# time test/nop-link 0
Using link: 0
count: 100000000, submit_count: 1562500


real	0m7.934s
user	0m0.740s
sys	0m7.157s
[root@archlinux liburing]# time test/nop-link 1
Using link: 1
count: 100000000, submit_count: 781250


real	0m9.009s
user	0m0.710s
sys	0m8.264s

The links are still a bit slower, which is to be expected as the
nop basically just completes, it doesn't do anything at all and
it never needs to go async.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9f00f30e1790..d0f645be91cb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2399,7 +2399,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
 /*
  * IORING_OP_NOP just posts a completion event, nothing else.
  */
-static int io_nop(struct io_kiocb *req)
+static int io_nop(struct io_kiocb *req, struct io_kiocb **nxt)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
@@ -2407,7 +2407,7 @@ static int io_nop(struct io_kiocb *req)
 		return -EINVAL;
 
 	io_cqring_add_event(req, 0);
-	io_put_req(req);
+	io_put_req_find_next(req, nxt);
 	return 0;
 }
 
@@ -4355,7 +4355,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	switch (req->opcode) {
 	case IORING_OP_NOP:
-		ret = io_nop(req);
+		ret = io_nop(req, nxt);
 		break;
 	case IORING_OP_READV:
 	case IORING_OP_READ_FIXED:

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:22   ` Jens Axboe
@ 2020-02-12 17:29     ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-12 17:29 UTC (permalink / raw)
  To: Carter Li 李通洲, io-uring

On 2/12/20 10:22 AM, Jens Axboe wrote:
> On 2/12/20 10:11 AM, Jens Axboe wrote:
>> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>>> Hi everyone,
>>>
>>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>>
>>> Test code attached below. The program completes after getting 100000000 cqes.
>>>
>>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>>> $ time ./test0
>>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>>> $ time ./test1
>>> USE_LINK: 1, count: 100000110, submit_count: 799584
>>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>>> please have a check.
>>
>> The nop isn't really a good test case, as it doesn't contain any smarts
>> in terms of executing a link fast. So it doesn't say a whole lot outside
>> of "we could make nop links faster", which is also kind of pointless.
>>
>> "Normal" commands will work better. Where the link is really a win is if
>> the first request needs to go async to complete. For that case, the
>> next link can execute directly from that context. This saves an async
>> punt for the common case.
> 
> Case in point, if I just add the below patch, we're a lot closer:
> 
> [root@archlinux liburing]# time test/nop-link 0
> Using link: 0
> count: 100000000, submit_count: 1562500
> 
> 
> real	0m7.934s
> user	0m0.740s
> sys	0m7.157s
> [root@archlinux liburing]# time test/nop-link 1
> Using link: 1
> count: 100000000, submit_count: 781250
> 
> 
> real	0m9.009s
> user	0m0.710s
> sys	0m8.264s
> 
> The links are still a bit slower, which is to be expected as the
> nop basically just completes, it doesn't do anything at all and
> it never needs to go async.

Pinning the test for more reliable results and we're basically even.

[root@archlinux liburing]# time taskset -c 0 test/nop-link 1
Using link: 1
count: 100000000, submit_count: 781250


real	0m8.251s
user	0m0.680s
sys	0m7.536s

[root@archlinux liburing]# time taskset -c 0 test/nop-link 0
Using link: 0
count: 100000000, submit_count: 1562500


real	0m7.986s
user	0m0.610s
sys	0m7.340s

For the intended case (outlined above), it'll definitely be a
win.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-12 17:11 ` Jens Axboe
  2020-02-12 17:22   ` Jens Axboe
@ 2020-02-13  0:33   ` Carter Li 李通洲
  2020-02-13 15:08     ` Pavel Begunkov
  1 sibling, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-13  0:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Thanks for your reply.

You are right the nop isn't really a good test case. But I actually
found this issue when benchmarking my echo server, which didn't use
NOP of course.

Test case attached below. Use rust_echo_bench for benchmarking.
https://github.com/haraldh/rust_echo_bench


$ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=0
$ ./link_recv 12345
$ cargo run --release # On another console
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Speed: 168264 request/sec, 168264 response/sec
Requests: 10095846
Responses: 10095844

$ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=1
$ ./link_recv 12345
$ cargo run --release # On another console
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Speed: 112666 request/sec, 112666 response/sec
Requests: 6760009
Responses: 6759975


I think `POLL_ADD(POLLIN)-RECV` and `POLL_ADD(POLLOUT)-SEND` are common use cases for networking ( for some reason a short read for SEND is not considered an error, `RECV-SEND` cannot be used in a link chain ). RECV/SEND won't block after polled. I expect better performance for fewer io_uring_enter syscalls. Could you please have a check with it?

Another more complex test case `POLL_ADD-READ_FIXED-WRITE_FIXED` I have posted on Github, which currently results in freeze.

https://github.com/axboe/liburing/issues/71

Carter

---

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/socket.h>
#include <sys/poll.h>
#include <netinet/in.h>

#include <liburing.h>

#define BACKLOG 128
#define MAX_MESSAGE_LEN 1024
#define MAX_CONNECTIONS 1024
#ifndef USE_LINK
#   define USE_LINK 0
#endif

enum { ACCEPT, POLL, READ, WRITE };

struct conn_info {
    __u32 fd;
    __u32 type;
};

typedef char buf_type[MAX_CONNECTIONS][MAX_MESSAGE_LEN];

static struct io_uring ring;
static unsigned cqe_count = 0;

int init_socket(int portno) {
    int sock_listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (sock_listen_fd < 0) {
        perror("socket");
        return -1;
    }

    struct sockaddr_in server_addr = {
        .sin_family = AF_INET,
        .sin_port = htons(portno),
        .sin_addr = {
            .s_addr = INADDR_ANY,
        },
    };

    if (bind(sock_listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
        perror("bind");
        return -1;
    }

    if (listen(sock_listen_fd, BACKLOG) < 0) {
        perror("listen");
        return -1;
    }

    return sock_listen_fd;
}

static struct io_uring_sqe* get_sqe_safe() {
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    if (__builtin_expect(!!sqe, 1)) {
        return sqe;
    } else {
        io_uring_cq_advance(&ring, cqe_count);
        cqe_count = 0;
        io_uring_submit(&ring);
        return io_uring_get_sqe(&ring);
    }
}

static void add_accept(int fd, struct sockaddr *client_addr, socklen_t *client_len) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = ACCEPT,
    };

    io_uring_prep_accept(sqe, fd, client_addr, client_len, 0);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_poll(int fd, int poll_mask, unsigned flags) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = POLL,
    };

    io_uring_prep_poll_add(sqe, fd, poll_mask);
    io_uring_sqe_set_flags(sqe, flags);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_socket_read(int fd, size_t size, buf_type *bufs) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = READ,
    };

    io_uring_prep_recv(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

static void add_socket_write(int fd, size_t size, buf_type *bufs, unsigned flags) {
    struct io_uring_sqe *sqe = get_sqe_safe();
    struct conn_info conn_i = {
        .fd = fd,
        .type = WRITE,
    };

    io_uring_prep_send(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
    io_uring_sqe_set_flags(sqe, flags);
    memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
}

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Please give a port number: %s [port]\n", argv[0]);
        return 1;
    }

    int portno = strtol(argv[1], NULL, 10);
    int sock_listen_fd = init_socket(portno);
    if (sock_listen_fd < 0) return -1;
    printf("io_uring echo server listening for connections on port: %d\n", portno);


    int ret = io_uring_queue_init(BACKLOG, &ring, 0);
    if (ret < 0) {
        fprintf(stderr, "queue_init: %s\n", strerror(-ret));
        return -1;
    }

    buf_type *bufs = (buf_type *)malloc(sizeof(*bufs));

    struct sockaddr_in client_addr;
    socklen_t client_len = sizeof(client_addr);
    add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);

    while (1) {
        io_uring_submit_and_wait(&ring, 1);

        struct io_uring_cqe *cqe;
        unsigned head;

        io_uring_for_each_cqe(&ring, head, cqe) {
            ++cqe_count;

            struct conn_info conn_i;
            memcpy(&conn_i, &cqe->user_data, sizeof(conn_i));
            int result = cqe->res;

            switch (conn_i.type) {
            case ACCEPT:
#if USE_LINK
                add_poll(result, POLLIN, IOSQE_IO_LINK);
                add_socket_read(result, MAX_MESSAGE_LEN, bufs);
#else
                add_poll(result, POLLIN, 0);
#endif
                add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
                break;

#if !USE_LINK
            case POLL:
                add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
                break;
#endif

            case READ:
                if (__builtin_expect(result <= 0, 0)) {
                    shutdown(conn_i.fd, SHUT_RDWR);
                } else {
                    add_socket_write(conn_i.fd, result, bufs, 0);
                }
                break;

            case WRITE:
#if USE_LINK
                add_poll(conn_i.fd, POLLIN, IOSQE_IO_LINK);
                add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
#else
                add_poll(conn_i.fd, POLLIN, 0);
#endif
                break;
            }
        }

        io_uring_cq_advance(&ring, cqe_count);
        cqe_count = 0;
    }


    close(sock_listen_fd);
    free(bufs);
}



> 2020年2月13日 上午1:11,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>> Hi everyone,
>> 
>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>> 
>> Test code attached below. The program completes after getting 100000000 cqes.
>> 
>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>> $ time ./test0
>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>> 
>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>> $ time ./test1
>> USE_LINK: 1, count: 100000110, submit_count: 799584
>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>> 
>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>> please have a check.
> 
> The nop isn't really a good test case, as it doesn't contain any smarts
> in terms of executing a link fast. So it doesn't say a whole lot outside
> of "we could make nop links faster", which is also kind of pointless.
> 
> "Normal" commands will work better. Where the link is really a win is if
> the first request needs to go async to complete. For that case, the
> next link can execute directly from that context. This saves an async
> punt for the common case.
> 
> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13  0:33   ` Carter Li 李通洲
@ 2020-02-13 15:08     ` Pavel Begunkov
  2020-02-13 15:14       ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Pavel Begunkov @ 2020-02-13 15:08 UTC (permalink / raw)
  To: Carter Li 李通洲, Jens Axboe; +Cc: io-uring

On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
> Thanks for your reply.
> 
> You are right the nop isn't really a good test case. But I actually
> found this issue when benchmarking my echo server, which didn't use
> NOP of course.

If there are no hidden subtle issues in io_uring, your benchmark or the
used pattern itself, it's probably because of overhead on async punting
(copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
io-wq itself).

I was going to tune async/punting stuff anyway, so I'll look into this.
And of course, there is always a good chance Jens have some bright insights

BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?

> Test case attached below. Use rust_echo_bench for benchmarking.
> https://github.com/haraldh/rust_echo_bench
> 
> 
> $ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=0
> $ ./link_recv 12345
> $ cargo run --release # On another console
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Speed: 168264 request/sec, 168264 response/sec
> Requests: 10095846
> Responses: 10095844
> 
> $ gcc link_recv.c -o link_recv -luring -O3 -DUSE_LINK=1
> $ ./link_recv 12345
> $ cargo run --release # On another console
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Speed: 112666 request/sec, 112666 response/sec
> Requests: 6760009
> Responses: 6759975
> 
> 
> I think `POLL_ADD(POLLIN)-RECV` and `POLL_ADD(POLLOUT)-SEND` are common use cases for networking ( for some reason a short read for SEND is not considered an error, `RECV-SEND` cannot be used in a link chain ). RECV/SEND won't block after polled. I expect better performance for fewer io_uring_enter syscalls. Could you please have a check with it?
> 
> Another more complex test case `POLL_ADD-READ_FIXED-WRITE_FIXED` I have posted on Github, which currently results in freeze.
> 
> https://github.com/axboe/liburing/issues/71
> 
> Carter
> 
> ---
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #include <sys/socket.h>
> #include <sys/poll.h>
> #include <netinet/in.h>
> 
> #include <liburing.h>
> 
> #define BACKLOG 128
> #define MAX_MESSAGE_LEN 1024
> #define MAX_CONNECTIONS 1024
> #ifndef USE_LINK
> #   define USE_LINK 0
> #endif
> 
> enum { ACCEPT, POLL, READ, WRITE };
> 
> struct conn_info {
>     __u32 fd;
>     __u32 type;
> };
> 
> typedef char buf_type[MAX_CONNECTIONS][MAX_MESSAGE_LEN];
> 
> static struct io_uring ring;
> static unsigned cqe_count = 0;
> 
> int init_socket(int portno) {
>     int sock_listen_fd = socket(AF_INET, SOCK_STREAM, 0);
>     if (sock_listen_fd < 0) {
>         perror("socket");
>         return -1;
>     }
> 
>     struct sockaddr_in server_addr = {
>         .sin_family = AF_INET,
>         .sin_port = htons(portno),
>         .sin_addr = {
>             .s_addr = INADDR_ANY,
>         },
>     };
> 
>     if (bind(sock_listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
>         perror("bind");
>         return -1;
>     }
> 
>     if (listen(sock_listen_fd, BACKLOG) < 0) {
>         perror("listen");
>         return -1;
>     }
> 
>     return sock_listen_fd;
> }
> 
> static struct io_uring_sqe* get_sqe_safe() {
>     struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
>     if (__builtin_expect(!!sqe, 1)) {
>         return sqe;
>     } else {
>         io_uring_cq_advance(&ring, cqe_count);
>         cqe_count = 0;
>         io_uring_submit(&ring);
>         return io_uring_get_sqe(&ring);
>     }
> }
> 
> static void add_accept(int fd, struct sockaddr *client_addr, socklen_t *client_len) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = ACCEPT,
>     };
> 
>     io_uring_prep_accept(sqe, fd, client_addr, client_len, 0);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_poll(int fd, int poll_mask, unsigned flags) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = POLL,
>     };
> 
>     io_uring_prep_poll_add(sqe, fd, poll_mask);
>     io_uring_sqe_set_flags(sqe, flags);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_socket_read(int fd, size_t size, buf_type *bufs) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = READ,
>     };
> 
>     io_uring_prep_recv(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> static void add_socket_write(int fd, size_t size, buf_type *bufs, unsigned flags) {
>     struct io_uring_sqe *sqe = get_sqe_safe();
>     struct conn_info conn_i = {
>         .fd = fd,
>         .type = WRITE,
>     };
> 
>     io_uring_prep_send(sqe, fd, (*bufs)[fd], size, MSG_NOSIGNAL);
>     io_uring_sqe_set_flags(sqe, flags);
>     memcpy(&sqe->user_data, &conn_i, sizeof(conn_i));
> }
> 
> int main(int argc, char *argv[]) {
>     if (argc < 2) {
>         fprintf(stderr, "Please give a port number: %s [port]\n", argv[0]);
>         return 1;
>     }
> 
>     int portno = strtol(argv[1], NULL, 10);
>     int sock_listen_fd = init_socket(portno);
>     if (sock_listen_fd < 0) return -1;
>     printf("io_uring echo server listening for connections on port: %d\n", portno);
> 
> 
>     int ret = io_uring_queue_init(BACKLOG, &ring, 0);
>     if (ret < 0) {
>         fprintf(stderr, "queue_init: %s\n", strerror(-ret));
>         return -1;
>     }
> 
>     buf_type *bufs = (buf_type *)malloc(sizeof(*bufs));
> 
>     struct sockaddr_in client_addr;
>     socklen_t client_len = sizeof(client_addr);
>     add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
> 
>     while (1) {
>         io_uring_submit_and_wait(&ring, 1);
> 
>         struct io_uring_cqe *cqe;
>         unsigned head;
> 
>         io_uring_for_each_cqe(&ring, head, cqe) {
>             ++cqe_count;
> 
>             struct conn_info conn_i;
>             memcpy(&conn_i, &cqe->user_data, sizeof(conn_i));
>             int result = cqe->res;
> 
>             switch (conn_i.type) {
>             case ACCEPT:
> #if USE_LINK
>                 add_poll(result, POLLIN, IOSQE_IO_LINK);
>                 add_socket_read(result, MAX_MESSAGE_LEN, bufs);
> #else
>                 add_poll(result, POLLIN, 0);
> #endif
>                 add_accept(sock_listen_fd, (struct sockaddr *)&client_addr, &client_len);
>                 break;
> 
> #if !USE_LINK
>             case POLL:
>                 add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
>                 break;
> #endif
> 
>             case READ:
>                 if (__builtin_expect(result <= 0, 0)) {
>                     shutdown(conn_i.fd, SHUT_RDWR);
>                 } else {
>                     add_socket_write(conn_i.fd, result, bufs, 0);
>                 }
>                 break;
> 
>             case WRITE:
> #if USE_LINK
>                 add_poll(conn_i.fd, POLLIN, IOSQE_IO_LINK);
>                 add_socket_read(conn_i.fd, MAX_MESSAGE_LEN, bufs);
> #else
>                 add_poll(conn_i.fd, POLLIN, 0);
> #endif
>                 break;
>             }
>         }
> 
>         io_uring_cq_advance(&ring, cqe_count);
>         cqe_count = 0;
>     }
> 
> 
>     close(sock_listen_fd);
>     free(bufs);
> }
> 
> 
> 
>> 2020年2月13日 上午1:11,Jens Axboe <axboe@kernel.dk> 写道:
>>
>> On 2/12/20 9:31 AM, Carter Li 李通洲 wrote:
>>> Hi everyone,
>>>
>>> IOSQE_IO_LINK seems to have very high cost, even greater then io_uring_enter syscall.
>>>
>>> Test code attached below. The program completes after getting 100000000 cqes.
>>>
>>> $ gcc test.c -luring -o test0 -g -O3 -DUSE_LINK=0
>>> $ time ./test0
>>> USE_LINK: 0, count: 100000000, submit_count: 1562500
>>> 0.99user 9.99system 0:11.02elapsed 99%CPU (0avgtext+0avgdata 1608maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> $ gcc test.c -luring -o test1 -g -O3 -DUSE_LINK=1
>>> $ time ./test1
>>> USE_LINK: 1, count: 100000110, submit_count: 799584
>>> 0.83user 19.21system 0:20.90elapsed 95%CPU (0avgtext+0avgdata 1632maxresident)k
>>> 0inputs+0outputs (0major+72minor)pagefaults 0swaps
>>>
>>> As you can see, the `-DUSE_LINK=1` version emits only about half io_uring_submit calls
>>> of the other version, but takes twice as long. That makes IOSQE_IO_LINK almost useless,
>>> please have a check.
>>
>> The nop isn't really a good test case, as it doesn't contain any smarts
>> in terms of executing a link fast. So it doesn't say a whole lot outside
>> of "we could make nop links faster", which is also kind of pointless.
>>
>> "Normal" commands will work better. Where the link is really a win is if
>> the first request needs to go async to complete. For that case, the
>> next link can execute directly from that context. This saves an async
>> punt for the common case.
>>
>> -- 
>> Jens Axboe
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:08     ` Pavel Begunkov
@ 2020-02-13 15:14       ` Jens Axboe
  2020-02-13 15:51         ` Carter Li 李通洲
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-13 15:14 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲; +Cc: io-uring

On 2/13/20 8:08 AM, Pavel Begunkov wrote:
> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>> Thanks for your reply.
>>
>> You are right the nop isn't really a good test case. But I actually
>> found this issue when benchmarking my echo server, which didn't use
>> NOP of course.
> 
> If there are no hidden subtle issues in io_uring, your benchmark or the
> used pattern itself, it's probably because of overhead on async punting
> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
> io-wq itself).
> 
> I was going to tune async/punting stuff anyway, so I'll look into this.
> And of course, there is always a good chance Jens have some bright insights

The main issue here is that if you do the poll->recv, then it'll be
an automatic punt of the recv to async context when the poll completes.
That's regardless of whether or not we can complete the poll inline,
we never attempt to recv inline from that completion. This is in contrast
to doing a separate poll, getting the notification, then doing another
sqe and io_uring_enter to perform the recv. For this case, we end up
doing everything inline, just with the cost of an additional system call
to submit the new recv.

It'd be really cool if we could improve on this situation, as recv (or
read) preceded by a poll is indeed a common use case. Or ditto for the
write side.

> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?

If there's no data to begin with, then the read will go async. Hence
it'll be a switch to a worker thread. The above should avoid it, but
it doesn't.

For carter's sake, it's worth nothing that the poll command is special
and normal requests would be more efficient with links. We just need
to work on making the poll linked with read/write perform much better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:14       ` Jens Axboe
@ 2020-02-13 15:51         ` Carter Li 李通洲
  2020-02-14  1:25           ` Carter Li 李通洲
  0 siblings, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-13 15:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring



> 2020年2月13日 下午11:14,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/13/20 8:08 AM, Pavel Begunkov wrote:
>> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>>> Thanks for your reply.
>>> 
>>> You are right the nop isn't really a good test case. But I actually
>>> found this issue when benchmarking my echo server, which didn't use
>>> NOP of course.
>> 
>> If there are no hidden subtle issues in io_uring, your benchmark or the
>> used pattern itself, it's probably because of overhead on async punting
>> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
>> io-wq itself).
>> 
>> I was going to tune async/punting stuff anyway, so I'll look into this.
>> And of course, there is always a good chance Jens have some bright insights
> 
> The main issue here is that if you do the poll->recv, then it'll be
> an automatic punt of the recv to async context when the poll completes.
> That's regardless of whether or not we can complete the poll inline,
> we never attempt to recv inline from that completion. This is in contrast
> to doing a separate poll, getting the notification, then doing another
> sqe and io_uring_enter to perform the recv. For this case, we end up
> doing everything inline, just with the cost of an additional system call
> to submit the new recv.
> 
> It'd be really cool if we could improve on this situation, as recv (or
> read) preceded by a poll is indeed a common use case. Or ditto for the
> write side.
> 
>> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?
> 
> If there's no data to begin with, then the read will go async. Hence
> it'll be a switch to a worker thread. The above should avoid it, but
> it doesn't.

Yes. I actually tested `directly read()` first, and found it was about 30%
slower then poll(fd)->read(fd).

https://github.com/axboe/liburing/issues/69

So it turns out that async punting has high overhead. A (silly) question:
could we implement read/write operations that would block as poll->read/write?


> 
> For carter's sake, it's worth nothing that the poll command is special
> and normal requests would be more efficient with links. We just need
> to work on making the poll linked with read/write perform much better.

Thanks

> 
> -- 
> Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-13 15:51         ` Carter Li 李通洲
@ 2020-02-14  1:25           ` Carter Li 李通洲
  2020-02-14  2:45             ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-14  1:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring

Another suggestion: we should always try completing operations inline
unless IOSQE_ASYNC is specified, no matter if the operations are preceded
by a poll.

Carter

> 2020年2月13日 下午11:51,Carter Li 李通洲 <carter.li@eoitek.com> 写道:
> 
> 
> 
>> 2020年2月13日 下午11:14,Jens Axboe <axboe@kernel.dk> 写道:
>> 
>> On 2/13/20 8:08 AM, Pavel Begunkov wrote:
>>> On 2/13/2020 3:33 AM, Carter Li 李通洲 wrote:
>>>> Thanks for your reply.
>>>> 
>>>> You are right the nop isn't really a good test case. But I actually
>>>> found this issue when benchmarking my echo server, which didn't use
>>>> NOP of course.
>>> 
>>> If there are no hidden subtle issues in io_uring, your benchmark or the
>>> used pattern itself, it's probably because of overhead on async punting
>>> (copying iovecs, several extra switches, refcounts, grabbing mm/fs/etc,
>>> io-wq itself).
>>> 
>>> I was going to tune async/punting stuff anyway, so I'll look into this.
>>> And of course, there is always a good chance Jens have some bright insights
>> 
>> The main issue here is that if you do the poll->recv, then it'll be
>> an automatic punt of the recv to async context when the poll completes.
>> That's regardless of whether or not we can complete the poll inline,
>> we never attempt to recv inline from that completion. This is in contrast
>> to doing a separate poll, getting the notification, then doing another
>> sqe and io_uring_enter to perform the recv. For this case, we end up
>> doing everything inline, just with the cost of an additional system call
>> to submit the new recv.
>> 
>> It'd be really cool if we could improve on this situation, as recv (or
>> read) preceded by a poll is indeed a common use case. Or ditto for the
>> write side.
>> 
>>> BTW, what's benefit of doing poll(fd)->read(fd), but not directly read()?
>> 
>> If there's no data to begin with, then the read will go async. Hence
>> it'll be a switch to a worker thread. The above should avoid it, but
>> it doesn't.
> 
> Yes. I actually tested `directly read()` first, and found it was about 30%
> slower then poll(fd)->read(fd).
> 
> https://github.com/axboe/liburing/issues/69
> 
> So it turns out that async punting has high overhead. A (silly) question:
> could we implement read/write operations that would block as poll->read/write?
> 
> 
>> 
>> For carter's sake, it's worth nothing that the poll command is special
>> and normal requests would be more efficient with links. We just need
>> to work on making the poll linked with read/write perform much better.
> 
> Thanks
> 
>> 
>> -- 
>> Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  1:25           ` Carter Li 李通洲
@ 2020-02-14  2:45             ` Jens Axboe
  2020-02-14  5:03               ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-14  2:45 UTC (permalink / raw)
  To: Carter Li 李通洲; +Cc: Pavel Begunkov, io-uring

On 2/13/20 6:25 PM, Carter Li 李通洲 wrote:
> Another suggestion: we should always try completing operations inline
> unless IOSQE_ASYNC is specified, no matter if the operations are preceded
> by a poll.

Yes that's a given, the problem is that the poll completion is run
inside the waitqueue handler, and with that locked and interrupts
disabled. So it's not quite that easy, unfortunately.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  2:45             ` Jens Axboe
@ 2020-02-14  5:03               ` Jens Axboe
  2020-02-14 15:32                 ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-14  5:03 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Pavel Begunkov, io-uring, Peter Zijlstra

On 2/13/20 7:45 PM, Jens Axboe wrote:
> On 2/13/20 6:25 PM, Carter Li 李通洲 wrote:
>> Another suggestion: we should always try completing operations inline
>> unless IOSQE_ASYNC is specified, no matter if the operations are preceded
>> by a poll.
> 
> Yes that's a given, the problem is that the poll completion is run
> inside the waitqueue handler, and with that locked and interrupts
> disabled. So it's not quite that easy, unfortunately.

Super hack alert... While we can't do it from within the wakeup handler,
we can ensure that the task runs the work when it is scheduled. This
patch attempts to do that - when we run the poll wakeup handler inline
and we have a linked request, instead of just sending that async, we
queue it up in our task uring work log. When the task is scheduled in,
the backlog is run. This achieves the desired result of having the next
work item done inline, instead of having a worker thread do it.

CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
currently do something like this? Only thing I'm currently aware of is
the preempt in/out notifiers, but they don't quite provide what I need,
since I need to pass some data (a request) as well.

The full detail on what I'm trying here is:

io_uring can have linked requests. One obvious use case for that is to
queue a POLLIN on a socket, and then link a read/recv to that. When the
poll completes, we want to run the read/recv. io_uring hooks into the
waitqueue wakeup handler to finish the poll request, and since we're
deep in waitqueue wakeup code, it queues the linked read/recv for
execution via an async thread. This is not optimal, obviously, as it
relies on a switch to a new thread to perform this read. This hack
queues a backlog to the task itself, and runs it when it's scheduled in.
Probably want to do the same for sched out as well, currently I just
hack that in the io_uring wait part...

This nets me about a 20% boost running the echo example that Carter
supplied.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 56799be66b49..e100406f4842 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -574,6 +574,7 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
 	struct io_wq_work	work;
 };
 
@@ -931,6 +932,8 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
+	if (!req->task)
+		req->task = get_task_struct(current);
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -953,6 +956,8 @@ static inline void io_req_work_drop_env(struct io_kiocb *req)
 		if (fs)
 			free_fs_struct(fs);
 	}
+	if (req->task)
+		put_task_struct(req->task);
 }
 
 static inline bool io_prep_async_work(struct io_kiocb *req,
@@ -1239,6 +1244,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	refcount_set(&req->refs, 2);
 	req->result = 0;
 	INIT_IO_WORK(&req->work, io_wq_submit_work);
+	req->task = NULL;
 	return req;
 fallback:
 	req = io_get_fallback_req(ctx);
@@ -3670,8 +3676,24 @@ static void __io_poll_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 			trigger_ev = false;
 			req->work.func = io_poll_trigger_evfd;
 		} else {
+			struct io_kiocb *nxt = NULL;
+
 			req->flags |= REQ_F_COMP_LOCKED;
+#if 1
+			io_put_req_find_next(req, &nxt);
+			if (nxt) {
+				struct task_struct *tsk = nxt->task;
+
+				raw_spin_lock(&tsk->uring_lock);
+				list_add_tail(&nxt->list, &tsk->uring_work);
+				raw_spin_unlock(&tsk->uring_lock);
+
+				/* do we need this? */
+				wake_up_process(tsk);
+			}
+#else
 			io_put_req(req);
+#endif
 			req = NULL;
 		}
 		spin_unlock_irqrestore(&ctx->completion_lock, flags);
@@ -5316,6 +5338,40 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return autoremove_wake_function(curr, mode, wake_flags, key);
 }
 
+static void __io_uring_task_handler(struct list_head *list)
+{
+	struct io_kiocb *req;
+
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct io_kiocb, list);
+		list_del(&req->list);
+
+		__io_queue_sqe(req, NULL);
+	}
+}
+
+void io_uring_task_handler(struct task_struct *tsk)
+{
+	LIST_HEAD(list);
+
+	raw_spin_lock_irq(&tsk->uring_lock);
+	if (!list_empty(&tsk->uring_work))
+		list_splice_init(&tsk->uring_work, &list);
+	raw_spin_unlock_irq(&tsk->uring_lock);
+
+	__io_uring_task_handler(&list);
+}
+
+static bool io_uring_handle_tasklog(void)
+{
+	if (list_empty(&current->uring_work))
+		return false;
+
+	__set_current_state(TASK_RUNNING);
+	io_uring_task_handler(current);
+	return true;
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
@@ -5358,6 +5414,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 						TASK_INTERRUPTIBLE);
 		if (io_should_wake(&iowq, false))
 			break;
+		/* should be a sched-out handler */
+		if (io_uring_handle_tasklog()) {
+			if (io_should_wake(&iowq, false))
+				break;
+			continue;
+		}
 		schedule();
 		if (signal_pending(current)) {
 			ret = -EINTR;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..447b06c6bed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_IO_URING
+	struct list_head		uring_work;
+	raw_spinlock_t			uring_lock;
+#endif
+
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc007604..b60f081cac17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
+#ifdef CONFIG_IO_URING
+	INIT_LIST_HEAD(&p->uring_work);
+	raw_spin_lock_init(&p->uring_lock);
+#endif
+
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
@@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
 
 #endif /* CONFIG_PREEMPT_NOTIFIERS */
 
+#ifdef CONFIG_IO_URING
+extern void io_uring_task_handler(struct task_struct *tsk);
+
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+	if (!list_empty(&tsk->uring_work))
+		io_uring_task_handler(tsk);
+}
+#else /* !CONFIG_IO_URING */
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+}
+#endif
+
 static inline void prepare_task(struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	balance_callback(rq);
 	preempt_enable();
 
+	io_uring_handler(current);
+
 	if (current->set_child_tid)
 		put_user(task_pid_vnr(current), current->set_child_tid);
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14  5:03               ` Jens Axboe
@ 2020-02-14 15:32                 ` Peter Zijlstra
  2020-02-14 15:47                   ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-14 15:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On Thu, Feb 13, 2020 at 10:03:54PM -0700, Jens Axboe wrote:

> CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
> currently do something like this? Only thing I'm currently aware of is
> the preempt in/out notifiers, but they don't quite provide what I need,
> since I need to pass some data (a request) as well.

Whee, nothing quite like this around I think.

> The full detail on what I'm trying here is:
> 
> io_uring can have linked requests. One obvious use case for that is to
> queue a POLLIN on a socket, and then link a read/recv to that. When the
> poll completes, we want to run the read/recv. io_uring hooks into the
> waitqueue wakeup handler to finish the poll request, and since we're
> deep in waitqueue wakeup code, it queues the linked read/recv for
> execution via an async thread. This is not optimal, obviously, as it
> relies on a switch to a new thread to perform this read. This hack
> queues a backlog to the task itself, and runs it when it's scheduled in.
> Probably want to do the same for sched out as well, currently I just
> hack that in the io_uring wait part...

I'll definitely need to think more about this, but a few comments on the
below.

> +static void __io_uring_task_handler(struct list_head *list)
> +{
> +	struct io_kiocb *req;
> +
> +	while (!list_empty(list)) {
> +		req = list_first_entry(list, struct io_kiocb, list);
> +		list_del(&req->list);
> +
> +		__io_queue_sqe(req, NULL);
> +	}
> +}
> +
> +void io_uring_task_handler(struct task_struct *tsk)
> +{
> +	LIST_HEAD(list);
> +
> +	raw_spin_lock_irq(&tsk->uring_lock);
> +	if (!list_empty(&tsk->uring_work))
> +		list_splice_init(&tsk->uring_work, &list);
> +	raw_spin_unlock_irq(&tsk->uring_lock);
> +
> +	__io_uring_task_handler(&list);
> +}

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fc1dfc007604..b60f081cac17 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
>  
> +#ifdef CONFIG_IO_URING
> +	INIT_LIST_HEAD(&p->uring_work);
> +	raw_spin_lock_init(&p->uring_lock);
> +#endif
> +
>  #ifdef CONFIG_COMPACTION
>  	p->capture_control = NULL;
>  #endif
> @@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
>  
>  #endif /* CONFIG_PREEMPT_NOTIFIERS */
>  
> +#ifdef CONFIG_IO_URING
> +extern void io_uring_task_handler(struct task_struct *tsk);
> +
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +	if (!list_empty(&tsk->uring_work))
> +		io_uring_task_handler(tsk);
> +}
> +#else /* !CONFIG_IO_URING */
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +}
> +#endif
> +
>  static inline void prepare_task(struct task_struct *next)
>  {
>  #ifdef CONFIG_SMP
> @@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>  	balance_callback(rq);
>  	preempt_enable();
>  
> +	io_uring_handler(current);
> +
>  	if (current->set_child_tid)
>  		put_user(task_pid_vnr(current), current->set_child_tid);
>  

I suspect you meant to put that in finish_task_switch() which is the
tail end of every schedule(), schedule_tail() is the tail end of
clone().

Or maybe you meant to put it in (and rename) sched_update_worker() which
is after every schedule() but in a preemptible context -- much saner
since you don't want to go add an unbounded amount of work in a
non-preemptible context.

At which point you already have your callback: io_wq_worker_running(),
or is this for any random task?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 15:32                 ` Peter Zijlstra
@ 2020-02-14 15:47                   ` Jens Axboe
  2020-02-14 16:18                     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-14 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 8:32 AM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 10:03:54PM -0700, Jens Axboe wrote:
> 
>> CC'ing peterz for some cluebat knowledge. Peter, is there a nice way to
>> currently do something like this? Only thing I'm currently aware of is
>> the preempt in/out notifiers, but they don't quite provide what I need,
>> since I need to pass some data (a request) as well.
> 
> Whee, nothing quite like this around I think.

Probably not ;-)

>> The full detail on what I'm trying here is:
>>
>> io_uring can have linked requests. One obvious use case for that is to
>> queue a POLLIN on a socket, and then link a read/recv to that. When the
>> poll completes, we want to run the read/recv. io_uring hooks into the
>> waitqueue wakeup handler to finish the poll request, and since we're
>> deep in waitqueue wakeup code, it queues the linked read/recv for
>> execution via an async thread. This is not optimal, obviously, as it
>> relies on a switch to a new thread to perform this read. This hack
>> queues a backlog to the task itself, and runs it when it's scheduled in.
>> Probably want to do the same for sched out as well, currently I just
>> hack that in the io_uring wait part...
> 
> I'll definitely need to think more about this, but a few comments on the
> below.
> 
>> +static void __io_uring_task_handler(struct list_head *list)
>> +{
>> +	struct io_kiocb *req;
>> +
>> +	while (!list_empty(list)) {
>> +		req = list_first_entry(list, struct io_kiocb, list);
>> +		list_del(&req->list);
>> +
>> +		__io_queue_sqe(req, NULL);
>> +	}
>> +}
>> +
>> +void io_uring_task_handler(struct task_struct *tsk)
>> +{
>> +	LIST_HEAD(list);
>> +
>> +	raw_spin_lock_irq(&tsk->uring_lock);
>> +	if (!list_empty(&tsk->uring_work))
>> +		list_splice_init(&tsk->uring_work, &list);
>> +	raw_spin_unlock_irq(&tsk->uring_lock);
>> +
>> +	__io_uring_task_handler(&list);
>> +}
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index fc1dfc007604..b60f081cac17 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>>  #endif
>>  
>> +#ifdef CONFIG_IO_URING
>> +	INIT_LIST_HEAD(&p->uring_work);
>> +	raw_spin_lock_init(&p->uring_lock);
>> +#endif
>> +
>>  #ifdef CONFIG_COMPACTION
>>  	p->capture_control = NULL;
>>  #endif
>> @@ -3069,6 +3074,20 @@ fire_sched_out_preempt_notifiers(struct task_struct *curr,
>>  
>>  #endif /* CONFIG_PREEMPT_NOTIFIERS */
>>  
>> +#ifdef CONFIG_IO_URING
>> +extern void io_uring_task_handler(struct task_struct *tsk);
>> +
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +	if (!list_empty(&tsk->uring_work))
>> +		io_uring_task_handler(tsk);
>> +}
>> +#else /* !CONFIG_IO_URING */
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +}
>> +#endif
>> +
>>  static inline void prepare_task(struct task_struct *next)
>>  {
>>  #ifdef CONFIG_SMP
>> @@ -3322,6 +3341,8 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>>  	balance_callback(rq);
>>  	preempt_enable();
>>  
>> +	io_uring_handler(current);
>> +
>>  	if (current->set_child_tid)
>>  		put_user(task_pid_vnr(current), current->set_child_tid);
>>  
> 
> I suspect you meant to put that in finish_task_switch() which is the
> tail end of every schedule(), schedule_tail() is the tail end of
> clone().
> 
> Or maybe you meant to put it in (and rename) sched_update_worker() which
> is after every schedule() but in a preemptible context -- much saner
> since you don't want to go add an unbounded amount of work in a
> non-preemptible context.
> 
> At which point you already have your callback: io_wq_worker_running(),
> or is this for any random task?

Let me try and clarify - this isn't for the worker tasks, this is for
any task that is using io_uring. In fact, it's particularly not for the
worker threads, just the task itself.

I basically want the handler to be called when:

1) The task is scheduled in. The poll will complete and stuff some items
   on that task list, and I want to task to process them as it wakes up.

2) The task is going to sleep, don't want to leave entries around while
   the task is sleeping.

3) I need it to be called from "normal" context, with ints enabled,
   preempt enabled, etc.

sched_update_worker() (with a rename) looks ideal for #1, and the
context is sane for me. Just need a good spot to put the hook call for
schedule out. I think this:

	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
		preempt_disable();
		if (tsk->flags & PF_WQ_WORKER)
			wq_worker_sleeping(tsk);
		else
			io_wq_worker_sleeping(tsk);
		preempt_enable_no_resched();
	}

just needs to go into another helper, and then I can call it there
outside of the preempt.

I'm sure there are daemons lurking here, but I'll test and see how it
goes...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 15:47                   ` Jens Axboe
@ 2020-02-14 16:18                     ` Jens Axboe
  2020-02-14 17:52                       ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-14 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 8:47 AM, Jens Axboe wrote:
>> I suspect you meant to put that in finish_task_switch() which is the
>> tail end of every schedule(), schedule_tail() is the tail end of
>> clone().
>>
>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>> is after every schedule() but in a preemptible context -- much saner
>> since you don't want to go add an unbounded amount of work in a
>> non-preemptible context.
>>
>> At which point you already have your callback: io_wq_worker_running(),
>> or is this for any random task?
> 
> Let me try and clarify - this isn't for the worker tasks, this is for
> any task that is using io_uring. In fact, it's particularly not for the
> worker threads, just the task itself.
> 
> I basically want the handler to be called when:
> 
> 1) The task is scheduled in. The poll will complete and stuff some items
>    on that task list, and I want to task to process them as it wakes up.
> 
> 2) The task is going to sleep, don't want to leave entries around while
>    the task is sleeping.
> 
> 3) I need it to be called from "normal" context, with ints enabled,
>    preempt enabled, etc.
> 
> sched_update_worker() (with a rename) looks ideal for #1, and the
> context is sane for me. Just need a good spot to put the hook call for
> schedule out. I think this:
> 
> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> 		preempt_disable();
> 		if (tsk->flags & PF_WQ_WORKER)
> 			wq_worker_sleeping(tsk);
> 		else
> 			io_wq_worker_sleeping(tsk);
> 		preempt_enable_no_resched();
> 	}
> 
> just needs to go into another helper, and then I can call it there
> outside of the preempt.
> 
> I'm sure there are daemons lurking here, but I'll test and see how it
> goes...

Here's a stab at cleaning it up:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll

top two patches. First one simply cleans up the sched_update_worker(),
so we now have sched_in_update() and sched_out_update(). No changes in
this patch, just moves the worker sched-out handling into a helper.

2nd patch then utilizes this to flush the per-task requests that may
have been queued up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 16:18                     ` Jens Axboe
@ 2020-02-14 17:52                       ` Jens Axboe
  2020-02-14 20:44                         ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-14 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 9:18 AM, Jens Axboe wrote:
> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>> I suspect you meant to put that in finish_task_switch() which is the
>>> tail end of every schedule(), schedule_tail() is the tail end of
>>> clone().
>>>
>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>> is after every schedule() but in a preemptible context -- much saner
>>> since you don't want to go add an unbounded amount of work in a
>>> non-preemptible context.
>>>
>>> At which point you already have your callback: io_wq_worker_running(),
>>> or is this for any random task?
>>
>> Let me try and clarify - this isn't for the worker tasks, this is for
>> any task that is using io_uring. In fact, it's particularly not for the
>> worker threads, just the task itself.
>>
>> I basically want the handler to be called when:
>>
>> 1) The task is scheduled in. The poll will complete and stuff some items
>>    on that task list, and I want to task to process them as it wakes up.
>>
>> 2) The task is going to sleep, don't want to leave entries around while
>>    the task is sleeping.
>>
>> 3) I need it to be called from "normal" context, with ints enabled,
>>    preempt enabled, etc.
>>
>> sched_update_worker() (with a rename) looks ideal for #1, and the
>> context is sane for me. Just need a good spot to put the hook call for
>> schedule out. I think this:
>>
>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>> 		preempt_disable();
>> 		if (tsk->flags & PF_WQ_WORKER)
>> 			wq_worker_sleeping(tsk);
>> 		else
>> 			io_wq_worker_sleeping(tsk);
>> 		preempt_enable_no_resched();
>> 	}
>>
>> just needs to go into another helper, and then I can call it there
>> outside of the preempt.
>>
>> I'm sure there are daemons lurking here, but I'll test and see how it
>> goes...
> 
> Here's a stab at cleaning it up:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
> 
> top two patches. First one simply cleans up the sched_update_worker(),
> so we now have sched_in_update() and sched_out_update(). No changes in
> this patch, just moves the worker sched-out handling into a helper.
> 
> 2nd patch then utilizes this to flush the per-task requests that may
> have been queued up.

In fact, we can go even further. If we have this task handler, then we:

1) Never need to go async for poll completion, and we can remove a bunch
   of code that handles that
2) Don't need to worry about nested eventfd notification, that code goes
   away too
3) Don't need the poll llist for batching flushes, that goes away

In terms of performance, for the single client case we did about 48K
requests per second on my kvm on the laptop, now we're doing 148K.
So it's definitely worthwhile... On top of that, diffstat:

 fs/io_uring.c | 166 +++++++-------------------------------------------
 1 file changed, 22 insertions(+), 144 deletions(-)


diff --git a/fs/io_uring.c b/fs/io_uring.c
index a683d7a08003..1e436b732f2a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -835,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -932,8 +924,6 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
-	if (!req->task)
-		req->task = get_task_struct(current);
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -3545,92 +3535,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
-
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3638,8 +3554,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3647,56 +3564,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				struct io_kiocb *nxt = NULL;
-
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req_find_next(req, &nxt);
-				if (nxt) {
-					struct task_struct *tsk = nxt->task;
-
-					raw_spin_lock(&tsk->uring_lock);
-					list_add_tail(&nxt->list, &tsk->uring_work);
-					raw_spin_unlock(&tsk->uring_lock);
-					/* do we need to wake tsk here??? */
-				}
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3755,7 +3628,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -4863,6 +4735,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return false;
 	}
 
+	req->task = get_task_struct(current);
+
 	/*
 	 * If we already have a head request, queue this one for async
 	 * submittal once the head completes. If we don't have a head but
@@ -5270,10 +5144,14 @@ void io_uring_task_handler(struct task_struct *tsk)
 	raw_spin_unlock_irq(&tsk->uring_lock);
 
 	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
 		req = list_first_entry(&local_list, struct io_kiocb, list);
 		list_del(&req->list);
 
-		__io_queue_sqe(req, NULL);
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(req, NULL);
 	}
 }
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 17:52                       ` Jens Axboe
@ 2020-02-14 20:44                         ` Jens Axboe
  2020-02-15  0:16                           ` Carter Li 李通洲
  2020-02-17 12:09                           ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-14 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/14/20 10:52 AM, Jens Axboe wrote:
> On 2/14/20 9:18 AM, Jens Axboe wrote:
>> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>>> I suspect you meant to put that in finish_task_switch() which is the
>>>> tail end of every schedule(), schedule_tail() is the tail end of
>>>> clone().
>>>>
>>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>>> is after every schedule() but in a preemptible context -- much saner
>>>> since you don't want to go add an unbounded amount of work in a
>>>> non-preemptible context.
>>>>
>>>> At which point you already have your callback: io_wq_worker_running(),
>>>> or is this for any random task?
>>>
>>> Let me try and clarify - this isn't for the worker tasks, this is for
>>> any task that is using io_uring. In fact, it's particularly not for the
>>> worker threads, just the task itself.
>>>
>>> I basically want the handler to be called when:
>>>
>>> 1) The task is scheduled in. The poll will complete and stuff some items
>>>    on that task list, and I want to task to process them as it wakes up.
>>>
>>> 2) The task is going to sleep, don't want to leave entries around while
>>>    the task is sleeping.
>>>
>>> 3) I need it to be called from "normal" context, with ints enabled,
>>>    preempt enabled, etc.
>>>
>>> sched_update_worker() (with a rename) looks ideal for #1, and the
>>> context is sane for me. Just need a good spot to put the hook call for
>>> schedule out. I think this:
>>>
>>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>>> 		preempt_disable();
>>> 		if (tsk->flags & PF_WQ_WORKER)
>>> 			wq_worker_sleeping(tsk);
>>> 		else
>>> 			io_wq_worker_sleeping(tsk);
>>> 		preempt_enable_no_resched();
>>> 	}
>>>
>>> just needs to go into another helper, and then I can call it there
>>> outside of the preempt.
>>>
>>> I'm sure there are daemons lurking here, but I'll test and see how it
>>> goes...
>>
>> Here's a stab at cleaning it up:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
>>
>> top two patches. First one simply cleans up the sched_update_worker(),
>> so we now have sched_in_update() and sched_out_update(). No changes in
>> this patch, just moves the worker sched-out handling into a helper.
>>
>> 2nd patch then utilizes this to flush the per-task requests that may
>> have been queued up.
> 
> In fact, we can go even further. If we have this task handler, then we:
> 
> 1) Never need to go async for poll completion, and we can remove a bunch
>    of code that handles that
> 2) Don't need to worry about nested eventfd notification, that code goes
>    away too
> 3) Don't need the poll llist for batching flushes, that goes away
> 
> In terms of performance, for the single client case we did about 48K
> requests per second on my kvm on the laptop, now we're doing 148K.
> So it's definitely worthwhile... On top of that, diffstat:
> 
>  fs/io_uring.c | 166 +++++++-------------------------------------------
>  1 file changed, 22 insertions(+), 144 deletions(-)

It's now up to 3.5x the original performance for the single client case.
Here's the updated patch, folded with the original that only went half
the way there.


commit b9c04ea10d10cf80f2d2f3b96e1668e523602072
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 14 09:15:29 2020 -0700

    io_uring: add per-task callback handler
    
    For poll requests, it's not uncommon to link a read (or write) after
    the poll to execute immediately after the file is marked as ready.
    Since the poll completion is called inside the waitqueue wake up handler,
    we have to punt that linked request to async context. This slows down
    the processing, and actually means it's faster to not use a link for this
    use case.
    
    We also run into problems if the completion_lock is contended, as we're
    doing a different lock ordering than the issue side is. Hence we have
    to do trylock for completion, and if that fails, go async. Poll removal
    needs to go async as well, for the same reason.
    
    eventfd notification needs special case as well, to avoid stack blowing
    recursion or deadlocks.
    
    These are all deficiencies that were inherited from the aio poll
    implementation, but I think we can do better. When a poll completes,
    simply queue it up in the task poll list. When the task completes the
    list, we can run dependent links inline as well. This means we never
    have to go async, and we can remove a bunch of code associated with
    that, and optimizations to try and make that run faster. The diffstat
    speaks for itself.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a826017ebb8..2756654e2955 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -574,6 +567,7 @@ struct io_kiocb {
 
 	struct list_head	inflight_entry;
 
+	struct task_struct	*task;
 	struct io_wq_work	work;
 };
 
@@ -834,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -1056,24 +1049,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
 		return false;
 	if (!ctx->eventfd_async)
 		return true;
-	return io_wq_current_is_worker() || in_interrupt();
+	return io_wq_current_is_worker();
 }
 
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev)
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (trigger_ev)
+	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
-{
-	__io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx));
-}
-
 /* Returns true if there are no backlogged entries after the flush */
 static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 {
@@ -1238,6 +1226,8 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	/* one is dropped after submission, the other at completion */
 	refcount_set(&req->refs, 2);
 	req->result = 0;
+	/* task will wait for requests on exit, don't need a ref */
+	req->task = current;
 	INIT_IO_WORK(&req->work, io_wq_submit_work);
 	return req;
 fallback:
@@ -3448,15 +3438,22 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt,
 static void io_poll_remove_one(struct io_kiocb *req)
 {
 	struct io_poll_iocb *poll = &req->poll;
+	bool do_complete = false;
 
 	spin_lock(&poll->head->lock);
 	WRITE_ONCE(poll->canceled, true);
 	if (!list_empty(&poll->wait.entry)) {
 		list_del_init(&poll->wait.entry);
-		io_queue_async_work(req);
+		do_complete = true;
 	}
 	spin_unlock(&poll->head->lock);
 	hash_del(&req->hash_node);
+	if (do_complete) {
+		io_cqring_fill_event(req, -ECANCELED);
+		io_commit_cqring(req->ctx);
+		req->flags |= REQ_F_COMP_LOCKED;
+		io_put_req(req);
+	}
 }
 
 static void io_poll_remove_all(struct io_ring_ctx *ctx)
@@ -3474,6 +3471,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 			io_poll_remove_one(req);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
 }
 
 static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
@@ -3539,92 +3538,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
 
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3632,8 +3557,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3641,46 +3567,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req(req);
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3739,7 +3631,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -5243,6 +5134,28 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return autoremove_wake_function(curr, mode, wake_flags, key);
 }
 
+void io_uring_task_handler(struct task_struct *tsk)
+{
+	LIST_HEAD(local_list);
+	struct io_kiocb *req;
+
+	raw_spin_lock_irq(&tsk->uring_lock);
+	if (!list_empty(&tsk->uring_work))
+		list_splice_init(&tsk->uring_work, &local_list);
+	raw_spin_unlock_irq(&tsk->uring_lock);
+
+	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
+		req = list_first_entry(&local_list, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(nxt, NULL);
+	}
+}
+
 /*
  * Wait until events become available, if we don't already have some. The
  * application must reap them itself, as they reside on the shared cq ring.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..447b06c6bed0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,6 +685,11 @@ struct task_struct {
 #endif
 	struct sched_dl_entity		dl;
 
+#ifdef CONFIG_IO_URING
+	struct list_head		uring_work;
+	raw_spinlock_t			uring_lock;
+#endif
+
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51ca491d99ed..170fefa1caf8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
 
+#ifdef CONFIG_IO_URING
+	INIT_LIST_HEAD(&p->uring_work);
+	raw_spin_lock_init(&p->uring_lock);
+#endif
+
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
@@ -4104,6 +4109,20 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
+#ifdef CONFIG_IO_URING
+extern void io_uring_task_handler(struct task_struct *tsk);
+
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+	if (!list_empty(&tsk->uring_work))
+		io_uring_task_handler(tsk);
+}
+#else /* !CONFIG_IO_URING */
+static inline void io_uring_handler(struct task_struct *tsk)
+{
+}
+#endif
+
 static void sched_out_update(struct task_struct *tsk)
 {
 	/*
@@ -4121,6 +4140,7 @@ static void sched_out_update(struct task_struct *tsk)
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
+	io_uring_handler(tsk);
 }
 
 static void sched_in_update(struct task_struct *tsk)
@@ -4131,6 +4151,7 @@ static void sched_in_update(struct task_struct *tsk)
 		else
 			io_wq_worker_running(tsk);
 	}
+	io_uring_handler(tsk);
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 20:44                         ` Jens Axboe
@ 2020-02-15  0:16                           ` Carter Li 李通洲
  2020-02-15  1:10                             ` Jens Axboe
  2020-02-17 12:09                           ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  0:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

Hello Jens,


> It's now up to 3.5x the original performance for the single client case.
> Here's the updated patch, folded with the original that only went half
> the way there.


I’m looking forward to it.

And question again: since POLL->READ/RECV is much faster then READ/RECV async,
could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
networking, but also for all pollable fds.

Carter

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  0:16                           ` Carter Li 李通洲
@ 2020-02-15  1:10                             ` Jens Axboe
  2020-02-15  1:25                               ` Carter Li 李通洲
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-15  1:10 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 5:16 PM, Carter Li 李通洲 wrote:
> Hello Jens,
> 
> 
>> It's now up to 3.5x the original performance for the single client case.
>> Here's the updated patch, folded with the original that only went half
>> the way there.
> 
> 
> I’m looking forward to it.
> 
> And question again: since POLL->READ/RECV is much faster then READ/RECV async,
> could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
> networking, but also for all pollable fds.

That's exactly the next step. With this, we have a very efficient way of
doing async IO for anything that can be driven by poll. Then we can do it
by default, instead of doing an async punt. Much faster and much more
efficient.

I'll try and work on that next week, I think this could be a real game
changer.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:10                             ` Jens Axboe
@ 2020-02-15  1:25                               ` Carter Li 李通洲
  2020-02-15  1:27                                 ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  1:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

There are at least 2 benefits over POLL->READ

1. Reduce a little complexity of user code, and save lots of sqes.
2. Better performance. Users can’t if an operation will block without
issuing an extra O_NONBLOCK syscall, which ends up with always using
POLL->READ link. If it’s handled by kernel, we may only poll when
we know it’s needed.


> 2020年2月15日 上午9:10,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/14/20 5:16 PM, Carter Li 李通洲 wrote:
>> Hello Jens,
>> 
>> 
>>> It's now up to 3.5x the original performance for the single client case.
>>> Here's the updated patch, folded with the original that only went half
>>> the way there.
>> 
>> 
>> I’m looking forward to it.
>> 
>> And question again: since POLL->READ/RECV is much faster then READ/RECV async,
>> could we implement READ/RECV that would block as POLL->READ/RECV? Not only for
>> networking, but also for all pollable fds.
> 
> That's exactly the next step. With this, we have a very efficient way of
> doing async IO for anything that can be driven by poll. Then we can do it
> by default, instead of doing an async punt. Much faster and much more
> efficient.
> 
> I'll try and work on that next week, I think this could be a real game
> changer.
> 
> -- 
> Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:25                               ` Carter Li 李通洲
@ 2020-02-15  1:27                                 ` Jens Axboe
  2020-02-15  6:01                                   ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-15  1:27 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
> There are at least 2 benefits over POLL->READ
> 
> 1. Reduce a little complexity of user code, and save lots of sqes.
> 2. Better performance. Users can’t if an operation will block without
> issuing an extra O_NONBLOCK syscall, which ends up with always using
> POLL->READ link. If it’s handled by kernel, we may only poll when
> we know it’s needed.

Exactly, it'll enable the app to do read/recv or write/send without
having to worry about anything, and it'll be as efficient as having
it linked to a poll command.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  1:27                                 ` Jens Axboe
@ 2020-02-15  6:01                                   ` Jens Axboe
  2020-02-15  6:32                                     ` Carter Li 李通洲
  2020-02-16 19:06                                     ` Pavel Begunkov
  0 siblings, 2 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-15  6:01 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 6:27 PM, Jens Axboe wrote:
> On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
>> There are at least 2 benefits over POLL->READ
>>
>> 1. Reduce a little complexity of user code, and save lots of sqes.
>> 2. Better performance. Users can’t if an operation will block without
>> issuing an extra O_NONBLOCK syscall, which ends up with always using
>> POLL->READ link. If it’s handled by kernel, we may only poll when
>> we know it’s needed.
> 
> Exactly, it'll enable the app to do read/recv or write/send without
> having to worry about anything, and it'll be as efficient as having
> it linked to a poll command.

Couldn't help myself... The below is the general direction, but not
done by any stretch of the imagination. There are a few hacks in there.
But, in short, it does allow eg send/recv to behave in an async manner
without needing the thread offload. Tested it with the test/send_recv
and test/send_recvmsg test apps, and it works there. There seems to be
some weird issue with eg test/socket-rw, not sure what that is yet.

Just wanted to throw it out there. It's essentially the same as the
linked poll, except it's just done internally.


commit d1fc97c5f132eaf39c06783925bd11dca9fa3ecd
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 14 22:23:12 2020 -0700

    io_uring: use poll driven retry for files that support it
    
    Currently io_uring tries any request in a non-blocking manner, if it can,
    and then retries from a worker thread if we got -EAGAIN. Now that we have
    a new and fancy poll based retry backend, use that to retry requests if
    the file supports it.
    
    This means that, for example, an IORING_OP_RECVMSG on a socket no longer
    requires an async thread to complete the IO. If we get -EAGAIN reading
    from the socket in a non-blocking manner, we arm a poll handler for
    notification on when the socket becomes readable. When it does, the
    pending read is executed directly by the task again, through the io_uring
    scheduler handlers.
    
    Note that this is very much a work-in-progress, and it doesn't (yet) pass
    the full test suite. Notable missing features:
    
    - With the work queued up async, we have a common method for locating it
      for cancelation purposes. I need to add cancel tracking for poll
      managed requests.
    
    - Need to double check req->apoll life time.
    
    - Probably a lot I don't quite recall right now...
    
    It does work for the basic read/write, send/recv, etc testing I've
    tried.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fb94b8bac638..530dcd91fa53 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -577,6 +577,7 @@ struct io_kiocb {
 		struct {
 			struct task_struct	*task;
 			struct list_head	task_list;
+			struct io_poll_iocb	*apoll;
 		};
 		struct io_wq_work	work;
 	};
@@ -623,6 +624,9 @@ struct io_op_def {
 	unsigned		file_table : 1;
 	/* needs ->fs */
 	unsigned		needs_fs : 1;
+	/* set if opcode supports polled "wait" */
+	unsigned		pollin : 1;
+	unsigned		pollout : 1;
 };
 
 static const struct io_op_def io_op_defs[] = {
@@ -632,6 +636,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITEV] = {
 		.async_ctx		= 1,
@@ -639,6 +644,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FSYNC] = {
 		.needs_file		= 1,
@@ -646,11 +652,13 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE_FIXED] = {
 		.needs_file		= 1,
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_POLL_ADD] = {
 		.needs_file		= 1,
@@ -666,6 +674,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECVMSG] = {
 		.async_ctx		= 1,
@@ -673,6 +682,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.needs_fs		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.async_ctx		= 1,
@@ -684,6 +694,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.file_table		= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_ASYNC_CANCEL] = {},
 	[IORING_OP_LINK_TIMEOUT] = {
@@ -695,6 +706,7 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_FALLOCATE] = {
 		.needs_file		= 1,
@@ -723,11 +735,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_FADVISE] = {
 		.needs_file		= 1,
@@ -739,11 +753,13 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
 	},
 	[IORING_OP_RECV] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
 	},
 	[IORING_OP_OPENAT2] = {
 		.needs_file		= 1,
@@ -2228,6 +2244,139 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
+			    struct wait_queue_head *head)
+{
+	if (unlikely(poll->head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	poll->head = head;
+	add_wait_queue(head, &poll->wait);
+}
+
+static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	__io_queue_proc(pt->req->apoll, pt, head);
+}
+
+static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
+			   __poll_t mask)
+{
+	struct task_struct *tsk;
+	unsigned long flags;
+
+	/* for instances that support it check for an event match first: */
+	if (mask && !(mask & poll->events))
+		return 0;
+
+	list_del_init(&poll->wait.entry);
+
+	tsk = req->task;
+	req->result = mask;
+	spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->task_list, &tsk->uring_work);
+	spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
+	return 1;
+}
+
+static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_kiocb *req = wait->private;
+	struct io_poll_iocb *poll = req->apoll;
+
+	trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data, key_to_poll(key));
+	return __io_async_wake(req, poll, key_to_poll(key));
+}
+
+static bool io_arm_poll_handler(struct io_kiocb *req, int *retry_count)
+{
+	const struct io_op_def *def = &io_op_defs[req->opcode];
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_poll_iocb *poll;
+	struct io_poll_table ipt;
+	bool cancel = false;
+	__poll_t mask;
+
+	if (!file_can_poll(req->file))
+		return false;
+	if (req->flags & REQ_F_MUST_PUNT)
+		return false;
+	if (!def->pollin && !def->pollout)
+		return false;
+
+	poll = kmalloc(sizeof(*poll), GFP_ATOMIC);
+	if (unlikely(!poll))
+		return false;
+
+	req->task = current;
+	INIT_LIST_HEAD(&req->task_list);
+	req->apoll = poll;
+
+	if (def->pollin)
+		mask = POLLIN;
+	if (def->pollout)
+		mask |= POLLOUT;
+
+	poll->file = req->file;
+	poll->head = NULL;
+	poll->done = poll->canceled = false;
+	poll->events = mask;
+	ipt.pt._qproc = io_async_queue_proc;
+	ipt.pt._key = mask;
+	ipt.req = req;
+	ipt.error = -EINVAL;
+
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, io_async_wake);
+	poll->wait.private = req;
+
+	mask = vfs_poll(req->file, &ipt.pt) & poll->events;
+
+	spin_lock_irq(&ctx->completion_lock);
+	if (likely(poll->head)) {
+		spin_lock(&poll->head->lock);
+		if (unlikely(list_empty(&poll->wait.entry))) {
+			if (ipt.error)
+				cancel = true;
+			ipt.error = 0;
+			mask = 0;
+		}
+		if (mask || ipt.error)
+			list_del_init(&poll->wait.entry);
+		else if (cancel)
+			WRITE_ONCE(poll->canceled, true);
+#if 0
+		Needs tracking for cancelation
+		else if (!poll->done) /* actually waiting for an event */
+			io_poll_req_insert(req);
+#endif
+		spin_unlock(&poll->head->lock);
+	}
+	if (mask) {
+		ipt.error = 0;
+		poll->done = true;
+		(*retry_count)++;
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+	trace_io_uring_poll_arm(ctx, req->opcode, req->user_data, mask,
+					poll->events);
+	return true;
+}
+
 static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
 		   bool force_nonblock)
 {
@@ -3569,44 +3718,16 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	__poll_t mask = key_to_poll(key);
-	struct task_struct *tsk;
-	unsigned long flags;
-
-	/* for instances that support it check for an event match first: */
-	if (mask && !(mask & poll->events))
-		return 0;
-
-	list_del_init(&poll->wait.entry);
 
-	tsk = req->task;
-	req->result = mask;
-	spin_lock_irqsave(&tsk->uring_lock, flags);
-	list_add_tail(&req->task_list, &tsk->uring_work);
-	spin_unlock_irqrestore(&tsk->uring_lock, flags);
-	wake_up_process(tsk);
-	return 1;
+	return __io_async_wake(req, poll, key_to_poll(key));
 }
 
-struct io_poll_table {
-	struct poll_table_struct pt;
-	struct io_kiocb *req;
-	int error;
-};
-
 static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
 			       struct poll_table_struct *p)
 {
 	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
 
-	if (unlikely(pt->req->poll.head)) {
-		pt->error = -EINVAL;
-		return;
-	}
-
-	pt->error = 0;
-	pt->req->poll.head = head;
-	add_wait_queue(head, &pt->req->poll.wait);
+	__io_queue_proc(&pt->req->poll, pt, head);
 }
 
 static void io_poll_req_insert(struct io_kiocb *req)
@@ -4617,11 +4738,13 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_kiocb *linked_timeout;
 	struct io_kiocb *nxt = NULL;
+	int retry_count = 0;
 	int ret;
 
 again:
 	linked_timeout = io_prep_linked_timeout(req);
 
+issue:
 	ret = io_issue_sqe(req, sqe, &nxt, true);
 
 	/*
@@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	 */
 	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
 	    (req->flags & REQ_F_MUST_PUNT))) {
+
+		if (io_arm_poll_handler(req, &retry_count)) {
+			if (retry_count == 1)
+				goto issue;
+			else if (!retry_count)
+				goto done_req;
+			INIT_IO_WORK(&req->work, io_wq_submit_work);
+		}
 punt:
 		if (io_op_defs[req->opcode].file_table) {
 			ret = io_grab_files(req);
@@ -5154,26 +5285,40 @@ void io_uring_task_handler(struct task_struct *tsk)
 {
 	LIST_HEAD(local_list);
 	struct io_kiocb *req;
+	long state;
 
 	spin_lock_irq(&tsk->uring_lock);
 	if (!list_empty(&tsk->uring_work))
 		list_splice_init(&tsk->uring_work, &local_list);
 	spin_unlock_irq(&tsk->uring_lock);
 
+	state = current->state;
+	__set_current_state(TASK_RUNNING);
 	while (!list_empty(&local_list)) {
 		struct io_kiocb *nxt = NULL;
+		void *to_free = NULL;
 
 		req = list_first_entry(&local_list, struct io_kiocb, task_list);
 		list_del(&req->task_list);
 
-		io_poll_task_handler(req, &nxt);
+		if (req->opcode == IORING_OP_POLL_ADD) {
+			io_poll_task_handler(req, &nxt);
+		} else {
+			nxt = req;
+			to_free = req->apoll;
+			WARN_ON_ONCE(!list_empty(&req->apoll->wait.entry));
+		}
 		if (nxt)
 			__io_queue_sqe(nxt, NULL);
 
+		kfree(to_free);
+
 		/* finish next time, if we're out of time slice */
-		if (need_resched())
+		if (need_resched() && !(current->flags & PF_EXITING))
 			break;
 	}
+	WARN_ON_ONCE(current->state != TASK_RUNNING);
+	__set_current_state(state);
 
 	if (!list_empty(&local_list)) {
 		spin_lock_irq(&tsk->uring_lock);
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 27bd9e4f927b..133a16472423 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -357,6 +357,60 @@ TRACE_EVENT(io_uring_submit_sqe,
 			  __entry->force_nonblock, __entry->sq_thread)
 );
 
+TRACE_EVENT(io_uring_poll_arm,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask, int events),
+
+	TP_ARGS(ctx, opcode, user_data, mask, events),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+		__field(  int,		events		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+		__entry->events		= events;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x, events 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask, __entry->events)
+);
+
+TRACE_EVENT(io_uring_poll_wake,
+
+	TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask),
+
+	TP_ARGS(ctx, opcode, user_data, mask),
+
+	TP_STRUCT__entry (
+		__field(  void *,	ctx		)
+		__field(  u8,		opcode		)
+		__field(  u64,		user_data	)
+		__field(  int,		mask		)
+	),
+
+	TP_fast_assign(
+		__entry->ctx		= ctx;
+		__entry->opcode		= opcode;
+		__entry->user_data	= user_data;
+		__entry->mask		= mask;
+	),
+
+	TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x",
+			  __entry->ctx, __entry->opcode,
+			  (unsigned long long) __entry->user_data,
+			  __entry->mask)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
diff --git a/kernel/exit.c b/kernel/exit.c
index 2833ffb0c211..988799763f34 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -716,6 +716,7 @@ void __noreturn do_exit(long code)
 	profile_task_exit(tsk);
 	kcov_task_exit(tsk);
 
+	WARN_ON(!list_empty(&tsk->uring_work));
 	WARN_ON(blk_needs_flush_plug(tsk));
 
 	if (unlikely(in_interrupt()))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:01                                   ` Jens Axboe
@ 2020-02-15  6:32                                     ` Carter Li 李通洲
  2020-02-15 15:11                                       ` Jens Axboe
  2020-02-16 19:06                                     ` Pavel Begunkov
  1 sibling, 1 reply; 59+ messages in thread
From: Carter Li 李通洲 @ 2020-02-15  6:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Pavel Begunkov, io-uring

Really appreciate it! Looking forward to getting it merged into mainline!

By the way, what time is it now in your city? ;-)

Carter

> 2020年2月15日 下午2:01,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 2/14/20 6:27 PM, Jens Axboe wrote:
>> On 2/14/20 6:25 PM, Carter Li 李通洲 wrote:
>>> There are at least 2 benefits over POLL->READ
>>> 
>>> 1. Reduce a little complexity of user code, and save lots of sqes.
>>> 2. Better performance. Users can’t if an operation will block without
>>> issuing an extra O_NONBLOCK syscall, which ends up with always using
>>> POLL->READ link. If it’s handled by kernel, we may only poll when
>>> we know it’s needed.
>> 
>> Exactly, it'll enable the app to do read/recv or write/send without
>> having to worry about anything, and it'll be as efficient as having
>> it linked to a poll command.
> 
> Couldn't help myself... The below is the general direction, but not
> done by any stretch of the imagination. There are a few hacks in there.
> But, in short, it does allow eg send/recv to behave in an async manner
> without needing the thread offload. Tested it with the test/send_recv
> and test/send_recvmsg test apps, and it works there. There seems to be
> some weird issue with eg test/socket-rw, not sure what that is yet.
> 
> Just wanted to throw it out there. It's essentially the same as the
> linked poll, except it's just done internally.
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:32                                     ` Carter Li 李通洲
@ 2020-02-15 15:11                                       ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-15 15:11 UTC (permalink / raw)
  To: Carter Li 李通洲
  Cc: Peter Zijlstra, Pavel Begunkov, io-uring

On 2/14/20 11:32 PM, Carter Li 李通洲 wrote:
> Really appreciate it! Looking forward to getting it merged into mainline!

I'm very excited about it too, I'll keep hammering on it to flesh it
out. I'll only be sporadically available next week though, but we'll
see how it turns out.

> By the way, what time is it now in your city? ;-)

It was 11PM when that went out, could be worse :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-15  6:01                                   ` Jens Axboe
  2020-02-15  6:32                                     ` Carter Li 李通洲
@ 2020-02-16 19:06                                     ` Pavel Begunkov
  2020-02-16 22:23                                       ` Jens Axboe
  2020-02-16 23:06                                       ` Jens Axboe
  1 sibling, 2 replies; 59+ messages in thread
From: Pavel Begunkov @ 2020-02-16 19:06 UTC (permalink / raw)
  To: Jens Axboe, Carter Li 李通洲; +Cc: Peter Zijlstra, io-uring

[-- Attachment #1.1: Type: text/plain, Size: 1474 bytes --]

On 15/02/2020 09:01, Jens Axboe wrote:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index fb94b8bac638..530dcd91fa53 100644
> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  	 */
>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>  	    (req->flags & REQ_F_MUST_PUNT))) {
> +
> +		if (io_arm_poll_handler(req, &retry_count)) {
> +			if (retry_count == 1)
> +				goto issue;

Better to sqe=NULL before retrying, so it won't re-read sqe and try to init the
req twice.

Also, the second sync-issue may -EAGAIN again, and as I remember, read/write/etc
will try to copy iovec into req->io. But iovec is already in req->io, so it will
self memcpy(). Not a good thing.

> +			else if (!retry_count)
> +				goto done_req;
> +			INIT_IO_WORK(&req->work, io_wq_submit_work);

It's not nice to reset it as this:
- prep() could set some work.flags
- custom work.func is more performant (adds extra switch)
- some may rely on specified work.func to be called. e.g. close(), even though
it doesn't participate in the scheme

> +		}
>  punt:
>  		if (io_op_defs[req->opcode].file_table) {
>  			ret = io_grab_files(req);
> @@ -5154,26 +5285,40 @@ void io_uring_task_handler(struct task_struct *tsk)
>  {
>  	LIST_HEAD(local_list);
>  	struct io_kiocb *req;
> +	long state;
>  
>  	spin_lock_irq(&tsk->uring_lock);
>  	if (!list_empty(&tsk->uring_work))

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 19:06                                     ` Pavel Begunkov
@ 2020-02-16 22:23                                       ` Jens Axboe
  2020-02-17 10:30                                         ` Pavel Begunkov
  2020-02-16 23:06                                       ` Jens Axboe
  1 sibling, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-16 22:23 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 12:06 PM, Pavel Begunkov wrote:
> On 15/02/2020 09:01, Jens Axboe wrote:
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index fb94b8bac638..530dcd91fa53 100644
>> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>  	 */
>>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>>  	    (req->flags & REQ_F_MUST_PUNT))) {
>> +
>> +		if (io_arm_poll_handler(req, &retry_count)) {
>> +			if (retry_count == 1)
>> +				goto issue;
> 
> Better to sqe=NULL before retrying, so it won't re-read sqe and try to
> init the req twice.

Good point, that should get cleared after issue.

> Also, the second sync-issue may -EAGAIN again, and as I remember,
> read/write/etc will try to copy iovec into req->io. But iovec is
> already in req->io, so it will self memcpy(). Not a good thing.

I'll look into those details, that has indeed reared its head before.

>> +			else if (!retry_count)
>> +				goto done_req;
>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
> 
> It's not nice to reset it as this:
> - prep() could set some work.flags
> - custom work.func is more performant (adds extra switch)
> - some may rely on specified work.func to be called. e.g. close(), even though
> it doesn't participate in the scheme

It's totally a hack as-is for the "can't do it, go async". I did clean
this up a bit (if you check the git version, it's changed quite a bit),
but it's still a mess in terms of that and ->work vs union ownership.
The commit message also has a note about that.

So more work needed in that area for sure.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 19:06                                     ` Pavel Begunkov
  2020-02-16 22:23                                       ` Jens Axboe
@ 2020-02-16 23:06                                       ` Jens Axboe
  2020-02-16 23:07                                         ` Jens Axboe
  1 sibling, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-16 23:06 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 12:06 PM, Pavel Begunkov wrote:
> Also, the second sync-issue may -EAGAIN again, and as I remember,
> read/write/etc will try to copy iovec into req->io. But iovec is
> already in req->io, so it will self memcpy(). Not a good thing.

That got fixed, it's no longer doing that anymore.

>> +			else if (!retry_count)
>> +				goto done_req;
>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
> 
> It's not nice to reset it as this:
> - prep() could set some work.flags
> - custom work.func is more performant (adds extra switch)
> - some may rely on specified work.func to be called. e.g. close(), even though
> it doesn't participate in the scheme

For now I just retain a copy of ->work, seems to be the easiest solution
vs trying to track this state.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 23:06                                       ` Jens Axboe
@ 2020-02-16 23:07                                         ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-16 23:07 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/16/20 4:06 PM, Jens Axboe wrote:
>>> +			else if (!retry_count)
>>> +				goto done_req;
>>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
>>
>> It's not nice to reset it as this:
>> - prep() could set some work.flags
>> - custom work.func is more performant (adds extra switch)
>> - some may rely on specified work.func to be called. e.g. close(), even though
>> it doesn't participate in the scheme
> 
> For now I just retain a copy of ->work, seems to be the easiest solution
> vs trying to track this state.

Should mention this isn't quite enough, we really need to drop anything
we have in ->work as well if it was already prepared/grabbed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-16 22:23                                       ` Jens Axboe
@ 2020-02-17 10:30                                         ` Pavel Begunkov
  2020-02-17 19:30                                           ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Pavel Begunkov @ 2020-02-17 10:30 UTC (permalink / raw)
  To: Jens Axboe, Carter Li 李通洲; +Cc: Peter Zijlstra, io-uring

On 2/17/2020 1:23 AM, Jens Axboe wrote:
> On 2/16/20 12:06 PM, Pavel Begunkov wrote:
>> On 15/02/2020 09:01, Jens Axboe wrote:
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index fb94b8bac638..530dcd91fa53 100644
>>> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>  	 */
>>>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>>>  	    (req->flags & REQ_F_MUST_PUNT))) {
>>> +
>>> +		if (io_arm_poll_handler(req, &retry_count)) {
>>> +			if (retry_count == 1)
>>> +				goto issue;
>>
>> Better to sqe=NULL before retrying, so it won't re-read sqe and try to
>> init the req twice.
> 
> Good point, that should get cleared after issue.
> 
>> Also, the second sync-issue may -EAGAIN again, and as I remember,
>> read/write/etc will try to copy iovec into req->io. But iovec is
>> already in req->io, so it will self memcpy(). Not a good thing.
> 
> I'll look into those details, that has indeed reared its head before.
> 
>>> +			else if (!retry_count)
>>> +				goto done_req;
>>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
>>
>> It's not nice to reset it as this:
>> - prep() could set some work.flags
>> - custom work.func is more performant (adds extra switch)
>> - some may rely on specified work.func to be called. e.g. close(), even though
>> it doesn't participate in the scheme
> 
> It's totally a hack as-is for the "can't do it, go async". I did clean

And I don't understand lifetimes yet... probably would need a couple of
questions later.

> this up a bit (if you check the git version, it's changed quite a bit),

That's what I've been looking at

> but it's still a mess in terms of that and ->work vs union ownership.
> The commit message also has a note about that.
> 
> So more work needed in that area for sure.

Right, I just checked a couple of things for you

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-14 20:44                         ` Jens Axboe
  2020-02-15  0:16                           ` Carter Li 李通洲
@ 2020-02-17 12:09                           ` Peter Zijlstra
  2020-02-17 16:12                             ` Jens Axboe
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-17 12:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On Fri, Feb 14, 2020 at 01:44:32PM -0700, Jens Axboe wrote:

I've not looked at git trees yet, but the below doesn't apply to
anything I have at hand.

Anyway, I think I can still make sense of it -- just a rename or two
seems to be missing.

A few notes on the below...

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..447b06c6bed0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -685,6 +685,11 @@ struct task_struct {
>  #endif
>  	struct sched_dl_entity		dl;
>  
> +#ifdef CONFIG_IO_URING
> +	struct list_head		uring_work;
> +	raw_spinlock_t			uring_lock;
> +#endif
> +

Could we pretty please use struct callback_head for this, just like
task_work() and RCU ? Look at task_work_add() for inspiration.

And maybe remove the uring naming form this.

>  #ifdef CONFIG_UCLAMP_TASK
>  	/* Clamp values requested for a scheduling entity */
>  	struct uclamp_se		uclamp_req[UCLAMP_CNT];
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51ca491d99ed..170fefa1caf8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
>  
> +#ifdef CONFIG_IO_URING
> +	INIT_LIST_HEAD(&p->uring_work);
> +	raw_spin_lock_init(&p->uring_lock);
> +#endif
> +
>  #ifdef CONFIG_COMPACTION
>  	p->capture_control = NULL;
>  #endif
> @@ -4104,6 +4109,20 @@ void __noreturn do_task_dead(void)
>  		cpu_relax();
>  }
>  
> +#ifdef CONFIG_IO_URING
> +extern void io_uring_task_handler(struct task_struct *tsk);
> +
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +	if (!list_empty(&tsk->uring_work))
> +		io_uring_task_handler(tsk);
> +}
> +#else /* !CONFIG_IO_URING */
> +static inline void io_uring_handler(struct task_struct *tsk)
> +{
> +}
> +#endif
> +
>  static void sched_out_update(struct task_struct *tsk)
>  {
>  	/*
> @@ -4121,6 +4140,7 @@ static void sched_out_update(struct task_struct *tsk)
>  			io_wq_worker_sleeping(tsk);
>  		preempt_enable_no_resched();
>  	}
> +	io_uring_handler(tsk);
>  }
>  
>  static void sched_in_update(struct task_struct *tsk)
> @@ -4131,6 +4151,7 @@ static void sched_in_update(struct task_struct *tsk)
>  		else
>  			io_wq_worker_running(tsk);
>  	}
> +	io_uring_handler(tsk);
>  }

The problem I have here is that we have an unconditional load of the
cacheline that has ->uring_work in.

/me curses about how nobody seems interested in building useful
cacheline analyis tools :/

Lemme see if I can find a spot for this... perhaps something like so?



diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0918904c939d..4fba93293fa1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -649,6 +649,7 @@ struct task_struct {
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
 	unsigned int			ptrace;
+	int				on_rq;
 
 #ifdef CONFIG_SMP
 	struct llist_node		wake_entry;
@@ -671,14 +672,16 @@ struct task_struct {
 	int				recent_used_cpu;
 	int				wake_cpu;
 #endif
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
 
+	struct callbach_head		*sched_work;
+
 	const struct sched_class	*sched_class;
+
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
 #ifdef CONFIG_CGROUP_SCHED



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 12:09                           ` Peter Zijlstra
@ 2020-02-17 16:12                             ` Jens Axboe
  2020-02-17 17:16                               ` Jens Axboe
  2020-02-18 13:13                               ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-17 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/17/20 5:09 AM, Peter Zijlstra wrote:
> On Fri, Feb 14, 2020 at 01:44:32PM -0700, Jens Axboe wrote:
> 
> I've not looked at git trees yet, but the below doesn't apply to
> anything I have at hand.
> 
> Anyway, I think I can still make sense of it -- just a rename or two
> seems to be missing.
> 
> A few notes on the below...

Thanks for continuing to look at it, while we both try and make sense of
it :-)

>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 04278493bf15..447b06c6bed0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -685,6 +685,11 @@ struct task_struct {
>>  #endif
>>  	struct sched_dl_entity		dl;
>>  
>> +#ifdef CONFIG_IO_URING
>> +	struct list_head		uring_work;
>> +	raw_spinlock_t			uring_lock;
>> +#endif
>> +
> 
> Could we pretty please use struct callback_head for this, just like
> task_work() and RCU ? Look at task_work_add() for inspiration.

Sure, so add a new one, sched_work, and have it get this sched-in or
sched-out behavior.

Only potential hitch I see there is related to ordering, which is more
of a fairness thab correctness issue. I'm going to ignore that for now,
and we can always revisit later.

> And maybe remove the uring naming form this.

No problem

>>  #ifdef CONFIG_UCLAMP_TASK
>>  	/* Clamp values requested for a scheduling entity */
>>  	struct uclamp_se		uclamp_req[UCLAMP_CNT];
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 51ca491d99ed..170fefa1caf8 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2717,6 +2717,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>>  #endif
>>  
>> +#ifdef CONFIG_IO_URING
>> +	INIT_LIST_HEAD(&p->uring_work);
>> +	raw_spin_lock_init(&p->uring_lock);
>> +#endif
>> +
>>  #ifdef CONFIG_COMPACTION
>>  	p->capture_control = NULL;
>>  #endif
>> @@ -4104,6 +4109,20 @@ void __noreturn do_task_dead(void)
>>  		cpu_relax();
>>  }
>>  
>> +#ifdef CONFIG_IO_URING
>> +extern void io_uring_task_handler(struct task_struct *tsk);
>> +
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +	if (!list_empty(&tsk->uring_work))
>> +		io_uring_task_handler(tsk);
>> +}
>> +#else /* !CONFIG_IO_URING */
>> +static inline void io_uring_handler(struct task_struct *tsk)
>> +{
>> +}
>> +#endif
>> +
>>  static void sched_out_update(struct task_struct *tsk)
>>  {
>>  	/*
>> @@ -4121,6 +4140,7 @@ static void sched_out_update(struct task_struct *tsk)
>>  			io_wq_worker_sleeping(tsk);
>>  		preempt_enable_no_resched();
>>  	}
>> +	io_uring_handler(tsk);
>>  }
>>  
>>  static void sched_in_update(struct task_struct *tsk)
>> @@ -4131,6 +4151,7 @@ static void sched_in_update(struct task_struct *tsk)
>>  		else
>>  			io_wq_worker_running(tsk);
>>  	}
>> +	io_uring_handler(tsk);
>>  }
> 
> The problem I have here is that we have an unconditional load of the
> cacheline that has ->uring_work in.
> 
> /me curses about how nobody seems interested in building useful
> cacheline analyis tools :/
> 
> Lemme see if I can find a spot for this... perhaps something like so?
> 
> 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0918904c939d..4fba93293fa1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -649,6 +649,7 @@ struct task_struct {
>  	/* Per task flags (PF_*), defined further below: */
>  	unsigned int			flags;
>  	unsigned int			ptrace;
> +	int				on_rq;
>  
>  #ifdef CONFIG_SMP
>  	struct llist_node		wake_entry;
> @@ -671,14 +672,16 @@ struct task_struct {
>  	int				recent_used_cpu;
>  	int				wake_cpu;
>  #endif
> -	int				on_rq;
>  
>  	int				prio;
>  	int				static_prio;
>  	int				normal_prio;
>  	unsigned int			rt_priority;
>  
> +	struct callbach_head		*sched_work;
> +
>  	const struct sched_class	*sched_class;
> +
>  	struct sched_entity		se;
>  	struct sched_rt_entity		rt;
>  #ifdef CONFIG_CGROUP_SCHED

Thanks, I'll kick off the series with doing it based on this instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 16:12                             ` Jens Axboe
@ 2020-02-17 17:16                               ` Jens Axboe
  2020-02-17 17:46                                 ` Peter Zijlstra
  2020-02-18 13:13                               ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-17 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/17/20 9:12 AM, Jens Axboe wrote:
> On 2/17/20 5:09 AM, Peter Zijlstra wrote:
>> On Fri, Feb 14, 2020 at 01:44:32PM -0700, Jens Axboe wrote:
>>
>> I've not looked at git trees yet, but the below doesn't apply to
>> anything I have at hand.
>>
>> Anyway, I think I can still make sense of it -- just a rename or two
>> seems to be missing.
>>
>> A few notes on the below...
> 
> Thanks for continuing to look at it, while we both try and make sense of
> it :-)
> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 04278493bf15..447b06c6bed0 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -685,6 +685,11 @@ struct task_struct {
>>>  #endif
>>>  	struct sched_dl_entity		dl;
>>>  
>>> +#ifdef CONFIG_IO_URING
>>> +	struct list_head		uring_work;
>>> +	raw_spinlock_t			uring_lock;
>>> +#endif
>>> +
>>
>> Could we pretty please use struct callback_head for this, just like
>> task_work() and RCU ? Look at task_work_add() for inspiration.
> 
> Sure, so add a new one, sched_work, and have it get this sched-in or
> sched-out behavior.
> 
> Only potential hitch I see there is related to ordering, which is more
> of a fairness thab correctness issue. I'm going to ignore that for now,
> and we can always revisit later.

OK, did the conversion, and it turned out pretty trivial, and reduces my
lines as well since I don't have to manage the list side. See here:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll

Three small prep patches:

sched: move io-wq/workqueue worker sched in/out into helpers
kernel: abstract out task work helpers
sched: add a sched_work list

and then the two main patches before on top of that, one which uses this
just for the poll command, then the last one which enables any request
to use this path if it's pollable.

Let me know what you think of the direction, and I'll send out a
"proper" series for perusing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 17:16                               ` Jens Axboe
@ 2020-02-17 17:46                                 ` Peter Zijlstra
  2020-02-17 18:16                                   ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-17 17:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On Mon, Feb 17, 2020 at 09:16:34AM -0800, Jens Axboe wrote:
> OK, did the conversion, and it turned out pretty trivial, and reduces my
> lines as well since I don't have to manage the list side. See here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
> 
> Three small prep patches:
> 
> sched: move io-wq/workqueue worker sched in/out into helpers
> kernel: abstract out task work helpers
> sched: add a sched_work list

The __task_work_add() thing should loose the set_notify_resume() thing,
that is very much task_work specific. Task_work, works off of
TIF_NOTIFY_RESUME on return-to-user. We really don't want that set for
the sched_work stuff.

I've not looked too hard at the rest, I need to run to the Dojo, should
have some time laster tonight, otherwise tomorrow ;-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 17:46                                 ` Peter Zijlstra
@ 2020-02-17 18:16                                   ` Jens Axboe
  2020-02-18 13:13                                     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-17 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/17/20 10:46 AM, Peter Zijlstra wrote:
> On Mon, Feb 17, 2020 at 09:16:34AM -0800, Jens Axboe wrote:
>> OK, did the conversion, and it turned out pretty trivial, and reduces my
>> lines as well since I don't have to manage the list side. See here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
>>
>> Three small prep patches:
>>
>> sched: move io-wq/workqueue worker sched in/out into helpers
>> kernel: abstract out task work helpers
>> sched: add a sched_work list
> 
> The __task_work_add() thing should loose the set_notify_resume() thing,
> that is very much task_work specific. Task_work, works off of
> TIF_NOTIFY_RESUME on return-to-user. We really don't want that set for
> the sched_work stuff.

Done, killed it from the generic helper.

I also made a tweak to work_exited, as I'll need that for the sched work.

> I've not looked too hard at the rest, I need to run to the Dojo, should
> have some time laster tonight, otherwise tomorrow ;-)

Enjoy! Thanks again for taking a look. I've pushed out the update:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 10:30                                         ` Pavel Begunkov
@ 2020-02-17 19:30                                           ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-17 19:30 UTC (permalink / raw)
  To: Pavel Begunkov, Carter Li 李通洲
  Cc: Peter Zijlstra, io-uring

On 2/17/20 3:30 AM, Pavel Begunkov wrote:
> On 2/17/2020 1:23 AM, Jens Axboe wrote:
>> On 2/16/20 12:06 PM, Pavel Begunkov wrote:
>>> On 15/02/2020 09:01, Jens Axboe wrote:
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index fb94b8bac638..530dcd91fa53 100644
>>>> @@ -4630,6 +4753,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>  	 */
>>>>  	if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
>>>>  	    (req->flags & REQ_F_MUST_PUNT))) {
>>>> +
>>>> +		if (io_arm_poll_handler(req, &retry_count)) {
>>>> +			if (retry_count == 1)
>>>> +				goto issue;
>>>
>>> Better to sqe=NULL before retrying, so it won't re-read sqe and try to
>>> init the req twice.
>>
>> Good point, that should get cleared after issue.
>>
>>> Also, the second sync-issue may -EAGAIN again, and as I remember,
>>> read/write/etc will try to copy iovec into req->io. But iovec is
>>> already in req->io, so it will self memcpy(). Not a good thing.
>>
>> I'll look into those details, that has indeed reared its head before.
>>
>>>> +			else if (!retry_count)
>>>> +				goto done_req;
>>>> +			INIT_IO_WORK(&req->work, io_wq_submit_work);
>>>
>>> It's not nice to reset it as this:
>>> - prep() could set some work.flags
>>> - custom work.func is more performant (adds extra switch)
>>> - some may rely on specified work.func to be called. e.g. close(), even though
>>> it doesn't participate in the scheme
>>
>> It's totally a hack as-is for the "can't do it, go async". I did clean
> 
> And I don't understand lifetimes yet... probably would need a couple of
> questions later.
> 
>> this up a bit (if you check the git version, it's changed quite a bit),
> 
> That's what I've been looking at
> 
>> but it's still a mess in terms of that and ->work vs union ownership.
>> The commit message also has a note about that.
>>
>> So more work needed in that area for sure.
> 
> Right, I just checked a couple of things for you

Appreciate it! I'll send out the series for easier review and
commenting. It's getting better, but still a bit rough around the edges.
It passes the test suite now, except for:

- socket-rw and connect - both of these don't get a poll wakeup from the
  networking side when the connection is made (or data is sent), which
  is very odd. So we just keep waiting for that, and nothing happens.
  Need to figure out why there's never a wakeup on it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 18:16                                   ` Jens Axboe
@ 2020-02-18 13:13                                     ` Peter Zijlstra
  2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
                                                         ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 13:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov

On Mon, Feb 17, 2020 at 10:16:20AM -0800, Jens Axboe wrote:

> Enjoy! Thanks again for taking a look. I've pushed out the update:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll

Oleg, we're looking at adding sched-out/sched-in work, based on
task_work. Either see the above git tree, or the folded patch at the
very end of this email.

But this has me wondering about task_work_run(), as it is it will
unconditionally take pi_lock, would not something like this make sense?

(with the caveat that try_cmpxchg() doesn't seem available on !x86 -- I
should go fix that)

---
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0fef395662a6..bf5a1f37c3c0 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -93,16 +93,20 @@ void task_work_run(void)
 	struct callback_head *work, *head, *next;
 
 	for (;;) {
+		work = READ_ONCE(task->task_work);
+		if (!work)
+			break
+
 		/*
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
 		raw_spin_lock_irq(&task->pi_lock);
 		do {
-			work = READ_ONCE(task->task_works);
-			head = !work && (task->flags & PF_EXITING) ?
-				&work_exited : NULL;
-		} while (cmpxchg(&task->task_works, work, head) != work);
+			head = NULL;
+			if (unlikely(!work && (task->flags & PF_EXITING)))
+				head = &work_exited;
+		} while (!try_cmpxchg(&task->task_works, &work, head));
 		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)



---

Jens, I think you want something like this on top of what you have,
mostly it is adding sched_work_run() to exit_task_work().

---

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index c688c56ce01d..e0c56f461df6 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -17,13 +17,14 @@ int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
 
+int sched_work_add(struct task_struct *task, struct callback_head *work);
+struct callback_head *sched_work_cancel(struct task_struct *, task_work_func_t);
+void sched_work_run(void);
+
 static inline void exit_task_work(struct task_struct *task)
 {
 	task_work_run();
+	sched_work_run();
 }
 
-int sched_work_add(struct task_struct *task, struct callback_head *work);
-void sched_work_run(struct task_struct *task);
-struct callback_head *sched_work_cancel(struct task_struct *, task_work_func_t);
-
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6fda35072c2d..26d6ecf0e0cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4106,7 +4106,7 @@ void __noreturn do_task_dead(void)
 static bool sched_out_update(struct task_struct *tsk)
 {
 	if (unlikely(tsk->sched_work)) {
-		sched_work_run(tsk);
+		sched_work_run();
 		return true;
 	}
 
@@ -4138,7 +4138,7 @@ static void sched_in_update(struct task_struct *tsk)
 			io_wq_worker_running(tsk);
 	}
 	if (unlikely(tsk->sched_work))
-		sched_work_run(tsk);
+		sched_work_run();
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index d73d1cd0c4dd..679550c8c7a4 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -117,10 +117,10 @@ sched_work_cancel(struct task_struct *task, task_work_func_t func)
 	return __task_work_cancel(task, &task->sched_work, func);
 }
 
-static void __task_work_run(struct task_struct *task,
-			    struct callback_head **headptr)
+static void __task_work_run(struct callback_head **headptr)
 {
 	struct callback_head *work, *head, *next;
+	struct task_struct *task = current;
 
 	for (;;) {
 		/*
@@ -157,10 +157,10 @@ static void __task_work_run(struct task_struct *task,
  */
 void task_work_run(void)
 {
-	__task_work_run(current, &current->task_works);
+	__task_work_run(&current->task_works);
 }
 
-void sched_work_run(struct task_struct *task)
+void sched_work_run()
 {
-	__task_work_run(task, &task->sched_work);
+	__task_work_run(&task->sched_work);
 }


----

folded patches from Jens

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0918904c939d..8d3fbf6e815b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -649,6 +649,7 @@ struct task_struct {
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
 	unsigned int			ptrace;
+	int				on_rq;
 
 #ifdef CONFIG_SMP
 	struct llist_node		wake_entry;
@@ -671,13 +672,14 @@ struct task_struct {
 	int				recent_used_cpu;
 	int				wake_cpu;
 #endif
-	int				on_rq;
 
 	int				prio;
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
 
+	struct callback_head		*sched_work;
+
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index bd9a6a91c097..c688c56ce01d 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -22,4 +22,8 @@ static inline void exit_task_work(struct task_struct *task)
 	task_work_run();
 }
 
+int sched_work_add(struct task_struct *task, struct callback_head *work);
+void sched_work_run(struct task_struct *task);
+struct callback_head *sched_work_cancel(struct task_struct *, task_work_func_t);
+
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94819d573be..6fda35072c2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2678,6 +2678,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
 static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	p->on_rq			= 0;
+	p->sched_work			= NULL;
 
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
@@ -4102,10 +4103,12 @@ void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
-static inline void sched_submit_work(struct task_struct *tsk)
+static bool sched_out_update(struct task_struct *tsk)
 {
-	if (!tsk->state)
-		return;
+	if (unlikely(tsk->sched_work)) {
+		sched_work_run(tsk);
+		return true;
+	}
 
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
@@ -4123,18 +4126,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
 		preempt_enable_no_resched();
 	}
 
-	if (tsk_is_pi_blocked(tsk))
-		return;
-
-	/*
-	 * If we are going to sleep and we have plugged IO queued,
-	 * make sure to submit it to avoid deadlocks.
-	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	return false;
 }
 
-static void sched_update_worker(struct task_struct *tsk)
+static void sched_in_update(struct task_struct *tsk)
 {
 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
@@ -4142,6 +4137,28 @@ static void sched_update_worker(struct task_struct *tsk)
 		else
 			io_wq_worker_running(tsk);
 	}
+	if (unlikely(tsk->sched_work))
+		sched_work_run(tsk);
+}
+
+static inline void sched_submit_work(struct task_struct *tsk)
+{
+	if (!tsk->state)
+		return;
+
+	/* if we processed work, we could be runnable again. check. */
+	if (sched_out_update(tsk) && !tsk->state)
+		return;
+
+	if (tsk_is_pi_blocked(tsk))
+		return;
+
+	/*
+	 * If we are going to sleep and we have plugged IO queued,
+	 * make sure to submit it to avoid deadlocks.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -4154,7 +4171,7 @@ asmlinkage __visible void __sched schedule(void)
 		__schedule(false);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
-	sched_update_worker(tsk);
+	sched_in_update(tsk);
 }
 EXPORT_SYMBOL(schedule);
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0fef395662a6..d73d1cd0c4dd 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,7 +3,30 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
-static struct callback_head work_exited; /* all we need is ->next == NULL */
+static void task_exit_func(struct callback_head *head)
+{
+}
+
+static struct callback_head work_exited = {
+	.next	= NULL,
+	.func	= task_exit_func,
+};
+
+static int __task_work_add(struct task_struct *task,
+			   struct callback_head **headptr,
+			   struct callback_head *work)
+{
+	struct callback_head *head;
+
+	do {
+		head = READ_ONCE(*headptr);
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
+		work->next = head;
+	} while (cmpxchg(headptr, head, work) != head);
+
+	return 0;
+}
 
 /**
  * task_work_add - ask the @task to execute @work->func()
@@ -27,39 +50,31 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *head;
+	int ret;
 
-	do {
-		head = READ_ONCE(task->task_works);
-		if (unlikely(head == &work_exited))
-			return -ESRCH;
-		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
+	ret = __task_work_add(task, &task->task_works, work);
 
 	if (notify)
 		set_notify_resume(task);
-	return 0;
+
+	return ret;
 }
 
-/**
- * task_work_cancel - cancel a pending work added by task_work_add()
- * @task: the task which should execute the work
- * @func: identifies the work to remove
- *
- * Find the last queued pending work with ->func == @func and remove
- * it from queue.
- *
- * RETURNS:
- * The found work or NULL if not found.
- */
-struct callback_head *
-task_work_cancel(struct task_struct *task, task_work_func_t func)
+int
+sched_work_add(struct task_struct *task, struct callback_head *work)
+{
+	return __task_work_add(task, &task->sched_work, work);
+}
+
+static struct callback_head *__task_work_cancel(struct task_struct *task,
+						struct callback_head **headptr,
+						task_work_func_t func)
 {
-	struct callback_head **pprev = &task->task_works;
+	struct callback_head **pprev = headptr;
 	struct callback_head *work;
 	unsigned long flags;
 
-	if (likely(!task->task_works))
+	if (likely(!(*headptr)))
 		return NULL;
 	/*
 	 * If cmpxchg() fails we continue without updating pprev.
@@ -80,16 +95,31 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 }
 
 /**
- * task_work_run - execute the works added by task_work_add()
+ * task_work_cancel - cancel a pending work added by task_work_add()
+ * @task: the task which should execute the work
+ * @func: identifies the work to remove
  *
- * Flush the pending works. Should be used by the core kernel code.
- * Called before the task returns to the user-mode or stops, or when
- * it exits. In the latter case task_work_add() can no longer add the
- * new work after task_work_run() returns.
+ * Find the last queued pending work with ->func == @func and remove
+ * it from queue.
+ *
+ * RETURNS:
+ * The found work or NULL if not found.
  */
-void task_work_run(void)
+struct callback_head *
+task_work_cancel(struct task_struct *task, task_work_func_t func)
+{
+	return __task_work_cancel(task, &task->task_works, func);
+}
+
+struct callback_head *
+sched_work_cancel(struct task_struct *task, task_work_func_t func)
+{
+	return __task_work_cancel(task, &task->sched_work, func);
+}
+
+static void __task_work_run(struct task_struct *task,
+			    struct callback_head **headptr)
 {
-	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
 	for (;;) {
@@ -99,10 +129,10 @@ void task_work_run(void)
 		 */
 		raw_spin_lock_irq(&task->pi_lock);
 		do {
-			work = READ_ONCE(task->task_works);
+			work = READ_ONCE(*headptr);
 			head = !work && (task->flags & PF_EXITING) ?
 				&work_exited : NULL;
-		} while (cmpxchg(&task->task_works, work, head) != work);
+		} while (cmpxchg(headptr, work, head) != work);
 		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)
@@ -116,3 +146,21 @@ void task_work_run(void)
 		} while (work);
 	}
 }
+
+/**
+ * task_work_run - execute the works added by task_work_add()
+ *
+ * Flush the pending works. Should be used by the core kernel code.
+ * Called before the task returns to the user-mode or stops, or when
+ * it exits. In the latter case task_work_add() can no longer add the
+ * new work after task_work_run() returns.
+ */
+void task_work_run(void)
+{
+	__task_work_run(current, &current->task_works);
+}
+
+void sched_work_run(struct task_struct *task)
+{
+	__task_work_run(task, &task->sched_work);
+}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-17 16:12                             ` Jens Axboe
  2020-02-17 17:16                               ` Jens Axboe
@ 2020-02-18 13:13                               ` Peter Zijlstra
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 13:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On Mon, Feb 17, 2020 at 08:12:30AM -0800, Jens Axboe wrote:
> Only potential hitch I see there is related to ordering, which is more
> of a fairness thab correctness issue. I'm going to ignore that for now,
> and we can always revisit later.

If that ever becomes a thing, see:

  c82199061009 ("task_work: remove fifo ordering guarantee")

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-02-18 13:13                                     ` Peter Zijlstra
@ 2020-02-18 14:27                                       ` Peter Zijlstra
  2020-02-18 14:40                                         ` Peter Zijlstra
  2020-02-20 10:30                                         ` Will Deacon
  2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
  2020-02-18 16:46                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Jens Axboe
  2 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 14:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov, Mark Rutland,
	linux-kernel, Will Deacon

On Tue, Feb 18, 2020 at 02:13:10PM +0100, Peter Zijlstra wrote:
> (with the caveat that try_cmpxchg() doesn't seem available on !x86 -- I
> should go fix that)

Completely untested (lemme go do that shortly), but something like so I
suppose.

---
Subject: asm-generic/atomic: Add try_cmpxchg() fallbacks

Only x86 provides try_cmpxchg() outside of the atomic_t interfaces,
provide generic fallbacks to create this interface from the widely
available cmpxchg() function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
diff --git a/include/linux/atomic-fallback.h b/include/linux/atomic-fallback.h
index 656b5489b673..243f61d6c35f 100644
--- a/include/linux/atomic-fallback.h
+++ b/include/linux/atomic-fallback.h
@@ -77,6 +77,50 @@
 
 #endif /* cmpxchg64_relaxed */
 
+#ifndef try_cmpxchg
+#define try_cmpxchg(_ptr, _oldp, _new) \
+({ \
+	typeof(*ptr) ___r, ___o = *(_oldp); \
+	___r = cmpxchg((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*(_old) = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg */
+
+#ifndef try_cmpxchg_acquire
+#define try_cmpxchg_acquire(_ptr, _oldp, _new) \
+({ \
+	typeof(*ptr) ___r, ___o = *(_oldp); \
+	___r = cmpxchg_acquire((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*(_old) = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_acquire */
+
+#ifndef try_cmpxchg_release
+#define try_cmpxchg_release(_ptr, _oldp, _new) \
+({ \
+	typeof(*ptr) ___r, ___o = *(_oldp); \
+	___r = cmpxchg_release((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*(_old) = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_release */
+
+#ifndef try_cmpxchg_relaxed
+#define try_cmpxchg_relaxed(_ptr, _oldp, _new) \
+({ \
+	typeof(*ptr) ___r, ___o = *(_oldp); \
+	___r = cmpxchg_relaxed((_ptr), ___o, (_new)); \
+	if (unlikely(___r != ___o)) \
+		*(_old) = ___r; \
+	likely(___r == ___o); \
+})
+#endif /* try_cmpxchg_relaxed */
+
 #ifndef atomic_read_acquire
 static __always_inline int
 atomic_read_acquire(const atomic_t *v)
@@ -2294,4 +2338,4 @@ atomic64_dec_if_positive(atomic64_t *v)
 #define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
 
 #endif /* _LINUX_ATOMIC_FALLBACK_H */
-// baaf45f4c24ed88ceae58baca39d7fd80bb8101b
+// 2dfbc767ce308d2edd67a49bd7b764dd07f62f6c
diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
index b6c6f5d306a7..3c9be8d550e0 100755
--- a/scripts/atomic/gen-atomic-fallback.sh
+++ b/scripts/atomic/gen-atomic-fallback.sh
@@ -140,6 +140,32 @@ cat <<EOF
 EOF
 }
 
+gen_try_cmpxchg_fallback()
+{
+	local order="$1"; shift;
+
+cat <<EOF
+#ifndef try_cmpxchg${order}
+#define try_cmpxchg${order}(_ptr, _oldp, _new) \\
+({ \\
+	typeof(*ptr) ___r, ___o = *(_oldp); \\
+	___r = cmpxchg${order}((_ptr), ___o, (_new)); \\
+	if (unlikely(___r != ___o)) \\
+		*(_old) = ___r; \\
+	likely(___r == ___o); \\
+})
+#endif /* try_cmpxchg${order} */
+
+EOF
+}
+
+gen_try_cmpxchg_fallbacks()
+{
+	for order in "" "_acquire" "_release" "_relaxed"; do
+		gen_try_cmpxchg_fallback "${order}"
+	done
+}
+
 cat << EOF
 // SPDX-License-Identifier: GPL-2.0
 
@@ -157,6 +183,8 @@ for xchg in "xchg" "cmpxchg" "cmpxchg64"; do
 	gen_xchg_fallbacks "${xchg}"
 done
 
+gen_try_cmpxchg_fallbacks
+
 grep '^[a-z]' "$1" | while read name meta args; do
 	gen_proto "${meta}" "${name}" "atomic" "int" ${args}
 done

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
@ 2020-02-18 14:40                                         ` Peter Zijlstra
  2020-02-20 10:30                                         ` Will Deacon
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 14:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov, Mark Rutland,
	linux-kernel, Will Deacon

On Tue, Feb 18, 2020 at 03:27:00PM +0100, Peter Zijlstra wrote:
> diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
> index b6c6f5d306a7..3c9be8d550e0 100755
> --- a/scripts/atomic/gen-atomic-fallback.sh
> +++ b/scripts/atomic/gen-atomic-fallback.sh
> @@ -140,6 +140,32 @@ cat <<EOF
>  EOF
>  }
>  
> +gen_try_cmpxchg_fallback()
> +{
> +	local order="$1"; shift;
> +
> +cat <<EOF
> +#ifndef try_cmpxchg${order}
> +#define try_cmpxchg${order}(_ptr, _oldp, _new) \\
> +({ \\
> +	typeof(*ptr) ___r, ___o = *(_oldp); \\
               *(_ptr)
> +	___r = cmpxchg${order}((_ptr), ___o, (_new)); \\
> +	if (unlikely(___r != ___o)) \\
> +		*(_old) = ___r; \\
		*(_oldp)
> +	likely(___r == ___o); \\
> +})
> +#endif /* try_cmpxchg${order} */

And it actually builds, as tested on an ARM build.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 13:13                                     ` Peter Zijlstra
  2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
@ 2020-02-18 14:56                                       ` Oleg Nesterov
  2020-02-18 15:07                                         ` Oleg Nesterov
  2020-02-18 15:07                                         ` Peter Zijlstra
  2020-02-18 16:46                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Jens Axboe
  2 siblings, 2 replies; 59+ messages in thread
From: Oleg Nesterov @ 2020-02-18 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On 02/18, Peter Zijlstra wrote:
>
> But this has me wondering about task_work_run(), as it is it will
> unconditionally take pi_lock,

because spin_unlock_wait() was removed ;) task_work_run() doesn't
really need to take pi_lock at all. 

> would not something like this make sense?

I think yes, but see below.

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -93,16 +93,20 @@ void task_work_run(void)
>  	struct callback_head *work, *head, *next;
>
>  	for (;;) {
> +		work = READ_ONCE(task->task_work);
> +		if (!work)
> +			break

This is wrong if PF_EXITING is set, in this case we must set
task->task_works = work_exited.

> +
>  		/*
>  		 * work->func() can do task_work_add(), do not set
>  		 * work_exited unless the list is empty.
>  		 */
>  		raw_spin_lock_irq(&task->pi_lock);
>  		do {
> -			work = READ_ONCE(task->task_works);
> -			head = !work && (task->flags & PF_EXITING) ?
> -				&work_exited : NULL;
> -		} while (cmpxchg(&task->task_works, work, head) != work);
> +			head = NULL;
> +			if (unlikely(!work && (task->flags & PF_EXITING)))
> +				head = &work_exited;
> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>  		raw_spin_unlock_irq(&task->pi_lock);
>
>  		if (!work)

otherwise I think this is correct, but how about the patch below?
Then this code can be changed to use try_cmpxchg().

Oleg.

--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -97,17 +97,24 @@ void task_work_run(void)
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
-		raw_spin_lock_irq(&task->pi_lock);
 		do {
+			head = NULL;
 			work = READ_ONCE(task->task_works);
-			head = !work && (task->flags & PF_EXITING) ?
-				&work_exited : NULL;
+			if (!work) {
+				if (task->flags & PF_EXITING)
+					head = &work_exited;
+				else
+					break;
+			}
 		} while (cmpxchg(&task->task_works, work, head) != work);
-		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)
 			break;
 
+		// Synchronize with task_work_cancel()
+		raw_spin_lock_irq(&task->pi_lock);
+		raw_spin_unlock_irq(&task->pi_lock);
+
 		do {
 			next = work->next;
 			work->func(work);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
@ 2020-02-18 15:07                                         ` Oleg Nesterov
  2020-02-18 15:38                                           ` Peter Zijlstra
  2020-02-18 15:07                                         ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2020-02-18 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On 02/18, Oleg Nesterov wrote:
>
> otherwise I think this is correct, but how about the patch below?
> Then this code can be changed to use try_cmpxchg().

You have already sent the patch which adds the generic try_cmpxchg,
so the patch below can be trivially adapted.

But I'd prefer another change, I think both task_work_add() and
task_work_cancel() can use try_cmpxchg() too.

> 
> Oleg.
> 
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -97,17 +97,24 @@ void task_work_run(void)
>  		 * work->func() can do task_work_add(), do not set
>  		 * work_exited unless the list is empty.
>  		 */
> -		raw_spin_lock_irq(&task->pi_lock);
>  		do {
> +			head = NULL;
>  			work = READ_ONCE(task->task_works);
> -			head = !work && (task->flags & PF_EXITING) ?
> -				&work_exited : NULL;
> +			if (!work) {
> +				if (task->flags & PF_EXITING)
> +					head = &work_exited;
> +				else
> +					break;
> +			}
>  		} while (cmpxchg(&task->task_works, work, head) != work);
> -		raw_spin_unlock_irq(&task->pi_lock);
>  
>  		if (!work)
>  			break;
>  
> +		// Synchronize with task_work_cancel()
> +		raw_spin_lock_irq(&task->pi_lock);
> +		raw_spin_unlock_irq(&task->pi_lock);
> +
>  		do {
>  			next = work->next;
>  			work->func(work);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
  2020-02-18 15:07                                         ` Oleg Nesterov
@ 2020-02-18 15:07                                         ` Peter Zijlstra
  2020-02-18 15:50                                           ` [PATCH] task_work_run: don't take ->pi_lock unconditionally Oleg Nesterov
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 15:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On Tue, Feb 18, 2020 at 03:56:46PM +0100, Oleg Nesterov wrote:
> On 02/18, Peter Zijlstra wrote:
> >
> > But this has me wondering about task_work_run(), as it is it will
> > unconditionally take pi_lock,
> 
> because spin_unlock_wait() was removed ;) task_work_run() doesn't
> really need to take pi_lock at all. 

Right.

> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -93,16 +93,20 @@ void task_work_run(void)
> >  	struct callback_head *work, *head, *next;
> >
> >  	for (;;) {
> > +		work = READ_ONCE(task->task_work);
> > +		if (!work)
> > +			break
> 
> This is wrong if PF_EXITING is set, in this case we must set
> task->task_works = work_exited.

Indeed!

> > +
> >  		/*
> >  		 * work->func() can do task_work_add(), do not set
> >  		 * work_exited unless the list is empty.
> >  		 */
> >  		raw_spin_lock_irq(&task->pi_lock);
> >  		do {
> > -			work = READ_ONCE(task->task_works);
> > -			head = !work && (task->flags & PF_EXITING) ?
> > -				&work_exited : NULL;
> > -		} while (cmpxchg(&task->task_works, work, head) != work);
> > +			head = NULL;
> > +			if (unlikely(!work && (task->flags & PF_EXITING)))
> > +				head = &work_exited;
> > +		} while (!try_cmpxchg(&task->task_works, &work, head));
> >  		raw_spin_unlock_irq(&task->pi_lock);
> >
> >  		if (!work)
> 
> otherwise I think this is correct, but how about the patch below?
> Then this code can be changed to use try_cmpxchg().

Works for me. Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 15:07                                         ` Oleg Nesterov
@ 2020-02-18 15:38                                           ` Peter Zijlstra
  2020-02-18 16:33                                             ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-18 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On Tue, Feb 18, 2020 at 04:07:45PM +0100, Oleg Nesterov wrote:
> On 02/18, Oleg Nesterov wrote:
> >
> > otherwise I think this is correct, but how about the patch below?
> > Then this code can be changed to use try_cmpxchg().
> 
> You have already sent the patch which adds the generic try_cmpxchg,
> so the patch below can be trivially adapted.
> 
> But I'd prefer another change, I think both task_work_add() and
> task_work_cancel() can use try_cmpxchg() too.

Yeah, I'll go change the lot, but maybe after Jens' patches, otherwise
we're just creating merge conflicts.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-18 15:07                                         ` Peter Zijlstra
@ 2020-02-18 15:50                                           ` Oleg Nesterov
  2020-02-20 16:39                                             ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2020-02-18 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
if task->task_works == NULL && !PF_EXITING.

And in fact the only reason why task_work_run() needs ->pi_lock is
the possible race with task_work_cancel(), we can optimize this code
and make the locking more clear.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/task_work.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 0fef395..825f282 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -97,16 +97,26 @@ void task_work_run(void)
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
-		raw_spin_lock_irq(&task->pi_lock);
 		do {
+			head = NULL;
 			work = READ_ONCE(task->task_works);
-			head = !work && (task->flags & PF_EXITING) ?
-				&work_exited : NULL;
+			if (!work) {
+				if (task->flags & PF_EXITING)
+					head = &work_exited;
+				else
+					break;
+			}
 		} while (cmpxchg(&task->task_works, work, head) != work);
-		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (!work)
 			break;
+		/*
+		 * Synchronize with task_work_cancel(). It can not remove
+		 * the first entry == work, cmpxchg(task_works) must fail.
+		 * But it can remove another entry from the ->next list.
+		 */
+		raw_spin_lock_irq(&task->pi_lock);
+		raw_spin_unlock_irq(&task->pi_lock);
 
 		do {
 			next = work->next;
-- 
2.5.0



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 15:38                                           ` Peter Zijlstra
@ 2020-02-18 16:33                                             ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-18 16:33 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/18/20 8:38 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2020 at 04:07:45PM +0100, Oleg Nesterov wrote:
>> On 02/18, Oleg Nesterov wrote:
>>>
>>> otherwise I think this is correct, but how about the patch below?
>>> Then this code can be changed to use try_cmpxchg().
>>
>> You have already sent the patch which adds the generic try_cmpxchg,
>> so the patch below can be trivially adapted.
>>
>> But I'd prefer another change, I think both task_work_add() and
>> task_work_cancel() can use try_cmpxchg() too.
> 
> Yeah, I'll go change the lot, but maybe after Jens' patches, otherwise
> we're just creating merge conflicts.

Just caught up with this thread, great stuff! Don't worry about me, I'll
just rebase on top of the fixes and cleanups from you and Oleg. Or you
can apply my two first if you wish, doesn't really matter to me, as I'll
likely just pull in that branch anyway for the rest to sit on top of.
Just let me know.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 13:13                                     ` Peter Zijlstra
  2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
  2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
@ 2020-02-18 16:46                                       ` Jens Axboe
  2020-02-18 16:52                                         ` Jens Axboe
  2 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-18 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov

On 2/18/20 6:13 AM, Peter Zijlstra wrote:
> Jens, I think you want something like this on top of what you have,
> mostly it is adding sched_work_run() to exit_task_work().

It also makes it a bit cleaner, I don't like the implied task == current
we have in a bunch of spots. Folded this in (thanks!) with minor edit:

> @@ -157,10 +157,10 @@ static void __task_work_run(struct task_struct *task,
>   */
>  void task_work_run(void)
>  {
> -	__task_work_run(current, &current->task_works);
> +	__task_work_run(&current->task_works);
>  }
>  
> -void sched_work_run(struct task_struct *task)
> +void sched_work_run()
>  {
> -	__task_work_run(task, &task->sched_work);
> +	__task_work_run(&task->sched_work);
>  }

s/task/current for this last one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [ISSUE] The time cost of IOSQE_IO_LINK
  2020-02-18 16:46                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Jens Axboe
@ 2020-02-18 16:52                                         ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-18 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov

On 2/18/20 9:46 AM, Jens Axboe wrote:
> On 2/18/20 6:13 AM, Peter Zijlstra wrote:
>> Jens, I think you want something like this on top of what you have,
>> mostly it is adding sched_work_run() to exit_task_work().
> 
> It also makes it a bit cleaner, I don't like the implied task == current
> we have in a bunch of spots. Folded this in (thanks!) with minor edit:
> 
>> @@ -157,10 +157,10 @@ static void __task_work_run(struct task_struct *task,
>>   */
>>  void task_work_run(void)
>>  {
>> -	__task_work_run(current, &current->task_works);
>> +	__task_work_run(&current->task_works);
>>  }
>>  
>> -void sched_work_run(struct task_struct *task)
>> +void sched_work_run()
>>  {
>> -	__task_work_run(task, &task->sched_work);
>> +	__task_work_run(&task->sched_work);
>>  }
> 
> s/task/current for this last one.

Current series, same spot:

https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll

I added Oleg's patch first in the series, then rebased on top and folded
in the incremental from you. Will run some testing now, otherwise only
sporadically available today and tomorrow. Unless something major comes
up, I'll send this out for review tomorrow afternoon local time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
  2020-02-18 14:40                                         ` Peter Zijlstra
@ 2020-02-20 10:30                                         ` Will Deacon
  2020-02-20 10:37                                           ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Will Deacon @ 2020-02-20 10:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov, Mark Rutland,
	linux-kernel

On Tue, Feb 18, 2020 at 03:27:00PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2020 at 02:13:10PM +0100, Peter Zijlstra wrote:
> > (with the caveat that try_cmpxchg() doesn't seem available on !x86 -- I
> > should go fix that)
> 
> Completely untested (lemme go do that shortly), but something like so I
> suppose.
> 
> ---
> Subject: asm-generic/atomic: Add try_cmpxchg() fallbacks
> 
> Only x86 provides try_cmpxchg() outside of the atomic_t interfaces,
> provide generic fallbacks to create this interface from the widely
> available cmpxchg() function.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> diff --git a/include/linux/atomic-fallback.h b/include/linux/atomic-fallback.h
> index 656b5489b673..243f61d6c35f 100644
> --- a/include/linux/atomic-fallback.h
> +++ b/include/linux/atomic-fallback.h
> @@ -77,6 +77,50 @@
>  
>  #endif /* cmpxchg64_relaxed */
>  
> +#ifndef try_cmpxchg
> +#define try_cmpxchg(_ptr, _oldp, _new) \
> +({ \
> +	typeof(*ptr) ___r, ___o = *(_oldp); \

Probably worth pointing out that this will have the nasty behaviour
for volatile pointers that we're tackling for READ_ONCE. Obviously no
need to hold this up, but just mentioning it here in the hope that one
of us remembers to fix it later on.

> diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
> index b6c6f5d306a7..3c9be8d550e0 100755
> --- a/scripts/atomic/gen-atomic-fallback.sh
> +++ b/scripts/atomic/gen-atomic-fallback.sh
> @@ -140,6 +140,32 @@ cat <<EOF
>  EOF
>  }
>  
> +gen_try_cmpxchg_fallback()
> +{
> +	local order="$1"; shift;
> +
> +cat <<EOF
> +#ifndef try_cmpxchg${order}
> +#define try_cmpxchg${order}(_ptr, _oldp, _new) \\
> +({ \\
> +	typeof(*ptr) ___r, ___o = *(_oldp); \\
> +	___r = cmpxchg${order}((_ptr), ___o, (_new)); \\
> +	if (unlikely(___r != ___o)) \\
> +		*(_old) = ___r; \\

This doesn't compile because _old isn't declared. Probably best to avoid
evaluating _oldp twice though.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-02-20 10:30                                         ` Will Deacon
@ 2020-02-20 10:37                                           ` Peter Zijlstra
  2020-02-20 10:39                                             ` Will Deacon
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-20 10:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jens Axboe, Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov, Mark Rutland,
	linux-kernel

On Thu, Feb 20, 2020 at 10:30:45AM +0000, Will Deacon wrote:
> On Tue, Feb 18, 2020 at 03:27:00PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 18, 2020 at 02:13:10PM +0100, Peter Zijlstra wrote:
> > > (with the caveat that try_cmpxchg() doesn't seem available on !x86 -- I
> > > should go fix that)
> > 
> > Completely untested (lemme go do that shortly), but something like so I
> > suppose.
> > 
> > ---
> > Subject: asm-generic/atomic: Add try_cmpxchg() fallbacks
> > 
> > Only x86 provides try_cmpxchg() outside of the atomic_t interfaces,
> > provide generic fallbacks to create this interface from the widely
> > available cmpxchg() function.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > diff --git a/include/linux/atomic-fallback.h b/include/linux/atomic-fallback.h
> > index 656b5489b673..243f61d6c35f 100644
> > --- a/include/linux/atomic-fallback.h
> > +++ b/include/linux/atomic-fallback.h
> > @@ -77,6 +77,50 @@
> >  
> >  #endif /* cmpxchg64_relaxed */
> >  
> > +#ifndef try_cmpxchg
> > +#define try_cmpxchg(_ptr, _oldp, _new) \
> > +({ \
> > +	typeof(*ptr) ___r, ___o = *(_oldp); \
> 
> Probably worth pointing out that this will have the nasty behaviour
> for volatile pointers that we're tackling for READ_ONCE. Obviously no
> need to hold this up, but just mentioning it here in the hope that one
> of us remembers to fix it later on.

Right :/

> > diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
> > index b6c6f5d306a7..3c9be8d550e0 100755
> > --- a/scripts/atomic/gen-atomic-fallback.sh
> > +++ b/scripts/atomic/gen-atomic-fallback.sh
> > @@ -140,6 +140,32 @@ cat <<EOF
> >  EOF
> >  }
> >  
> > +gen_try_cmpxchg_fallback()
> > +{
> > +	local order="$1"; shift;
> > +
> > +cat <<EOF
> > +#ifndef try_cmpxchg${order}
> > +#define try_cmpxchg${order}(_ptr, _oldp, _new) \\
> > +({ \\
> > +	typeof(*ptr) ___r, ___o = *(_oldp); \\
> > +	___r = cmpxchg${order}((_ptr), ___o, (_new)); \\
> > +	if (unlikely(___r != ___o)) \\
> > +		*(_old) = ___r; \\
> 
> This doesn't compile because _old isn't declared. Probably best to avoid
> evaluating _oldp twice though.

Compiler had already spotted that, I'll make it something like:

	typeof(*ptr) *___op = (_oldp), ___o = *___op;

	...

		*___op = ___r;

Which avoids the double eval.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks
  2020-02-20 10:37                                           ` Peter Zijlstra
@ 2020-02-20 10:39                                             ` Will Deacon
  0 siblings, 0 replies; 59+ messages in thread
From: Will Deacon @ 2020-02-20 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲,
	Pavel Begunkov, io-uring, Oleg Nesterov, Mark Rutland,
	linux-kernel

On Thu, Feb 20, 2020 at 11:37:27AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 20, 2020 at 10:30:45AM +0000, Will Deacon wrote:
> > On Tue, Feb 18, 2020 at 03:27:00PM +0100, Peter Zijlstra wrote:
> > > diff --git a/scripts/atomic/gen-atomic-fallback.sh b/scripts/atomic/gen-atomic-fallback.sh
> > > index b6c6f5d306a7..3c9be8d550e0 100755
> > > --- a/scripts/atomic/gen-atomic-fallback.sh
> > > +++ b/scripts/atomic/gen-atomic-fallback.sh
> > > @@ -140,6 +140,32 @@ cat <<EOF
> > >  EOF
> > >  }
> > >  
> > > +gen_try_cmpxchg_fallback()
> > > +{
> > > +	local order="$1"; shift;
> > > +
> > > +cat <<EOF
> > > +#ifndef try_cmpxchg${order}
> > > +#define try_cmpxchg${order}(_ptr, _oldp, _new) \\
> > > +({ \\
> > > +	typeof(*ptr) ___r, ___o = *(_oldp); \\
> > > +	___r = cmpxchg${order}((_ptr), ___o, (_new)); \\
> > > +	if (unlikely(___r != ___o)) \\
> > > +		*(_old) = ___r; \\
> > 
> > This doesn't compile because _old isn't declared. Probably best to avoid
> > evaluating _oldp twice though.
> 
> Compiler had already spotted that, I'll make it something like:
> 
> 	typeof(*ptr) *___op = (_oldp), ___o = *___op;
> 
> 	...
> 
> 		*___op = ___r;
> 
> Which avoids the double eval.

Cool, you can stick my Ack on the patch with that change.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-18 15:50                                           ` [PATCH] task_work_run: don't take ->pi_lock unconditionally Oleg Nesterov
@ 2020-02-20 16:39                                             ` Peter Zijlstra
  2020-02-20 17:22                                               ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-20 16:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On Tue, Feb 18, 2020 at 04:50:18PM +0100, Oleg Nesterov wrote:
> As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
> if task->task_works == NULL && !PF_EXITING.
> 
> And in fact the only reason why task_work_run() needs ->pi_lock is
> the possible race with task_work_cancel(), we can optimize this code
> and make the locking more clear.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---

Still playing with my try_cmpxchg() patches, how does the below look on
top?

---
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -27,14 +27,13 @@ static struct callback_head work_exited;
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *head;
+	struct callback_head *head = READ_ONCE(tsk->task_works);
 
 	do {
-		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
 		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
+	} while (!try_cmpxchg(&task->task_works, &head, work))
 
 	if (notify)
 		set_notify_resume(task);
@@ -90,26 +89,24 @@ task_work_cancel(struct task_struct *tas
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *work, *head, *next;
+	struct callback_head *work, *next;
 
 	for (;;) {
-		/*
-		 * work->func() can do task_work_add(), do not set
-		 * work_exited unless the list is empty.
-		 */
-		do {
-			head = NULL;
-			work = READ_ONCE(task->task_works);
-			if (!work) {
-				if (task->flags & PF_EXITING)
-					head = &work_exited;
-				else
-					break;
-			}
-		} while (cmpxchg(&task->task_works, work, head) != work);
+		work = READ_ONCE(task->task_works);
+		if (!work) {
+			if (!(task->flags & PF_EXITING))
+				return;
+
+			/*
+			 * work->func() can do task_work_add(), do not set
+			 * work_exited unless the list is empty.
+			 */
+			if (try_cmpxchg(&task->task_works, &work, &work_exited))
+				return;
+		}
+
+		work = xchg(&task->task_works, NULL);
 
-		if (!work)
-			break;
 		/*
 		 * Synchronize with task_work_cancel(). It can not remove
 		 * the first entry == work, cmpxchg(task_works) must fail.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-20 16:39                                             ` Peter Zijlstra
@ 2020-02-20 17:22                                               ` Oleg Nesterov
  2020-02-20 17:49                                                 ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2020-02-20 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On 02/20, Peter Zijlstra wrote:
>
> Still playing with my try_cmpxchg() patches, how does the below look on
> top?

I too made a simple patch right after you sent "[PATCH] asm-generic/atomic:
Add try_cmpxchg() fallbacks", see below.

But I am fine with your version, with one exception

>  void task_work_run(void)
>  {
>  	struct task_struct *task = current;
> -	struct callback_head *work, *head, *next;
> +	struct callback_head *work, *next;
>  
>  	for (;;) {
> -		/*
> -		 * work->func() can do task_work_add(), do not set
> -		 * work_exited unless the list is empty.
> -		 */
> -		do {
> -			head = NULL;
> -			work = READ_ONCE(task->task_works);
> -			if (!work) {
> -				if (task->flags & PF_EXITING)
> -					head = &work_exited;
> -				else
> -					break;
> -			}
> -		} while (cmpxchg(&task->task_works, work, head) != work);
> +		work = READ_ONCE(task->task_works);
> +		if (!work) {
> +			if (!(task->flags & PF_EXITING))
> +				return;
> +
> +			/*
> +			 * work->func() can do task_work_add(), do not set
> +			 * work_exited unless the list is empty.
> +			 */
> +			if (try_cmpxchg(&task->task_works, &work, &work_exited))
> +				return;
> +		}
> +
> +		work = xchg(&task->task_works, NULL);
>  
> -		if (!work)
> -			break;

You can't remove the "if (!work)" check, cancel_task_work() can remove
a single callback between READ_ONCE() and xchg().

Oleg.

--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -27,14 +27,11 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *head;
-
+	work->next = READ_ONCE(task->task_works);
 	do {
-		head = READ_ONCE(task->task_works);
-		if (unlikely(head == &work_exited))
+		if (unlikely(work->next == &work_exited))
 			return -ESRCH;
-		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
+	} while (!try_cmpxchg(&task->task_works, &work->next, work));
 
 	if (notify)
 		set_notify_resume(task);
@@ -68,10 +65,10 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	while ((work = READ_ONCE(*pprev))) {
+	for (work = READ_ONCE(*pprev); work; ) {
 		if (work->func != func)
 			pprev = &work->next;
-		else if (cmpxchg(pprev, work, work->next) == work)
+		else if (try_cmpxchg(pprev, &work, work->next))
 			break;
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -97,16 +94,16 @@ void task_work_run(void)
 		 * work->func() can do task_work_add(), do not set
 		 * work_exited unless the list is empty.
 		 */
+		work = READ_ONCE(task->task_works);
 		do {
 			head = NULL;
-			work = READ_ONCE(task->task_works);
 			if (!work) {
 				if (task->flags & PF_EXITING)
 					head = &work_exited;
 				else
 					break;
 			}
-		} while (cmpxchg(&task->task_works, work, head) != work);
+		} while (!try_cmpxchg(&task->task_works, &work, head));
 
 		if (!work)
 			break;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-20 17:22                                               ` Oleg Nesterov
@ 2020-02-20 17:49                                                 ` Peter Zijlstra
  2020-02-21 14:52                                                   ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-20 17:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On Thu, Feb 20, 2020 at 06:22:02PM +0100, Oleg Nesterov wrote:
> On 02/20, Peter Zijlstra wrote:

> > +		work = READ_ONCE(task->task_works);
> > +		if (!work) {
> > +			if (!(task->flags & PF_EXITING))
> > +				return;
> > +
> > +			/*
> > +			 * work->func() can do task_work_add(), do not set
> > +			 * work_exited unless the list is empty.
> > +			 */
> > +			if (try_cmpxchg(&task->task_works, &work, &work_exited))
> > +				return;
> > +		}
> > +
> > +		work = xchg(&task->task_works, NULL);
> >  
> > -		if (!work)
> > -			break;
> 
> You can't remove the "if (!work)" check, cancel_task_work() can remove
> a single callback between READ_ONCE() and xchg().

Argh!

> Oleg.
> 
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -27,14 +27,11 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
>  int
>  task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
>  {
> -	struct callback_head *head;
> -
> +	work->next = READ_ONCE(task->task_works);
>  	do {
> -		head = READ_ONCE(task->task_works);
> -		if (unlikely(head == &work_exited))
> +		if (unlikely(work->next == &work_exited))
>  			return -ESRCH;
> -		work->next = head;
> -	} while (cmpxchg(&task->task_works, head, work) != head);
> +	} while (!try_cmpxchg(&task->task_works, &work->next, work));
>  
>  	if (notify)
>  		set_notify_resume(task);

Cute, that should work.

> @@ -68,10 +65,10 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
>  	 * we raced with task_work_run(), *pprev == NULL/exited.
>  	 */
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
> +	for (work = READ_ONCE(*pprev); work; ) {
>  		if (work->func != func)
>  			pprev = &work->next;

But didn't you loose the READ_ONCE() of *pprev in this branch?

> +		else if (try_cmpxchg(pprev, &work, work->next))
>  			break;
>  	}
>  	raw_spin_unlock_irqrestore(&task->pi_lock, flags);


> @@ -97,16 +94,16 @@ void task_work_run(void)
>  		 * work->func() can do task_work_add(), do not set
>  		 * work_exited unless the list is empty.
>  		 */
> +		work = READ_ONCE(task->task_works);
>  		do {
>  			head = NULL;
>  			if (!work) {
>  				if (task->flags & PF_EXITING)
>  					head = &work_exited;
>  				else
>  					break;
>  			}
> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>  
>  		if (!work)
>  			break;

But given that, as you say, cancel() could have gone and stole our head,
should we not try and install &work_exiting when PF_EXITING in that
case?

That is; should we not do continue in that case, instead of break.

---

--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -27,14 +27,13 @@ static struct callback_head work_exited;
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *head;
+	struct callback_head *head = READ_ONCE(tsk->task_works);
 
 	do {
-		head = READ_ONCE(task->task_works);
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
 		work->next = head;
-	} while (cmpxchg(&task->task_works, head, work) != head);
+	} while (!try_cmpxchg(&task->task_works, &head, work))
 
 	if (notify)
 		set_notify_resume(task);
@@ -69,9 +68,12 @@ task_work_cancel(struct task_struct *tas
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 	while ((work = READ_ONCE(*pprev))) {
-		if (work->func != func)
+		if (work->func != func) {
 			pprev = &work->next;
-		else if (cmpxchg(pprev, work, work->next) == work)
+			continue;
+		}
+
+		if (try_cmpxchg(pprev, &work, work->next))
 			break;
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -90,26 +92,26 @@ task_work_cancel(struct task_struct *tas
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *work, *head, *next;
+	struct callback_head *work, *next;
 
 	for (;;) {
-		/*
-		 * work->func() can do task_work_add(), do not set
-		 * work_exited unless the list is empty.
-		 */
-		do {
-			head = NULL;
-			work = READ_ONCE(task->task_works);
-			if (!work) {
-				if (task->flags & PF_EXITING)
-					head = &work_exited;
-				else
-					break;
-			}
-		} while (cmpxchg(&task->task_works, work, head) != work);
+		work = READ_ONCE(task->task_works);
+		if (!work) {
+			if (!(task->flags & PF_EXITING))
+				return;
+
+			/*
+			 * work->func() can do task_work_add(), do not set
+			 * work_exited unless the list is empty.
+			 */
+			if (try_cmpxchg(&task->task_works, &work, &work_exited))
+				return;
+		}
+
+		work = xchg(&task->task_works, NULL);
+		if (!work)
+			continue;
 
-		if (!work)
-			break;
 		/*
 		 * Synchronize with task_work_cancel(). It can not remove
 		 * the first entry == work, cmpxchg(task_works) must fail.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-20 17:49                                                 ` Peter Zijlstra
@ 2020-02-21 14:52                                                   ` Oleg Nesterov
  2020-02-24 18:47                                                     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2020-02-21 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Carter Li 李通洲, Pavel Begunkov, io-uring

On 02/20, Peter Zijlstra wrote:
>
> On Thu, Feb 20, 2020 at 06:22:02PM +0100, Oleg Nesterov wrote:
> > @@ -68,10 +65,10 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  	 * we raced with task_work_run(), *pprev == NULL/exited.
> >  	 */
> >  	raw_spin_lock_irqsave(&task->pi_lock, flags);
> > +	for (work = READ_ONCE(*pprev); work; ) {
> >  		if (work->func != func)
> >  			pprev = &work->next;
>
> But didn't you loose the READ_ONCE() of *pprev in this branch?

Argh, yes ;)

> > @@ -97,16 +94,16 @@ void task_work_run(void)
> >  		 * work->func() can do task_work_add(), do not set
> >  		 * work_exited unless the list is empty.
> >  		 */
> > +		work = READ_ONCE(task->task_works);
> >  		do {
> >  			head = NULL;
> >  			if (!work) {
> >  				if (task->flags & PF_EXITING)
> >  					head = &work_exited;
> >  				else
> >  					break;
> >  			}
> > +		} while (!try_cmpxchg(&task->task_works, &work, head));
> >
> >  		if (!work)
> >  			break;
>
> But given that, as you say, cancel() could have gone and stole our head,
> should we not try and install &work_exiting when PF_EXITING in that
> case?

cancel() can't do this, as long as we use cmpxchg/try_cmpxchg, not xchg().
This is what the comment before lock/unlock below tries to explain.

> That is; should we not do continue in that case, instead of break.

This is what we should do if we use xchg() like your previous version did.
Or I am totally confused. Hmm, and when I re-read my words it looks as if
I am trying to confuse you.

So lets "simplify" this code assuming that PF_EXITING is set:

		work = READ_ONCE(task->task_works);
		do {
			head = NULL;
			if (!work)
				head = &work_exited;
		} while (!try_cmpxchg(&task->task_works, &work, head));

		if (!work)
			break;

If work == NULL after try_cmpxchg() _succeeds_, then the new "head" must
be work_exited and we have nothing to do.

If it was nullified by try_cmpxchg(&work) because we raced with cancel_(),
then this try_cmpxchg() should have been failed.

Right?

> @@ -69,9 +68,12 @@ task_work_cancel(struct task_struct *tas
>  	 */
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>  	while ((work = READ_ONCE(*pprev))) {
> -		if (work->func != func)
> +		if (work->func != func) {
>  			pprev = &work->next;
> -		else if (cmpxchg(pprev, work, work->next) == work)
> +			continue;
> +		}
> +
> +		if (try_cmpxchg(pprev, &work, work->next))
>  			break;

perhaps I misread this code, but it looks a bit strange to me... it doesn't
differ from

	while ((work = READ_ONCE(*pprev))) {
		if (work->func != func)
			pprev = &work->next;
		else if (try_cmpxchg(pprev, &work, work->next))
			break;
	}

either way it is correct, the only problem is that we do not need (want)
another READ_ONCE() if try_cmpxchg() fails.

>  void task_work_run(void)
>  {
>  	struct task_struct *task = current;
> -	struct callback_head *work, *head, *next;
> +	struct callback_head *work, *next;
>  
>  	for (;;) {
> -		/*
> -		 * work->func() can do task_work_add(), do not set
> -		 * work_exited unless the list is empty.
> -		 */
> -		do {
> -			head = NULL;
> -			work = READ_ONCE(task->task_works);
> -			if (!work) {
> -				if (task->flags & PF_EXITING)
> -					head = &work_exited;
> -				else
> -					break;
> -			}
> -		} while (cmpxchg(&task->task_works, work, head) != work);
> +		work = READ_ONCE(task->task_works);
> +		if (!work) {
> +			if (!(task->flags & PF_EXITING))
> +				return;
> +
> +			/*
> +			 * work->func() can do task_work_add(), do not set
> +			 * work_exited unless the list is empty.
> +			 */
> +			if (try_cmpxchg(&task->task_works, &work, &work_exited))
> +				return;
> +		}
> +
> +		work = xchg(&task->task_works, NULL);
> +		if (!work)
> +			continue;

looks correct...

Oleg.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-21 14:52                                                   ` Oleg Nesterov
@ 2020-02-24 18:47                                                     ` Jens Axboe
  2020-02-28 19:17                                                       ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-24 18:47 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/21/20 7:52 AM, Oleg Nesterov wrote:
> On 02/20, Peter Zijlstra wrote:
>>
>> On Thu, Feb 20, 2020 at 06:22:02PM +0100, Oleg Nesterov wrote:
>>> @@ -68,10 +65,10 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
>>>  	 * we raced with task_work_run(), *pprev == NULL/exited.
>>>  	 */
>>>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>>> +	for (work = READ_ONCE(*pprev); work; ) {
>>>  		if (work->func != func)
>>>  			pprev = &work->next;
>>
>> But didn't you loose the READ_ONCE() of *pprev in this branch?
> 
> Argh, yes ;)
> 
>>> @@ -97,16 +94,16 @@ void task_work_run(void)
>>>  		 * work->func() can do task_work_add(), do not set
>>>  		 * work_exited unless the list is empty.
>>>  		 */
>>> +		work = READ_ONCE(task->task_works);
>>>  		do {
>>>  			head = NULL;
>>>  			if (!work) {
>>>  				if (task->flags & PF_EXITING)
>>>  					head = &work_exited;
>>>  				else
>>>  					break;
>>>  			}
>>> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>>>
>>>  		if (!work)
>>>  			break;
>>
>> But given that, as you say, cancel() could have gone and stole our head,
>> should we not try and install &work_exiting when PF_EXITING in that
>> case?
> 
> cancel() can't do this, as long as we use cmpxchg/try_cmpxchg, not xchg().
> This is what the comment before lock/unlock below tries to explain.
> 
>> That is; should we not do continue in that case, instead of break.
> 
> This is what we should do if we use xchg() like your previous version did.
> Or I am totally confused. Hmm, and when I re-read my words it looks as if
> I am trying to confuse you.
> 
> So lets "simplify" this code assuming that PF_EXITING is set:
> 
> 		work = READ_ONCE(task->task_works);
> 		do {
> 			head = NULL;
> 			if (!work)
> 				head = &work_exited;
> 		} while (!try_cmpxchg(&task->task_works, &work, head));
> 
> 		if (!work)
> 			break;
> 
> If work == NULL after try_cmpxchg() _succeeds_, then the new "head" must
> be work_exited and we have nothing to do.
> 
> If it was nullified by try_cmpxchg(&work) because we raced with cancel_(),
> then this try_cmpxchg() should have been failed.
> 
> Right?
> 
>> @@ -69,9 +68,12 @@ task_work_cancel(struct task_struct *tas
>>  	 */
>>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>>  	while ((work = READ_ONCE(*pprev))) {
>> -		if (work->func != func)
>> +		if (work->func != func) {
>>  			pprev = &work->next;
>> -		else if (cmpxchg(pprev, work, work->next) == work)
>> +			continue;
>> +		}
>> +
>> +		if (try_cmpxchg(pprev, &work, work->next))
>>  			break;
> 
> perhaps I misread this code, but it looks a bit strange to me... it doesn't
> differ from
> 
> 	while ((work = READ_ONCE(*pprev))) {
> 		if (work->func != func)
> 			pprev = &work->next;
> 		else if (try_cmpxchg(pprev, &work, work->next))
> 			break;
> 	}
> 
> either way it is correct, the only problem is that we do not need (want)
> another READ_ONCE() if try_cmpxchg() fails.
> 
>>  void task_work_run(void)
>>  {
>>  	struct task_struct *task = current;
>> -	struct callback_head *work, *head, *next;
>> +	struct callback_head *work, *next;
>>  
>>  	for (;;) {
>> -		/*
>> -		 * work->func() can do task_work_add(), do not set
>> -		 * work_exited unless the list is empty.
>> -		 */
>> -		do {
>> -			head = NULL;
>> -			work = READ_ONCE(task->task_works);
>> -			if (!work) {
>> -				if (task->flags & PF_EXITING)
>> -					head = &work_exited;
>> -				else
>> -					break;
>> -			}
>> -		} while (cmpxchg(&task->task_works, work, head) != work);
>> +		work = READ_ONCE(task->task_works);
>> +		if (!work) {
>> +			if (!(task->flags & PF_EXITING))
>> +				return;
>> +
>> +			/*
>> +			 * work->func() can do task_work_add(), do not set
>> +			 * work_exited unless the list is empty.
>> +			 */
>> +			if (try_cmpxchg(&task->task_works, &work, &work_exited))
>> +				return;
>> +		}
>> +
>> +		work = xchg(&task->task_works, NULL);
>> +		if (!work)
>> +			continue;
> 
> looks correct...

Peter/Oleg, as you've probably noticed, I'm still hauling Oleg's
original patch around. Is the above going to turn into a separate patch
on top?  If so, feel free to shove it my way as well for some extra
testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-24 18:47                                                     ` Jens Axboe
@ 2020-02-28 19:17                                                       ` Jens Axboe
  2020-02-28 19:25                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-28 19:17 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Carter Li 李通洲, Pavel Begunkov, io-uring

On 2/24/20 11:47 AM, Jens Axboe wrote:
> On 2/21/20 7:52 AM, Oleg Nesterov wrote:
>> On 02/20, Peter Zijlstra wrote:
>>>
>>> On Thu, Feb 20, 2020 at 06:22:02PM +0100, Oleg Nesterov wrote:
>>>> @@ -68,10 +65,10 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
>>>>  	 * we raced with task_work_run(), *pprev == NULL/exited.
>>>>  	 */
>>>>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>>>> +	for (work = READ_ONCE(*pprev); work; ) {
>>>>  		if (work->func != func)
>>>>  			pprev = &work->next;
>>>
>>> But didn't you loose the READ_ONCE() of *pprev in this branch?
>>
>> Argh, yes ;)
>>
>>>> @@ -97,16 +94,16 @@ void task_work_run(void)
>>>>  		 * work->func() can do task_work_add(), do not set
>>>>  		 * work_exited unless the list is empty.
>>>>  		 */
>>>> +		work = READ_ONCE(task->task_works);
>>>>  		do {
>>>>  			head = NULL;
>>>>  			if (!work) {
>>>>  				if (task->flags & PF_EXITING)
>>>>  					head = &work_exited;
>>>>  				else
>>>>  					break;
>>>>  			}
>>>> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>>>>
>>>>  		if (!work)
>>>>  			break;
>>>
>>> But given that, as you say, cancel() could have gone and stole our head,
>>> should we not try and install &work_exiting when PF_EXITING in that
>>> case?
>>
>> cancel() can't do this, as long as we use cmpxchg/try_cmpxchg, not xchg().
>> This is what the comment before lock/unlock below tries to explain.
>>
>>> That is; should we not do continue in that case, instead of break.
>>
>> This is what we should do if we use xchg() like your previous version did.
>> Or I am totally confused. Hmm, and when I re-read my words it looks as if
>> I am trying to confuse you.
>>
>> So lets "simplify" this code assuming that PF_EXITING is set:
>>
>> 		work = READ_ONCE(task->task_works);
>> 		do {
>> 			head = NULL;
>> 			if (!work)
>> 				head = &work_exited;
>> 		} while (!try_cmpxchg(&task->task_works, &work, head));
>>
>> 		if (!work)
>> 			break;
>>
>> If work == NULL after try_cmpxchg() _succeeds_, then the new "head" must
>> be work_exited and we have nothing to do.
>>
>> If it was nullified by try_cmpxchg(&work) because we raced with cancel_(),
>> then this try_cmpxchg() should have been failed.
>>
>> Right?
>>
>>> @@ -69,9 +68,12 @@ task_work_cancel(struct task_struct *tas
>>>  	 */
>>>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>>>  	while ((work = READ_ONCE(*pprev))) {
>>> -		if (work->func != func)
>>> +		if (work->func != func) {
>>>  			pprev = &work->next;
>>> -		else if (cmpxchg(pprev, work, work->next) == work)
>>> +			continue;
>>> +		}
>>> +
>>> +		if (try_cmpxchg(pprev, &work, work->next))
>>>  			break;
>>
>> perhaps I misread this code, but it looks a bit strange to me... it doesn't
>> differ from
>>
>> 	while ((work = READ_ONCE(*pprev))) {
>> 		if (work->func != func)
>> 			pprev = &work->next;
>> 		else if (try_cmpxchg(pprev, &work, work->next))
>> 			break;
>> 	}
>>
>> either way it is correct, the only problem is that we do not need (want)
>> another READ_ONCE() if try_cmpxchg() fails.
>>
>>>  void task_work_run(void)
>>>  {
>>>  	struct task_struct *task = current;
>>> -	struct callback_head *work, *head, *next;
>>> +	struct callback_head *work, *next;
>>>  
>>>  	for (;;) {
>>> -		/*
>>> -		 * work->func() can do task_work_add(), do not set
>>> -		 * work_exited unless the list is empty.
>>> -		 */
>>> -		do {
>>> -			head = NULL;
>>> -			work = READ_ONCE(task->task_works);
>>> -			if (!work) {
>>> -				if (task->flags & PF_EXITING)
>>> -					head = &work_exited;
>>> -				else
>>> -					break;
>>> -			}
>>> -		} while (cmpxchg(&task->task_works, work, head) != work);
>>> +		work = READ_ONCE(task->task_works);
>>> +		if (!work) {
>>> +			if (!(task->flags & PF_EXITING))
>>> +				return;
>>> +
>>> +			/*
>>> +			 * work->func() can do task_work_add(), do not set
>>> +			 * work_exited unless the list is empty.
>>> +			 */
>>> +			if (try_cmpxchg(&task->task_works, &work, &work_exited))
>>> +				return;
>>> +		}
>>> +
>>> +		work = xchg(&task->task_works, NULL);
>>> +		if (!work)
>>> +			continue;
>>
>> looks correct...
> 
> Peter/Oleg, as you've probably noticed, I'm still hauling Oleg's
> original patch around. Is the above going to turn into a separate patch
> on top?  If so, feel free to shove it my way as well for some extra
> testing.

Peter/Oleg, gentle ping on this query. I'd like to queue up the task poll
rework on the io_uring side, but I still have this one at the start of
the series:

https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-task-poll&id=3b668ecf75f94f40c1faf9688ba3f32fb5e9f5d0


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-28 19:17                                                       ` Jens Axboe
@ 2020-02-28 19:25                                                         ` Peter Zijlstra
  2020-02-28 19:28                                                           ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-28 19:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Oleg Nesterov, Carter Li 李通洲,
	Pavel Begunkov, io-uring

On Fri, Feb 28, 2020 at 12:17:58PM -0700, Jens Axboe wrote:
> > 
> > Peter/Oleg, as you've probably noticed, I'm still hauling Oleg's
> > original patch around. Is the above going to turn into a separate patch
> > on top?  If so, feel free to shove it my way as well for some extra
> > testing.
> 
> Peter/Oleg, gentle ping on this query. I'd like to queue up the task poll
> rework on the io_uring side, but I still have this one at the start of
> the series:

the generic try_cmpxchg crud I posted earlier is 'wrong' and the sane
version I cooked up depends on a whole bunch of patches that are in
limbo because of kcsan (I need to get that resolved one way or the
other).

So if you don't mind, I'll shelf this for a while until I got all that
sorted.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-28 19:25                                                         ` Peter Zijlstra
@ 2020-02-28 19:28                                                           ` Jens Axboe
  2020-02-28 20:06                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2020-02-28 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Carter Li 李通洲,
	Pavel Begunkov, io-uring

On 2/28/20 12:25 PM, Peter Zijlstra wrote:
> On Fri, Feb 28, 2020 at 12:17:58PM -0700, Jens Axboe wrote:
>>>
>>> Peter/Oleg, as you've probably noticed, I'm still hauling Oleg's
>>> original patch around. Is the above going to turn into a separate patch
>>> on top?  If so, feel free to shove it my way as well for some extra
>>> testing.
>>
>> Peter/Oleg, gentle ping on this query. I'd like to queue up the task poll
>> rework on the io_uring side, but I still have this one at the start of
>> the series:
> 
> the generic try_cmpxchg crud I posted earlier is 'wrong' and the sane
> version I cooked up depends on a whole bunch of patches that are in
> limbo because of kcsan (I need to get that resolved one way or the
> other).
> 
> So if you don't mind, I'll shelf this for a while until I got all that
> sorted.

Shelf the one I have queued up, or the suggested changes that you had
for on top of it?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-28 19:28                                                           ` Jens Axboe
@ 2020-02-28 20:06                                                             ` Peter Zijlstra
  2020-02-28 20:15                                                               ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2020-02-28 20:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Oleg Nesterov, Carter Li 李通洲,
	Pavel Begunkov, io-uring

On Fri, Feb 28, 2020 at 12:28:54PM -0700, Jens Axboe wrote:

> Shelf the one I have queued up, or the suggested changes that you had
> for on top of it?

The stuff on top.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] task_work_run: don't take ->pi_lock unconditionally
  2020-02-28 20:06                                                             ` Peter Zijlstra
@ 2020-02-28 20:15                                                               ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2020-02-28 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Carter Li 李通洲,
	Pavel Begunkov, io-uring

On 2/28/20 1:06 PM, Peter Zijlstra wrote:
> On Fri, Feb 28, 2020 at 12:28:54PM -0700, Jens Axboe wrote:
> 
>> Shelf the one I have queued up, or the suggested changes that you had
>> for on top of it?
> 
> The stuff on top.

OK all good, thanks for confirming.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, back to index

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 16:31 [ISSUE] The time cost of IOSQE_IO_LINK Carter Li 李通洲
2020-02-12 17:11 ` Jens Axboe
2020-02-12 17:22   ` Jens Axboe
2020-02-12 17:29     ` Jens Axboe
2020-02-13  0:33   ` Carter Li 李通洲
2020-02-13 15:08     ` Pavel Begunkov
2020-02-13 15:14       ` Jens Axboe
2020-02-13 15:51         ` Carter Li 李通洲
2020-02-14  1:25           ` Carter Li 李通洲
2020-02-14  2:45             ` Jens Axboe
2020-02-14  5:03               ` Jens Axboe
2020-02-14 15:32                 ` Peter Zijlstra
2020-02-14 15:47                   ` Jens Axboe
2020-02-14 16:18                     ` Jens Axboe
2020-02-14 17:52                       ` Jens Axboe
2020-02-14 20:44                         ` Jens Axboe
2020-02-15  0:16                           ` Carter Li 李通洲
2020-02-15  1:10                             ` Jens Axboe
2020-02-15  1:25                               ` Carter Li 李通洲
2020-02-15  1:27                                 ` Jens Axboe
2020-02-15  6:01                                   ` Jens Axboe
2020-02-15  6:32                                     ` Carter Li 李通洲
2020-02-15 15:11                                       ` Jens Axboe
2020-02-16 19:06                                     ` Pavel Begunkov
2020-02-16 22:23                                       ` Jens Axboe
2020-02-17 10:30                                         ` Pavel Begunkov
2020-02-17 19:30                                           ` Jens Axboe
2020-02-16 23:06                                       ` Jens Axboe
2020-02-16 23:07                                         ` Jens Axboe
2020-02-17 12:09                           ` Peter Zijlstra
2020-02-17 16:12                             ` Jens Axboe
2020-02-17 17:16                               ` Jens Axboe
2020-02-17 17:46                                 ` Peter Zijlstra
2020-02-17 18:16                                   ` Jens Axboe
2020-02-18 13:13                                     ` Peter Zijlstra
2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
2020-02-18 14:40                                         ` Peter Zijlstra
2020-02-20 10:30                                         ` Will Deacon
2020-02-20 10:37                                           ` Peter Zijlstra
2020-02-20 10:39                                             ` Will Deacon
2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
2020-02-18 15:07                                         ` Oleg Nesterov
2020-02-18 15:38                                           ` Peter Zijlstra
2020-02-18 16:33                                             ` Jens Axboe
2020-02-18 15:07                                         ` Peter Zijlstra
2020-02-18 15:50                                           ` [PATCH] task_work_run: don't take ->pi_lock unconditionally Oleg Nesterov
2020-02-20 16:39                                             ` Peter Zijlstra
2020-02-20 17:22                                               ` Oleg Nesterov
2020-02-20 17:49                                                 ` Peter Zijlstra
2020-02-21 14:52                                                   ` Oleg Nesterov
2020-02-24 18:47                                                     ` Jens Axboe
2020-02-28 19:17                                                       ` Jens Axboe
2020-02-28 19:25                                                         ` Peter Zijlstra
2020-02-28 19:28                                                           ` Jens Axboe
2020-02-28 20:06                                                             ` Peter Zijlstra
2020-02-28 20:15                                                               ` Jens Axboe
2020-02-18 16:46                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Jens Axboe
2020-02-18 16:52                                         ` Jens Axboe
2020-02-18 13:13                               ` Peter Zijlstra

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git