hi, This issue was found when I tested IORING_FEAT_FAST_POLL feature, with the newest upstream codes, indeed I find that io_uring's performace improvement is not obvious compared to epoll in my test environment, most of the time they are similar. Test cases basically comes from: https://github.com/frevib/io_uring-echo-server/blob/io-uring-feat-fast-poll/benchmarks/benchmarks.md. In above url, the author's test results shows that io_uring will get a big performace improvement compared to epoll. I'm still looking into why I don't get the big improvement, currently don't know why, but I find some obvious regression issue. I wrote a simple tool based io_uring nop operation to evaluate io_uring framework in v5.1 and 5.7.0-rc4+(jens's io_uring-5.7 branch), I see a obvious performace regression: v5.1 kernel: $sudo taskset -c 60 ./io_uring_nop_stress -r 300 # run 300 seconds total ios: 1832524960 IOPS: 6108416 5.7.0-rc4+ $sudo taskset -c 60 ./io_uring_nop_stress -r 300 total ios: 1597672304 IOPS: 5325574 it's about 12% performance regression. Using perf can see many performance bottlenecks, for example, io_submit_sqes is one. For now, I did't make many analysis yet, just have a look at io_submit_sqes(), there are many assignment operations in io_init_req(), but I'm not sure whether they are all needed when req is not needed to be punt to io-wq, for example, INIT_IO_WORK(&req->work, io_wq_submit_work); # a whole struct assignment from perf annotate tool, it's an expensive operation, I think reqs that use fast poll feature use task-work function, so the INIT_IO_WORK maybe not necessary. Above is just one issue, what I worry is that whether io_uring is becoming more bloated gradually, and will not that better to aio. In https://kernel.dk/io_uring.pdf, it says that io_uring will eliminate 104 bytes copy compared to aio, but see currenct io_init_req(), io_uring maybe copy more, introducing more overhead? Or does we need to carefully re-design struct io_kiocb, to reduce overhead as soon as possible. Regards, Xiaoguang Wang