All of lore.kernel.org
 help / color / mirror / Atom feed
* napi_busy_poll
@ 2022-02-08 14:58 Olivier Langlois
  2022-02-08 17:05 ` napi_busy_poll Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-08 14:58 UTC (permalink / raw)
  To: io-uring

Hi,

I was wondering if integrating the NAPI busy poll for socket fds into
io_uring like how select/poll/epoll are doing has ever been considered?

It seems to me that it could be an awesome feature when used along with
a io_qpoll thread and something not too difficult to add...

Greetings,
Olivier


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-08 14:58 napi_busy_poll Olivier Langlois
@ 2022-02-08 17:05 ` Jens Axboe
  2022-02-09  3:34   ` napi_busy_poll Hao Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2022-02-08 17:05 UTC (permalink / raw)
  To: Olivier Langlois, io-uring

On 2/8/22 7:58 AM, Olivier Langlois wrote:
> Hi,
> 
> I was wondering if integrating the NAPI busy poll for socket fds into
> io_uring like how select/poll/epoll are doing has ever been considered?
> 
> It seems to me that it could be an awesome feature when used along with
> a io_qpoll thread and something not too difficult to add...

Should be totally doable and it's been brought up before, just needs
someone to actually do it... Would love to see it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-08 17:05 ` napi_busy_poll Jens Axboe
@ 2022-02-09  3:34   ` Hao Xu
  2022-02-12 19:51     ` napi_busy_poll Olivier Langlois
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-09  3:34 UTC (permalink / raw)
  To: Jens Axboe, Olivier Langlois, io-uring

在 2022/2/9 上午1:05, Jens Axboe 写道:
> On 2/8/22 7:58 AM, Olivier Langlois wrote:
>> Hi,
>>
>> I was wondering if integrating the NAPI busy poll for socket fds into
>> io_uring like how select/poll/epoll are doing has ever been considered?
>>
>> It seems to me that it could be an awesome feature when used along with
>> a io_qpoll thread and something not too difficult to add...
> 
> Should be totally doable and it's been brought up before, just needs
> someone to actually do it... Would love to see it.
> 
We've done some investigation before, would like to have a try.

Regards,
Hao


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-09  3:34   ` napi_busy_poll Hao Xu
@ 2022-02-12 19:51     ` Olivier Langlois
  2022-02-13 18:47       ` napi_busy_poll Jens Axboe
  2022-02-14 17:13       ` napi_busy_poll Hao Xu
  0 siblings, 2 replies; 23+ messages in thread
From: Olivier Langlois @ 2022-02-12 19:51 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote:
> 在 2022/2/9 上午1:05, Jens Axboe 写道:
> > On 2/8/22 7:58 AM, Olivier Langlois wrote:
> > > Hi,
> > > 
> > > I was wondering if integrating the NAPI busy poll for socket fds
> > > into
> > > io_uring like how select/poll/epoll are doing has ever been
> > > considered?
> > > 
> > > It seems to me that it could be an awesome feature when used
> > > along with
> > > a io_qpoll thread and something not too difficult to add...
> > 
> > Should be totally doable and it's been brought up before, just
> > needs
> > someone to actually do it... Would love to see it.
> > 
> We've done some investigation before, would like to have a try.
> 
Hao,

Let me know if I can help you with coding or testing. I have done very
preliminary investigation too. It doesn't seem like it would be very
hard to implement but I get confused with small details.

For instance, the epoll implementation, unless there is something that
I don't understand, appears to have a serious limitation. It seems like
it would not work correctly if there are sockets associated to more
than 1 NAPI device in the fd set. As far as I am concerned, that
limitation would be ok since in my setup I only use 1 device but if it
was necessary to be better than the epoll implementation, I am not sure
at all how this could be addressed. I do not have enough kernel dev
experience to find easy solutions to those type of issues...

Worse case scenario, I guess that I could give it a shot creating a
good enough implementation for my needs and show it to the list to get
feedback...

Greetings,
Olivier


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-12 19:51     ` napi_busy_poll Olivier Langlois
@ 2022-02-13 18:47       ` Jens Axboe
  2022-02-14 17:13       ` napi_busy_poll Hao Xu
  1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2022-02-13 18:47 UTC (permalink / raw)
  To: Olivier Langlois, Hao Xu, io-uring

On 2/12/22 12:51 PM, Olivier Langlois wrote:
> On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote:
>> 在 2022/2/9 上午1:05, Jens Axboe 写道:
>>> On 2/8/22 7:58 AM, Olivier Langlois wrote:
>>>> Hi,
>>>>
>>>> I was wondering if integrating the NAPI busy poll for socket fds
>>>> into
>>>> io_uring like how select/poll/epoll are doing has ever been
>>>> considered?
>>>>
>>>> It seems to me that it could be an awesome feature when used
>>>> along with
>>>> a io_qpoll thread and something not too difficult to add...
>>>
>>> Should be totally doable and it's been brought up before, just
>>> needs
>>> someone to actually do it... Would love to see it.
>>>
>> We've done some investigation before, would like to have a try.
>>
> Hao,
> 
> Let me know if I can help you with coding or testing. I have done very
> preliminary investigation too. It doesn't seem like it would be very
> hard to implement but I get confused with small details.
> 
> For instance, the epoll implementation, unless there is something that
> I don't understand, appears to have a serious limitation. It seems
> like it would not work correctly if there are sockets associated to
> more than 1 NAPI device in the fd set. As far as I am concerned, that
> limitation would be ok since in my setup I only use 1 device but if it
> was necessary to be better than the epoll implementation, I am not
> sure at all how this could be addressed. I do not have enough kernel
> dev experience to find easy solutions to those type of issues...
> 
> Worse case scenario, I guess that I could give it a shot creating a
> good enough implementation for my needs and show it to the list to get
> feedback...

That is, in fact, usually the best way to get feedback on a change! I
believe there's even a law named for it, though local to Linux I
attribute it to akpm. Hence I'd encourage you to do that, best and
fastest way to make progress on the solution.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-12 19:51     ` napi_busy_poll Olivier Langlois
  2022-02-13 18:47       ` napi_busy_poll Jens Axboe
@ 2022-02-14 17:13       ` Hao Xu
  2022-02-15  8:37         ` napi_busy_poll Olivier Langlois
  1 sibling, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-14 17:13 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring


On 2/13/22 03:51, Olivier Langlois wrote:
> On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote:
>> 在 2022/2/9 上午1:05, Jens Axboe 写道:
>>> On 2/8/22 7:58 AM, Olivier Langlois wrote:
>>>> Hi,
>>>>
>>>> I was wondering if integrating the NAPI busy poll for socket fds
>>>> into
>>>> io_uring like how select/poll/epoll are doing has ever been
>>>> considered?
>>>>
>>>> It seems to me that it could be an awesome feature when used
>>>> along with
>>>> a io_qpoll thread and something not too difficult to add...
>>> Should be totally doable and it's been brought up before, just
>>> needs
>>> someone to actually do it... Would love to see it.
>>>
>> We've done some investigation before, would like to have a try.
>>
> Hao,
>
> Let me know if I can help you with coding or testing. I have done very
> preliminary investigation too. It doesn't seem like it would be very
> hard to implement but I get confused with small details.
>
> For instance, the epoll implementation, unless there is something that
> I don't understand, appears to have a serious limitation. It seems like
> it would not work correctly if there are sockets associated to more
> than 1 NAPI device in the fd set. As far as I am concerned, that

Yes, it seems that epoll_wait only does busy polling for 1 NAPI.

I think it is because the busy polling there is just an optimization

(doing some polling before trapping into sleep) not a must have,

so it's kind of trade-off between polling and reacting to other events

I guess. Not very sure about this too..

The iouring implementation I'm thinking of in my mind is polling for every

NAPI involved.


Regards,

Hao

> limitation would be ok since in my setup I only use 1 device but if it
> was necessary to be better than the epoll implementation, I am not sure
> at all how this could be addressed. I do not have enough kernel dev
> experience to find easy solutions to those type of issues...
>
> Worse case scenario, I guess that I could give it a shot creating a
> good enough implementation for my needs and show it to the list to get
> feedback...
>
> Greetings,
> Olivier

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-14 17:13       ` napi_busy_poll Hao Xu
@ 2022-02-15  8:37         ` Olivier Langlois
  2022-02-15 18:05           ` napi_busy_poll Olivier Langlois
  0 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-15  8:37 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Tue, 2022-02-15 at 01:13 +0800, Hao Xu wrote:
> 
> Yes, it seems that epoll_wait only does busy polling for 1 NAPI.
> 
> I think it is because the busy polling there is just an optimization
> 
> (doing some polling before trapping into sleep) not a must have,
> 
> so it's kind of trade-off between polling and reacting to other
> events
> 
> I guess. Not very sure about this too..
> 
> The iouring implementation I'm thinking of in my mind is polling for
> every
> 
> NAPI involved.
> 
Hao,

I have found the explanation about the epoll oddity:

In:
https://legacy.netdevconf.info/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

Linux-4.12 changes
epoll() support was added by Sridhar Samudrala and Alexander Duyck,
with the assumption that an application using epoll() and busy polling
would first make sure that it would classify sockets based on their
receive queue (NAPI ID), and use at least one epoll fd per receive
queue.

SO_INCOMING_NAPI_ID was added as a new socket option to retrieve this
information, instead of relying on other mechanisms (CPU or NUMA
identifications).

I have created a small toy implementation with some limitations:
1. It assumes a single napi_id per io_uring ctx like what epoll does
2. It does not detect when pending requests using supporting sockets
are all gone.

That being said, I have not been able to make it work yet. For some
unknown reasons, no valid napi_id is extracted from the sockets added
to the context so the net_busy_poll function is never called.

I find that very strange since prior to use io_uring, my code was using
epoll and the busy polling was working fine with my application
sockets. Something is escaping my comprehension. I must tired and this
will become obvious...

In the meantime, here is what I have created so far. Feel free to play
with it and/or enhance it:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b9c7e4793b..d3deca9b9ef5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -63,6 +63,7 @@
 #include <net/sock.h>
 #include <net/af_unix.h>
 #include <net/scm.h>
+#include <net/busy_poll.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -395,6 +396,10 @@ struct io_ring_ctx {
 	struct list_head	sqd_list;
 
 	unsigned long		check_cq_overflow;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	/* used to track busy poll napi_id */
+	unsigned int napi_id;
+#endif
 
 	struct {
 		unsigned		cached_cq_tail;
@@ -6976,7 +6981,40 @@ static inline struct file
*io_file_get_fixed(struct io_ring_ctx *ctx,
 	io_req_set_rsrc_node(req, ctx);
 	return file;
 }
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/*
+ * Set epoll busy poll NAPI ID from sk.
+ */
+static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx,
struct file *file)
+{
+	unsigned int napi_id;
+	struct socket *sock;
+	struct sock *sk;
+
+	if (!net_busy_loop_on())
+		return;
 
+	sock = sock_from_file(file);
+	if (!sock)
+		return;
+
+	sk = sock->sk;
+	if (!sk)
+		return;
+
+	napi_id = READ_ONCE(sk->sk_napi_id);
+
+	/* Non-NAPI IDs can be rejected
+	 *	or
+	 * Nothing to do if we already have this ID
+	 */
+	if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id)
+		return;
+
+	/* record NAPI ID for use in next busy poll */
+	ctx->napi_id = napi_id;
+}
+#endif
 static struct file *io_file_get_normal(struct io_ring_ctx *ctx,
 				       struct io_kiocb *req, int fd)
 {
@@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct
io_ring_ctx *ctx,
 	trace_io_uring_file_get(ctx, fd);
 
 	/* we don't allow fixed io_uring files */
-	if (file && unlikely(file->f_op == &io_uring_fops))
-		io_req_track_inflight(req);
+	if (file) {
+		if (unlikely(file->f_op == &io_uring_fops))
+			io_req_track_inflight(req);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		else
+			io_set_busy_poll_napi_id(ctx, file);
+#endif
+	}
 	return file;
 }
 
@@ -7489,7 +7533,22 @@ static inline void
io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx)
 		   ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP);
 	spin_unlock(&ctx->completion_lock);
 }
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/*
+ * Busy poll if globally on and supporting sockets found
+ */
+static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx)
+{
+	unsigned int napi_id = ctx->napi_id;
 
+	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) {
+		napi_busy_loop(napi_id, NULL, NULL, true,
+			       BUSY_POLL_BUDGET);
+		return true;
+	}
+	return false;
+}
+#endif
 static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 {
 	unsigned int to_submit;
@@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx
*ctx, bool cap_entries)
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
 			ret = io_submit_sqes(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
-
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		if (io_napi_busy_loop(ctx))
+			++ret;
+#endif
 		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
 			wake_up(&ctx->sqo_sq_wait);
 		if (creds)
@@ -7649,6 +7711,9 @@ struct io_wait_queue {
 	struct io_ring_ctx *ctx;
 	unsigned cq_tail;
 	unsigned nr_timeouts;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	unsigned busy_poll_to;
+#endif
 };
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)
@@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct
io_ring_ctx *ctx,
 	return !*timeout ? -ETIME : 1;
 }
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static inline bool io_busy_loop_timeout(unsigned long start_time,
+					unsigned long bp_usec)
+{
+	if (bp_usec) {
+		unsigned long end_time = start_time + bp_usec;
+		unsigned long now = busy_loop_current_time();
+
+		return time_after(now, end_time);
+	}
+	return true;
+}
+
+static bool io_busy_loop_end(void *p, unsigned long start_time)
+{
+	struct io_wait_queue *iowq = p;
+
+	return io_busy_loop_timeout(start_time, iowq->busy_poll_to) ||
+	       io_run_task_work_sig() ||
+	       io_should_wake(iowq);
+}
+#endif
+
 /*
  * Wait until events become available, if we don't already have some.
The
  * application must reap them itself, as they reside on the shared cq
ring.
@@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx
*ctx, int min_events,
 		if (!io_run_task_work())
 			break;
 	} while (1);
-
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	iowq.busy_poll_to = 0;
+#endif
 	if (uts) {
 		struct timespec64 ts;
 
 		if (get_timespec64(&ts, uts))
 			return -EFAULT;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		if (!(ctx->flags & IORING_SETUP_SQPOLL) &&
+		    (ctx->napi_id >= MIN_NAPI_ID) &&
net_busy_loop_on()) {
+			unsigned busy_poll_to =
+				READ_ONCE(sysctl_net_busy_poll);
+			struct timespec64 pollto =
+				ns_to_timespec64(1000*busy_poll_to);
+
+			if (timespec64_compare(&ts, &pollto) > 0) {
+				ts = timespec64_sub(ts, pollto);
+				iowq.busy_poll_to = busy_poll_to;
+			}
+			else {
+				iowq.busy_poll_to =
timespec64_to_ns(&ts)/1000;
+				ts.tv_sec = 0;
+				ts.tv_nsec = 0;
+			}
+		}
+#endif
 		timeout = timespec64_to_jiffies(&ts);
 	}
 
@@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx
*ctx, int min_events,
 	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
 
 	trace_io_uring_cqring_wait(ctx, min_events);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	if (iowq.busy_poll_to)
+		napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq,
true,
+			       BUSY_POLL_BUDGET);
+#endif
 	do {
 		/* if we can't even flush overflow, don't wait for
more */
 		if (!io_cqring_overflow_flush(ctx)) {

> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-15  8:37         ` napi_busy_poll Olivier Langlois
@ 2022-02-15 18:05           ` Olivier Langlois
  2022-02-16  3:12             ` napi_busy_poll Hao Xu
  2022-02-16 12:14             ` napi_busy_poll Hao Xu
  0 siblings, 2 replies; 23+ messages in thread
From: Olivier Langlois @ 2022-02-15 18:05 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote:
> 
> That being said, I have not been able to make it work yet. For some
> unknown reasons, no valid napi_id is extracted from the sockets added
> to the context so the net_busy_poll function is never called.
> 
> I find that very strange since prior to use io_uring, my code was
> using
> epoll and the busy polling was working fine with my application
> sockets. Something is escaping my comprehension. I must tired and
> this
> will become obvious...
> 
The napi_id values associated with my sockets appear to be in the range
0 < napi_id < MIN_NAPI_ID

from busy_loop.h:
/*		0 - Reserved to indicate value not set
 *     1..NR_CPUS - Reserved for sender_cpu
 *  NR_CPUS+1..~0 - Region available for NAPI IDs
 */
#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))

I have found this:
https://lwn.net/Articles/619862/

hinting that busy_poll may be incompatible with RPS
(Documentation/networking/scaling.rst) that I may have discovered
*AFTER* my epoll -> io_uring transition (I don't recall exactly the
sequence of my learning process).

With my current knowledge, it makes little sense why busy polling would
not be possible with RPS. Also, what exactly is a NAPI device is quite
nebulous to me... Looking into the Intel igb driver code, it seems like
1 NAPI device is created for each interrupt vector/Rx buffer of the
device.

Bottomline, it seems like I have fallen into a new rabbit hole. It may
take me a day or 2 to figure it all... you are welcome to enlight me if
you know a thing or 2 about those topics... I am kinda lost right
now...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-15 18:05           ` napi_busy_poll Olivier Langlois
@ 2022-02-16  3:12             ` Hao Xu
  2022-02-16 19:19               ` napi_busy_poll Olivier Langlois
  2022-02-16 12:14             ` napi_busy_poll Hao Xu
  1 sibling, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-16  3:12 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring

在 2022/2/16 上午2:05, Olivier Langlois 写道:
> On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote:
>>
>> That being said, I have not been able to make it work yet. For some
>> unknown reasons, no valid napi_id is extracted from the sockets added
>> to the context so the net_busy_poll function is never called.
>>
>> I find that very strange since prior to use io_uring, my code was
>> using
>> epoll and the busy polling was working fine with my application
>> sockets. Something is escaping my comprehension. I must tired and
>> this
>> will become obvious...
>>
> The napi_id values associated with my sockets appear to be in the range
> 0 < napi_id < MIN_NAPI_ID
> 
> from busy_loop.h:
> /*		0 - Reserved to indicate value not set
>   *     1..NR_CPUS - Reserved for sender_cpu
>   *  NR_CPUS+1..~0 - Region available for NAPI IDs
>   */
> #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
> 
> I have found this:
> https://lwn.net/Articles/619862/
> 
> hinting that busy_poll may be incompatible with RPS
> (Documentation/networking/scaling.rst) that I may have discovered
> *AFTER* my epoll -> io_uring transition (I don't recall exactly the
> sequence of my learning process).
> 
I read your code, I guess the thing is the sk->napi_id is set from
skb->napi_id and the latter is set when the net device received some
packets.
> With my current knowledge, it makes little sense why busy polling would
> not be possible with RPS. Also, what exactly is a NAPI device is quite
> nebulous to me... Looking into the Intel igb driver code, it seems like
> 1 NAPI device is created for each interrupt vector/Rx buffer of the
> device.
AFAIK, yes, each Rx ring has its own NAPI.
> 
> Bottomline, it seems like I have fallen into a new rabbit hole. It may
> take me a day or 2 to figure it all... you are welcome to enlight me if
> you know a thing or 2 about those topics... I am kinda lost right
> now...
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-15 18:05           ` napi_busy_poll Olivier Langlois
  2022-02-16  3:12             ` napi_busy_poll Hao Xu
@ 2022-02-16 12:14             ` Hao Xu
  2022-02-17 20:28               ` napi_busy_poll Olivier Langlois
                                 ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Hao Xu @ 2022-02-16 12:14 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring

在 2022/2/16 上午2:05, Olivier Langlois 写道:
> On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote:
>>
>> That being said, I have not been able to make it work yet. For some
>> unknown reasons, no valid napi_id is extracted from the sockets added
>> to the context so the net_busy_poll function is never called.
>>
>> I find that very strange since prior to use io_uring, my code was
>> using
>> epoll and the busy polling was working fine with my application
>> sockets. Something is escaping my comprehension. I must tired and
>> this
>> will become obvious...
>>
> The napi_id values associated with my sockets appear to be in the range
> 0 < napi_id < MIN_NAPI_ID
> 
> from busy_loop.h:
> /*		0 - Reserved to indicate value not set
>   *     1..NR_CPUS - Reserved for sender_cpu
>   *  NR_CPUS+1..~0 - Region available for NAPI IDs
>   */
> #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
> 
> I have found this:
> https://lwn.net/Articles/619862/
> 
> hinting that busy_poll may be incompatible with RPS
> (Documentation/networking/scaling.rst) that I may have discovered
> *AFTER* my epoll -> io_uring transition (I don't recall exactly the
> sequence of my learning process).
> 
> With my current knowledge, it makes little sense why busy polling would
> not be possible with RPS. Also, what exactly is a NAPI device is quite
> nebulous to me... Looking into the Intel igb driver code, it seems like
> 1 NAPI device is created for each interrupt vector/Rx buffer of the
> device.
> 
> Bottomline, it seems like I have fallen into a new rabbit hole. It may
> take me a day or 2 to figure it all... you are welcome to enlight me if
> you know a thing or 2 about those topics... I am kinda lost right
> now...
> 
Hi Olivier,
I've write something to express my idea, it would be great if you can
try it.
It's totally untested and only does polling in sqthread, won't be hard
to expand it to cqring_wait. My original idea is to poll all the napi
device but seems that may be not efficient. so for a request, just
do napi polling for one napi.
There is still one problem: when to delete the polled NAPIs.

Regards,
Hao

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 538f90bd0508..2e32d5fe0641 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -63,6 +63,7 @@
  #include <net/sock.h>
  #include <net/af_unix.h>
  #include <net/scm.h>
+#include <net/busy_poll.h>
  #include <linux/anon_inodes.h>
  #include <linux/sched/mm.h>
  #include <linux/uaccess.h>
@@ -443,6 +444,7 @@ struct io_ring_ctx {
                 spinlock_t                      rsrc_ref_lock;
         };

+       struct list_head                napi_list;
         /* Keep this last, we don't need it for the fast path */
         struct {
                 #if defined(CONFIG_UNIX)
@@ -1457,6 +1459,7 @@ static __cold struct io_ring_ctx 
*io_ring_ctx_alloc(struct io_uring_params *p)
         INIT_WQ_LIST(&ctx->locked_free_list);
         INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
         INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+       INIT_LIST_HEAD(&ctx->napi_list);
         return ctx;
  err:
         kfree(ctx->dummy_ubuf);
@@ -5419,6 +5422,70 @@ IO_NETOP_FN(send);
  IO_NETOP_FN(recv);
  #endif /* CONFIG_NET */

+#ifdef CONFIG_NET_RX_BUSY_POLL
+struct napi_entry {
+       struct list_head        list;
+       unsigned int            napi_id;
+};
+
+static void io_add_napi(struct file *file, struct io_ring_ctx *ctx)
+{
+       unsigned int napi_id;
+       struct socket *sock;
+       struct sock *sk;
+       struct napi_entry *ne;
+
+       if (!net_busy_loop_on())
+               return;
+
+       sock = sock_from_file(file);
+       if (!sock)
+               return;
+
+       sk = sock->sk;
+       if (!sk)
+               return;
+
+       napi_id = READ_ONCE(sk->sk_napi_id);
+       if (napi_id < MIN_NAPI_ID)
+               return;
+
+       list_for_each_entry(ne, &ctx->napi_list, list) {
+               if (ne->napi_id == napi_id)
+                       return;
+       }
+
+       ne = kmalloc(sizeof(*ne), GFP_KERNEL);
+       if (!ne)
+               return;
+
+       list_add_tail(&ne->list, &ctx->napi_list);
+}
+
+static void io_napi_busy_loop(struct io_ring_ctx *ctx)
+{
+       struct napi_entry *ne;
+
+       if (list_empty(&ctx->napi_list) || !net_busy_loop_on())
+               return;
+
+       list_for_each_entry(ne, &ctx->napi_list, list)
+               napi_busy_loop(ne->napi_id, NULL, NULL, false, 
BUSY_POLL_BUDGET);
+}
+#else
+
+static inline void io_add_napi(struct file *file, struct io_ring_ctx *ctx)
+{
+       return;
+}
+
+static inline void io_napi_busy_loop(struct io_ring_ctx *ctx)
+{
+       return;
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
+
  struct io_poll_table {
         struct poll_table_struct pt;
         struct io_kiocb *req;
@@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb 
*req, bool *locked)
         struct io_ring_ctx *ctx = req->ctx;
         int ret;

+       io_add_napi(req->file, req->ctx);
         ret = io_poll_check_events(req);
         if (ret > 0)
                 return;
@@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb 
*req, bool *locked)
         struct io_ring_ctx *ctx = req->ctx;
         int ret;

+       io_add_napi(req->file, req->ctx);
         ret = io_poll_check_events(req);
         if (ret > 0)
                 return;
@@ -7544,6 +7613,9 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, 
bool cap_entries)
                         wake_up(&ctx->sqo_sq_wait);
                 if (creds)
                         revert_creds(creds);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+               io_napi_busy_loop(ctx);
+#endif
         }

         return ret;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-16  3:12             ` napi_busy_poll Hao Xu
@ 2022-02-16 19:19               ` Olivier Langlois
  0 siblings, 0 replies; 23+ messages in thread
From: Olivier Langlois @ 2022-02-16 19:19 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Wed, 2022-02-16 at 11:12 +0800, Hao Xu wrote:
> 
> > 
> I read your code, I guess the thing is the sk->napi_id is set from
> skb->napi_id and the latter is set when the net device received some
> packets.
> > With my current knowledge, it makes little sense why busy polling
> > would
> > not be possible with RPS. Also, what exactly is a NAPI device is
> > quite
> > nebulous to me... Looking into the Intel igb driver code, it seems
> > like
> > 1 NAPI device is created for each interrupt vector/Rx buffer of the
> > device.
> AFAIK, yes, each Rx ring has its own NAPI.
> > 
> > Bottomline, it seems like I have fallen into a new rabbit hole. It
> > may
> > take me a day or 2 to figure it all... you are welcome to enlight
> > me if
> > you know a thing or 2 about those topics... I am kinda lost right
> > now...
> > 
> 
My dive into the net/core code has been beneficial!

I have found out that the reason why I did not have napi_id for my
sockets is because I have introduced a local SOCKS proxy into my setup.
By using the loopback device, this is de facto removing NAPI out of the
picture.

After having fixed this issue, I have started to test my code.

The modified io_cqring_wait() code does not work. With a pending recv()
request, the moment napi_busy_loop() is called, the recv() request
fails with an EFAULT error.

I suspect this might be because io_busy_loop_end() is doing something
that is not allowed while inside napi_busy_loop().

The simpler code change inside __io_sq_thread() might work but I still
have to validate.

I'll update later today or tomorrow with the latest result and
discovery!

Greetings,


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-16 12:14             ` napi_busy_poll Hao Xu
@ 2022-02-17 20:28               ` Olivier Langlois
  2022-02-18  8:06                 ` napi_busy_poll Hao Xu
  2022-02-17 23:18               ` napi_busy_poll Olivier Langlois
  2022-02-18  5:05               ` napi_busy_poll Olivier Langlois
  2 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-17 20:28 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
> > 
> Hi Olivier,
> I've write something to express my idea, it would be great if you can
> try it.
> It's totally untested and only does polling in sqthread, won't be
> hard
> to expand it to cqring_wait. My original idea is to poll all the napi
> device but seems that may be not efficient. so for a request, just
> do napi polling for one napi.
> There is still one problem: when to delete the polled NAPIs.
> 
> Regards,
> Hao
> 
Hi Hao,

I am going to give your patch a try hopefully later today.

On my side, I have made a small change to my code and it started to
work.

I did remove the call to io_run_task_work() from io_busy_loop_end().
While inside napi_busy_loop(), preemption is disabled, local_bh too and
the function acquires a napi_poll lock. I haven't been able to put my
finger on exactly why but it appears that in that state, the net
subsystem is not reentrant. Therefore, I did replace the call to
io_run_task_work() with a call to signal_pending() just to know if
there are pending task works so that they can be handled outside the
napi_busy_loop.

I am not sure how a socket is assigned to a napi device but I got a few
with my 50 sockets program (8 to be exact):

[2022-02-17 09:59:10] INFO WSBASE/client_established 706
LWS_CALLBACK_CLIENT_ESTABLISHED client 50(17), napi_id: 10
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 3(63), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 49(16), napi_id: 15
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 15(51), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 31(35), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 14(52), napi_id: 14
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 11(55), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 16(50), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 40(26), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 39(27), napi_id: 14
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 8(58), napi_id: 10
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 20(46), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 7(59), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 6(60), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 22(44), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 13(53), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 38(28), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 21(45), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 4(62), napi_id: 15
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 35(31), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 25(41), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 18(48), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 12(54), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 5(61), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 23(43), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 46(20), napi_id: 11
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 43(23), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 9(57), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 29(37), napi_id: 14
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 28(38), napi_id: 15
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 33(33), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 27(39), napi_id: 10
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 32(34), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 1(65), napi_id: 10
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 26(40), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 37(29), napi_id: 12
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 30(36), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 47(19), napi_id: 14
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 10(56), napi_id: 11
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 44(22), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 17(49), napi_id: 11
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 41(25), napi_id: 13
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 48(18), napi_id: 10
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 2(64), napi_id: 15
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 45(21), napi_id: 14
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 24(42), napi_id: 9
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 34(32), napi_id: 11
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 42(24), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 36(30), napi_id: 16
[2022-02-17 09:59:10] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 19(47), napi_id: 16
[2022-02-17 09:59:11] INFO WSBASE/client_established 696
LWS_CALLBACK_CLIENT_ESTABLISHED client 38(66), napi_id: 11

First number is the thread id (706 and 696). I have 2 threads. 1
io_uring context per thread.

Next, you have the client id and the number in parenthesis is the
socket fd.

Based on the result, it appears that having a single napi_id per
context won't do it... I wish I could pick the brains of the people
having done things the way they have done it with the epoll
implementation.

I guess that I could create 8 distinct io_uring context with
IORING_SETUP_ATTACH_WQ but it seems like a big burden placed on the
shoulders of users when a simple linked list with ref-counted elements
could do it...

I think that if I merge your patch with what I have done so far, we
could get something really cool!

Concerning the remaining problem about when to remove the napi_id, I
would say that a good place to do it would be when a request is
completed and discarded if there was a refcount added to your
napi_entry struct.

The only thing that I hate about this idea is that in my scenario, the
sockets are going to be pretty much the same for the whole io_uring
context existance. Therefore, the whole ref counting overhead is
useless and unneeded.

Here is the latest version of my effort:
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b9c7e4793b..ea2a3661c16f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -63,6 +63,7 @@
 #include <net/sock.h>
 #include <net/af_unix.h>
 #include <net/scm.h>
+#include <net/busy_poll.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -395,6 +396,10 @@ struct io_ring_ctx {
 	struct list_head	sqd_list;
 
 	unsigned long		check_cq_overflow;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	/* used to track busy poll napi_id */
+	unsigned int napi_id;
+#endif
 
 	struct {
 		unsigned		cached_cq_tail;
@@ -6976,7 +6981,40 @@ static inline struct file
*io_file_get_fixed(struct io_ring_ctx *ctx,
 	io_req_set_rsrc_node(req, ctx);
 	return file;
 }
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/*
+ * Set epoll busy poll NAPI ID from sk.
+ */
+static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx,
struct file *file)
+{
+	unsigned int napi_id;
+	struct socket *sock;
+	struct sock *sk;
+
+	if (!net_busy_loop_on())
+		return;
 
+	sock = sock_from_file(file);
+	if (!sock)
+		return;
+
+	sk = sock->sk;
+	if (!sk)
+		return;
+
+	napi_id = READ_ONCE(sk->sk_napi_id);
+
+	/* Non-NAPI IDs can be rejected
+	 *	or
+	 * Nothing to do if we already have this ID
+	 */
+	if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id)
+		return;
+
+	/* record NAPI ID for use in next busy poll */
+	ctx->napi_id = napi_id;
+}
+#endif
 static struct file *io_file_get_normal(struct io_ring_ctx *ctx,
 				       struct io_kiocb *req, int fd)
 {
@@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct
io_ring_ctx *ctx,
 	trace_io_uring_file_get(ctx, fd);
 
 	/* we don't allow fixed io_uring files */
-	if (file && unlikely(file->f_op == &io_uring_fops))
-		io_req_track_inflight(req);
+	if (file) {
+		if (unlikely(file->f_op == &io_uring_fops))
+			io_req_track_inflight(req);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		else
+			io_set_busy_poll_napi_id(ctx, file);
+#endif
+	}
 	return file;
 }
 
@@ -7489,7 +7533,22 @@ static inline void
io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx)
 		   ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP);
 	spin_unlock(&ctx->completion_lock);
 }
+#ifdef CONFIG_NET_RX_BUSY_POLL
+/*
+ * Busy poll if globally on and supporting sockets found
+ */
+static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx)
+{
+	unsigned int napi_id = ctx->napi_id;
 
+	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) {
+		napi_busy_loop(napi_id, NULL, NULL, true,
+			       BUSY_POLL_BUDGET);
+		return true;
+	}
+	return false;
+}
+#endif
 static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 {
 	unsigned int to_submit;
@@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx
*ctx, bool cap_entries)
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
 			ret = io_submit_sqes(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
-
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		if (io_napi_busy_loop(ctx))
+			++ret;
+#endif
 		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
 			wake_up(&ctx->sqo_sq_wait);
 		if (creds)
@@ -7649,6 +7711,9 @@ struct io_wait_queue {
 	struct io_ring_ctx *ctx;
 	unsigned cq_tail;
 	unsigned nr_timeouts;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	unsigned busy_poll_to;
+#endif
 };
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)
@@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct
io_ring_ctx *ctx,
 	return !*timeout ? -ETIME : 1;
 }
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static inline bool io_busy_loop_timeout(unsigned long start_time,
+					unsigned long bp_usec)
+{
+	if (bp_usec) {
+		unsigned long end_time = start_time + bp_usec;
+		unsigned long now = busy_loop_current_time();
+
+		return time_after(now, end_time);
+	}
+	return true;
+}
+
+static bool io_busy_loop_end(void *p, unsigned long start_time)
+{
+	struct io_wait_queue *iowq = p;
+
+	return io_busy_loop_timeout(start_time, iowq->busy_poll_to) ||
+	       signal_pending(current) ||
+	       io_should_wake(iowq);
+}
+#endif
+
 /*
  * Wait until events become available, if we don't already have some.
The
  * application must reap them itself, as they reside on the shared cq
ring.
@@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx
*ctx, int min_events,
 		if (!io_run_task_work())
 			break;
 	} while (1);
-
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	iowq.busy_poll_to = 0;
+#endif
 	if (uts) {
 		struct timespec64 ts;
 
 		if (get_timespec64(&ts, uts))
 			return -EFAULT;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+		if (!(ctx->flags & IORING_SETUP_SQPOLL) &&
+		    (ctx->napi_id >= MIN_NAPI_ID) &&
net_busy_loop_on()) {
+			unsigned busy_poll_to =
+				READ_ONCE(sysctl_net_busy_poll);
+			struct timespec64 pollto =
+				ns_to_timespec64(1000*busy_poll_to);
+
+			if (timespec64_compare(&ts, &pollto) > 0) {
+				ts = timespec64_sub(ts, pollto);
+				iowq.busy_poll_to = busy_poll_to;
+			}
+			else {
+				iowq.busy_poll_to =
timespec64_to_ns(&ts)/1000;
+				ts.tv_sec = 0;
+				ts.tv_nsec = 0;
+			}
+		}
+#endif
 		timeout = timespec64_to_jiffies(&ts);
 	}
 
@@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx
*ctx, int min_events,
 	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
 
 	trace_io_uring_cqring_wait(ctx, min_events);
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	if (iowq.busy_poll_to)
+		napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq,
true,
+			       BUSY_POLL_BUDGET);
+#endif
 	do {
 		/* if we can't even flush overflow, don't wait for
more */
 		if (!io_cqring_overflow_flush(ctx)) {


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-16 12:14             ` napi_busy_poll Hao Xu
  2022-02-17 20:28               ` napi_busy_poll Olivier Langlois
@ 2022-02-17 23:18               ` Olivier Langlois
  2022-02-17 23:25                 ` napi_busy_poll Jens Axboe
  2022-02-18  7:21                 ` napi_busy_poll Hao Xu
  2022-02-18  5:05               ` napi_busy_poll Olivier Langlois
  2 siblings, 2 replies; 23+ messages in thread
From: Olivier Langlois @ 2022-02-17 23:18 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
> 
> > 
> Hi Olivier,
> I've write something to express my idea, it would be great if you can
> try it.
> It's totally untested and only does polling in sqthread, won't be
> hard
> to expand it to cqring_wait. My original idea is to poll all the napi
> device but seems that may be not efficient. so for a request, just
> do napi polling for one napi.
> There is still one problem: when to delete the polled NAPIs.
> 
I think that I have found an elegant solution to the remaining problem!

Are you ok if I send out a patch to Jens that contains your code if I
put your name as a co-author?
> 
> 
> +       ne = kmalloc(sizeof(*ne), GFP_KERNEL);
> +       if (!ne)
> +               return;
> +
> +       list_add_tail(&ne->list, &ctx->napi_list);
> +}
ne->napi_id is not initialized before returning from the function!



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-17 23:18               ` napi_busy_poll Olivier Langlois
@ 2022-02-17 23:25                 ` Jens Axboe
  2022-02-18  7:21                 ` napi_busy_poll Hao Xu
  1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2022-02-17 23:25 UTC (permalink / raw)
  To: Olivier Langlois, Hao Xu, io-uring

On 2/17/22 4:18 PM, Olivier Langlois wrote:
> On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
>>
>>>
>> Hi Olivier,
>> I've write something to express my idea, it would be great if you can
>> try it.
>> It's totally untested and only does polling in sqthread, won't be
>> hard
>> to expand it to cqring_wait. My original idea is to poll all the napi
>> device but seems that may be not efficient. so for a request, just
>> do napi polling for one napi.
>> There is still one problem: when to delete the polled NAPIs.
>>
> I think that I have found an elegant solution to the remaining problem!
> 
> Are you ok if I send out a patch to Jens that contains your code if I
> put your name as a co-author?

FWIW, we tend to use the

Co-developed-by:

tag for this kind of situation. Just throwing it out there :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-16 12:14             ` napi_busy_poll Hao Xu
  2022-02-17 20:28               ` napi_busy_poll Olivier Langlois
  2022-02-17 23:18               ` napi_busy_poll Olivier Langlois
@ 2022-02-18  5:05               ` Olivier Langlois
  2022-02-18  7:41                 ` napi_busy_poll Hao Xu
  2 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-18  5:05 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
> 
> @@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb 
> *req, bool *locked)
>          struct io_ring_ctx *ctx = req->ctx;
>          int ret;
> 
> +       io_add_napi(req->file, req->ctx);
>          ret = io_poll_check_events(req);
>          if (ret > 0)
>                  return;
> @@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb 
> *req, bool *locked)
>          struct io_ring_ctx *ctx = req->ctx;
>          int ret;
> 
> +       io_add_napi(req->file, req->ctx);
>          ret = io_poll_check_events(req);
>          if (ret > 0)
>                  return;
> 
I have a doubt about these call sites for adding the napi_id into the
list. AFAIK, these are the functions called when the desired events are
ready therefore, it is too late for polling the device.

OTOH, my choice of doing it from io_file_get_normal() was perhaps a
poor choice too because it is premature.

Possibly the best location might be __io_arm_poll_handler()...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-17 23:18               ` napi_busy_poll Olivier Langlois
  2022-02-17 23:25                 ` napi_busy_poll Jens Axboe
@ 2022-02-18  7:21                 ` Hao Xu
  1 sibling, 0 replies; 23+ messages in thread
From: Hao Xu @ 2022-02-18  7:21 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring


On 2/18/22 07:18, Olivier Langlois wrote:
> On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
>> Hi Olivier,
>> I've write something to express my idea, it would be great if you can
>> try it.
>> It's totally untested and only does polling in sqthread, won't be
>> hard
>> to expand it to cqring_wait. My original idea is to poll all the napi
>> device but seems that may be not efficient. so for a request, just
>> do napi polling for one napi.
>> There is still one problem: when to delete the polled NAPIs.
>>
> I think that I have found an elegant solution to the remaining problem!
>
> Are you ok if I send out a patch to Jens that contains your code if I
> put your name as a co-author?
No problem, go ahead.
>>
>> +       ne = kmalloc(sizeof(*ne), GFP_KERNEL);
>> +       if (!ne)
>> +               return;
>> +
>> +       list_add_tail(&ne->list, &ctx->napi_list);
>> +}
> ne->napi_id is not initialized before returning from the function!
Yes, feel free to enhance this code and fix bugs.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-18  5:05               ` napi_busy_poll Olivier Langlois
@ 2022-02-18  7:41                 ` Hao Xu
  2022-02-19  7:02                   ` napi_busy_poll Olivier Langlois
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-18  7:41 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring


On 2/18/22 13:05, Olivier Langlois wrote:
> On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
>> @@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb
>> *req, bool *locked)
>>           struct io_ring_ctx *ctx = req->ctx;
>>           int ret;
>>
>> +       io_add_napi(req->file, req->ctx);
>>           ret = io_poll_check_events(req);
>>           if (ret > 0)
>>                   return;
>> @@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb
>> *req, bool *locked)
>>           struct io_ring_ctx *ctx = req->ctx;
>>           int ret;
>>
>> +       io_add_napi(req->file, req->ctx);
>>           ret = io_poll_check_events(req);
>>           if (ret > 0)
>>                   return;
>>
> I have a doubt about these call sites for adding the napi_id into the
> list. AFAIK, these are the functions called when the desired events are
> ready therefore, it is too late for polling the device.
[1]
>
> OTOH, my choice of doing it from io_file_get_normal() was perhaps a
> poor choice too because it is premature.
>
> Possibly the best location might be __io_arm_poll_handler()...
Hi Oliver,

Have you tried just issue one recv/pollin request and observe the

napi_id? From my understanding of the network stack, the napi_id

of a socket won't be valid until it gets some packets. Because before

that moment, busy_poll doesn't know which hw queue to poll.

In other words, the idea of NAPI polling is: the packets of a socket

can be from any hw queue of a net adapter, but we just poll the

queue which just received some data. So to get this piece of info,

there must be some data coming to one queue, before doing the

busy_poll. Correct me if I'm wrong since I'm also a newbie of

network stuff...


I was considering to poll all the rx rings, but it seemed to be not

efficient from some tests by my colleague.

for question [1] you mentioned, I think it's ok, since:

  - not all the data has been ready at that moment

  - the polling is not just for that request, there may be more data comming

    from the rx ring since we usually use polling mode under high workload

    pressure.

See the implementation of epoll busy_poll, the same thing.


Regards,
Hao


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-17 20:28               ` napi_busy_poll Olivier Langlois
@ 2022-02-18  8:06                 ` Hao Xu
  2022-02-19  7:14                   ` napi_busy_poll Olivier Langlois
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-18  8:06 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring


On 2/18/22 04:28, Olivier Langlois wrote:
> On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote:
>> Hi Olivier,
>> I've write something to express my idea, it would be great if you can
>> try it.
>> It's totally untested and only does polling in sqthread, won't be
>> hard
>> to expand it to cqring_wait. My original idea is to poll all the napi
>> device but seems that may be not efficient. so for a request, just
>> do napi polling for one napi.
>> There is still one problem: when to delete the polled NAPIs.
>>
>> Regards,
>> Hao
>>
> Hi Hao,
>
> I am going to give your patch a try hopefully later today.
>
> On my side, I have made a small change to my code and it started to
> work.
>
> I did remove the call to io_run_task_work() from io_busy_loop_end().
> While inside napi_busy_loop(), preemption is disabled, local_bh too and
> the function acquires a napi_poll lock. I haven't been able to put my
> finger on exactly why but it appears that in that state, the net
> subsystem is not reentrant. Therefore, I did replace the call to
> io_run_task_work() with a call to signal_pending() just to know if
> there are pending task works so that they can be handled outside the
> napi_busy_loop.
>
> I am not sure how a socket is assigned to a napi device but I got a few
> with my 50 sockets program (8 to be exact):
>
> [2022-02-17 09:59:10] INFO WSBASE/client_established 706
> LWS_CALLBACK_CLIENT_ESTABLISHED client 50(17), napi_id: 10
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 3(63), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 49(16), napi_id: 15
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 15(51), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 31(35), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 14(52), napi_id: 14
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 11(55), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 16(50), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 40(26), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 39(27), napi_id: 14
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 8(58), napi_id: 10
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 20(46), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 7(59), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 6(60), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 22(44), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 13(53), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 38(28), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 21(45), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 4(62), napi_id: 15
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 35(31), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 25(41), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 18(48), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 12(54), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 5(61), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 23(43), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 46(20), napi_id: 11
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 43(23), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 9(57), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 29(37), napi_id: 14
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 28(38), napi_id: 15
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 33(33), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 27(39), napi_id: 10
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 32(34), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 1(65), napi_id: 10
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 26(40), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 37(29), napi_id: 12
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 30(36), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 47(19), napi_id: 14
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 10(56), napi_id: 11
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 44(22), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 17(49), napi_id: 11
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 41(25), napi_id: 13
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 48(18), napi_id: 10
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 2(64), napi_id: 15
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 45(21), napi_id: 14
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 24(42), napi_id: 9
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 34(32), napi_id: 11
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 42(24), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 36(30), napi_id: 16
> [2022-02-17 09:59:10] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 19(47), napi_id: 16
> [2022-02-17 09:59:11] INFO WSBASE/client_established 696
> LWS_CALLBACK_CLIENT_ESTABLISHED client 38(66), napi_id: 11
>
> First number is the thread id (706 and 696). I have 2 threads. 1
> io_uring context per thread.
>
> Next, you have the client id and the number in parenthesis is the
> socket fd.
>
> Based on the result, it appears that having a single napi_id per
> context won't do it... I wish I could pick the brains of the people
> having done things the way they have done it with the epoll
> implementation.
>
> I guess that I could create 8 distinct io_uring context with
> IORING_SETUP_ATTACH_WQ but it seems like a big burden placed on the
> shoulders of users when a simple linked list with ref-counted elements
> could do it...
>
> I think that if I merge your patch with what I have done so far, we
> could get something really cool!
>
> Concerning the remaining problem about when to remove the napi_id, I
> would say that a good place to do it would be when a request is
> completed and discarded if there was a refcount added to your
> napi_entry struct.
>
> The only thing that I hate about this idea is that in my scenario, the
> sockets are going to be pretty much the same for the whole io_uring
> context existance. Therefore, the whole ref counting overhead is
> useless and unneeded.

I remember that now all the completion is in the original task(

should be confirmed again),

so it should be ok to just use simple 'unsigned int count' to show

the number of users of a napi entry. And doing deletion when count

is 0. For your scenario, which is only one napi in a iouring context,

This won't be big overhead as well.

The only thing is we may need to optimize the napi lookup process,

but I'm not sure if it is necessary.


Regards,

Hao

>
> Here is the latest version of my effort:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 77b9c7e4793b..ea2a3661c16f 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -63,6 +63,7 @@
>   #include <net/sock.h>
>   #include <net/af_unix.h>
>   #include <net/scm.h>
> +#include <net/busy_poll.h>
>   #include <linux/anon_inodes.h>
>   #include <linux/sched/mm.h>
>   #include <linux/uaccess.h>
> @@ -395,6 +396,10 @@ struct io_ring_ctx {
>   	struct list_head	sqd_list;
>   
>   	unsigned long		check_cq_overflow;
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +	/* used to track busy poll napi_id */
> +	unsigned int napi_id;
> +#endif
>   
>   	struct {
>   		unsigned		cached_cq_tail;
> @@ -6976,7 +6981,40 @@ static inline struct file
> *io_file_get_fixed(struct io_ring_ctx *ctx,
>   	io_req_set_rsrc_node(req, ctx);
>   	return file;
>   }
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +/*
> + * Set epoll busy poll NAPI ID from sk.
> + */
> +static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx,
> struct file *file)
> +{
> +	unsigned int napi_id;
> +	struct socket *sock;
> +	struct sock *sk;
> +
> +	if (!net_busy_loop_on())
> +		return;
>   
> +	sock = sock_from_file(file);
> +	if (!sock)
> +		return;
> +
> +	sk = sock->sk;
> +	if (!sk)
> +		return;
> +
> +	napi_id = READ_ONCE(sk->sk_napi_id);
> +
> +	/* Non-NAPI IDs can be rejected
> +	 *	or
> +	 * Nothing to do if we already have this ID
> +	 */
> +	if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id)
> +		return;
> +
> +	/* record NAPI ID for use in next busy poll */
> +	ctx->napi_id = napi_id;
> +}
> +#endif
>   static struct file *io_file_get_normal(struct io_ring_ctx *ctx,
>   				       struct io_kiocb *req, int fd)
>   {
> @@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct
> io_ring_ctx *ctx,
>   	trace_io_uring_file_get(ctx, fd);
>   
>   	/* we don't allow fixed io_uring files */
> -	if (file && unlikely(file->f_op == &io_uring_fops))
> -		io_req_track_inflight(req);
> +	if (file) {
> +		if (unlikely(file->f_op == &io_uring_fops))
> +			io_req_track_inflight(req);
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +		else
> +			io_set_busy_poll_napi_id(ctx, file);
> +#endif
> +	}
>   	return file;
>   }
>   
> @@ -7489,7 +7533,22 @@ static inline void
> io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx)
>   		   ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP);
>   	spin_unlock(&ctx->completion_lock);
>   }
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +/*
> + * Busy poll if globally on and supporting sockets found
> + */
> +static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx)
> +{
> +	unsigned int napi_id = ctx->napi_id;
>   
> +	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) {
> +		napi_busy_loop(napi_id, NULL, NULL, true,
> +			       BUSY_POLL_BUDGET);
> +		return true;
> +	}
> +	return false;
> +}
> +#endif
>   static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
>   {
>   	unsigned int to_submit;
> @@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx
> *ctx, bool cap_entries)
>   		    !(ctx->flags & IORING_SETUP_R_DISABLED))
>   			ret = io_submit_sqes(ctx, to_submit);
>   		mutex_unlock(&ctx->uring_lock);
> -
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +		if (io_napi_busy_loop(ctx))
> +			++ret;
> +#endif
>   		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
>   			wake_up(&ctx->sqo_sq_wait);
>   		if (creds)
> @@ -7649,6 +7711,9 @@ struct io_wait_queue {
>   	struct io_ring_ctx *ctx;
>   	unsigned cq_tail;
>   	unsigned nr_timeouts;
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +	unsigned busy_poll_to;
> +#endif
>   };
>   
>   static inline bool io_should_wake(struct io_wait_queue *iowq)
> @@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct
> io_ring_ctx *ctx,
>   	return !*timeout ? -ETIME : 1;
>   }
>   
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +static inline bool io_busy_loop_timeout(unsigned long start_time,
> +					unsigned long bp_usec)
> +{
> +	if (bp_usec) {
> +		unsigned long end_time = start_time + bp_usec;
> +		unsigned long now = busy_loop_current_time();
> +
> +		return time_after(now, end_time);
> +	}
> +	return true;
> +}
> +
> +static bool io_busy_loop_end(void *p, unsigned long start_time)
> +{
> +	struct io_wait_queue *iowq = p;
> +
> +	return io_busy_loop_timeout(start_time, iowq->busy_poll_to) ||
> +	       signal_pending(current) ||
> +	       io_should_wake(iowq);
> +}
> +#endif
> +
>   /*
>    * Wait until events become available, if we don't already have some.
> The
>    * application must reap them itself, as they reside on the shared cq
> ring.
> @@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx
> *ctx, int min_events,
>   		if (!io_run_task_work())
>   			break;
>   	} while (1);
> -
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +	iowq.busy_poll_to = 0;
> +#endif
>   	if (uts) {
>   		struct timespec64 ts;
>   
>   		if (get_timespec64(&ts, uts))
>   			return -EFAULT;
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +		if (!(ctx->flags & IORING_SETUP_SQPOLL) &&
> +		    (ctx->napi_id >= MIN_NAPI_ID) &&
> net_busy_loop_on()) {
> +			unsigned busy_poll_to =
> +				READ_ONCE(sysctl_net_busy_poll);
> +			struct timespec64 pollto =
> +				ns_to_timespec64(1000*busy_poll_to);
> +
> +			if (timespec64_compare(&ts, &pollto) > 0) {
> +				ts = timespec64_sub(ts, pollto);
> +				iowq.busy_poll_to = busy_poll_to;
> +			}
> +			else {
> +				iowq.busy_poll_to =
> timespec64_to_ns(&ts)/1000;
> +				ts.tv_sec = 0;
> +				ts.tv_nsec = 0;
> +			}
> +		}
> +#endif
>   		timeout = timespec64_to_jiffies(&ts);
>   	}
>   
> @@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx
> *ctx, int min_events,
>   	iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events;
>   
>   	trace_io_uring_cqring_wait(ctx, min_events);
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +	if (iowq.busy_poll_to)
> +		napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq,
> true,
> +			       BUSY_POLL_BUDGET);
> +#endif
>   	do {
>   		/* if we can't even flush overflow, don't wait for
> more */
>   		if (!io_cqring_overflow_flush(ctx)) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-18  7:41                 ` napi_busy_poll Hao Xu
@ 2022-02-19  7:02                   ` Olivier Langlois
  2022-02-21  5:03                     ` napi_busy_poll Hao Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-19  7:02 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Fri, 2022-02-18 at 15:41 +0800, Hao Xu wrote:
> 
> Hi Oliver,
> 
> Have you tried just issue one recv/pollin request and observe the
> 
> napi_id?

Hi Hao, not precisely but you are 100% right about where the
association is done. It is when a packet is received that the
association is made. This happens in few places but the most likely
place where it can happen with my NIC (Intel igb) is inside
napi_gro_receive().

I do verify the socket napi_id once a WebSocket session is established.
At that point a lot of packets going back and forth have been
exchanged:

TCP handshake
TLS handshake
HTTP request requesting a WS upgrade

At that point, the napi_id has been assigned.

My problem was only that my socket packets were routed on the loopback
interface which has no napi devices associated to it.

I did remove the local SOCKS proxy out of my setup and NAPI ids started
to appear as expected.

>  From my understanding of the network stack, the napi_id
> 
> of a socket won't be valid until it gets some packets. Because before
> 
> that moment, busy_poll doesn't know which hw queue to poll.
> 
> In other words, the idea of NAPI polling is: the packets of a socket
> 
> can be from any hw queue of a net adapter, but we just poll the
> 
> queue which just received some data. So to get this piece of info,
> 
> there must be some data coming to one queue, before doing the
> 
> busy_poll. Correct me if I'm wrong since I'm also a newbie of
> 
> network stuff...

I am now getting what you mean here. So there are 2 possible
approaches. You either:

1. add the napi id when you are sure that it is available after its
setting in the sock layer but you are not sure if it will be needed
again with future requests as it is too late to be of any use for the
current request (unless it is a MULTISHOT poll) (the add is performed
in io_poll_task_func() and io_apoll_task_func()

2. add the napi id when the request poll is armed where this knowledge
could be leveraged to handle the current req knowing that we might fail
getting the id if it is the initial recv request. (the add would be
performed in __io_arm_poll_handler)

TBH, I am not sure if there are arguments in favor of one approach over
the other... Maybe option #1 is the only one to make napi busy poll
work correctly with MULTISHOT requests...

I'll let you think about this point... Your first choice might be the
right one...

the other thing to consider when choosing the call site is locking...
when done from __io_arm_poll_handler(), uring_lock is acquired...

I am not sure that this is always the case with
io_poll_task_func/io_apoll_task_func...

I'll post v1 of the patch. My testing is showing that it works fine.
race condition is not an issue when busy poll is performed by sqpoll
thread because the list modification is exclusivy performed by that
thread too.

but I think that there is a possible race condition where the napi_list
could be used from io_cqring_wait() while another thread modify the
list. This is NOT done in my testing scenario but definitely something
that could happen somewhere in the real world...

> 
> 
> I was considering to poll all the rx rings, but it seemed to be not
> 
> efficient from some tests by my colleague.

This is definitely the simplest implementation but I did not go as far
as testing it. There is too much unknown variables to be viable IMHO. I
am not too sure how many napi devices there can be in a typical server.
I know that in my test machine, it has 2 NICs and one of them is just
unconnected. If we were to loop through all the devices, we would be
polling wastefully at least half of all the devices on the system. That
does not sound like a very good approach.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-18  8:06                 ` napi_busy_poll Hao Xu
@ 2022-02-19  7:14                   ` Olivier Langlois
  2022-02-21  4:52                     ` napi_busy_poll Hao Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Olivier Langlois @ 2022-02-19  7:14 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Fri, 2022-02-18 at 16:06 +0800, Hao Xu wrote:
> 
> > 
> > Concerning the remaining problem about when to remove the napi_id,
> > I
> > would say that a good place to do it would be when a request is
> > completed and discarded if there was a refcount added to your
> > napi_entry struct.
> > 
> > The only thing that I hate about this idea is that in my scenario,
> > the
> > sockets are going to be pretty much the same for the whole io_uring
> > context existance. Therefore, the whole ref counting overhead is
> > useless and unneeded.
> 
> I remember that now all the completion is in the original task(
> 
> should be confirmed again),
> 
> so it should be ok to just use simple 'unsigned int count' to show
> 
> the number of users of a napi entry. And doing deletion when count
> 
> is 0. For your scenario, which is only one napi in a iouring context,
> 
> This won't be big overhead as well.
> 
> The only thing is we may need to optimize the napi lookup process,
> 
> but I'm not sure if it is necessary.
> 
Hi Hao,

You can forget about the ref count idea. What I hated about it was that
it was incurring a cost to all the io_uring users including the vast
majority of them that won't be using napi_busy_poll...

One thing that I know that Pavel hates is when hot paths are littered
with a bunch of new code. I have been very careful doing that in my
design.

I think that I have found a much better approach to tackle the problem
of when to remove the napi_ids... I'll stop teasing about it... The
code is ready, tested... All I need to do is prepare the patch and send
it to the list for review.

net/core/dev.c is using a hash to store the napi structs. This could
possible be easily replicated in io_uring but for now, imho, this is a
polishing detail only that can be revisited later after the proof of
concept will have been accepted.

So eager to share the patch... This is the next thing that I do before
going to bed once I am done reading my emails...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-19  7:14                   ` napi_busy_poll Olivier Langlois
@ 2022-02-21  4:52                     ` Hao Xu
  0 siblings, 0 replies; 23+ messages in thread
From: Hao Xu @ 2022-02-21  4:52 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring

在 2022/2/19 下午3:14, Olivier Langlois 写道:
> On Fri, 2022-02-18 at 16:06 +0800, Hao Xu wrote:
>>
>>>
>>> Concerning the remaining problem about when to remove the napi_id,
>>> I
>>> would say that a good place to do it would be when a request is
>>> completed and discarded if there was a refcount added to your
>>> napi_entry struct.
>>>
>>> The only thing that I hate about this idea is that in my scenario,
>>> the
>>> sockets are going to be pretty much the same for the whole io_uring
>>> context existance. Therefore, the whole ref counting overhead is
>>> useless and unneeded.
>>
>> I remember that now all the completion is in the original task(
>>
>> should be confirmed again),
>>
>> so it should be ok to just use simple 'unsigned int count' to show
>>
>> the number of users of a napi entry. And doing deletion when count
>>
>> is 0. For your scenario, which is only one napi in a iouring context,
>>
>> This won't be big overhead as well.
>>
>> The only thing is we may need to optimize the napi lookup process,
>>
>> but I'm not sure if it is necessary.
>>
> Hi Hao,
> 
> You can forget about the ref count idea. What I hated about it was that
> it was incurring a cost to all the io_uring users including the vast
> majority of them that won't be using napi_busy_poll...

We can do the deletion at the end of each OP which we should, like
the recv, the poll_task_func/apoll_task_func. Won't affect the main path
I guess.
> 
> One thing that I know that Pavel hates is when hot paths are littered
> with a bunch of new code. I have been very careful doing that in my
> design.
> 
> I think that I have found a much better approach to tackle the problem
> of when to remove the napi_ids... I'll stop teasing about it... The
> code is ready, tested... All I need to do is prepare the patch and send
> it to the list for review.
> 
> net/core/dev.c is using a hash to store the napi structs. This could
> possible be easily replicated in io_uring but for now, imho, this is a
> polishing detail only that can be revisited later after the proof of
> concept will have been accepted.
Exactly, that is not a high priority thing right now.
> 
> So eager to share the patch... This is the next thing that I do before
> going to bed once I am done reading my emails...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-19  7:02                   ` napi_busy_poll Olivier Langlois
@ 2022-02-21  5:03                     ` Hao Xu
  2022-02-25  4:42                       ` napi_busy_poll Olivier Langlois
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Xu @ 2022-02-21  5:03 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring

在 2022/2/19 下午3:02, Olivier Langlois 写道:
> On Fri, 2022-02-18 at 15:41 +0800, Hao Xu wrote:
>>
>> Hi Oliver,
>>
>> Have you tried just issue one recv/pollin request and observe the
>>
>> napi_id?
> 
> Hi Hao, not precisely but you are 100% right about where the
> association is done. It is when a packet is received that the
> association is made. This happens in few places but the most likely
> place where it can happen with my NIC (Intel igb) is inside
> napi_gro_receive().

Yes, when a packet is received-->set skb->napi_id, when receiving a
batch of them-->deliver the skbs to the protocol layer and set
sk->napi_id
> 
> I do verify the socket napi_id once a WebSocket session is established.
> At that point a lot of packets going back and forth have been
> exchanged:
> 
> TCP handshake
> TLS handshake
> HTTP request requesting a WS upgrade
> 
> At that point, the napi_id has been assigned.
> 
> My problem was only that my socket packets were routed on the loopback
> interface which has no napi devices associated to it.
> 
> I did remove the local SOCKS proxy out of my setup and NAPI ids started
> to appear as expected.
> 
>>   From my understanding of the network stack, the napi_id
>>
>> of a socket won't be valid until it gets some packets. Because before
>>
>> that moment, busy_poll doesn't know which hw queue to poll.
>>
>> In other words, the idea of NAPI polling is: the packets of a socket
>>
>> can be from any hw queue of a net adapter, but we just poll the
>>
>> queue which just received some data. So to get this piece of info,
>>
>> there must be some data coming to one queue, before doing the
>>
>> busy_poll. Correct me if I'm wrong since I'm also a newbie of
>>
>> network stuff...
> 
> I am now getting what you mean here. So there are 2 possible
> approaches. You either:
> 
> 1. add the napi id when you are sure that it is available after its
> setting in the sock layer but you are not sure if it will be needed
> again with future requests as it is too late to be of any use for the
> current request (unless it is a MULTISHOT poll) (the add is performed
> in io_poll_task_func() and io_apoll_task_func()
> 
> 2. add the napi id when the request poll is armed where this knowledge
> could be leveraged to handle the current req knowing that we might fail
> getting the id if it is the initial recv request. (the add would be
> performed in __io_arm_poll_handler)
I explains this in the patch.
> 
> TBH, I am not sure if there are arguments in favor of one approach over
> the other... Maybe option #1 is the only one to make napi busy poll
> work correctly with MULTISHOT requests...
> 
> I'll let you think about this point... Your first choice might be the
> right one...
> 
> the other thing to consider when choosing the call site is locking...
> when done from __io_arm_poll_handler(), uring_lock is acquired...
> 
> I am not sure that this is always the case with
> io_poll_task_func/io_apoll_task_func...
> 
> I'll post v1 of the patch. My testing is showing that it works fine.
> race condition is not an issue when busy poll is performed by sqpoll
> thread because the list modification is exclusivy performed by that
> thread too.
> 
> but I think that there is a possible race condition where the napi_list
> could be used from io_cqring_wait() while another thread modify the
> list. This is NOT done in my testing scenario but definitely something
> that could happen somewhere in the real world...

Will there be any issue if we do the access with
list_for_each_entry_safe? I think it is safe enough.

> 
>>
>>
>> I was considering to poll all the rx rings, but it seemed to be not
>>
>> efficient from some tests by my colleague.
> 
> This is definitely the simplest implementation but I did not go as far
> as testing it. There is too much unknown variables to be viable IMHO. I
> am not too sure how many napi devices there can be in a typical server.
> I know that in my test machine, it has 2 NICs and one of them is just
> unconnected. If we were to loop through all the devices, we would be
> polling wastefully at least half of all the devices on the system. That
> does not sound like a very good approach.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: napi_busy_poll
  2022-02-21  5:03                     ` napi_busy_poll Hao Xu
@ 2022-02-25  4:42                       ` Olivier Langlois
  0 siblings, 0 replies; 23+ messages in thread
From: Olivier Langlois @ 2022-02-25  4:42 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe, io-uring

On Mon, 2022-02-21 at 13:03 +0800, Hao Xu wrote:
> > 
> > but I think that there is a possible race condition where the
> > napi_list
> > could be used from io_cqring_wait() while another thread modify the
> > list. This is NOT done in my testing scenario but definitely
> > something
> > that could happen somewhere in the real world...
> 
> Will there be any issue if we do the access with
> list_for_each_entry_safe? I think it is safe enough.

Hi Hao,

If napi_busy_poll is exclusively done from the sqpoll thread, all is
good because all the napi_list manipulations are performed from the
sqpoll thread.

The issue is if we want to offer napi_busy_poll for a task calling
io_uring_enter(). If the busy_poll is performed from io_cqring_wait()
as I propose in my patch, the napi_list could be updated by a different
thread calling io_uring_enter() to submit other requests.

This is an issue that v2 is addressing. This makes the code uglier. The
strategy being to splice the context napi_list into a local list in
io_cqring_wait() and assume that the most likely outcome when the
busy_poll will be over the only thing that will be needed is to move
back the local list into the context. If in the meantime, the context
napi_list has been updated, the lists are going to be merged. This
appears to be the approach minimizing the amount of memory allocations.

Creating a benchmark program took more time than I originally expected.
I am going to run it and if gains from napi_polling from
io_cqring_wait() aren't that good... maybe ditching napi_busy_poll()
support from io_cqring_wait() and that way, locking the lock before
adding napi ids will not be required anymore...

Here is what will be added in v2:
- Evaluate list_empty(&ctx->napi_list) outside io_napi_busy_loop() to
keep __io_sq_thread() execution as fast as possible
- In io_cqring_wait(), move up the sig block to avoid needless
computation if the block exits the function
- In io_cqring_wait(), protect ctx->napi_list from race condition by
splicing it into a local list
- In io_cqring_wait(), allow busy polling when uts is missing
- Fix kernel test robot issues


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-02-25  4:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08 14:58 napi_busy_poll Olivier Langlois
2022-02-08 17:05 ` napi_busy_poll Jens Axboe
2022-02-09  3:34   ` napi_busy_poll Hao Xu
2022-02-12 19:51     ` napi_busy_poll Olivier Langlois
2022-02-13 18:47       ` napi_busy_poll Jens Axboe
2022-02-14 17:13       ` napi_busy_poll Hao Xu
2022-02-15  8:37         ` napi_busy_poll Olivier Langlois
2022-02-15 18:05           ` napi_busy_poll Olivier Langlois
2022-02-16  3:12             ` napi_busy_poll Hao Xu
2022-02-16 19:19               ` napi_busy_poll Olivier Langlois
2022-02-16 12:14             ` napi_busy_poll Hao Xu
2022-02-17 20:28               ` napi_busy_poll Olivier Langlois
2022-02-18  8:06                 ` napi_busy_poll Hao Xu
2022-02-19  7:14                   ` napi_busy_poll Olivier Langlois
2022-02-21  4:52                     ` napi_busy_poll Hao Xu
2022-02-17 23:18               ` napi_busy_poll Olivier Langlois
2022-02-17 23:25                 ` napi_busy_poll Jens Axboe
2022-02-18  7:21                 ` napi_busy_poll Hao Xu
2022-02-18  5:05               ` napi_busy_poll Olivier Langlois
2022-02-18  7:41                 ` napi_busy_poll Hao Xu
2022-02-19  7:02                   ` napi_busy_poll Olivier Langlois
2022-02-21  5:03                     ` napi_busy_poll Hao Xu
2022-02-25  4:42                       ` napi_busy_poll Olivier Langlois

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.