[net-next,2/2] net: exit busy loop when another process is runnable
diff mbox series

Message ID 1408608310-13579-2-git-send-email-jasowang@redhat.com
State New, archived
Headers show
Series
  • [net-next,1/2] sched: introduce nr_running_this_cpu()
Related show

Commit Message

Jason Wang Aug. 21, 2014, 8:05 a.m. UTC
Rx busy loop does not scale well in the case when several parallel
sessions is active. This is because we keep looping even if there's
another process is runnable. For example, if that process is about to
send packet, keep busy polling in current process will brings extra
delay and damage the performance.

This patch solves this issue by exiting the busy loop when there's
another process is runnable in current cpu. Simple test that pin two
netperf sessions in the same cpu in receiving side shows obvious
improvement:

Before:
netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
16384  87380  1        1       10.00    15513.74
16384  87380
16384  87380  1        1       10.00    15092.78
16384  87380

After:
netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
16384  87380  1        1       10.00    23334.53
16384  87380
16384  87380  1        1       10.00    23327.58
16384  87380

Benchmark was done through two 8 cores Xeon machine back to back connected
with mlx4 through netperf TCP_RR test (busy_read were set to 50):

sessions/bytes/before/after/+improvement%/busy_read=0/
1/1/30062.10/30034.72/+0%/20228.96/
16/1/214719.83/307669.01/+43%/268997.71/
32/1/231252.81/345845.16/+49%/336157.442/
64/512/212467.39/373464.93/+75%/397449.375/

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/net/busy_poll.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Michael S. Tsirkin Aug. 21, 2014, 8:11 a.m. UTC | #1
On Thu, Aug 21, 2014 at 04:05:10PM +0800, Jason Wang wrote:
> Rx busy loop does not scale well in the case when several parallel
> sessions is active. This is because we keep looping even if there's
> another process is runnable. For example, if that process is about to
> send packet, keep busy polling in current process will brings extra
> delay and damage the performance.
> 
> This patch solves this issue by exiting the busy loop when there's
> another process is runnable in current cpu. Simple test that pin two
> netperf sessions in the same cpu in receiving side shows obvious
> improvement:
> 
> Before:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    15513.74
> 16384  87380
> 16384  87380  1        1       10.00    15092.78
> 16384  87380
> 
> After:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    23334.53
> 16384  87380
> 16384  87380  1        1       10.00    23327.58
> 16384  87380
> 
> Benchmark was done through two 8 cores Xeon machine back to back connected
> with mlx4 through netperf TCP_RR test (busy_read were set to 50):
> 
> sessions/bytes/before/after/+improvement%/busy_read=0/
> 1/1/30062.10/30034.72/+0%/20228.96/
> 16/1/214719.83/307669.01/+43%/268997.71/
> 32/1/231252.81/345845.16/+49%/336157.442/
> 64/512/212467.39/373464.93/+75%/397449.375/
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  include/net/busy_poll.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 1d67fb6..8a33fb2 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>  		cpu_relax();
>  
>  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> -		 !need_resched() && !busy_loop_timeout(end_time));
> +		 !need_resched() && !busy_loop_timeout(end_time) &&
> +		 nr_running_this_cpu() < 2);

<= 1 would be a bit clearer? We want at most one process here.


>  
>  	rc = !skb_queue_empty(&sk->sk_receive_queue);
>  out:
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Amos Kong Aug. 21, 2014, 7:03 p.m. UTC | #2
On Thu, Aug 21, 2014 at 04:05:10PM +0800, Jason Wang wrote:
> Rx busy loop does not scale well in the case when several parallel
> sessions is active. This is because we keep looping even if there's
> another process is runnable. For example, if that process is about to
> send packet, keep busy polling in current process will brings extra
> delay and damage the performance.
> 
> This patch solves this issue by exiting the busy loop when there's
> another process is runnable in current cpu. Simple test that pin two
> netperf sessions in the same cpu in receiving side shows obvious
> improvement:
> 
> Before:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    15513.74
> 16384  87380
> 16384  87380  1        1       10.00    15092.78
> 16384  87380
> 
> After:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    23334.53
> 16384  87380
> 16384  87380  1        1       10.00    23327.58
> 16384  87380
> 
> Benchmark was done through two 8 cores Xeon machine back to back connected
> with mlx4 through netperf TCP_RR test (busy_read were set to 50):
> 
> sessions/bytes/before/after/+improvement%/busy_read=0/
> 1/1/30062.10/30034.72/+0%/20228.96/
> 16/1/214719.83/307669.01/+43%/268997.71/
> 32/1/231252.81/345845.16/+49%/336157.442/
> 64/512/212467.39/373464.93/+75%/397449.375/
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  include/net/busy_poll.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 1d67fb6..8a33fb2 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>  		cpu_relax();
>  
>  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> -		 !need_resched() && !busy_loop_timeout(end_time));
> +		 !need_resched() && !busy_loop_timeout(end_time) &&
> +		 nr_running_this_cpu() < 2);


Reviewed-by: Amos Kong <akong@redhat.com>
  
>  	rc = !skb_queue_empty(&sk->sk_receive_queue);
>  out:
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 22, 2014, 2:53 a.m. UTC | #3
On 08/21/2014 04:11 PM, Michael S. Tsirkin wrote:
> On Thu, Aug 21, 2014 at 04:05:10PM +0800, Jason Wang wrote:
>> > Rx busy loop does not scale well in the case when several parallel
>> > sessions is active. This is because we keep looping even if there's
>> > another process is runnable. For example, if that process is about to
>> > send packet, keep busy polling in current process will brings extra
>> > delay and damage the performance.
>> > 
>> > This patch solves this issue by exiting the busy loop when there's
>> > another process is runnable in current cpu. Simple test that pin two
>> > netperf sessions in the same cpu in receiving side shows obvious
>> > improvement:
>> > 
>> > Before:
>> > netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
>> > netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
>> > 16384  87380  1        1       10.00    15513.74
>> > 16384  87380
>> > 16384  87380  1        1       10.00    15092.78
>> > 16384  87380
>> > 
>> > After:
>> > netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
>> > netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
>> > 16384  87380  1        1       10.00    23334.53
>> > 16384  87380
>> > 16384  87380  1        1       10.00    23327.58
>> > 16384  87380
>> > 
>> > Benchmark was done through two 8 cores Xeon machine back to back connected
>> > with mlx4 through netperf TCP_RR test (busy_read were set to 50):
>> > 
>> > sessions/bytes/before/after/+improvement%/busy_read=0/
>> > 1/1/30062.10/30034.72/+0%/20228.96/
>> > 16/1/214719.83/307669.01/+43%/268997.71/
>> > 32/1/231252.81/345845.16/+49%/336157.442/
>> > 64/512/212467.39/373464.93/+75%/397449.375/
>> > 
>> > Signed-off-by: Jason Wang <jasowang@redhat.com>
>> > ---
>> >  include/net/busy_poll.h | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
>> > index 1d67fb6..8a33fb2 100644
>> > --- a/include/net/busy_poll.h
>> > +++ b/include/net/busy_poll.h
>> > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>> >  		cpu_relax();
>> >  
>> >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
>> > -		 !need_resched() && !busy_loop_timeout(end_time));
>> > +		 !need_resched() && !busy_loop_timeout(end_time) &&
>> > +		 nr_running_this_cpu() < 2);
> <= 1 would be a bit clearer? We want at most one process here.
>

Ok, will change it in next version.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mike Galbraith Aug. 22, 2014, 5:01 a.m. UTC | #4
On Thu, 2014-08-21 at 16:05 +0800, Jason Wang wrote: 
> Rx busy loop does not scale well in the case when several parallel
> sessions is active. This is because we keep looping even if there's
> another process is runnable. For example, if that process is about to
> send packet, keep busy polling in current process will brings extra
> delay and damage the performance.
> 
> This patch solves this issue by exiting the busy loop when there's
> another process is runnable in current cpu. Simple test that pin two
> netperf sessions in the same cpu in receiving side shows obvious
> improvement:

That patch says to me it's a bad idea to spin when someone (anyone) else
can get some work done on a CPU, which intuitively makes sense.  But..

(ponders net goop: with silly 1 byte ping-pong load, throughput is bound
by fastpath latency, net plus sched plus fixable nohz and governor crud
if not polling, so you can't get a lot of data moved byte at a time no
matter how sexy the pipe whether polling or not due to bound.  If OTOH
net hardware is a blazing fast large bore packet cannon, net overhead
per unit payload drops, sched+crud is a constant)

Seems the only time it's a good idea to poll is if blasting big packets
on sexy hardware, and if you're doing that, you want to poll regardless
of whether somebody else is waiting, or?

> Before:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    15513.74
> 16384  87380
> 16384  87380  1        1       10.00    15092.78
> 16384  87380
> 
> After:
> netperf -H 192.168.100.2 -T 0,0 -t TCP_RR -P 0 & \
> netperf -H 192.168.100.2 -T 1,0 -t TCP_RR -P 0
> 16384  87380  1        1       10.00    23334.53
> 16384  87380
> 16384  87380  1        1       10.00    23327.58
> 16384  87380
> 
> Benchmark was done through two 8 cores Xeon machine back to back connected
> with mlx4 through netperf TCP_RR test (busy_read were set to 50):
> 
> sessions/bytes/before/after/+improvement%/busy_read=0/
> 1/1/30062.10/30034.72/+0%/20228.96/
> 16/1/214719.83/307669.01/+43%/268997.71/
> 32/1/231252.81/345845.16/+49%/336157.442/
> 64/512/212467.39/373464.93/+75%/397449.375/
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  include/net/busy_poll.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index 1d67fb6..8a33fb2 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>  		cpu_relax();
>  
>  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> -		 !need_resched() && !busy_loop_timeout(end_time));
> +		 !need_resched() && !busy_loop_timeout(end_time) &&
> +		 nr_running_this_cpu() < 2);
>  
>  	rc = !skb_queue_empty(&sk->sk_receive_queue);
>  out:


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 22, 2014, 7:29 a.m. UTC | #5
On 08/22/2014 01:01 PM, Mike Galbraith wrote:
> On Thu, 2014-08-21 at 16:05 +0800, Jason Wang wrote: 
>> > Rx busy loop does not scale well in the case when several parallel
>> > sessions is active. This is because we keep looping even if there's
>> > another process is runnable. For example, if that process is about to
>> > send packet, keep busy polling in current process will brings extra
>> > delay and damage the performance.
>> > 
>> > This patch solves this issue by exiting the busy loop when there's
>> > another process is runnable in current cpu. Simple test that pin two
>> > netperf sessions in the same cpu in receiving side shows obvious
>> > improvement:
> That patch says to me it's a bad idea to spin when someone (anyone) else
> can get some work done on a CPU, which intuitively makes sense.  But..
>
> (ponders net goop: with silly 1 byte ping-pong load, throughput is bound
> by fastpath latency, net plus sched plus fixable nohz and governor crud
> if not polling, so you can't get a lot of data moved byte at a time no
> matter how sexy the pipe whether polling or not due to bound.  If OTOH
> net hardware is a blazing fast large bore packet cannon, net overhead
> per unit payload drops, sched+crud is a constant)

Polling could be done by either rx busy loop in process context or NAPI
in softirq. Rx busy loop may only spin and poll when no packet were
found in socket receive queue. It spins in the hope that at least one
packet will come (in this case the process will exit rx busy loop) in a
short while. In this way, it eliminates the overheads of NAPI, wakeup
and scheduling. This patch just make the busy polling less aggressive:
Since the process finds nothing to receive when still spinning in this
loop, there's no need to waste cpu cycles ( or even call cpu_relax()) if
there's another work could be done by current CPU.

For stream workload like you mentioned here, if the card was fast
enough, the socket receive queue was not easy to be drained. Rx busy
loop won't help or even won't be triggered in this case.
>
> Seems the only time it's a good idea to poll is if blasting big packets
> on sexy hardware, and if you're doing that, you want to poll regardless
> of whether somebody else is waiting, or?

NAPI will work instead of rx busy loop in this case. It will poll and
try to drain nic's rx ring in softirq regardless somebody else.

Btw, current rx busy loop does not perform well on stream workload since
it bypasses GRO to reduce latency. But this issue beyond the scope of
this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Ingo Molnar Aug. 22, 2014, 7:36 a.m. UTC | #6
> > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > index 1d67fb6..8a33fb2 100644
> > --- a/include/net/busy_poll.h
> > +++ b/include/net/busy_poll.h
> > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> >  		cpu_relax();
> >  
> >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > -		 !need_resched() && !busy_loop_timeout(end_time));
> > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > +		 nr_running_this_cpu() < 2);

So it's generally a bad idea to couple to the scheduler through 
such a low level, implementation dependent value like 
'nr_running', causing various problems:

 - It misses important work that might be pending on this CPU,
   like RCU callbacks.

 - It will also over-credit task contexts that might be
   runnable, but which are less important than the currently
   running one: such as a SCHED_IDLE task

 - It will also over-credit even regular SCHED_NORMAL tasks, if
   this current task is more important than them: say
   SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
   with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
   on an otherwise idle system.

So what you want is a more sophisticated query to the 
scheduler, a sched_expected_runtime() method that returns the 
number of nsecs this task is expected to run in the future, 
which returns 0 if you will be scheduled away on the next 
schedule(), and returns infinity for a high prio SCHED_FIFO 
task, or if this SCHED_NORMAL task is on an otherwise idle CPU.

It will return a regular time slice value in other cases, when 
there's some load on the CPU.

The polling logic can then do its decision based on that time 
value.

All this can be done reasonably fast and lockless in most 
cases, so that it can be called from busy-polling code.

An added advantage would be that this approach consolidates the 
somewhat random need_resched() checks into this method as well.

In any case I don't agree with the nr_running_this_cpu() 
method.

(Please Cc: me and lkml to future iterations of this patchset.)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Ingo Molnar Aug. 22, 2014, 7:42 a.m. UTC | #7
* Jason Wang <jasowang@redhat.com> wrote:

> Polling could be done by either rx busy loop in process 
> context or NAPI in softirq. [...]

Note that this shows another reason why it's a bad idea to 
query nr_running directly: depending on the softirq processing 
method, a softirq might run:

 - directly in process context
 - in an idle thread's context
 - or in a ksoftirqd context. 

'nr_running' will have different values in these cases, causing 
assymetries in busy-poll handling!

Another class of assymetry is when there are other softirq bits 
pending, beyond NET_RX (or NET_TX): a nr_running check misses 
them.

The solution I outlined in the previous mail (using a 
sched_expected_runtime() method) would be able to avoid most of 
these artifacts.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 22, 2014, 9:08 a.m. UTC | #8
On 08/22/2014 03:36 PM, Ingo Molnar wrote:
>>> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
>>> index 1d67fb6..8a33fb2 100644
>>> --- a/include/net/busy_poll.h
>>> +++ b/include/net/busy_poll.h
>>> @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>>>  		cpu_relax();
>>>  
>>>  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
>>> -		 !need_resched() && !busy_loop_timeout(end_time));
>>> +		 !need_resched() && !busy_loop_timeout(end_time) &&
>>> +		 nr_running_this_cpu() < 2);
> So it's generally a bad idea to couple to the scheduler through 
> such a low level, implementation dependent value like 
> 'nr_running', causing various problems:
>
>  - It misses important work that might be pending on this CPU,
>    like RCU callbacks.
>
>  - It will also over-credit task contexts that might be
>    runnable, but which are less important than the currently
>    running one: such as a SCHED_IDLE task
>
>  - It will also over-credit even regular SCHED_NORMAL tasks, if
>    this current task is more important than them: say
>    SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
>    with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
>    on an otherwise idle system.

I see.
> So what you want is a more sophisticated query to the 
> scheduler, a sched_expected_runtime() method that returns the 
> number of nsecs this task is expected to run in the future, 
> which returns 0 if you will be scheduled away on the next 
> schedule(), and returns infinity for a high prio SCHED_FIFO 
> task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
>
> It will return a regular time slice value in other cases, when 
> there's some load on the CPU.
>
> The polling logic can then do its decision based on that time 
> value.

But this is just for current process. We want to determine whether or
not it was worth to loop busily in current process by checking if
there's any another runnable processes or callbacks. And what we need
here is just a simple and lockless hint which can't be wrong but may be
inaccurate to exit the busy loop. The net code does not depends on this
hint to do scheduling or yielding.

How about just introducing a boolean helper like current_can_busy_loop()
and return true in one of the following conditions:

- Current task is SCHED_FIFO
- Current task is neither SCHED_FIFO nor SCHED_IDLE and no other
runnable processes or pending RCU callbacks in current cpu

And add warns to make sure it can only be called in process context.

Thanks
>
> All this can be done reasonably fast and lockless in most 
> cases, so that it can be called from busy-polling code.
>
> An added advantage would be that this approach consolidates the 
> somewhat random need_resched() checks into this method as well.
>
> In any case I don't agree with the nr_running_this_cpu() 
> method.
>
> (Please Cc: me and lkml to future iterations of this patchset.)
>
> Thanks,
>
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eric Dumazet Aug. 22, 2014, 2:16 p.m. UTC | #9
On Fri, 2014-08-22 at 17:08 +0800, Jason Wang wrote:

> But this is just for current process. We want to determine whether or
> not it was worth to loop busily in current process by checking if
> there's any another runnable processes or callbacks. And what we need
> here is just a simple and lockless hint which can't be wrong but may be
> inaccurate to exit the busy loop. The net code does not depends on this
> hint to do scheduling or yielding.
> 
> How about just introducing a boolean helper like current_can_busy_loop()
> and return true in one of the following conditions:
> 
> - Current task is SCHED_FIFO
> - Current task is neither SCHED_FIFO nor SCHED_IDLE and no other
> runnable processes or pending RCU callbacks in current cpu
> 
> And add warns to make sure it can only be called in process context.


1) Any reasons Eliezer Tamir is not included in the CC list ?

   He is the busypoll author after all, and did nothing wrong to be
banned from these patches ;)

2) It looks like sk_buy_loop() should not be inlined, its is already too
big.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 25, 2014, 2:54 a.m. UTC | #10
On 08/22/2014 10:16 PM, Eric Dumazet wrote:
> On Fri, 2014-08-22 at 17:08 +0800, Jason Wang wrote:
>
>> > But this is just for current process. We want to determine whether or
>> > not it was worth to loop busily in current process by checking if
>> > there's any another runnable processes or callbacks. And what we need
>> > here is just a simple and lockless hint which can't be wrong but may be
>> > inaccurate to exit the busy loop. The net code does not depends on this
>> > hint to do scheduling or yielding.
>> > 
>> > How about just introducing a boolean helper like current_can_busy_loop()
>> > and return true in one of the following conditions:
>> > 
>> > - Current task is SCHED_FIFO
>> > - Current task is neither SCHED_FIFO nor SCHED_IDLE and no other
>> > runnable processes or pending RCU callbacks in current cpu
>> > 
>> > And add warns to make sure it can only be called in process context.
> 1) Any reasons Eliezer Tamir is not included in the CC list ?
>
>    He is the busypoll author after all, and did nothing wrong to be
> banned from these patches ;)
CC Eliezer Tamir


No intentional, I just generate the CC list through get_maintainer.pl :(
>
> 2) It looks like sk_buy_loop() should not be inlined, its is already too
> big.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Aug. 25, 2014, 1:16 p.m. UTC | #11
On 22/08/2014 17:16, Eric Dumazet wrote:
> On Fri, 2014-08-22 at 17:08 +0800, Jason Wang wrote:
> 
>> But this is just for current process. We want to determine whether or
>> not it was worth to loop busily in current process by checking if
>> there's any another runnable processes or callbacks. And what we need
>> here is just a simple and lockless hint which can't be wrong but may be
>> inaccurate to exit the busy loop. The net code does not depends on this
>> hint to do scheduling or yielding.
>>
>> How about just introducing a boolean helper like current_can_busy_loop()
>> and return true in one of the following conditions:
>>
>> - Current task is SCHED_FIFO
>> - Current task is neither SCHED_FIFO nor SCHED_IDLE and no other
>> runnable processes or pending RCU callbacks in current cpu
>>
>> And add warns to make sure it can only be called in process context.
> 
> 
> 1) Any reasons Eliezer Tamir is not included in the CC list ?

Thanks for remembering me, Eric ;)

Here are my 2 cents:
I think Ingo's suggestion of only yielding to tasks with same or higher
priority makes sense.

IF you change the current behavior, please update the documentation.
You are going to make people scratch their head and ask "what changed?"
you owe them a clue.

I also would like to have some way to keep track of when/if/how much
this yield happens.

> 2) It looks like sk_buy_loop() should not be inlined, its is already too
> big.

You made this comment in the past, my response was that it's inlnied so
it can be optimized when nonblock is known at compile time to be true
(e.g when called from sock_poll).
IFF you think that's less important, then I will defer to your opinion.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 26, 2014, 7:16 a.m. UTC | #12
On 08/25/2014 09:16 PM, Eliezer Tamir wrote:
> On 22/08/2014 17:16, Eric Dumazet wrote:
>> > On Fri, 2014-08-22 at 17:08 +0800, Jason Wang wrote:
>> > 
>>> >> But this is just for current process. We want to determine whether or
>>> >> not it was worth to loop busily in current process by checking if
>>> >> there's any another runnable processes or callbacks. And what we need
>>> >> here is just a simple and lockless hint which can't be wrong but may be
>>> >> inaccurate to exit the busy loop. The net code does not depends on this
>>> >> hint to do scheduling or yielding.
>>> >>
>>> >> How about just introducing a boolean helper like current_can_busy_loop()
>>> >> and return true in one of the following conditions:
>>> >>
>>> >> - Current task is SCHED_FIFO
>>> >> - Current task is neither SCHED_FIFO nor SCHED_IDLE and no other
>>> >> runnable processes or pending RCU callbacks in current cpu
>>> >>
>>> >> And add warns to make sure it can only be called in process context.
>> > 
>> > 
>> > 1) Any reasons Eliezer Tamir is not included in the CC list ?
> Thanks for remembering me, Eric ;)
>
> Here are my 2 cents:
> I think Ingo's suggestion of only yielding to tasks with same or higher
> priority makes sense.

I'm not sure I get your meaning. Do you mean calling yield_to() directly
in sk_busy_loop?

Schedule() which will be called later should handle all cases such as
priority and rt process. And this patch just want the schedule() to do
this decision earlier by exiting the busy loop earlier. This will
improve the latency in both heavy load and light load.

Checking number of nsecs this task is expected to run in the future
sounds like the work that sk_busy_loop_end_time() should consider. It
was not the issue that this patch want to address.
>
> IF you change the current behavior, please update the documentation.
> You are going to make people scratch their head and ask "what changed?"
> you owe them a clue.

Thanks for the reminding. But for this patch itself, it does not change
user noticeable behaviour.
> I also would like to have some way to keep track of when/if/how much
> this yield happens.
>

Ok, not very hard to add, maybe just another statistics counter.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Aug. 29, 2014, 3:08 a.m. UTC | #13
On 08/22/2014 03:42 PM, Ingo Molnar wrote:
> * Jason Wang <jasowang@redhat.com> wrote:
>
>> Polling could be done by either rx busy loop in process 
>> context or NAPI in softirq. [...]
> Note that this shows another reason why it's a bad idea to 
> query nr_running directly: depending on the softirq processing 
> method, a softirq might run:
>
>  - directly in process context
>  - in an idle thread's context
>  - or in a ksoftirqd context. 
>
> 'nr_running' will have different values in these cases, causing 
> assymetries in busy-poll handling!

Current rx busy polling code only works in process context, we can add
BUG or warnings to make sure the helper was only called in process context.
>
> Another class of assymetry is when there are other softirq bits 
> pending, beyond NET_RX (or NET_TX): a nr_running check misses 
> them.

Yes, but rx busy polling only works in process context and does not
disable bh, so it may be not an issue.
> The solution I outlined in the previous mail (using a 
> sched_expected_runtime() method) would be able to avoid most of 
> these artifacts.

It only take cares the remaining runnable time for current process, this
is a good hint for sk_busy_loop_end_time() which return the maximum time
a process could do busy polling. But this is not this patch needs. We
need check the state of other runnable process in current cpu to avoid
damaging the performance of other process.
>
> Thanks,
>
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 1, 2014, 6:39 a.m. UTC | #14
On 29/08/2014 06:08, Jason Wang wrote:
> Yes, but rx busy polling only works in process context and does not
> disable bh, so it may be not an issue.

sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 1, 2014, 6:55 a.m. UTC | #15
On 26/08/2014 10:16, Jason Wang wrote:
> On 08/25/2014 09:16 PM, Eliezer Tamir wrote:
>> Here are my 2 cents:
>> I think Ingo's suggestion of only yielding to tasks with same or higher
>> priority makes sense.
> 
> I'm not sure I get your meaning. Do you mean calling yield_to() directly
> in sk_busy_loop?

Think about the case where two processes are busy polling on the
same CPU and the same device queue. Since busy polling processes
incoming packets on the queue from any process, this scenario works
well currently, and will not work at all when polling yields to other
processes that are of the same priority that are running on the same
CPU.

As a side note, there is a lot of room for improvement when two
processes on the same CPU want to busy poll on different device
queues.
The RFC code I published for epoll support showed one possible
way of solving this, but I'm sure that there are other possibilities.

Maybe the networking subsystem should maintain a list of device
queues that need busypolling and have a thread that would poll
all of them when there's nothing better to do.

I'm aware of similar work on busy polling on NVMe devices, so
maybe there should be a global busypoll thread for all devices
that support it.

BTW, I have someone inside Intel that wants to test future patches. Feel
free to send me patches for testing, even if they are not ready for
publishing yet.

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 1, 2014, 9:31 a.m. UTC | #16
On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
> > +++ b/include/net/busy_poll.h
> > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> >  		cpu_relax();
> >  
> >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > -		 !need_resched() && !busy_loop_timeout(end_time));
> > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > +		 nr_running_this_cpu() < 2);
> >  

So as has been said by now; this is horrible.

We should not export nr_running like this ever. Your usage of < 2
implies this can be hit with nr_running == 0, and therefore you can also
hit it with nr_running == 1 where the one is not network related and you
get random delays.

Worse still, you have BH (and thereby preemption) disabled, you should
not _ever_ have undefined and indefinite waits like that.

You also destroy any hope of dropping into lower power states; even when
there's never going to be a packet ever again, also bad.

All in all, a complete trainwreck.

NAK.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 1, 2014, 9:52 a.m. UTC | #17
On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
> > > +++ b/include/net/busy_poll.h
> > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > >  		cpu_relax();
> > >  
> > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > +		 nr_running_this_cpu() < 2);
> > >  
> 
> So as has been said by now; this is horrible.
> 
> We should not export nr_running like this ever. Your usage of < 2
> implies this can be hit with nr_running == 0, and therefore you can also
> hit it with nr_running == 1 where the one is not network related and you
> get random delays.
> 
> Worse still, you have BH (and thereby preemption) disabled, you should
> not _ever_ have undefined and indefinite waits like that.
> 
> You also destroy any hope of dropping into lower power states; even when
> there's never going to be a packet ever again, also bad.

Hmm this patch sometimes makes us exit from the busy loop *earlier*.
How can this interfere with dropping into lower power states?

> All in all, a complete trainwreck.
> 
> NAK.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 1, 2014, 10:04 a.m. UTC | #18
On Mon, Sep 01, 2014 at 12:52:19PM +0300, Michael S. Tsirkin wrote:
> On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
> > > > +++ b/include/net/busy_poll.h
> > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > >  		cpu_relax();
> > > >  
> > > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > > +		 nr_running_this_cpu() < 2);
> > > >  
> > 
> > So as has been said by now; this is horrible.
> > 
> > We should not export nr_running like this ever. Your usage of < 2
> > implies this can be hit with nr_running == 0, and therefore you can also
> > hit it with nr_running == 1 where the one is not network related and you
> > get random delays.
> > 
> > Worse still, you have BH (and thereby preemption) disabled, you should
> > not _ever_ have undefined and indefinite waits like that.
> > 
> > You also destroy any hope of dropping into lower power states; even when
> > there's never going to be a packet ever again, also bad.
> 
> Hmm this patch sometimes makes us exit from the busy loop *earlier*.
> How can this interfere with dropping into lower power states?

Ah.. jetlag.. :/ I read it like it owuld indefinitely spin if there was
only the 'one' task, not avoid the spin unless there was the one task.

The nr_running thing is still horrible, but let me reread this patch
description to see if it explains why that is a good thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 1, 2014, 10:19 a.m. UTC | #19
On Mon, Sep 01, 2014 at 12:04:34PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 01, 2014 at 12:52:19PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
> > > > > +++ b/include/net/busy_poll.h
> > > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > > >  		cpu_relax();
> > > > >  
> > > > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > > > +		 nr_running_this_cpu() < 2);
> > > > >  
> > > 
> > > So as has been said by now; this is horrible.
> > > 
> > > We should not export nr_running like this ever. Your usage of < 2
> > > implies this can be hit with nr_running == 0, and therefore you can also
> > > hit it with nr_running == 1 where the one is not network related and you
> > > get random delays.
> > > 
> > > Worse still, you have BH (and thereby preemption) disabled, you should
> > > not _ever_ have undefined and indefinite waits like that.
> > > 
> > > You also destroy any hope of dropping into lower power states; even when
> > > there's never going to be a packet ever again, also bad.
> > 
> > Hmm this patch sometimes makes us exit from the busy loop *earlier*.
> > How can this interfere with dropping into lower power states?
> 
> Ah.. jetlag.. :/ I read it like it owuld indefinitely spin if there was
> only the 'one' task, not avoid the spin unless there was the one task.
> 
> The nr_running thing is still horrible, but let me reread this patch
> description to see if it explains why that is a good thing.

OK I suppose that more or less makes sense, the contextual behaviour is
of course tedious in that it makes behaviour less predictable. The
'other' tasks might not want to generate data and you then destroy
throughput by not spinning.

I'm not entirely sure I see how its all supposed to work though; the
various poll functions call sk_busy_poll() and do_select() also loops.

The patch only kills the sk_busy_poll() loop, but then do_select() will
still loop and not sleep, so how is this helping?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 1, 2014, 10:22 a.m. UTC | #20
On Mon, Sep 01, 2014 at 12:04:34PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 01, 2014 at 12:52:19PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
> > > > > +++ b/include/net/busy_poll.h
> > > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > > >  		cpu_relax();
> > > > >  
> > > > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > > > +		 nr_running_this_cpu() < 2);
> > > > >  
> > > 
> > > So as has been said by now; this is horrible.
> > > 
> > > We should not export nr_running like this ever. Your usage of < 2
> > > implies this can be hit with nr_running == 0, and therefore you can also
> > > hit it with nr_running == 1 where the one is not network related and you
> > > get random delays.
> > > 
> > > Worse still, you have BH (and thereby preemption) disabled, you should
> > > not _ever_ have undefined and indefinite waits like that.
> > > 
> > > You also destroy any hope of dropping into lower power states; even when
> > > there's never going to be a packet ever again, also bad.
> > 
> > Hmm this patch sometimes makes us exit from the busy loop *earlier*.
> > How can this interfere with dropping into lower power states?
> 
> Ah.. jetlag.. :/ I read it like it owuld indefinitely spin if there was
> only the 'one' task, not avoid the spin unless there was the one task.
> 
> The nr_running thing is still horrible,

Yea, it's a kludge, but busy waiting is a heuristic thing anyway,
so it boils down to whether it's mostly effective.
I agree it would be better to make it more robust/consistent if we can
do it without a lot of complexity.

> but let me reread this patch
> description to see if it explains why that is a good thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 3:29 a.m. UTC | #21
On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
> On 29/08/2014 06:08, Jason Wang wrote:
>> > Yes, but rx busy polling only works in process context and does not
>> > disable bh, so it may be not an issue.
> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
>
> -Eliezer

True, so we need probably also exit the loop when there are pending bhs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 3:35 a.m. UTC | #22
On 09/01/2014 02:55 PM, Eliezer Tamir wrote:
> On 26/08/2014 10:16, Jason Wang wrote:
>> On 08/25/2014 09:16 PM, Eliezer Tamir wrote:
>>> Here are my 2 cents:
>>> I think Ingo's suggestion of only yielding to tasks with same or higher
>>> priority makes sense.
>> I'm not sure I get your meaning. Do you mean calling yield_to() directly
>> in sk_busy_loop?
> Think about the case where two processes are busy polling on the
> same CPU and the same device queue. Since busy polling processes
> incoming packets on the queue from any process, this scenario works
> well currently,

I see, but looks like we can simply do this by exiting the busy loop
when ndo_busy_poll() finds something but not for current socket?
>  and will not work at all when polling yields to other
> processes that are of the same priority that are running on the same
> CPU.

So yielding has its limitation, we need let scheduler to do the choice
instead.
>
> As a side note, there is a lot of room for improvement when two
> processes on the same CPU want to busy poll on different device
> queues.
> The RFC code I published for epoll support showed one possible
> way of solving this, but I'm sure that there are other possibilities.
>
> Maybe the networking subsystem should maintain a list of device
> queues that need busypolling and have a thread that would poll
> all of them when there's nothing better to do.

Not sure whether this method will scale considering thousands of sockets
and processes.
>
> I'm aware of similar work on busy polling on NVMe devices, so
> maybe there should be a global busypoll thread for all devices
> that support it.
>
> BTW, I have someone inside Intel that wants to test future patches. Feel
> free to send me patches for testing, even if they are not ready for
> publishing yet.
>
> Cheers,
> Eliezer

Ok, will do it, thanks a lot.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 3:38 a.m. UTC | #23
On 09/01/2014 06:04 PM, Peter Zijlstra wrote:
> On Mon, Sep 01, 2014 at 12:52:19PM +0300, Michael S. Tsirkin wrote:
>> On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
>>> On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
>>>>> +++ b/include/net/busy_poll.h
>>>>> @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>>>>>  		cpu_relax();
>>>>>  
>>>>>  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
>>>>> -		 !need_resched() && !busy_loop_timeout(end_time));
>>>>> +		 !need_resched() && !busy_loop_timeout(end_time) &&
>>>>> +		 nr_running_this_cpu() < 2);
>>>>>  
>>> So as has been said by now; this is horrible.
>>>
>>> We should not export nr_running like this ever. Your usage of < 2
>>> implies this can be hit with nr_running == 0, and therefore you can also
>>> hit it with nr_running == 1 where the one is not network related and you
>>> get random delays.
>>>
>>> Worse still, you have BH (and thereby preemption) disabled, you should
>>> not _ever_ have undefined and indefinite waits like that.
>>>
>>> You also destroy any hope of dropping into lower power states; even when
>>> there's never going to be a packet ever again, also bad.
>> Hmm this patch sometimes makes us exit from the busy loop *earlier*.
>> How can this interfere with dropping into lower power states?
> Ah.. jetlag.. :/ I read it like it owuld indefinitely spin if there was
> only the 'one' task, not avoid the spin unless there was the one task.
>
> The nr_running thing is still horrible, but let me reread this patch
> description to see if it explains why that is a good thing.

I see, how about just exporting a boolean helper like
current_can_busy_loop() and take care all of the conditions (pending bhs
and rcu callbacks, runnable processes) in scheduler code itself?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 4:03 a.m. UTC | #24
On 09/01/2014 06:19 PM, Peter Zijlstra wrote:
> On Mon, Sep 01, 2014 at 12:04:34PM +0200, Peter Zijlstra wrote:
>> > On Mon, Sep 01, 2014 at 12:52:19PM +0300, Michael S. Tsirkin wrote:
>>> > > On Mon, Sep 01, 2014 at 11:31:59AM +0200, Peter Zijlstra wrote:
>>>> > > > On Fri, Aug 22, 2014 at 07:01:05AM +0200, Mike Galbraith wrote:
>>>>>> > > > > > +++ b/include/net/busy_poll.h
>>>>>> > > > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
>>>>>> > > > > >  		cpu_relax();
>>>>>> > > > > >  
>>>>>> > > > > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
>>>>>> > > > > > -		 !need_resched() && !busy_loop_timeout(end_time));
>>>>>> > > > > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
>>>>>> > > > > > +		 nr_running_this_cpu() < 2);
>>>>>> > > > > >  
>>>> > > > 
>>>> > > > So as has been said by now; this is horrible.
>>>> > > > 
>>>> > > > We should not export nr_running like this ever. Your usage of < 2
>>>> > > > implies this can be hit with nr_running == 0, and therefore you can also
>>>> > > > hit it with nr_running == 1 where the one is not network related and you
>>>> > > > get random delays.
>>>> > > > 
>>>> > > > Worse still, you have BH (and thereby preemption) disabled, you should
>>>> > > > not _ever_ have undefined and indefinite waits like that.
>>>> > > > 
>>>> > > > You also destroy any hope of dropping into lower power states; even when
>>>> > > > there's never going to be a packet ever again, also bad.
>>> > > 
>>> > > Hmm this patch sometimes makes us exit from the busy loop *earlier*.
>>> > > How can this interfere with dropping into lower power states?
>> > 
>> > Ah.. jetlag.. :/ I read it like it owuld indefinitely spin if there was
>> > only the 'one' task, not avoid the spin unless there was the one task.
>> > 
>> > The nr_running thing is still horrible, but let me reread this patch
>> > description to see if it explains why that is a good thing.
> OK I suppose that more or less makes sense, the contextual behaviour is
> of course tedious in that it makes behaviour less predictable. The
> 'other' tasks might not want to generate data and you then destroy
> throughput by not spinning.

The patch try to make sure:
- the the performance of busy read was not worse than it was disabled in
any cases.
- the performance improvement of a single socket was not achieved by
sacrificing the total performance (all other processes) of the system
 
If 'other' tasks are also CPU or I/O intensive jobs, we switch to do
them so the total performance were kept or even increased, and the
performance of current process were guaranteed not worse than when busy
read was disabled (or even better since it may still do busy read
sometimes when it was the only runnable process). If 'other' task are
not intensive, they just do little work and sleep soon, then the busy
read can still work in most of the time during the future reads, we may
still get obvious improvements.
> I'm not entirely sure I see how its all supposed to work though; the
> various poll functions call sk_busy_poll() and do_select() also loops.
>
> The patch only kills the sk_busy_poll() loop, but then do_select() will
> still loop and not sleep, so how is this helping?

Yes, the patch only help for processes who did a blocking reads (busy
read). For select(), maybe we can do the same thing but need more test
and thoughts.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 2, 2014, 6:03 a.m. UTC | #25
On 02/09/2014 06:35, Jason Wang wrote:
> On 09/01/2014 02:55 PM, Eliezer Tamir wrote:
>> On 26/08/2014 10:16, Jason Wang wrote:
>>> On 08/25/2014 09:16 PM, Eliezer Tamir wrote:

>> Think about the case where two processes are busy polling on the
>> same CPU and the same device queue. Since busy polling processes
>> incoming packets on the queue from any process, this scenario works
>> well currently,
> 
> I see, but looks like we can simply do this by exiting the busy loop
> when ndo_busy_poll() finds something but not for current socket?

I don't think there is a need for that.

When ndo_busy_poll() finds something it feeds it to the stack, which
will process the packet, just as if it came from NAPI polling.
So, if this is data that someone is blocked waiting on, the stack will
wake them up, and then you presumably can decide which app should get
the cpu.

Note, that there is no easy way to know, when looking at the
incoming traffic, whether it is important, or even if you are seeing
a full message. (Maybe you only have 9 packets out of 10?)
The only place this knowledge might exist is in the layers of the
stack closer to the user.

>>  and will not work at all when polling yields to other
>> processes that are of the same priority that are running on the same
>> CPU.
> 
>>
>> Maybe the networking subsystem should maintain a list of device
>> queues that need busypolling and have a thread that would poll
>> all of them when there's nothing better to do.
> 
> Not sure whether this method will scale considering thousands of sockets
> and processes.

There may be millions of sockets, but in most cases only a handful of
device queues per CPU to busy poll on. I have tested the epoll rfc
code with hundreds of thousands of sockets and one or two device
queues and is scales pretty well.

The part I don't like in that code is the cumbersome mechanism I used
to track the socket -> queue relationship. I think that if I had more
time to work on it, I would instead look into extending the epoll
interface so that libevent can tell the kernel what it wants, instead
of having the busypoll code try and learn it.

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 2, 2014, 6:12 a.m. UTC | #26
On Tue, Sep 02, 2014 at 11:38:40AM +0800, Jason Wang wrote:
> I see, how about just exporting a boolean helper like
> current_can_busy_loop() and take care all of the conditions (pending bhs
> and rcu callbacks, runnable processes) in scheduler code itself?

How is that going to help the cases that are hurt by not spinning for a
packet?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 2, 2014, 6:15 a.m. UTC | #27
On 02/09/2014 06:29, Jason Wang wrote:
> On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
>> On 29/08/2014 06:08, Jason Wang wrote:
>>>> Yes, but rx busy polling only works in process context and does not
>>>> disable bh, so it may be not an issue.
>> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
> 
> True, so we need probably also exit the loop when there are pending bhs.

I'm not so sure, in the typical busy poll scenario, the incoming
traffic is the most time-critical thing in the system.
It's so important that you are willing to trade lots of CPU power
for better latency. The user has decided that he wants to dedicate
this CPU mostly for that. This is not something that plays nice with
other apps, but this is what the user wants.

So, you definitely don't want to starve any bh, and you should
regularly re-enable bh's, but you also don't want to stop everything
at any time a bh is scheduled.

You also want network processing on the queues that are busy polled
to come through busy polling and not through NAPI, which is run in bh
context.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 6:31 a.m. UTC | #28
On 09/02/2014 02:03 PM, Eliezer Tamir wrote:
> On 02/09/2014 06:35, Jason Wang wrote:
>> On 09/01/2014 02:55 PM, Eliezer Tamir wrote:
>>> On 26/08/2014 10:16, Jason Wang wrote:
>>>> On 08/25/2014 09:16 PM, Eliezer Tamir wrote:
>>> Think about the case where two processes are busy polling on the
>>> same CPU and the same device queue. Since busy polling processes
>>> incoming packets on the queue from any process, this scenario works
>>> well currently,
>> I see, but looks like we can simply do this by exiting the busy loop
>> when ndo_busy_poll() finds something but not for current socket?
> I don't think there is a need for that.
>
> When ndo_busy_poll() finds something it feeds it to the stack, which
> will process the packet, just as if it came from NAPI polling.
> So, if this is data that someone is blocked waiting on, the stack will
> wake them up, and then you presumably can decide which app should get
> the cpu.

Yes, but current code can not do this. In most of the cases, the new
woke up process have no chance to run if another process is busy loop in
the same cpu.
>
> Note, that there is no easy way to know, when looking at the
> incoming traffic, whether it is important, or even if you are seeing
> a full message. (Maybe you only have 9 packets out of 10?)
> The only place this knowledge might exist is in the layers of the
> stack closer to the user.
>
>>>  and will not work at all when polling yields to other
>>> processes that are of the same priority that are running on the same
>>> CPU.
>>> Maybe the networking subsystem should maintain a list of device
>>> queues that need busypolling and have a thread that would poll
>>> all of them when there's nothing better to do.
>> Not sure whether this method will scale considering thousands of sockets
>> and processes.
> There may be millions of sockets, but in most cases only a handful of
> device queues per CPU to busy poll on. I have tested the epoll rfc
> code with hundreds of thousands of sockets and one or two device
> queues and is scales pretty well.
>
> The part I don't like in that code is the cumbersome mechanism I used
> to track the socket -> queue relationship. I think that if I had more
> time to work on it, I would instead look into extending the epoll
> interface so that libevent can tell the kernel what it wants, instead
> of having the busypoll code try and learn it.

I'd like to have a look at this rfc. Could you please give me a pointer?
I've done a quick search on kernel mailing list but didn't find it.

Thanks
>
> Cheers,
> Eliezer

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 7:19 a.m. UTC | #29
On 09/02/2014 02:12 PM, Peter Zijlstra wrote:
> On Tue, Sep 02, 2014 at 11:38:40AM +0800, Jason Wang wrote:
>> > I see, how about just exporting a boolean helper like
>> > current_can_busy_loop() and take care all of the conditions (pending bhs
>> > and rcu callbacks, runnable processes) in scheduler code itself?
> How is that going to help the cases that are hurt by not spinning for a
> packet?

The patch does not help for this case. Spinning in the case only help
for a single process but hurt all others. Those other processes may not
use busy polling or even non network related. Spinning still may give
somewhat a high priority to the process who use busy polling or reading
which is unfair. And how much we can gain for only a single process by
spinning still when several other tasks could be done is still
questionable. It's quite possible that we could only get one or zero
packet after wasting lots of the cpu cycles when there are thousands or
more sockets.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 2, 2014, 7:37 a.m. UTC | #30
On 09/02/2014 02:15 PM, Eliezer Tamir wrote:
> On 02/09/2014 06:29, Jason Wang wrote:
>> On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
>>> On 29/08/2014 06:08, Jason Wang wrote:
>>>>> Yes, but rx busy polling only works in process context and does not
>>>>> disable bh, so it may be not an issue.
>>> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
>> True, so we need probably also exit the loop when there are pending bhs.
> I'm not so sure, in the typical busy poll scenario, the incoming
> traffic is the most time-critical thing in the system.
> It's so important that you are willing to trade lots of CPU power
> for better latency. The user has decided that he wants to dedicate
> this CPU mostly for that. 

But user should increase the process priority or cgroup in this case.
> This is not something that plays nice with
> other apps, but this is what the user wants.

So the busy polling looks have a higher priority somehow than other
processes.
> So, you definitely don't want to starve any bh, and you should
> regularly re-enable bh's, but you also don't want to stop everything
> at any time a bh is scheduled.

If I get your meaning, you may want call to rcu_read_lock_bh() and get
socket napi id inside the do{} loop? This seems can prevent bhs from
being starved and can also handle the case that the packets were from
different NAPIs.
>
> You also want network processing on the queues that are busy polled
> to come through busy polling and not through NAPI, which is run in bh
> context.
>
> -Eliezer
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 2, 2014, 8:31 a.m. UTC | #31
On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:
> On 02/09/2014 06:29, Jason Wang wrote:
> > On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
> >> On 29/08/2014 06:08, Jason Wang wrote:
> >>>> Yes, but rx busy polling only works in process context and does not
> >>>> disable bh, so it may be not an issue.
> >> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
> > 
> > True, so we need probably also exit the loop when there are pending bhs.
> 
> I'm not so sure, in the typical busy poll scenario, the incoming
> traffic is the most time-critical thing in the system.
> It's so important that you are willing to trade lots of CPU power
> for better latency. The user has decided that he wants to dedicate
> this CPU mostly for that. This is not something that plays nice with
> other apps, but this is what the user wants.

I think most applications wouldn't interpret this flag as "burn up CPU I don't
care what is the result", what apps want is more of "maximise throughput
and minimise latency even if throughput/CPU ratio goes down".
Jason posted benchmarks that show throughput going up because other
processes get more of a chance to run, so this seems consistent
with that goal.


> So, you definitely don't want to starve any bh, and you should
> regularly re-enable bh's, but you also don't want to stop everything
> at any time a bh is scheduled.
> 
> You also want network processing on the queues that are busy polled
> to come through busy polling and not through NAPI, which is run in bh
> context.
> 
> -Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 2, 2014, 10:24 a.m. UTC | #32
On Tue, Sep 02, 2014 at 12:03:42PM +0800, Jason Wang wrote:
> On 09/01/2014 06:19 PM, Peter Zijlstra wrote:
> > OK I suppose that more or less makes sense, the contextual behaviour is
> > of course tedious in that it makes behaviour less predictable. The
> > 'other' tasks might not want to generate data and you then destroy
> > throughput by not spinning.
> 
> The patch try to make sure:
> - the the performance of busy read was not worse than it was disabled in
> any cases.
> - the performance improvement of a single socket was not achieved by
> sacrificing the total performance (all other processes) of the system
>  
> If 'other' tasks are also CPU or I/O intensive jobs, we switch to do
> them so the total performance were kept or even increased, and the
> performance of current process were guaranteed not worse than when busy
> read was disabled (or even better since it may still do busy read
> sometimes when it was the only runnable process). If 'other' task are
> not intensive, they just do little work and sleep soon, then the busy
> read can still work in most of the time during the future reads, we may
> still get obvious improvements.

Not entirely true; the select/poll whatever will now block, which means
we need a wakeup, which increases the latency immensely.

> > I'm not entirely sure I see how its all supposed to work though; the
> > various poll functions call sk_busy_poll() and do_select() also loops.
> >
> > The patch only kills the sk_busy_poll() loop, but then do_select() will
> > still loop and not sleep, so how is this helping?
> 
> Yes, the patch only help for processes who did a blocking reads (busy
> read). For select(), maybe we can do the same thing but need more test
> and thoughts.

What's the blocking read callgraph, how so we end up in sk_busy_poll() there?

But that's another reason the patch is wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 3, 2014, 6:21 a.m. UTC | #33
On 02/09/2014 09:31, Jason Wang wrote:
> On 09/02/2014 02:03 PM, Eliezer Tamir wrote:
>> On 02/09/2014 06:35, Jason Wang wrote:
>>> Not sure whether this method will scale considering thousands of sockets
>>> and processes.
>> There may be millions of sockets, but in most cases only a handful of
>> device queues per CPU to busy poll on. I have tested the epoll rfc
>> code with hundreds of thousands of sockets and one or two device
>> queues and is scales pretty well.
>>
>> The part I don't like in that code is the cumbersome mechanism I used
>> to track the socket -> queue relationship. I think that if I had more
>> time to work on it, I would instead look into extending the epoll
>> interface so that libevent can tell the kernel what it wants, instead
>> of having the busypoll code try and learn it.
> 
> I'd like to have a look at this rfc. Could you please give me a pointer?
> I've done a quick search on kernel mailing list but didn't find it.

https://lkml.org/lkml/2013/8/21/192

Cheers,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Eliezer Tamir Sept. 3, 2014, 6:49 a.m. UTC | #34
On 02/09/2014 11:31, Michael S. Tsirkin wrote:
> On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:
>> On 02/09/2014 06:29, Jason Wang wrote:
>>> On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
>>>> On 29/08/2014 06:08, Jason Wang wrote:
>>>>>> Yes, but rx busy polling only works in process context and does not
>>>>>> disable bh, so it may be not an issue.
>>>> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
>>>
>>> True, so we need probably also exit the loop when there are pending bhs.
>>
>> I'm not so sure, in the typical busy poll scenario, the incoming
>> traffic is the most time-critical thing in the system.
>> It's so important that you are willing to trade lots of CPU power
>> for better latency. The user has decided that he wants to dedicate
>> this CPU mostly for that. This is not something that plays nice with
>> other apps, but this is what the user wants.
> 
> I think most applications wouldn't interpret this flag as "burn up CPU I don't
> care what is the result", what apps want is more of "maximise throughput
> and minimise latency even if throughput/CPU ratio goes down".
> Jason posted benchmarks that show throughput going up because other
> processes get more of a chance to run, so this seems consistent
> with that goal.

Busy polling is not a general purpose feature, it's not something you
can casually turn on and will "just work". Most applications should not
be using busy polling. Currently it is used by multiserver applications
that you spend days tuning to specific platforms.

What the user wants is to lower both avg and maximum latencies, at the
expense of everything else including power efficiency and sometimes
even throughput. The only exception is making the system crash ;)

While letting other things take precedence over busy polling might not
hurt the avg latency much, it will kill your maximum latency.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 3, 2014, 6:58 a.m. UTC | #35
On 09/02/2014 06:24 PM, Peter Zijlstra wrote:
> On Tue, Sep 02, 2014 at 12:03:42PM +0800, Jason Wang wrote:
>> > On 09/01/2014 06:19 PM, Peter Zijlstra wrote:
>>> > > OK I suppose that more or less makes sense, the contextual behaviour is
>>> > > of course tedious in that it makes behaviour less predictable. The
>>> > > 'other' tasks might not want to generate data and you then destroy
>>> > > throughput by not spinning.
>> > 
>> > The patch try to make sure:
>> > - the the performance of busy read was not worse than it was disabled in
>> > any cases.
>> > - the performance improvement of a single socket was not achieved by
>> > sacrificing the total performance (all other processes) of the system
>> >  
>> > If 'other' tasks are also CPU or I/O intensive jobs, we switch to do
>> > them so the total performance were kept or even increased, and the
>> > performance of current process were guaranteed not worse than when busy
>> > read was disabled (or even better since it may still do busy read
>> > sometimes when it was the only runnable process). If 'other' task are
>> > not intensive, they just do little work and sleep soon, then the busy
>> > read can still work in most of the time during the future reads, we may
>> > still get obvious improvements
> Not entirely true; the select/poll whatever will now block, which means
> we need a wakeup, which increases the latency immensely.

Not sure I get your meaning. This patch does not change the logic or
dynamic of select/poll since sock_poll() always call sk_busy_loop() with
noblock is true. This means sk_busy_loop() will only try ndo_busy_poll()
once whatever the result of other checks. The busy polling was done
through its caller in fact.
>>> > > I'm not entirely sure I see how its all supposed to work though; the
>>> > > various poll functions call sk_busy_poll() and do_select() also loops.
>>> > >
>>> > > The patch only kills the sk_busy_poll() loop, but then do_select() will
>>> > > still loop and not sleep, so how is this helping?
>> > 
>> > Yes, the patch only help for processes who did a blocking reads (busy
>> > read). For select(), maybe we can do the same thing but need more test
>> > and thoughts.
> What's the blocking read callgraph, how so we end up in sk_busy_poll() there?
>
> But that's another reason the patch is wrong.

The patch only try to improve the performance of busy read (and test
results shows impressive changes). It does not change anything for busy
poll. Considering there maybe two processes in one cpu, one is doing
busy read and one is doing busy polling. This patch may in fact help the
busy polling performance in this case.

It's good to discuss the ideas of busy poll together, but it was out of
the scope of this patch. We can try to do optimization on top.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 3, 2014, 6:59 a.m. UTC | #36
On 09/03/2014 02:21 PM, Eliezer Tamir wrote:
> On 02/09/2014 09:31, Jason Wang wrote:
>> On 09/02/2014 02:03 PM, Eliezer Tamir wrote:
>>> On 02/09/2014 06:35, Jason Wang wrote:
>>>> Not sure whether this method will scale considering thousands of sockets
>>>> and processes.
>>> There may be millions of sockets, but in most cases only a handful of
>>> device queues per CPU to busy poll on. I have tested the epoll rfc
>>> code with hundreds of thousands of sockets and one or two device
>>> queues and is scales pretty well.
>>>
>>> The part I don't like in that code is the cumbersome mechanism I used
>>> to track the socket -> queue relationship. I think that if I had more
>>> time to work on it, I would instead look into extending the epoll
>>> interface so that libevent can tell the kernel what it wants, instead
>>> of having the busypoll code try and learn it.
>> I'd like to have a look at this rfc. Could you please give me a pointer?
>> I've done a quick search on kernel mailing list but didn't find it.
> https://lkml.org/lkml/2013/8/21/192
>
> Cheers,
> Eliezer
> --

Thanks. I will have a look at this series.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jason Wang Sept. 3, 2014, 7:33 a.m. UTC | #37
On 09/03/2014 02:49 PM, Eliezer Tamir wrote:
> On 02/09/2014 11:31, Michael S. Tsirkin wrote:
>> On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:
>>> On 02/09/2014 06:29, Jason Wang wrote:
>>>> On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
>>>>> On 29/08/2014 06:08, Jason Wang wrote:
>>>>>>> Yes, but rx busy polling only works in process context and does not
>>>>>>> disable bh, so it may be not an issue.
>>>>> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
>>>> True, so we need probably also exit the loop when there are pending bhs.
>>> I'm not so sure, in the typical busy poll scenario, the incoming
>>> traffic is the most time-critical thing in the system.
>>> It's so important that you are willing to trade lots of CPU power
>>> for better latency. The user has decided that he wants to dedicate
>>> this CPU mostly for that. This is not something that plays nice with
>>> other apps, but this is what the user wants.
>> I think most applications wouldn't interpret this flag as "burn up CPU I don't
>> care what is the result", what apps want is more of "maximise throughput
>> and minimise latency even if throughput/CPU ratio goes down".
>> Jason posted benchmarks that show throughput going up because other
>> processes get more of a chance to run, so this seems consistent
>> with that goal.
> Busy polling is not a general purpose feature, it's not something you
> can casually turn on and will "just work". Most applications should not
> be using busy polling.

How about busy read? It was enabled only through a global sysctl and
then applications can make use of this without rewrite.
>  Currently it is used by multiserver applications
> that you spend days tuning to specific platforms.

I agree the polling code needs more thought but the patch only change
busy read.

A new issue is for virt users. I implement busy polling for virtio-net
but we don't want one vcpu monopoly the cpu if there's some tasks on
other vcpus. We may need some hint from host to guest to let it exit the
loop if needed.
>
> What the user wants is to lower both avg and maximum latencies, at the
> expense of everything else including power efficiency and sometimes
> even throughput. The only exception is making the system crash ;)
>
> While letting other things take precedence over busy polling might not
> hurt the avg latency much, it will kill your maximum latency.
>
> -Eliezer


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 3, 2014, 7:51 a.m. UTC | #38
On Wed, Sep 03, 2014 at 09:49:10AM +0300, Eliezer Tamir wrote:
> On 02/09/2014 11:31, Michael S. Tsirkin wrote:
> > On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:
> >> On 02/09/2014 06:29, Jason Wang wrote:
> >>> On 09/01/2014 02:39 PM, Eliezer Tamir wrote:
> >>>> On 29/08/2014 06:08, Jason Wang wrote:
> >>>>>> Yes, but rx busy polling only works in process context and does not
> >>>>>> disable bh, so it may be not an issue.
> >>>> sk_busy_loop() uses rcu_read_lock_bh(), so it does run with bh disabled.
> >>>
> >>> True, so we need probably also exit the loop when there are pending bhs.
> >>
> >> I'm not so sure, in the typical busy poll scenario, the incoming
> >> traffic is the most time-critical thing in the system.
> >> It's so important that you are willing to trade lots of CPU power
> >> for better latency. The user has decided that he wants to dedicate
> >> this CPU mostly for that. This is not something that plays nice with
> >> other apps, but this is what the user wants.
> > 
> > I think most applications wouldn't interpret this flag as "burn up CPU I don't
> > care what is the result", what apps want is more of "maximise throughput
> > and minimise latency even if throughput/CPU ratio goes down".
> > Jason posted benchmarks that show throughput going up because other
> > processes get more of a chance to run, so this seems consistent
> > with that goal.
> 
> Busy polling is not a general purpose feature, it's not something you
> can casually turn on and will "just work". Most applications should not
> be using busy polling. Currently it is used by multiserver applications
> that you spend days tuning to specific platforms.
> 
> What the user wants is to lower both avg and maximum latencies, at the
> expense of everything else including power efficiency and sometimes
> even throughput. The only exception is making the system crash ;)
> 
> While letting other things take precedence over busy polling might not
> hurt the avg latency much, it will kill your maximum latency.
> 
> -Eliezer

If scheduler happens to run both server and client on the
same CPU, polling will hurt maximum latency even more.
So I guess different users want different things.

How about applications giving us a hint what they prefer?
For example, a new flag that says "I don't have anything useful to do so
let's do busy polling but my server is on the local system, so please
only poll if CPU is otherwise idle".
Michael S. Tsirkin Sept. 3, 2014, 8:09 a.m. UTC | #39
On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote:
> 
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index 1d67fb6..8a33fb2 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > >  		cpu_relax();
> > >  
> > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > +		 nr_running_this_cpu() < 2);
> 
> So it's generally a bad idea to couple to the scheduler through 
> such a low level, implementation dependent value like 
> 'nr_running', causing various problems:
> 
>  - It misses important work that might be pending on this CPU,
>    like RCU callbacks.
> 
>  - It will also over-credit task contexts that might be
>    runnable, but which are less important than the currently
>    running one: such as a SCHED_IDLE task
> 
>  - It will also over-credit even regular SCHED_NORMAL tasks, if
>    this current task is more important than them: say
>    SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
>    with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
>    on an otherwise idle system.
> 
> So what you want is a more sophisticated query to the 
> scheduler, a sched_expected_runtime() method that returns the 
> number of nsecs this task is expected to run in the future, 
> which returns 0 if you will be scheduled away on the next 
> schedule(), and returns infinity for a high prio SCHED_FIFO 
> task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
> 
> It will return a regular time slice value in other cases, when 
> there's some load on the CPU.
> 
> The polling logic can then do its decision based on that time 
> value.
> 
> All this can be done reasonably fast and lockless in most 
> cases, so that it can be called from busy-polling code.
> 
> An added advantage would be that this approach consolidates the 
> somewhat random need_resched() checks into this method as well.
> 
> In any case I don't agree with the nr_running_this_cpu() 
> method.
> 
> (Please Cc: me and lkml to future iterations of this patchset.)
> 
> Thanks,
> 
> 	Ingo

This sounds very nice.

We could then have applications say "I want to poll
only if no one else has urgent work" where
urgent is defined as "has to run within next N nseconds".

Really just a bit more flexibility added to busy polling.

Peter, Ingo, does this sound good?
Peter Zijlstra Sept. 3, 2014, 9:30 a.m. UTC | #40
On Wed, Sep 03, 2014 at 02:58:33PM +0800, Jason Wang wrote:
> On 09/02/2014 06:24 PM, Peter Zijlstra wrote:
> The patch only try to improve the performance of busy read (and test
> results shows impressive changes). It does not change anything for busy
> poll. Considering there maybe two processes in one cpu, one is doing
> busy read and one is doing busy polling. This patch may in fact help the
> busy polling performance in this case.
> 
> It's good to discuss the ideas of busy poll together, but it was out of
> the scope of this patch. We can try to do optimization on top.

No thta's just wrong, blocked read and blocking select should behave the
same.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 3, 2014, 9:36 a.m. UTC | #41
On Wed, Sep 03, 2014 at 03:33:23PM +0800, Jason Wang wrote:
> A new issue is for virt users. I implement busy polling for virtio-net
> but we don't want one vcpu monopoly the cpu if there's some tasks on
> other vcpus. We may need some hint from host to guest to let it exit the
> loop if needed.

Aw god.. virt nonsense is wrecking things again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 3, 2014, 9:59 a.m. UTC | #42
On Wed, Sep 03, 2014 at 11:36:05AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 03, 2014 at 03:33:23PM +0800, Jason Wang wrote:
> > A new issue is for virt users. I implement busy polling for virtio-net
> > but we don't want one vcpu monopoly the cpu if there's some tasks on
> > other vcpus. We may need some hint from host to guest to let it exit the
> > loop if needed.
> 
> Aw god.. virt nonsense is wrecking things again.

I don't see a reason to bring virt up here.  Original patch shows
improvement with a simple loopback netperf test.

Some people like the linux kernel so much, they want to run many
instances of it on the same CPU. Understandable.
So of course everything is then magnified x2 as you
are running two schedulers and two networking stacks,
but let's focus on the important thing which is
better interaction between busy polling and the linux scheduler.
Eliezer Tamir Sept. 4, 2014, 6:51 a.m. UTC | #43
On 03/09/2014 10:51, Michael S. Tsirkin wrote:
> On Wed, Sep 03, 2014 at 09:49:10AM +0300, Eliezer Tamir wrote:
>> On 02/09/2014 11:31, Michael S. Tsirkin wrote:
>>> On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:

>> Busy polling is not a general purpose feature, it's not something you
>> can casually turn on and will "just work". Most applications should not
>> be using busy polling. Currently it is used by multiserver applications
>> that you spend days tuning to specific platforms.
>>
>> What the user wants is to lower both avg and maximum latencies, at the
>> expense of everything else including power efficiency and sometimes
>> even throughput. The only exception is making the system crash ;)
>>
>> While letting other things take precedence over busy polling might not
>> hurt the avg latency much, it will kill your maximum latency.
> 
> If scheduler happens to run both server and client on the
> same CPU, polling will hurt maximum latency even more.
> So I guess different users want different things.
> 
> How about applications giving us a hint what they prefer?
> For example, a new flag that says "I don't have anything useful to do so
> let's do busy polling but my server is on the local system, so please
> only poll if CPU is otherwise idle".

I'm sorry for being ambiguous, when I said multi-server application, I
meant an app that runs on more than one server machine.

The loopback test is as far as I know not interesting.

Of course if busypoll becomes interesting for virtualization over
loopback, I have no problem with that, provided that there is a way to
get the old behavior and that it is well documented.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Michael S. Tsirkin Sept. 4, 2014, 8:25 a.m. UTC | #44
On Thu, Sep 04, 2014 at 09:51:33AM +0300, Eliezer Tamir wrote:
> On 03/09/2014 10:51, Michael S. Tsirkin wrote:
> > On Wed, Sep 03, 2014 at 09:49:10AM +0300, Eliezer Tamir wrote:
> >> On 02/09/2014 11:31, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 02, 2014 at 09:15:18AM +0300, Eliezer Tamir wrote:
> 
> >> Busy polling is not a general purpose feature, it's not something you
> >> can casually turn on and will "just work". Most applications should not
> >> be using busy polling. Currently it is used by multiserver applications
> >> that you spend days tuning to specific platforms.
> >>
> >> What the user wants is to lower both avg and maximum latencies, at the
> >> expense of everything else including power efficiency and sometimes
> >> even throughput. The only exception is making the system crash ;)
> >>
> >> While letting other things take precedence over busy polling might not
> >> hurt the avg latency much, it will kill your maximum latency.
> > 
> > If scheduler happens to run both server and client on the
> > same CPU, polling will hurt maximum latency even more.
> > So I guess different users want different things.
> > 
> > How about applications giving us a hint what they prefer?
> > For example, a new flag that says "I don't have anything useful to do so
> > let's do busy polling but my server is on the local system, so please
> > only poll if CPU is otherwise idle".
> 
> I'm sorry for being ambiguous, when I said multi-server application, I
> meant an app that runs on more than one server machine.
> 
> The loopback test is as far as I know not interesting.

It's not interesting for ethernet NIC vendors :).

But people do deploy servers and clients on the same box, and
improving latency for them would be benefitial.

Loopback is widely used by desktops, containers, vpns, etc etc

Getting more theoretical, even with external networking, any
application processing traffic in more than one thread behaves the same.
For example, it's not uncommon to have a thread get messages from a
socket, and pass them out to other threads for processing.  At the
moment the polling thread would sometimes delay the processing if it
lands on the same CPU.

> Of course if busypoll becomes interesting for virtualization over
> loopback,

IMO bringing up virtualization just muddies the waters, it's a general
problem that arises whenever you run a lot of stuff on your CPU.

> I have no problem with that, provided that there is a way to
> get the old behavior and that it is well documented.
> 
> -Eliezer

I agree here, since some application might be relying on the old behavior.
Michael S. Tsirkin April 11, 2016, 4:31 p.m. UTC | #45
On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote:
> 
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index 1d67fb6..8a33fb2 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > >  		cpu_relax();
> > >  
> > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > +		 nr_running_this_cpu() < 2);
> 
> So it's generally a bad idea to couple to the scheduler through 
> such a low level, implementation dependent value like 
> 'nr_running', causing various problems:
> 
>  - It misses important work that might be pending on this CPU,
>    like RCU callbacks.
> 
>  - It will also over-credit task contexts that might be
>    runnable, but which are less important than the currently
>    running one: such as a SCHED_IDLE task
> 
>  - It will also over-credit even regular SCHED_NORMAL tasks, if
>    this current task is more important than them: say
>    SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
>    with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
>    on an otherwise idle system.
> 
> So what you want is a more sophisticated query to the 
> scheduler, a sched_expected_runtime() method that returns the 
> number of nsecs this task is expected to run in the future, 
> which returns 0 if you will be scheduled away on the next 
> schedule(), and returns infinity for a high prio SCHED_FIFO 
> task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
> 
> It will return a regular time slice value in other cases, when 
> there's some load on the CPU.
> 
> The polling logic can then do its decision based on that time 
> value.
> 
> All this can be done reasonably fast and lockless in most 
> cases, so that it can be called from busy-polling code.
>
> An added advantage would be that this approach consolidates the 
> somewhat random need_resched() checks into this method as well.
> 
> In any case I don't agree with the nr_running_this_cpu() 
> method.
> 
> (Please Cc: me and lkml to future iterations of this patchset.)
> 
> Thanks,
> 
> 	Ingo

I tried to look into this: it might be even nicer to add
sched_expected_to_run(time) which tells us whether we expect the current
task to keep running for the next XX nsecs.

For the fair scheduler, it seems that it could be as simple as

+static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
+{
+       struct sched_entity *left;
+       struct sched_entity *curr = cfs_rq->curr;
+
+       if (!curr || !curr->on_rq)
+               return false;
+
+       left = __pick_first_entity(cfs_rq);
+       if (!left)
+               return true;
+
+       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
+                    left->vruntime) < 0;
+}

The reason it seems easier is because that way we can reuse
calc_delta_fair and don't have to do the reverse translation
from vruntime to nsec.

And I guess if we do this with interrupts disabled, and only poke
at the current CPU's rq, we know first entity
won't go away so we don't need locks? 

Is this close to what you had in mind?

Thanks,
Ingo Molnar April 13, 2016, 7:20 a.m. UTC | #46
* Michael S. Tsirkin <mst@redhat.com> wrote:

> On Fri, Aug 22, 2014 at 09:36:53AM +0200, Ingo Molnar wrote:
> > 
> > > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > > index 1d67fb6..8a33fb2 100644
> > > > --- a/include/net/busy_poll.h
> > > > +++ b/include/net/busy_poll.h
> > > > @@ -109,7 +109,8 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > >  		cpu_relax();
> > > >  
> > > >  	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
> > > > -		 !need_resched() && !busy_loop_timeout(end_time));
> > > > +		 !need_resched() && !busy_loop_timeout(end_time) &&
> > > > +		 nr_running_this_cpu() < 2);
> > 
> > So it's generally a bad idea to couple to the scheduler through 
> > such a low level, implementation dependent value like 
> > 'nr_running', causing various problems:
> > 
> >  - It misses important work that might be pending on this CPU,
> >    like RCU callbacks.
> > 
> >  - It will also over-credit task contexts that might be
> >    runnable, but which are less important than the currently
> >    running one: such as a SCHED_IDLE task
> > 
> >  - It will also over-credit even regular SCHED_NORMAL tasks, if
> >    this current task is more important than them: say
> >    SCHED_FIFO. A SCHED_FIFO workload should run just as fast 
> >    with SCHED_NORMAL tasks around, as a SCHED_NORMAL workload 
> >    on an otherwise idle system.
> > 
> > So what you want is a more sophisticated query to the 
> > scheduler, a sched_expected_runtime() method that returns the 
> > number of nsecs this task is expected to run in the future, 
> > which returns 0 if you will be scheduled away on the next 
> > schedule(), and returns infinity for a high prio SCHED_FIFO 
> > task, or if this SCHED_NORMAL task is on an otherwise idle CPU.
> > 
> > It will return a regular time slice value in other cases, when 
> > there's some load on the CPU.
> > 
> > The polling logic can then do its decision based on that time 
> > value.
> > 
> > All this can be done reasonably fast and lockless in most 
> > cases, so that it can be called from busy-polling code.
> >
> > An added advantage would be that this approach consolidates the 
> > somewhat random need_resched() checks into this method as well.
> > 
> > In any case I don't agree with the nr_running_this_cpu() 
> > method.
> > 
> > (Please Cc: me and lkml to future iterations of this patchset.)
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> I tried to look into this: it might be even nicer to add
> sched_expected_to_run(time) which tells us whether we expect the current
> task to keep running for the next XX nsecs.
> 
> For the fair scheduler, it seems that it could be as simple as
> 
> +static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
> +{
> +       struct sched_entity *left;
> +       struct sched_entity *curr = cfs_rq->curr;
> +
> +       if (!curr || !curr->on_rq)
> +               return false;
> +
> +       left = __pick_first_entity(cfs_rq);
> +       if (!left)
> +               return true;
> +
> +       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
> +                    left->vruntime) < 0;
> +}
> 
> The reason it seems easier is because that way we can reuse
> calc_delta_fair and don't have to do the reverse translation
> from vruntime to nsec.
> 
> And I guess if we do this with interrupts disabled, and only poke
> at the current CPU's rq, we know first entity
> won't go away so we don't need locks? 
> 
> Is this close to what you had in mind?

Yeah, fair enough ;-)

I'm not 100% convinced about the interface, but the model looks good to me.
Let's try it - I don't have fundamental objections anymore.

I also agree that it could be done lockless - although I'd suggest two steps: 
first do the dumb thing with the proper scheduler lock(s) held, then another patch 
which removes the locks for a bit more performance. That will make any subtle 
crashes/races bisectable.

Thanks,

	Ingo
Peter Zijlstra April 13, 2016, 1:28 p.m. UTC | #47
On Mon, Apr 11, 2016 at 07:31:57PM +0300, Michael S. Tsirkin wrote:
> +static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
> +{
> +       struct sched_entity *left;
> +       struct sched_entity *curr = cfs_rq->curr;
> +
> +       if (!curr || !curr->on_rq)
> +               return false;
> +
> +       left = __pick_first_entity(cfs_rq);
> +       if (!left)
> +               return true;
> +
> +       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
> +                    left->vruntime) < 0;
> +}
> 
> The reason it seems easier is because that way we can reuse
> calc_delta_fair and don't have to do the reverse translation
> from vruntime to nsec.
> 
> And I guess if we do this with interrupts disabled, and only poke
> at the current CPU's rq, we know first entity
> won't go away so we don't need locks? 

Nope, not true. Current isn't actually in the tree, and any other task
is subject to being moved at any time.

Even if current was in the tree, there is no guarantee it is
->rb_leftmost; imagine a task being migrated in that has a smaller
vruntime.

So this really cannot work without locks :/

I've not thought about the actual problem you're trying to solve; but I
figured I'd let you know this before you continue down this path.
Michael S. Tsirkin April 13, 2016, 1:51 p.m. UTC | #48
On Wed, Apr 13, 2016 at 03:28:03PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 11, 2016 at 07:31:57PM +0300, Michael S. Tsirkin wrote:
> > +static bool expected_to_run_fair(struct cfs_rq *cfs_rq, s64 t)
> > +{
> > +       struct sched_entity *left;
> > +       struct sched_entity *curr = cfs_rq->curr;
> > +
> > +       if (!curr || !curr->on_rq)
> > +               return false;
> > +
> > +       left = __pick_first_entity(cfs_rq);
> > +       if (!left)
> > +               return true;
> > +
> > +       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
> > +                    left->vruntime) < 0;
> > +}
> > 
> > The reason it seems easier is because that way we can reuse
> > calc_delta_fair and don't have to do the reverse translation
> > from vruntime to nsec.
> > 
> > And I guess if we do this with interrupts disabled, and only poke
> > at the current CPU's rq, we know first entity
> > won't go away so we don't need locks? 
> 
> Nope, not true. Current isn't actually in the tree, and any other task
> is subject to being moved at any time.
> Even if current was in the tree, there is no guarantee it is
> ->rb_leftmost; imagine a task being migrated in that has a smaller
> vruntime.
> 
> So this really cannot work without locks :/
> 
> I've not thought about the actual problem you're trying to solve; but I
> figured I'd let you know this before you continue down this path.

Hmm. This is in fact called in the context of a given task,
so maybe I should just change the API and use
	&task->se
instead of 
	cfs_rq->current

:

static bool expected_to_run_fair(struct cfs_rq *cfs_rq, struct task_struct *task, s64 t)
{
       struct sched_entity *left;
       struct sched_entity *curr = &task->se;
       if (!curr || !curr->on_rq)
              return false;

       left = __pick_first_entity(cfs_rq);
       if (!left)
               return true;

       return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
                    left->vruntime) < 0;
}

This way it is caller's respinsibility to make sure task
is not going away.
Left here is on tree so it's not going away, right?
Peter Zijlstra April 14, 2016, 12:55 a.m. UTC | #49
On Tue, Sep 02, 2014 at 02:31:53PM +0800, Jason Wang wrote:
> Yes, but current code can not do this. In most of the cases, the new
> woke up process have no chance to run if another process is busy loop in
> the same cpu.

That sounds weird; a process that slept will have an effective vruntime
deficit vs one that has been running for a while.
Peter Zijlstra April 14, 2016, 12:58 a.m. UTC | #50
On Wed, Apr 13, 2016 at 04:51:36PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 13, 2016 at 03:28:03PM +0200, Peter Zijlstra wrote:

> > Nope, not true. Current isn't actually in the tree, and any other task
> > is subject to being moved at any time.
> > Even if current was in the tree, there is no guarantee it is
> > ->rb_leftmost; imagine a task being migrated in that has a smaller
> > vruntime.
> > 
> > So this really cannot work without locks :/
> > 
> > I've not thought about the actual problem you're trying to solve; but I
> > figured I'd let you know this before you continue down this path.

> static bool expected_to_run_fair(struct cfs_rq *cfs_rq, struct task_struct *task, s64 t)
> {
>        struct sched_entity *left;
>        struct sched_entity *curr = &task->se;
>        if (!curr || !curr->on_rq)
>               return false;
> 
>        left = __pick_first_entity(cfs_rq);
>        if (!left)
>                return true;
> 
>        return (s64)(curr->vruntime + calc_delta_fair(t, curr) -
>                     left->vruntime) < 0;
> }
> 
> Left here is on tree so it's not going away, right?

No, left is the leftmost entry on the tree, it can go away at any time.
You cannot touch the tree without locks.

Someone can come an migrate the task to another CPU, start executing it
and it can die between you finding the left pointer and dereferencing
it.

Patch
diff mbox series

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 1d67fb6..8a33fb2 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -109,7 +109,8 @@  static inline bool sk_busy_loop(struct sock *sk, int nonblock)
 		cpu_relax();
 
 	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue) &&
-		 !need_resched() && !busy_loop_timeout(end_time));
+		 !need_resched() && !busy_loop_timeout(end_time) &&
+		 nr_running_this_cpu() < 2);
 
 	rc = !skb_queue_empty(&sk->sk_receive_queue);
 out: