All of lore.kernel.org
 help / color / mirror / Atom feed
* rps: question
@ 2010-02-07 18:42 jamal
  2010-02-08  5:58 ` Tom Herbert
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-02-07 18:42 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, netdev, robert, David Miller

Hi Tom,

First off: Kudos on the numbers you are seeing; they are
impressive. Do you have any numbers on a forwarding path test?

My first impression when i saw the numbers was one of suprise.
Back in the days when we tried to split stack processing the way
you did(it was one of the experiments on early NAPI), IPIs were 
_damn_ expensive. What changed in current architecture that makes
this more palatable? IPIs are still synchronous AFAIK (and the more
IPI receiver there are, the worse the ACK latency). Did you test this
across other archs or say 3-4 year old machines?

cheers,
jamal 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps: question
  2010-02-07 18:42 rps: question jamal
@ 2010-02-08  5:58 ` Tom Herbert
  2010-02-08 15:09   ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Tom Herbert @ 2010-02-08  5:58 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller

On Sun, Feb 7, 2010 at 10:42 AM, jamal <hadi@cyberus.ca> wrote:
>
> Hi Tom,
>
> First off: Kudos on the numbers you are seeing; they are
> impressive. Do you have any numbers on a forwarding path test?
>
I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we see
for an end host.

> My first impression when i saw the numbers was one of suprise.
> Back in the days when we tried to split stack processing the way
> you did(it was one of the experiments on early NAPI), IPIs were
> _damn_ expensive. What changed in current architecture that makes
> this more palatable? IPIs are still synchronous AFAIK (and the more
> IPI receiver there are, the worse the ACK latency). Did you test this
> across other archs or say 3-4 year old machines?
>

No, the cost of the IPIs hasn't been an issue for us performance-wise.
 We are using them extensively-- up to one per core per device
interrupt.

We're calling __smp_call_function_single which is asynchronous in that
the caller provides the call structure and there is not waiting for
the IPI to complete.  A flag is used with each call structure that is
set when the IPI is in progress, this prevents simultaneous use of a
call structure.

I haven't seen any architectural specific issues with the IPIs, I
believe they are completing in < 2 usecs on platforms we're running
(some opteron systems that are over 3yrs old).

Tom

> cheers,
> jamal
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps: question
  2010-02-08  5:58 ` Tom Herbert
@ 2010-02-08 15:09   ` jamal
  2010-04-14 11:53     ` rps perfomance WAS(Re: " jamal
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-02-08 15:09 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, netdev, robert, David Miller

On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:

> I don't have specific numbers, although we are using this on
> application doing forwarding and numbers seem in line with what we see
> for an end host.
> 

When i get the chance i will give it a run. I have access to an i7
somewhere. It seems like i need some specific nics?

> No, the cost of the IPIs hasn't been an issue for us performance-wise.
>  We are using them extensively-- up to one per core per device
> interrupt.

Ok, so you are not going across cores then? I wonder if there's
some new optimization to reduce IPI latency  when both sender/receiver
reside on the same core? 

> We're calling __smp_call_function_single which is asynchronous in that
> the caller provides the call structure and there is not waiting for
> the IPI to complete.  A flag is used with each call structure that is
> set when the IPI is in progress, this prevents simultaneous use of a
> call structure.

It is possible that is just an abstraction hiding the details..
AFAIK, IPIs are synchronous. Remote has to ack with another IPI 
while the issuing cpu waits for ack IPI and then returns.

> I haven't seen any architectural specific issues with the IPIs, I
> believe they are completing in < 2 usecs on platforms we're running
> (some opteron systems that are over 3yrs old).

2 usecs aint bad (at 10G you only accumulate a few packets while
stalled). I think we saw much higher values.
I was asking on different architectures because I have tried something
equivalent as recent as 2 years back on a MIPS multicore and the
forwarding results were horrible. 
IPIs flush the processor pipeline so they aint cheap - but that may
vary depending on the architecture. Someone more knowledgeable should
be able to give better insights.
My suspicion is that with low transaction rate (with appropriate traffic
patterns) you will see a very much increased latency since you will 
be sending more IPIs..

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* rps perfomance WAS(Re: rps: question
  2010-02-08 15:09   ` jamal
@ 2010-04-14 11:53     ` jamal
  2010-04-14 17:31       ` Tom Herbert
                         ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: jamal @ 2010-04-14 11:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao, Andi Kleen

Following up like promised:

On Mon, 2010-02-08 at 10:09 -0500, jamal wrote:
> On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:
> 
> > I don't have specific numbers, although we are using this on
> > application doing forwarding and numbers seem in line with what we see
> > for an end host.
> > 
> 
> When i get the chance i will give it a run. I have access to an i7
> somewhere. It seems like i need some specific nics?

I did step #0 last night on an i7 (single Nehalem). I think more than
anything i was impressed by the Nehalem's excellent caching system.
Robert, I am almost tempted to say skb recycling performance will be
excellent on this  machine given the cost of a cache miss is much lower
than previous generation hardware.

My test was simple: irq affinity on cpu0(core0) and rps redirection to
cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
on different cores with similar results. I base tested against no rps
being used and a kernel which didnt have any RPS config on.
[BTW, I had to hand-edit the .config since i couldnt do it from
menuconfig (Is there any reason for it to be so?)]

Traffic was sent from another machine into the i7 via an el-cheapo sky2
(dont know how shitty this NIC is, but it seems to know how to do MSI so
probably capable of multiqueueing); the test was several sets of 
a ping first and then a ping -f (I will get more sophisticated in my
next test likely this weekend).

Results:
CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
cpu was being chewed highly by sky2_poll and on the redirected-to-core
it was always smp_call_function_single.
Latency was (consistently) on average 5 microseconds. 
So if i sent 1M ping -f packets, without RPS it took on average
176 seconds and with RPS it took 181 seconds to do a round-trip.
Throughput didnt change but this could be attributed to the low amounts
of data i was sending.
I observed that we were generating, on average, an IPI per packet even
with ping -f. (added an extra stat to record when we sent an IPI and
counted against the number of packets sent).
In my opinion it is these IPIs that contribute the most to the latency
and i think it happens that the Nehalem is just highly improved in this 
area. I wish i had a more commonly used machine to test rps on.
I expect that rps will perform worse on currently cheaper/older hardware
for the traffic characteristic i tested.

On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?

So my question to Tom, Eric and Changli or anyone else who has been
running RPS:
What hardware did you use? Is there anyone using older hardware than
say AMD Opteron or Intel Nehalem?

My impressions of rps so far:
I think i may end up being impressed when i generate a lot more traffic
since the cost of IPI will be amortized. 
At this point multiqueue seems a lot more impressive alternative and it
seems to me multiqueu hardware is a lot more commodity (price-point)
than a Nehalem.

Plan:
I plan to still attack the app space (and write a basic udp app that
binds to one or more rps cpus and try blasting a lot of UDP traffic to
see what happens) my step after that is to move to forwarding tests..
 
cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 11:53     ` rps perfomance WAS(Re: " jamal
@ 2010-04-14 17:31       ` Tom Herbert
  2010-04-14 18:04         ` Eric Dumazet
  2010-04-14 18:53       ` Stephen Hemminger
  2010-04-15  8:42       ` David Miller
  2 siblings, 1 reply; 86+ messages in thread
From: Tom Herbert @ 2010-04-14 17:31 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao, Andi Kleen

The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet.  If you are running a single flow, then
you'll see latency increase for that flow.  With more concurrent flows
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests.  Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make single
queue NICs give reasonable performance.


On Wed, Apr 14, 2010 at 4:53 AM, jamal <hadi@cyberus.ca> wrote:
> Following up like promised:
>
> On Mon, 2010-02-08 at 10:09 -0500, jamal wrote:
>> On Sun, 2010-02-07 at 21:58 -0800, Tom Herbert wrote:
>>
>> > I don't have specific numbers, although we are using this on
>> > application doing forwarding and numbers seem in line with what we see
>> > for an end host.
>> >
>>
>> When i get the chance i will give it a run. I have access to an i7
>> somewhere. It seems like i need some specific nics?
>
> I did step #0 last night on an i7 (single Nehalem). I think more than
> anything i was impressed by the Nehalem's excellent caching system.
> Robert, I am almost tempted to say skb recycling performance will be
> excellent on this  machine given the cost of a cache miss is much lower
> than previous generation hardware.
>
> My test was simple: irq affinity on cpu0(core0) and rps redirection to
> cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
> on different cores with similar results. I base tested against no rps
> being used and a kernel which didnt have any RPS config on.
> [BTW, I had to hand-edit the .config since i couldnt do it from
> menuconfig (Is there any reason for it to be so?)]
>
> Traffic was sent from another machine into the i7 via an el-cheapo sky2
> (dont know how shitty this NIC is, but it seems to know how to do MSI so
> probably capable of multiqueueing); the test was several sets of
> a ping first and then a ping -f (I will get more sophisticated in my
> next test likely this weekend).
>
> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single.
> Latency was (consistently) on average 5 microseconds.
> So if i sent 1M ping -f packets, without RPS it took on average
> 176 seconds and with RPS it took 181 seconds to do a round-trip.
> Throughput didnt change but this could be attributed to the low amounts
> of data i was sending.
> I observed that we were generating, on average, an IPI per packet even
> with ping -f. (added an extra stat to record when we sent an IPI and
> counted against the number of packets sent).
> In my opinion it is these IPIs that contribute the most to the latency
> and i think it happens that the Nehalem is just highly improved in this
> area. I wish i had a more commonly used machine to test rps on.
> I expect that rps will perform worse on currently cheaper/older hardware
> for the traffic characteristic i tested.
>
> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.
> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).
> - data cache has to be synced to main memory
> - the instruction pipeline is flushed
> - what else did i miss? Andi?
>
> So my question to Tom, Eric and Changli or anyone else who has been
> running RPS:
> What hardware did you use? Is there anyone using older hardware than
> say AMD Opteron or Intel Nehalem?
>
> My impressions of rps so far:
> I think i may end up being impressed when i generate a lot more traffic
> since the cost of IPI will be amortized.
> At this point multiqueue seems a lot more impressive alternative and it
> seems to me multiqueu hardware is a lot more commodity (price-point)
> than a Nehalem.
>
> Plan:
> I plan to still attack the app space (and write a basic udp app that
> binds to one or more rps cpus and try blasting a lot of UDP traffic to
> see what happens) my step after that is to move to forwarding tests..
>
> cheers,
> jamal
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 17:31       ` Tom Herbert
@ 2010-04-14 18:04         ` Eric Dumazet
  2010-04-14 18:53           ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-14 18:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: hadi, netdev, robert, David Miller, Changli Gao, Andi Kleen

Le mercredi 14 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> The point of RPS is to increase parallelism, but the cost of that is
> more overhead per packet.  If you are running a single flow, then
> you'll see latency increase for that flow.  With more concurrent flows
> the benefits of parallelism kick in and latency gets better.-- we've
> seen the break even point around ten connections in our tests.  Also,
> I don't think we've made the claim that RPS should generally perform
> better than multi-queue, the primary motivation for RPS is make single
> queue NICs give reasonable performance.
> 

Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 18:04         ` Eric Dumazet
@ 2010-04-14 18:53           ` jamal
  2010-04-14 19:44             ` Stephen Hemminger
                               ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: jamal @ 2010-04-14 18:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, netdev, robert, David Miller, Changli Gao, Andi Kleen

On Wed, 2010-04-14 at 20:04 +0200, Eric Dumazet wrote:

> Yes, multiqueue is far better of course, but in case of hardware lacking
> multiqueue, RPS can help many workloads, where application has _some_
> work to do, not only counting frames or so...

Agreed. So to enumerate, the benefits come in if:
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
d) you have a specific cache hierachy
e) app is working hard to process incoming messages

> RPS overhead (IPI, cache misses, ...) must be amortized by
> parallelization or we lose.

Indeed. 
How well they can be amortized seems very cpu or board specific.

I think the main challenge for my pedantic mind is missing details. Is
there a paper on rps? Example for #d above, the commit log mentions that
rps benefits if you have certain types of "cache hierachy". Probably
some arch with large shared L2/3 (maybe inclusive) cache will benefit.
example: it does well on Nehalem and probably opterons as long (as you
dont start stacking these things on some interconnect like QPI or HT).
But what happens when you have FSB sharing across cores (still a very
common setup)? etc etc

Can I ask what hardware you run this on?

> A ping test is not an ideal candidate for RPS, since everything is done
> at softirq level, and should be faster without RPS...

ping wont do justice to the possible potential of rps mostly because it
generates very little traffic i.e the part #c above. But it helps me at
least boot a machine with proper setup - but it is not totally useless
because i think the cost of IPI can be deduced from the results.
I am going to put together some udp app with variable think-time to see
what happens. Would that be a reasonable thing to test on?

It would be valuable to have something like Documentation/networking/rps
to detail things a little more. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 11:53     ` rps perfomance WAS(Re: " jamal
  2010-04-14 17:31       ` Tom Herbert
@ 2010-04-14 18:53       ` Stephen Hemminger
  2010-04-15  8:42       ` David Miller
  2 siblings, 0 replies; 86+ messages in thread
From: Stephen Hemminger @ 2010-04-14 18:53 UTC (permalink / raw)
  To: hadi
  Cc: Tom Herbert, Eric Dumazet, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

On Wed, 14 Apr 2010 07:53:06 -0400
jamal <hadi@cyberus.ca> wrote:

> Results:
> CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
> cpu was being chewed highly by sky2_poll and on the redirected-to-core
> it was always smp_call_function_single

I posted a patch to use sky2 hardware hash (RSS) which should lower the
cost per packet.

-- 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 18:53           ` jamal
@ 2010-04-14 19:44             ` Stephen Hemminger
  2010-04-14 19:58               ` Eric Dumazet
                                 ` (3 more replies)
  2010-04-15  8:48             ` David Miller
  2010-04-16 15:57             ` Tom Herbert
  2 siblings, 4 replies; 86+ messages in thread
From: Stephen Hemminger @ 2010-04-14 19:44 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

On Wed, 14 Apr 2010 14:53:42 -0400
jamal <hadi@cyberus.ca> wrote:

> Agreed. So to enumerate, the benefits come in if:
> a) you have many processors
> b) you have single-queue nic
> c) at sub-threshold traffic you dont care about a little latency

There probably needs to be better autotuning for this, there is no reason
that RPS to be steering packets unless the queue is getting backed up.
Some kind of high / low water mark mechanism is needed.

RPS might also interact with the core turbo boost functionality on Intel chips.
Newer chips will make a single core faster if other core can be kept idle.


-- 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 19:44             ` Stephen Hemminger
@ 2010-04-14 19:58               ` Eric Dumazet
  2010-04-15  8:51                 ` David Miller
  2010-04-14 20:22               ` jamal
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-14 19:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, Tom Herbert, netdev, robert, David Miller, Changli Gao, Andi Kleen

Le mercredi 14 avril 2010 à 12:44 -0700, Stephen Hemminger a écrit :
> On Wed, 14 Apr 2010 14:53:42 -0400
> jamal <hadi@cyberus.ca> wrote:
> 
> > Agreed. So to enumerate, the benefits come in if:
> > a) you have many processors
> > b) you have single-queue nic
> > c) at sub-threshold traffic you dont care about a little latency
> 
> There probably needs to be better autotuning for this, there is no reason
> that RPS to be steering packets unless the queue is getting backed up.
> Some kind of high / low water mark mechanism is needed.
> 
> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.
> 
> 

This was discussed a while ago, and Out Of Order packet delivery was the
thing that frightened us a bit.

Every time we change RPS to be on or off, we might have some extra
noise. Maybe we already have this problem with irqbalance ?




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 19:44             ` Stephen Hemminger
  2010-04-14 19:58               ` Eric Dumazet
@ 2010-04-14 20:22               ` jamal
  2010-04-14 20:27                 ` Eric Dumazet
  2010-04-15  8:51                 ` David Miller
  2010-04-14 20:34               ` Andi Kleen
  2010-04-15  8:50               ` David Miller
  3 siblings, 2 replies; 86+ messages in thread
From: jamal @ 2010-04-14 20:22 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:

> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.

how well does it work with Linux? Sounds like all i need to do is turn
on some BIOS feature. 
One of the negatives with multiqueue nics is because the core selection
is static, you could end up overloading one core while others stay idle.
This seems to steal cycle capacity from the idle cores and gives it to
the busy cpus. nice. So i see it as a boost to multiqueue.

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:22               ` jamal
@ 2010-04-14 20:27                 ` Eric Dumazet
  2010-04-14 20:38                   ` jamal
  2010-04-14 20:45                   ` Tom Herbert
  2010-04-15  8:51                 ` David Miller
  1 sibling, 2 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-14 20:27 UTC (permalink / raw)
  To: hadi
  Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

Le mercredi 14 avril 2010 à 16:22 -0400, jamal a écrit :
> On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:
> 
> > RPS might also interact with the core turbo boost functionality on Intel chips.
> > Newer chips will make a single core faster if other core can be kept idle.
> 
> how well does it work with Linux? Sounds like all i need to do is turn
> on some BIOS feature. 
> One of the negatives with multiqueue nics is because the core selection
> is static, you could end up overloading one core while others stay idle.
> This seems to steal cycle capacity from the idle cores and gives it to
> the busy cpus. nice. So i see it as a boost to multiqueue.

Only if more than one flow is involved.

And if you have many flows, chance they will spread several queues...




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 19:44             ` Stephen Hemminger
  2010-04-14 19:58               ` Eric Dumazet
  2010-04-14 20:22               ` jamal
@ 2010-04-14 20:34               ` Andi Kleen
  2010-04-15  8:50               ` David Miller
  3 siblings, 0 replies; 86+ messages in thread
From: Andi Kleen @ 2010-04-14 20:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, Eric Dumazet, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

> RPS might also interact with the core turbo boost functionality on Intel chips.
> Newer chips will make a single core faster if other core can be kept idle.

In addition to Turbo using less cores can also help to save power.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:27                 ` Eric Dumazet
@ 2010-04-14 20:38                   ` jamal
  2010-04-14 20:45                   ` Tom Herbert
  1 sibling, 0 replies; 86+ messages in thread
From: jamal @ 2010-04-14 20:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Tom Herbert, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

On Wed, 2010-04-14 at 22:27 +0200, Eric Dumazet wrote:

> Only if more than one flow is involved.
> 
> And if you have many flows, chance they will spread several queues...

Over long period of time measurement, true; but even with > 1 flows, it
is possible that one flow is more active/intense than others (rtp vs
some bulk file transfer) or more processor intensive than others(eg
ipsec vs clear text) etc. 
 
BTW: just poking at intel doc on turbo boost and it seems the max a
core can steal from others is 400Mhz; so a core can go from 2.8Ghz
to 3.2Ghz. I am sure theres a lot of interesting dynamics from this ;->
I think i will try turning this thing in my tests since i have an i7.

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:27                 ` Eric Dumazet
  2010-04-14 20:38                   ` jamal
@ 2010-04-14 20:45                   ` Tom Herbert
  2010-04-14 20:57                     ` Eric Dumazet
  1 sibling, 1 reply; 86+ messages in thread
From: Tom Herbert @ 2010-04-14 20:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

> Only if more than one flow is involved.
>
> And if you have many flows, chance they will spread several queues...
>

But use too many queues and the efficiency of NAPI drops and cost of
device interrupts becomes dominant, so that the overhead from
additional hard interrupts can surpass the overhead of doing RPS and
the IPIs.  I believe we are seeing this is in some of our results
which shows that a combination of multi-queue and RPS can be better
than just multi-queue (see rps changelog).  Again, I'm not claiming
that is generally true, but there are a lot of factors to consider.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:45                   ` Tom Herbert
@ 2010-04-14 20:57                     ` Eric Dumazet
  2010-04-14 22:51                       ` Changli Gao
                                         ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-14 20:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

Le mercredi 14 avril 2010 à 13:45 -0700, Tom Herbert a écrit :
> > Only if more than one flow is involved.
> >
> > And if you have many flows, chance they will spread several queues...
> >
> 
> But use too many queues and the efficiency of NAPI drops and cost of
> device interrupts becomes dominant, so that the overhead from
> additional hard interrupts can surpass the overhead of doing RPS and
> the IPIs.  I believe we are seeing this is in some of our results
> which shows that a combination of multi-queue and RPS can be better
> than just multi-queue (see rps changelog).  Again, I'm not claiming
> that is generally true, but there are a lot of factors to consider.
> --

RPS can be tuned (Changli wants a finer tuning...), it would be
intereting to tune multiqueue devices too. I dont know if its possible
right now.

On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
10Gigabit has 16 queues. It might be good to use less queues according
to your results on some workloads, and eventually use RPS on a second
layering.





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:57                     ` Eric Dumazet
@ 2010-04-14 22:51                       ` Changli Gao
  2010-04-14 23:02                         ` Stephen Hemminger
  2010-04-15  8:57                       ` David Miller
  2010-04-15 12:10                       ` jamal
  2 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-14 22:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, hadi, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen

On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> RPS can be tuned (Changli wants a finer tuning...), it would be
> intereting to tune multiqueue devices too. I dont know if its possible
> right now.
>
> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.

My idear is: run a daemon in userland to monitor the softnet
statistics, and tun the RPS setting if necessary. It seems that the
current softnet statistics data isn't correct.

Long time ago, I did a test, and the conclution was
call_function_single IPI was more expensive than resched IPI, so I
moved to kernel thread from softirq for packet processing. I'll redo
the test later.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 22:51                       ` Changli Gao
@ 2010-04-14 23:02                         ` Stephen Hemminger
  2010-04-15  2:40                           ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Stephen Hemminger @ 2010-04-14 23:02 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Tom Herbert, hadi, netdev, robert, David Miller,
	Andi Kleen

On Thu, 15 Apr 2010 06:51:29 +0800
Changli Gao <xiaosuo@gmail.com> wrote:

> On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > RPS can be tuned (Changli wants a finer tuning...), it would be
> > intereting to tune multiqueue devices too. I dont know if its possible
> > right now.
> >
> > On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> > 10Gigabit has 16 queues. It might be good to use less queues according
> > to your results on some workloads, and eventually use RPS on a second
> > layering.
> 
> My idear is: run a daemon in userland to monitor the softnet
> statistics, and tun the RPS setting if necessary. It seems that the
> current softnet statistics data isn't correct.
> 
> Long time ago, I did a test, and the conclution was
> call_function_single IPI was more expensive than resched IPI, so I
> moved to kernel thread from softirq for packet processing. I'll redo
> the test later.
> 

The big thing is data, data, data... Performance can only be examined
with real hard data with multiple different kind of hardware.  Also, check for
regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
on this would allow for rational rather than speculative choices.

Adding more tuning knobs is not the answer unless you can show when
the tuning helps.

-- 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 23:02                         ` Stephen Hemminger
@ 2010-04-15  2:40                           ` Eric Dumazet
  2010-04-15  2:50                             ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-15  2:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Changli Gao, Tom Herbert, hadi, netdev, robert, David Miller, Andi Kleen

Le mercredi 14 avril 2010 à 16:02 -0700, Stephen Hemminger a écrit :
> On Thu, 15 Apr 2010 06:51:29 +0800
> Changli Gao <xiaosuo@gmail.com> wrote:
> 
> > On Thu, Apr 15, 2010 at 4:57 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > > RPS can be tuned (Changli wants a finer tuning...), it would be
> > > intereting to tune multiqueue devices too. I dont know if its possible
> > > right now.
> > >
> > > On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> > > 10Gigabit has 16 queues. It might be good to use less queues according
> > > to your results on some workloads, and eventually use RPS on a second
> > > layering.
> > 
> > My idear is: run a daemon in userland to monitor the softnet
> > statistics, and tun the RPS setting if necessary. It seems that the
> > current softnet statistics data isn't correct.
> > 
> > Long time ago, I did a test, and the conclution was
> > call_function_single IPI was more expensive than resched IPI, so I
> > moved to kernel thread from softirq for packet processing. I'll redo
> > the test later.
> > 
> 
> The big thing is data, data, data... Performance can only be examined
> with real hard data with multiple different kind of hardware.  Also, check for
> regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
> on this would allow for rational rather than speculative choices.
> 
> Adding more tuning knobs is not the answer unless you can show when
> the tuning helps.
> 

Agree 100%, and irqbalance is the existing daemon. It should be used and
changed if necessary.

Changli, my stronges argument about your patches is that our scheduler
and memory affinity api (numactl driven) is bitmask oriented, giving the
same weight to individual cpu or individual memory node.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15  2:40                           ` Eric Dumazet
@ 2010-04-15  2:50                             ` Changli Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Changli Gao @ 2010-04-15  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Tom Herbert, hadi, netdev, robert,
	David Miller, Andi Kleen

On Thu, Apr 15, 2010 at 10:40 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Agree 100%, and irqbalance is the existing daemon. It should be used and
> changed if necessary.
>
> Changli, my stronges argument about your patches is that our scheduler
> and memory affinity api (numactl driven) is bitmask oriented, giving the
> same weight to individual cpu or individual memory node.
>

It works with the assumption: the workloads handled in non-schedulable
context are less than the others. If most of work is done in
non-schedulable(softirq) context, scheduler can't keep load balance.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 11:53     ` rps perfomance WAS(Re: " jamal
  2010-04-14 17:31       ` Tom Herbert
  2010-04-14 18:53       ` Stephen Hemminger
@ 2010-04-15  8:42       ` David Miller
  2 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-15  8:42 UTC (permalink / raw)
  To: hadi; +Cc: therbert, eric.dumazet, netdev, robert, xiaosuo, andi

From: jamal <hadi@cyberus.ca>
Date: Wed, 14 Apr 2010 07:53:06 -0400

> I base tested against no rps being used and a kernel which didnt
> have any RPS config on.  [BTW, I had to hand-edit the .config since
> i couldnt do it from menuconfig (Is there any reason for it to be
> so?)]

The RPS config is merely an indirect dependency on SMP as we have it
coded up in the Kconfig files, it's not meant to be user selectable
and is intended to be unconditionally on for SMP builds.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 18:53           ` jamal
  2010-04-14 19:44             ` Stephen Hemminger
@ 2010-04-15  8:48             ` David Miller
  2010-04-15 11:55               ` jamal
  2010-04-16 15:57             ` Tom Herbert
  2 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2010-04-15  8:48 UTC (permalink / raw)
  To: hadi; +Cc: eric.dumazet, therbert, netdev, robert, xiaosuo, andi

From: jamal <hadi@cyberus.ca>
Date: Wed, 14 Apr 2010 14:53:42 -0400

> On Wed, 2010-04-14 at 20:04 +0200, Eric Dumazet wrote:
> 
>> Yes, multiqueue is far better of course, but in case of hardware lacking
>> multiqueue, RPS can help many workloads, where application has _some_
>> work to do, not only counting frames or so...
> 
> Agreed. So to enumerate, the benefits come in if:
> a) you have many processors
> b) you have single-queue nic
> c) at sub-threshold traffic you dont care about a little latency
> d) you have a specific cache hierachy
> e) app is working hard to process incoming messages

A single-queue NIC is actually not a requirement, RPS helps also in
cases where you have 'N' application threads and N is less than the
number of CPUs your multi-queue NIC is distributing traffic to.

Moving the bulk of the input packet processing to the cpus where
the applications actually sit had a non-trivial benefit.  RFS takes
this aspect to yet another level.

> I think the main challenge for my pedantic mind is missing details. Is
> there a paper on rps? Example for #d above, the commit log mentions that
> rps benefits if you have certain types of "cache hierachy". Probably
> some arch with large shared L2/3 (maybe inclusive) cache will benefit.
> example: it does well on Nehalem and probably opterons as long (as you
> dont start stacking these things on some interconnect like QPI or HT).
> But what happens when you have FSB sharing across cores (still a very
> common setup)? etc etc

I think for the case where application locality is important,
RPS/RFS can help regardless of cache details.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 19:44             ` Stephen Hemminger
                                 ` (2 preceding siblings ...)
  2010-04-14 20:34               ` Andi Kleen
@ 2010-04-15  8:50               ` David Miller
  3 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-15  8:50 UTC (permalink / raw)
  To: shemminger; +Cc: hadi, eric.dumazet, therbert, netdev, robert, xiaosuo, andi

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 14 Apr 2010 12:44:26 -0700

> On Wed, 14 Apr 2010 14:53:42 -0400
> jamal <hadi@cyberus.ca> wrote:
> 
>> Agreed. So to enumerate, the benefits come in if:
>> a) you have many processors
>> b) you have single-queue nic
>> c) at sub-threshold traffic you dont care about a little latency
> 
> There probably needs to be better autotuning for this, there is no reason
> that RPS to be steering packets unless the queue is getting backed up.

I disagree, if the goal is to migrate the bulk of packet processing
to where the app will actually sink and process the data then it should
forward to RPS marked cpus regardless of local queue levels.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 19:58               ` Eric Dumazet
@ 2010-04-15  8:51                 ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-15  8:51 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, hadi, therbert, netdev, robert, xiaosuo, andi

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 21:58:50 +0200

> Every time we change RPS to be on or off, we might have some extra
> noise. Maybe we already have this problem with irqbalance ?

irqbalance should never move network device interrupts around
under normal circumstances.  Arjan assured me that there is
specific logic in the irqbalance daemon to not move NIC
interrupts around once a target has been choosen.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:22               ` jamal
  2010-04-14 20:27                 ` Eric Dumazet
@ 2010-04-15  8:51                 ` David Miller
  1 sibling, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-15  8:51 UTC (permalink / raw)
  To: hadi; +Cc: shemminger, eric.dumazet, therbert, netdev, robert, xiaosuo, andi

From: jamal <hadi@cyberus.ca>
Date: Wed, 14 Apr 2010 16:22:48 -0400

> On Wed, 2010-04-14 at 12:44 -0700, Stephen Hemminger wrote:
> 
>> RPS might also interact with the core turbo boost functionality on Intel chips.
>> Newer chips will make a single core faster if other core can be kept idle.
> 
> how well does it work with Linux?

It's completely transparent and should just happen without any
BIOS tweaks.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:57                     ` Eric Dumazet
  2010-04-14 22:51                       ` Changli Gao
@ 2010-04-15  8:57                       ` David Miller
  2010-04-15 12:10                       ` jamal
  2 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-15  8:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, hadi, shemminger, netdev, robert, xiaosuo, andi

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 22:57:41 +0200

> RPS can be tuned (Changli wants a finer tuning...), it would be
> intereting to tune multiqueue devices too. I dont know if its possible
> right now.

Only NIU allows real detailed control over queue selection and
stuff like that, because the hardware has a real TCAM for
packet matching and packets which match in TCAM entries can
steer to different collections of queues.

We have ethtool interfaces for this (ETHTOOL_GRXCLS*), so you can
change it.

For most other chips we only have interfaces for modifying the
RX hashing algorithm or what the RX hash covers, stuff like
that.

See also ETHTOOL_GRXFH, ETHTOOL_SRXFH, ETHTOOL_SRXNTUPLE, and
ETHTOOL_GRXNTUPLE, the latter two of which were added for Intel
NICs.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15  8:48             ` David Miller
@ 2010-04-15 11:55               ` jamal
  2010-04-15 16:41                 ` Rick Jones
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-15 11:55 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev, robert, xiaosuo, andi

On Thu, 2010-04-15 at 01:48 -0700, David Miller wrote:

> A single-queue NIC is actually not a requirement, 
> RPS helps also in cases where you have 'N' application threads 
> and N is less than the number of CPUs your multi-queue NIC is 
> distributing traffic to.

sure..

> Moving the bulk of the input packet processing to the cpus where
> the applications actually sit had a non-trivial benefit.  

This is true regardless of rps though. 

> RFS takes this aspect to yet another level.

rfs looks quiet interesting;-> I think with some twist it could be
used with multiqueue nics as well

> I think for the case where application locality is important,
> RPS/RFS can help regardless of cache details.

Generally true, as long as there's not much shared data across the cpus
or the cost of a cache miss is reasonably tolerable. The socket layer
just happens to be not sharing much with ingress packet path and
for a single processor Nehalem, the caching system works so well that
the cost of cache misses is not as an important a variable. Everything
is on the same die including the MM controller etc.
I am speculating (didnt get any answer to the question i asked) that
people running rps use such hardware;->

I speculate again that it may be too costly to run rps on something like
a tigerton or intel clovertown where you have cores sharing/contending
for an FSB. If I can get answers to the question: "What h/ware are
people running?" i could be proven wrong.
[Note: I am not against RPS - i think it has its place; so i hope my
desire to find out when to use rps doesnt show as hostility towards
rps.]

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 20:57                     ` Eric Dumazet
  2010-04-14 22:51                       ` Changli Gao
  2010-04-15  8:57                       ` David Miller
@ 2010-04-15 12:10                       ` jamal
  2010-04-15 12:32                         ` Changli Gao
  2 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-15 12:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen

On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:

> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.

Ok Eric, you seem to be running a system with two Nehalems
interconnected by QPI.
Is there any difference, performance-wise, between redirecting from
coreX to coreY when they are on the same Nehalem vs when you
are going across QPI?

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 12:10                       ` jamal
@ 2010-04-15 12:32                         ` Changli Gao
  2010-04-15 12:50                           ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-15 12:32 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen

On Thu, Apr 15, 2010 at 8:10 PM, jamal <hadi@cyberus.ca> wrote:
> On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:
>
>> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
>> 10Gigabit has 16 queues. It might be good to use less queues according
>> to your results on some workloads, and eventually use RPS on a second
>> layering.
>

For historical reason, we use Linux-2.6.18. Our company have several
products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
Multi-Threaded. We use the similar mechanism like dynamic weighted
RPS. The total throughput is increased nearly linear with the number
of the worker threads(one worker thread per CPU).

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 12:32                         ` Changli Gao
@ 2010-04-15 12:50                           ` jamal
  2010-04-15 23:51                             ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-15 12:50 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Tom Herbert, netdev

On Thu, 2010-04-15 at 20:32 +0800, Changli Gao wrote:

> For historical reason, we use Linux-2.6.18. Our company have several
> products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
> Multi-Threaded. 

Thanks for sharing. How much more can you say? ;-> Do you have a paper
or description of some sort somewhere?

> We use the similar mechanism like dynamic weighted
> RPS. The total throughput is increased nearly linear with the number
> of the worker threads(one worker thread per CPU).

Other than the i7 - have you tried to run rps on on the P4?

cheers,
jamal



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 11:55               ` jamal
@ 2010-04-15 16:41                 ` Rick Jones
  2010-04-15 20:16                   ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Rick Jones @ 2010-04-15 16:41 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, eric.dumazet, therbert, netdev, robert, xiaosuo, andi

> 
> I speculate again that it may be too costly to run rps on something like
> a tigerton or intel clovertown where you have cores sharing/contending
> for an FSB. If I can get answers to the question: "What h/ware are
> people running?" i could be proven wrong.
> [Note: I am not against RPS - i think it has its place; so i hope my
> desire to find out when to use rps doesnt show as hostility towards
> rps.]

IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
implemented in a Streams-based stack tracing its history to a common ancestor 
with Solaris (Mentat).

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 16:41                 ` Rick Jones
@ 2010-04-15 20:16                   ` jamal
  2010-04-15 20:25                     ` Rick Jones
  2010-04-15 23:56                     ` Changli Gao
  0 siblings, 2 replies; 86+ messages in thread
From: jamal @ 2010-04-15 20:16 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, eric.dumazet, therbert, netdev, robert, xiaosuo, andi

On Thu, 2010-04-15 at 09:41 -0700, Rick Jones wrote:

> IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
> networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
> shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
> implemented in a Streams-based stack tracing its history to a common ancestor 
> with Solaris (Mentat).

Sounds interesting.
Wikipedia information overload. Any arch description of the HP9000? 
Did your scheme use IPIs to message the other CPUs?

cheers,
jamal 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 20:16                   ` jamal
@ 2010-04-15 20:25                     ` Rick Jones
  2010-04-15 23:56                     ` Changli Gao
  1 sibling, 0 replies; 86+ messages in thread
From: Rick Jones @ 2010-04-15 20:25 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, eric.dumazet, therbert, netdev, robert, xiaosuo, andi

jamal wrote:
> On Thu, 2010-04-15 at 09:41 -0700, Rick Jones wrote:
> 
> 
>>IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
>>networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
>>shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
>>implemented in a Streams-based stack tracing its history to a common ancestor 
>>with Solaris (Mentat).
> 
> 
> Sounds interesting.
> Wikipedia information overload. Any arch description of the HP9000? 

I should have been more specific - HP 9000 Model 800's :) PA-RISC based business 
computers running HP-UX.  In the case of IPS, HP-UX 10.20 ca 1995 or so.

> Did your scheme use IPIs to message the other CPUs?

Netisrs were kernel processes one per CPU (back then a core, a processor and a 
CPU were one and the same :), and while we didn't call them IPI's, yes, it was a 
"soft interrupt" directed at the given processor to launch the netisr if it 
wasn't already running.

TOPS was similar, but was with Streams and that did/does have some kernel 
processes not everything would happen as a kernel process.

rick jones

HP 3000 Model 900's - by and large the same PA-RISC hardware but running MPE/XL 
(later called MPE/iX)
HP 9000 Model 700's - PA-RISC based workstations
HP 9000 Model 300's - Moto 68K-based workstations (replaced by the 700s)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 12:50                           ` jamal
@ 2010-04-15 23:51                             ` Changli Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Changli Gao @ 2010-04-15 23:51 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Tom Herbert, netdev

On Thu, Apr 15, 2010 at 8:50 PM, jamal <hadi@cyberus.ca> wrote:
> On Thu, 2010-04-15 at 20:32 +0800, Changli Gao wrote:
>
>> For historical reason, we use Linux-2.6.18. Our company have several
>> products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
>> Multi-Threaded.
>
> Thanks for sharing. How much more can you say? ;-> Do you have a paper
> or description of some sort somewhere?

On a dual 4-core Xeon, we use one core for NIC in internal side, one
core for NIC in the external side, one for inbound QoS, one for
outbound QoS, and the CPU cycles left are used by DPI(DFA), the total
throughput is about 3 Gbps with a polygraph test.

>
>> We use the similar mechanism like dynamic weighted
>> RPS. The total throughput is increased nearly linear with the number
>> of the worker threads(one worker thread per CPU).
>
> Other than the i7 - have you tried to run rps on on the P4?
>

No.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 20:16                   ` jamal
  2010-04-15 20:25                     ` Rick Jones
@ 2010-04-15 23:56                     ` Changli Gao
  2010-04-16  5:18                       ` Eric Dumazet
  1 sibling, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-15 23:56 UTC (permalink / raw)
  To: hadi
  Cc: Rick Jones, David Miller, eric.dumazet, therbert, netdev, robert, andi

On Fri, Apr 16, 2010 at 4:16 AM, jamal <hadi@cyberus.ca> wrote:
>
> Sounds interesting.
> Wikipedia information overload. Any arch description of the HP9000?
> Did your scheme use IPIs to message the other CPUs?
>

If you doubt the cost of smp_call_function_single(), how about having
a try with my another patch, which implements the similar of RPS, but
uses kernel threads instead, so no explicit IPI.

http://patchwork.ozlabs.org/patch/38319/


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-15 23:56                     ` Changli Gao
@ 2010-04-16  5:18                       ` Eric Dumazet
  2010-04-16  6:02                         ` Changli Gao
  2010-04-16 13:21                         ` jamal
  0 siblings, 2 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-16  5:18 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi

Le vendredi 16 avril 2010 à 07:56 +0800, Changli Gao a écrit :
> On Fri, Apr 16, 2010 at 4:16 AM, jamal <hadi@cyberus.ca> wrote:
> >
> > Sounds interesting.
> > Wikipedia information overload. Any arch description of the HP9000?
> > Did your scheme use IPIs to message the other CPUs?
> >
> 
> If you doubt the cost of smp_call_function_single(), how about having
> a try with my another patch, which implements the similar of RPS, but
> uses kernel threads instead, so no explicit IPI.
> 
> http://patchwork.ozlabs.org/patch/38319/
> 
> 

Come on Changli.

How do you wake up a thread on a remote cpu ?

To answer Jamal question, we need to answer to Jamal question, that is
timing cost of IPIS.

A kernel module might do this, this could be integrated in perf bench so
that we can regression tests upcoming kernels.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  5:18                       ` Eric Dumazet
@ 2010-04-16  6:02                         ` Changli Gao
  2010-04-16  6:28                           ` Tom Herbert
                                             ` (2 more replies)
  2010-04-16 13:21                         ` jamal
  1 sibling, 3 replies; 86+ messages in thread
From: Changli Gao @ 2010-04-16  6:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, Apr 16, 2010 at 1:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 16 avril 2010 à 07:56 +0800, Changli Gao a écrit :
>> On Fri, Apr 16, 2010 at 4:16 AM, jamal <hadi@cyberus.ca> wrote:
>> >
>> > Sounds interesting.
>> > Wikipedia information overload. Any arch description of the HP9000?
>> > Did your scheme use IPIs to message the other CPUs?
>> >
>>
>> If you doubt the cost of smp_call_function_single(), how about having
>> a try with my another patch, which implements the similar of RPS, but
>> uses kernel threads instead, so no explicit IPI.
>>
>> http://patchwork.ozlabs.org/patch/38319/
>>
>>
>
> Come on Changli.
>
> How do you wake up a thread on a remote cpu ?
>

resched IPI, apparently. But it is async absolutely. and its IRQ
handler is lighter.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  6:02                         ` Changli Gao
@ 2010-04-16  6:28                           ` Tom Herbert
  2010-04-16  6:32                           ` Eric Dumazet
  2010-04-16  7:15                           ` Andi Kleen
  2 siblings, 0 replies; 86+ messages in thread
From: Tom Herbert @ 2010-04-16  6:28 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, hadi, Rick Jones, David Miller, netdev, robert, andi

>> How do you wake up a thread on a remote cpu ?
>>
>
> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.
>
The IPI used in RPS is done asynchronously.

> --
> Regards,
> Changli Gao(xiaosuo@gmail.com)
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  6:02                         ` Changli Gao
  2010-04-16  6:28                           ` Tom Herbert
@ 2010-04-16  6:32                           ` Eric Dumazet
  2010-04-16 13:42                             ` jamal
  2010-04-16  7:15                           ` Andi Kleen
  2 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-16  6:32 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi

Le vendredi 16 avril 2010 à 14:02 +0800, Changli Gao a écrit :

> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.
> 

You still dont answer to the question, and your claims are not grounded
by hard facts, but by your interpretation of code.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  6:02                         ` Changli Gao
  2010-04-16  6:28                           ` Tom Herbert
  2010-04-16  6:32                           ` Eric Dumazet
@ 2010-04-16  7:15                           ` Andi Kleen
  2010-04-16 13:27                             ` jamal
  2 siblings, 1 reply; 86+ messages in thread
From: Andi Kleen @ 2010-04-16  7:15 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, hadi, Rick Jones, David Miller, therbert, netdev,
	robert, andi

> > Come on Changli.
> >
> > How do you wake up a thread on a remote cpu ?
> >
> 
> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.

It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
that's in the tree for a few releases. So it would surprise me if it made
much difference. In the old days when there was only a single lock for
s_c_f() perhaps...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  5:18                       ` Eric Dumazet
  2010-04-16  6:02                         ` Changli Gao
@ 2010-04-16 13:21                         ` jamal
  2010-04-16 13:34                           ` Changli Gao
  2010-04-17  7:35                           ` Eric Dumazet
  1 sibling, 2 replies; 86+ messages in thread
From: jamal @ 2010-04-16 13:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

[-- Attachment #1: Type: text/plain, Size: 1231 bytes --]

On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:

> 
> A kernel module might do this, this could be integrated in perf bench so
> that we can regression tests upcoming kernels.

Perf would be good - but even softnet_stat cleaner than the the nasty
hack i use (attached) would be a good start; the ping with and without
rps gives me a ballpark number.

IPI is important to me because having tried it before it and failed
miserably. I was thinking the improvement may be due to hardware used
but i am having a hard time to get people to tell me what hardware they
used! I am old school - I need data;-> The RFS patch commit seems to
have more info but still vague, example: 
"The benefits of RFS are dependent on cache hierarchy, application
load, and other factors"
Also, what does a "simple" or "complex" benchmark mean?;->
I think it is only fair to get this info, no?

Please dont consider what i say above as being anti-RPS.
5 microsec extra latency is not bad if it can be amortized.
Unfortunately, the best traffic i could generate was < 20Kpps of
ping which still manages to get 1 IPI/packet on Nehalem. I am going
to write up some app (lots of cycles available tommorow). I still think
it is valueable.

cheers,
jamal

[-- Attachment #2: p1 --]
[-- Type: text/x-patch, Size: 1551 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..f8267fc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -224,6 +224,7 @@ struct netif_rx_stats {
 	unsigned time_squeeze;
 	unsigned cpu_collision;
 	unsigned received_rps;
+	unsigned ipi_rps;
 };
 
 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
diff --git a/kernel/smp.c b/kernel/smp.c
index 9867b6b..8c5dcb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -11,6 +11,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/netdevice.h>
 
 static struct {
 	struct list_head	queue;
@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}
 
 	if (wait)
 		csd_lock_wait(data);
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..0bbbdcf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3563,10 +3563,12 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;
 
-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);
+	s->ipi_rps = 0;
+	s->received_rps = 0;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  7:15                           ` Andi Kleen
@ 2010-04-16 13:27                             ` jamal
  2010-04-16 13:37                               ` Andi Kleen
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-16 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Changli Gao, Eric Dumazet, Rick Jones, David Miller, therbert,
	netdev, robert

On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:

> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> 
> It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> that's in the tree for a few releases. So it would surprise me if it made
> much difference. In the old days when there was only a single lock for
> s_c_f() perhaps...

So you are saying that the old implementation of IPI (likely what i
tried pre-napi and as recent as 2-3 years ago) was bad because of a
single lock?

BTW, I directed some questions to you earlier but didnt get a response,
to quote:
---
On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?
---

Do you know any specs i could read up which will tell me a little more?

cheers,
jamal



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:21                         ` jamal
@ 2010-04-16 13:34                           ` Changli Gao
  2010-04-16 13:49                             ` jamal
  2010-04-17  7:35                           ` Eric Dumazet
  1 sibling, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-16 13:34 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, Apr 16, 2010 at 9:21 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
>
>>
>> A kernel module might do this, this could be integrated in perf bench so
>> that we can regression tests upcoming kernels.
>
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
>
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example:
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
>
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.
>

+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);

Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.

@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:27                             ` jamal
@ 2010-04-16 13:37                               ` Andi Kleen
  2010-04-16 13:58                                 ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Andi Kleen @ 2010-04-16 13:37 UTC (permalink / raw)
  To: jamal
  Cc: Andi Kleen, Changli Gao, Eric Dumazet, Rick Jones, David Miller,
	therbert, netdev, robert

On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:
> On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:
> 
> > > resched IPI, apparently. But it is async absolutely. and its IRQ
> > > handler is lighter.
> > 
> > It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> > that's in the tree for a few releases. So it would surprise me if it made
> > much difference. In the old days when there was only a single lock for
> > s_c_f() perhaps...
> 
> So you are saying that the old implementation of IPI (likely what i
> tried pre-napi and as recent as 2-3 years ago) was bad because of a
> single lock?

Yes.

The old implementation of smp_call_function. Also in the really old
days there was no smp_call_function_single() so you tended to broadcast.

Jens did a lot of work on this for his block device work IPI implementation.

> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.

Nehalem is just fast. I don't know why it's fast in your specific
case. It might be simply because it has lots of bandwidth everywhere.
Atomic operations are also faster than on previous Intel CPUs.


> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).

In the hardware there's no ack, but in the Linux implementation there
is usually (because need to know when to free the stack state used
to pass information)

However there's also now support for queued IPI
with a special API (I believe Tom is using that)

> - data cache has to be synced to main memory
> - the instruction pipeline is flushed

At least on Nehalem data transfer can be often through the cache.

IPIs involve APIC accesses which are not very fast (so overall
it's far more than a pipeline worth of work), but it's still
not a incredible expensive operation.

There's also X2APIC now which should be slightly faster, but it's 
likely not in your Nehalem (this is only in the highend Xeon versions)

> Do you know any specs i could read up which will tell me a little more?

If you're just interested in IPI and cache line transfer performance it's
probably best to just measure it.

Some general information is always in the Intel optimization guide.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16  6:32                           ` Eric Dumazet
@ 2010-04-16 13:42                             ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2010-04-16 13:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, 2010-04-16 at 08:32 +0200, Eric Dumazet wrote:
> Le vendredi 16 avril 2010 à 14:02 +0800, Changli Gao a écrit :
> 
> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> > 
> 
> You still dont answer to the question, and your claims are not grounded
> by hard facts, but by your interpretation of code.

My understanding of current scheduler is it does use IPIs to migrate
tasks around - so thats why things may be working for Changli. i.e
it is scheduler magic if you use kthreads. It is hard to say if this
would work better...

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:34                           ` Changli Gao
@ 2010-04-16 13:49                             ` jamal
  2010-04-16 14:10                               ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-16 13:49 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:

> 
> +	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
>  		   s->total, s->dropped, s->time_squeeze, 0,
>  		   0, 0, 0, 0, /* was fastroute */
> -		   s->cpu_collision, s->received_rps);
> +		   s->cpu_collision, s->received_rps, s->ipi_rps);
> 
> Do you mean that received_rps is equal to ipi_rps? received_rps is the
> number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
> function generic_exec_single(). If there isn't other user of
> generic_exec_single(), received_rps should be equal to ipi_rps.
> 

my observation is:
s->total is the sum of all packets received by cpu (some directly from
ethernet)
s->received_rps was what the count receiver cpu saw incoming if they
were sent by another cpu. 
s-> ipi_rps is the times we tried to enq to remote cpu but found it to
be empty and had to send an IPI. 
ipi_rps can be < received_rps if we receive > 1 packet without
generating an IPI. What did i miss?

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:37                               ` Andi Kleen
@ 2010-04-16 13:58                                 ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2010-04-16 13:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Changli Gao, Eric Dumazet, Rick Jones, David Miller, therbert,
	netdev, robert

On Fri, 2010-04-16 at 15:37 +0200, Andi Kleen wrote:
> On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:

> > So you are saying that the old implementation of IPI (likely what i
> > tried pre-napi and as recent as 2-3 years ago) was bad because of a
> > single lock?
> 
> Yes.

> The old implementation of smp_call_function. Also in the really old
> days there was no smp_call_function_single() so you tended to broadcast.
> 
> Jens did a lot of work on this for his block device work IPI implementation.

Nice - thanks for that info! So not only has h/ware improved, but
implementation as well..

> > On IPIs:
> > Is anyone familiar with what is going on with Nehalem? Why is it this
> > good? I expect things will get a lot nastier with other hardware like
> > xeon based or even Nehalem with rps going across QPI.
> 
> Nehalem is just fast. I don't know why it's fast in your specific
> case. It might be simply because it has lots of bandwidth everywhere.
> Atomic operations are also faster than on previous Intel CPUs.

Well, the cache architecture is nicer. The on-die MC is nice. No more
shared MC hub/FSB. The 3 MC channels are nice. Intel finally beating
AMD ;-> someone did a measurement of the memory timings (L1, L2, L3, MM
and the results were impressive; i have the numbers somewhere).

> 
> > Here's why i think IPIs are bad, please correct me if i am wrong:
> > - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> > is in the form of an IPI).
> 
> In the hardware there's no ack, but in the Linux implementation there
> is usually (because need to know when to free the stack state used
> to pass information)
>
> However there's also now support for queued IPI
> with a special API (I believe Tom is using that)
> 

Which is the non-queued-IPI call?

> > - data cache has to be synced to main memory
> > - the instruction pipeline is flushed
> 
> At least on Nehalem data transfer can be often through the cache.

I thought you have to go all the way to MM in case of IPIs.

> IPIs involve APIC accesses which are not very fast (so overall
> it's far more than a pipeline worth of work), but it's still
> not a incredible expensive operation.
> 
> There's also X2APIC now which should be slightly faster, but it's 
> likely not in your Nehalem (this is only in the highend Xeon versions)
> 

Ok, true - forgot about the APIC as well...

> > Do you know any specs i could read up which will tell me a little more?
> 
> If you're just interested in IPI and cache line transfer performance it's
> probably best to just measure it.

There are tools like benchit which would give me L1,2,3,MM measurements;
for IPI the ping + rps test i did maybe sufficient.

> Some general information is always in the Intel optimization guide.

Thanks Andi!

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:49                             ` jamal
@ 2010-04-16 14:10                               ` Changli Gao
  2010-04-16 14:43                                 ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-16 14:10 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:
>
>
> my observation is:
> s->total is the sum of all packets received by cpu (some directly from
> ethernet)

It is meaningless currently. If rps is enabled, it may be twice of the
number of the packets received, because one packet may be count twice:
one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I
had posted a patch to solve this problem.

http://patchwork.ozlabs.org/patch/50217/

If you don't apply my patch, you'd better refer to /proc/net/dev for
the total number.

> s->received_rps was what the count receiver cpu saw incoming if they
> were sent by another cpu.

Maybe its name confused you.

/* Called from hardirq (IPI) context */
static void trigger_softirq(void *data)
{
        struct softnet_data *queue = data;
        __napi_schedule(&queue->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

the function above is called in IRQ of IPI. It counts the number of
IPIs received. It is actually ipi_rps you need.

> s-> ipi_rps is the times we tried to enq to remote cpu but found it to
> be empty and had to send an IPI.
> ipi_rps can be < received_rps if we receive > 1 packet without
> generating an IPI. What did i miss?
>


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 14:10                               ` Changli Gao
@ 2010-04-16 14:43                                 ` jamal
  2010-04-16 14:58                                   ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-16 14:43 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:

> > my observation is:
> > s->total is the sum of all packets received by cpu (some directly from
> > ethernet)
> 
> It is meaningless currently. If rps is enabled, it may be twice of the
> number of the packets received, because one packet may be count twice:
> one in enqueue_to_backlog(), and the other in __netif_receive_skb(). 

You are probably right - you made me look at my collected data ;->
i will look closely later, but it seems they are accounting for
different cpus, no? 
Example, attached are some of the stats i captured when i was running
the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
cut to the first and last two columns):

cpu   Total     |rps_recv |rps_ipi
-----+----------+---------+---------
cpu0 | 002dc7f1 |00000000 |000f4246
cpu1 | 002dc804 |000f4240 |00000000
-------------------------------------

So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
the data) and for the test 0xf4246 times it generated an IPI. It can be
seen that total running for CPU1 is 0x2dc804 but in this one run it
received 1M packets (0xf4240). 
i.e i dont see the double accounting..

cheers,
jamal

[-- Attachment #2: st1 --]
[-- Type: text/plain, Size: 792 bytes --]

002dc7f1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4246
002dc804 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4240 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 14:43                                 ` jamal
@ 2010-04-16 14:58                                   ` Changli Gao
  2010-04-19 12:48                                     ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-16 14:58 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Fri, Apr 16, 2010 at 10:43 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
>> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
>
>> > my observation is:
>> > s->total is the sum of all packets received by cpu (some directly from
>> > ethernet)
>>
>> It is meaningless currently. If rps is enabled, it may be twice of the
>> number of the packets received, because one packet may be count twice:
>> one in enqueue_to_backlog(), and the other in __netif_receive_skb().
>
> You are probably right - you made me look at my collected data ;->
> i will look closely later, but it seems they are accounting for
> different cpus, no?
> Example, attached are some of the stats i captured when i was running
> the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
> cut to the first and last two columns):
>
> cpu   Total     |rps_recv |rps_ipi
> -----+----------+---------+---------
> cpu0 | 002dc7f1 |00000000 |000f4246
> cpu1 | 002dc804 |000f4240 |00000000
> -------------------------------------
>
> So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
> redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
> the data) and for the test 0xf4246 times it generated an IPI. It can be
> seen that total running for CPU1 is 0x2dc804 but in this one run it
> received 1M packets (0xf4240).

I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:

about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated.

> i.e i dont see the double accounting..
>

a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by:

echo 1 > ..../rps_cpus

you will find the total number are doubled.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-14 18:53           ` jamal
  2010-04-14 19:44             ` Stephen Hemminger
  2010-04-15  8:48             ` David Miller
@ 2010-04-16 15:57             ` Tom Herbert
  2 siblings, 0 replies; 86+ messages in thread
From: Tom Herbert @ 2010-04-16 15:57 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao, Andi Kleen

> It would be valuable to have something like Documentation/networking/rps
> to detail things a little more.
>

Working on it.  Will try to post data for several platforms soon.

> cheers,
> jamal
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 13:21                         ` jamal
  2010-04-16 13:34                           ` Changli Gao
@ 2010-04-17  7:35                           ` Eric Dumazet
  2010-04-17  8:43                             ` Tom Herbert
  2010-04-17 17:31                             ` rps perfomance WAS(Re: rps: question jamal
  1 sibling, 2 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-17  7:35 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit :
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
> 
> > 
> > A kernel module might do this, this could be integrated in perf bench so
> > that we can regression tests upcoming kernels.
> 
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
> 
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example: 
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
> 
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17  7:35                           ` Eric Dumazet
@ 2010-04-17  8:43                             ` Tom Herbert
  2010-04-17  9:23                               ` Eric Dumazet
  2010-04-17 14:17                               ` [PATCH net-next-2.6] net: remove time limit in process_backlog() Eric Dumazet
  2010-04-17 17:31                             ` rps perfomance WAS(Re: rps: question jamal
  1 sibling, 2 replies; 86+ messages in thread
From: Tom Herbert @ 2010-04-17  8:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi

> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
>
> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.
>
You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
case latency on loads with multiple flows.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
>
> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.
>
> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.
>
> For me RPS use cases are :
>
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
>
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
>
> I'll try to do these tests on a Nehalem target.
>
>
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17  8:43                             ` Tom Herbert
@ 2010-04-17  9:23                               ` Eric Dumazet
  2010-04-17 14:27                                 ` Eric Dumazet
  2010-04-17 14:17                               ` [PATCH net-next-2.6] net: remove time limit in process_backlog() Eric Dumazet
  1 sibling, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-17  9:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi

Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> >
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> >
> You could do that, but then the packet processing becomes HOL blocking
> on all the packets that are being sent to other queues for
> processing-- remember the IPIs is only sent at the end of the NAPI.
> So unless the upper stack processing is <0.74us in your case, I think
> processing packets directly on the local queue would improve best case
> latency, but would increase average latency and even more likely worse
> case latency on loads with multiple flows.

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH net-next-2.6] net: remove time limit in process_backlog()
  2010-04-17  8:43                             ` Tom Herbert
  2010-04-17  9:23                               ` Eric Dumazet
@ 2010-04-17 14:17                               ` Eric Dumazet
  2010-04-18  9:36                                 ` David Miller
  1 sibling, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-17 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..8092f01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -264,7 +264,7 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 #ifdef CONFIG_LOCKDEP
@@ -3232,7 +3232,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	unsigned long start_time = jiffies;
 
 	napi->weight = weight_p;
 	do {
@@ -3252,7 +3251,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_enable();
 
 		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+	} while (++work < quota);
 
 	return work;
 }



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17  9:23                               ` Eric Dumazet
@ 2010-04-17 14:27                                 ` Eric Dumazet
  2010-04-17 17:26                                   ` Tom Herbert
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-17 14:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi

Le samedi 17 avril 2010 à 11:23 +0200, Eric Dumazet a écrit :
> Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > > So the cost of queing the packet into our own queue (netif_receive_skb
> > > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > >
> > > I personally think we should process packet instead of queeing it, but
> > > Tom disagree with me.
> > >
> > You could do that, but then the packet processing becomes HOL blocking
> > on all the packets that are being sent to other queues for
> > processing-- remember the IPIs is only sent at the end of the NAPI.
> > So unless the upper stack processing is <0.74us in your case, I think
> > processing packets directly on the local queue would improve best case
> > latency, but would increase average latency and even more likely worse
> > case latency on loads with multiple flows.


Tom, I am not sure what you describe is even respected for NAPI devices.
(I hope you use napi devices in your company ;) )

If we enqueue a skb to backlog, we also link our backlog napi into our
poll_list, if not already there.

So the loop in net_rx_action() will make us handle our backlog napi a
bit after this network device napi (if time limit of 2 jiffies not
elapsed) and *before* sending IPIS to remote cpus anyway.





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17 14:27                                 ` Eric Dumazet
@ 2010-04-17 17:26                                   ` Tom Herbert
  0 siblings, 0 replies; 86+ messages in thread
From: Tom Herbert @ 2010-04-17 17:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi

> Tom, I am not sure what you describe is even respected for NAPI devices.
> (I hope you use napi devices in your company ;) )
>
> If we enqueue a skb to backlog, we also link our backlog napi into our
> poll_list, if not already there.
>
> So the loop in net_rx_action() will make us handle our backlog napi a
> bit after this network device napi (if time limit of 2 jiffies not
> elapsed) and *before* sending IPIS to remote cpus anyway.
>
Then I think that's a bug you've identified ;-)

>
>
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17  7:35                           ` Eric Dumazet
  2010-04-17  8:43                             ` Tom Herbert
@ 2010-04-17 17:31                             ` jamal
  2010-04-18  9:39                               ` Eric Dumazet
  1 sibling, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-17 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:

> I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> nehalem. So a 3-4 years old design.

Eric, I thank you kind sir for going out of your way to do this - it is
certainly a good processor to compare against 

> For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> 192.168.0.2". Yes ping is not very good, but its available ;)

It is a reasonable quick test, no fancy setup required ;->

> Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> user land. 

I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences;  I am speculating you probably ended having greater
than one packet/IPI ratio i.e amortization benefit..
  
> I dont want to tweak acpi or whatever smart power saving
> mechanisms.

I should mention i turned off acpi as well in the bios; it was consuming
more cpu cycles than net-processing and was interfering in my tests.

> When RPS off
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> 
> RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> 
> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> 

Excellent analysis.

> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.

Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...

The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket
layer.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

Good test - should be worst case scenario. But there are two other 
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores
and same socket but different die test?

> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.

Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it
would be higher going across QPI.

> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.

Sound about right maybe 2 us in my case. I am still mystified by "what
damage does an IPI make?" to the system harmony. I have to do some
reading. Andi mentioned the APIC connection - but my gut feeling is you
probably end up going to main memory and invalidate cache.

> For me RPS use cases are :
> 
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
> 
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
> 

Agreed on both. 
The caveat to note:
- what hardware would be reasonable
- within same hardware what setups would be good to use 
- when it doesnt benefit even with the everything correct (eg low tcp
throughput)

> I'll try to do these tests on a Nehalem target.

Thanks again Eric.

cheers,
jamal 

[1]http://en.wikipedia.org/wiki/Little's_law


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] net: remove time limit in process_backlog()
  2010-04-17 14:17                               ` [PATCH net-next-2.6] net: remove time limit in process_backlog() Eric Dumazet
@ 2010-04-18  9:36                                 ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-18  9:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 17 Apr 2010 16:17:02 +0200

> - There is no point to enforce a time limit in process_backlog(), since
> other napi instances dont follow same rule. We can exit after only one
> packet processed...
> The normal quota of 64 packets per napi instance should be the norm, and
> net_rx_action() already has its own time limit.
> Note : /proc/net/core/dev_weight can be used to tune this 64 default
> value.
> 
> - Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Yep, doing this time limit at two levels is pointless.

Applied, thanks Eric!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-17 17:31                             ` rps perfomance WAS(Re: rps: question jamal
@ 2010-04-18  9:39                               ` Eric Dumazet
  2010-04-18 11:34                                 ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-18  9:39 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

Le samedi 17 avril 2010 à 13:31 -0400, jamal a écrit :
> On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:
> 
> > I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> > nehalem. So a 3-4 years old design.
> 
> Eric, I thank you kind sir for going out of your way to do this - it is
> certainly a good processor to compare against 
> 
> > For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> > 192.168.0.2". Yes ping is not very good, but its available ;)
> 
> It is a reasonable quick test, no fancy setup required ;->
> 
> > Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> > user land. 
> 
> I didnt keep the cpus busy. I should re-run with such a setup, any
> specific app that you used to keep them busy? Keeping them busy could
> have consequences;  I am speculating you probably ended having greater
> than one packet/IPI ratio i.e amortization benefit..

No, only one packet per IPI, since I setup my tg3 coalescing parameter
to the minimum value, I received one packet per interrupt.

The specific app is :

for f in `seq 1 8`; do while :; do :; done& done


>   
> > I dont want to tweak acpi or whatever smart power saving
> > mechanisms.
> 
> I should mention i turned off acpi as well in the bios; it was consuming
> more cpu cycles than net-processing and was interfering in my tests.
> 
> > When RPS off
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> > 
> > RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> > (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> > 
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > 
> 
> Excellent analysis.
> 
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> 
> Sorry - I am gonna have to turn on some pedagogy and offer my
> Canadian 2 cents;->
> I would lean on agreeing with Tom, but maybe go one step further (sans
> packet-reordering): we should never process packets to socket layer on
> the demuxing cpu.
> enqueue everything you receive on a different cpu - so somehow receiving
> cpu becomes part of a hashing decision ...
> 
> The reason is derived from queueing theory - of which i know dangerously
> little - but refer you to mr. little his-self[1] (pun fully
> intended;->):
> i.e fixed serving time provides more predictable results as opposed to
> once in a while a spike as you receive packets destined to "our cpu".
> Queueing packets and later allocating cycles to processing them adds to
> variability, but is not as bad as processing to completion to socket
> layer.
> 
> > RPS on, directed on cpu1 (other socket)
> > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
> 
> Good test - should be worst case scenario. But there are two other 
> scenarios which will give different results in my opinion.
> On your setup i think each socket has two dies, each with two cores. So
> my feeling is you will get different numbers if you go within same die
> and across dies within same socket. If i am not mistaken, the mapping
> would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
> socket1/die0{core1/3}, socket1{core5/7}.
> If you have cycles can you try the same socket+die but different cores
> and same socket but different die test?

Sure, lets redo a full test, taking lowest time of three ping runs


echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms


# egrep "physical id|core|apicid" /proc/cpuinfo 
physical id	: 0
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0

physical id	: 1
core id		: 0
cpu cores	: 4
apicid		: 4
initial apicid	: 4

physical id	: 0
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2

physical id	: 1
core id		: 2
cpu cores	: 4
apicid		: 6
initial apicid	: 6

physical id	: 0
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1

physical id	: 1
core id		: 1
cpu cores	: 4
apicid		: 5
initial apicid	: 5

physical id	: 0
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3

physical id	: 1
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-18  9:39                               ` Eric Dumazet
@ 2010-04-18 11:34                                 ` Eric Dumazet
  2010-04-19  2:09                                   ` jamal
                                                     ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-18 11:34 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

Le dimanche 18 avril 2010 à 11:39 +0200, Eric Dumazet a écrit :
> No, only one packet per IPI, since I setup my tg3 coalescing parameter
> to the minimum value, I received one packet per interrupt.
> 
> The specific app is :
> 
> for f in `seq 1 8`; do while :; do :; done& done
> 

An other interesting user land app would be to use a cpu _and_ memory
cruncher, because of caches misses we'll get.

$ cat nloop.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define SZ 4*1024*1024

int main(int argc, char *argv[])
{
	int nproc = 8;
	char *buffer;

	if (argc > 1)
		nproc = atoi(argv[1]);
	while (nproc > 1) {
		if (fork() == 0)
			break;
		nproc--;
	}
	buffer = malloc(SZ);
	while (1)
		memset(buffer, 0x55, SZ);
}

$ ./nloop 8 &

echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4861ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4981ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7191ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7128ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7107ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
5505ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7125ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7022ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7157ms


Maximum overhead is 7191-4861 = 23.3 us per packet




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-18 11:34                                 ` Eric Dumazet
@ 2010-04-19  2:09                                   ` jamal
  2010-04-19  9:37                                   ` [RFC] rps: shortcut net_rps_action() Eric Dumazet
  2010-04-20 12:02                                   ` rps perfomance WAS(Re: rps: question jamal
  2 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2010-04-19  2:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]


Thanks Eric. I tried to visualize your results - attached.
There are 2-3 odd numbers (labelled with *) but other
than that results are as expected...

I did run some experiments with some udp sink server
and i saw the IPIs amortized; unfortunately sky2 h/ware 
proved to be bottleneck (at > 750Kpps incoming, it started 
dropping and wasnt recording the drops, so i had to slow things down). I
need to digest my results a little more - but it seems i was getting
better throughput results with RPS (i.e it was able to sink
more packets)..

cheers,
jamal

[-- Attachment #2: erichw.pdf --]
[-- Type: application/pdf, Size: 187023 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [RFC] rps: shortcut net_rps_action()
  2010-04-18 11:34                                 ` Eric Dumazet
  2010-04-19  2:09                                   ` jamal
@ 2010-04-19  9:37                                   ` Eric Dumazet
  2010-04-19  9:48                                     ` Changli Gao
  2010-04-20 12:02                                   ` rps perfomance WAS(Re: rps: question jamal
  2 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-19  9:37 UTC (permalink / raw)
  To: Tom Herbert, David Miller; +Cc: netdev

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

I add a flag to scan cpumask only if at least one IPI was scheduled.
Even cpumask_weight() might be expensive on some setups, where
nr_cpumask_bits could be very big (4096 for example)

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    5 +-
 net/core/dev.c            |   73 ++++++++++++++++--------------------
 2 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..283d3ef 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1389,8 +1389,11 @@ struct softnet_data {
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
+	unsigned int		rps_ipis_scheduled;
+	unsigned int		rps_select;
+	cpumask_t		rps_mask[2];
+	/* Elements below can be accessed between CPUs for RPS */
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 	unsigned int		input_queue_head;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..3e6e420 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2347,19 +2347,14 @@ done:
 }
 
 /*
- * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * sofnet_data holds the per-CPU mask of CPUs for which IPIs are scheduled
  * to be sent to kick remote softirq processing.  There are two masks since
- * the sending of IPIs must be done with interrupts enabled.  The select field
+ * the sending of IPIs must be done with interrupts enabled.  The rps_select field
  * indicates the current mask that enqueue_backlog uses to schedule IPIs.
  * select is flipped before net_rps_action is called while still under lock,
  * net_rps_action then uses the non-selected mask to send the IPIs and clears
  * it without conflicting with enqueue_backlog operation.
  */
-struct rps_remote_softirq_cpus {
-	cpumask_t mask[2];
-	int select;
-};
-static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
 
 /* Called from hardirq (IPI) context */
 static void trigger_softirq(void *data)
@@ -2403,10 +2398,10 @@ enqueue:
 		if (napi_schedule_prep(&queue->backlog)) {
 #ifdef CONFIG_RPS
 			if (cpu != smp_processor_id()) {
-				struct rps_remote_softirq_cpus *rcpus =
-				    &__get_cpu_var(rps_remote_softirq_cpus);
+				struct softnet_data *myqueue = &__get_cpu_var(softnet_data);
 
-				cpu_set(cpu, rcpus->mask[rcpus->select]);
+				cpu_set(cpu, myqueue->rps_mask[myqueue->rps_select]);
+				myqueue->rps_ipis_scheduled = 1;
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 				goto enqueue;
 			}
@@ -2911,7 +2906,9 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
-/* Network device is going away, flush any packets still pending  */
+/* Network device is going away, flush any packets still pending
+ * Called with irqs disabled.
+ */
 static void flush_backlog(void *arg)
 {
 	struct net_device *dev = arg;
@@ -3340,24 +3337,36 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
-#ifdef CONFIG_RPS
 /*
- * net_rps_action sends any pending IPI's for rps.  This is only called from
- * softirq and interrupts must be enabled.
+ * net_rps_action sends any pending IPI's for rps.
+ * Note: called with local irq disabled, but exits with local irq enabled.
  */
-static void net_rps_action(cpumask_t *mask)
+static void net_rps_action(void)
 {
-	int cpu;
+#ifdef CONFIG_RPS
+	if (percpu_read(softnet_data.rps_ipis_scheduled)) {
+		struct softnet_data *queue = &__get_cpu_var(softnet_data);
+		int cpu, select = queue->rps_select;
+		cpumask_t *mask;
+		
+		queue->rps_ipis_scheduled = 0;
+		queue->rps_select ^= 1;
 
-	/* Send pending IPI's to kick RPS processing on remote cpus. */
-	for_each_cpu_mask_nr(cpu, *mask) {
-		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
-		if (cpu_online(cpu))
-			__smp_call_function_single(cpu, &queue->csd, 0);
-	}
-	cpus_clear(*mask);
-}
+		local_irq_enable();
+
+		mask = &queue->rps_mask[select];
+
+		/* Send pending IPI's to kick RPS processing on remote cpus. */
+		for_each_cpu_mask_nr(cpu, *mask) {
+			struct softnet_data *remqueue = &per_cpu(softnet_data, cpu);
+			if (cpu_online(cpu))
+				__smp_call_function_single(cpu, &remqueue->csd, 0);
+		}
+		cpus_clear(*mask);
+	} else
 #endif
+		local_irq_enable();
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3365,10 +3374,6 @@ static void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
-#ifdef CONFIG_RPS
-	int select;
-	struct rps_remote_softirq_cpus *rcpus;
-#endif
 
 	local_irq_disable();
 
@@ -3431,17 +3436,7 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
-#ifdef CONFIG_RPS
-	rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
-	select = rcpus->select;
-	rcpus->select ^= 1;
-
-	local_irq_enable();
-
-	net_rps_action(&rcpus->mask[select]);
-#else
-	local_irq_enable();
-#endif
+	net_rps_action();
 
 #ifdef CONFIG_NET_DMA
 	/*



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [RFC] rps: shortcut net_rps_action()
  2010-04-19  9:37                                   ` [RFC] rps: shortcut net_rps_action() Eric Dumazet
@ 2010-04-19  9:48                                     ` Changli Gao
  2010-04-19 12:14                                       ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-19  9:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, David Miller, netdev

On Mon, Apr 19, 2010 at 5:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
> RPS is not active.
>
> I add a flag to scan cpumask only if at least one IPI was scheduled.
> Even cpumask_weight() might be expensive on some setups, where
> nr_cpumask_bits could be very big (4096 for example)

How about using a array to save the cpu IDs. The number of CPUs, to
which the IPI will be sent, should be small.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] rps: shortcut net_rps_action()
  2010-04-19  9:48                                     ` Changli Gao
@ 2010-04-19 12:14                                       ` Eric Dumazet
  2010-04-19 12:28                                         ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-19 12:14 UTC (permalink / raw)
  To: Changli Gao; +Cc: Tom Herbert, David Miller, netdev

Le lundi 19 avril 2010 à 17:48 +0800, Changli Gao a écrit :
> On Mon, Apr 19, 2010 at 5:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
> > RPS is not active.
> >
> > I add a flag to scan cpumask only if at least one IPI was scheduled.
> > Even cpumask_weight() might be expensive on some setups, where
> > nr_cpumask_bits could be very big (4096 for example)
> 
> How about using a array to save the cpu IDs. The number of CPUs, to
> which the IPI will be sent, should be small.
> 

Yes it should be small, yet the two arrays would be big enough to make
softnet_data first part use at least two cache lines instead of one,
even in the case we handle one cpu/IPI per net_rps_action()

As several packets can be enqueued for a given cpu, we would need to
keep bitmasks.
We would have to add one test in enqueue_to_backlog()

if (cpu_test_and_set(cpu, mask)) {
	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
	array[nb++] = cpu;
}




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] rps: shortcut net_rps_action()
  2010-04-19 12:14                                       ` Eric Dumazet
@ 2010-04-19 12:28                                         ` Changli Gao
  2010-04-19 13:27                                           ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-19 12:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, David Miller, netdev

On Mon, Apr 19, 2010 at 8:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> As several packets can be enqueued for a given cpu, we would need to
> keep bitmasks.
> We would have to add one test in enqueue_to_backlog()
>
> if (cpu_test_and_set(cpu, mask)) {
>        __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>        array[nb++] = cpu;
> }

        rps_lock(queue);
        if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
                if (queue->input_pkt_queue.qlen) {
...
                if (napi_schedule_prep(&queue->backlog)) {
#ifdef CONFIG_RPS
                        if (cpu != smp_processor_id()) {
                                struct rps_remote_softirq_cpus *rcpus =
                                    &__get_cpu_var(rps_remote_softirq_cpus);

                                cpu_set(cpu, rcpus->mask[rcpus->select]);
                                __raise_softirq_irqoff(NET_RX_SOFTIRQ);
                                goto enqueue;
                        }
#endif
                        __napi_schedule(&queue->backlog);
                }

Only the first packet of a softnet.input_pkt_queue may trigger IPI, so
we don't need to keep bitmasks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-16 14:58                                   ` Changli Gao
@ 2010-04-19 12:48                                     ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2010-04-19 12:48 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi


Sorry, didnt respond to you - busyed out setting up before trying
to think a little more about this..

On Fri, 2010-04-16 at 22:58 +0800, Changli Gao wrote:

> >
> > cpu   Total     |rps_recv |rps_ipi
> > -----+----------+---------+---------
> > cpu0 | 002dc7f1 |00000000 |000f4246
> > cpu1 | 002dc804 |000f4240 |00000000
> > -------------------------------------
> >
> > So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
> > redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
> > the data) and for the test 0xf4246 times it generated an IPI. It can be
> > seen that total running for CPU1 is 0x2dc804 but in this one run it
> > received 1M packets (0xf4240).
> 
> I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:
> 
> about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated.

If you look at the patch, I am zeroing those stats - so 0xf4240 is only
one test (decimal 1M). I think there is something to what you are
saying; rps_ipi on cpu0 is ambigous because it counts the number of
times cpu0 softirq was scheduled as well as the number of times cpu0
scheduled other cpus. 
The extra six for cpu0 turn out to be the times an ethernet interrupt
scheduled the cpu0 softirq.

> a single packet is counted twice by CPU0 and CPU1. 

Well, the counts have different meanings; rps_ipi applies to source cpu
activity and rps_recv applies to destination. Example, if cpu0 in total
6 times found some destination cpu to be empty and 2 of those happen to
be on cpu1, cpu2, cpu3 then
cpu0: ipi_rps = 6
cpu1: rps_recv = 2
cpu2: rps_recv = 2
cpu3: rps_recv = 2


> If you change RPS setting by:
> 
> echo 1 > ..../rps_cpus
> 
> you will find the total number are doubled.

This is true. But IMO deserving and should be double counted.
It is just more fine-grained accounting.
IOW, I am not sure we need your patch because we will loose the
fine-grain accounting - and mine requires more work to be less ambigous.

cheers,
jamal 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] rps: shortcut net_rps_action()
  2010-04-19 12:28                                         ` Changli Gao
@ 2010-04-19 13:27                                           ` Eric Dumazet
  2010-04-19 14:22                                             ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-19 13:27 UTC (permalink / raw)
  To: Changli Gao; +Cc: Tom Herbert, David Miller, netdev

Le lundi 19 avril 2010 à 20:28 +0800, Changli Gao a écrit :
> On Mon, Apr 19, 2010 at 8:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > As several packets can be enqueued for a given cpu, we would need to
> > keep bitmasks.
> > We would have to add one test in enqueue_to_backlog()
> >
> > if (cpu_test_and_set(cpu, mask)) {
> >        __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >        array[nb++] = cpu;
> > }
> 
>         rps_lock(queue);
>         if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
>                 if (queue->input_pkt_queue.qlen) {
> ...
>                 if (napi_schedule_prep(&queue->backlog)) {
> #ifdef CONFIG_RPS
>                         if (cpu != smp_processor_id()) {
>                                 struct rps_remote_softirq_cpus *rcpus =
>                                     &__get_cpu_var(rps_remote_softirq_cpus);
> 
>                                 cpu_set(cpu, rcpus->mask[rcpus->select]);
>                                 __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>                                 goto enqueue;
>                         }
> #endif
>                         __napi_schedule(&queue->backlog);
>                 }
> 
> Only the first packet of a softnet.input_pkt_queue may trigger IPI, so
> we don't need to keep bitmasks.
> 

This is not true Changli

Please read again all previous mails about RPS, or the code.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] rps: shortcut net_rps_action()
  2010-04-19 13:27                                           ` Eric Dumazet
@ 2010-04-19 14:22                                             ` Eric Dumazet
  2010-04-19 15:07                                               ` [PATCH net-next-2.6] " Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-19 14:22 UTC (permalink / raw)
  To: Changli Gao; +Cc: Tom Herbert, David Miller, netdev

Le lundi 19 avril 2010 à 15:27 +0200, Eric Dumazet a écrit :

> This is not true Changli
> 
> Please read again all previous mails about RPS, or the code.
> 

Hmm, I just read again, and I now remember Tom used a single bitmap,
then we had to add a second set because of a possible race.

A list would be enough.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-19 14:22                                             ` Eric Dumazet
@ 2010-04-19 15:07                                               ` Eric Dumazet
  2010-04-19 16:02                                                 ` Tom Herbert
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-19 15:07 UTC (permalink / raw)
  To: Changli Gao, David Miller, Tom Herbert; +Cc: netdev

Le lundi 19 avril 2010 à 16:22 +0200, Eric Dumazet a écrit :

> 
> Hmm, I just read again, and I now remember Tom used a single bitmap,
> then we had to add a second set because of a possible race.
> 
> A list would be enough.
> 

Here is the updated patch, using a single list instead of bitmap

RFC status becomes official patch ;)

Thanks Changli for your array suggestion !


[PATCH net-next-2.6] rps: shortcut net_rps_action()

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    9 ++--
 net/core/dev.c            |   79 ++++++++++++++----------------------
 2 files changed, 38 insertions(+), 50 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..83ab3da 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1381,17 +1381,20 @@ static inline int unregister_gifconf(unsigned int family)
 }
 
 /*
- * Incoming packets are placed on per-cpu queues so that
- * no locking is needed.
+ * Incoming packets are placed on per-cpu queues
  */
 struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
+	struct softnet_data	*rps_ipi_list;
+
+	/* Elements below can be accessed between CPUs for RPS */
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct softnet_data	*rps_ipi_next;
+	unsigned int		cpu;
 	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..f6ff2cf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2346,21 +2346,6 @@ done:
 	return cpu;
 }
 
-/*
- * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
- * to be sent to kick remote softirq processing.  There are two masks since
- * the sending of IPIs must be done with interrupts enabled.  The select field
- * indicates the current mask that enqueue_backlog uses to schedule IPIs.
- * select is flipped before net_rps_action is called while still under lock,
- * net_rps_action then uses the non-selected mask to send the IPIs and clears
- * it without conflicting with enqueue_backlog operation.
- */
-struct rps_remote_softirq_cpus {
-	cpumask_t mask[2];
-	int select;
-};
-static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
-
 /* Called from hardirq (IPI) context */
 static void trigger_softirq(void *data)
 {
@@ -2403,10 +2388,12 @@ enqueue:
 		if (napi_schedule_prep(&queue->backlog)) {
 #ifdef CONFIG_RPS
 			if (cpu != smp_processor_id()) {
-				struct rps_remote_softirq_cpus *rcpus =
-				    &__get_cpu_var(rps_remote_softirq_cpus);
+				struct softnet_data *myqueue;
+
+				myqueue = &__get_cpu_var(softnet_data);
+				queue->rps_ipi_next = myqueue->rps_ipi_list;
+				myqueue->rps_ipi_list = queue;
 
-				cpu_set(cpu, rcpus->mask[rcpus->select]);
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 				goto enqueue;
 			}
@@ -2911,7 +2898,9 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
-/* Network device is going away, flush any packets still pending  */
+/* Network device is going away, flush any packets still pending
+ * Called with irqs disabled.
+ */
 static void flush_backlog(void *arg)
 {
 	struct net_device *dev = arg;
@@ -3340,24 +3329,33 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
-#ifdef CONFIG_RPS
 /*
- * net_rps_action sends any pending IPI's for rps.  This is only called from
- * softirq and interrupts must be enabled.
+ * net_rps_action sends any pending IPI's for rps.
+ * Note: called with local irq disabled, but exits with local irq enabled.
  */
-static void net_rps_action(cpumask_t *mask)
+static void net_rps_action(void)
 {
-	int cpu;
+#ifdef CONFIG_RPS
+	struct softnet_data *locqueue = &__get_cpu_var(softnet_data);
+	struct softnet_data *remqueue = locqueue->rps_ipi_list;
 
-	/* Send pending IPI's to kick RPS processing on remote cpus. */
-	for_each_cpu_mask_nr(cpu, *mask) {
-		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
-		if (cpu_online(cpu))
-			__smp_call_function_single(cpu, &queue->csd, 0);
-	}
-	cpus_clear(*mask);
-}
+	if (remqueue) {
+		locqueue->rps_ipi_list = NULL;
+
+		local_irq_enable();
+
+		/* Send pending IPI's to kick RPS processing on remote cpus. */
+		while (remqueue) {
+			struct softnet_data *next = remqueue->rps_ipi_next;
+			if (cpu_online(remqueue->cpu))
+				__smp_call_function_single(remqueue->cpu,
+							   &remqueue->csd, 0);
+			remqueue = next;
+		}
+	} else
 #endif
+		local_irq_enable();
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3365,10 +3363,6 @@ static void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
-#ifdef CONFIG_RPS
-	int select;
-	struct rps_remote_softirq_cpus *rcpus;
-#endif
 
 	local_irq_disable();
 
@@ -3431,17 +3425,7 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
-#ifdef CONFIG_RPS
-	rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
-	select = rcpus->select;
-	rcpus->select ^= 1;
-
-	local_irq_enable();
-
-	net_rps_action(&rcpus->mask[select]);
-#else
-	local_irq_enable();
-#endif
+	net_rps_action();
 
 #ifdef CONFIG_NET_DMA
 	/*
@@ -5841,6 +5825,7 @@ static int __init net_dev_init(void)
 		queue->csd.func = trigger_softirq;
 		queue->csd.info = queue;
 		queue->csd.flags = 0;
+		queue->cpu = i;
 #endif
 
 		queue->backlog.poll = process_backlog;



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-19 15:07                                               ` [PATCH net-next-2.6] " Eric Dumazet
@ 2010-04-19 16:02                                                 ` Tom Herbert
  2010-04-19 20:21                                                 ` David Miller
  2010-04-19 23:56                                                 ` [PATCH net-next-2.6] rps: shortcut net_rps_action() Changli Gao
  2 siblings, 0 replies; 86+ messages in thread
From: Tom Herbert @ 2010-04-19 16:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, David Miller, netdev

>
> [PATCH net-next-2.6] rps: shortcut net_rps_action()
>
> net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
> RPS is not active.
>
> Tom Herbert used two bitmasks to hold information needed to send IPI,
> but a single LIFO list seems more appropriate.
>
Yes, this patch is an improvement over that.

> Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
> (remove two ifdefs)
>
> Move rps_remote_softirq_cpus into softnet_data to share its first cache
> line, filling an existing hole.
>
> In a future patch, we could call net_rps_action() from process_backlog()
> to make sure we send IPI before handling this cpu backlog.
>
Yes.  I did some quick experiments last night and there does seem to
be some gains in doing this.

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/netdevice.h |    9 ++--
>  net/core/dev.c            |   79 ++++++++++++++----------------------
>  2 files changed, 38 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 649a025..83ab3da 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1381,17 +1381,20 @@ static inline int unregister_gifconf(unsigned int family)
>  }
>
>  /*
> - * Incoming packets are placed on per-cpu queues so that
> - * no locking is needed.
> + * Incoming packets are placed on per-cpu queues
>  */
>  struct softnet_data {
>        struct Qdisc            *output_queue;
>        struct list_head        poll_list;
>        struct sk_buff          *completion_queue;
>
> -       /* Elements below can be accessed between CPUs for RPS */
>  #ifdef CONFIG_RPS
> +       struct softnet_data     *rps_ipi_list;
> +
> +       /* Elements below can be accessed between CPUs for RPS */
>        struct call_single_data csd ____cacheline_aligned_in_smp;
> +       struct softnet_data     *rps_ipi_next;
> +       unsigned int            cpu;
>        unsigned int            input_queue_head;
>  #endif
>        struct sk_buff_head     input_pkt_queue;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7abf959..f6ff2cf 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2346,21 +2346,6 @@ done:
>        return cpu;
>  }
>
> -/*
> - * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
> - * to be sent to kick remote softirq processing.  There are two masks since
> - * the sending of IPIs must be done with interrupts enabled.  The select field
> - * indicates the current mask that enqueue_backlog uses to schedule IPIs.
> - * select is flipped before net_rps_action is called while still under lock,
> - * net_rps_action then uses the non-selected mask to send the IPIs and clears
> - * it without conflicting with enqueue_backlog operation.
> - */
> -struct rps_remote_softirq_cpus {
> -       cpumask_t mask[2];
> -       int select;
> -};
> -static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
> -
>  /* Called from hardirq (IPI) context */
>  static void trigger_softirq(void *data)
>  {
> @@ -2403,10 +2388,12 @@ enqueue:
>                if (napi_schedule_prep(&queue->backlog)) {
>  #ifdef CONFIG_RPS
>                        if (cpu != smp_processor_id()) {
> -                               struct rps_remote_softirq_cpus *rcpus =
> -                                   &__get_cpu_var(rps_remote_softirq_cpus);
> +                               struct softnet_data *myqueue;
> +
> +                               myqueue = &__get_cpu_var(softnet_data);
> +                               queue->rps_ipi_next = myqueue->rps_ipi_list;
> +                               myqueue->rps_ipi_list = queue;
>
> -                               cpu_set(cpu, rcpus->mask[rcpus->select]);
>                                __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>                                goto enqueue;
>                        }
> @@ -2911,7 +2898,9 @@ int netif_receive_skb(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(netif_receive_skb);
>
> -/* Network device is going away, flush any packets still pending  */
> +/* Network device is going away, flush any packets still pending
> + * Called with irqs disabled.
> + */
>  static void flush_backlog(void *arg)
>  {
>        struct net_device *dev = arg;
> @@ -3340,24 +3329,33 @@ void netif_napi_del(struct napi_struct *napi)
>  }
>  EXPORT_SYMBOL(netif_napi_del);
>
> -#ifdef CONFIG_RPS
>  /*
> - * net_rps_action sends any pending IPI's for rps.  This is only called from
> - * softirq and interrupts must be enabled.
> + * net_rps_action sends any pending IPI's for rps.
> + * Note: called with local irq disabled, but exits with local irq enabled.
>  */
> -static void net_rps_action(cpumask_t *mask)
> +static void net_rps_action(void)
>  {
> -       int cpu;
> +#ifdef CONFIG_RPS
> +       struct softnet_data *locqueue = &__get_cpu_var(softnet_data);
> +       struct softnet_data *remqueue = locqueue->rps_ipi_list;
>
> -       /* Send pending IPI's to kick RPS processing on remote cpus. */
> -       for_each_cpu_mask_nr(cpu, *mask) {
> -               struct softnet_data *queue = &per_cpu(softnet_data, cpu);
> -               if (cpu_online(cpu))
> -                       __smp_call_function_single(cpu, &queue->csd, 0);
> -       }
> -       cpus_clear(*mask);
> -}
> +       if (remqueue) {
> +               locqueue->rps_ipi_list = NULL;
> +
> +               local_irq_enable();
> +
> +               /* Send pending IPI's to kick RPS processing on remote cpus. */
> +               while (remqueue) {
> +                       struct softnet_data *next = remqueue->rps_ipi_next;
> +                       if (cpu_online(remqueue->cpu))
> +                               __smp_call_function_single(remqueue->cpu,
> +                                                          &remqueue->csd, 0);
> +                       remqueue = next;
> +               }
> +       } else
>  #endif
> +               local_irq_enable();
> +}
>
>  static void net_rx_action(struct softirq_action *h)
>  {
> @@ -3365,10 +3363,6 @@ static void net_rx_action(struct softirq_action *h)
>        unsigned long time_limit = jiffies + 2;
>        int budget = netdev_budget;
>        void *have;
> -#ifdef CONFIG_RPS
> -       int select;
> -       struct rps_remote_softirq_cpus *rcpus;
> -#endif
>
>        local_irq_disable();
>
> @@ -3431,17 +3425,7 @@ static void net_rx_action(struct softirq_action *h)
>                netpoll_poll_unlock(have);
>        }
>  out:
> -#ifdef CONFIG_RPS
> -       rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
> -       select = rcpus->select;
> -       rcpus->select ^= 1;
> -
> -       local_irq_enable();
> -
> -       net_rps_action(&rcpus->mask[select]);
> -#else
> -       local_irq_enable();
> -#endif
> +       net_rps_action();
>
>  #ifdef CONFIG_NET_DMA
>        /*
> @@ -5841,6 +5825,7 @@ static int __init net_dev_init(void)
>                queue->csd.func = trigger_softirq;
>                queue->csd.info = queue;
>                queue->csd.flags = 0;
> +               queue->cpu = i;
>  #endif
>
>                queue->backlog.poll = process_backlog;
>
>
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-19 15:07                                               ` [PATCH net-next-2.6] " Eric Dumazet
  2010-04-19 16:02                                                 ` Tom Herbert
@ 2010-04-19 20:21                                                 ` David Miller
  2010-04-20  7:17                                                   ` [PATCH net-next-2.6] rps: cleanups Eric Dumazet
  2010-04-19 23:56                                                 ` [PATCH net-next-2.6] rps: shortcut net_rps_action() Changli Gao
  2 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2010-04-19 20:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Apr 2010 17:07:33 +0200

> [PATCH net-next-2.6] rps: shortcut net_rps_action()

Applied, thanks Eric.

It is getting increasingly complicated to follow who enables and
disabled local cpu irqs in these code paths.  We could combat
this by adding something like "_irq_enable()" to the function
names.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-19 15:07                                               ` [PATCH net-next-2.6] " Eric Dumazet
  2010-04-19 16:02                                                 ` Tom Herbert
  2010-04-19 20:21                                                 ` David Miller
@ 2010-04-19 23:56                                                 ` Changli Gao
  2010-04-20  0:32                                                   ` Changli Gao
  2 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-19 23:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Tom Herbert, netdev

On Mon, Apr 19, 2010 at 11:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> +
> +               /* Send pending IPI's to kick RPS processing on remote cpus. */
> +               while (remqueue) {
> +                       struct softnet_data *next = remqueue->rps_ipi_next;
> +                       if (cpu_online(remqueue->cpu))
> +                               __smp_call_function_single(remqueue->cpu,
> +                                                          &remqueue->csd, 0);
> +                       remqueue = next;
> +               }

It seems you prefetch rps_ipi_next. I think it isn't necessary, as the
list should be short. If you insist on this, is the macro prefetch()
better?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-19 23:56                                                 ` [PATCH net-next-2.6] rps: shortcut net_rps_action() Changli Gao
@ 2010-04-20  0:32                                                   ` Changli Gao
  2010-04-20  5:55                                                     ` Eric Dumazet
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-20  0:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Tom Herbert, netdev

On Tue, Apr 20, 2010 at 7:56 AM, Changli Gao <xiaosuo@gmail.com> wrote:
>
> It seems you prefetch rps_ipi_next. I think it isn't necessary, as the
> list should be short. If you insist on this, is the macro prefetch()
> better?

Oh, I read the code again and got the answer. After the IPI is sent,
this softnet will be queued by the other CPUs. We prefetch the pointer
rps_ipi_next to avoid this race condition.

Sorry for noise :)


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: shortcut net_rps_action()
  2010-04-20  0:32                                                   ` Changli Gao
@ 2010-04-20  5:55                                                     ` Eric Dumazet
  0 siblings, 0 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-20  5:55 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, Tom Herbert, netdev

Le mardi 20 avril 2010 à 08:32 +0800, Changli Gao a écrit :

> Oh, I read the code again and got the answer. After the IPI is sent,
> this softnet will be queued by the other CPUs. We prefetch the pointer
> rps_ipi_next to avoid this race condition.
> 

Speaking of prefetch business,

I partly tested following patch, I will submit it if it happens to be a
clear win.

diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..fe6fc9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2349,7 +2349,9 @@ done:
 static void trigger_softirq(void *data)
 {
 	struct softnet_data *queue = data;
+
 	__napi_schedule(&queue->backlog);
+	prefetch(queue->input_pkt_queue.next);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
 #endif /* CONFIG_RPS */



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH net-next-2.6] rps: cleanups
  2010-04-19 20:21                                                 ` David Miller
@ 2010-04-20  7:17                                                   ` Eric Dumazet
  2010-04-20  8:18                                                     ` David Miller
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-20  7:17 UTC (permalink / raw)
  To: David Miller; +Cc: xiaosuo, therbert, netdev

Le lundi 19 avril 2010 à 13:21 -0700, David Miller a écrit :

> 
> It is getting increasingly complicated to follow who enables and
> disabled local cpu irqs in these code paths.  We could combat
> this by adding something like "_irq_enable()" to the function
> names.

Yes I agree, we need a general cleanup in this file

Thanks David !

[PATCH net-next-2.6] rps: cleanups

struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    4 
 net/core/dev.c            |  149 +++++++++++++++++++-----------------
 2 files changed, 82 insertions(+), 71 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 83ab3da..3c5ed5f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,10 +1401,10 @@ struct softnet_data {
 	struct napi_struct	backlog;
 };
 
-static inline void incr_input_queue_head(struct softnet_data *queue)
+static inline void input_queue_head_incr(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	queue->input_queue_head++;
+	sd->input_queue_head++;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..70df048 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -208,17 +208,17 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
 }
 
-static inline void rps_lock(struct softnet_data *queue)
+static inline void rps_lock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_lock(&queue->input_pkt_queue.lock);
+	spin_lock(&sd->input_pkt_queue.lock);
 #endif
 }
 
-static inline void rps_unlock(struct softnet_data *queue)
+static inline void rps_unlock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_unlock(&queue->input_pkt_queue.lock);
+	spin_unlock(&sd->input_pkt_queue.lock);
 #endif
 }
 
@@ -2346,63 +2346,74 @@ done:
 }
 
 /* Called from hardirq (IPI) context */
-static void trigger_softirq(void *data)
+static void rps_trigger_softirq(void *data)
 {
-	struct softnet_data *queue = data;
-	__napi_schedule(&queue->backlog);
+	struct softnet_data *sd = data;
+
+	__napi_schedule(&sd->backlog);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
+
 #endif /* CONFIG_RPS */
 
 /*
+ * Check if this softnet_data structure is another cpu one
+ * If yes, queue it to our IPI list and return 1
+ * If no, return 0
+ */
+static int rps_ipi_queued(struct softnet_data *sd)
+{
+#ifdef CONFIG_RPS
+	struct softnet_data *mysd = &__get_cpu_var(softnet_data);
+
+	if (sd != mysd) {
+		sd->rps_ipi_next = mysd->rps_ipi_list;
+		mysd->rps_ipi_list = sd;
+
+		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+		return 1;
+	}
+#endif /* CONFIG_RPS */
+	return 0;
+}
+
+/*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
  */
 static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 			      unsigned int *qtail)
 {
-	struct softnet_data *queue;
+	struct softnet_data *sd;
 	unsigned long flags;
 
-	queue = &per_cpu(softnet_data, cpu);
+	sd = &per_cpu(softnet_data, cpu);
 
 	local_irq_save(flags);
 	__get_cpu_var(netdev_rx_stat).total++;
 
-	rps_lock(queue);
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
+	rps_lock(sd);
+	if (sd->input_pkt_queue.qlen <= netdev_max_backlog) {
+		if (sd->input_pkt_queue.qlen) {
 enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			__skb_queue_tail(&sd->input_pkt_queue, skb);
 #ifdef CONFIG_RPS
-			*qtail = queue->input_queue_head +
-			    queue->input_pkt_queue.qlen;
+			*qtail = sd->input_queue_head + sd->input_pkt_queue.qlen;
 #endif
-			rps_unlock(queue);
+			rps_unlock(sd);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
 		}
 
 		/* Schedule NAPI for backlog device */
-		if (napi_schedule_prep(&queue->backlog)) {
-#ifdef CONFIG_RPS
-			if (cpu != smp_processor_id()) {
-				struct softnet_data *myqueue;
-
-				myqueue = &__get_cpu_var(softnet_data);
-				queue->rps_ipi_next = myqueue->rps_ipi_list;
-				myqueue->rps_ipi_list = queue;
-
-				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-				goto enqueue;
-			}
-#endif
-			__napi_schedule(&queue->backlog);
+		if (napi_schedule_prep(&sd->backlog)) {
+			if (!rps_ipi_queued(sd))
+				__napi_schedule(&sd->backlog);
 		}
 		goto enqueue;
 	}
 
-	rps_unlock(queue);
+	rps_unlock(sd);
 
 	__get_cpu_var(netdev_rx_stat).dropped++;
 	local_irq_restore(flags);
@@ -2903,17 +2914,17 @@ EXPORT_SYMBOL(netif_receive_skb);
 static void flush_backlog(void *arg)
 {
 	struct net_device *dev = arg;
-	struct softnet_data *queue = &__get_cpu_var(softnet_data);
+	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 	struct sk_buff *skb, *tmp;
 
-	rps_lock(queue);
-	skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
+	rps_lock(sd);
+	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp)
 		if (skb->dev == dev) {
-			__skb_unlink(skb, &queue->input_pkt_queue);
+			__skb_unlink(skb, &sd->input_pkt_queue);
 			kfree_skb(skb);
-			incr_input_queue_head(queue);
+			input_queue_head_incr(sd);
 		}
-	rps_unlock(queue);
+	rps_unlock(sd);
 }
 
 static int napi_gro_complete(struct sk_buff *skb)
@@ -3219,23 +3230,23 @@ EXPORT_SYMBOL(napi_gro_frags);
 static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
-	struct softnet_data *queue = &__get_cpu_var(softnet_data);
+	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 
 	napi->weight = weight_p;
 	do {
 		struct sk_buff *skb;
 
 		local_irq_disable();
-		rps_lock(queue);
-		skb = __skb_dequeue(&queue->input_pkt_queue);
+		rps_lock(sd);
+		skb = __skb_dequeue(&sd->input_pkt_queue);
 		if (!skb) {
 			__napi_complete(napi);
-			rps_unlock(queue);
+			rps_unlock(sd);
 			local_irq_enable();
 			break;
 		}
-		incr_input_queue_head(queue);
-		rps_unlock(queue);
+		input_queue_head_incr(sd);
+		rps_unlock(sd);
 		local_irq_enable();
 
 		__netif_receive_skb(skb);
@@ -3331,24 +3342,25 @@ EXPORT_SYMBOL(netif_napi_del);
  * net_rps_action sends any pending IPI's for rps.
  * Note: called with local irq disabled, but exits with local irq enabled.
  */
-static void net_rps_action(void)
+static void net_rps_action_and_irq_disable(void)
 {
 #ifdef CONFIG_RPS
-	struct softnet_data *locqueue = &__get_cpu_var(softnet_data);
-	struct softnet_data *remqueue = locqueue->rps_ipi_list;
+	struct softnet_data *sd = &__get_cpu_var(softnet_data);
+	struct softnet_data *remsd = sd->rps_ipi_list;
 
-	if (remqueue) {
-		locqueue->rps_ipi_list = NULL;
+	if (remsd) {
+		sd->rps_ipi_list = NULL;
 
 		local_irq_enable();
 
 		/* Send pending IPI's to kick RPS processing on remote cpus. */
-		while (remqueue) {
-			struct softnet_data *next = remqueue->rps_ipi_next;
-			if (cpu_online(remqueue->cpu))
-				__smp_call_function_single(remqueue->cpu,
-							   &remqueue->csd, 0);
-			remqueue = next;
+		while (remsd) {
+			struct softnet_data *next = remsd->rps_ipi_next;
+
+			if (cpu_online(remsd->cpu))
+				__smp_call_function_single(remsd->cpu,
+							   &remsd->csd, 0);
+			remsd = next;
 		}
 	} else
 #endif
@@ -3423,7 +3435,7 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
-	net_rps_action();
+	net_rps_action_and_irq_disable();
 
 #ifdef CONFIG_NET_DMA
 	/*
@@ -5595,7 +5607,7 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	/* Process offline CPU's input_pkt_queue */
 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
-		incr_input_queue_head(oldsd);
+		input_queue_head_incr(oldsd);
 	}
 
 	return NOTIFY_OK;
@@ -5812,24 +5824,23 @@ static int __init net_dev_init(void)
 	 */
 
 	for_each_possible_cpu(i) {
-		struct softnet_data *queue;
+		struct softnet_data *sd = &per_cpu(softnet_data, i);
 
-		queue = &per_cpu(softnet_data, i);
-		skb_queue_head_init(&queue->input_pkt_queue);
-		queue->completion_queue = NULL;
-		INIT_LIST_HEAD(&queue->poll_list);
+		skb_queue_head_init(&sd->input_pkt_queue);
+		sd->completion_queue = NULL;
+		INIT_LIST_HEAD(&sd->poll_list);
 
 #ifdef CONFIG_RPS
-		queue->csd.func = trigger_softirq;
-		queue->csd.info = queue;
-		queue->csd.flags = 0;
-		queue->cpu = i;
+		sd->csd.func = rps_trigger_softirq;
+		sd->csd.info = sd;
+		sd->csd.flags = 0;
+		sd->cpu = i;
 #endif
 
-		queue->backlog.poll = process_backlog;
-		queue->backlog.weight = weight_p;
-		queue->backlog.gro_list = NULL;
-		queue->backlog.gro_count = 0;
+		sd->backlog.poll = process_backlog;
+		sd->backlog.weight = weight_p;
+		sd->backlog.gro_list = NULL;
+		sd->backlog.gro_count = 0;
 	}
 
 	dev_boot_phase = 0;



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH net-next-2.6] rps: cleanups
  2010-04-20  7:17                                                   ` [PATCH net-next-2.6] rps: cleanups Eric Dumazet
@ 2010-04-20  8:18                                                     ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2010-04-20  8:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 09:17:14 +0200

> Le lundi 19 avril 2010 à 13:21 -0700, David Miller a écrit :
> 
>> 
>> It is getting increasingly complicated to follow who enables and
>> disabled local cpu irqs in these code paths.  We could combat
>> this by adding something like "_irq_enable()" to the function
>> names.
> 
> Yes I agree, we need a general cleanup in this file
> 
> Thanks David !
> 
> [PATCH net-next-2.6] rps: cleanups
> 

Applied.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-18 11:34                                 ` Eric Dumazet
  2010-04-19  2:09                                   ` jamal
  2010-04-19  9:37                                   ` [RFC] rps: shortcut net_rps_action() Eric Dumazet
@ 2010-04-20 12:02                                   ` jamal
  2010-04-20 13:13                                     ` Eric Dumazet
  2 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-20 12:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

folks,

Thanks to everybody (Eric stands out) for your patience. 
I ended mostly validating whats already been said. I have a lot of data
and can describe in details how i tested etc but it would require
patience in reading, so i will spare you;-> If you are interested let me
know and i will be happy to share.

Summary is: 
-rps good, gives higher throughput for apps
-rps not so good, latency worse but gets better with higher input rate
or increasing number of flows (which translates to higher pps)
-rps works well with newer hardware that has better cache structures.
[Gives great results on my test machine a Nehalem single processor, 4
cores each with two SMT threads that has a shared L2 between threads and
a shared L3 between cores]. 
Your selection of what the demux cpu is and where the target cpus are is
an influencing factor in the latency results. If you have a system with
multiple sockets, you should get better numbers if you stay within the
same socket relative to going across sockets.
-rps does a better job at helping schedule apps on same cpu thus
localizing the app. The throughput results with rps are very consistent
and better whereas in non-rps case, variance is _high_.

My next step is to do some forwarding tests - probably next week. I am
concerned here because i expect the cache misses to be higher than the
app scenario (netdev structure and attributes could be touched by many
cpus)

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-20 12:02                                   ` rps perfomance WAS(Re: rps: question jamal
@ 2010-04-20 13:13                                     ` Eric Dumazet
       [not found]                                       ` <1271853570.4032.21.camel@bigi>
  0 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2010-04-20 13:13 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

Le mardi 20 avril 2010 à 08:02 -0400, jamal a écrit : 
> folks,
> 
> Thanks to everybody (Eric stands out) for your patience. 
> I ended mostly validating whats already been said. I have a lot of data
> and can describe in details how i tested etc but it would require
> patience in reading, so i will spare you;-> If you are interested let me
> know and i will be happy to share.
> 
> Summary is: 
> -rps good, gives higher throughput for apps
> -rps not so good, latency worse but gets better with higher input rate
> or increasing number of flows (which translates to higher pps)
> -rps works well with newer hardware that has better cache structures.
> [Gives great results on my test machine a Nehalem single processor, 4
> cores each with two SMT threads that has a shared L2 between threads and
> a shared L3 between cores]. 
> Your selection of what the demux cpu is and where the target cpus are is
> an influencing factor in the latency results. If you have a system with
> multiple sockets, you should get better numbers if you stay within the
> same socket relative to going across sockets.
> -rps does a better job at helping schedule apps on same cpu thus
> localizing the app. The throughput results with rps are very consistent
> and better whereas in non-rps case, variance is _high_.
> 
> My next step is to do some forwarding tests - probably next week. I am
> concerned here because i expect the cache misses to be higher than the
> app scenario (netdev structure and attributes could be touched by many
> cpus)
> 

Hi Jamal

I think your tests are very interesting, maybe could you publish them
somehow ? (I forgot to thank you about the previous report and nice
graph)

perf reports would be good too to help to spot hot points.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
       [not found]                                       ` <1271853570.4032.21.camel@bigi>
@ 2010-04-21 19:01                                         ` Eric Dumazet
  2010-04-22  1:27                                           ` Changli Gao
  2010-04-22 12:12                                           ` jamal
  2010-04-21 21:53                                         ` Rick Jones
  1 sibling, 2 replies; 86+ messages in thread
From: Eric Dumazet @ 2010-04-21 19:01 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit :
> On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote:
> 
> 
> > I think your tests are very interesting, maybe could you publish them
> > somehow ? (I forgot to thank you about the previous report and nice
> > graph)
> > perf reports would be good too to help to spot hot points.
> 
> Ok ;->
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
> which forks at most a process per detected cpu and binds to a different
> udp port on each processor.
> Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin
> and varied the destination port to select a different flow on each of
> the outgoing packets. I could further increment the number of flows by
> varying the source address and source port number but in the end i 
> settled down to fixed srcip/srcport/destinationip and just varied the
> port number in order to simplify results collection.
> For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves
> out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT
> threads, cpu0 and cpu4 are SMT threads that reside on core0 and they
> steal execution cycles from each other - so i didnt want that to happen
> and instead tried to have as many of those cycles as possible for
> demuxing incoming packets.
> 
> Overall, in best case scenario rps had 5-7% better throughput than
> nonrps setup. It had upto 10% more cpu use and about 2-5% more latency.
> I am attaching some visualization of the way 8 flows were distributed
> around the different cpus. The diagrams show some samples - but what you
> see there was a good reflection of what i saw in many runs of the tests.
> Essentially, for localization is better with rps which gets better if
> you can somehow map the target cpus as selected by rps to what the app
> binds to.
> Ive also attached a small annotated perf output - sorry i didnt have
> time to dig deeper into the code; maybe later this week. I think my
> biggest problem in this setup was the sky2 driver or hardware poor
> ability to handle lots of traffic.
> 
> 
> cheers,
> jamal
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

Thanks a lot Jamal, this is really useful

Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'

I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?

I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().

Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
       [not found]                                       ` <1271853570.4032.21.camel@bigi>
  2010-04-21 19:01                                         ` Eric Dumazet
@ 2010-04-21 21:53                                         ` Rick Jones
  1 sibling, 0 replies; 86+ messages in thread
From: Rick Jones @ 2010-04-21 21:53 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Changli Gao, David Miller, therbert, netdev, robert, andi

> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
 > ...
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

No need to apologize,  if you like I'd be happy to discuss netperf usage tips 
offline.  That offer stands for everyone.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-21 19:01                                         ` Eric Dumazet
@ 2010-04-22  1:27                                           ` Changli Gao
  2010-04-22 12:12                                           ` jamal
  1 sibling, 0 replies; 86+ messages in thread
From: Changli Gao @ 2010-04-22  1:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi

On Thu, Apr 22, 2010 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Thanks a lot Jamal, this is really useful
>
> Drawback of using a fixed src ip from your generator is that all flows
> share the same struct dst entry on SUT. This might explain some glitches
> you noticed (ip_route_input + ip_rcv at high level on slave/application
> cpus)
> Also note your test is one way. If some data was replied we would see
> much use of the 'flows'
>
> I notice epoll_ctl() used a lot, are you re-arming epoll each time you
> receive a datagram ?
>
> I see slave/application cpus hit _raw_spin_lock_irqsave() and
> _raw_spin_unlock_irqrestore().
>
> Maybe a ring buffer could help (instead of a double linked queue) for
> backlog, or the double queue trick, if Changli wants to respin his
> patch.
>
>

OK, I'll post a new patch against the current tree, so Jamal can have
a try. I am sorry, but I don't have a suitable computer for benchmark.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-21 19:01                                         ` Eric Dumazet
  2010-04-22  1:27                                           ` Changli Gao
@ 2010-04-22 12:12                                           ` jamal
  2010-04-25  2:31                                             ` Changli Gao
  1 sibling, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-22 12:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert, andi

On Wed, 2010-04-21 at 21:01 +0200, Eric Dumazet wrote:

> Drawback of using a fixed src ip from your generator is that all flows
> share the same struct dst entry on SUT. This might explain some glitches
> you noticed (ip_route_input + ip_rcv at high level on slave/application
> cpus)

yes, that would explain it ;-> I could have flows going to each cpu
generating different unique dst. It is good i didnt ;->

> Also note your test is one way. If some data was replied we would see
> much use of the 'flows'
> 

In my next step i wanted to "route" these packets at app level and for
this stage of testing just wanted to sink the data to reduce experiment
variables. Reason:
The netdev structure would hit a lot of cache misses if i started using
it to both send/recv since lots of things are shared on tx/rx (example
napi tx prunning could happen on either tx or receive path); same thing
with qdisc path which is at netdev granularity.. I think there may be
room for interesting improvements in this area..

> I notice epoll_ctl() used a lot, are you re-arming epoll each time you
> receive a datagram ?

I am using default libevent on debian. It looks very old and maybe
buggy. I will try to upgrade first and if still see the same
investigate.
  
> I see slave/application cpus hit _raw_spin_lock_irqsave() and  
> _raw_spin_unlock_irqrestore().
> 
> Maybe a ring buffer could help (instead of a double linked queue) for
> backlog, or the double queue trick, if Changli wants to respin his
> patch.
> 

Ok, I will have some cycles later today/tommorow or for sure on weekend.
My setup is still intact - so i can test.

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-22 12:12                                           ` jamal
@ 2010-04-25  2:31                                             ` Changli Gao
  2010-04-26 11:35                                               ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Changli Gao @ 2010-04-25  2:31 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Thu, Apr 22, 2010 at 8:12 PM, jamal <hadi@cyberus.ca> wrote:
>
>> I see slave/application cpus hit _raw_spin_lock_irqsave() and
>> _raw_spin_unlock_irqrestore().
>>
>> Maybe a ring buffer could help (instead of a double linked queue) for
>> backlog, or the double queue trick, if Changli wants to respin his
>> patch.
>>
>
> Ok, I will have some cycles later today/tommorow or for sure on weekend.
> My setup is still intact - so i can test.
>

I read the code again, and find that we don't use spin_lock_irqsave(),
and we use local_irq_save() and spin_lock() instead, so
_raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
related to backlog. the lock maybe sk_receive_queue.lock.

Jamal, did you use a single socket to serve all the clients?

BTW:  completion_queue and output_queue in softnet_data both are LIFO
queues. For completion_queue, FIFO is better, as the last used skb is
more likely in cache, and should be used first. Since slab has always
cache the last used memory at the head, we'd better free the skb in
FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-25  2:31                                             ` Changli Gao
@ 2010-04-26 11:35                                               ` jamal
  2010-04-26 13:35                                                 ` Changli Gao
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2010-04-26 11:35 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:

> I read the code again, and find that we don't use spin_lock_irqsave(),
> and we use local_irq_save() and spin_lock() instead, so
> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
> related to backlog. the lock maybe sk_receive_queue.lock.

Possible.
I am wondering if there's a way we can precisely nail where that is
happening? is lockstat any use? 
Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.

So looking at your patch now i see it is likely there was an improvement
made for non-rps case (moving out of loop some irq_enable etc).
i.e my results may not be crazy after adding your patch and seeing an
improvement for non-rps case.
However, whatever your patch did - it did not help the rps case case:
call_function_single_interrupt() comes out higher in the profile,
and # of IPIs seems to have gone up (although i did not measure this, I
can see the interrupts/second went up by almost 50-60%)

> Jamal, did you use a single socket to serve all the clients?

Socket per detected cpu.

> BTW:  completion_queue and output_queue in softnet_data both are LIFO
> queues. For completion_queue, FIFO is better, as the last used skb is
> more likely in cache, and should be used first. Since slab has always
> cache the last used memory at the head, we'd better free the skb in
> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

I think it will depend on how many of those skbs are sitting in the
completion queue, cache warmth etc. LIFO is always safest, you have
higher probability of finding a cached skb infront.

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: rps perfomance WAS(Re: rps: question
  2010-04-26 11:35                                               ` jamal
@ 2010-04-26 13:35                                                 ` Changli Gao
  0 siblings, 0 replies; 86+ messages in thread
From: Changli Gao @ 2010-04-26 13:35 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert, andi

On Mon, Apr 26, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
> On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:
>
>> I read the code again, and find that we don't use spin_lock_irqsave(),
>> and we use local_irq_save() and spin_lock() instead, so
>> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
>> related to backlog. the lock maybe sk_receive_queue.lock.
>
> Possible.
> I am wondering if there's a way we can precisely nail where that is
> happening? is lockstat any use?
> Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.
>

Maybe lockstat can help in this case.

> So looking at your patch now i see it is likely there was an improvement
> made for non-rps case (moving out of loop some irq_enable etc).
> i.e my results may not be crazy after adding your patch and seeing an
> improvement for non-rps case.
> However, whatever your patch did - it did not help the rps case case:
> call_function_single_interrupt() comes out higher in the profile,
> and # of IPIs seems to have gone up (although i did not measure this, I
> can see the interrupts/second went up by almost 50-60%)

Did you apply the patch from Eric? It would reduce the number of
local_irq_disable() calls but increase the number of IPIs.

>
>> Jamal, did you use a single socket to serve all the clients?
>
> Socket per detected cpu.

Ignore it. I made a mistake here.

>
>> BTW:  completion_queue and output_queue in softnet_data both are LIFO
>> queues. For completion_queue, FIFO is better, as the last used skb is
>> more likely in cache, and should be used first. Since slab has always
>> cache the last used memory at the head, we'd better free the skb in
>> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.
>
> I think it will depend on how many of those skbs are sitting in the
> completion queue, cache warmth etc. LIFO is always safest, you have
> higher probability of finding a cached skb infront.
>

we call kfree_skb() to release skbs to slab allocator, then slab
allocator stores them in a LIFO queue. If completion queue is also a
LIFO queue, the latest unused skb will be in the front of the queue,
and will be released to slab allocator at first. At the next time, we
call alloc_skb(), the memory used by the skb in the end of the
completion queue will be returned instead of the hot one.

However, as Eric said, new drivers don't rely on completion queue, it
isn't a real problem, especially in your test case.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2010-04-26 13:36 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-07 18:42 rps: question jamal
2010-02-08  5:58 ` Tom Herbert
2010-02-08 15:09   ` jamal
2010-04-14 11:53     ` rps perfomance WAS(Re: " jamal
2010-04-14 17:31       ` Tom Herbert
2010-04-14 18:04         ` Eric Dumazet
2010-04-14 18:53           ` jamal
2010-04-14 19:44             ` Stephen Hemminger
2010-04-14 19:58               ` Eric Dumazet
2010-04-15  8:51                 ` David Miller
2010-04-14 20:22               ` jamal
2010-04-14 20:27                 ` Eric Dumazet
2010-04-14 20:38                   ` jamal
2010-04-14 20:45                   ` Tom Herbert
2010-04-14 20:57                     ` Eric Dumazet
2010-04-14 22:51                       ` Changli Gao
2010-04-14 23:02                         ` Stephen Hemminger
2010-04-15  2:40                           ` Eric Dumazet
2010-04-15  2:50                             ` Changli Gao
2010-04-15  8:57                       ` David Miller
2010-04-15 12:10                       ` jamal
2010-04-15 12:32                         ` Changli Gao
2010-04-15 12:50                           ` jamal
2010-04-15 23:51                             ` Changli Gao
2010-04-15  8:51                 ` David Miller
2010-04-14 20:34               ` Andi Kleen
2010-04-15  8:50               ` David Miller
2010-04-15  8:48             ` David Miller
2010-04-15 11:55               ` jamal
2010-04-15 16:41                 ` Rick Jones
2010-04-15 20:16                   ` jamal
2010-04-15 20:25                     ` Rick Jones
2010-04-15 23:56                     ` Changli Gao
2010-04-16  5:18                       ` Eric Dumazet
2010-04-16  6:02                         ` Changli Gao
2010-04-16  6:28                           ` Tom Herbert
2010-04-16  6:32                           ` Eric Dumazet
2010-04-16 13:42                             ` jamal
2010-04-16  7:15                           ` Andi Kleen
2010-04-16 13:27                             ` jamal
2010-04-16 13:37                               ` Andi Kleen
2010-04-16 13:58                                 ` jamal
2010-04-16 13:21                         ` jamal
2010-04-16 13:34                           ` Changli Gao
2010-04-16 13:49                             ` jamal
2010-04-16 14:10                               ` Changli Gao
2010-04-16 14:43                                 ` jamal
2010-04-16 14:58                                   ` Changli Gao
2010-04-19 12:48                                     ` jamal
2010-04-17  7:35                           ` Eric Dumazet
2010-04-17  8:43                             ` Tom Herbert
2010-04-17  9:23                               ` Eric Dumazet
2010-04-17 14:27                                 ` Eric Dumazet
2010-04-17 17:26                                   ` Tom Herbert
2010-04-17 14:17                               ` [PATCH net-next-2.6] net: remove time limit in process_backlog() Eric Dumazet
2010-04-18  9:36                                 ` David Miller
2010-04-17 17:31                             ` rps perfomance WAS(Re: rps: question jamal
2010-04-18  9:39                               ` Eric Dumazet
2010-04-18 11:34                                 ` Eric Dumazet
2010-04-19  2:09                                   ` jamal
2010-04-19  9:37                                   ` [RFC] rps: shortcut net_rps_action() Eric Dumazet
2010-04-19  9:48                                     ` Changli Gao
2010-04-19 12:14                                       ` Eric Dumazet
2010-04-19 12:28                                         ` Changli Gao
2010-04-19 13:27                                           ` Eric Dumazet
2010-04-19 14:22                                             ` Eric Dumazet
2010-04-19 15:07                                               ` [PATCH net-next-2.6] " Eric Dumazet
2010-04-19 16:02                                                 ` Tom Herbert
2010-04-19 20:21                                                 ` David Miller
2010-04-20  7:17                                                   ` [PATCH net-next-2.6] rps: cleanups Eric Dumazet
2010-04-20  8:18                                                     ` David Miller
2010-04-19 23:56                                                 ` [PATCH net-next-2.6] rps: shortcut net_rps_action() Changli Gao
2010-04-20  0:32                                                   ` Changli Gao
2010-04-20  5:55                                                     ` Eric Dumazet
2010-04-20 12:02                                   ` rps perfomance WAS(Re: rps: question jamal
2010-04-20 13:13                                     ` Eric Dumazet
     [not found]                                       ` <1271853570.4032.21.camel@bigi>
2010-04-21 19:01                                         ` Eric Dumazet
2010-04-22  1:27                                           ` Changli Gao
2010-04-22 12:12                                           ` jamal
2010-04-25  2:31                                             ` Changli Gao
2010-04-26 11:35                                               ` jamal
2010-04-26 13:35                                                 ` Changli Gao
2010-04-21 21:53                                         ` Rick Jones
2010-04-16 15:57             ` Tom Herbert
2010-04-14 18:53       ` Stephen Hemminger
2010-04-15  8:42       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.