All of lore.kernel.org
 help / color / mirror / Atom feed
* tx queue hashing hot-spots and poor performance (multiq, ixgbe)
@ 2009-04-29 23:00 Andrew Dickinson
  2009-04-30  9:07 ` Jens Låås
  2009-05-01 10:20 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Dickinson @ 2009-04-29 23:00 UTC (permalink / raw)
  To: netdev

Howdy list,

Background...
I'm trying to evaluate a new system for routing performance for some
custom packet modification that we do.  To start, I'm trying to get a
high-water mark of routing performance without our custom cruft in the
middle.  The hardware setup is a dual-package Nehalem box (X5550,
Hyper-Threading disabled) with a dual 10G intel card (pci-id:
8086:10fb).  Because this NIC is freakishly new, I'm running the
latest torvalds kernel in order to get the ixgbe driver to identify it
(<sigh>).  With HT off, I've got 8 cores in the system.  For the sake
of reducing the number of variables that I'm dealing with, I'm only
using one of the NICs to start with and simply routing packets back
out the single 10G NIC.

Interrupts...
I've disabled irqbalance and I'm explicitly pinning interrupts, one
per core, as follows:

-bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk
'{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done;
done

-bash-3.2# for i in `seq 57 72`; do cat /proc/irq/$i/smp_affinity; done
0001
0002
0004
0008
0010
0020
0040
0080
0001
0002
0004
0008
0010
0020
0040
0080

-bash-3.2# cat /proc/interrupts  | grep eth2
  57:      77941          0          0          0          0
0          0          0   PCI-MSI-edge      eth2-rx-0
  58:         92      59682          0          0          0
0          0          0   PCI-MSI-edge      eth2-rx-1
  59:         92          0      21716          0          0
0          0          0   PCI-MSI-edge      eth2-rx-2
  60:         92          0          0      14356          0
0          0          0   PCI-MSI-edge      eth2-rx-3
  61:         92          0          0          0      91483
0          0          0   PCI-MSI-edge      eth2-rx-4
  62:         92          0          0          0          0
19495          0          0   PCI-MSI-edge      eth2-rx-5
  63:         92          0          0          0          0
0         24          0   PCI-MSI-edge      eth2-rx-6
  64:         92          0          0          0          0
0          0      19605   PCI-MSI-edge      eth2-rx-7
  65:      94709          0          0          0          0
0          0          0   PCI-MSI-edge      eth2-tx-0
  66:         92         24          0          0          0
0          0          0   PCI-MSI-edge      eth2-tx-1
  67:         98          0         24          0          0
0          0          0   PCI-MSI-edge      eth2-tx-2
  68:         92          0          0     100208          0
0          0          0   PCI-MSI-edge      eth2-tx-3
  69:         92          0          0          0         24
0          0          0   PCI-MSI-edge      eth2-tx-4
  70:         92          0          0          0          0
24          0          0   PCI-MSI-edge      eth2-tx-5
  71:         92          0          0          0          0
0     144566          0   PCI-MSI-edge      eth2-tx-6
  72:         92          0          0          0          0
0          0         24   PCI-MSI-edge      eth2-tx-7
  73:          2          0          0          0          0
0          0          0   PCI-MSI-edge      eth2:lsc

The output of /proc/interrupts is hinting at the problem that I'm
having...  The TX queues which are being chosen are only 0, 3, and 6.
The flow of traffic that I'm generating is random source/dest pairs,
each within a /24, so I don't think that I'm sending data that should
be breaking the skb_tx_hash() routine.

Further, when I run top, I see that almost all of the interrupt
processing is happening on a single cpu.
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.3%hi,  0.7%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 19.3%hi, 80.7%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

This appears to be due to 'tx'-based activity... if I change my route
table to blackhole the traffic, the CPUs are nearly idle.

My next thought was to try multiqueue...
-bash-3.2# ./tc/tc qdisc add dev eth2 root handle 1: multiq
-bash-3.2# ./tc/tc qdisc show dev eth2
qdisc multiq 1: root refcnt 128 bands 8/128

With multiq scheduling, the CPU load evens out a bunch, but I still
have a soft-interrupt hot-spot (see CPU3 here.  Also note that only
CPU's 0, 3, and 6 are handling hardware interrupts.):
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 69.9%id,  0.0%wa,  0.3%hi, 29.8%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 64.8%id,  0.0%wa,  0.0%hi, 35.2%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 76.5%id,  0.0%wa,  0.0%hi, 23.5%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  4.8%id,  0.0%wa,  2.6%hi, 92.6%si,  0.0%st
Cpu4  :  0.3%us,  0.3%sy,  0.0%ni, 76.2%id,  0.3%wa,  0.0%hi, 22.8%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 49.4%id,  0.0%wa,  0.0%hi, 50.6%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 56.8%id,  0.0%wa,  1.0%hi, 42.3%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 51.6%id,  0.0%wa,  0.0%hi, 48.4%si,  0.0%st

However, what I see with multiqueue enabled is that I'm dropping 80%
of my traffic (which appears to be due to a large number of
'rx_missed_errors').

Any thoughts on what I'm doing wrong or where I should continue to look?

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson
@ 2009-04-30  9:07 ` Jens Låås
  2009-04-30  9:24   ` David Miller
  2009-05-01 10:20 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Låås @ 2009-04-30  9:07 UTC (permalink / raw)
  To: Andrew Dickinson, netdev

2009/4/30, Andrew Dickinson <andrew@whydna.net>:
> Howdy list,
>
>  Background...
>  I'm trying to evaluate a new system for routing performance for some
>  custom packet modification that we do.  To start, I'm trying to get a
>  high-water mark of routing performance without our custom cruft in the
>  middle.  The hardware setup is a dual-package Nehalem box (X5550,
>  Hyper-Threading disabled) with a dual 10G intel card (pci-id:
>  8086:10fb).  Because this NIC is freakishly new, I'm running the
>  latest torvalds kernel in order to get the ixgbe driver to identify it
>  (<sigh>).  With HT off, I've got 8 cores in the system.  For the sake
>  of reducing the number of variables that I'm dealing with, I'm only
>  using one of the NICs to start with and simply routing packets back
>  out the single 10G NIC.

OK.

We have done quite a bit of 10G testing.
Ill comment based on our experiences.

>
>  Interrupts...
>  I've disabled irqbalance and I'm explicitly pinning interrupts, one
>  per core, as follows:

Setting affinity is a must yes, for high performance.

It is also important that tx affinity matches rx-affinity. So the
TX-completion runs on the same CPU as rx.

>
>  -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk
>  '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done;
>  done
>
>  -bash-3.2# for i in `seq 57 72`; do cat /proc/irq/$i/smp_affinity; done
>  0001
>  0002
>  0004
>  0008
>  0010
>  0020
>  0040
>  0080
>  0001
>  0002
>  0004
>  0008
>  0010
>  0020
>  0040
>  0080
>
>  -bash-3.2# cat /proc/interrupts  | grep eth2
>   57:      77941          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-0
>   58:         92      59682          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-1
>   59:         92          0      21716          0          0
>  0          0          0   PCI-MSI-edge      eth2-rx-2
>   60:         92          0          0      14356          0
>  0          0          0   PCI-MSI-edge      eth2-rx-3
>   61:         92          0          0          0      91483
>  0          0          0   PCI-MSI-edge      eth2-rx-4
>   62:         92          0          0          0          0
>  19495          0          0   PCI-MSI-edge      eth2-rx-5
>   63:         92          0          0          0          0
>  0         24          0   PCI-MSI-edge      eth2-rx-6
>   64:         92          0          0          0          0
>  0          0      19605   PCI-MSI-edge      eth2-rx-7
>   65:      94709          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-0
>   66:         92         24          0          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-1
>   67:         98          0         24          0          0
>  0          0          0   PCI-MSI-edge      eth2-tx-2
>   68:         92          0          0     100208          0
>  0          0          0   PCI-MSI-edge      eth2-tx-3
>   69:         92          0          0          0         24
>  0          0          0   PCI-MSI-edge      eth2-tx-4
>   70:         92          0          0          0          0
>  24          0          0   PCI-MSI-edge      eth2-tx-5
>   71:         92          0          0          0          0
>  0     144566          0   PCI-MSI-edge      eth2-tx-6
>   72:         92          0          0          0          0
>  0          0         24   PCI-MSI-edge      eth2-tx-7
>   73:          2          0          0          0          0
>  0          0          0   PCI-MSI-edge      eth2:lsc
>
>  The output of /proc/interrupts is hinting at the problem that I'm
>  having...  The TX queues which are being chosen are only 0, 3, and 6.
>  The flow of traffic that I'm generating is random source/dest pairs,
>  each within a /24, so I don't think that I'm sending data that should
>  be breaking the skb_tx_hash() routine.

RX-side looks good. TX-side looks like what we also got with vanilla linux.

What we do is patch all drivers with a custom select_queue function
that selects the same outgoing queue as the incoming queue. With a one
to one mapping of queues to CPUs you can also use the processor id.

This way we get performance.

Another way we are looking at is to use an abstraction to help with
the queue mapping. (We call it 'flowtrunk'). This is then configurable
from userspace.


>
>  Further, when I run top, I see that almost all of the interrupt
>  processing is happening on a single cpu.
>  Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>  Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.3%hi,  0.7%si,  0.0%st
>  Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>  Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 19.3%hi, 80.7%si,  0.0%st
>  Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>
>  This appears to be due to 'tx'-based activity... if I change my route
>  table to blackhole the traffic, the CPUs are nearly idle.
>
>  My next thought was to try multiqueue...
>  -bash-3.2# ./tc/tc qdisc add dev eth2 root handle 1: multiq
>  -bash-3.2# ./tc/tc qdisc show dev eth2
>  qdisc multiq 1: root refcnt 128 bands 8/128
>
>  With multiq scheduling, the CPU load evens out a bunch, but I still
>  have a soft-interrupt hot-spot (see CPU3 here.  Also note that only
>  CPU's 0, 3, and 6 are handling hardware interrupts.):
>  Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 69.9%id,  0.0%wa,  0.3%hi, 29.8%si,  0.0%st
>  Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 64.8%id,  0.0%wa,  0.0%hi, 35.2%si,  0.0%st
>  Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 76.5%id,  0.0%wa,  0.0%hi, 23.5%si,  0.0%st
>  Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  4.8%id,  0.0%wa,  2.6%hi, 92.6%si,  0.0%st
>  Cpu4  :  0.3%us,  0.3%sy,  0.0%ni, 76.2%id,  0.3%wa,  0.0%hi, 22.8%si,  0.0%st
>  Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 49.4%id,  0.0%wa,  0.0%hi, 50.6%si,  0.0%st
>  Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 56.8%id,  0.0%wa,  1.0%hi, 42.3%si,  0.0%st
>  Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 51.6%id,  0.0%wa,  0.0%hi, 48.4%si,  0.0%st
>
>  However, what I see with multiqueue enabled is that I'm dropping 80%
>  of my traffic (which appears to be due to a large number of
>  'rx_missed_errors').
>
>  Any thoughts on what I'm doing wrong or where I should continue to look?

Changing the qdisc wont help since all qdiscs but pfifo_fast
serializes all CPUs to one qdisc. pfifo_fast creates a separate qdisc
per tx_queue.

If you dont want to patch the kernel you can try increasing the queue
length of the pfifo_fast qdisc.

Cheers,
Jens

>
>  -Andrew
>
> --
>  To unsubscribe from this list: send the line "unsubscribe netdev" in
>  the body of a message to majordomo@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30  9:07 ` Jens Låås
@ 2009-04-30  9:24   ` David Miller
  2009-04-30 10:51     ` Jens Låås
  2009-04-30 14:04     ` Andrew Dickinson
  0 siblings, 2 replies; 28+ messages in thread
From: David Miller @ 2009-04-30  9:24 UTC (permalink / raw)
  To: jelaas; +Cc: andrew, netdev

From: Jens Låås <jelaas@gmail.com>
Date: Thu, 30 Apr 2009 11:07:35 +0200

> RX-side looks good. TX-side looks like what we also got with vanilla linux.
> 
> What we do is patch all drivers with a custom select_queue function
> that selects the same outgoing queue as the incoming queue. With a one
> to one mapping of queues to CPUs you can also use the processor id.
> 
> This way we get performance.

I don't understand why this can even be necessary.

With the current code, the RX queue of a packet becomes
the hash for the TX queue.

If all the TX activity is happening on one TX queue then
there is a bug somewhere.

Either the receiving device isn't invoking skb_record_rx_queue()
correctly, or there is some bug in how we compute the TX hash.

Everyone adds their own hacks, but that absolutely should not be
necessary, the kernel is essentially doing what you are adding
hacks for.

The only possible problems are bugs in the code, and we should find
those bugs instead of constantly talking about 'local select_queue
hacks we add to our cool driver for performance' :-/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30  9:24   ` David Miller
@ 2009-04-30 10:51     ` Jens Låås
  2009-04-30 11:05       ` David Miller
  2009-04-30 14:04     ` Andrew Dickinson
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Låås @ 2009-04-30 10:51 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, netdev

2009/4/30, David Miller <davem@davemloft.net>:
> From: Jens Låås <jelaas@gmail.com>
>  Date: Thu, 30 Apr 2009 11:07:35 +0200
>
>
>  > RX-side looks good. TX-side looks like what we also got with vanilla linux.
>  >
>  > What we do is patch all drivers with a custom select_queue function
>  > that selects the same outgoing queue as the incoming queue. With a one
>  > to one mapping of queues to CPUs you can also use the processor id.
>  >
>  > This way we get performance.
>
>
> I don't understand why this can even be necessary.
>
>  With the current code, the RX queue of a packet becomes
>  the hash for the TX queue.
>
>  If all the TX activity is happening on one TX queue then
>  there is a bug somewhere.

If I remember correctly we got use of several tx-queues and not one.
The hashed distribution missed a few of the tx-queues though. And it
also looked like some rx-queues got mapped on top of the same
tx-queue.

At the time we reasoned this behaviour was expected from the hashed randomizing.
We may certainly have misunderstood this and a one to one mapping
should be expected.

Hopefully the case where we have several devices and want
TX-completion to match rx-queue can also be solved. (The assumption
that tx-completion needs to run on the same CPU may also be proved
wrong. But we havent seen this in tests sofar.)

The main problem though was that the mapping is randomized. We wanted
to set smp_affinity correctly for tx to match rx. That was actually
the main reason for our local hacks.

>
>  Either the receiving device isn't invoking skb_record_rx_queue()
>  correctly, or there is some bug in how we compute the TX hash.
>
>  Everyone adds their own hacks, but that absolutely should not be
>  necessary, the kernel is essentially doing what you are adding
>  hacks for.
>
>  The only possible problems are bugs in the code, and we should find
>  those bugs instead of constantly talking about 'local select_queue
>  hacks we add to our cool driver for performance' :-/

We certainly dont consider the hacks cool in any way. They were only
for a specific purpose and a specific kernel-version.

Cheers,
Jens

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30 10:51     ` Jens Låås
@ 2009-04-30 11:05       ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2009-04-30 11:05 UTC (permalink / raw)
  To: jelaas; +Cc: andrew, netdev

From: Jens Låås <jelaas@gmail.com>
Date: Thu, 30 Apr 2009 12:51:16 +0200

> The main problem though was that the mapping is randomized. We wanted
> to set smp_affinity correctly for tx to match rx. That was actually
> the main reason for our local hacks.

It's NOT RANDOMIZED, READ THE CODE!

It takes the RX queue number and uses it to select the TX
queue.

That anything but random!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30  9:24   ` David Miller
  2009-04-30 10:51     ` Jens Låås
@ 2009-04-30 14:04     ` Andrew Dickinson
  2009-04-30 14:08       ` David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Dickinson @ 2009-04-30 14:04 UTC (permalink / raw)
  To: David Miller; +Cc: jelaas, netdev

<snip>

> If all the TX activity is happening on one TX queue then
> there is a bug somewhere.
>
> Either the receiving device isn't invoking skb_record_rx_queue()
> correctly, or there is some bug in how we compute the TX hash.

 I'll do some debugging around skb_tx_hash() and see if I can make
sense of it.  I'll let you know what I find.  My hypothesis is that
skb_record_rx_queue() isn't being called, but I should dig into it
before I start making claims. ;-P

<snip>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30 14:04     ` Andrew Dickinson
@ 2009-04-30 14:08       ` David Miller
  2009-04-30 23:53         ` Andrew Dickinson
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2009-04-30 14:08 UTC (permalink / raw)
  To: andrew; +Cc: jelaas, netdev

From: Andrew Dickinson <andrew@whydna.net>
Date: Thu, 30 Apr 2009 07:04:33 -0700

>  I'll do some debugging around skb_tx_hash() and see if I can make
> sense of it.  I'll let you know what I find.  My hypothesis is that
> skb_record_rx_queue() isn't being called, but I should dig into it
> before I start making claims. ;-P

That's one possibility.

Another is that the hashing isn't working out.  One way to
play with that is to simply replace the:

		hash = skb_get_rx_queue(skb);

in skb_tx_hash() with something like:

		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;

and see if that improves the situation.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30 14:08       ` David Miller
@ 2009-04-30 23:53         ` Andrew Dickinson
  2009-05-01  4:19           ` Andrew Dickinson
  2009-05-01  6:14           ` Eric Dumazet
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Dickinson @ 2009-04-30 23:53 UTC (permalink / raw)
  To: David Miller; +Cc: jelaas, netdev

OK... I've got some more data on it...

I passed a small number of packets through the system and added a ton
of printks to it ;-P

Here's the distribution of values as seen by
skb_rx_queue_recorded()... count on the left, value on the right:
     37 0
     31 1
     31 2
     39 3
     37 4
     31 5
     42 6
     39 7

That's nice and even....  Here's what's getting returned from the
skb_tx_hash().  Again, count on the left, value on the right:
     31 0
     81 1
     37 2
     70 3
     37 4
     31 6

Note that we're entirely missing 5 and 7 and that those interrupts
seem to have gotten munged onto 1 and 3.

I think the voodoo lies within:
    return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);

David,  I made the change that you suggested:
        //hash = skb_get_rx_queue(skb);
        return skb_get_rx_queue(skb) % dev->real_num_tx_queues;

And now, I see a nice even mixing of interrupts on the TX side (yay!).

However, my problem's not solved entirely... here's what top is showing me:
top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
Swap:  2096472k total,        0k used,  2096472k free,   146364k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
ksoftirqd/1
   13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
ksoftirqd/3
   19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
ksoftirqd/5
   25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
ksoftirqd/7
 3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
<snip>


It appears that only the odd CPUs are actually handling the
interrupts, which doesn't jive with what /proc/interrupts shows me:
            CPU0       CPU1	  CPU2       CPU3	CPU4	   CPU5       CPU6	 CPU7
  66:    2970565          0          0          0          0
0          0          0   PCI-MSI-edge	  eth2-rx-0
  67:         28     821122          0          0          0
0          0          0   PCI-MSI-edge	  eth2-rx-1
  68:         28          0    2943299          0          0
0          0          0   PCI-MSI-edge	  eth2-rx-2
  69:         28          0          0     817776          0
0          0          0   PCI-MSI-edge	  eth2-rx-3
  70:         28          0          0          0    2963924
0          0          0   PCI-MSI-edge	  eth2-rx-4
  71:         28          0          0          0          0
821032          0          0   PCI-MSI-edge	  eth2-rx-5
  72:         28          0          0          0          0
0    2979987          0   PCI-MSI-edge	  eth2-rx-6
  73:         28          0          0          0          0
0          0     845422   PCI-MSI-edge	  eth2-rx-7
  74:    4664732          0          0          0          0
0          0          0   PCI-MSI-edge	  eth2-tx-0
  75:         34    4679312          0          0          0
0          0          0   PCI-MSI-edge	  eth2-tx-1
  76:         28          0    4665014          0          0
0          0          0   PCI-MSI-edge	  eth2-tx-2
  77:         28          0          0    4681531          0
0          0          0   PCI-MSI-edge	  eth2-tx-3
  78:         28          0          0          0    4665793
0          0          0   PCI-MSI-edge	  eth2-tx-4
  79:         28          0          0          0          0
4671596          0          0   PCI-MSI-edge	  eth2-tx-5
  80:         28          0          0          0          0
0    4665279          0   PCI-MSI-edge	  eth2-tx-6
  81:         28          0          0          0          0
0          0    4664504   PCI-MSI-edge	  eth2-tx-7
  82:          2          0          0          0          0
0          0          0   PCI-MSI-edge	  eth2:lsc


Why would ksoftirqd only run on half of the cores (and only the odd
ones to boot)?  The one commonality that's striking me is that that
all the odd CPU#'s are on the same physical processor:

-bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
processor	: 0
physical id	: 0
processor	: 1
physical id	: 1
processor	: 2
physical id	: 0
processor	: 3
physical id	: 1
processor	: 4
physical id	: 0
processor	: 5
physical id	: 1
processor	: 6
physical id	: 0
processor	: 7
physical id	: 1

I did compile the kernel with NUMA support... am I being bitten by
something there?  Other thoughts on where I should look.

Also... is there an incantation to get NAPI to work in the torvalds
kernel?  As you can see, I'm generating quite a few interrrupts.

-A


On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
> From: Andrew Dickinson <andrew@whydna.net>
> Date: Thu, 30 Apr 2009 07:04:33 -0700
>
>>  I'll do some debugging around skb_tx_hash() and see if I can make
>> sense of it.  I'll let you know what I find.  My hypothesis is that
>> skb_record_rx_queue() isn't being called, but I should dig into it
>> before I start making claims. ;-P
>
> That's one possibility.
>
> Another is that the hashing isn't working out.  One way to
> play with that is to simply replace the:
>
>                hash = skb_get_rx_queue(skb);
>
> in skb_tx_hash() with something like:
>
>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>
> and see if that improves the situation.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30 23:53         ` Andrew Dickinson
@ 2009-05-01  4:19           ` Andrew Dickinson
  2009-05-01  7:32             ` Eric Dumazet
  2009-05-01  6:14           ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Dickinson @ 2009-05-01  4:19 UTC (permalink / raw)
  To: David Miller; +Cc: jelaas, netdev

Adding a bit more info...

I should add, the other 4 ksoftirqd tasklets _are_ running, they're
just not busy. (In case that wasn't clear...)

Also of note, I rebooted the box (after recompiling with NUMA off).
This time when I push traffic through, only the even-ksoftirqd's were
busy..  I then tweaked some of the ring settings via ethtool and
suddenly the odd-ksoftirqd's became busy (and the even ones went
idle).

Thoughts?  Suggestions?  driver issue?  I'm at 2.6.30-rc3.

(BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's
are busy that I still have room to make this box go faster).

-A


On Thu, Apr 30, 2009 at 4:53 PM, Andrew Dickinson <andrew@whydna.net> wrote:
> OK... I've got some more data on it...
>
> I passed a small number of packets through the system and added a ton
> of printks to it ;-P
>
> Here's the distribution of values as seen by
> skb_rx_queue_recorded()... count on the left, value on the right:
>     37 0
>     31 1
>     31 2
>     39 3
>     37 4
>     31 5
>     42 6
>     39 7
>
> That's nice and even....  Here's what's getting returned from the
> skb_tx_hash().  Again, count on the left, value on the right:
>     31 0
>     81 1
>     37 2
>     70 3
>     37 4
>     31 6
>
> Note that we're entirely missing 5 and 7 and that those interrupts
> seem to have gotten munged onto 1 and 3.
>
> I think the voodoo lies within:
>    return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>
> David,  I made the change that you suggested:
>        //hash = skb_get_rx_queue(skb);
>        return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>
> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>
> However, my problem's not solved entirely... here's what top is showing me:
> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>    7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
> ksoftirqd/1
>   13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
> ksoftirqd/3
>   19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
> ksoftirqd/5
>   25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
> ksoftirqd/7
>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
> <snip>
>
>
> It appears that only the odd CPUs are actually handling the
> interrupts, which doesn't jive with what /proc/interrupts shows me:
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>  66:    2970565          0          0          0          0
> 0          0          0   PCI-MSI-edge    eth2-rx-0
>  67:         28     821122          0          0          0
> 0          0          0   PCI-MSI-edge    eth2-rx-1
>  68:         28          0    2943299          0          0
> 0          0          0   PCI-MSI-edge    eth2-rx-2
>  69:         28          0          0     817776          0
> 0          0          0   PCI-MSI-edge    eth2-rx-3
>  70:         28          0          0          0    2963924
> 0          0          0   PCI-MSI-edge    eth2-rx-4
>  71:         28          0          0          0          0
> 821032          0          0   PCI-MSI-edge       eth2-rx-5
>  72:         28          0          0          0          0
> 0    2979987          0   PCI-MSI-edge    eth2-rx-6
>  73:         28          0          0          0          0
> 0          0     845422   PCI-MSI-edge    eth2-rx-7
>  74:    4664732          0          0          0          0
> 0          0          0   PCI-MSI-edge    eth2-tx-0
>  75:         34    4679312          0          0          0
> 0          0          0   PCI-MSI-edge    eth2-tx-1
>  76:         28          0    4665014          0          0
> 0          0          0   PCI-MSI-edge    eth2-tx-2
>  77:         28          0          0    4681531          0
> 0          0          0   PCI-MSI-edge    eth2-tx-3
>  78:         28          0          0          0    4665793
> 0          0          0   PCI-MSI-edge    eth2-tx-4
>  79:         28          0          0          0          0
> 4671596          0          0   PCI-MSI-edge      eth2-tx-5
>  80:         28          0          0          0          0
> 0    4665279          0   PCI-MSI-edge    eth2-tx-6
>  81:         28          0          0          0          0
> 0          0    4664504   PCI-MSI-edge    eth2-tx-7
>  82:          2          0          0          0          0
> 0          0          0   PCI-MSI-edge    eth2:lsc
>
>
> Why would ksoftirqd only run on half of the cores (and only the odd
> ones to boot)?  The one commonality that's striking me is that that
> all the odd CPU#'s are on the same physical processor:
>
> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
> processor       : 0
> physical id     : 0
> processor       : 1
> physical id     : 1
> processor       : 2
> physical id     : 0
> processor       : 3
> physical id     : 1
> processor       : 4
> physical id     : 0
> processor       : 5
> physical id     : 1
> processor       : 6
> physical id     : 0
> processor       : 7
> physical id     : 1
>
> I did compile the kernel with NUMA support... am I being bitten by
> something there?  Other thoughts on where I should look.
>
> Also... is there an incantation to get NAPI to work in the torvalds
> kernel?  As you can see, I'm generating quite a few interrrupts.
>
> -A
>
>
> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>> From: Andrew Dickinson <andrew@whydna.net>
>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>
>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>> before I start making claims. ;-P
>>
>> That's one possibility.
>>
>> Another is that the hashing isn't working out.  One way to
>> play with that is to simply replace the:
>>
>>                hash = skb_get_rx_queue(skb);
>>
>> in skb_tx_hash() with something like:
>>
>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> and see if that improves the situation.
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-30 23:53         ` Andrew Dickinson
  2009-05-01  4:19           ` Andrew Dickinson
@ 2009-05-01  6:14           ` Eric Dumazet
  2009-05-01  6:19             ` Andrew Dickinson
                               ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  6:14 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev

Andrew Dickinson a écrit :
> OK... I've got some more data on it...
> 
> I passed a small number of packets through the system and added a ton
> of printks to it ;-P
> 
> Here's the distribution of values as seen by
> skb_rx_queue_recorded()... count on the left, value on the right:
>      37 0
>      31 1
>      31 2
>      39 3
>      37 4
>      31 5
>      42 6
>      39 7
> 
> That's nice and even....  Here's what's getting returned from the
> skb_tx_hash().  Again, count on the left, value on the right:
>      31 0
>      81 1
>      37 2
>      70 3
>      37 4
>      31 6
> 
> Note that we're entirely missing 5 and 7 and that those interrupts
> seem to have gotten munged onto 1 and 3.
> 
> I think the voodoo lies within:
>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> 
> David,  I made the change that you suggested:
>         //hash = skb_get_rx_queue(skb);
>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> 
> And now, I see a nice even mixing of interrupts on the TX side (yay!).
> 
> However, my problem's not solved entirely... here's what top is showing me:
> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
> ksoftirqd/1
>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
> ksoftirqd/3
>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
> ksoftirqd/5
>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
> ksoftirqd/7
>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
> <snip>
> 
> 
> It appears that only the odd CPUs are actually handling the
> interrupts, which doesn't jive with what /proc/interrupts shows me:
>             CPU0       CPU1	  CPU2       CPU3	CPU4	   CPU5       CPU6	 CPU7
>   66:    2970565          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-0
>   67:         28     821122          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-1
>   68:         28          0    2943299          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-2
>   69:         28          0          0     817776          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-3
>   70:         28          0          0          0    2963924
> 0          0          0   PCI-MSI-edge	  eth2-rx-4
>   71:         28          0          0          0          0
> 821032          0          0   PCI-MSI-edge	  eth2-rx-5
>   72:         28          0          0          0          0
> 0    2979987          0   PCI-MSI-edge	  eth2-rx-6
>   73:         28          0          0          0          0
> 0          0     845422   PCI-MSI-edge	  eth2-rx-7
>   74:    4664732          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-0
>   75:         34    4679312          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-1
>   76:         28          0    4665014          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-2
>   77:         28          0          0    4681531          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-3
>   78:         28          0          0          0    4665793
> 0          0          0   PCI-MSI-edge	  eth2-tx-4
>   79:         28          0          0          0          0
> 4671596          0          0   PCI-MSI-edge	  eth2-tx-5
>   80:         28          0          0          0          0
> 0    4665279          0   PCI-MSI-edge	  eth2-tx-6
>   81:         28          0          0          0          0
> 0          0    4664504   PCI-MSI-edge	  eth2-tx-7
>   82:          2          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2:lsc
> 
> 
> Why would ksoftirqd only run on half of the cores (and only the odd
> ones to boot)?  The one commonality that's striking me is that that
> all the odd CPU#'s are on the same physical processor:
> 
> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
> processor	: 0
> physical id	: 0
> processor	: 1
> physical id	: 1
> processor	: 2
> physical id	: 0
> processor	: 3
> physical id	: 1
> processor	: 4
> physical id	: 0
> processor	: 5
> physical id	: 1
> processor	: 6
> physical id	: 0
> processor	: 7
> physical id	: 1
> 
> I did compile the kernel with NUMA support... am I being bitten by
> something there?  Other thoughts on where I should look.
> 
> Also... is there an incantation to get NAPI to work in the torvalds
> kernel?  As you can see, I'm generating quite a few interrrupts.
> 
> -A
> 
> 
> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>> From: Andrew Dickinson <andrew@whydna.net>
>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>
>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>> before I start making claims. ;-P
>> That's one possibility.
>>
>> Another is that the hashing isn't working out.  One way to
>> play with that is to simply replace the:
>>
>>                hash = skb_get_rx_queue(skb);
>>
>> in skb_tx_hash() with something like:
>>
>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> and see if that improves the situation.
>>

Hi Andrew

Please try following patch (I dont have multi-queue NIC, sorry)

I will do the followup patch if this ones corrects the distribution problem
you noticed.

Thanks very much for all your findings.

[PATCH] net: skb_tx_hash() improvements

When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.

Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..e2e9e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
+	if (skb_rx_queue_recorded(skb))
+		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-	} else
+	else
 		hash = skb->protocol;
 
 	hash = jhash_1word(hash, skb_tx_hashrnd);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  6:14           ` Eric Dumazet
@ 2009-05-01  6:19             ` Andrew Dickinson
  2009-05-01  6:40               ` Eric Dumazet
  2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
  2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Dickinson @ 2009-05-01  6:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jelaas, netdev

On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> OK... I've got some more data on it...
>>
>> I passed a small number of packets through the system and added a ton
>> of printks to it ;-P
>>
>> Here's the distribution of values as seen by
>> skb_rx_queue_recorded()... count on the left, value on the right:
>>      37 0
>>      31 1
>>      31 2
>>      39 3
>>      37 4
>>      31 5
>>      42 6
>>      39 7
>>
>> That's nice and even....  Here's what's getting returned from the
>> skb_tx_hash().  Again, count on the left, value on the right:
>>      31 0
>>      81 1
>>      37 2
>>      70 3
>>      37 4
>>      31 6
>>
>> Note that we're entirely missing 5 and 7 and that those interrupts
>> seem to have gotten munged onto 1 and 3.
>>
>> I think the voodoo lies within:
>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> David,  I made the change that you suggested:
>>         //hash = skb_get_rx_queue(skb);
>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>
>> However, my problem's not solved entirely... here's what top is showing me:
>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>> ksoftirqd/1
>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>> ksoftirqd/3
>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>> ksoftirqd/5
>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>> ksoftirqd/7
>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>> <snip>
>>
>>
>> It appears that only the odd CPUs are actually handling the
>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>   66:    2970565          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>   67:         28     821122          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>   68:         28          0    2943299          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>   69:         28          0          0     817776          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>   70:         28          0          0          0    2963924
>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>   71:         28          0          0          0          0
>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>   72:         28          0          0          0          0
>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>   73:         28          0          0          0          0
>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>   74:    4664732          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>   75:         34    4679312          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>   76:         28          0    4665014          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>   77:         28          0          0    4681531          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>   78:         28          0          0          0    4665793
>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>   79:         28          0          0          0          0
>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>   80:         28          0          0          0          0
>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>   81:         28          0          0          0          0
>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>   82:          2          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>
>>
>> Why would ksoftirqd only run on half of the cores (and only the odd
>> ones to boot)?  The one commonality that's striking me is that that
>> all the odd CPU#'s are on the same physical processor:
>>
>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>> processor     : 0
>> physical id   : 0
>> processor     : 1
>> physical id   : 1
>> processor     : 2
>> physical id   : 0
>> processor     : 3
>> physical id   : 1
>> processor     : 4
>> physical id   : 0
>> processor     : 5
>> physical id   : 1
>> processor     : 6
>> physical id   : 0
>> processor     : 7
>> physical id   : 1
>>
>> I did compile the kernel with NUMA support... am I being bitten by
>> something there?  Other thoughts on where I should look.
>>
>> Also... is there an incantation to get NAPI to work in the torvalds
>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>
>> -A
>>
>>
>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Andrew Dickinson <andrew@whydna.net>
>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>
>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>> before I start making claims. ;-P
>>> That's one possibility.
>>>
>>> Another is that the hashing isn't working out.  One way to
>>> play with that is to simply replace the:
>>>
>>>                hash = skb_get_rx_queue(skb);
>>>
>>> in skb_tx_hash() with something like:
>>>
>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> and see if that improves the situation.
>>>
>
> Hi Andrew
>
> Please try following patch (I dont have multi-queue NIC, sorry)
>
> I will do the followup patch if this ones corrects the distribution problem
> you noticed.
>
> Thanks very much for all your findings.
>
> [PATCH] net: skb_tx_hash() improvements
>
> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>
> Later improvements would be to compute reciprocal value of real_num_tx_queues
> to avoid a divide here. But this computation should be done once,
> when real_num_tx_queues is set. This needs a separate patch, and a new
> field in struct net_device.
>
> Reported-by: Andrew Dickinson <andrew@whydna.net>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 308a7d0..e2e9e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>  {
>        u32 hash;
>
> -       if (skb_rx_queue_recorded(skb)) {
> -               hash = skb_get_rx_queue(skb);
> -       } else if (skb->sk && skb->sk->sk_hash) {
> +       if (skb_rx_queue_recorded(skb))
> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +
> +       if (skb->sk && skb->sk->sk_hash)
>                hash = skb->sk->sk_hash;
> -       } else
> +       else
>                hash = skb->protocol;
>
>        hash = jhash_1word(hash, skb_tx_hashrnd);
>
>

Eric,

That's exactly what I did!  It solved the problem of hot-spots on some
interrupts.  However, I now have a new problem (which is documented in
my previous posts).  The short of it is that I'm only seeing 4 (out of
8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
busy 4 are always on one physical package (but not always the same
package (it'll change on reboot or when I change some parameters via
ethtool), but never both.  This, despite /proc/interrupts showing me
that all 8 interrupts are being hit evenly.  There's more details in
my last mail. ;-D

-Andrew

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  6:19             ` Andrew Dickinson
@ 2009-05-01  6:40               ` Eric Dumazet
  2009-05-01  7:23                 ` Andrew Dickinson
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  6:40 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev

Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> OK... I've got some more data on it...
>>>
>>> I passed a small number of packets through the system and added a ton
>>> of printks to it ;-P
>>>
>>> Here's the distribution of values as seen by
>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>      37 0
>>>      31 1
>>>      31 2
>>>      39 3
>>>      37 4
>>>      31 5
>>>      42 6
>>>      39 7
>>>
>>> That's nice and even....  Here's what's getting returned from the
>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>      31 0
>>>      81 1
>>>      37 2
>>>      70 3
>>>      37 4
>>>      31 6
>>>
>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>> seem to have gotten munged onto 1 and 3.
>>>
>>> I think the voodoo lies within:
>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>
>>> David,  I made the change that you suggested:
>>>         //hash = skb_get_rx_queue(skb);
>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>
>>> However, my problem's not solved entirely... here's what top is showing me:
>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>> ksoftirqd/1
>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>> ksoftirqd/3
>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>> ksoftirqd/5
>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>> ksoftirqd/7
>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>> <snip>
>>>
>>>
>>> It appears that only the odd CPUs are actually handling the
>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>   66:    2970565          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>   67:         28     821122          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>   68:         28          0    2943299          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>   69:         28          0          0     817776          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>   70:         28          0          0          0    2963924
>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>   71:         28          0          0          0          0
>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>   72:         28          0          0          0          0
>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>   73:         28          0          0          0          0
>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>   74:    4664732          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>   75:         34    4679312          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>   76:         28          0    4665014          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>   77:         28          0          0    4681531          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>   78:         28          0          0          0    4665793
>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>   79:         28          0          0          0          0
>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>   80:         28          0          0          0          0
>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>   81:         28          0          0          0          0
>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>   82:          2          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>
>>>
>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>> ones to boot)?  The one commonality that's striking me is that that
>>> all the odd CPU#'s are on the same physical processor:
>>>
>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>> processor     : 0
>>> physical id   : 0
>>> processor     : 1
>>> physical id   : 1
>>> processor     : 2
>>> physical id   : 0
>>> processor     : 3
>>> physical id   : 1
>>> processor     : 4
>>> physical id   : 0
>>> processor     : 5
>>> physical id   : 1
>>> processor     : 6
>>> physical id   : 0
>>> processor     : 7
>>> physical id   : 1
>>>
>>> I did compile the kernel with NUMA support... am I being bitten by
>>> something there?  Other thoughts on where I should look.
>>>
>>> Also... is there an incantation to get NAPI to work in the torvalds
>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>
>>> -A
>>>
>>>
>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>
>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>> before I start making claims. ;-P
>>>> That's one possibility.
>>>>
>>>> Another is that the hashing isn't working out.  One way to
>>>> play with that is to simply replace the:
>>>>
>>>>                hash = skb_get_rx_queue(skb);
>>>>
>>>> in skb_tx_hash() with something like:
>>>>
>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>
>>>> and see if that improves the situation.
>>>>
>> Hi Andrew
>>
>> Please try following patch (I dont have multi-queue NIC, sorry)
>>
>> I will do the followup patch if this ones corrects the distribution problem
>> you noticed.
>>
>> Thanks very much for all your findings.
>>
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>
>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>> to avoid a divide here. But this computation should be done once,
>> when real_num_tx_queues is set. This needs a separate patch, and a new
>> field in struct net_device.
>>
>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 308a7d0..e2e9e4a 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>  {
>>        u32 hash;
>>
>> -       if (skb_rx_queue_recorded(skb)) {
>> -               hash = skb_get_rx_queue(skb);
>> -       } else if (skb->sk && skb->sk->sk_hash) {
>> +       if (skb_rx_queue_recorded(skb))
>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>> +
>> +       if (skb->sk && skb->sk->sk_hash)
>>                hash = skb->sk->sk_hash;
>> -       } else
>> +       else
>>                hash = skb->protocol;
>>
>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>
>>
> 
> Eric,
> 
> That's exactly what I did!  It solved the problem of hot-spots on some
> interrupts.  However, I now have a new problem (which is documented in
> my previous posts).  The short of it is that I'm only seeing 4 (out of
> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
> busy 4 are always on one physical package (but not always the same
> package (it'll change on reboot or when I change some parameters via
> ethtool), but never both.  This, despite /proc/interrupts showing me
> that all 8 interrupts are being hit evenly.  There's more details in
> my last mail. ;-D
> 

Well, I was reacting to your 'voodo' comment about 

return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);

Since this is not the problem. Problem is coming from jhash() which shuffles
the input, while in your case we want to select same output queue
because of cpu affinities. No shuffle required.

(assuming cpu0 is handling tx-queue-0 and rx-queue-0,
          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)

Then /proc/interrupts show your rx interrupts are not evenly distributed.

Or that ksoftirqd is triggered only on one physical cpu, while on other
cpu, softirqds are not run from ksoftirqd. Its only a matter of load.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  6:40               ` Eric Dumazet
@ 2009-05-01  7:23                 ` Andrew Dickinson
  2009-05-01  7:31                   ` Eric Dumazet
  2009-05-01 21:37                   ` Brandeburg, Jesse
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Dickinson @ 2009-05-01  7:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jelaas, netdev

On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> Andrew Dickinson a écrit :
>>>> OK... I've got some more data on it...
>>>>
>>>> I passed a small number of packets through the system and added a ton
>>>> of printks to it ;-P
>>>>
>>>> Here's the distribution of values as seen by
>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>      37 0
>>>>      31 1
>>>>      31 2
>>>>      39 3
>>>>      37 4
>>>>      31 5
>>>>      42 6
>>>>      39 7
>>>>
>>>> That's nice and even....  Here's what's getting returned from the
>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>      31 0
>>>>      81 1
>>>>      37 2
>>>>      70 3
>>>>      37 4
>>>>      31 6
>>>>
>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>> seem to have gotten munged onto 1 and 3.
>>>>
>>>> I think the voodoo lies within:
>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>
>>>> David,  I made the change that you suggested:
>>>>         //hash = skb_get_rx_queue(skb);
>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>
>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>
>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>> ksoftirqd/1
>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>> ksoftirqd/3
>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>> ksoftirqd/5
>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>> ksoftirqd/7
>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>> <snip>
>>>>
>>>>
>>>> It appears that only the odd CPUs are actually handling the
>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>   66:    2970565          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>   67:         28     821122          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>   68:         28          0    2943299          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>   69:         28          0          0     817776          0
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>   70:         28          0          0          0    2963924
>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>   71:         28          0          0          0          0
>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>   72:         28          0          0          0          0
>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>   73:         28          0          0          0          0
>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>   74:    4664732          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>   75:         34    4679312          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>   76:         28          0    4665014          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>   77:         28          0          0    4681531          0
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>   78:         28          0          0          0    4665793
>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>   79:         28          0          0          0          0
>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>   80:         28          0          0          0          0
>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>   81:         28          0          0          0          0
>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>   82:          2          0          0          0          0
>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>
>>>>
>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>> ones to boot)?  The one commonality that's striking me is that that
>>>> all the odd CPU#'s are on the same physical processor:
>>>>
>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>> processor     : 0
>>>> physical id   : 0
>>>> processor     : 1
>>>> physical id   : 1
>>>> processor     : 2
>>>> physical id   : 0
>>>> processor     : 3
>>>> physical id   : 1
>>>> processor     : 4
>>>> physical id   : 0
>>>> processor     : 5
>>>> physical id   : 1
>>>> processor     : 6
>>>> physical id   : 0
>>>> processor     : 7
>>>> physical id   : 1
>>>>
>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>> something there?  Other thoughts on where I should look.
>>>>
>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>
>>>> -A
>>>>
>>>>
>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>
>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>> before I start making claims. ;-P
>>>>> That's one possibility.
>>>>>
>>>>> Another is that the hashing isn't working out.  One way to
>>>>> play with that is to simply replace the:
>>>>>
>>>>>                hash = skb_get_rx_queue(skb);
>>>>>
>>>>> in skb_tx_hash() with something like:
>>>>>
>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>
>>>>> and see if that improves the situation.
>>>>>
>>> Hi Andrew
>>>
>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>
>>> I will do the followup patch if this ones corrects the distribution problem
>>> you noticed.
>>>
>>> Thanks very much for all your findings.
>>>
>>> [PATCH] net: skb_tx_hash() improvements
>>>
>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>> as the device driver exactly told us which queue was selected at RX time.
>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>
>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>> to avoid a divide here. But this computation should be done once,
>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>> field in struct net_device.
>>>
>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 308a7d0..e2e9e4a 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>  {
>>>        u32 hash;
>>>
>>> -       if (skb_rx_queue_recorded(skb)) {
>>> -               hash = skb_get_rx_queue(skb);
>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>> +       if (skb_rx_queue_recorded(skb))
>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>> +
>>> +       if (skb->sk && skb->sk->sk_hash)
>>>                hash = skb->sk->sk_hash;
>>> -       } else
>>> +       else
>>>                hash = skb->protocol;
>>>
>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>
>>>
>>
>> Eric,
>>
>> That's exactly what I did!  It solved the problem of hot-spots on some
>> interrupts.  However, I now have a new problem (which is documented in
>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>> busy 4 are always on one physical package (but not always the same
>> package (it'll change on reboot or when I change some parameters via
>> ethtool), but never both.  This, despite /proc/interrupts showing me
>> that all 8 interrupts are being hit evenly.  There's more details in
>> my last mail. ;-D
>>
>
> Well, I was reacting to your 'voodo' comment about
>
> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>
> Since this is not the problem. Problem is coming from jhash() which shuffles
> the input, while in your case we want to select same output queue
> because of cpu affinities. No shuffle required.

Agreed.  I don't want to jhash(), and I'm not.

> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)

That's a correct assumption. :D

> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>
> Or that ksoftirqd is triggered only on one physical cpu, while on other
> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.

Hrmm... more fuel for the fire...

The NIC seems to be doing a good job of hashing the incoming data and
the kernel is now finding the right TX queue:
-bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
     rx_packets: 1286009099
     tx_packets: 1287853570
     tx_queue_0_packets: 162469405
     tx_queue_1_packets: 162452446
     tx_queue_2_packets: 162481160
     tx_queue_3_packets: 162441839
     tx_queue_4_packets: 162484930
     tx_queue_5_packets: 162478402
     tx_queue_6_packets: 162492530
     tx_queue_7_packets: 162477162
     rx_queue_0_packets: 162469449
     rx_queue_1_packets: 162452440
     rx_queue_2_packets: 162481186
     rx_queue_3_packets: 162441885
     rx_queue_4_packets: 162484949
     rx_queue_5_packets: 162478427

Here's where it gets juicy.  If I reduce the rate at which I'm pushing
traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
as follow:
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

And if I watch /proc/interrupts, I see that all of the tx and rx
queues are handling a fairly similar number of interrupts (ballpark,
7-8k/sec on rx, 10k on tx).

OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
0,2,4, and 6) RX queues are receiving relatively few interrupts
(5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
receiving about 2-3k/sec.  What's extra strange is that the TX queues
are still handling about 10k/sec each.

So, below some magic threshold (approx 2.3Mpps), the box is basically
idle and happily routing all the packets (I can confirm that my
network test device ixia is showing 0-loss).  Above the magic
threshold, the box starts acting as described above and I'm unable to
push it beyond that threshold.  While I understand that there are
limits to how fast I can route packets (obviously), it seems very
strange that I'm seeing this physical-CPU affinity on the ksoftirqd
"processes".

Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
ksoftirqd processes show any utilization at all.

These are 64-byte frames, so I shouldn't be hitting any bandwidth
issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
just routing packets back out the one NIC).

=/

-A

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  7:23                 ` Andrew Dickinson
@ 2009-05-01  7:31                   ` Eric Dumazet
  2009-05-01  7:34                     ` Andrew Dickinson
  2009-05-01 21:37                   ` Brandeburg, Jesse
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  7:31 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev

Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>> Andrew Dickinson a écrit :
>>>>> OK... I've got some more data on it...
>>>>>
>>>>> I passed a small number of packets through the system and added a ton
>>>>> of printks to it ;-P
>>>>>
>>>>> Here's the distribution of values as seen by
>>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>>      37 0
>>>>>      31 1
>>>>>      31 2
>>>>>      39 3
>>>>>      37 4
>>>>>      31 5
>>>>>      42 6
>>>>>      39 7
>>>>>
>>>>> That's nice and even....  Here's what's getting returned from the
>>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>>      31 0
>>>>>      81 1
>>>>>      37 2
>>>>>      70 3
>>>>>      37 4
>>>>>      31 6
>>>>>
>>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>>> seem to have gotten munged onto 1 and 3.
>>>>>
>>>>> I think the voodoo lies within:
>>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>>
>>>>> David,  I made the change that you suggested:
>>>>>         //hash = skb_get_rx_queue(skb);
>>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>
>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>>
>>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>>
>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>>> ksoftirqd/1
>>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>>> ksoftirqd/3
>>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>>> ksoftirqd/5
>>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>>> ksoftirqd/7
>>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>>> <snip>
>>>>>
>>>>>
>>>>> It appears that only the odd CPUs are actually handling the
>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>>   66:    2970565          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>>   67:         28     821122          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>>   68:         28          0    2943299          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>>   69:         28          0          0     817776          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>>   70:         28          0          0          0    2963924
>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>>   71:         28          0          0          0          0
>>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>>   72:         28          0          0          0          0
>>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>>   73:         28          0          0          0          0
>>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>>   74:    4664732          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>>   75:         34    4679312          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>>   76:         28          0    4665014          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>>   77:         28          0          0    4681531          0
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>>   78:         28          0          0          0    4665793
>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>>   79:         28          0          0          0          0
>>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>>   80:         28          0          0          0          0
>>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>>   81:         28          0          0          0          0
>>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>>   82:          2          0          0          0          0
>>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>>
>>>>>
>>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>>> ones to boot)?  The one commonality that's striking me is that that
>>>>> all the odd CPU#'s are on the same physical processor:
>>>>>
>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>>> processor     : 0
>>>>> physical id   : 0
>>>>> processor     : 1
>>>>> physical id   : 1
>>>>> processor     : 2
>>>>> physical id   : 0
>>>>> processor     : 3
>>>>> physical id   : 1
>>>>> processor     : 4
>>>>> physical id   : 0
>>>>> processor     : 5
>>>>> physical id   : 1
>>>>> processor     : 6
>>>>> physical id   : 0
>>>>> processor     : 7
>>>>> physical id   : 1
>>>>>
>>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>>> something there?  Other thoughts on where I should look.
>>>>>
>>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>>
>>>>> -A
>>>>>
>>>>>
>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>>
>>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>>> before I start making claims. ;-P
>>>>>> That's one possibility.
>>>>>>
>>>>>> Another is that the hashing isn't working out.  One way to
>>>>>> play with that is to simply replace the:
>>>>>>
>>>>>>                hash = skb_get_rx_queue(skb);
>>>>>>
>>>>>> in skb_tx_hash() with something like:
>>>>>>
>>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>
>>>>>> and see if that improves the situation.
>>>>>>
>>>> Hi Andrew
>>>>
>>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>>
>>>> I will do the followup patch if this ones corrects the distribution problem
>>>> you noticed.
>>>>
>>>> Thanks very much for all your findings.
>>>>
>>>> [PATCH] net: skb_tx_hash() improvements
>>>>
>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>>> as the device driver exactly told us which queue was selected at RX time.
>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>>
>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>>> to avoid a divide here. But this computation should be done once,
>>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>>> field in struct net_device.
>>>>
>>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 308a7d0..e2e9e4a 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>>  {
>>>>        u32 hash;
>>>>
>>>> -       if (skb_rx_queue_recorded(skb)) {
>>>> -               hash = skb_get_rx_queue(skb);
>>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>>> +       if (skb_rx_queue_recorded(skb))
>>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>> +
>>>> +       if (skb->sk && skb->sk->sk_hash)
>>>>                hash = skb->sk->sk_hash;
>>>> -       } else
>>>> +       else
>>>>                hash = skb->protocol;
>>>>
>>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>>
>>>>
>>> Eric,
>>>
>>> That's exactly what I did!  It solved the problem of hot-spots on some
>>> interrupts.  However, I now have a new problem (which is documented in
>>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>>> busy 4 are always on one physical package (but not always the same
>>> package (it'll change on reboot or when I change some parameters via
>>> ethtool), but never both.  This, despite /proc/interrupts showing me
>>> that all 8 interrupts are being hit evenly.  There's more details in
>>> my last mail. ;-D
>>>
>> Well, I was reacting to your 'voodo' comment about
>>
>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> Since this is not the problem. Problem is coming from jhash() which shuffles
>> the input, while in your case we want to select same output queue
>> because of cpu affinities. No shuffle required.
> 
> Agreed.  I don't want to jhash(), and I'm not.
> 
>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
> 
> That's a correct assumption. :D
> 
>> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>>
>> Or that ksoftirqd is triggered only on one physical cpu, while on other
>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
> 
> Hrmm... more fuel for the fire...
> 
> The NIC seems to be doing a good job of hashing the incoming data and
> the kernel is now finding the right TX queue:
> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>      rx_packets: 1286009099
>      tx_packets: 1287853570
>      tx_queue_0_packets: 162469405
>      tx_queue_1_packets: 162452446
>      tx_queue_2_packets: 162481160
>      tx_queue_3_packets: 162441839
>      tx_queue_4_packets: 162484930
>      tx_queue_5_packets: 162478402
>      tx_queue_6_packets: 162492530
>      tx_queue_7_packets: 162477162
>      rx_queue_0_packets: 162469449
>      rx_queue_1_packets: 162452440
>      rx_queue_2_packets: 162481186
>      rx_queue_3_packets: 162441885
>      rx_queue_4_packets: 162484949
>      rx_queue_5_packets: 162478427
> 
> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
> as follow:
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts, I see that all of the tx and rx
> queues are handling a fairly similar number of interrupts (ballpark,
> 7-8k/sec on rx, 10k on tx).
> 
> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
> 
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
> 0,2,4, and 6) RX queues are receiving relatively few interrupts
> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
> receiving about 2-3k/sec.  What's extra strange is that the TX queues
> are still handling about 10k/sec each.
> 
> So, below some magic threshold (approx 2.3Mpps), the box is basically
> idle and happily routing all the packets (I can confirm that my
> network test device ixia is showing 0-loss).  Above the magic
> threshold, the box starts acting as described above and I'm unable to
> push it beyond that threshold.  While I understand that there are
> limits to how fast I can route packets (obviously), it seems very
> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
> "processes".
> 

box is not idle, you hit a bug in kernel, I already corrected this week :)

check for "sched: account system time properly" in google

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..26efa47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 
 	if (user_tick)
 		account_user_time(p, one_jiffy, one_jiffy_scaled);
-	else if (p != rq->idle)
+	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
 		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
 				    one_jiffy_scaled);
 	else


> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
> ksoftirqd processes show any utilization at all.
> 
> These are 64-byte frames, so I shouldn't be hitting any bandwidth
> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
> just routing packets back out the one NIC).
> 
> =/
> 



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  4:19           ` Andrew Dickinson
@ 2009-05-01  7:32             ` Eric Dumazet
  2009-05-01  7:47               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  7:32 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev

Andrew Dickinson a écrit :
> Adding a bit more info...
> 
> I should add, the other 4 ksoftirqd tasklets _are_ running, they're
> just not busy. (In case that wasn't clear...)
> 
> Also of note, I rebooted the box (after recompiling with NUMA off).
> This time when I push traffic through, only the even-ksoftirqd's were
> busy..  I then tweaked some of the ring settings via ethtool and
> suddenly the odd-ksoftirqd's became busy (and the even ones went
> idle).
> 
> Thoughts?  Suggestions?  driver issue?  I'm at 2.6.30-rc3.
> 
> (BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's
> are busy that I still have room to make this box go faster).

I dont see the point here. ksoftirqd is running only if too much
work has to be done in softirq context. Which should be your case
since you want to saturate cpus with network load.

You could try to change /proc/sys/net/core/netdev_budget if you really
want to trigger ksoftirqd sooner or later, but it wont fundamentally
change routing performance.

If you believe box is loosing frames because cpu are saturated, please
post some oprofile results.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  7:31                   ` Eric Dumazet
@ 2009-05-01  7:34                     ` Andrew Dickinson
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Dickinson @ 2009-05-01  7:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jelaas, netdev

On Fri, May 1, 2009 at 12:31 AM, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Andrew Dickinson a écrit :
>> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>> Andrew Dickinson a écrit :
>>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>>> Andrew Dickinson a écrit :
>>>>>> OK... I've got some more data on it...
>>>>>>
>>>>>> I passed a small number of packets through the system and added a ton
>>>>>> of printks to it ;-P
>>>>>>
>>>>>> Here's the distribution of values as seen by
>>>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>>>      37 0
>>>>>>      31 1
>>>>>>      31 2
>>>>>>      39 3
>>>>>>      37 4
>>>>>>      31 5
>>>>>>      42 6
>>>>>>      39 7
>>>>>>
>>>>>> That's nice and even....  Here's what's getting returned from the
>>>>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>>>>      31 0
>>>>>>      81 1
>>>>>>      37 2
>>>>>>      70 3
>>>>>>      37 4
>>>>>>      31 6
>>>>>>
>>>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>>>> seem to have gotten munged onto 1 and 3.
>>>>>>
>>>>>> I think the voodoo lies within:
>>>>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>>>
>>>>>> David,  I made the change that you suggested:
>>>>>>         //hash = skb_get_rx_queue(skb);
>>>>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>
>>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>>>
>>>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>>>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>>>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>>>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>>>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>>>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>>>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>>>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>>>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>>>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>>>>
>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>>>>> ksoftirqd/1
>>>>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>>>>> ksoftirqd/3
>>>>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>>>>> ksoftirqd/5
>>>>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>>>>> ksoftirqd/7
>>>>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>>>>> <snip>
>>>>>>
>>>>>>
>>>>>> It appears that only the odd CPUs are actually handling the
>>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>>>>   66:    2970565          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>>>>   67:         28     821122          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>>>>   68:         28          0    2943299          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>>>>   69:         28          0          0     817776          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>>>>   70:         28          0          0          0    2963924
>>>>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>>>>   71:         28          0          0          0          0
>>>>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>>>>   72:         28          0          0          0          0
>>>>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>>>>   73:         28          0          0          0          0
>>>>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>>>>   74:    4664732          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>>>>   75:         34    4679312          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>>>>   76:         28          0    4665014          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>>>>   77:         28          0          0    4681531          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>>>>   78:         28          0          0          0    4665793
>>>>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>>>>   79:         28          0          0          0          0
>>>>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>>>>   80:         28          0          0          0          0
>>>>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>>>>   81:         28          0          0          0          0
>>>>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>>>>   82:          2          0          0          0          0
>>>>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>>>>
>>>>>>
>>>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>>>> ones to boot)?  The one commonality that's striking me is that that
>>>>>> all the odd CPU#'s are on the same physical processor:
>>>>>>
>>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>>>> processor     : 0
>>>>>> physical id   : 0
>>>>>> processor     : 1
>>>>>> physical id   : 1
>>>>>> processor     : 2
>>>>>> physical id   : 0
>>>>>> processor     : 3
>>>>>> physical id   : 1
>>>>>> processor     : 4
>>>>>> physical id   : 0
>>>>>> processor     : 5
>>>>>> physical id   : 1
>>>>>> processor     : 6
>>>>>> physical id   : 0
>>>>>> processor     : 7
>>>>>> physical id   : 1
>>>>>>
>>>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>>>> something there?  Other thoughts on where I should look.
>>>>>>
>>>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>>>>
>>>>>> -A
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>>>
>>>>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>>>> before I start making claims. ;-P
>>>>>>> That's one possibility.
>>>>>>>
>>>>>>> Another is that the hashing isn't working out.  One way to
>>>>>>> play with that is to simply replace the:
>>>>>>>
>>>>>>>                hash = skb_get_rx_queue(skb);
>>>>>>>
>>>>>>> in skb_tx_hash() with something like:
>>>>>>>
>>>>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>>
>>>>>>> and see if that improves the situation.
>>>>>>>
>>>>> Hi Andrew
>>>>>
>>>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>>>
>>>>> I will do the followup patch if this ones corrects the distribution problem
>>>>> you noticed.
>>>>>
>>>>> Thanks very much for all your findings.
>>>>>
>>>>> [PATCH] net: skb_tx_hash() improvements
>>>>>
>>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>>>> as the device driver exactly told us which queue was selected at RX time.
>>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>>>
>>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>>>> to avoid a divide here. But this computation should be done once,
>>>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>>>> field in struct net_device.
>>>>>
>>>>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>>>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index 308a7d0..e2e9e4a 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>>>  {
>>>>>        u32 hash;
>>>>>
>>>>> -       if (skb_rx_queue_recorded(skb)) {
>>>>> -               hash = skb_get_rx_queue(skb);
>>>>> -       } else if (skb->sk && skb->sk->sk_hash) {
>>>>> +       if (skb_rx_queue_recorded(skb))
>>>>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>> +
>>>>> +       if (skb->sk && skb->sk->sk_hash)
>>>>>                hash = skb->sk->sk_hash;
>>>>> -       } else
>>>>> +       else
>>>>>                hash = skb->protocol;
>>>>>
>>>>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>>>>
>>>>>
>>>> Eric,
>>>>
>>>> That's exactly what I did!  It solved the problem of hot-spots on some
>>>> interrupts.  However, I now have a new problem (which is documented in
>>>> my previous posts).  The short of it is that I'm only seeing 4 (out of
>>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
>>>> busy 4 are always on one physical package (but not always the same
>>>> package (it'll change on reboot or when I change some parameters via
>>>> ethtool), but never both.  This, despite /proc/interrupts showing me
>>>> that all 8 interrupts are being hit evenly.  There's more details in
>>>> my last mail. ;-D
>>>>
>>> Well, I was reacting to your 'voodo' comment about
>>>
>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>
>>> Since this is not the problem. Problem is coming from jhash() which shuffles
>>> the input, while in your case we want to select same output queue
>>> because of cpu affinities. No shuffle required.
>>
>> Agreed.  I don't want to jhash(), and I'm not.
>>
>>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>>>          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
>>
>> That's a correct assumption. :D
>>
>>> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>>>
>>> Or that ksoftirqd is triggered only on one physical cpu, while on other
>>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
>>
>> Hrmm... more fuel for the fire...
>>
>> The NIC seems to be doing a good job of hashing the incoming data and
>> the kernel is now finding the right TX queue:
>> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>>      rx_packets: 1286009099
>>      tx_packets: 1287853570
>>      tx_queue_0_packets: 162469405
>>      tx_queue_1_packets: 162452446
>>      tx_queue_2_packets: 162481160
>>      tx_queue_3_packets: 162441839
>>      tx_queue_4_packets: 162484930
>>      tx_queue_5_packets: 162478402
>>      tx_queue_6_packets: 162492530
>>      tx_queue_7_packets: 162477162
>>      rx_queue_0_packets: 162469449
>>      rx_queue_1_packets: 162452440
>>      rx_queue_2_packets: 162481186
>>      rx_queue_3_packets: 162441885
>>      rx_queue_4_packets: 162484949
>>      rx_queue_5_packets: 162478427
>>
>> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
>> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
>> as follow:
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>
>> And if I watch /proc/interrupts, I see that all of the tx and rx
>> queues are handling a fairly similar number of interrupts (ballpark,
>> 7-8k/sec on rx, 10k on tx).
>>
>> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
>>
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
>> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>
>> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
>> 0,2,4, and 6) RX queues are receiving relatively few interrupts
>> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
>> receiving about 2-3k/sec.  What's extra strange is that the TX queues
>> are still handling about 10k/sec each.
>>
>> So, below some magic threshold (approx 2.3Mpps), the box is basically
>> idle and happily routing all the packets (I can confirm that my
>> network test device ixia is showing 0-loss).  Above the magic
>> threshold, the box starts acting as described above and I'm unable to
>> push it beyond that threshold.  While I understand that there are
>> limits to how fast I can route packets (obviously), it seems very
>> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
>> "processes".
>>
>
> box is not idle, you hit a bug in kernel, I already corrected this week :)
>
> check for "sched: account system time properly" in google
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index b902e58..26efa47 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
>
>        if (user_tick)
>                account_user_time(p, one_jiffy, one_jiffy_scaled);
> -       else if (p != rq->idle)
> +       else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
>                account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
>                                    one_jiffy_scaled);
>        else
>

<whew>, I'm not crazy! ;-P

I'll apply this patch and let you know how that changes things.

-A


>> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
>> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
>> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
>> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
>> ksoftirqd processes show any utilization at all.
>>
>> These are 64-byte frames, so I shouldn't be hitting any bandwidth
>> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
>> just routing packets back out the one NIC).
>>
>> =/
>>
>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  7:32             ` Eric Dumazet
@ 2009-05-01  7:47               ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  7:47 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev

Eric Dumazet a écrit :
> Andrew Dickinson a écrit :
>> Adding a bit more info...
>>
>> I should add, the other 4 ksoftirqd tasklets _are_ running, they're
>> just not busy. (In case that wasn't clear...)
>>
>> Also of note, I rebooted the box (after recompiling with NUMA off).
>> This time when I push traffic through, only the even-ksoftirqd's were
>> busy..  I then tweaked some of the ring settings via ethtool and
>> suddenly the odd-ksoftirqd's became busy (and the even ones went
>> idle).
>>
>> Thoughts?  Suggestions?  driver issue?  I'm at 2.6.30-rc3.
>>
>> (BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's
>> are busy that I still have room to make this box go faster).
> 
> I dont see the point here. ksoftirqd is running only if too much
> work has to be done in softirq context. Which should be your case
> since you want to saturate cpus with network load.
> 
> You could try to change /proc/sys/net/core/netdev_budget if you really
> want to trigger ksoftirqd sooner or later, but it wont fundamentally
> change routing performance.
> 
> If you believe box is loosing frames because cpu are saturated, please
> post some oprofile results.

My random feeling is you might have a dst_release() contention, but my
feeling might be wrong, I dont know what kind of network load you really
use...



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] net: skb_tx_hash() improvements
  2009-05-01  6:14           ` Eric Dumazet
  2009-05-01  6:19             ` Andrew Dickinson
@ 2009-05-01  8:29             ` Eric Dumazet
  2009-05-01  8:52               ` Eric Dumazet
  2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
  2 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  8:29 UTC (permalink / raw)
  To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev

David, here is the followup I promised

Thanks

[PATCH] net: skb_tx_hash() improvements

When skb_rx_queue_recorded() is true, we dont want to use jhash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with only 8 different inputs.

We also need to implement a true reciprocal division, to not disturb
symmetric setups (when number of tx queues matches number of rx queues)
and cpu affinities.

This patch introduces a new helper, dev_real_num_tx_queues_set()
to set both real_num_tx_queues and its reciprocal value,
and makes all drivers use this helper.

Many thanks to Andrew Dickinson to let us see the light here :)

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 drivers/net/bnx2.c              |    2 +-
 drivers/net/bnx2x_main.c        |    2 +-
 drivers/net/cxgb3/cxgb3_main.c  |    2 +-
 drivers/net/igb/igb_main.c      |    2 +-
 drivers/net/ixgbe/ixgbe_main.c  |    2 +-
 drivers/net/mv643xx_eth.c       |    2 +-
 drivers/net/myri10ge/myri10ge.c |    4 ++--
 drivers/net/niu.c               |    2 +-
 drivers/net/vxge/vxge-main.c    |    2 +-
 include/linux/netdevice.h       |    2 ++
 net/core/dev.c                  |   26 ++++++++++++++++++--------
 11 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index d478391..1f674c1 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -5951,7 +5951,7 @@ bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi)
 	}
 
 	bp->num_tx_rings = rounddown_pow_of_two(bp->irq_nvecs);
-	bp->dev->real_num_tx_queues = bp->num_tx_rings;
+	dev_real_num_tx_queues_set(bp->dev, bp->num_tx_rings);
 
 	bp->num_rx_rings = bp->irq_nvecs;
 }
diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index ad5ef25..d5c641b 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -6800,7 +6800,7 @@ static void bnx2x_set_int_mode(struct bnx2x *bp)
 		}
 		break;
 	}
-	bp->dev->real_num_tx_queues = bp->num_tx_queues;
+	dev_real_num_tx_queues_set(bp->dev, bp->num_tx_queues);
 }
 
 static void bnx2x_set_rx_mode(struct net_device *dev);
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 7ea4841..a84abf3 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -1220,7 +1220,7 @@ static int cxgb_open(struct net_device *dev)
 			       "Could not initialize offload capabilities\n");
 	}
 
-	dev->real_num_tx_queues = pi->nqsets;
+	dev_real_num_tx_queues_set(dev, pi->nqsets);
 	link_start(dev);
 	t3_port_intr_enable(adapter, pi->port_id);
 	netif_tx_start_all_queues(dev);
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 08c8014..48c530d 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -691,7 +691,7 @@ msi_only:
 		adapter->flags |= IGB_FLAG_HAS_MSI;
 out:
 	/* Notify the stack of the (possibly) reduced Tx Queue count. */
-	adapter->netdev->real_num_tx_queues = adapter->num_tx_queues;
+	dev_real_num_tx_queues_set(adapter->netdev, adapter->num_tx_queues);
 	return;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 07e778d..4b4369b 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2737,7 +2737,7 @@ static void ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 
 done:
 	/* Notify the stack of the (possibly) reduced Tx Queue count. */
-	adapter->netdev->real_num_tx_queues = adapter->num_tx_queues;
+	dev_real_num_tx_queues_set(adapter->netdev, adapter->num_tx_queues);
 }
 
 static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter,
diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
index b3185bf..cb6d859 100644
--- a/drivers/net/mv643xx_eth.c
+++ b/drivers/net/mv643xx_eth.c
@@ -2904,7 +2904,7 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 	mp->dev = dev;
 
 	set_params(mp, pd);
-	dev->real_num_tx_queues = mp->txq_count;
+	dev_real_num_tx_queues_set(dev, mp->txq_count);
 
 	if (pd->phy_addr != MV643XX_ETH_PHY_NONE)
 		mp->phy = phy_scan(mp, pd->phy_addr);
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index f2c4a66..bfb6a11 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -968,7 +968,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp)
 		 * RX queues, so if we get an error, first retry using a
 		 * single TX queue before giving up */
 		if (status != 0 && mgp->dev->real_num_tx_queues > 1) {
-			mgp->dev->real_num_tx_queues = 1;
+			dev_real_num_tx_queues_set(mgp->dev, 1);
 			cmd.data0 = mgp->num_slices;
 			cmd.data1 = MXGEFW_SLICE_INTR_MODE_ONE_PER_SLICE;
 			status = myri10ge_send_cmd(mgp,
@@ -3862,7 +3862,7 @@ static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		dev_err(&pdev->dev, "failed to alloc slice state\n");
 		goto abort_with_firmware;
 	}
-	netdev->real_num_tx_queues = mgp->num_slices;
+	dev_real_num_tx_queues_set(netdev, mgp->num_slices);
 	status = myri10ge_reset(mgp);
 	if (status != 0) {
 		dev_err(&pdev->dev, "failed reset\n");
diff --git a/drivers/net/niu.c b/drivers/net/niu.c
index 2b17453..a6eac3b 100644
--- a/drivers/net/niu.c
+++ b/drivers/net/niu.c
@@ -4501,7 +4501,7 @@ static int niu_alloc_channels(struct niu *np)
 	np->num_rx_rings = parent->rxchan_per_port[port];
 	np->num_tx_rings = parent->txchan_per_port[port];
 
-	np->dev->real_num_tx_queues = np->num_tx_rings;
+	dev_real_num_tx_queues_set(np->dev, np->num_tx_rings);
 
 	np->rx_rings = kzalloc(np->num_rx_rings * sizeof(struct rx_ring_info),
 			       GFP_KERNEL);
diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c
index b7f08f3..15602ab 100644
--- a/drivers/net/vxge/vxge-main.c
+++ b/drivers/net/vxge/vxge-main.c
@@ -3331,7 +3331,7 @@ int __devinit vxge_device_register(struct __vxge_hw_device *hldev,
 		ndev->features |= NETIF_F_GRO;
 
 	if (vdev->config.tx_steering_type == TX_MULTIQ_STEERING)
-		ndev->real_num_tx_queues = no_of_vpath;
+		dev_real_num_tx_queues_set(ndev, no_of_vpath);
 
 #ifdef NETIF_F_LLTX
 	ndev->features |= NETIF_F_LLTX;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a96a1a..f3939ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -790,6 +790,7 @@ struct net_device
 
 	/* Number of TX queues currently active in device  */
 	unsigned int		real_num_tx_queues;
+	unsigned int		rec_real_num_tx_queues; /* reciprocal value */
 
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 	spinlock_t		tx_global_lock;
@@ -1782,6 +1783,7 @@ static inline void netif_addr_unlock_bh(struct net_device *dev)
 
 extern void		ether_setup(struct net_device *dev);
 
+extern void dev_real_num_tx_queues_set(struct net_device *dev, unsigned int count);
 /* Support for loadable net-drivers */
 extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 				       void (*setup)(struct net_device *),
diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..dfb8f32 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -126,6 +126,7 @@
 #include <linux/in.h>
 #include <linux/jhash.h>
 #include <linux/random.h>
+#include <linux/reciprocal_div.h>
 
 #include "net-sysfs.h"
 
@@ -1735,19 +1736,28 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb)) {
+	if (skb_rx_queue_recorded(skb))
 		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
-		hash = skb->sk->sk_hash;
-	} else
-		hash = skb->protocol;
+	else {
+		if (skb->sk && skb->sk->sk_hash)
+			hash = skb->sk->sk_hash;
+		else
+			hash = skb->protocol;
 
-	hash = jhash_1word(hash, skb_tx_hashrnd);
+		hash = jhash_1word(hash, skb_tx_hashrnd);
+	}
+	return (u16) reciprocal_divide(hash, dev->rec_real_num_tx_queues);
 
-	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }
 EXPORT_SYMBOL(skb_tx_hash);
 
+void dev_real_num_tx_queues_set(struct net_device *dev, unsigned int count)
+{
+	dev->real_num_tx_queues = count;
+	dev->rec_real_num_tx_queues = reciprocal_value(count);
+}
+EXPORT_SYMBOL(dev_real_num_tx_queues_set);
+
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
@@ -4781,7 +4791,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 
 	dev->_tx = tx;
 	dev->num_tx_queues = queue_count;
-	dev->real_num_tx_queues = queue_count;
+	dev_real_num_tx_queues_set(dev, queue_count);
 
 	dev->gso_max_size = GSO_MAX_SIZE;
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: skb_tx_hash() improvements
  2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
@ 2009-05-01  8:52               ` Eric Dumazet
  2009-05-01  9:29                 ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  8:52 UTC (permalink / raw)
  To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev

Eric Dumazet a écrit :
> David, here is the followup I promised
> 
> Thanks
> 
> [PATCH] net: skb_tx_hash() improvements
> 
> When skb_rx_queue_recorded() is true, we dont want to use jhash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with only 8 different inputs.
> 
> We also need to implement a true reciprocal division, to not disturb
> symmetric setups (when number of tx queues matches number of rx queues)
> and cpu affinities.
> 
> This patch introduces a new helper, dev_real_num_tx_queues_set()
> to set both real_num_tx_queues and its reciprocal value,
> and makes all drivers use this helper.

Oh well, this was wrong, I took divide result while we want a modulo !

Need to think a litle bit more :)




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: skb_tx_hash() improvements
  2009-05-01  8:52               ` Eric Dumazet
@ 2009-05-01  9:29                 ` Eric Dumazet
  2009-05-01 16:17                   ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01  9:29 UTC (permalink / raw)
  To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David, here is the followup I promised
>>
>> Thanks
>>
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jhash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with only 8 different inputs.
>>
>> We also need to implement a true reciprocal division, to not disturb
>> symmetric setups (when number of tx queues matches number of rx queues)
>> and cpu affinities.
>>
>> This patch introduces a new helper, dev_real_num_tx_queues_set()
>> to set both real_num_tx_queues and its reciprocal value,
>> and makes all drivers use this helper.
> 
> Oh well, this was wrong, I took divide result while we want a modulo !
> 
> Need to think a litle bit more :)
> 

So no need of a true reciprocal divide, just a refinement of first patch.

(Avoiding the divide if possible)

If incoming device has 4 rx queues, and outgoing device has 8 queues,
only 4 of tx queues are used, I wonder if we need some further improvement
here to better use all available tx queues ? Probably not in generic code...

[PATCH] net: skb_tx_hash() improvement

When skb_rx_queue_recorded() is true, we dont want to use jhash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with only 8 different inputs.

Same thing for the 'modulo' operation, that works only if inputs are
enough random (ie use all available 32 bits)

This patch avoids jhash computation (which cost ~50 instructions), but might
still need a modulo operation, in case number of tx queues is smaller
than number of rx queues.

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..b3acb51 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1737,9 +1737,19 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
+		/*
+		 * Try to avoid an expensive divide, for symmetric setups :
+		 *   number of tx queues of output device ==
+		 *   number of rx queues of incoming device
+		 */
+		if (hash >= dev->real_num_tx_queues)
+			hash %= dev->real_num_tx_queues;
+		return hash;
+	}
+
+	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-	} else
+	else
 		hash = skb->protocol;
 
 	hash = jhash_1word(hash, skb_tx_hashrnd);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson
  2009-04-30  9:07 ` Jens Låås
@ 2009-05-01 10:20 ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2009-05-01 10:20 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: netdev


Interesting thread Andrew.  I'm also doing some 10G routing performance 
testing, but using Sun Neptune (niu) and SMC's 10G XFP (sfc) NICs.

I'm using pktgen for testing, but it sounds interesting that you got a 
Ixia testing equipment, nice.

On Wed, 29 Apr 2009, Andrew Dickinson wrote:

> I'm trying to evaluate a new system for routing performance for some
> custom packet modification that we do.  To start, I'm trying to get a
> high-water mark of routing performance without our custom cruft in the
> middle.  The hardware setup is a dual-package Nehalem box (X5550,
> Hyper-Threading disabled) with a dual 10G intel card (pci-id:
> 8086:10fb).  Because this NIC is freakishly new, I'm running the
> latest torvalds kernel in order to get the ixgbe driver to identify it
> (<sigh>).

Is that the Intel 82599 10GbE chip?
Where did you get/buy that NIC?


> Interrupts...
> I've disabled irqbalance and I'm explicitly pinning interrupts, one
> per core, as follows:

I'm doing the same...
I find that keeping the RX and TX queue pinned to the same CPU, is 
essential, together with patch that control the mapping between RX and TX 
queues.  But with Eric's patch it looks like I can drop my own patch :-)

If I don't do RX to TX mapping, then Oprofile shows that we use too much 
time freeing the skb's, naturally due to cache bounces.


> -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk
> '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done;
> done

Keep up the good work!

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  6:14           ` Eric Dumazet
  2009-05-01  6:19             ` Andrew Dickinson
  2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
@ 2009-05-01 16:08             ` David Miller
  2009-05-01 16:48               ` Eric Dumazet
  2 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2009-05-01 16:08 UTC (permalink / raw)
  To: dada1; +Cc: andrew, jelaas, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 01 May 2009 08:14:03 +0200

> [PATCH] net: skb_tx_hash() improvements
> 
> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
> 
> Later improvements would be to compute reciprocal value of real_num_tx_queues
> to avoid a divide here. But this computation should be done once,
> when real_num_tx_queues is set. This needs a separate patch, and a new
> field in struct net_device.
> 
> Reported-by: Andrew Dickinson <andrew@whydna.net>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, except that I changed the commit message header line to more
reflect that this is in fact a bug fix.

BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues
(you can just use the RX queue recording as the hash, straight) and
that's the kind of check what I intended to add to net-2.6 had you not
beaten me to this patch.

Also, thanks for giving me absolutely no credit for this whole thing
in your commit message.  I know I do that to you all the time :-/ How
can you forget so quickly that I'm the one that even suggested the
exact code change for Andrew to test in the first place?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: skb_tx_hash() improvements
  2009-05-01  9:29                 ` Eric Dumazet
@ 2009-05-01 16:17                   ` David Miller
  2009-05-03 21:44                     ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2009-05-01 16:17 UTC (permalink / raw)
  To: dada1; +Cc: andrew, jelaas, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 01 May 2009 11:29:54 +0200

> -	} else if (skb->sk && skb->sk->sk_hash) {
> +		/*
> +		 * Try to avoid an expensive divide, for symmetric setups :
> +		 *   number of tx queues of output device ==
> +		 *   number of rx queues of incoming device
> +		 */
> +		if (hash >= dev->real_num_tx_queues)
> +			hash %= dev->real_num_tx_queues;
> +		return hash;
> +	}

Subtraction in a while() loop is almost certainly a lot
faster.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
@ 2009-05-01 16:48               ` Eric Dumazet
  2009-05-01 17:22                 ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2009-05-01 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, jelaas, netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 01 May 2009 08:14:03 +0200
> 
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>
>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>> to avoid a divide here. But this computation should be done once,
>> when real_num_tx_queues is set. This needs a separate patch, and a new
>> field in struct net_device.
>>
>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> Applied, except that I changed the commit message header line to more
> reflect that this is in fact a bug fix.
> 
> BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues
> (you can just use the RX queue recording as the hash, straight) and
> that's the kind of check what I intended to add to net-2.6 had you not
> beaten me to this patch.
> 
> Also, thanks for giving me absolutely no credit for this whole thing
> in your commit message.  I know I do that to you all the time :-/ How
> can you forget so quickly that I'm the one that even suggested the
> exact code change for Andrew to test in the first place?

Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am
David Miller, silly me ! :)

I am not fighting for credit or whatever, certainly not with you.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01 16:48               ` Eric Dumazet
@ 2009-05-01 17:22                 ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2009-05-01 17:22 UTC (permalink / raw)
  To: dada1; +Cc: andrew, jelaas, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 01 May 2009 18:48:30 +0200

> Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am
> David Miller, silly me ! :)
> 
> I am not fighting for credit or whatever, certainly not with you.

Great, just making sure it wasn't intentional :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
  2009-05-01  7:23                 ` Andrew Dickinson
  2009-05-01  7:31                   ` Eric Dumazet
@ 2009-05-01 21:37                   ` Brandeburg, Jesse
  1 sibling, 0 replies; 28+ messages in thread
From: Brandeburg, Jesse @ 2009-05-01 21:37 UTC (permalink / raw)
  To: Andrew Dickinson; +Cc: Eric Dumazet, David Miller, jelaas, netdev

I'm going to try to clarify just a few minor things in the hope of helping 
explain why things look the way they do from the ixgbe perspective.

On Fri, 1 May 2009, Andrew Dickinson wrote:
> >> That's exactly what I did!  It solved the problem of hot-spots on some
> >> interrupts.  However, I now have a new problem (which is documented in
> >> my previous posts).  The short of it is that I'm only seeing 4 (out of
> >> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
> >> busy 4 are always on one physical package (but not always the same
> >> package (it'll change on reboot or when I change some parameters via
> >> ethtool), but never both.  This, despite /proc/interrupts showing me
> >> that all 8 interrupts are being hit evenly.  There's more details in
> >> my last mail. ;-D
> >>
> >
> > Well, I was reacting to your 'voodo' comment about
> >
> > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> >
> > Since this is not the problem. Problem is coming from jhash() which shuffles
> > the input, while in your case we want to select same output queue
> > because of cpu affinities. No shuffle required.
> 
> Agreed.  I don't want to jhash(), and I'm not.
> 
> > (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
> >          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
> 
> That's a correct assumption. :D
> 
> > Then /proc/interrupts show your rx interrupts are not evenly distributed.
> >
> > Or that ksoftirqd is triggered only on one physical cpu, while on other
> > cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
> 
> Hrmm... more fuel for the fire...
> 
> The NIC seems to be doing a good job of hashing the incoming data and
> the kernel is now finding the right TX queue:
> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
>      rx_packets: 1286009099
>      tx_packets: 1287853570
>      tx_queue_0_packets: 162469405
>      tx_queue_1_packets: 162452446
>      tx_queue_2_packets: 162481160
>      tx_queue_3_packets: 162441839
>      tx_queue_4_packets: 162484930
>      tx_queue_5_packets: 162478402
>      tx_queue_6_packets: 162492530
>      tx_queue_7_packets: 162477162
>      rx_queue_0_packets: 162469449
>      rx_queue_1_packets: 162452440
>      rx_queue_2_packets: 162481186
>      rx_queue_3_packets: 162441885
>      rx_queue_4_packets: 162484949
>      rx_queue_5_packets: 162478427
> 
> Here's where it gets juicy.  If I reduce the rate at which I'm pushing
> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
> as follow:
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts, I see that all of the tx and rx
> queues are handling a fairly similar number of interrupts (ballpark,
> 7-8k/sec on rx, 10k on tx).
> 
> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
> 
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  1.9%id,  0.0%wa,  5.5%hi, 92.5%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  2.3%id,  0.0%wa,  4.9%hi, 92.9%si,  0.0%st
> Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  1.9%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,  5.2%id,  0.0%wa,  5.2%hi, 89.6%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.3%hi,  1.9%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  4.9%hi, 94.8%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> 
> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
> 0,2,4, and 6) RX queues are receiving relatively few interrupts
> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
> receiving about 2-3k/sec.  What's extra strange is that the TX queues
> are still handling about 10k/sec each.

rx interrupts start polling (100% time)
tx queues keep doing 10K per second because tx queues don't run in NAPI 
mode for MSI-X vectors.  They do try to limit the amount of work done at 
once as to not hog a cpu.

> So, below some magic threshold (approx 2.3Mpps), the box is basically
> idle and happily routing all the packets (I can confirm that my
> network test device ixia is showing 0-loss).  Above the magic
> threshold, the box starts acting as described above and I'm unable to
> push it beyond that threshold.  While I understand that there are
> limits to how fast I can route packets (obviously), it seems very
> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
> "processes".
> 
> Here's how fragile this "magic threshold" is...  2.292 Mpps, box looks
> idle, 0 loss.   2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
> 2.307 Mpps, even-CPU ksoftirqd processes at 75%.  2.323 Mpps, even-CPU
> ksoftirqd proccesses at 100%.  Never during this did the odd-CPU
> ksoftirqd processes show any utilization at all.
> 
> These are 64-byte frames, so I shouldn't be hitting any bandwidth
> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
> just routing packets back out the one NIC).

do you have all six channels populated with memory?

you're probably just hitting the limits of the OS combined with the 
hardware.  You could try reducing your rx/tx queue count (have to change 
code, 'num_rx_queues =') - hope we get ethtool to do that someday.

and then assigning each rx queue to one core and a tx queue to another on 
a shared cache.

on a Nehalem the kernel in numa mode (is your BIOS in numa mode?) may not 
be balancing the memory utilization evenly between channels.  are you 
using slub or slqb?

changing netdev_alloc_skb to __alloc_skb(be sure to specify node=-1 and 
getting rid of the skb_reserve(NET_IP_ALIGN) and skb_reserve(16)
might help align rx packets for dma.

hope this helps,
  Jesse

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: skb_tx_hash() improvements
  2009-05-01 16:17                   ` David Miller
@ 2009-05-03 21:44                     ` David Miller
  2009-05-04  6:12                       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2009-05-03 21:44 UTC (permalink / raw)
  To: dada1; +Cc: andrew, jelaas, netdev

From: David Miller <davem@davemloft.net>
Date: Fri, 01 May 2009 09:17:47 -0700 (PDT)

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 01 May 2009 11:29:54 +0200
> 
>> -	} else if (skb->sk && skb->sk->sk_hash) {
>> +		/*
>> +		 * Try to avoid an expensive divide, for symmetric setups :
>> +		 *   number of tx queues of output device ==
>> +		 *   number of rx queues of incoming device
>> +		 */
>> +		if (hash >= dev->real_num_tx_queues)
>> +			hash %= dev->real_num_tx_queues;
>> +		return hash;
>> +	}
> 
> Subtraction in a while() loop is almost certainly a lot
> faster.

To move forward on this, I've commited the following to
net-next-2.6, thanks!

net: Avoid modulus in skb_tx_hash() for forwarding case.

Based almost entirely upon a patch by Eric Dumazet.

The common case is to have num-tx-queues <= num_rx_queues
and even if num_tx_queues is larger it will not be significantly
larger.

Therefore, a subtraction loop is always going to be faster than
modulus.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8144295..3c8073f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1735,8 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb))
-		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+	if (skb_rx_queue_recorded(skb)) {
+		hash = skb_get_rx_queue(skb);
+		while (unlikely (hash >= dev->real_num_tx_queues))
+			hash -= dev->real_num_tx_queues;
+		return hash;
+	}
 
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-- 
1.6.2.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: skb_tx_hash() improvements
  2009-05-03 21:44                     ` David Miller
@ 2009-05-04  6:12                       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2009-05-04  6:12 UTC (permalink / raw)
  To: David Miller; +Cc: andrew, jelaas, netdev

David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Fri, 01 May 2009 09:17:47 -0700 (PDT)
> 
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 01 May 2009 11:29:54 +0200
>>
>>> -	} else if (skb->sk && skb->sk->sk_hash) {
>>> +		/*
>>> +		 * Try to avoid an expensive divide, for symmetric setups :
>>> +		 *   number of tx queues of output device ==
>>> +		 *   number of rx queues of incoming device
>>> +		 */
>>> +		if (hash >= dev->real_num_tx_queues)
>>> +			hash %= dev->real_num_tx_queues;
>>> +		return hash;
>>> +	}
>> Subtraction in a while() loop is almost certainly a lot
>> faster.
> 
> To move forward on this, I've commited the following to
> net-next-2.6, thanks!
> 
> net: Avoid modulus in skb_tx_hash() for forwarding case.
> 
> Based almost entirely upon a patch by Eric Dumazet.
> 
> The common case is to have num-tx-queues <= num_rx_queues
> and even if num_tx_queues is larger it will not be significantly
> larger.
> 
> Therefore, a subtraction loop is always going to be faster than
> modulus.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  net/core/dev.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8144295..3c8073f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1735,8 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>  {
>  	u32 hash;
>  
> -	if (skb_rx_queue_recorded(skb))
> -		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +	if (skb_rx_queue_recorded(skb)) {
> +		hash = skb_get_rx_queue(skb);
> +		while (unlikely (hash >= dev->real_num_tx_queues))
> +			hash -= dev->real_num_tx_queues;
> +		return hash;
> +	}
>  
>  	if (skb->sk && skb->sk->sk_hash)
>  		hash = skb->sk->sk_hash;

Yes, I checked that compiler did not use a divide instruction here
(I remember it did on a similar loop in kernel, related to time)

Thank you


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-05-04  6:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson
2009-04-30  9:07 ` Jens Låås
2009-04-30  9:24   ` David Miller
2009-04-30 10:51     ` Jens Låås
2009-04-30 11:05       ` David Miller
2009-04-30 14:04     ` Andrew Dickinson
2009-04-30 14:08       ` David Miller
2009-04-30 23:53         ` Andrew Dickinson
2009-05-01  4:19           ` Andrew Dickinson
2009-05-01  7:32             ` Eric Dumazet
2009-05-01  7:47               ` Eric Dumazet
2009-05-01  6:14           ` Eric Dumazet
2009-05-01  6:19             ` Andrew Dickinson
2009-05-01  6:40               ` Eric Dumazet
2009-05-01  7:23                 ` Andrew Dickinson
2009-05-01  7:31                   ` Eric Dumazet
2009-05-01  7:34                     ` Andrew Dickinson
2009-05-01 21:37                   ` Brandeburg, Jesse
2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
2009-05-01  8:52               ` Eric Dumazet
2009-05-01  9:29                 ` Eric Dumazet
2009-05-01 16:17                   ` David Miller
2009-05-03 21:44                     ` David Miller
2009-05-04  6:12                       ` Eric Dumazet
2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
2009-05-01 16:48               ` Eric Dumazet
2009-05-01 17:22                 ` David Miller
2009-05-01 10:20 ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.