* tx queue hashing hot-spots and poor performance (multiq, ixgbe) @ 2009-04-29 23:00 Andrew Dickinson 2009-04-30 9:07 ` Jens Låås 2009-05-01 10:20 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 28+ messages in thread From: Andrew Dickinson @ 2009-04-29 23:00 UTC (permalink / raw) To: netdev Howdy list, Background... I'm trying to evaluate a new system for routing performance for some custom packet modification that we do. To start, I'm trying to get a high-water mark of routing performance without our custom cruft in the middle. The hardware setup is a dual-package Nehalem box (X5550, Hyper-Threading disabled) with a dual 10G intel card (pci-id: 8086:10fb). Because this NIC is freakishly new, I'm running the latest torvalds kernel in order to get the ixgbe driver to identify it (<sigh>). With HT off, I've got 8 cores in the system. For the sake of reducing the number of variables that I'm dealing with, I'm only using one of the NICs to start with and simply routing packets back out the single 10G NIC. Interrupts... I've disabled irqbalance and I'm explicitly pinning interrupts, one per core, as follows: -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done; done -bash-3.2# for i in `seq 57 72`; do cat /proc/irq/$i/smp_affinity; done 0001 0002 0004 0008 0010 0020 0040 0080 0001 0002 0004 0008 0010 0020 0040 0080 -bash-3.2# cat /proc/interrupts | grep eth2 57: 77941 0 0 0 0 0 0 0 PCI-MSI-edge eth2-rx-0 58: 92 59682 0 0 0 0 0 0 PCI-MSI-edge eth2-rx-1 59: 92 0 21716 0 0 0 0 0 PCI-MSI-edge eth2-rx-2 60: 92 0 0 14356 0 0 0 0 PCI-MSI-edge eth2-rx-3 61: 92 0 0 0 91483 0 0 0 PCI-MSI-edge eth2-rx-4 62: 92 0 0 0 0 19495 0 0 PCI-MSI-edge eth2-rx-5 63: 92 0 0 0 0 0 24 0 PCI-MSI-edge eth2-rx-6 64: 92 0 0 0 0 0 0 19605 PCI-MSI-edge eth2-rx-7 65: 94709 0 0 0 0 0 0 0 PCI-MSI-edge eth2-tx-0 66: 92 24 0 0 0 0 0 0 PCI-MSI-edge eth2-tx-1 67: 98 0 24 0 0 0 0 0 PCI-MSI-edge eth2-tx-2 68: 92 0 0 100208 0 0 0 0 PCI-MSI-edge eth2-tx-3 69: 92 0 0 0 24 0 0 0 PCI-MSI-edge eth2-tx-4 70: 92 0 0 0 0 24 0 0 PCI-MSI-edge eth2-tx-5 71: 92 0 0 0 0 0 144566 0 PCI-MSI-edge eth2-tx-6 72: 92 0 0 0 0 0 0 24 PCI-MSI-edge eth2-tx-7 73: 2 0 0 0 0 0 0 0 PCI-MSI-edge eth2:lsc The output of /proc/interrupts is hinting at the problem that I'm having... The TX queues which are being chosen are only 0, 3, and 6. The flow of traffic that I'm generating is random source/dest pairs, each within a /24, so I don't think that I'm sending data that should be breaking the skb_tx_hash() routine. Further, when I run top, I see that almost all of the interrupt processing is happening on a single cpu. Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 19.3%hi, 80.7%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st This appears to be due to 'tx'-based activity... if I change my route table to blackhole the traffic, the CPUs are nearly idle. My next thought was to try multiqueue... -bash-3.2# ./tc/tc qdisc add dev eth2 root handle 1: multiq -bash-3.2# ./tc/tc qdisc show dev eth2 qdisc multiq 1: root refcnt 128 bands 8/128 With multiq scheduling, the CPU load evens out a bunch, but I still have a soft-interrupt hot-spot (see CPU3 here. Also note that only CPU's 0, 3, and 6 are handling hardware interrupts.): Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 69.9%id, 0.0%wa, 0.3%hi, 29.8%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 64.8%id, 0.0%wa, 0.0%hi, 35.2%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 76.5%id, 0.0%wa, 0.0%hi, 23.5%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 4.8%id, 0.0%wa, 2.6%hi, 92.6%si, 0.0%st Cpu4 : 0.3%us, 0.3%sy, 0.0%ni, 76.2%id, 0.3%wa, 0.0%hi, 22.8%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 49.4%id, 0.0%wa, 0.0%hi, 50.6%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 56.8%id, 0.0%wa, 1.0%hi, 42.3%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 51.6%id, 0.0%wa, 0.0%hi, 48.4%si, 0.0%st However, what I see with multiqueue enabled is that I'm dropping 80% of my traffic (which appears to be due to a large number of 'rx_missed_errors'). Any thoughts on what I'm doing wrong or where I should continue to look? -Andrew ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson @ 2009-04-30 9:07 ` Jens Låås 2009-04-30 9:24 ` David Miller 2009-05-01 10:20 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 28+ messages in thread From: Jens Låås @ 2009-04-30 9:07 UTC (permalink / raw) To: Andrew Dickinson, netdev 2009/4/30, Andrew Dickinson <andrew@whydna.net>: > Howdy list, > > Background... > I'm trying to evaluate a new system for routing performance for some > custom packet modification that we do. To start, I'm trying to get a > high-water mark of routing performance without our custom cruft in the > middle. The hardware setup is a dual-package Nehalem box (X5550, > Hyper-Threading disabled) with a dual 10G intel card (pci-id: > 8086:10fb). Because this NIC is freakishly new, I'm running the > latest torvalds kernel in order to get the ixgbe driver to identify it > (<sigh>). With HT off, I've got 8 cores in the system. For the sake > of reducing the number of variables that I'm dealing with, I'm only > using one of the NICs to start with and simply routing packets back > out the single 10G NIC. OK. We have done quite a bit of 10G testing. Ill comment based on our experiences. > > Interrupts... > I've disabled irqbalance and I'm explicitly pinning interrupts, one > per core, as follows: Setting affinity is a must yes, for high performance. It is also important that tx affinity matches rx-affinity. So the TX-completion runs on the same CPU as rx. > > -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk > '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done; > done > > -bash-3.2# for i in `seq 57 72`; do cat /proc/irq/$i/smp_affinity; done > 0001 > 0002 > 0004 > 0008 > 0010 > 0020 > 0040 > 0080 > 0001 > 0002 > 0004 > 0008 > 0010 > 0020 > 0040 > 0080 > > -bash-3.2# cat /proc/interrupts | grep eth2 > 57: 77941 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-0 > 58: 92 59682 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-1 > 59: 92 0 21716 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-2 > 60: 92 0 0 14356 0 > 0 0 0 PCI-MSI-edge eth2-rx-3 > 61: 92 0 0 0 91483 > 0 0 0 PCI-MSI-edge eth2-rx-4 > 62: 92 0 0 0 0 > 19495 0 0 PCI-MSI-edge eth2-rx-5 > 63: 92 0 0 0 0 > 0 24 0 PCI-MSI-edge eth2-rx-6 > 64: 92 0 0 0 0 > 0 0 19605 PCI-MSI-edge eth2-rx-7 > 65: 94709 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-0 > 66: 92 24 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-1 > 67: 98 0 24 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-2 > 68: 92 0 0 100208 0 > 0 0 0 PCI-MSI-edge eth2-tx-3 > 69: 92 0 0 0 24 > 0 0 0 PCI-MSI-edge eth2-tx-4 > 70: 92 0 0 0 0 > 24 0 0 PCI-MSI-edge eth2-tx-5 > 71: 92 0 0 0 0 > 0 144566 0 PCI-MSI-edge eth2-tx-6 > 72: 92 0 0 0 0 > 0 0 24 PCI-MSI-edge eth2-tx-7 > 73: 2 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2:lsc > > The output of /proc/interrupts is hinting at the problem that I'm > having... The TX queues which are being chosen are only 0, 3, and 6. > The flow of traffic that I'm generating is random source/dest pairs, > each within a /24, so I don't think that I'm sending data that should > be breaking the skb_tx_hash() routine. RX-side looks good. TX-side looks like what we also got with vanilla linux. What we do is patch all drivers with a custom select_queue function that selects the same outgoing queue as the incoming queue. With a one to one mapping of queues to CPUs you can also use the processor id. This way we get performance. Another way we are looking at is to use an abstraction to help with the queue mapping. (We call it 'flowtrunk'). This is then configurable from userspace. > > Further, when I run top, I see that almost all of the interrupt > processing is happening on a single cpu. > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.3%hi, 0.7%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 19.3%hi, 80.7%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > > This appears to be due to 'tx'-based activity... if I change my route > table to blackhole the traffic, the CPUs are nearly idle. > > My next thought was to try multiqueue... > -bash-3.2# ./tc/tc qdisc add dev eth2 root handle 1: multiq > -bash-3.2# ./tc/tc qdisc show dev eth2 > qdisc multiq 1: root refcnt 128 bands 8/128 > > With multiq scheduling, the CPU load evens out a bunch, but I still > have a soft-interrupt hot-spot (see CPU3 here. Also note that only > CPU's 0, 3, and 6 are handling hardware interrupts.): > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 69.9%id, 0.0%wa, 0.3%hi, 29.8%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 64.8%id, 0.0%wa, 0.0%hi, 35.2%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 76.5%id, 0.0%wa, 0.0%hi, 23.5%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 4.8%id, 0.0%wa, 2.6%hi, 92.6%si, 0.0%st > Cpu4 : 0.3%us, 0.3%sy, 0.0%ni, 76.2%id, 0.3%wa, 0.0%hi, 22.8%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 49.4%id, 0.0%wa, 0.0%hi, 50.6%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 56.8%id, 0.0%wa, 1.0%hi, 42.3%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 51.6%id, 0.0%wa, 0.0%hi, 48.4%si, 0.0%st > > However, what I see with multiqueue enabled is that I'm dropping 80% > of my traffic (which appears to be due to a large number of > 'rx_missed_errors'). > > Any thoughts on what I'm doing wrong or where I should continue to look? Changing the qdisc wont help since all qdiscs but pfifo_fast serializes all CPUs to one qdisc. pfifo_fast creates a separate qdisc per tx_queue. If you dont want to patch the kernel you can try increasing the queue length of the pfifo_fast qdisc. Cheers, Jens > > -Andrew > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 9:07 ` Jens Låås @ 2009-04-30 9:24 ` David Miller 2009-04-30 10:51 ` Jens Låås 2009-04-30 14:04 ` Andrew Dickinson 0 siblings, 2 replies; 28+ messages in thread From: David Miller @ 2009-04-30 9:24 UTC (permalink / raw) To: jelaas; +Cc: andrew, netdev From: Jens Låås <jelaas@gmail.com> Date: Thu, 30 Apr 2009 11:07:35 +0200 > RX-side looks good. TX-side looks like what we also got with vanilla linux. > > What we do is patch all drivers with a custom select_queue function > that selects the same outgoing queue as the incoming queue. With a one > to one mapping of queues to CPUs you can also use the processor id. > > This way we get performance. I don't understand why this can even be necessary. With the current code, the RX queue of a packet becomes the hash for the TX queue. If all the TX activity is happening on one TX queue then there is a bug somewhere. Either the receiving device isn't invoking skb_record_rx_queue() correctly, or there is some bug in how we compute the TX hash. Everyone adds their own hacks, but that absolutely should not be necessary, the kernel is essentially doing what you are adding hacks for. The only possible problems are bugs in the code, and we should find those bugs instead of constantly talking about 'local select_queue hacks we add to our cool driver for performance' :-/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 9:24 ` David Miller @ 2009-04-30 10:51 ` Jens Låås 2009-04-30 11:05 ` David Miller 2009-04-30 14:04 ` Andrew Dickinson 1 sibling, 1 reply; 28+ messages in thread From: Jens Låås @ 2009-04-30 10:51 UTC (permalink / raw) To: David Miller; +Cc: andrew, netdev 2009/4/30, David Miller <davem@davemloft.net>: > From: Jens Låås <jelaas@gmail.com> > Date: Thu, 30 Apr 2009 11:07:35 +0200 > > > > RX-side looks good. TX-side looks like what we also got with vanilla linux. > > > > What we do is patch all drivers with a custom select_queue function > > that selects the same outgoing queue as the incoming queue. With a one > > to one mapping of queues to CPUs you can also use the processor id. > > > > This way we get performance. > > > I don't understand why this can even be necessary. > > With the current code, the RX queue of a packet becomes > the hash for the TX queue. > > If all the TX activity is happening on one TX queue then > there is a bug somewhere. If I remember correctly we got use of several tx-queues and not one. The hashed distribution missed a few of the tx-queues though. And it also looked like some rx-queues got mapped on top of the same tx-queue. At the time we reasoned this behaviour was expected from the hashed randomizing. We may certainly have misunderstood this and a one to one mapping should be expected. Hopefully the case where we have several devices and want TX-completion to match rx-queue can also be solved. (The assumption that tx-completion needs to run on the same CPU may also be proved wrong. But we havent seen this in tests sofar.) The main problem though was that the mapping is randomized. We wanted to set smp_affinity correctly for tx to match rx. That was actually the main reason for our local hacks. > > Either the receiving device isn't invoking skb_record_rx_queue() > correctly, or there is some bug in how we compute the TX hash. > > Everyone adds their own hacks, but that absolutely should not be > necessary, the kernel is essentially doing what you are adding > hacks for. > > The only possible problems are bugs in the code, and we should find > those bugs instead of constantly talking about 'local select_queue > hacks we add to our cool driver for performance' :-/ We certainly dont consider the hacks cool in any way. They were only for a specific purpose and a specific kernel-version. Cheers, Jens ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 10:51 ` Jens Låås @ 2009-04-30 11:05 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2009-04-30 11:05 UTC (permalink / raw) To: jelaas; +Cc: andrew, netdev From: Jens Låås <jelaas@gmail.com> Date: Thu, 30 Apr 2009 12:51:16 +0200 > The main problem though was that the mapping is randomized. We wanted > to set smp_affinity correctly for tx to match rx. That was actually > the main reason for our local hacks. It's NOT RANDOMIZED, READ THE CODE! It takes the RX queue number and uses it to select the TX queue. That anything but random! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 9:24 ` David Miller 2009-04-30 10:51 ` Jens Låås @ 2009-04-30 14:04 ` Andrew Dickinson 2009-04-30 14:08 ` David Miller 1 sibling, 1 reply; 28+ messages in thread From: Andrew Dickinson @ 2009-04-30 14:04 UTC (permalink / raw) To: David Miller; +Cc: jelaas, netdev <snip> > If all the TX activity is happening on one TX queue then > there is a bug somewhere. > > Either the receiving device isn't invoking skb_record_rx_queue() > correctly, or there is some bug in how we compute the TX hash. I'll do some debugging around skb_tx_hash() and see if I can make sense of it. I'll let you know what I find. My hypothesis is that skb_record_rx_queue() isn't being called, but I should dig into it before I start making claims. ;-P <snip> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 14:04 ` Andrew Dickinson @ 2009-04-30 14:08 ` David Miller 2009-04-30 23:53 ` Andrew Dickinson 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-04-30 14:08 UTC (permalink / raw) To: andrew; +Cc: jelaas, netdev From: Andrew Dickinson <andrew@whydna.net> Date: Thu, 30 Apr 2009 07:04:33 -0700 > I'll do some debugging around skb_tx_hash() and see if I can make > sense of it. I'll let you know what I find. My hypothesis is that > skb_record_rx_queue() isn't being called, but I should dig into it > before I start making claims. ;-P That's one possibility. Another is that the hashing isn't working out. One way to play with that is to simply replace the: hash = skb_get_rx_queue(skb); in skb_tx_hash() with something like: return skb_get_rx_queue(skb) % dev->real_num_tx_queues; and see if that improves the situation. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 14:08 ` David Miller @ 2009-04-30 23:53 ` Andrew Dickinson 2009-05-01 4:19 ` Andrew Dickinson 2009-05-01 6:14 ` Eric Dumazet 0 siblings, 2 replies; 28+ messages in thread From: Andrew Dickinson @ 2009-04-30 23:53 UTC (permalink / raw) To: David Miller; +Cc: jelaas, netdev OK... I've got some more data on it... I passed a small number of packets through the system and added a ton of printks to it ;-P Here's the distribution of values as seen by skb_rx_queue_recorded()... count on the left, value on the right: 37 0 31 1 31 2 39 3 37 4 31 5 42 6 39 7 That's nice and even.... Here's what's getting returned from the skb_tx_hash(). Again, count on the left, value on the right: 31 0 81 1 37 2 70 3 37 4 31 6 Note that we're entirely missing 5 and 7 and that those interrupts seem to have gotten munged onto 1 and 3. I think the voodoo lies within: return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); David, I made the change that you suggested: //hash = skb_get_rx_queue(skb); return skb_get_rx_queue(skb) % dev->real_num_tx_queues; And now, I see a nice even mixing of interrupts on the TX side (yay!). However, my problem's not solved entirely... here's what top is showing me: top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers Swap: 2096472k total, 0k used, 2096472k free, 146364k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 ksoftirqd/1 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 ksoftirqd/3 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 ksoftirqd/5 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 ksoftirqd/7 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top <snip> It appears that only the odd CPUs are actually handling the interrupts, which doesn't jive with what /proc/interrupts shows me: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 66: 2970565 0 0 0 0 0 0 0 PCI-MSI-edge eth2-rx-0 67: 28 821122 0 0 0 0 0 0 PCI-MSI-edge eth2-rx-1 68: 28 0 2943299 0 0 0 0 0 PCI-MSI-edge eth2-rx-2 69: 28 0 0 817776 0 0 0 0 PCI-MSI-edge eth2-rx-3 70: 28 0 0 0 2963924 0 0 0 PCI-MSI-edge eth2-rx-4 71: 28 0 0 0 0 821032 0 0 PCI-MSI-edge eth2-rx-5 72: 28 0 0 0 0 0 2979987 0 PCI-MSI-edge eth2-rx-6 73: 28 0 0 0 0 0 0 845422 PCI-MSI-edge eth2-rx-7 74: 4664732 0 0 0 0 0 0 0 PCI-MSI-edge eth2-tx-0 75: 34 4679312 0 0 0 0 0 0 PCI-MSI-edge eth2-tx-1 76: 28 0 4665014 0 0 0 0 0 PCI-MSI-edge eth2-tx-2 77: 28 0 0 4681531 0 0 0 0 PCI-MSI-edge eth2-tx-3 78: 28 0 0 0 4665793 0 0 0 PCI-MSI-edge eth2-tx-4 79: 28 0 0 0 0 4671596 0 0 PCI-MSI-edge eth2-tx-5 80: 28 0 0 0 0 0 4665279 0 PCI-MSI-edge eth2-tx-6 81: 28 0 0 0 0 0 0 4664504 PCI-MSI-edge eth2-tx-7 82: 2 0 0 0 0 0 0 0 PCI-MSI-edge eth2:lsc Why would ksoftirqd only run on half of the cores (and only the odd ones to boot)? The one commonality that's striking me is that that all the odd CPU#'s are on the same physical processor: -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual processor : 0 physical id : 0 processor : 1 physical id : 1 processor : 2 physical id : 0 processor : 3 physical id : 1 processor : 4 physical id : 0 processor : 5 physical id : 1 processor : 6 physical id : 0 processor : 7 physical id : 1 I did compile the kernel with NUMA support... am I being bitten by something there? Other thoughts on where I should look. Also... is there an incantation to get NAPI to work in the torvalds kernel? As you can see, I'm generating quite a few interrrupts. -A On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: > From: Andrew Dickinson <andrew@whydna.net> > Date: Thu, 30 Apr 2009 07:04:33 -0700 > >> I'll do some debugging around skb_tx_hash() and see if I can make >> sense of it. I'll let you know what I find. My hypothesis is that >> skb_record_rx_queue() isn't being called, but I should dig into it >> before I start making claims. ;-P > > That's one possibility. > > Another is that the hashing isn't working out. One way to > play with that is to simply replace the: > > hash = skb_get_rx_queue(skb); > > in skb_tx_hash() with something like: > > return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > > and see if that improves the situation. > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 23:53 ` Andrew Dickinson @ 2009-05-01 4:19 ` Andrew Dickinson 2009-05-01 7:32 ` Eric Dumazet 2009-05-01 6:14 ` Eric Dumazet 1 sibling, 1 reply; 28+ messages in thread From: Andrew Dickinson @ 2009-05-01 4:19 UTC (permalink / raw) To: David Miller; +Cc: jelaas, netdev Adding a bit more info... I should add, the other 4 ksoftirqd tasklets _are_ running, they're just not busy. (In case that wasn't clear...) Also of note, I rebooted the box (after recompiling with NUMA off). This time when I push traffic through, only the even-ksoftirqd's were busy.. I then tweaked some of the ring settings via ethtool and suddenly the odd-ksoftirqd's became busy (and the even ones went idle). Thoughts? Suggestions? driver issue? I'm at 2.6.30-rc3. (BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's are busy that I still have room to make this box go faster). -A On Thu, Apr 30, 2009 at 4:53 PM, Andrew Dickinson <andrew@whydna.net> wrote: > OK... I've got some more data on it... > > I passed a small number of packets through the system and added a ton > of printks to it ;-P > > Here's the distribution of values as seen by > skb_rx_queue_recorded()... count on the left, value on the right: > 37 0 > 31 1 > 31 2 > 39 3 > 37 4 > 31 5 > 42 6 > 39 7 > > That's nice and even.... Here's what's getting returned from the > skb_tx_hash(). Again, count on the left, value on the right: > 31 0 > 81 1 > 37 2 > 70 3 > 37 4 > 31 6 > > Note that we're entirely missing 5 and 7 and that those interrupts > seem to have gotten munged onto 1 and 3. > > I think the voodoo lies within: > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); > > David, I made the change that you suggested: > //hash = skb_get_rx_queue(skb); > return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > > And now, I see a nice even mixing of interrupts on the TX side (yay!). > > However, my problem's not solved entirely... here's what top is showing me: > top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 > Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st > Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers > Swap: 2096472k total, 0k used, 2096472k free, 146364k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 > ksoftirqd/1 > 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 > ksoftirqd/3 > 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 > ksoftirqd/5 > 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 > ksoftirqd/7 > 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top > <snip> > > > It appears that only the odd CPUs are actually handling the > interrupts, which doesn't jive with what /proc/interrupts shows me: > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 66: 2970565 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-0 > 67: 28 821122 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-1 > 68: 28 0 2943299 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-2 > 69: 28 0 0 817776 0 > 0 0 0 PCI-MSI-edge eth2-rx-3 > 70: 28 0 0 0 2963924 > 0 0 0 PCI-MSI-edge eth2-rx-4 > 71: 28 0 0 0 0 > 821032 0 0 PCI-MSI-edge eth2-rx-5 > 72: 28 0 0 0 0 > 0 2979987 0 PCI-MSI-edge eth2-rx-6 > 73: 28 0 0 0 0 > 0 0 845422 PCI-MSI-edge eth2-rx-7 > 74: 4664732 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-0 > 75: 34 4679312 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-1 > 76: 28 0 4665014 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-2 > 77: 28 0 0 4681531 0 > 0 0 0 PCI-MSI-edge eth2-tx-3 > 78: 28 0 0 0 4665793 > 0 0 0 PCI-MSI-edge eth2-tx-4 > 79: 28 0 0 0 0 > 4671596 0 0 PCI-MSI-edge eth2-tx-5 > 80: 28 0 0 0 0 > 0 4665279 0 PCI-MSI-edge eth2-tx-6 > 81: 28 0 0 0 0 > 0 0 4664504 PCI-MSI-edge eth2-tx-7 > 82: 2 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2:lsc > > > Why would ksoftirqd only run on half of the cores (and only the odd > ones to boot)? The one commonality that's striking me is that that > all the odd CPU#'s are on the same physical processor: > > -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual > processor : 0 > physical id : 0 > processor : 1 > physical id : 1 > processor : 2 > physical id : 0 > processor : 3 > physical id : 1 > processor : 4 > physical id : 0 > processor : 5 > physical id : 1 > processor : 6 > physical id : 0 > processor : 7 > physical id : 1 > > I did compile the kernel with NUMA support... am I being bitten by > something there? Other thoughts on where I should look. > > Also... is there an incantation to get NAPI to work in the torvalds > kernel? As you can see, I'm generating quite a few interrrupts. > > -A > > > On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >> From: Andrew Dickinson <andrew@whydna.net> >> Date: Thu, 30 Apr 2009 07:04:33 -0700 >> >>> I'll do some debugging around skb_tx_hash() and see if I can make >>> sense of it. I'll let you know what I find. My hypothesis is that >>> skb_record_rx_queue() isn't being called, but I should dig into it >>> before I start making claims. ;-P >> >> That's one possibility. >> >> Another is that the hashing isn't working out. One way to >> play with that is to simply replace the: >> >> hash = skb_get_rx_queue(skb); >> >> in skb_tx_hash() with something like: >> >> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >> >> and see if that improves the situation. >> > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 4:19 ` Andrew Dickinson @ 2009-05-01 7:32 ` Eric Dumazet 2009-05-01 7:47 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 7:32 UTC (permalink / raw) To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev Andrew Dickinson a écrit : > Adding a bit more info... > > I should add, the other 4 ksoftirqd tasklets _are_ running, they're > just not busy. (In case that wasn't clear...) > > Also of note, I rebooted the box (after recompiling with NUMA off). > This time when I push traffic through, only the even-ksoftirqd's were > busy.. I then tweaked some of the ring settings via ethtool and > suddenly the odd-ksoftirqd's became busy (and the even ones went > idle). > > Thoughts? Suggestions? driver issue? I'm at 2.6.30-rc3. > > (BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's > are busy that I still have room to make this box go faster). I dont see the point here. ksoftirqd is running only if too much work has to be done in softirq context. Which should be your case since you want to saturate cpus with network load. You could try to change /proc/sys/net/core/netdev_budget if you really want to trigger ksoftirqd sooner or later, but it wont fundamentally change routing performance. If you believe box is loosing frames because cpu are saturated, please post some oprofile results. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 7:32 ` Eric Dumazet @ 2009-05-01 7:47 ` Eric Dumazet 0 siblings, 0 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 7:47 UTC (permalink / raw) To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev Eric Dumazet a écrit : > Andrew Dickinson a écrit : >> Adding a bit more info... >> >> I should add, the other 4 ksoftirqd tasklets _are_ running, they're >> just not busy. (In case that wasn't clear...) >> >> Also of note, I rebooted the box (after recompiling with NUMA off). >> This time when I push traffic through, only the even-ksoftirqd's were >> busy.. I then tweaked some of the ring settings via ethtool and >> suddenly the odd-ksoftirqd's became busy (and the even ones went >> idle). >> >> Thoughts? Suggestions? driver issue? I'm at 2.6.30-rc3. >> >> (BTW, I'm under the assumption that since only 4 (of 8) ksoftirqd's >> are busy that I still have room to make this box go faster). > > I dont see the point here. ksoftirqd is running only if too much > work has to be done in softirq context. Which should be your case > since you want to saturate cpus with network load. > > You could try to change /proc/sys/net/core/netdev_budget if you really > want to trigger ksoftirqd sooner or later, but it wont fundamentally > change routing performance. > > If you believe box is loosing frames because cpu are saturated, please > post some oprofile results. My random feeling is you might have a dst_release() contention, but my feeling might be wrong, I dont know what kind of network load you really use... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-30 23:53 ` Andrew Dickinson 2009-05-01 4:19 ` Andrew Dickinson @ 2009-05-01 6:14 ` Eric Dumazet 2009-05-01 6:19 ` Andrew Dickinson ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 6:14 UTC (permalink / raw) To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev Andrew Dickinson a écrit : > OK... I've got some more data on it... > > I passed a small number of packets through the system and added a ton > of printks to it ;-P > > Here's the distribution of values as seen by > skb_rx_queue_recorded()... count on the left, value on the right: > 37 0 > 31 1 > 31 2 > 39 3 > 37 4 > 31 5 > 42 6 > 39 7 > > That's nice and even.... Here's what's getting returned from the > skb_tx_hash(). Again, count on the left, value on the right: > 31 0 > 81 1 > 37 2 > 70 3 > 37 4 > 31 6 > > Note that we're entirely missing 5 and 7 and that those interrupts > seem to have gotten munged onto 1 and 3. > > I think the voodoo lies within: > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); > > David, I made the change that you suggested: > //hash = skb_get_rx_queue(skb); > return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > > And now, I see a nice even mixing of interrupts on the TX side (yay!). > > However, my problem's not solved entirely... here's what top is showing me: > top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 > Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st > Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers > Swap: 2096472k total, 0k used, 2096472k free, 146364k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 > ksoftirqd/1 > 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 > ksoftirqd/3 > 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 > ksoftirqd/5 > 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 > ksoftirqd/7 > 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top > <snip> > > > It appears that only the odd CPUs are actually handling the > interrupts, which doesn't jive with what /proc/interrupts shows me: > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 66: 2970565 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-0 > 67: 28 821122 0 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-1 > 68: 28 0 2943299 0 0 > 0 0 0 PCI-MSI-edge eth2-rx-2 > 69: 28 0 0 817776 0 > 0 0 0 PCI-MSI-edge eth2-rx-3 > 70: 28 0 0 0 2963924 > 0 0 0 PCI-MSI-edge eth2-rx-4 > 71: 28 0 0 0 0 > 821032 0 0 PCI-MSI-edge eth2-rx-5 > 72: 28 0 0 0 0 > 0 2979987 0 PCI-MSI-edge eth2-rx-6 > 73: 28 0 0 0 0 > 0 0 845422 PCI-MSI-edge eth2-rx-7 > 74: 4664732 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-0 > 75: 34 4679312 0 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-1 > 76: 28 0 4665014 0 0 > 0 0 0 PCI-MSI-edge eth2-tx-2 > 77: 28 0 0 4681531 0 > 0 0 0 PCI-MSI-edge eth2-tx-3 > 78: 28 0 0 0 4665793 > 0 0 0 PCI-MSI-edge eth2-tx-4 > 79: 28 0 0 0 0 > 4671596 0 0 PCI-MSI-edge eth2-tx-5 > 80: 28 0 0 0 0 > 0 4665279 0 PCI-MSI-edge eth2-tx-6 > 81: 28 0 0 0 0 > 0 0 4664504 PCI-MSI-edge eth2-tx-7 > 82: 2 0 0 0 0 > 0 0 0 PCI-MSI-edge eth2:lsc > > > Why would ksoftirqd only run on half of the cores (and only the odd > ones to boot)? The one commonality that's striking me is that that > all the odd CPU#'s are on the same physical processor: > > -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual > processor : 0 > physical id : 0 > processor : 1 > physical id : 1 > processor : 2 > physical id : 0 > processor : 3 > physical id : 1 > processor : 4 > physical id : 0 > processor : 5 > physical id : 1 > processor : 6 > physical id : 0 > processor : 7 > physical id : 1 > > I did compile the kernel with NUMA support... am I being bitten by > something there? Other thoughts on where I should look. > > Also... is there an incantation to get NAPI to work in the torvalds > kernel? As you can see, I'm generating quite a few interrrupts. > > -A > > > On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >> From: Andrew Dickinson <andrew@whydna.net> >> Date: Thu, 30 Apr 2009 07:04:33 -0700 >> >>> I'll do some debugging around skb_tx_hash() and see if I can make >>> sense of it. I'll let you know what I find. My hypothesis is that >>> skb_record_rx_queue() isn't being called, but I should dig into it >>> before I start making claims. ;-P >> That's one possibility. >> >> Another is that the hashing isn't working out. One way to >> play with that is to simply replace the: >> >> hash = skb_get_rx_queue(skb); >> >> in skb_tx_hash() with something like: >> >> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >> >> and see if that improves the situation. >> Hi Andrew Please try following patch (I dont have multi-queue NIC, sorry) I will do the followup patch if this ones corrects the distribution problem you noticed. Thanks very much for all your findings. [PATCH] net: skb_tx_hash() improvements When skb_rx_queue_recorded() is true, we dont want to use jash distribution as the device driver exactly told us which queue was selected at RX time. jhash makes a statistical shuffle, but this wont work with 8 static inputs. Later improvements would be to compute reciprocal value of real_num_tx_queues to avoid a divide here. But this computation should be done once, when real_num_tx_queues is set. This needs a separate patch, and a new field in struct net_device. Reported-by: Andrew Dickinson <andrew@whydna.net> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/net/core/dev.c b/net/core/dev.c index 308a7d0..e2e9e4a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) { u32 hash; - if (skb_rx_queue_recorded(skb)) { - hash = skb_get_rx_queue(skb); - } else if (skb->sk && skb->sk->sk_hash) { + if (skb_rx_queue_recorded(skb)) + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; + + if (skb->sk && skb->sk->sk_hash) hash = skb->sk->sk_hash; - } else + else hash = skb->protocol; hash = jhash_1word(hash, skb_tx_hashrnd); ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 6:14 ` Eric Dumazet @ 2009-05-01 6:19 ` Andrew Dickinson 2009-05-01 6:40 ` Eric Dumazet 2009-05-01 8:29 ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet 2009-05-01 16:08 ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller 2 siblings, 1 reply; 28+ messages in thread From: Andrew Dickinson @ 2009-05-01 6:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, jelaas, netdev On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: > Andrew Dickinson a écrit : >> OK... I've got some more data on it... >> >> I passed a small number of packets through the system and added a ton >> of printks to it ;-P >> >> Here's the distribution of values as seen by >> skb_rx_queue_recorded()... count on the left, value on the right: >> 37 0 >> 31 1 >> 31 2 >> 39 3 >> 37 4 >> 31 5 >> 42 6 >> 39 7 >> >> That's nice and even.... Here's what's getting returned from the >> skb_tx_hash(). Again, count on the left, value on the right: >> 31 0 >> 81 1 >> 37 2 >> 70 3 >> 37 4 >> 31 6 >> >> Note that we're entirely missing 5 and 7 and that those interrupts >> seem to have gotten munged onto 1 and 3. >> >> I think the voodoo lies within: >> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >> >> David, I made the change that you suggested: >> //hash = skb_get_rx_queue(skb); >> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >> >> And now, I see a nice even mixing of interrupts on the TX side (yay!). >> >> However, my problem's not solved entirely... here's what top is showing me: >> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 >> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie >> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st >> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st >> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers >> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 >> ksoftirqd/1 >> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 >> ksoftirqd/3 >> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 >> ksoftirqd/5 >> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 >> ksoftirqd/7 >> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top >> <snip> >> >> >> It appears that only the odd CPUs are actually handling the >> interrupts, which doesn't jive with what /proc/interrupts shows me: >> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 >> 66: 2970565 0 0 0 0 >> 0 0 0 PCI-MSI-edge eth2-rx-0 >> 67: 28 821122 0 0 0 >> 0 0 0 PCI-MSI-edge eth2-rx-1 >> 68: 28 0 2943299 0 0 >> 0 0 0 PCI-MSI-edge eth2-rx-2 >> 69: 28 0 0 817776 0 >> 0 0 0 PCI-MSI-edge eth2-rx-3 >> 70: 28 0 0 0 2963924 >> 0 0 0 PCI-MSI-edge eth2-rx-4 >> 71: 28 0 0 0 0 >> 821032 0 0 PCI-MSI-edge eth2-rx-5 >> 72: 28 0 0 0 0 >> 0 2979987 0 PCI-MSI-edge eth2-rx-6 >> 73: 28 0 0 0 0 >> 0 0 845422 PCI-MSI-edge eth2-rx-7 >> 74: 4664732 0 0 0 0 >> 0 0 0 PCI-MSI-edge eth2-tx-0 >> 75: 34 4679312 0 0 0 >> 0 0 0 PCI-MSI-edge eth2-tx-1 >> 76: 28 0 4665014 0 0 >> 0 0 0 PCI-MSI-edge eth2-tx-2 >> 77: 28 0 0 4681531 0 >> 0 0 0 PCI-MSI-edge eth2-tx-3 >> 78: 28 0 0 0 4665793 >> 0 0 0 PCI-MSI-edge eth2-tx-4 >> 79: 28 0 0 0 0 >> 4671596 0 0 PCI-MSI-edge eth2-tx-5 >> 80: 28 0 0 0 0 >> 0 4665279 0 PCI-MSI-edge eth2-tx-6 >> 81: 28 0 0 0 0 >> 0 0 4664504 PCI-MSI-edge eth2-tx-7 >> 82: 2 0 0 0 0 >> 0 0 0 PCI-MSI-edge eth2:lsc >> >> >> Why would ksoftirqd only run on half of the cores (and only the odd >> ones to boot)? The one commonality that's striking me is that that >> all the odd CPU#'s are on the same physical processor: >> >> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual >> processor : 0 >> physical id : 0 >> processor : 1 >> physical id : 1 >> processor : 2 >> physical id : 0 >> processor : 3 >> physical id : 1 >> processor : 4 >> physical id : 0 >> processor : 5 >> physical id : 1 >> processor : 6 >> physical id : 0 >> processor : 7 >> physical id : 1 >> >> I did compile the kernel with NUMA support... am I being bitten by >> something there? Other thoughts on where I should look. >> >> Also... is there an incantation to get NAPI to work in the torvalds >> kernel? As you can see, I'm generating quite a few interrrupts. >> >> -A >> >> >> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >>> From: Andrew Dickinson <andrew@whydna.net> >>> Date: Thu, 30 Apr 2009 07:04:33 -0700 >>> >>>> I'll do some debugging around skb_tx_hash() and see if I can make >>>> sense of it. I'll let you know what I find. My hypothesis is that >>>> skb_record_rx_queue() isn't being called, but I should dig into it >>>> before I start making claims. ;-P >>> That's one possibility. >>> >>> Another is that the hashing isn't working out. One way to >>> play with that is to simply replace the: >>> >>> hash = skb_get_rx_queue(skb); >>> >>> in skb_tx_hash() with something like: >>> >>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>> >>> and see if that improves the situation. >>> > > Hi Andrew > > Please try following patch (I dont have multi-queue NIC, sorry) > > I will do the followup patch if this ones corrects the distribution problem > you noticed. > > Thanks very much for all your findings. > > [PATCH] net: skb_tx_hash() improvements > > When skb_rx_queue_recorded() is true, we dont want to use jash distribution > as the device driver exactly told us which queue was selected at RX time. > jhash makes a statistical shuffle, but this wont work with 8 static inputs. > > Later improvements would be to compute reciprocal value of real_num_tx_queues > to avoid a divide here. But this computation should be done once, > when real_num_tx_queues is set. This needs a separate patch, and a new > field in struct net_device. > > Reported-by: Andrew Dickinson <andrew@whydna.net> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > diff --git a/net/core/dev.c b/net/core/dev.c > index 308a7d0..e2e9e4a 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) > { > u32 hash; > > - if (skb_rx_queue_recorded(skb)) { > - hash = skb_get_rx_queue(skb); > - } else if (skb->sk && skb->sk->sk_hash) { > + if (skb_rx_queue_recorded(skb)) > + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > + > + if (skb->sk && skb->sk->sk_hash) > hash = skb->sk->sk_hash; > - } else > + else > hash = skb->protocol; > > hash = jhash_1word(hash, skb_tx_hashrnd); > > Eric, That's exactly what I did! It solved the problem of hot-spots on some interrupts. However, I now have a new problem (which is documented in my previous posts). The short of it is that I'm only seeing 4 (out of 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The busy 4 are always on one physical package (but not always the same package (it'll change on reboot or when I change some parameters via ethtool), but never both. This, despite /proc/interrupts showing me that all 8 interrupts are being hit evenly. There's more details in my last mail. ;-D -Andrew ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 6:19 ` Andrew Dickinson @ 2009-05-01 6:40 ` Eric Dumazet 2009-05-01 7:23 ` Andrew Dickinson 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 6:40 UTC (permalink / raw) To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev Andrew Dickinson a écrit : > On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Andrew Dickinson a écrit : >>> OK... I've got some more data on it... >>> >>> I passed a small number of packets through the system and added a ton >>> of printks to it ;-P >>> >>> Here's the distribution of values as seen by >>> skb_rx_queue_recorded()... count on the left, value on the right: >>> 37 0 >>> 31 1 >>> 31 2 >>> 39 3 >>> 37 4 >>> 31 5 >>> 42 6 >>> 39 7 >>> >>> That's nice and even.... Here's what's getting returned from the >>> skb_tx_hash(). Again, count on the left, value on the right: >>> 31 0 >>> 81 1 >>> 37 2 >>> 70 3 >>> 37 4 >>> 31 6 >>> >>> Note that we're entirely missing 5 and 7 and that those interrupts >>> seem to have gotten munged onto 1 and 3. >>> >>> I think the voodoo lies within: >>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >>> >>> David, I made the change that you suggested: >>> //hash = skb_get_rx_queue(skb); >>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>> >>> And now, I see a nice even mixing of interrupts on the TX side (yay!). >>> >>> However, my problem's not solved entirely... here's what top is showing me: >>> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 >>> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie >>> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >>> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st >>> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >>> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st >>> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers >>> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 >>> ksoftirqd/1 >>> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 >>> ksoftirqd/3 >>> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 >>> ksoftirqd/5 >>> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 >>> ksoftirqd/7 >>> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top >>> <snip> >>> >>> >>> It appears that only the odd CPUs are actually handling the >>> interrupts, which doesn't jive with what /proc/interrupts shows me: >>> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 >>> 66: 2970565 0 0 0 0 >>> 0 0 0 PCI-MSI-edge eth2-rx-0 >>> 67: 28 821122 0 0 0 >>> 0 0 0 PCI-MSI-edge eth2-rx-1 >>> 68: 28 0 2943299 0 0 >>> 0 0 0 PCI-MSI-edge eth2-rx-2 >>> 69: 28 0 0 817776 0 >>> 0 0 0 PCI-MSI-edge eth2-rx-3 >>> 70: 28 0 0 0 2963924 >>> 0 0 0 PCI-MSI-edge eth2-rx-4 >>> 71: 28 0 0 0 0 >>> 821032 0 0 PCI-MSI-edge eth2-rx-5 >>> 72: 28 0 0 0 0 >>> 0 2979987 0 PCI-MSI-edge eth2-rx-6 >>> 73: 28 0 0 0 0 >>> 0 0 845422 PCI-MSI-edge eth2-rx-7 >>> 74: 4664732 0 0 0 0 >>> 0 0 0 PCI-MSI-edge eth2-tx-0 >>> 75: 34 4679312 0 0 0 >>> 0 0 0 PCI-MSI-edge eth2-tx-1 >>> 76: 28 0 4665014 0 0 >>> 0 0 0 PCI-MSI-edge eth2-tx-2 >>> 77: 28 0 0 4681531 0 >>> 0 0 0 PCI-MSI-edge eth2-tx-3 >>> 78: 28 0 0 0 4665793 >>> 0 0 0 PCI-MSI-edge eth2-tx-4 >>> 79: 28 0 0 0 0 >>> 4671596 0 0 PCI-MSI-edge eth2-tx-5 >>> 80: 28 0 0 0 0 >>> 0 4665279 0 PCI-MSI-edge eth2-tx-6 >>> 81: 28 0 0 0 0 >>> 0 0 4664504 PCI-MSI-edge eth2-tx-7 >>> 82: 2 0 0 0 0 >>> 0 0 0 PCI-MSI-edge eth2:lsc >>> >>> >>> Why would ksoftirqd only run on half of the cores (and only the odd >>> ones to boot)? The one commonality that's striking me is that that >>> all the odd CPU#'s are on the same physical processor: >>> >>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual >>> processor : 0 >>> physical id : 0 >>> processor : 1 >>> physical id : 1 >>> processor : 2 >>> physical id : 0 >>> processor : 3 >>> physical id : 1 >>> processor : 4 >>> physical id : 0 >>> processor : 5 >>> physical id : 1 >>> processor : 6 >>> physical id : 0 >>> processor : 7 >>> physical id : 1 >>> >>> I did compile the kernel with NUMA support... am I being bitten by >>> something there? Other thoughts on where I should look. >>> >>> Also... is there an incantation to get NAPI to work in the torvalds >>> kernel? As you can see, I'm generating quite a few interrrupts. >>> >>> -A >>> >>> >>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >>>> From: Andrew Dickinson <andrew@whydna.net> >>>> Date: Thu, 30 Apr 2009 07:04:33 -0700 >>>> >>>>> I'll do some debugging around skb_tx_hash() and see if I can make >>>>> sense of it. I'll let you know what I find. My hypothesis is that >>>>> skb_record_rx_queue() isn't being called, but I should dig into it >>>>> before I start making claims. ;-P >>>> That's one possibility. >>>> >>>> Another is that the hashing isn't working out. One way to >>>> play with that is to simply replace the: >>>> >>>> hash = skb_get_rx_queue(skb); >>>> >>>> in skb_tx_hash() with something like: >>>> >>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>> >>>> and see if that improves the situation. >>>> >> Hi Andrew >> >> Please try following patch (I dont have multi-queue NIC, sorry) >> >> I will do the followup patch if this ones corrects the distribution problem >> you noticed. >> >> Thanks very much for all your findings. >> >> [PATCH] net: skb_tx_hash() improvements >> >> When skb_rx_queue_recorded() is true, we dont want to use jash distribution >> as the device driver exactly told us which queue was selected at RX time. >> jhash makes a statistical shuffle, but this wont work with 8 static inputs. >> >> Later improvements would be to compute reciprocal value of real_num_tx_queues >> to avoid a divide here. But this computation should be done once, >> when real_num_tx_queues is set. This needs a separate patch, and a new >> field in struct net_device. >> >> Reported-by: Andrew Dickinson <andrew@whydna.net> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 308a7d0..e2e9e4a 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) >> { >> u32 hash; >> >> - if (skb_rx_queue_recorded(skb)) { >> - hash = skb_get_rx_queue(skb); >> - } else if (skb->sk && skb->sk->sk_hash) { >> + if (skb_rx_queue_recorded(skb)) >> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >> + >> + if (skb->sk && skb->sk->sk_hash) >> hash = skb->sk->sk_hash; >> - } else >> + else >> hash = skb->protocol; >> >> hash = jhash_1word(hash, skb_tx_hashrnd); >> >> > > Eric, > > That's exactly what I did! It solved the problem of hot-spots on some > interrupts. However, I now have a new problem (which is documented in > my previous posts). The short of it is that I'm only seeing 4 (out of > 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The > busy 4 are always on one physical package (but not always the same > package (it'll change on reboot or when I change some parameters via > ethtool), but never both. This, despite /proc/interrupts showing me > that all 8 interrupts are being hit evenly. There's more details in > my last mail. ;-D > Well, I was reacting to your 'voodo' comment about return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); Since this is not the problem. Problem is coming from jhash() which shuffles the input, while in your case we want to select same output queue because of cpu affinities. No shuffle required. (assuming cpu0 is handling tx-queue-0 and rx-queue-0, cpu1 is handling tx-queue-1 and rx-queue-1, and so on...) Then /proc/interrupts show your rx interrupts are not evenly distributed. Or that ksoftirqd is triggered only on one physical cpu, while on other cpu, softirqds are not run from ksoftirqd. Its only a matter of load. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 6:40 ` Eric Dumazet @ 2009-05-01 7:23 ` Andrew Dickinson 2009-05-01 7:31 ` Eric Dumazet 2009-05-01 21:37 ` Brandeburg, Jesse 0 siblings, 2 replies; 28+ messages in thread From: Andrew Dickinson @ 2009-05-01 7:23 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, jelaas, netdev On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: > Andrew Dickinson a écrit : >> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >>> Andrew Dickinson a écrit : >>>> OK... I've got some more data on it... >>>> >>>> I passed a small number of packets through the system and added a ton >>>> of printks to it ;-P >>>> >>>> Here's the distribution of values as seen by >>>> skb_rx_queue_recorded()... count on the left, value on the right: >>>> 37 0 >>>> 31 1 >>>> 31 2 >>>> 39 3 >>>> 37 4 >>>> 31 5 >>>> 42 6 >>>> 39 7 >>>> >>>> That's nice and even.... Here's what's getting returned from the >>>> skb_tx_hash(). Again, count on the left, value on the right: >>>> 31 0 >>>> 81 1 >>>> 37 2 >>>> 70 3 >>>> 37 4 >>>> 31 6 >>>> >>>> Note that we're entirely missing 5 and 7 and that those interrupts >>>> seem to have gotten munged onto 1 and 3. >>>> >>>> I think the voodoo lies within: >>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >>>> >>>> David, I made the change that you suggested: >>>> //hash = skb_get_rx_queue(skb); >>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>> >>>> And now, I see a nice even mixing of interrupts on the TX side (yay!). >>>> >>>> However, my problem's not solved entirely... here's what top is showing me: >>>> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 >>>> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie >>>> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >>>> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st >>>> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >>>> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st >>>> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers >>>> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached >>>> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 >>>> ksoftirqd/1 >>>> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 >>>> ksoftirqd/3 >>>> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 >>>> ksoftirqd/5 >>>> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 >>>> ksoftirqd/7 >>>> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top >>>> <snip> >>>> >>>> >>>> It appears that only the odd CPUs are actually handling the >>>> interrupts, which doesn't jive with what /proc/interrupts shows me: >>>> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 >>>> 66: 2970565 0 0 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-rx-0 >>>> 67: 28 821122 0 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-rx-1 >>>> 68: 28 0 2943299 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-rx-2 >>>> 69: 28 0 0 817776 0 >>>> 0 0 0 PCI-MSI-edge eth2-rx-3 >>>> 70: 28 0 0 0 2963924 >>>> 0 0 0 PCI-MSI-edge eth2-rx-4 >>>> 71: 28 0 0 0 0 >>>> 821032 0 0 PCI-MSI-edge eth2-rx-5 >>>> 72: 28 0 0 0 0 >>>> 0 2979987 0 PCI-MSI-edge eth2-rx-6 >>>> 73: 28 0 0 0 0 >>>> 0 0 845422 PCI-MSI-edge eth2-rx-7 >>>> 74: 4664732 0 0 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-tx-0 >>>> 75: 34 4679312 0 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-tx-1 >>>> 76: 28 0 4665014 0 0 >>>> 0 0 0 PCI-MSI-edge eth2-tx-2 >>>> 77: 28 0 0 4681531 0 >>>> 0 0 0 PCI-MSI-edge eth2-tx-3 >>>> 78: 28 0 0 0 4665793 >>>> 0 0 0 PCI-MSI-edge eth2-tx-4 >>>> 79: 28 0 0 0 0 >>>> 4671596 0 0 PCI-MSI-edge eth2-tx-5 >>>> 80: 28 0 0 0 0 >>>> 0 4665279 0 PCI-MSI-edge eth2-tx-6 >>>> 81: 28 0 0 0 0 >>>> 0 0 4664504 PCI-MSI-edge eth2-tx-7 >>>> 82: 2 0 0 0 0 >>>> 0 0 0 PCI-MSI-edge eth2:lsc >>>> >>>> >>>> Why would ksoftirqd only run on half of the cores (and only the odd >>>> ones to boot)? The one commonality that's striking me is that that >>>> all the odd CPU#'s are on the same physical processor: >>>> >>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual >>>> processor : 0 >>>> physical id : 0 >>>> processor : 1 >>>> physical id : 1 >>>> processor : 2 >>>> physical id : 0 >>>> processor : 3 >>>> physical id : 1 >>>> processor : 4 >>>> physical id : 0 >>>> processor : 5 >>>> physical id : 1 >>>> processor : 6 >>>> physical id : 0 >>>> processor : 7 >>>> physical id : 1 >>>> >>>> I did compile the kernel with NUMA support... am I being bitten by >>>> something there? Other thoughts on where I should look. >>>> >>>> Also... is there an incantation to get NAPI to work in the torvalds >>>> kernel? As you can see, I'm generating quite a few interrrupts. >>>> >>>> -A >>>> >>>> >>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >>>>> From: Andrew Dickinson <andrew@whydna.net> >>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700 >>>>> >>>>>> I'll do some debugging around skb_tx_hash() and see if I can make >>>>>> sense of it. I'll let you know what I find. My hypothesis is that >>>>>> skb_record_rx_queue() isn't being called, but I should dig into it >>>>>> before I start making claims. ;-P >>>>> That's one possibility. >>>>> >>>>> Another is that the hashing isn't working out. One way to >>>>> play with that is to simply replace the: >>>>> >>>>> hash = skb_get_rx_queue(skb); >>>>> >>>>> in skb_tx_hash() with something like: >>>>> >>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>> >>>>> and see if that improves the situation. >>>>> >>> Hi Andrew >>> >>> Please try following patch (I dont have multi-queue NIC, sorry) >>> >>> I will do the followup patch if this ones corrects the distribution problem >>> you noticed. >>> >>> Thanks very much for all your findings. >>> >>> [PATCH] net: skb_tx_hash() improvements >>> >>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution >>> as the device driver exactly told us which queue was selected at RX time. >>> jhash makes a statistical shuffle, but this wont work with 8 static inputs. >>> >>> Later improvements would be to compute reciprocal value of real_num_tx_queues >>> to avoid a divide here. But this computation should be done once, >>> when real_num_tx_queues is set. This needs a separate patch, and a new >>> field in struct net_device. >>> >>> Reported-by: Andrew Dickinson <andrew@whydna.net> >>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>> >>> diff --git a/net/core/dev.c b/net/core/dev.c >>> index 308a7d0..e2e9e4a 100644 >>> --- a/net/core/dev.c >>> +++ b/net/core/dev.c >>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) >>> { >>> u32 hash; >>> >>> - if (skb_rx_queue_recorded(skb)) { >>> - hash = skb_get_rx_queue(skb); >>> - } else if (skb->sk && skb->sk->sk_hash) { >>> + if (skb_rx_queue_recorded(skb)) >>> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>> + >>> + if (skb->sk && skb->sk->sk_hash) >>> hash = skb->sk->sk_hash; >>> - } else >>> + else >>> hash = skb->protocol; >>> >>> hash = jhash_1word(hash, skb_tx_hashrnd); >>> >>> >> >> Eric, >> >> That's exactly what I did! It solved the problem of hot-spots on some >> interrupts. However, I now have a new problem (which is documented in >> my previous posts). The short of it is that I'm only seeing 4 (out of >> 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The >> busy 4 are always on one physical package (but not always the same >> package (it'll change on reboot or when I change some parameters via >> ethtool), but never both. This, despite /proc/interrupts showing me >> that all 8 interrupts are being hit evenly. There's more details in >> my last mail. ;-D >> > > Well, I was reacting to your 'voodo' comment about > > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); > > Since this is not the problem. Problem is coming from jhash() which shuffles > the input, while in your case we want to select same output queue > because of cpu affinities. No shuffle required. Agreed. I don't want to jhash(), and I'm not. > (assuming cpu0 is handling tx-queue-0 and rx-queue-0, > cpu1 is handling tx-queue-1 and rx-queue-1, and so on...) That's a correct assumption. :D > Then /proc/interrupts show your rx interrupts are not evenly distributed. > > Or that ksoftirqd is triggered only on one physical cpu, while on other > cpu, softirqds are not run from ksoftirqd. Its only a matter of load. Hrmm... more fuel for the fire... The NIC seems to be doing a good job of hashing the incoming data and the kernel is now finding the right TX queue: -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets rx_packets: 1286009099 tx_packets: 1287853570 tx_queue_0_packets: 162469405 tx_queue_1_packets: 162452446 tx_queue_2_packets: 162481160 tx_queue_3_packets: 162441839 tx_queue_4_packets: 162484930 tx_queue_5_packets: 162478402 tx_queue_6_packets: 162492530 tx_queue_7_packets: 162477162 rx_queue_0_packets: 162469449 rx_queue_1_packets: 162452440 rx_queue_2_packets: 162481186 rx_queue_3_packets: 162441885 rx_queue_4_packets: 162484949 rx_queue_5_packets: 162478427 Here's where it gets juicy. If I reduce the rate at which I'm pushing traffic to a 0-loss level (in this case about 2.2Mpps), then top looks as follow: Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st And if I watch /proc/interrupts, I see that all of the tx and rx queues are handling a fairly similar number of interrupts (ballpark, 7-8k/sec on rx, 10k on tx). OK... now let me double the packet rate (to about 4.4Mpps), top looks like this: Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 5.5%hi, 92.5%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 2.3%id, 0.0%wa, 4.9%hi, 92.9%si, 0.0%st Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 5.2%id, 0.0%wa, 5.2%hi, 89.6%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.3%hi, 1.9%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 4.9%hi, 94.8%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st And if I watch /proc/interrupts again, I see that the even-CPUs (i.e. 0,2,4, and 6) RX queues are receiving relatively few interrupts (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are receiving about 2-3k/sec. What's extra strange is that the TX queues are still handling about 10k/sec each. So, below some magic threshold (approx 2.3Mpps), the box is basically idle and happily routing all the packets (I can confirm that my network test device ixia is showing 0-loss). Above the magic threshold, the box starts acting as described above and I'm unable to push it beyond that threshold. While I understand that there are limits to how fast I can route packets (obviously), it seems very strange that I'm seeing this physical-CPU affinity on the ksoftirqd "processes". Here's how fragile this "magic threshold" is... 2.292 Mpps, box looks idle, 0 loss. 2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish. 2.307 Mpps, even-CPU ksoftirqd processes at 75%. 2.323 Mpps, even-CPU ksoftirqd proccesses at 100%. Never during this did the odd-CPU ksoftirqd processes show any utilization at all. These are 64-byte frames, so I shouldn't be hitting any bandwidth issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm just routing packets back out the one NIC). =/ -A ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 7:23 ` Andrew Dickinson @ 2009-05-01 7:31 ` Eric Dumazet 2009-05-01 7:34 ` Andrew Dickinson 2009-05-01 21:37 ` Brandeburg, Jesse 1 sibling, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 7:31 UTC (permalink / raw) To: Andrew Dickinson; +Cc: David Miller, jelaas, netdev Andrew Dickinson a écrit : > On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Andrew Dickinson a écrit : >>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >>>> Andrew Dickinson a écrit : >>>>> OK... I've got some more data on it... >>>>> >>>>> I passed a small number of packets through the system and added a ton >>>>> of printks to it ;-P >>>>> >>>>> Here's the distribution of values as seen by >>>>> skb_rx_queue_recorded()... count on the left, value on the right: >>>>> 37 0 >>>>> 31 1 >>>>> 31 2 >>>>> 39 3 >>>>> 37 4 >>>>> 31 5 >>>>> 42 6 >>>>> 39 7 >>>>> >>>>> That's nice and even.... Here's what's getting returned from the >>>>> skb_tx_hash(). Again, count on the left, value on the right: >>>>> 31 0 >>>>> 81 1 >>>>> 37 2 >>>>> 70 3 >>>>> 37 4 >>>>> 31 6 >>>>> >>>>> Note that we're entirely missing 5 and 7 and that those interrupts >>>>> seem to have gotten munged onto 1 and 3. >>>>> >>>>> I think the voodoo lies within: >>>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >>>>> >>>>> David, I made the change that you suggested: >>>>> //hash = skb_get_rx_queue(skb); >>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>> >>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!). >>>>> >>>>> However, my problem's not solved entirely... here's what top is showing me: >>>>> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 >>>>> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie >>>>> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>>> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >>>>> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>>> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>>> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st >>>>> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >>>>> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st >>>>> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers >>>>> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached >>>>> >>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>>> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 >>>>> ksoftirqd/1 >>>>> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 >>>>> ksoftirqd/3 >>>>> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 >>>>> ksoftirqd/5 >>>>> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 >>>>> ksoftirqd/7 >>>>> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top >>>>> <snip> >>>>> >>>>> >>>>> It appears that only the odd CPUs are actually handling the >>>>> interrupts, which doesn't jive with what /proc/interrupts shows me: >>>>> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 >>>>> 66: 2970565 0 0 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-rx-0 >>>>> 67: 28 821122 0 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-rx-1 >>>>> 68: 28 0 2943299 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-rx-2 >>>>> 69: 28 0 0 817776 0 >>>>> 0 0 0 PCI-MSI-edge eth2-rx-3 >>>>> 70: 28 0 0 0 2963924 >>>>> 0 0 0 PCI-MSI-edge eth2-rx-4 >>>>> 71: 28 0 0 0 0 >>>>> 821032 0 0 PCI-MSI-edge eth2-rx-5 >>>>> 72: 28 0 0 0 0 >>>>> 0 2979987 0 PCI-MSI-edge eth2-rx-6 >>>>> 73: 28 0 0 0 0 >>>>> 0 0 845422 PCI-MSI-edge eth2-rx-7 >>>>> 74: 4664732 0 0 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-tx-0 >>>>> 75: 34 4679312 0 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-tx-1 >>>>> 76: 28 0 4665014 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2-tx-2 >>>>> 77: 28 0 0 4681531 0 >>>>> 0 0 0 PCI-MSI-edge eth2-tx-3 >>>>> 78: 28 0 0 0 4665793 >>>>> 0 0 0 PCI-MSI-edge eth2-tx-4 >>>>> 79: 28 0 0 0 0 >>>>> 4671596 0 0 PCI-MSI-edge eth2-tx-5 >>>>> 80: 28 0 0 0 0 >>>>> 0 4665279 0 PCI-MSI-edge eth2-tx-6 >>>>> 81: 28 0 0 0 0 >>>>> 0 0 4664504 PCI-MSI-edge eth2-tx-7 >>>>> 82: 2 0 0 0 0 >>>>> 0 0 0 PCI-MSI-edge eth2:lsc >>>>> >>>>> >>>>> Why would ksoftirqd only run on half of the cores (and only the odd >>>>> ones to boot)? The one commonality that's striking me is that that >>>>> all the odd CPU#'s are on the same physical processor: >>>>> >>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual >>>>> processor : 0 >>>>> physical id : 0 >>>>> processor : 1 >>>>> physical id : 1 >>>>> processor : 2 >>>>> physical id : 0 >>>>> processor : 3 >>>>> physical id : 1 >>>>> processor : 4 >>>>> physical id : 0 >>>>> processor : 5 >>>>> physical id : 1 >>>>> processor : 6 >>>>> physical id : 0 >>>>> processor : 7 >>>>> physical id : 1 >>>>> >>>>> I did compile the kernel with NUMA support... am I being bitten by >>>>> something there? Other thoughts on where I should look. >>>>> >>>>> Also... is there an incantation to get NAPI to work in the torvalds >>>>> kernel? As you can see, I'm generating quite a few interrrupts. >>>>> >>>>> -A >>>>> >>>>> >>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >>>>>> From: Andrew Dickinson <andrew@whydna.net> >>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700 >>>>>> >>>>>>> I'll do some debugging around skb_tx_hash() and see if I can make >>>>>>> sense of it. I'll let you know what I find. My hypothesis is that >>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it >>>>>>> before I start making claims. ;-P >>>>>> That's one possibility. >>>>>> >>>>>> Another is that the hashing isn't working out. One way to >>>>>> play with that is to simply replace the: >>>>>> >>>>>> hash = skb_get_rx_queue(skb); >>>>>> >>>>>> in skb_tx_hash() with something like: >>>>>> >>>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>>> >>>>>> and see if that improves the situation. >>>>>> >>>> Hi Andrew >>>> >>>> Please try following patch (I dont have multi-queue NIC, sorry) >>>> >>>> I will do the followup patch if this ones corrects the distribution problem >>>> you noticed. >>>> >>>> Thanks very much for all your findings. >>>> >>>> [PATCH] net: skb_tx_hash() improvements >>>> >>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution >>>> as the device driver exactly told us which queue was selected at RX time. >>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs. >>>> >>>> Later improvements would be to compute reciprocal value of real_num_tx_queues >>>> to avoid a divide here. But this computation should be done once, >>>> when real_num_tx_queues is set. This needs a separate patch, and a new >>>> field in struct net_device. >>>> >>>> Reported-by: Andrew Dickinson <andrew@whydna.net> >>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>>> >>>> diff --git a/net/core/dev.c b/net/core/dev.c >>>> index 308a7d0..e2e9e4a 100644 >>>> --- a/net/core/dev.c >>>> +++ b/net/core/dev.c >>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) >>>> { >>>> u32 hash; >>>> >>>> - if (skb_rx_queue_recorded(skb)) { >>>> - hash = skb_get_rx_queue(skb); >>>> - } else if (skb->sk && skb->sk->sk_hash) { >>>> + if (skb_rx_queue_recorded(skb)) >>>> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>> + >>>> + if (skb->sk && skb->sk->sk_hash) >>>> hash = skb->sk->sk_hash; >>>> - } else >>>> + else >>>> hash = skb->protocol; >>>> >>>> hash = jhash_1word(hash, skb_tx_hashrnd); >>>> >>>> >>> Eric, >>> >>> That's exactly what I did! It solved the problem of hot-spots on some >>> interrupts. However, I now have a new problem (which is documented in >>> my previous posts). The short of it is that I'm only seeing 4 (out of >>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The >>> busy 4 are always on one physical package (but not always the same >>> package (it'll change on reboot or when I change some parameters via >>> ethtool), but never both. This, despite /proc/interrupts showing me >>> that all 8 interrupts are being hit evenly. There's more details in >>> my last mail. ;-D >>> >> Well, I was reacting to your 'voodo' comment about >> >> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >> >> Since this is not the problem. Problem is coming from jhash() which shuffles >> the input, while in your case we want to select same output queue >> because of cpu affinities. No shuffle required. > > Agreed. I don't want to jhash(), and I'm not. > >> (assuming cpu0 is handling tx-queue-0 and rx-queue-0, >> cpu1 is handling tx-queue-1 and rx-queue-1, and so on...) > > That's a correct assumption. :D > >> Then /proc/interrupts show your rx interrupts are not evenly distributed. >> >> Or that ksoftirqd is triggered only on one physical cpu, while on other >> cpu, softirqds are not run from ksoftirqd. Its only a matter of load. > > Hrmm... more fuel for the fire... > > The NIC seems to be doing a good job of hashing the incoming data and > the kernel is now finding the right TX queue: > -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets > rx_packets: 1286009099 > tx_packets: 1287853570 > tx_queue_0_packets: 162469405 > tx_queue_1_packets: 162452446 > tx_queue_2_packets: 162481160 > tx_queue_3_packets: 162441839 > tx_queue_4_packets: 162484930 > tx_queue_5_packets: 162478402 > tx_queue_6_packets: 162492530 > tx_queue_7_packets: 162477162 > rx_queue_0_packets: 162469449 > rx_queue_1_packets: 162452440 > rx_queue_2_packets: 162481186 > rx_queue_3_packets: 162441885 > rx_queue_4_packets: 162484949 > rx_queue_5_packets: 162478427 > > Here's where it gets juicy. If I reduce the rate at which I'm pushing > traffic to a 0-loss level (in this case about 2.2Mpps), then top looks > as follow: > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > > And if I watch /proc/interrupts, I see that all of the tx and rx > queues are handling a fairly similar number of interrupts (ballpark, > 7-8k/sec on rx, 10k on tx). > > OK... now let me double the packet rate (to about 4.4Mpps), top looks like this: > > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 5.5%hi, 92.5%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 2.3%id, 0.0%wa, 4.9%hi, 92.9%si, 0.0%st > Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 5.2%id, 0.0%wa, 5.2%hi, 89.6%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.3%hi, 1.9%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 4.9%hi, 94.8%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > > And if I watch /proc/interrupts again, I see that the even-CPUs (i.e. > 0,2,4, and 6) RX queues are receiving relatively few interrupts > (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are > receiving about 2-3k/sec. What's extra strange is that the TX queues > are still handling about 10k/sec each. > > So, below some magic threshold (approx 2.3Mpps), the box is basically > idle and happily routing all the packets (I can confirm that my > network test device ixia is showing 0-loss). Above the magic > threshold, the box starts acting as described above and I'm unable to > push it beyond that threshold. While I understand that there are > limits to how fast I can route packets (obviously), it seems very > strange that I'm seeing this physical-CPU affinity on the ksoftirqd > "processes". > box is not idle, you hit a bug in kernel, I already corrected this week :) check for "sched: account system time properly" in google diff --git a/kernel/sched.c b/kernel/sched.c index b902e58..26efa47 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick) if (user_tick) account_user_time(p, one_jiffy, one_jiffy_scaled); - else if (p != rq->idle) + else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET)) account_system_time(p, HARDIRQ_OFFSET, one_jiffy, one_jiffy_scaled); else > Here's how fragile this "magic threshold" is... 2.292 Mpps, box looks > idle, 0 loss. 2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish. > 2.307 Mpps, even-CPU ksoftirqd processes at 75%. 2.323 Mpps, even-CPU > ksoftirqd proccesses at 100%. Never during this did the odd-CPU > ksoftirqd processes show any utilization at all. > > These are 64-byte frames, so I shouldn't be hitting any bandwidth > issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm > just routing packets back out the one NIC). > > =/ > ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 7:31 ` Eric Dumazet @ 2009-05-01 7:34 ` Andrew Dickinson 0 siblings, 0 replies; 28+ messages in thread From: Andrew Dickinson @ 2009-05-01 7:34 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, jelaas, netdev On Fri, May 1, 2009 at 12:31 AM, Eric Dumazet <dada1@cosmosbay.com> wrote: > Andrew Dickinson a écrit : >> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >>> Andrew Dickinson a écrit : >>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote: >>>>> Andrew Dickinson a écrit : >>>>>> OK... I've got some more data on it... >>>>>> >>>>>> I passed a small number of packets through the system and added a ton >>>>>> of printks to it ;-P >>>>>> >>>>>> Here's the distribution of values as seen by >>>>>> skb_rx_queue_recorded()... count on the left, value on the right: >>>>>> 37 0 >>>>>> 31 1 >>>>>> 31 2 >>>>>> 39 3 >>>>>> 37 4 >>>>>> 31 5 >>>>>> 42 6 >>>>>> 39 7 >>>>>> >>>>>> That's nice and even.... Here's what's getting returned from the >>>>>> skb_tx_hash(). Again, count on the left, value on the right: >>>>>> 31 0 >>>>>> 81 1 >>>>>> 37 2 >>>>>> 70 3 >>>>>> 37 4 >>>>>> 31 6 >>>>>> >>>>>> Note that we're entirely missing 5 and 7 and that those interrupts >>>>>> seem to have gotten munged onto 1 and 3. >>>>>> >>>>>> I think the voodoo lies within: >>>>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >>>>>> >>>>>> David, I made the change that you suggested: >>>>>> //hash = skb_get_rx_queue(skb); >>>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>>> >>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!). >>>>>> >>>>>> However, my problem's not solved entirely... here's what top is showing me: >>>>>> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21 >>>>>> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie >>>>>> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>>>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>>>> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >>>>>> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st >>>>>> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st >>>>>> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st >>>>>> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >>>>>> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st >>>>>> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers >>>>>> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached >>>>>> >>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>>>> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24 >>>>>> ksoftirqd/1 >>>>>> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98 >>>>>> ksoftirqd/3 >>>>>> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52 >>>>>> ksoftirqd/5 >>>>>> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56 >>>>>> ksoftirqd/7 >>>>>> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top >>>>>> <snip> >>>>>> >>>>>> >>>>>> It appears that only the odd CPUs are actually handling the >>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me: >>>>>> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 >>>>>> 66: 2970565 0 0 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-rx-0 >>>>>> 67: 28 821122 0 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-rx-1 >>>>>> 68: 28 0 2943299 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-rx-2 >>>>>> 69: 28 0 0 817776 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-rx-3 >>>>>> 70: 28 0 0 0 2963924 >>>>>> 0 0 0 PCI-MSI-edge eth2-rx-4 >>>>>> 71: 28 0 0 0 0 >>>>>> 821032 0 0 PCI-MSI-edge eth2-rx-5 >>>>>> 72: 28 0 0 0 0 >>>>>> 0 2979987 0 PCI-MSI-edge eth2-rx-6 >>>>>> 73: 28 0 0 0 0 >>>>>> 0 0 845422 PCI-MSI-edge eth2-rx-7 >>>>>> 74: 4664732 0 0 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-tx-0 >>>>>> 75: 34 4679312 0 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-tx-1 >>>>>> 76: 28 0 4665014 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-tx-2 >>>>>> 77: 28 0 0 4681531 0 >>>>>> 0 0 0 PCI-MSI-edge eth2-tx-3 >>>>>> 78: 28 0 0 0 4665793 >>>>>> 0 0 0 PCI-MSI-edge eth2-tx-4 >>>>>> 79: 28 0 0 0 0 >>>>>> 4671596 0 0 PCI-MSI-edge eth2-tx-5 >>>>>> 80: 28 0 0 0 0 >>>>>> 0 4665279 0 PCI-MSI-edge eth2-tx-6 >>>>>> 81: 28 0 0 0 0 >>>>>> 0 0 4664504 PCI-MSI-edge eth2-tx-7 >>>>>> 82: 2 0 0 0 0 >>>>>> 0 0 0 PCI-MSI-edge eth2:lsc >>>>>> >>>>>> >>>>>> Why would ksoftirqd only run on half of the cores (and only the odd >>>>>> ones to boot)? The one commonality that's striking me is that that >>>>>> all the odd CPU#'s are on the same physical processor: >>>>>> >>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual >>>>>> processor : 0 >>>>>> physical id : 0 >>>>>> processor : 1 >>>>>> physical id : 1 >>>>>> processor : 2 >>>>>> physical id : 0 >>>>>> processor : 3 >>>>>> physical id : 1 >>>>>> processor : 4 >>>>>> physical id : 0 >>>>>> processor : 5 >>>>>> physical id : 1 >>>>>> processor : 6 >>>>>> physical id : 0 >>>>>> processor : 7 >>>>>> physical id : 1 >>>>>> >>>>>> I did compile the kernel with NUMA support... am I being bitten by >>>>>> something there? Other thoughts on where I should look. >>>>>> >>>>>> Also... is there an incantation to get NAPI to work in the torvalds >>>>>> kernel? As you can see, I'm generating quite a few interrrupts. >>>>>> >>>>>> -A >>>>>> >>>>>> >>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote: >>>>>>> From: Andrew Dickinson <andrew@whydna.net> >>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700 >>>>>>> >>>>>>>> I'll do some debugging around skb_tx_hash() and see if I can make >>>>>>>> sense of it. I'll let you know what I find. My hypothesis is that >>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it >>>>>>>> before I start making claims. ;-P >>>>>>> That's one possibility. >>>>>>> >>>>>>> Another is that the hashing isn't working out. One way to >>>>>>> play with that is to simply replace the: >>>>>>> >>>>>>> hash = skb_get_rx_queue(skb); >>>>>>> >>>>>>> in skb_tx_hash() with something like: >>>>>>> >>>>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>>>> >>>>>>> and see if that improves the situation. >>>>>>> >>>>> Hi Andrew >>>>> >>>>> Please try following patch (I dont have multi-queue NIC, sorry) >>>>> >>>>> I will do the followup patch if this ones corrects the distribution problem >>>>> you noticed. >>>>> >>>>> Thanks very much for all your findings. >>>>> >>>>> [PATCH] net: skb_tx_hash() improvements >>>>> >>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution >>>>> as the device driver exactly told us which queue was selected at RX time. >>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs. >>>>> >>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues >>>>> to avoid a divide here. But this computation should be done once, >>>>> when real_num_tx_queues is set. This needs a separate patch, and a new >>>>> field in struct net_device. >>>>> >>>>> Reported-by: Andrew Dickinson <andrew@whydna.net> >>>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>>>> >>>>> diff --git a/net/core/dev.c b/net/core/dev.c >>>>> index 308a7d0..e2e9e4a 100644 >>>>> --- a/net/core/dev.c >>>>> +++ b/net/core/dev.c >>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) >>>>> { >>>>> u32 hash; >>>>> >>>>> - if (skb_rx_queue_recorded(skb)) { >>>>> - hash = skb_get_rx_queue(skb); >>>>> - } else if (skb->sk && skb->sk->sk_hash) { >>>>> + if (skb_rx_queue_recorded(skb)) >>>>> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; >>>>> + >>>>> + if (skb->sk && skb->sk->sk_hash) >>>>> hash = skb->sk->sk_hash; >>>>> - } else >>>>> + else >>>>> hash = skb->protocol; >>>>> >>>>> hash = jhash_1word(hash, skb_tx_hashrnd); >>>>> >>>>> >>>> Eric, >>>> >>>> That's exactly what I did! It solved the problem of hot-spots on some >>>> interrupts. However, I now have a new problem (which is documented in >>>> my previous posts). The short of it is that I'm only seeing 4 (out of >>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The >>>> busy 4 are always on one physical package (but not always the same >>>> package (it'll change on reboot or when I change some parameters via >>>> ethtool), but never both. This, despite /proc/interrupts showing me >>>> that all 8 interrupts are being hit evenly. There's more details in >>>> my last mail. ;-D >>>> >>> Well, I was reacting to your 'voodo' comment about >>> >>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); >>> >>> Since this is not the problem. Problem is coming from jhash() which shuffles >>> the input, while in your case we want to select same output queue >>> because of cpu affinities. No shuffle required. >> >> Agreed. I don't want to jhash(), and I'm not. >> >>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0, >>> cpu1 is handling tx-queue-1 and rx-queue-1, and so on...) >> >> That's a correct assumption. :D >> >>> Then /proc/interrupts show your rx interrupts are not evenly distributed. >>> >>> Or that ksoftirqd is triggered only on one physical cpu, while on other >>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load. >> >> Hrmm... more fuel for the fire... >> >> The NIC seems to be doing a good job of hashing the incoming data and >> the kernel is now finding the right TX queue: >> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets >> rx_packets: 1286009099 >> tx_packets: 1287853570 >> tx_queue_0_packets: 162469405 >> tx_queue_1_packets: 162452446 >> tx_queue_2_packets: 162481160 >> tx_queue_3_packets: 162441839 >> tx_queue_4_packets: 162484930 >> tx_queue_5_packets: 162478402 >> tx_queue_6_packets: 162492530 >> tx_queue_7_packets: 162477162 >> rx_queue_0_packets: 162469449 >> rx_queue_1_packets: 162452440 >> rx_queue_2_packets: 162481186 >> rx_queue_3_packets: 162441885 >> rx_queue_4_packets: 162484949 >> rx_queue_5_packets: 162478427 >> >> Here's where it gets juicy. If I reduce the rate at which I'm pushing >> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks >> as follow: >> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st >> Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >> >> And if I watch /proc/interrupts, I see that all of the tx and rx >> queues are handling a fairly similar number of interrupts (ballpark, >> 7-8k/sec on rx, 10k on tx). >> >> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this: >> >> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 5.5%hi, 92.5%si, 0.0%st >> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st >> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 2.3%id, 0.0%wa, 4.9%hi, 92.9%si, 0.0%st >> Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st >> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 5.2%id, 0.0%wa, 5.2%hi, 89.6%si, 0.0%st >> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.3%hi, 1.9%si, 0.0%st >> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 4.9%hi, 94.8%si, 0.0%st >> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st >> >> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e. >> 0,2,4, and 6) RX queues are receiving relatively few interrupts >> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are >> receiving about 2-3k/sec. What's extra strange is that the TX queues >> are still handling about 10k/sec each. >> >> So, below some magic threshold (approx 2.3Mpps), the box is basically >> idle and happily routing all the packets (I can confirm that my >> network test device ixia is showing 0-loss). Above the magic >> threshold, the box starts acting as described above and I'm unable to >> push it beyond that threshold. While I understand that there are >> limits to how fast I can route packets (obviously), it seems very >> strange that I'm seeing this physical-CPU affinity on the ksoftirqd >> "processes". >> > > box is not idle, you hit a bug in kernel, I already corrected this week :) > > check for "sched: account system time properly" in google > > diff --git a/kernel/sched.c b/kernel/sched.c > index b902e58..26efa47 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick) > > if (user_tick) > account_user_time(p, one_jiffy, one_jiffy_scaled); > - else if (p != rq->idle) > + else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET)) > account_system_time(p, HARDIRQ_OFFSET, one_jiffy, > one_jiffy_scaled); > else > <whew>, I'm not crazy! ;-P I'll apply this patch and let you know how that changes things. -A >> Here's how fragile this "magic threshold" is... 2.292 Mpps, box looks >> idle, 0 loss. 2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish. >> 2.307 Mpps, even-CPU ksoftirqd processes at 75%. 2.323 Mpps, even-CPU >> ksoftirqd proccesses at 100%. Never during this did the odd-CPU >> ksoftirqd processes show any utilization at all. >> >> These are 64-byte frames, so I shouldn't be hitting any bandwidth >> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm >> just routing packets back out the one NIC). >> >> =/ >> > > > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 7:23 ` Andrew Dickinson 2009-05-01 7:31 ` Eric Dumazet @ 2009-05-01 21:37 ` Brandeburg, Jesse 1 sibling, 0 replies; 28+ messages in thread From: Brandeburg, Jesse @ 2009-05-01 21:37 UTC (permalink / raw) To: Andrew Dickinson; +Cc: Eric Dumazet, David Miller, jelaas, netdev I'm going to try to clarify just a few minor things in the hope of helping explain why things look the way they do from the ixgbe perspective. On Fri, 1 May 2009, Andrew Dickinson wrote: > >> That's exactly what I did! It solved the problem of hot-spots on some > >> interrupts. However, I now have a new problem (which is documented in > >> my previous posts). The short of it is that I'm only seeing 4 (out of > >> 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The > >> busy 4 are always on one physical package (but not always the same > >> package (it'll change on reboot or when I change some parameters via > >> ethtool), but never both. This, despite /proc/interrupts showing me > >> that all 8 interrupts are being hit evenly. There's more details in > >> my last mail. ;-D > >> > > > > Well, I was reacting to your 'voodo' comment about > > > > return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); > > > > Since this is not the problem. Problem is coming from jhash() which shuffles > > the input, while in your case we want to select same output queue > > because of cpu affinities. No shuffle required. > > Agreed. I don't want to jhash(), and I'm not. > > > (assuming cpu0 is handling tx-queue-0 and rx-queue-0, > > cpu1 is handling tx-queue-1 and rx-queue-1, and so on...) > > That's a correct assumption. :D > > > Then /proc/interrupts show your rx interrupts are not evenly distributed. > > > > Or that ksoftirqd is triggered only on one physical cpu, while on other > > cpu, softirqds are not run from ksoftirqd. Its only a matter of load. > > Hrmm... more fuel for the fire... > > The NIC seems to be doing a good job of hashing the incoming data and > the kernel is now finding the right TX queue: > -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets > rx_packets: 1286009099 > tx_packets: 1287853570 > tx_queue_0_packets: 162469405 > tx_queue_1_packets: 162452446 > tx_queue_2_packets: 162481160 > tx_queue_3_packets: 162441839 > tx_queue_4_packets: 162484930 > tx_queue_5_packets: 162478402 > tx_queue_6_packets: 162492530 > tx_queue_7_packets: 162477162 > rx_queue_0_packets: 162469449 > rx_queue_1_packets: 162452440 > rx_queue_2_packets: 162481186 > rx_queue_3_packets: 162441885 > rx_queue_4_packets: 162484949 > rx_queue_5_packets: 162478427 > > Here's where it gets juicy. If I reduce the rate at which I'm pushing > traffic to a 0-loss level (in this case about 2.2Mpps), then top looks > as follow: > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > > And if I watch /proc/interrupts, I see that all of the tx and rx > queues are handling a fairly similar number of interrupts (ballpark, > 7-8k/sec on rx, 10k on tx). > > OK... now let me double the packet rate (to about 4.4Mpps), top looks like this: > > Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 5.5%hi, 92.5%si, 0.0%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 2.3%id, 0.0%wa, 4.9%hi, 92.9%si, 0.0%st > Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st > Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 5.2%id, 0.0%wa, 5.2%hi, 89.6%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.3%hi, 1.9%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 4.9%hi, 94.8%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > > And if I watch /proc/interrupts again, I see that the even-CPUs (i.e. > 0,2,4, and 6) RX queues are receiving relatively few interrupts > (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are > receiving about 2-3k/sec. What's extra strange is that the TX queues > are still handling about 10k/sec each. rx interrupts start polling (100% time) tx queues keep doing 10K per second because tx queues don't run in NAPI mode for MSI-X vectors. They do try to limit the amount of work done at once as to not hog a cpu. > So, below some magic threshold (approx 2.3Mpps), the box is basically > idle and happily routing all the packets (I can confirm that my > network test device ixia is showing 0-loss). Above the magic > threshold, the box starts acting as described above and I'm unable to > push it beyond that threshold. While I understand that there are > limits to how fast I can route packets (obviously), it seems very > strange that I'm seeing this physical-CPU affinity on the ksoftirqd > "processes". > > Here's how fragile this "magic threshold" is... 2.292 Mpps, box looks > idle, 0 loss. 2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish. > 2.307 Mpps, even-CPU ksoftirqd processes at 75%. 2.323 Mpps, even-CPU > ksoftirqd proccesses at 100%. Never during this did the odd-CPU > ksoftirqd processes show any utilization at all. > > These are 64-byte frames, so I shouldn't be hitting any bandwidth > issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm > just routing packets back out the one NIC). do you have all six channels populated with memory? you're probably just hitting the limits of the OS combined with the hardware. You could try reducing your rx/tx queue count (have to change code, 'num_rx_queues =') - hope we get ethtool to do that someday. and then assigning each rx queue to one core and a tx queue to another on a shared cache. on a Nehalem the kernel in numa mode (is your BIOS in numa mode?) may not be balancing the memory utilization evenly between channels. are you using slub or slqb? changing netdev_alloc_skb to __alloc_skb(be sure to specify node=-1 and getting rid of the skb_reserve(NET_IP_ALIGN) and skb_reserve(16) might help align rx packets for dma. hope this helps, Jesse ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] net: skb_tx_hash() improvements 2009-05-01 6:14 ` Eric Dumazet 2009-05-01 6:19 ` Andrew Dickinson @ 2009-05-01 8:29 ` Eric Dumazet 2009-05-01 8:52 ` Eric Dumazet 2009-05-01 16:08 ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller 2 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 8:29 UTC (permalink / raw) To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev David, here is the followup I promised Thanks [PATCH] net: skb_tx_hash() improvements When skb_rx_queue_recorded() is true, we dont want to use jhash distribution as the device driver exactly told us which queue was selected at RX time. jhash makes a statistical shuffle, but this wont work with only 8 different inputs. We also need to implement a true reciprocal division, to not disturb symmetric setups (when number of tx queues matches number of rx queues) and cpu affinities. This patch introduces a new helper, dev_real_num_tx_queues_set() to set both real_num_tx_queues and its reciprocal value, and makes all drivers use this helper. Many thanks to Andrew Dickinson to let us see the light here :) Reported-by: Andrew Dickinson <andrew@whydna.net> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- drivers/net/bnx2.c | 2 +- drivers/net/bnx2x_main.c | 2 +- drivers/net/cxgb3/cxgb3_main.c | 2 +- drivers/net/igb/igb_main.c | 2 +- drivers/net/ixgbe/ixgbe_main.c | 2 +- drivers/net/mv643xx_eth.c | 2 +- drivers/net/myri10ge/myri10ge.c | 4 ++-- drivers/net/niu.c | 2 +- drivers/net/vxge/vxge-main.c | 2 +- include/linux/netdevice.h | 2 ++ net/core/dev.c | 26 ++++++++++++++++++-------- 11 files changed, 30 insertions(+), 18 deletions(-) diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index d478391..1f674c1 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -5951,7 +5951,7 @@ bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi) } bp->num_tx_rings = rounddown_pow_of_two(bp->irq_nvecs); - bp->dev->real_num_tx_queues = bp->num_tx_rings; + dev_real_num_tx_queues_set(bp->dev, bp->num_tx_rings); bp->num_rx_rings = bp->irq_nvecs; } diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c index ad5ef25..d5c641b 100644 --- a/drivers/net/bnx2x_main.c +++ b/drivers/net/bnx2x_main.c @@ -6800,7 +6800,7 @@ static void bnx2x_set_int_mode(struct bnx2x *bp) } break; } - bp->dev->real_num_tx_queues = bp->num_tx_queues; + dev_real_num_tx_queues_set(bp->dev, bp->num_tx_queues); } static void bnx2x_set_rx_mode(struct net_device *dev); diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c index 7ea4841..a84abf3 100644 --- a/drivers/net/cxgb3/cxgb3_main.c +++ b/drivers/net/cxgb3/cxgb3_main.c @@ -1220,7 +1220,7 @@ static int cxgb_open(struct net_device *dev) "Could not initialize offload capabilities\n"); } - dev->real_num_tx_queues = pi->nqsets; + dev_real_num_tx_queues_set(dev, pi->nqsets); link_start(dev); t3_port_intr_enable(adapter, pi->port_id); netif_tx_start_all_queues(dev); diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c index 08c8014..48c530d 100644 --- a/drivers/net/igb/igb_main.c +++ b/drivers/net/igb/igb_main.c @@ -691,7 +691,7 @@ msi_only: adapter->flags |= IGB_FLAG_HAS_MSI; out: /* Notify the stack of the (possibly) reduced Tx Queue count. */ - adapter->netdev->real_num_tx_queues = adapter->num_tx_queues; + dev_real_num_tx_queues_set(adapter->netdev, adapter->num_tx_queues); return; } diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 07e778d..4b4369b 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -2737,7 +2737,7 @@ static void ixgbe_set_num_queues(struct ixgbe_adapter *adapter) done: /* Notify the stack of the (possibly) reduced Tx Queue count. */ - adapter->netdev->real_num_tx_queues = adapter->num_tx_queues; + dev_real_num_tx_queues_set(adapter->netdev, adapter->num_tx_queues); } static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter, diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c index b3185bf..cb6d859 100644 --- a/drivers/net/mv643xx_eth.c +++ b/drivers/net/mv643xx_eth.c @@ -2904,7 +2904,7 @@ static int mv643xx_eth_probe(struct platform_device *pdev) mp->dev = dev; set_params(mp, pd); - dev->real_num_tx_queues = mp->txq_count; + dev_real_num_tx_queues_set(dev, mp->txq_count); if (pd->phy_addr != MV643XX_ETH_PHY_NONE) mp->phy = phy_scan(mp, pd->phy_addr); diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index f2c4a66..bfb6a11 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -968,7 +968,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp) * RX queues, so if we get an error, first retry using a * single TX queue before giving up */ if (status != 0 && mgp->dev->real_num_tx_queues > 1) { - mgp->dev->real_num_tx_queues = 1; + dev_real_num_tx_queues_set(mgp->dev, 1); cmd.data0 = mgp->num_slices; cmd.data1 = MXGEFW_SLICE_INTR_MODE_ONE_PER_SLICE; status = myri10ge_send_cmd(mgp, @@ -3862,7 +3862,7 @@ static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent) dev_err(&pdev->dev, "failed to alloc slice state\n"); goto abort_with_firmware; } - netdev->real_num_tx_queues = mgp->num_slices; + dev_real_num_tx_queues_set(netdev, mgp->num_slices); status = myri10ge_reset(mgp); if (status != 0) { dev_err(&pdev->dev, "failed reset\n"); diff --git a/drivers/net/niu.c b/drivers/net/niu.c index 2b17453..a6eac3b 100644 --- a/drivers/net/niu.c +++ b/drivers/net/niu.c @@ -4501,7 +4501,7 @@ static int niu_alloc_channels(struct niu *np) np->num_rx_rings = parent->rxchan_per_port[port]; np->num_tx_rings = parent->txchan_per_port[port]; - np->dev->real_num_tx_queues = np->num_tx_rings; + dev_real_num_tx_queues_set(np->dev, np->num_tx_rings); np->rx_rings = kzalloc(np->num_rx_rings * sizeof(struct rx_ring_info), GFP_KERNEL); diff --git a/drivers/net/vxge/vxge-main.c b/drivers/net/vxge/vxge-main.c index b7f08f3..15602ab 100644 --- a/drivers/net/vxge/vxge-main.c +++ b/drivers/net/vxge/vxge-main.c @@ -3331,7 +3331,7 @@ int __devinit vxge_device_register(struct __vxge_hw_device *hldev, ndev->features |= NETIF_F_GRO; if (vdev->config.tx_steering_type == TX_MULTIQ_STEERING) - ndev->real_num_tx_queues = no_of_vpath; + dev_real_num_tx_queues_set(ndev, no_of_vpath); #ifdef NETIF_F_LLTX ndev->features |= NETIF_F_LLTX; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5a96a1a..f3939ec 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -790,6 +790,7 @@ struct net_device /* Number of TX queues currently active in device */ unsigned int real_num_tx_queues; + unsigned int rec_real_num_tx_queues; /* reciprocal value */ unsigned long tx_queue_len; /* Max frames per queue allowed */ spinlock_t tx_global_lock; @@ -1782,6 +1783,7 @@ static inline void netif_addr_unlock_bh(struct net_device *dev) extern void ether_setup(struct net_device *dev); +extern void dev_real_num_tx_queues_set(struct net_device *dev, unsigned int count); /* Support for loadable net-drivers */ extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, void (*setup)(struct net_device *), diff --git a/net/core/dev.c b/net/core/dev.c index 308a7d0..dfb8f32 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -126,6 +126,7 @@ #include <linux/in.h> #include <linux/jhash.h> #include <linux/random.h> +#include <linux/reciprocal_div.h> #include "net-sysfs.h" @@ -1735,19 +1736,28 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) { u32 hash; - if (skb_rx_queue_recorded(skb)) { + if (skb_rx_queue_recorded(skb)) hash = skb_get_rx_queue(skb); - } else if (skb->sk && skb->sk->sk_hash) { - hash = skb->sk->sk_hash; - } else - hash = skb->protocol; + else { + if (skb->sk && skb->sk->sk_hash) + hash = skb->sk->sk_hash; + else + hash = skb->protocol; - hash = jhash_1word(hash, skb_tx_hashrnd); + hash = jhash_1word(hash, skb_tx_hashrnd); + } + return (u16) reciprocal_divide(hash, dev->rec_real_num_tx_queues); - return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32); } EXPORT_SYMBOL(skb_tx_hash); +void dev_real_num_tx_queues_set(struct net_device *dev, unsigned int count) +{ + dev->real_num_tx_queues = count; + dev->rec_real_num_tx_queues = reciprocal_value(count); +} +EXPORT_SYMBOL(dev_real_num_tx_queues_set); + static struct netdev_queue *dev_pick_tx(struct net_device *dev, struct sk_buff *skb) { @@ -4781,7 +4791,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, dev->_tx = tx; dev->num_tx_queues = queue_count; - dev->real_num_tx_queues = queue_count; + dev_real_num_tx_queues_set(dev, queue_count); dev->gso_max_size = GSO_MAX_SIZE; ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] net: skb_tx_hash() improvements 2009-05-01 8:29 ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet @ 2009-05-01 8:52 ` Eric Dumazet 2009-05-01 9:29 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 8:52 UTC (permalink / raw) To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev Eric Dumazet a écrit : > David, here is the followup I promised > > Thanks > > [PATCH] net: skb_tx_hash() improvements > > When skb_rx_queue_recorded() is true, we dont want to use jhash distribution > as the device driver exactly told us which queue was selected at RX time. > jhash makes a statistical shuffle, but this wont work with only 8 different inputs. > > We also need to implement a true reciprocal division, to not disturb > symmetric setups (when number of tx queues matches number of rx queues) > and cpu affinities. > > This patch introduces a new helper, dev_real_num_tx_queues_set() > to set both real_num_tx_queues and its reciprocal value, > and makes all drivers use this helper. Oh well, this was wrong, I took divide result while we want a modulo ! Need to think a litle bit more :) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: skb_tx_hash() improvements 2009-05-01 8:52 ` Eric Dumazet @ 2009-05-01 9:29 ` Eric Dumazet 2009-05-01 16:17 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 9:29 UTC (permalink / raw) To: David Miller; +Cc: Andrew Dickinson, jelaas, netdev Eric Dumazet a écrit : > Eric Dumazet a écrit : >> David, here is the followup I promised >> >> Thanks >> >> [PATCH] net: skb_tx_hash() improvements >> >> When skb_rx_queue_recorded() is true, we dont want to use jhash distribution >> as the device driver exactly told us which queue was selected at RX time. >> jhash makes a statistical shuffle, but this wont work with only 8 different inputs. >> >> We also need to implement a true reciprocal division, to not disturb >> symmetric setups (when number of tx queues matches number of rx queues) >> and cpu affinities. >> >> This patch introduces a new helper, dev_real_num_tx_queues_set() >> to set both real_num_tx_queues and its reciprocal value, >> and makes all drivers use this helper. > > Oh well, this was wrong, I took divide result while we want a modulo ! > > Need to think a litle bit more :) > So no need of a true reciprocal divide, just a refinement of first patch. (Avoiding the divide if possible) If incoming device has 4 rx queues, and outgoing device has 8 queues, only 4 of tx queues are used, I wonder if we need some further improvement here to better use all available tx queues ? Probably not in generic code... [PATCH] net: skb_tx_hash() improvement When skb_rx_queue_recorded() is true, we dont want to use jhash distribution as the device driver exactly told us which queue was selected at RX time. jhash makes a statistical shuffle, but this wont work with only 8 different inputs. Same thing for the 'modulo' operation, that works only if inputs are enough random (ie use all available 32 bits) This patch avoids jhash computation (which cost ~50 instructions), but might still need a modulo operation, in case number of tx queues is smaller than number of rx queues. Reported-by: Andrew Dickinson <andrew@whydna.net> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/net/core/dev.c b/net/core/dev.c index 308a7d0..b3acb51 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1737,9 +1737,19 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) if (skb_rx_queue_recorded(skb)) { hash = skb_get_rx_queue(skb); - } else if (skb->sk && skb->sk->sk_hash) { + /* + * Try to avoid an expensive divide, for symmetric setups : + * number of tx queues of output device == + * number of rx queues of incoming device + */ + if (hash >= dev->real_num_tx_queues) + hash %= dev->real_num_tx_queues; + return hash; + } + + if (skb->sk && skb->sk->sk_hash) hash = skb->sk->sk_hash; - } else + else hash = skb->protocol; hash = jhash_1word(hash, skb_tx_hashrnd); ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] net: skb_tx_hash() improvements 2009-05-01 9:29 ` Eric Dumazet @ 2009-05-01 16:17 ` David Miller 2009-05-03 21:44 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-05-01 16:17 UTC (permalink / raw) To: dada1; +Cc: andrew, jelaas, netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 01 May 2009 11:29:54 +0200 > - } else if (skb->sk && skb->sk->sk_hash) { > + /* > + * Try to avoid an expensive divide, for symmetric setups : > + * number of tx queues of output device == > + * number of rx queues of incoming device > + */ > + if (hash >= dev->real_num_tx_queues) > + hash %= dev->real_num_tx_queues; > + return hash; > + } Subtraction in a while() loop is almost certainly a lot faster. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: skb_tx_hash() improvements 2009-05-01 16:17 ` David Miller @ 2009-05-03 21:44 ` David Miller 2009-05-04 6:12 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-05-03 21:44 UTC (permalink / raw) To: dada1; +Cc: andrew, jelaas, netdev From: David Miller <davem@davemloft.net> Date: Fri, 01 May 2009 09:17:47 -0700 (PDT) > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 01 May 2009 11:29:54 +0200 > >> - } else if (skb->sk && skb->sk->sk_hash) { >> + /* >> + * Try to avoid an expensive divide, for symmetric setups : >> + * number of tx queues of output device == >> + * number of rx queues of incoming device >> + */ >> + if (hash >= dev->real_num_tx_queues) >> + hash %= dev->real_num_tx_queues; >> + return hash; >> + } > > Subtraction in a while() loop is almost certainly a lot > faster. To move forward on this, I've commited the following to net-next-2.6, thanks! net: Avoid modulus in skb_tx_hash() for forwarding case. Based almost entirely upon a patch by Eric Dumazet. The common case is to have num-tx-queues <= num_rx_queues and even if num_tx_queues is larger it will not be significantly larger. Therefore, a subtraction loop is always going to be faster than modulus. Signed-off-by: David S. Miller <davem@davemloft.net> --- net/core/dev.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8144295..3c8073f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1735,8 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) { u32 hash; - if (skb_rx_queue_recorded(skb)) - return skb_get_rx_queue(skb) % dev->real_num_tx_queues; + if (skb_rx_queue_recorded(skb)) { + hash = skb_get_rx_queue(skb); + while (unlikely (hash >= dev->real_num_tx_queues)) + hash -= dev->real_num_tx_queues; + return hash; + } if (skb->sk && skb->sk->sk_hash) hash = skb->sk->sk_hash; -- 1.6.2.4 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] net: skb_tx_hash() improvements 2009-05-03 21:44 ` David Miller @ 2009-05-04 6:12 ` Eric Dumazet 0 siblings, 0 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-04 6:12 UTC (permalink / raw) To: David Miller; +Cc: andrew, jelaas, netdev David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Fri, 01 May 2009 09:17:47 -0700 (PDT) > >> From: Eric Dumazet <dada1@cosmosbay.com> >> Date: Fri, 01 May 2009 11:29:54 +0200 >> >>> - } else if (skb->sk && skb->sk->sk_hash) { >>> + /* >>> + * Try to avoid an expensive divide, for symmetric setups : >>> + * number of tx queues of output device == >>> + * number of rx queues of incoming device >>> + */ >>> + if (hash >= dev->real_num_tx_queues) >>> + hash %= dev->real_num_tx_queues; >>> + return hash; >>> + } >> Subtraction in a while() loop is almost certainly a lot >> faster. > > To move forward on this, I've commited the following to > net-next-2.6, thanks! > > net: Avoid modulus in skb_tx_hash() for forwarding case. > > Based almost entirely upon a patch by Eric Dumazet. > > The common case is to have num-tx-queues <= num_rx_queues > and even if num_tx_queues is larger it will not be significantly > larger. > > Therefore, a subtraction loop is always going to be faster than > modulus. > > Signed-off-by: David S. Miller <davem@davemloft.net> > --- > net/core/dev.c | 8 ++++++-- > 1 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 8144295..3c8073f 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1735,8 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb) > { > u32 hash; > > - if (skb_rx_queue_recorded(skb)) > - return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > + if (skb_rx_queue_recorded(skb)) { > + hash = skb_get_rx_queue(skb); > + while (unlikely (hash >= dev->real_num_tx_queues)) > + hash -= dev->real_num_tx_queues; > + return hash; > + } > > if (skb->sk && skb->sk->sk_hash) > hash = skb->sk->sk_hash; Yes, I checked that compiler did not use a divide instruction here (I remember it did on a similar loop in kernel, related to time) Thank you ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 6:14 ` Eric Dumazet 2009-05-01 6:19 ` Andrew Dickinson 2009-05-01 8:29 ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet @ 2009-05-01 16:08 ` David Miller 2009-05-01 16:48 ` Eric Dumazet 2 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-05-01 16:08 UTC (permalink / raw) To: dada1; +Cc: andrew, jelaas, netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 01 May 2009 08:14:03 +0200 > [PATCH] net: skb_tx_hash() improvements > > When skb_rx_queue_recorded() is true, we dont want to use jash distribution > as the device driver exactly told us which queue was selected at RX time. > jhash makes a statistical shuffle, but this wont work with 8 static inputs. > > Later improvements would be to compute reciprocal value of real_num_tx_queues > to avoid a divide here. But this computation should be done once, > when real_num_tx_queues is set. This needs a separate patch, and a new > field in struct net_device. > > Reported-by: Andrew Dickinson <andrew@whydna.net> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Applied, except that I changed the commit message header line to more reflect that this is in fact a bug fix. BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues (you can just use the RX queue recording as the hash, straight) and that's the kind of check what I intended to add to net-2.6 had you not beaten me to this patch. Also, thanks for giving me absolutely no credit for this whole thing in your commit message. I know I do that to you all the time :-/ How can you forget so quickly that I'm the one that even suggested the exact code change for Andrew to test in the first place? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 16:08 ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller @ 2009-05-01 16:48 ` Eric Dumazet 2009-05-01 17:22 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-01 16:48 UTC (permalink / raw) To: David Miller; +Cc: andrew, jelaas, netdev David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 01 May 2009 08:14:03 +0200 > >> [PATCH] net: skb_tx_hash() improvements >> >> When skb_rx_queue_recorded() is true, we dont want to use jash distribution >> as the device driver exactly told us which queue was selected at RX time. >> jhash makes a statistical shuffle, but this wont work with 8 static inputs. >> >> Later improvements would be to compute reciprocal value of real_num_tx_queues >> to avoid a divide here. But this computation should be done once, >> when real_num_tx_queues is set. This needs a separate patch, and a new >> field in struct net_device. >> >> Reported-by: Andrew Dickinson <andrew@whydna.net> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > Applied, except that I changed the commit message header line to more > reflect that this is in fact a bug fix. > > BTW, you don't need the reciprocol when num-tx-queues <= num-rx-queues > (you can just use the RX queue recording as the hash, straight) and > that's the kind of check what I intended to add to net-2.6 had you not > beaten me to this patch. > > Also, thanks for giving me absolutely no credit for this whole thing > in your commit message. I know I do that to you all the time :-/ How > can you forget so quickly that I'm the one that even suggested the > exact code change for Andrew to test in the first place? Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am David Miller, silly me ! :) I am not fighting for credit or whatever, certainly not with you. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-05-01 16:48 ` Eric Dumazet @ 2009-05-01 17:22 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2009-05-01 17:22 UTC (permalink / raw) To: dada1; +Cc: andrew, jelaas, netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 01 May 2009 18:48:30 +0200 > Hoho, your Honor, I am totaly guilty and sorry, sometime I think I am > David Miller, silly me ! :) > > I am not fighting for credit or whatever, certainly not with you. Great, just making sure it wasn't intentional :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe) 2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson 2009-04-30 9:07 ` Jens Låås @ 2009-05-01 10:20 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 28+ messages in thread From: Jesper Dangaard Brouer @ 2009-05-01 10:20 UTC (permalink / raw) To: Andrew Dickinson; +Cc: netdev Interesting thread Andrew. I'm also doing some 10G routing performance testing, but using Sun Neptune (niu) and SMC's 10G XFP (sfc) NICs. I'm using pktgen for testing, but it sounds interesting that you got a Ixia testing equipment, nice. On Wed, 29 Apr 2009, Andrew Dickinson wrote: > I'm trying to evaluate a new system for routing performance for some > custom packet modification that we do. To start, I'm trying to get a > high-water mark of routing performance without our custom cruft in the > middle. The hardware setup is a dual-package Nehalem box (X5550, > Hyper-Threading disabled) with a dual 10G intel card (pci-id: > 8086:10fb). Because this NIC is freakishly new, I'm running the > latest torvalds kernel in order to get the ixgbe driver to identify it > (<sigh>). Is that the Intel 82599 10GbE chip? Where did you get/buy that NIC? > Interrupts... > I've disabled irqbalance and I'm explicitly pinning interrupts, one > per core, as follows: I'm doing the same... I find that keeping the RX and TX queue pinned to the same CPU, is essential, together with patch that control the mapping between RX and TX queues. But with Eric's patch it looks like I can drop my own patch :-) If I don't do RX to TX mapping, then Oprofile shows that we use too much time freeing the skb's, naturally due to cache bounces. > -bash-3.2# for x in 57 65; do for i in `seq 0 7`; do echo $i | awk > '{printf("%X", (2^$1));}' > /proc/irq/$(($i + $x))/smp_affinity; done; > done Keep up the good work! Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-05-04 6:12 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson 2009-04-30 9:07 ` Jens Låås 2009-04-30 9:24 ` David Miller 2009-04-30 10:51 ` Jens Låås 2009-04-30 11:05 ` David Miller 2009-04-30 14:04 ` Andrew Dickinson 2009-04-30 14:08 ` David Miller 2009-04-30 23:53 ` Andrew Dickinson 2009-05-01 4:19 ` Andrew Dickinson 2009-05-01 7:32 ` Eric Dumazet 2009-05-01 7:47 ` Eric Dumazet 2009-05-01 6:14 ` Eric Dumazet 2009-05-01 6:19 ` Andrew Dickinson 2009-05-01 6:40 ` Eric Dumazet 2009-05-01 7:23 ` Andrew Dickinson 2009-05-01 7:31 ` Eric Dumazet 2009-05-01 7:34 ` Andrew Dickinson 2009-05-01 21:37 ` Brandeburg, Jesse 2009-05-01 8:29 ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet 2009-05-01 8:52 ` Eric Dumazet 2009-05-01 9:29 ` Eric Dumazet 2009-05-01 16:17 ` David Miller 2009-05-03 21:44 ` David Miller 2009-05-04 6:12 ` Eric Dumazet 2009-05-01 16:08 ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller 2009-05-01 16:48 ` Eric Dumazet 2009-05-01 17:22 ` David Miller 2009-05-01 10:20 ` Jesper Dangaard Brouer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.