Receive side performance issue with multi-10-GigE and NUMA

* Receive side performance issue with multi-10-GigE and NUMA
@ 2009-08-07 21:06 Bill Fink
  2009-08-07 21:18 ` Brice Goglin
                   ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-07 21:06 UTC (permalink / raw)
  To: Linux Network Developers; +Cc: brice, gallatin

I've run into a major receive side performance issue with multi-10-GigE
on a NUMA system.  The system is using a SuperMicro X8DAH+-F motherboard
with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
1333 MHz DDR3 memory.  It is a Fedora 10 system but using the latest
2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
from Fedora 10).

The test setup is:

	i7test1----(6)----xeontest1----(6)----i7test2
	         10-GigE             10-GigE

So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
of 12 10-GigE interfaces.  eth2 through eth7 (which are on the
second Intel 5520 I/O Hub) are connected to i7test1 while
eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
are connected to i7test2.

Previous direct testing between i7test1 and i7test2 (which use an
Asus P6T6 WS Revolution motherboard) demonstrated that they could
achieve ~70 Gbps performance for either transmit or receive using
8 10-GigE interfaces.

The transmit side performance of xeontest1 is fantastic:

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
n12:  9648.0522 MB /  10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
n9: 11130.5320 MB /  10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
n11:  9418.1250 MB /  10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
n10:  9279.4758 MB /  10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
n8: 11142.6574 MB /  10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
n13:  9422.1492 MB /  10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
n3: 11471.2500 MB /  10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
n6:  9339.6354 MB /  10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
n4:  9093.2500 MB /  10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
n5:  9121.8367 MB /  10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
n7:  9292.2500 MB /  10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
n2: 11487.1150 MB /  10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT

Aggregate performance:			100.4637 Gbps

The problem is with the receive side performance.

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n11:  6983.6359 MB /  10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
n10:  7000.1557 MB /  10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
n9:  2451.7206 MB /  10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
n13:  2453.0887 MB /  10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
n12:  2446.5303 MB /  10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
n8:  2462.5890 MB /  10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
n4:  2763.5091 MB /  10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
n5:  2770.0887 MB /  10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
n2:  1777.7277 MB /  10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
n6:  1772.7962 MB /  10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
n3:  1779.4535 MB /  10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
n7:  1770.8359 MB /  10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT

Aggregate performance:			29.9492 Gbps

I suspected that this was because the memory being allocated by the
myri10ge driver was not being allocated on the optimum NUMA node.
BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
is what I would have expected, but this is my first experience with
a NUMA system.

Based upon a patch by Peter Zijlstra that I discovered through Google
searching, I tried patching the myri10ge driver to change its memory
allocation of memory pages from alloc_pages() to alloc_pages_node()
and specifying the NUMA node of the parent device of the Myricom 10-GigE
device, which IIUC should be the PCIe switch.  This didn't help.

This could be because I discovered that if I did:

	find /sys -name numa_node -exec grep . {} /dev/null \;

that the numa_node associated with all the PCI devices was always 0,
and if IIUC then I believe some of the PCI devices should have been
associated with NUMA node 2.  Perhaps this is what is causing all
the memory pages allocated by the myri10ge driver to be on NUMA
node 0, and thus causing the major performance issue.

To kludge around this, I made a different patch to the myri10ge driver.
This time I hardcoded the NUMA node in the call to alloc_pages_node()
to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
This is of course very specific to our specific system (NUMA node ids
and Myricom 10-GigE device IRQs), and is not something that would be
generically applicable.  But it was useful as a test, and it did
improve the receive side performance substantially!

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n5:  8221.2911 MB /  10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
n4:  8237.9524 MB /  10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
n11:  7935.3750 MB /  10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
n2:  4543.1621 MB /  10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
n10:  7916.3925 MB /  10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
n7:  4558.4817 MB /  10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
n13:  4390.1875 MB /  10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
n3:  4572.6478 MB /  10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
n6:  4564.4776 MB /  10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
n8:  4409.8551 MB /  10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
n9:  4412.7836 MB /  10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
n12:  4413.4061 MB /  10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT

Aggregate performance:			56.4703 Gbps

This was basically double the previous receive side performance
without the patch.

I don't know if this is fundamentally a myri10ge driver issue or
some underlying Linux kernel issue, so it's not clear to me what
a proper fix would be.

Finally, while definitely a major improvement, I think it should be
possible to do even better, since we achieved 70 Gbps in the i7 to i7
tests, and probably could have done 80 Gbps except for an Asus
motherboard restriction with the interconnect between the Intel X58
and Nvidia NF200 chips.  It's definitely a big step in the right
direction though if this issue can be resolved.

Any help greatly appreicated in advance.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread