All of lore.kernel.org
 help / color / mirror / Atom feed
* Receive side performance issue with multi-10-GigE and NUMA
@ 2009-08-07 21:06 Bill Fink
  2009-08-07 21:18 ` Brice Goglin
                   ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-07 21:06 UTC (permalink / raw)
  To: Linux Network Developers; +Cc: brice, gallatin

I've run into a major receive side performance issue with multi-10-GigE
on a NUMA system.  The system is using a SuperMicro X8DAH+-F motherboard
with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
1333 MHz DDR3 memory.  It is a Fedora 10 system but using the latest
2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
from Fedora 10).

The test setup is:

	i7test1----(6)----xeontest1----(6)----i7test2
	         10-GigE             10-GigE

So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
of 12 10-GigE interfaces.  eth2 through eth7 (which are on the
second Intel 5520 I/O Hub) are connected to i7test1 while
eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
are connected to i7test2.

Previous direct testing between i7test1 and i7test2 (which use an
Asus P6T6 WS Revolution motherboard) demonstrated that they could
achieve ~70 Gbps performance for either transmit or receive using
8 10-GigE interfaces.

The transmit side performance of xeontest1 is fantastic:

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
n12:  9648.0522 MB /  10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
n9: 11130.5320 MB /  10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
n11:  9418.1250 MB /  10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
n10:  9279.4758 MB /  10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
n8: 11142.6574 MB /  10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
n13:  9422.1492 MB /  10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
n3: 11471.2500 MB /  10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
n6:  9339.6354 MB /  10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
n4:  9093.2500 MB /  10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
n5:  9121.8367 MB /  10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
n7:  9292.2500 MB /  10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
n2: 11487.1150 MB /  10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT

Aggregate performance:			100.4637 Gbps

The problem is with the receive side performance.

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n11:  6983.6359 MB /  10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
n10:  7000.1557 MB /  10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
n9:  2451.7206 MB /  10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
n13:  2453.0887 MB /  10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
n12:  2446.5303 MB /  10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
n8:  2462.5890 MB /  10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
n4:  2763.5091 MB /  10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
n5:  2770.0887 MB /  10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
n2:  1777.7277 MB /  10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
n6:  1772.7962 MB /  10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
n3:  1779.4535 MB /  10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
n7:  1770.8359 MB /  10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT

Aggregate performance:			29.9492 Gbps

I suspected that this was because the memory being allocated by the
myri10ge driver was not being allocated on the optimum NUMA node.
BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
is what I would have expected, but this is my first experience with
a NUMA system.

Based upon a patch by Peter Zijlstra that I discovered through Google
searching, I tried patching the myri10ge driver to change its memory
allocation of memory pages from alloc_pages() to alloc_pages_node()
and specifying the NUMA node of the parent device of the Myricom 10-GigE
device, which IIUC should be the PCIe switch.  This didn't help.

This could be because I discovered that if I did:

	find /sys -name numa_node -exec grep . {} /dev/null \;

that the numa_node associated with all the PCI devices was always 0,
and if IIUC then I believe some of the PCI devices should have been
associated with NUMA node 2.  Perhaps this is what is causing all
the memory pages allocated by the myri10ge driver to be on NUMA
node 0, and thus causing the major performance issue.

To kludge around this, I made a different patch to the myri10ge driver.
This time I hardcoded the NUMA node in the call to alloc_pages_node()
to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
This is of course very specific to our specific system (NUMA node ids
and Myricom 10-GigE device IRQs), and is not something that would be
generically applicable.  But it was useful as a test, and it did
improve the receive side performance substantially!

[root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
n5:  8221.2911 MB /  10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
n4:  8237.9524 MB /  10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
n11:  7935.3750 MB /  10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
n2:  4543.1621 MB /  10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
n10:  7916.3925 MB /  10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
n7:  4558.4817 MB /  10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
n13:  4390.1875 MB /  10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
n3:  4572.6478 MB /  10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
n6:  4564.4776 MB /  10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
n8:  4409.8551 MB /  10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
n9:  4412.7836 MB /  10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
n12:  4413.4061 MB /  10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT

Aggregate performance:			56.4703 Gbps

This was basically double the previous receive side performance
without the patch.

I don't know if this is fundamentally a myri10ge driver issue or
some underlying Linux kernel issue, so it's not clear to me what
a proper fix would be.

Finally, while definitely a major improvement, I think it should be
possible to do even better, since we achieved 70 Gbps in the i7 to i7
tests, and probably could have done 80 Gbps except for an Asus
motherboard restriction with the interconnect between the Intel X58
and Nvidia NF200 chips.  It's definitely a big step in the right
direction though if this issue can be resolved.

Any help greatly appreicated in advance.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
@ 2009-08-07 21:18 ` Brice Goglin
  2009-08-07 21:51   ` Bill Fink
  2009-08-07 22:12 ` Neil Horman
  2009-08-12 23:29 ` David Miller
  2 siblings, 1 reply; 95+ messages in thread
From: Brice Goglin @ 2009-08-07 21:18 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin

Bill Fink wrote:
> This could be because I discovered that if I did:
>
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
>
> that the numa_node associated with all the PCI devices was always 0,
> and if IIUC then I believe some of the PCI devices should have been
> associated with NUMA node 2.  Perhaps this is what is causing all
> the memory pages allocated by the myri10ge driver to be on NUMA
> node 0, and thus causing the major performance issue.
>   

I've seen some cases in the past where numa_node was always 0 on
quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in
later kernels thanks to patches from Yinghai Lu (CC'ed).
Is the corresponding local_cpus sysfs file wrong as well ?
Maybe your kernel doesn't properly handle the NUMA location of PCI
devices on Nehalem machines yet?

Brice


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:18 ` Brice Goglin
@ 2009-08-07 21:51   ` Bill Fink
  2009-08-07 21:53     ` Brice Goglin
  2009-08-08  1:03     ` Andrew Gallatin
  0 siblings, 2 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-07 21:51 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin

On Fri, 07 Aug 2009, Brice Goglin wrote:

> Bill Fink wrote:
> > This could be because I discovered that if I did:
> >
> > 	find /sys -name numa_node -exec grep . {} /dev/null \;
> >
> > that the numa_node associated with all the PCI devices was always 0,
> > and if IIUC then I believe some of the PCI devices should have been
> > associated with NUMA node 2.  Perhaps this is what is causing all
> > the memory pages allocated by the myri10ge driver to be on NUMA
> > node 0, and thus causing the major performance issue.
> >   
> 
> I've seen some cases in the past where numa_node was always 0 on
> quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in
> later kernels thanks to patches from Yinghai Lu (CC'ed).

By later kernels do you mean 2.6.30 or 2.6.31?

> Is the corresponding local_cpus sysfs file wrong as well ?

All sysfs local_cpus values are the same (00000000,000000ff),
so yes they are also wrong.

> Maybe your kernel doesn't properly handle the NUMA location of PCI
> devices on Nehalem machines yet?

I assume so, unless there's some secret NUMA system setting that
I'm unaware of that would affect this and needs changing for my
setup.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:51   ` Bill Fink
@ 2009-08-07 21:53     ` Brice Goglin
  2009-08-07 22:08       ` Bill Fink
  2009-08-08  1:03     ` Andrew Gallatin
  1 sibling, 1 reply; 95+ messages in thread
From: Brice Goglin @ 2009-08-07 21:53 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin

Bill Fink wrote:
>> I've seen some cases in the past where numa_node was always 0 on
>> quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in
>> later kernels thanks to patches from Yinghai Lu (CC'ed).
>>     
>
> By later kernels do you mean 2.6.30 or 2.6.31?
>   

No, I meant "later than when the problem occured". I was using 2.6.22 at
this point and the problem was fixed somewhere around 2.6.25.

>> Is the corresponding local_cpus sysfs file wrong as well ?
>>     
>
> All sysfs local_cpus values are the same (00000000,000000ff),
> so yes they are also wrong.
>   

And hyperthreading is enabled, right?

Brice


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:53     ` Brice Goglin
@ 2009-08-07 22:08       ` Bill Fink
  2009-08-07 22:17         ` Brice Goglin
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-07 22:08 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin

On Fri, 07 Aug 2009, Brice Goglin wrote:

> Bill Fink wrote:
> >> I've seen some cases in the past where numa_node was always 0 on
> >> quad-Opteron machines with a PCI bus on node 1. IIRC it got fixed in
> >> later kernels thanks to patches from Yinghai Lu (CC'ed).
> >
> > By later kernels do you mean 2.6.30 or 2.6.31?
> 
> No, I meant "later than when the problem occured". I was using 2.6.22 at
> this point and the problem was fixed somewhere around 2.6.25.

OK.  The tests were run on a 2.6.29.6 kernel so presumably should
have included the fix you mentioned.

> >> Is the corresponding local_cpus sysfs file wrong as well ?
> >
> > All sysfs local_cpus values are the same (00000000,000000ff),
> > so yes they are also wrong.
> 
> And hyperthreading is enabled, right?

No, hyperthreading is disabled.  It's a dual quad-core system so there
are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
  2009-08-07 21:18 ` Brice Goglin
@ 2009-08-07 22:12 ` Neil Horman
  2009-08-08  0:54   ` Bill Fink
  2009-08-12 23:29 ` David Miller
  2 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-07 22:12 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Fri, Aug 07, 2009 at 05:06:00PM -0400, Bill Fink wrote:
> I've run into a major receive side performance issue with multi-10-GigE
> on a NUMA system.  The system is using a SuperMicro X8DAH+-F motherboard
> with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
> 1333 MHz DDR3 memory.  It is a Fedora 10 system but using the latest
> 2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
> from Fedora 10).
> 
> The test setup is:
> 
> 	i7test1----(6)----xeontest1----(6)----i7test2
> 	         10-GigE             10-GigE
> 
> So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
> of 12 10-GigE interfaces.  eth2 through eth7 (which are on the
> second Intel 5520 I/O Hub) are connected to i7test1 while
> eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
> are connected to i7test2.
> 
> Previous direct testing between i7test1 and i7test2 (which use an
> Asus P6T6 WS Revolution motherboard) demonstrated that they could
> achieve ~70 Gbps performance for either transmit or receive using
> 8 10-GigE interfaces.
> 
> The transmit side performance of xeontest1 is fantastic:
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
> n12:  9648.0522 MB /  10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
> n9: 11130.5320 MB /  10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
> n11:  9418.1250 MB /  10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
> n10:  9279.4758 MB /  10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
> n8: 11142.6574 MB /  10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
> n13:  9422.1492 MB /  10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
> n3: 11471.2500 MB /  10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
> n6:  9339.6354 MB /  10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
> n4:  9093.2500 MB /  10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
> n5:  9121.8367 MB /  10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
> n7:  9292.2500 MB /  10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
> n2: 11487.1150 MB /  10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT
> 
> Aggregate performance:			100.4637 Gbps
> 
> The problem is with the receive side performance.
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n11:  6983.6359 MB /  10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
> n10:  7000.1557 MB /  10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
> n9:  2451.7206 MB /  10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
> n13:  2453.0887 MB /  10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
> n12:  2446.5303 MB /  10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
> n8:  2462.5890 MB /  10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
> n4:  2763.5091 MB /  10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
> n5:  2770.0887 MB /  10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
> n2:  1777.7277 MB /  10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
> n6:  1772.7962 MB /  10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
> n3:  1779.4535 MB /  10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
> n7:  1770.8359 MB /  10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT
> 
> Aggregate performance:			29.9492 Gbps
> 
> I suspected that this was because the memory being allocated by the
> myri10ge driver was not being allocated on the optimum NUMA node.
> BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
> is what I would have expected, but this is my first experience with
> a NUMA system.
> 
> Based upon a patch by Peter Zijlstra that I discovered through Google
> searching, I tried patching the myri10ge driver to change its memory
> allocation of memory pages from alloc_pages() to alloc_pages_node()
> and specifying the NUMA node of the parent device of the Myricom 10-GigE
> device, which IIUC should be the PCIe switch.  This didn't help.
> 
> This could be because I discovered that if I did:
> 
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
> 
> that the numa_node associated with all the PCI devices was always 0,
> and if IIUC then I believe some of the PCI devices should have been
> associated with NUMA node 2.  Perhaps this is what is causing all
> the memory pages allocated by the myri10ge driver to be on NUMA
> node 0, and thus causing the major performance issue.
> 
> To kludge around this, I made a different patch to the myri10ge driver.
> This time I hardcoded the NUMA node in the call to alloc_pages_node()
> to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
> and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
> This is of course very specific to our specific system (NUMA node ids
> and Myricom 10-GigE device IRQs), and is not something that would be
> generically applicable.  But it was useful as a test, and it did
> improve the receive side performance substantially!
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n5:  8221.2911 MB /  10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
> n4:  8237.9524 MB /  10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
> n11:  7935.3750 MB /  10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
> n2:  4543.1621 MB /  10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
> n10:  7916.3925 MB /  10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
> n7:  4558.4817 MB /  10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
> n13:  4390.1875 MB /  10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
> n3:  4572.6478 MB /  10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
> n6:  4564.4776 MB /  10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
> n8:  4409.8551 MB /  10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
> n9:  4412.7836 MB /  10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
> n12:  4413.4061 MB /  10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT
> 
> Aggregate performance:			56.4703 Gbps
> 
> This was basically double the previous receive side performance
> without the patch.
> 
> I don't know if this is fundamentally a myri10ge driver issue or
> some underlying Linux kernel issue, so it's not clear to me what
> a proper fix would be.
> 
> Finally, while definitely a major improvement, I think it should be
> possible to do even better, since we achieved 70 Gbps in the i7 to i7
> tests, and probably could have done 80 Gbps except for an Asus
> motherboard restriction with the interconnect between the Intel X58
> and Nvidia NF200 chips.  It's definitely a big step in the right
> direction though if this issue can be resolved.
> 
> Any help greatly appreicated in advance.
> 
> 						-Thanks
> 
> 						-Bill
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

You're timing is impeccable!  I just posted a patch for an ftrace module to help
detect just these kind of conditions:
http://marc.info/?l=linux-netdev&m=124967650218846&w=2

Hope that helps you out
Neil


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 22:08       ` Bill Fink
@ 2009-08-07 22:17         ` Brice Goglin
  2009-08-07 22:55           ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Brice Goglin @ 2009-08-07 22:17 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, Yinghai Lu, gallatin

Bill Fink wrote:
> OK.  The tests were run on a 2.6.29.6 kernel so presumably should
> have included the fix you mentioned.
>   

Yes, but I wanted to emphasize that new platforms sometime need some new
code to handle this kind of things. Some Nehalem-specific changes might
be needed now.

>>>> Is the corresponding local_cpus sysfs file wrong as well ?
>>>>         
>>> All sysfs local_cpus values are the same (00000000,000000ff),
>>> so yes they are also wrong.
>>>       
>> And hyperthreading is enabled, right?
>>     
>
> No, hyperthreading is disabled.  It's a dual quad-core system so there
> are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2.
>   

So numa_node says that the device is close to node 0 while local_cpus
says that it's close to all 8 cores ie close to both node0 and node2
(which may well be wrong as well).

Brice


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 22:17         ` Brice Goglin
@ 2009-08-07 22:55           ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-07 22:55 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Linux Network Developers, Yinghai Lu, gallatin

On Sat, 08 Aug 2009, Brice Goglin wrote:

> Bill Fink wrote:
> > OK.  The tests were run on a 2.6.29.6 kernel so presumably should
> > have included the fix you mentioned.
> 
> Yes, but I wanted to emphasize that new platforms sometime need some new
> code to handle this kind of things. Some Nehalem-specific changes might
> be needed now.

Thanks for the clarification.

> >>>> Is the corresponding local_cpus sysfs file wrong as well ?
> >>>>         
> >>> All sysfs local_cpus values are the same (00000000,000000ff),
> >>> so yes they are also wrong.
> >>>       
> >> And hyperthreading is enabled, right?
> >>     
> >
> > No, hyperthreading is disabled.  It's a dual quad-core system so there
> > are a total of 8 cores, 4 on NUMA node 0 and 4 on NUMA node2.
> 
> So numa_node says that the device is close to node 0 while local_cpus
> says that it's close to all 8 cores ie close to both node0 and node2
> (which may well be wrong as well).

I believe it is wrong.  The basic system arcitecture is:

      Memory----CPU1----QPI----CPU2----Memory
                  |              |
                  |              |
                 QPI            QPI
                  |              |
                  |              |
                5520----QPI----5520
                ||||           ||||
                ||||           ||||
                ||||           ||||
                PCIe           PCIe

There are 2 x8, 1 x16, and 1 x4 PCIe 2.0 interfaces on each of the
Intel 5520 I/O Hubs.  The Myricom dual-port 10-GigE NICs are in the
six x8 or better slots.  eth2 through eth7 are on the second
Intel 5520 I/O Hub, so they should presumably show up on NUMA node 2,
and have local CPUs 1, 3, 5, and 7.  eth8 through eth13 are on the
first Intel 5520 I/O Hub, and thus should be on NUMA node 0 with
local CPUs 0, 2, 4, and 6 (CPU info derived from /proc/cpinfo).

						-Bill


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 22:12 ` Neil Horman
@ 2009-08-08  0:54   ` Bill Fink
  2009-08-08  1:56     ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-08  0:54 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Fri, 7 Aug 2009, Neil Horman wrote:

> You're timing is impeccable!  I just posted a patch for an ftrace module to help
> detect just these kind of conditions:
> http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> 
> Hope that helps you out
> Neil

Thanks!  It could be helpful.  Do you have a pointer to documentation
on how to use it?  And does it require the latest GIT kernel or could
it possibly be used with a 2.6.29.6 kernel?

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:51   ` Bill Fink
  2009-08-07 21:53     ` Brice Goglin
@ 2009-08-08  1:03     ` Andrew Gallatin
  2009-08-08  1:35       ` Bill Fink
  1 sibling, 1 reply; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-08  1:03 UTC (permalink / raw)
  To: Bill Fink; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu

Bill Fink wrote:

> All sysfs local_cpus values are the same (00000000,000000ff),
> so yes they are also wrong.

How were you handling IRQ binding?  If local_cpus is wrong,
the irqbalance will not be able to make good decisions about
where to bind the NICs' IRQs.  Did you try manually binding
each NICs's interrupt to a separate CPU on the correct node?

Regards,

Drew

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08  1:03     ` Andrew Gallatin
@ 2009-08-08  1:35       ` Bill Fink
  2009-08-08 11:08         ` Andrew Gallatin
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-08  1:35 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu

On Fri, 07 Aug 2009, Andrew Gallatin wrote:

> Bill Fink wrote:
> 
> > All sysfs local_cpus values are the same (00000000,000000ff),
> > so yes they are also wrong.
> 
> How were you handling IRQ binding?  If local_cpus is wrong,
> the irqbalance will not be able to make good decisions about
> where to bind the NICs' IRQs.  Did you try manually binding
> each NICs's interrupt to a separate CPU on the correct node?

Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
and the nuttcp application had its CPU affinity set to the same
CPU with its memory affinity bound to the same local NUMA node.
And the irqbalance daemon wasn't running.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08  0:54   ` Bill Fink
@ 2009-08-08  1:56     ` Neil Horman
  2009-08-14 20:44       ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-08  1:56 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote:
> On Fri, 7 Aug 2009, Neil Horman wrote:
> 
> > You're timing is impeccable!  I just posted a patch for an ftrace module to help
> > detect just these kind of conditions:
> > http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> > 
> > Hope that helps you out
> > Neil
> 
> Thanks!  It could be helpful.  Do you have a pointer to documentation
> on how to use it?  And does it require the latest GIT kernel or could
> it possibly be used with a 2.6.29.6 kernel?
> 
> 						-Bill
> 

It should apply to 2.6.29.6 no problem (might take a little massaging, but not
much).

No docs I'm afraid (sorry, I'm horrible about that)

Using it is easy though:

1) Patch, build and boot the kernel (make sure to have
CONFIG_SKB_SOURCES_TRACER, along with the other FTRACE requisite options)

2) mount -t debugfs nodev /sys/kernel/debug

3) cd /sys/kernel/debug/tracing

4) echo skb_sources > ./current_tracer

5) echo 1 > trace

6) cat ./trace

Step 5 clears the trace buffer.  Step 6 provides you a list list this


PID	ANID	CNID	RXQ	CCPU	LEN


Where:
PID - The process receiving an skb
ANID - The node which the skb being received was allocated on
CNID - The node which the process is running when it read this skb
RQQ - The NIC receive queue that received this skb
CCPU - The cpu the process was running on when it read the skb in question
LEN - The length of the skb being received

Each entry in the list denotes a unique skb (obviously), and with a clever awk
script you can identify which nodes each process in your system is receiving
frames from, so that you can use numactl or taskset to bias that process to run
on the same nodes cpus.

Note that step (6) wil show a larger list each time you cat that file (as trace
records aren't removed during a read.  Step 5 is what actually clears the trace
buffer and resets the list length to zero.

Hope that helps. Please feel free to email me if you have any questions.

Regards
Neil


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08  1:35       ` Bill Fink
@ 2009-08-08 11:08         ` Andrew Gallatin
  2009-08-08 11:26           ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-08 11:08 UTC (permalink / raw)
  To: Bill Fink; +Cc: Brice Goglin, Linux Network Developers, Yinghai Lu

Bill Fink wrote:
 > On Fri, 07 Aug 2009, Andrew Gallatin wrote:
 >
 >> Bill Fink wrote:
 >>
 >>> All sysfs local_cpus values are the same (00000000,000000ff),
 >>> so yes they are also wrong.
 >> How were you handling IRQ binding?  If local_cpus is wrong,
 >> the irqbalance will not be able to make good decisions about
 >> where to bind the NICs' IRQs.  Did you try manually binding
 >> each NICs's interrupt to a separate CPU on the correct node?
 >
 > Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
 > and the nuttcp application had its CPU affinity set to the same
 > CPU with its memory affinity bound to the same local NUMA node.
 > And the irqbalance daemon wasn't running.

I must be misunderstanding something.  I had thought that
alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
would allocate based on default policy which (if not interleaved)
should allocate from the current NUMA node.  And since restocking the
RX ring happens from a the driver's NAPI softirq context, then it
should always be restocking on the same node the memory is destined to
be consumed on.

Do I just not understand how alloc_pages() works on NUMA?

Thanks,

Drew


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08 11:08         ` Andrew Gallatin
@ 2009-08-08 11:26           ` Neil Horman
  2009-08-08 18:21             ` Andrew Gallatin
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-08 11:26 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu

On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
> Bill Fink wrote:
> > On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> >
> >> Bill Fink wrote:
> >>
> >>> All sysfs local_cpus values are the same (00000000,000000ff),
> >>> so yes they are also wrong.
> >> How were you handling IRQ binding?  If local_cpus is wrong,
> >> the irqbalance will not be able to make good decisions about
> >> where to bind the NICs' IRQs.  Did you try manually binding
> >> each NICs's interrupt to a separate CPU on the correct node?
> >
> > Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
> > and the nuttcp application had its CPU affinity set to the same
> > CPU with its memory affinity bound to the same local NUMA node.
> > And the irqbalance daemon wasn't running.
>
> I must be misunderstanding something.  I had thought that
> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
> would allocate based on default policy which (if not interleaved)
> should allocate from the current NUMA node.  And since restocking the
> RX ring happens from a the driver's NAPI softirq context, then it
> should always be restocking on the same node the memory is destined to
> be consumed on.
>
> Do I just not understand how alloc_pages() works on NUMA?
>

Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx
ring in their napi context.  netdev_alloc_skb specifically allocates an skb from
memory in the node that the actually NIC is local to (rather than the cpu that
the interrupt is running on).  That cuts out cross numa node chatter when the
device is dma-ing a frame from the hardware to the allocated skb.  The offshoot
of that however (especially in 10G cards with lots of rx queues whos interrupts
are spread out through the system) is that the irq affinity for a given irq has
an increased risk of not being on the same node as the skb memory.  The ftrace
module I referenced earlier will help illustrate this, as well as cases where
its causing applications to run on processors that create lots of cross-node
chatter.

Neil

> Thanks,
>
> Drew
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08 11:26           ` Neil Horman
@ 2009-08-08 18:21             ` Andrew Gallatin
  2009-08-08 18:32               ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-08 18:21 UTC (permalink / raw)
  To: Neil Horman; +Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu

Neil Horman wrote:
> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
>> Bill Fink wrote:
>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
>>>
>>>> Bill Fink wrote:
>>>>
>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
>>>>> so yes they are also wrong.
>>>> How were you handling IRQ binding?  If local_cpus is wrong,
>>>> the irqbalance will not be able to make good decisions about
>>>> where to bind the NICs' IRQs.  Did you try manually binding
>>>> each NICs's interrupt to a separate CPU on the correct node?
>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
>>> and the nuttcp application had its CPU affinity set to the same
>>> CPU with its memory affinity bound to the same local NUMA node.
>>> And the irqbalance daemon wasn't running.
>> I must be misunderstanding something.  I had thought that
>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
>> would allocate based on default policy which (if not interleaved)
>> should allocate from the current NUMA node.  And since restocking the
>> RX ring happens from a the driver's NAPI softirq context, then it
>> should always be restocking on the same node the memory is destined to
>> be consumed on.
>>
>> Do I just not understand how alloc_pages() works on NUMA?
>>
> 
> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx
> ring in their napi context.  netdev_alloc_skb specifically allocates an skb from
> memory in the node that the actually NIC is local to (rather than the cpu that
> the interrupt is running on).  That cuts out cross numa node chatter when the
> device is dma-ing a frame from the hardware to the allocated skb.  The offshoot
> of that however (especially in 10G cards with lots of rx queues whos interrupts
> are spread out through the system) is that the irq affinity for a given irq has
> an increased risk of not being on the same node as the skb memory.  The ftrace
> module I referenced earlier will help illustrate this, as well as cases where
> its causing applications to run on processors that create lots of cross-node
> chatter.

One thing worth noting is that myri10ge is rather unusual in that
it fills its RX rings with pages, then attaches them to skbs  after
the receive is done.   Given how (I think) alloc_page() works, I
don't understand why correct CPU binding does not have the same
benefit as Bill's patch to assign the NUMA node manually.

I'm certainly willing to change to myri10ge to use alloc_pages_node()
based on NIC locality, if that provides a benefit, but I'd really
like to understand why CPU binding is not helping.

Drew

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08 18:21             ` Andrew Gallatin
@ 2009-08-08 18:32               ` Neil Horman
  2009-08-11  7:32                 ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-08 18:32 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Bill Fink, Brice Goglin, Linux Network Developers, Yinghai Lu

On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> Neil Horman wrote:
>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
>>> Bill Fink wrote:
>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
>>>>
>>>>> Bill Fink wrote:
>>>>>
>>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
>>>>>> so yes they are also wrong.
>>>>> How were you handling IRQ binding?  If local_cpus is wrong,
>>>>> the irqbalance will not be able to make good decisions about
>>>>> where to bind the NICs' IRQs.  Did you try manually binding
>>>>> each NICs's interrupt to a separate CPU on the correct node?
>>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
>>>> and the nuttcp application had its CPU affinity set to the same
>>>> CPU with its memory affinity bound to the same local NUMA node.
>>>> And the irqbalance daemon wasn't running.
>>> I must be misunderstanding something.  I had thought that
>>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
>>> would allocate based on default policy which (if not interleaved)
>>> should allocate from the current NUMA node.  And since restocking the
>>> RX ring happens from a the driver's NAPI softirq context, then it
>>> should always be restocking on the same node the memory is destined to
>>> be consumed on.
>>>
>>> Do I just not understand how alloc_pages() works on NUMA?
>>>
>>
>> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx
>> ring in their napi context.  netdev_alloc_skb specifically allocates an skb from
>> memory in the node that the actually NIC is local to (rather than the cpu that
>> the interrupt is running on).  That cuts out cross numa node chatter when the
>> device is dma-ing a frame from the hardware to the allocated skb.  The offshoot
>> of that however (especially in 10G cards with lots of rx queues whos interrupts
>> are spread out through the system) is that the irq affinity for a given irq has
>> an increased risk of not being on the same node as the skb memory.  The ftrace
>> module I referenced earlier will help illustrate this, as well as cases where
>> its causing applications to run on processors that create lots of cross-node
>> chatter.
>
> One thing worth noting is that myri10ge is rather unusual in that
> it fills its RX rings with pages, then attaches them to skbs  after
> the receive is done.   Given how (I think) alloc_page() works, I
> don't understand why correct CPU binding does not have the same
> benefit as Bill's patch to assign the NUMA node manually.
>
> I'm certainly willing to change to myri10ge to use alloc_pages_node()
> based on NIC locality, if that provides a benefit, but I'd really
> like to understand why CPU binding is not helping.
>
Thats hard to say.  If binding the app to a cpu on the same node doesn't help,
that would suggest to me:

1) That the process binding isn't being honored
2) The cpu you're binding to isn't actually on the same node
3) The node which the skb's are allocated on is not the one you think it is
4) The cross numa chatter is improved, but another problem has taken its place
(like cpu contention between the process and the interrupt handler on the samme
cpu)
5) The problem is something else entirely.

Either way, I'd suggest applying and running the patch set that I referenced
previously.  It will give you a good table representation of how skbs for this
process are being allocated and consumed, and let you confirm or eliminate items
1-4 above.

Neil

> Drew
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08 18:32               ` Neil Horman
@ 2009-08-11  7:32                 ` Bill Fink
  2009-08-11 11:02                   ` Neil Horman
                                     ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-11  7:32 UTC (permalink / raw)
  To: Neil Horman
  Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu

On Sat, 8 Aug 2009, Neil Horman wrote:

> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> > Neil Horman wrote:
> >> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
> >>> Bill Fink wrote:
> >>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> >>>>
> >>>>> Bill Fink wrote:
> >>>>>
> >>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
> >>>>>> so yes they are also wrong.
> >>>>> How were you handling IRQ binding?  If local_cpus is wrong,
> >>>>> the irqbalance will not be able to make good decisions about
> >>>>> where to bind the NICs' IRQs.  Did you try manually binding
> >>>>> each NICs's interrupt to a separate CPU on the correct node?
> >>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
> >>>> and the nuttcp application had its CPU affinity set to the same
> >>>> CPU with its memory affinity bound to the same local NUMA node.
> >>>> And the irqbalance daemon wasn't running.
> >>> I must be misunderstanding something.  I had thought that
> >>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
> >>> would allocate based on default policy which (if not interleaved)
> >>> should allocate from the current NUMA node.  And since restocking the
> >>> RX ring happens from a the driver's NAPI softirq context, then it
> >>> should always be restocking on the same node the memory is destined to
> >>> be consumed on.
> >>>
> >>> Do I just not understand how alloc_pages() works on NUMA?
> >>>
> >>
> >> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx
> >> ring in their napi context.  netdev_alloc_skb specifically allocates an skb from
> >> memory in the node that the actually NIC is local to (rather than the cpu that
> >> the interrupt is running on).  That cuts out cross numa node chatter when the
> >> device is dma-ing a frame from the hardware to the allocated skb.  The offshoot
> >> of that however (especially in 10G cards with lots of rx queues whos interrupts
> >> are spread out through the system) is that the irq affinity for a given irq has
> >> an increased risk of not being on the same node as the skb memory.  The ftrace
> >> module I referenced earlier will help illustrate this, as well as cases where
> >> its causing applications to run on processors that create lots of cross-node
> >> chatter.
> >
> > One thing worth noting is that myri10ge is rather unusual in that
> > it fills its RX rings with pages, then attaches them to skbs  after
> > the receive is done.   Given how (I think) alloc_page() works, I
> > don't understand why correct CPU binding does not have the same
> > benefit as Bill's patch to assign the NUMA node manually.
> >
> > I'm certainly willing to change to myri10ge to use alloc_pages_node()
> > based on NIC locality, if that provides a benefit, but I'd really
> > like to understand why CPU binding is not helping.

I originally tried to just use alloc_pages_node() instead of alloc_pages(),
but it didn't help.  As mentioned in an earlier e-mail, that seems to
be because I discovered that doing:

	find /sys -name numa_node -exec grep . {} /dev/null \;

revealed that the NUMA node associated with _all_ the PCI devices was
always 0, when at least some of them should have been associated with
NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.
 
I discovered today that the NUMA node cpulist/cpumap is also wrong.
A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
cpumask of 00000000,000000ff), while the cpulist for node2 is empty
(with a cpumask of 00000000,00000000).  The distance is correct,
with "10 20" for node 0 and "20 10" for node2.

Since there seems to be an underlying kernel issue here, what would
be the proper place to address the apparently incorrect assignment
of NUMA node information for this system?

Even with my hacked workaround, which basically doubled the receive
side performance without my patch, the performance level was still
subpar from what I would have expected should be possible based on
some other tests I ran, such as the following single and multiple
parallel nuttcp loopback tests.

On Asus P6T6 motherboard with single Intel i7 965 3.2 GHz (overclocked
to 3.4 GHz) quad-core processor (non-NUMA):

Single nuttcp loopback test using CPUs 0 and 1:

[root@i7test1 ~]# nuttcp -xc0/1 192.168.1.10
44948.3125 MB /  10.04 sec = 37554.1394 Mbps 99 %TX 75 %RX 0 retrans 0.04 msRTT

Two parallel nuttcp loopback tests using CPUs 0, 1, 2, and 3:

[root@i7test1 ~]# nuttcp -xc0/1 -p5101 192.168.1.10 & nuttcp -xc2/3 -p5102 192.168.1.10 &
43595.0000 MB /  10.04 sec = 36423.4339 Mbps 99 %TX 82 %RX 0 retrans 0.04 msRTT
43384.5000 MB /  10.04 sec = 36247.5115 Mbps 99 %TX 74 %RX 0 retrans 0.02 msRTT

Aggregate performance:		72.6709 Gbps

On SuperMicro X8DAH+-F motherboard with dual Intel Xeon 5580 3.2 GHz
quad-core processors (NUMA):

Single nuttcp loopback test using CPUs 0 and 2 on NUMA node 0:

[root@xeontest1 ~]# nuttcp -xc0/2 192.168.1.14
39348.0000 MB /  10.04 sec = 32875.4865 Mbps 99 %TX 59 %RX 0 retrans 0.06 msRTT

Two parallel nuttcp loopback tests using CPUs 0, 2, 4, and 6 on NUMA node 0:

[root@xeontest1 ~]# nuttcp -xc0/2 -p5101 192.168.1.14 & nuttcp -xc4/6 -p5102 192.168.1.14 &
36197.0625 MB /  10.04 sec = 30245.0918 Mbps 99 %TX 59 %RX 0 retrans 0.06 msRTT
38153.5000 MB /  10.04 sec = 31876.4556 Mbps 99 %TX 75 %RX 0 retrans 0.04 msRTT

Aggregate performance:		62.1215 Gbps

While the performance using a single Xeon 5580 quad-core processor on
the SuperMicro system was 12.5 % to 14.5 % slower than the single i7 965
quad-core processor on the Asus system, when you use both of the Xeon 5580
quad core processors:

Four parallel nuttcp loopback tests using CPUs 0, 2, 4, and 6 on NUMA node 0,
and CPUs 1, 3, 5, and 7 on NUMA node 2:

[root@xeontest1 ~]# nuttcp -xc0/2 -p5101 192.168.1.14 & nuttcp -xc4/6 -p5102 192.168.1.14 & numactl --membind=2 nuttcp -xc1/3 -p5103 192.168.1.14 & numactl --membind=2 nuttcp -xc5/7 -p5104 192.168.1.14 &
36340.4375 MB /  10.04 sec = 30363.2672 Mbps 99 %TX 71 %RX 0 retrans 0.06 msRTT
36344.1250 MB /  10.04 sec = 30365.1838 Mbps 99 %TX 70 %RX 0 retrans 0.04 msRTT
34134.5625 MB /  10.04 sec = 28519.0180 Mbps 98 %TX 67 %RX 0 retrans 0.06 msRTT
34812.6875 MB /  10.04 sec = 29085.5312 Mbps 99 %TX 66 %RX 0 retrans 0.04 msRTT

Aggregate performance:		118.3330 Gbps

Overall the SuperMicro system outperforms the Asus system by 62.8 %.
Since a test between a pair of the i7 test systems achieved an aggregate
performance of ~70 Gbps, and could probably have achieved 80 Gbps except
for a motherboard restriction, it would seem the dual Xeon system should
be able to achieve at least the same level of aggregate performance.
On the transmit side it excels, achieving 100 Gbps.  But on the receive
side, even with my hacked workaround, it tops out at 56 Gbps.  I would
welcome any further ideas on what might still be limiting the aggregate
receive side performance of the dual Xeon NUMA system.

> Thats hard to say.  If binding the app to a cpu on the same node doesn't help,
> that would suggest to me:
> 
> 1) That the process binding isn't being honored
> 2) The cpu you're binding to isn't actually on the same node
> 3) The node which the skb's are allocated on is not the one you think it is
> 4) The cross numa chatter is improved, but another problem has taken its place
> (like cpu contention between the process and the interrupt handler on the samme
> cpu)
> 5) The problem is something else entirely.
> 
> Either way, I'd suggest applying and running the patch set that I referenced
> previously.  It will give you a good table representation of how skbs for this
> process are being allocated and consumed, and let you confirm or eliminate items
> 1-4 above.

Unfortunately I haven't had a chance to try that yet, as I was away
for the weekend and then there was an emergency at work today.  But
I will hopefully get a chance to try it out shortly.  I had some
initial concerns about just how much trace data would be generated
for a 10-second 10-GigE (or 100-GigE) test, but after doing some
quick calculations for 9000 byte jumbo frames, I guess it's a manageable
amount of data.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-11  7:32                 ` Bill Fink
@ 2009-08-11 11:02                   ` Neil Horman
  2009-08-11 19:15                     ` Christoph Lameter
  2009-08-11 22:27                   ` Andi Kleen
  2009-08-12  0:02                   ` Brandeburg, Jesse
  2 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-11 11:02 UTC (permalink / raw)
  To: Bill Fink
  Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers, Yinghai Lu

On Tue, Aug 11, 2009 at 03:32:10AM -0400, Bill Fink wrote:
> On Sat, 8 Aug 2009, Neil Horman wrote:
> 
> > On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> > > Neil Horman wrote:
> > >> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
> > >>> Bill Fink wrote:
> > >>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> > >>>>
> > >>>>> Bill Fink wrote:
> > >>>>>
> > >>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
> > >>>>>> so yes they are also wrong.
> > >>>>> How were you handling IRQ binding?  If local_cpus is wrong,
> > >>>>> the irqbalance will not be able to make good decisions about
> > >>>>> where to bind the NICs' IRQs.  Did you try manually binding
> > >>>>> each NICs's interrupt to a separate CPU on the correct node?
> > >>>> Yes, all the NIC IRQs were bound to a CPU on the local NUMA node,
> > >>>> and the nuttcp application had its CPU affinity set to the same
> > >>>> CPU with its memory affinity bound to the same local NUMA node.
> > >>>> And the irqbalance daemon wasn't running.
> > >>> I must be misunderstanding something.  I had thought that
> > >>> alloc_pages() on NUMA would wind up doing alloc_pages_current(), which
> > >>> would allocate based on default policy which (if not interleaved)
> > >>> should allocate from the current NUMA node.  And since restocking the
> > >>> RX ring happens from a the driver's NAPI softirq context, then it
> > >>> should always be restocking on the same node the memory is destined to
> > >>> be consumed on.
> > >>>
> > >>> Do I just not understand how alloc_pages() works on NUMA?
> > >>>
> > >>
> > >> Thats how alloc_works, but most drivers use netdev_alloc_skb to refill their rx
> > >> ring in their napi context.  netdev_alloc_skb specifically allocates an skb from
> > >> memory in the node that the actually NIC is local to (rather than the cpu that
> > >> the interrupt is running on).  That cuts out cross numa node chatter when the
> > >> device is dma-ing a frame from the hardware to the allocated skb.  The offshoot
> > >> of that however (especially in 10G cards with lots of rx queues whos interrupts
> > >> are spread out through the system) is that the irq affinity for a given irq has
> > >> an increased risk of not being on the same node as the skb memory.  The ftrace
> > >> module I referenced earlier will help illustrate this, as well as cases where
> > >> its causing applications to run on processors that create lots of cross-node
> > >> chatter.
> > >
> > > One thing worth noting is that myri10ge is rather unusual in that
> > > it fills its RX rings with pages, then attaches them to skbs  after
> > > the receive is done.   Given how (I think) alloc_page() works, I
> > > don't understand why correct CPU binding does not have the same
> > > benefit as Bill's patch to assign the NUMA node manually.
> > >
> > > I'm certainly willing to change to myri10ge to use alloc_pages_node()
> > > based on NIC locality, if that provides a benefit, but I'd really
> > > like to understand why CPU binding is not helping.
> 
> I originally tried to just use alloc_pages_node() instead of alloc_pages(),
> but it didn't help.  As mentioned in an earlier e-mail, that seems to
> be because I discovered that doing:
> 
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
> 
> revealed that the NUMA node associated with _all_ the PCI devices was
> always 0, when at least some of them should have been associated with
> NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.
>  
> I discovered today that the NUMA node cpulist/cpumap is also wrong.
> A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
> cpumask of 00000000,000000ff), while the cpulist for node2 is empty
> (with a cpumask of 00000000,00000000).  The distance is correct,
> with "10 20" for node 0 and "20 10" for node2.
> 
> Since there seems to be an underlying kernel issue here, what would
> be the proper place to address the apparently incorrect assignment
> of NUMA node information for this system?
> 

Well, its possible that there is a kernel bug, but those tables that you're
reading are parsed IIRC directly from the systems SRAT table in acpi space.  I'm
not sure of a way to read those directly from user space, but IIRC if you turn
on apic debugging they will get dumped out.  It sounds as though perhaps your
SRAT table is incorrectly reporting the location of your devices.  You may also
want to look at dumping out your smbios via dmidecode to see where that places
all your 10G nic cards.

Neil

> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-11 11:02                   ` Neil Horman
@ 2009-08-11 19:15                     ` Christoph Lameter
  0 siblings, 0 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-08-11 19:15 UTC (permalink / raw)
  To: Neil Horman
  Cc: Bill Fink, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Tue, 11 Aug 2009, Neil Horman wrote:

> Well, its possible that there is a kernel bug, but those tables that you're
> reading are parsed IIRC directly from the systems SRAT table in acpi space.  I'm
> not sure of a way to read those directly from user space, but IIRC if you turn
> on apic debugging they will get dumped out.  It sounds as though perhaps your
> SRAT table is incorrectly reporting the location of your devices.  You may also
> want to look at dumping out your smbios via dmidecode to see where that places
> all your 10G nic cards.

Very likely. Talk to the manufacturer of the machine and make sure that
the ACPI information is correct. NUMA is new to many vendors because of
the recent introduction of newer processor architectures that support NUMA
for the first time in small smp machines.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-11  7:32                 ` Bill Fink
  2009-08-11 11:02                   ` Neil Horman
@ 2009-08-11 22:27                   ` Andi Kleen
  2009-08-12  4:30                     ` Bill Fink
  2009-08-12  0:02                   ` Brandeburg, Jesse
  2 siblings, 1 reply; 95+ messages in thread
From: Andi Kleen @ 2009-08-11 22:27 UTC (permalink / raw)
  To: Bill Fink
  Cc: Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

Bill Fink <billfink@mindspring.com> writes:
>
> I originally tried to just use alloc_pages_node() instead of alloc_pages(),
> but it didn't help.  As mentioned in an earlier e-mail, that seems to
> be because I discovered that doing:
>
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
>
> revealed that the NUMA node associated with _all_ the PCI devices was
> always 0, when at least some of them should have been associated with
> NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.

> I discovered today that the NUMA node cpulist/cpumap is also wrong.
> A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
> cpumask of 00000000,000000ff), while the cpulist for node2 is empty
> (with a cpumask of 00000000,00000000).  The distance is correct,
> with "10 20" for node 0 and "20 10" for node2.


When the CPU nodes are not correct the device nodes are unlikely
to correct either. In fact your system likely has no node 1 configured, 
right?

This information comes from the BIOS. So either your BIOS is broken
or you simply didn't enable NUMA mode in the BIOS, but configured
memory interleaving.

If you post dmesg output somewhere I can take a look.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-11  7:32                 ` Bill Fink
  2009-08-11 11:02                   ` Neil Horman
  2009-08-11 22:27                   ` Andi Kleen
@ 2009-08-12  0:02                   ` Brandeburg, Jesse
  2009-08-12  4:38                     ` Bill Fink
  2 siblings, 1 reply; 95+ messages in thread
From: Brandeburg, Jesse @ 2009-08-12  0:02 UTC (permalink / raw)
  To: Bill Fink, Neil Horman
  Cc: Andrew Gallatin, Brice Goglin, Linux Network Developers,
	Yinghai Lu, jbarnes

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

Bill Fink wrote:
> On Sat, 8 Aug 2009, Neil Horman wrote:
> 
>> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
>>> Neil Horman wrote:
>>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
>>>>> Bill Fink wrote:
>>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
>>>>>> 
>>>>>>> Bill Fink wrote:
>>>>>>> 
>>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
>>>>>>>> so yes they are also wrong.

bill, I recently helped Jesse Barnes push a patch that addresses this kind
of issue on CoreI7, the root cause was the numa_node variable was
initialized based on slot on AMD systems, but needed to be set to -1 by
default on systems with a uniform IOH to slot architecture.

here is the commit ID:
http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38
d674be519109696746192943a6d524019f7f

I'm not sure it is in linus' tree yet, this link is to net-next

Maybe see if it helps?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6703 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-11 22:27                   ` Andi Kleen
@ 2009-08-12  4:30                     ` Bill Fink
  2009-08-12  7:21                       ` Andi Kleen
       [not found]                       ` <4A856781.2080301@myri.com>
  0 siblings, 2 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-12  4:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Wed, 12 Aug 2009, Andi Kleen wrote:

> Bill Fink <billfink@mindspring.com> writes:
> >
> > I originally tried to just use alloc_pages_node() instead of alloc_pages(),
> > but it didn't help.  As mentioned in an earlier e-mail, that seems to
> > be because I discovered that doing:
> >
> > 	find /sys -name numa_node -exec grep . {} /dev/null \;
> >
> > revealed that the NUMA node associated with _all_ the PCI devices was
> > always 0, when at least some of them should have been associated with
> > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.
> 
> > I discovered today that the NUMA node cpulist/cpumap is also wrong.
> > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
> > cpumask of 00000000,000000ff), while the cpulist for node2 is empty
> > (with a cpumask of 00000000,00000000).  The distance is correct,
> > with "10 20" for node 0 and "20 10" for node2.
> 
> When the CPU nodes are not correct the device nodes are unlikely
> to correct either. In fact your system likely has no node 1 configured, 
> right?

That was right.  There was no node 1, only nodes 0 and 2.

> This information comes from the BIOS. So either your BIOS is broken
> or you simply didn't enable NUMA mode in the BIOS, but configured
> memory interleaving.
> 
> If you post dmesg output somewhere I can take a look.

I did have NUMA enabled, and memory was configured as independent
rather than interleaved.

Based on all the discussions, it seemed a good possibility that the
BIOS was broken.  Today a colleague checked the SuperMicro site, and
discovered and installed a newer version of the BIOS.  Things seem
better now, but not totally correct.

There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs
for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7
(previously the even CPUs were on the first Xeon 5580 processor while
the odd CPUs were on the second processor).

[root@xeontest1 ~]# numastat
                           node0           node1
numa_hit                28087735        27195340
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             12065           11978
local_node              28081559        27182572
other_node                  6176           12768

[root@xeontest1 ~]# grep 'physical id' /proc/cpuinfo
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1

[root@xeontest1 ~]# cat /sys/devices/system/node/node0/cpulist
0-3
[root@xeontest1 ~]# cat /sys/devices/system/node/node1/cpulist
4-7

But _all_ the PCI devices are still just on node 0.

[root@xeontest1 ~]# find /sys -name numa_node -exec grep . {} /dev/null \;

shows numa_node is always 0.

[root@xeontest1 ~]# find /sys -name local_cpulist -exec grep . {} /dev/null \;

shows local_cpulist is always 0-3.

I now can get basically the same level of aggregate receive side
performance (55 Gbps) without my patch that I could previously get
only with my hacked workaround in the myri10ge driver.  But this
still seems significantly subpar to what I believe it should be
capable of.

BTW when I first booted the test system after upgrading the BIOS,
I got a kernel oops because it was still using my hacked myri10ge
driver, and apparently it didn't like that I was specifying to
use a then nonexistent node 2 (I was checking for success of the
alloc_pages_node() call and falling back to the original alloc_pages()
call on failure).  Or it could have been on the __alloc_skb() call
where I had a similar hack for the skb allocation.

Are you still interested in me posting the dmesg output?

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-12  0:02                   ` Brandeburg, Jesse
@ 2009-08-12  4:38                     ` Bill Fink
  2009-08-12 16:00                       ` Jesse Barnes
  2009-08-14 20:31                       ` Bill Fink
  0 siblings, 2 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-12  4:38 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu, jbarnes

On Tue, 11 Aug 2009, Brandeburg, Jesse wrote:

> Bill Fink wrote:
> > On Sat, 8 Aug 2009, Neil Horman wrote:
> > 
> >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> >>> Neil Horman wrote:
> >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
> >>>>> Bill Fink wrote:
> >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> >>>>>> 
> >>>>>>> Bill Fink wrote:
> >>>>>>> 
> >>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
> >>>>>>>> so yes they are also wrong.
> 
> bill, I recently helped Jesse Barnes push a patch that addresses this kind
> of issue on CoreI7, the root cause was the numa_node variable was
> initialized based on slot on AMD systems, but needed to be set to -1 by
> default on systems with a uniform IOH to slot architecture.
> 
> here is the commit ID:
> http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38
> d674be519109696746192943a6d524019f7f
> 
> I'm not sure it is in linus' tree yet, this link is to net-next
> 
> Maybe see if it helps?

It's worth a shot.

Hopefully I can get a chance to build a new kernel tomorrow to check
out some of the suggestions, like this one, the setting of ACPI_DEBUG,
and the new ftrace module for checking NUMA affinity of skbs.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-12  4:30                     ` Bill Fink
@ 2009-08-12  7:21                       ` Andi Kleen
       [not found]                       ` <4A856781.2080301@myri.com>
  1 sibling, 0 replies; 95+ messages in thread
From: Andi Kleen @ 2009-08-12  7:21 UTC (permalink / raw)
  To: Bill Fink
  Cc: Andi Kleen, Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu, jbarnes

> There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs
> for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7
> (previously the even CPUs were on the first Xeon 5580 processor while
> the odd CPUs were on the second processor).

That might be ok, depending on how the APICs are configured.
Of course you should have the same number of CPUs on the different
nodes. Anyways, it's gone now.

> 
> [root@xeontest1 ~]# numastat
>                            node0           node1
> numa_hit                28087735        27195340
> numa_miss                      0               0
> numa_foreign                   0               0
> interleave_hit             12065           11978
> local_node              28081559        27182572
> other_node                  6176           12768
> 
> [root@xeontest1 ~]# grep 'physical id' /proc/cpuinfo
> physical id     : 0
> physical id     : 0
> physical id     : 0
> physical id     : 0
> physical id     : 1
> physical id     : 1
> physical id     : 1
> physical id     : 1
> 
> [root@xeontest1 ~]# cat /sys/devices/system/node/node0/cpulist
> 0-3
> [root@xeontest1 ~]# cat /sys/devices/system/node/node1/cpulist
> 4-7
> 
> But _all_ the PCI devices are still just on node 0.

Most likely you need the appended patch from linux-next.

It should be probably in .31, but I can't see it in linus' tree only in -next. 
Jesse? 

Unfortunately the patch seems to combine code movement with fixes :-(

> Are you still interested in me posting the dmesg output?

No.

-Andi



commit eaf2f454cc9a76dbe1890af6269e60fe9978a3a5
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Fri Jul 10 14:04:30 2009 -0700

    x86/PCI: initialize PCI bus node numbers early
    
    The current mp_bus_to_node array is initialized only by AMD specific
    code, since AMD platforms have registers that can be used for
    determining mode numbers.  On new Intel platforms it's necessary to
    initialize this array as well though, otherwise all PCI node numbers
    will be 0, when in fact they should be -1 (indicating that I/O isn't
    tied to any particular node).
    
    So move the mp_bus_to_node code into the common PCI code, and
    initialize it early with a default value of -1.  This may be overridden
    later by arch code (e.g. the AMD code).
    
    With this change, PCI consistent memory and other node specific
    allocations (e.g. skbuff allocs) should occur on the "current" node.
    If, for performance reasons, applications want to be bound to specific
    nodes, they should open their devices only after being pinned to the
    CPU where they'll run, for maximum locality.
    
    Acked-by: Yinghai Lu <yinghai@kernel.org>
    Tested-by: Jesse Brandeburg <jesse.brandeburg@gmail.com>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

diff --git a/arch/x86/pci/amd_bus.c b/arch/x86/pci/amd_bus.c
index 3ffa10d..572ee97 100644
--- a/arch/x86/pci/amd_bus.c
+++ b/arch/x86/pci/amd_bus.c
@@ -15,63 +15,6 @@
  * also get peer root bus resource for io,mmio
  */
 
-#ifdef CONFIG_NUMA
-
-#define BUS_NR 256
-
-#ifdef CONFIG_X86_64
-
-static int mp_bus_to_node[BUS_NR];
-
-void set_mp_bus_to_node(int busnum, int node)
-{
-	if (busnum >= 0 &&  busnum < BUS_NR)
-		mp_bus_to_node[busnum] = node;
-}
-
-int get_mp_bus_to_node(int busnum)
-{
-	int node = -1;
-
-	if (busnum < 0 || busnum > (BUS_NR - 1))
-		return node;
-
-	node = mp_bus_to_node[busnum];
-
-	/*
-	 * let numa_node_id to decide it later in dma_alloc_pages
-	 * if there is no ram on that node
-	 */
-	if (node != -1 && !node_online(node))
-		node = -1;
-
-	return node;
-}
-
-#else /* CONFIG_X86_32 */
-
-static unsigned char mp_bus_to_node[BUS_NR];
-
-void set_mp_bus_to_node(int busnum, int node)
-{
-	if (busnum >= 0 &&  busnum < BUS_NR)
-	mp_bus_to_node[busnum] = (unsigned char) node;
-}
-
-int get_mp_bus_to_node(int busnum)
-{
-	int node;
-
-	if (busnum < 0 || busnum > (BUS_NR - 1))
-		return 0;
-	node = mp_bus_to_node[busnum];
-	return node;
-}
-
-#endif /* CONFIG_X86_32 */
-
-#endif /* CONFIG_NUMA */
-
 #ifdef CONFIG_X86_64
 
 /*
@@ -301,11 +244,6 @@ static int __init early_fill_mp_bus_info(void)
 	u64 val;
 	u32 address;
 
-#ifdef CONFIG_NUMA
-	for (i = 0; i < BUS_NR; i++)
-		mp_bus_to_node[i] = -1;
-#endif
-
 	if (!early_pci_allowed())
 		return -1;
 
@@ -346,7 +284,7 @@ static int __init early_fill_mp_bus_info(void)
 		node = (reg >> 4) & 0x07;
 #ifdef CONFIG_NUMA
 		for (j = min_bus; j <= max_bus; j++)
-			mp_bus_to_node[j] = (unsigned char) node;
+			set_mp_bus_to_node(j, node);
 #endif
 		link = (reg >> 8) & 0x03;
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 2202b62..5db96d4 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -600,3 +600,72 @@ struct pci_bus * __devinit pci_scan_bus_with_sysdata(int busno)
 {
 	return pci_scan_bus_on_node(busno, &pci_root_ops, -1);
 }
+
+/*
+ * NUMA info for PCI busses
+ *
+ * Early arch code is responsible for filling in reasonable values here.
+ * A node id of "-1" means "use current node".  In other words, if a bus
+ * has a -1 node id, it's not tightly coupled to any particular chunk
+ * of memory (as is the case on some Nehalem systems).
+ */
+#ifdef CONFIG_NUMA
+
+#define BUS_NR 256
+
+#ifdef CONFIG_X86_64
+
+static int mp_bus_to_node[BUS_NR] = {
+	[0 ... BUS_NR - 1] = -1
+};
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+		mp_bus_to_node[busnum] = node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node = -1;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return node;
+
+	node = mp_bus_to_node[busnum];
+
+	/*
+	 * let numa_node_id to decide it later in dma_alloc_pages
+	 * if there is no ram on that node
+	 */
+	if (node != -1 && !node_online(node))
+		node = -1;
+
+	return node;
+}
+
+#else /* CONFIG_X86_32 */
+
+static unsigned char mp_bus_to_node[BUS_NR] = {
+	[0 ... BUS_NR - 1] = -1
+};
+
+void set_mp_bus_to_node(int busnum, int node)
+{
+	if (busnum >= 0 &&  busnum < BUS_NR)
+	mp_bus_to_node[busnum] = (unsigned char) node;
+}
+
+int get_mp_bus_to_node(int busnum)
+{
+	int node;
+
+	if (busnum < 0 || busnum > (BUS_NR - 1))
+		return 0;
+	node = mp_bus_to_node[busnum];
+	return node;
+}
+
+#endif /* CONFIG_X86_32 */
+
+#endif /* CONFIG_NUMA */


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-12  4:38                     ` Bill Fink
@ 2009-08-12 16:00                       ` Jesse Barnes
  2009-08-14 20:31                       ` Bill Fink
  1 sibling, 0 replies; 95+ messages in thread
From: Jesse Barnes @ 2009-08-12 16:00 UTC (permalink / raw)
  To: Bill Fink
  Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Wed, 12 Aug 2009 00:38:24 -0400
Bill Fink <billfink@mindspring.com> wrote:

> On Tue, 11 Aug 2009, Brandeburg, Jesse wrote:
> 
> > Bill Fink wrote:
> > > On Sat, 8 Aug 2009, Neil Horman wrote:
> > > 
> > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> > >>> Neil Horman wrote:
> > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin
> > >>>> wrote:
> > >>>>> Bill Fink wrote:
> > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> > >>>>>> 
> > >>>>>>> Bill Fink wrote:
> > >>>>>>> 
> > >>>>>>>> All sysfs local_cpus values are the same
> > >>>>>>>> (00000000,000000ff), so yes they are also wrong.
> > 
> > bill, I recently helped Jesse Barnes push a patch that addresses
> > this kind of issue on CoreI7, the root cause was the numa_node
> > variable was initialized based on slot on AMD systems, but needed
> > to be set to -1 by default on systems with a uniform IOH to slot
> > architecture.
> > 
> > here is the commit ID:
> > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38
> > d674be519109696746192943a6d524019f7f
> > 
> > I'm not sure it is in linus' tree yet, this link is to net-next
> > 
> > Maybe see if it helps?
> 
> It's worth a shot.
> 
> Hopefully I can get a chance to build a new kernel tomorrow to check
> out some of the suggestions, like this one, the setting of ACPI_DEBUG,
> and the new ftrace module for checking NUMA affinity of skbs.

It's a fairly significant change so I wasn't planning on sending it to
Linus for 2.6.31.  If you think it *should* go into 2.6.31 (and stable
for that matter), please let me know soon.

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
  2009-08-07 21:18 ` Brice Goglin
  2009-08-07 22:12 ` Neil Horman
@ 2009-08-12 23:29 ` David Miller
  2009-08-13  2:35   ` Bill Fink
  2 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2009-08-12 23:29 UTC (permalink / raw)
  To: billfink; +Cc: netdev, brice, gallatin

From: Bill Fink <billfink@mindspring.com>
Date: Fri, 7 Aug 2009 17:06:00 -0400

> To kludge around this, I made a different patch to the myri10ge driver.
> This time I hardcoded the NUMA node in the call to alloc_pages_node()
> to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
> and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
> This is of course very specific to our specific system (NUMA node ids
> and Myricom 10-GigE device IRQs), and is not something that would be
> generically applicable.  But it was useful as a test, and it did
> improve the receive side performance substantially!

This, unfortunately, won't be comprehensive.  You'd also need to
kludge the NUMA node used for allocation of the skb->data buffer via
the netdev_alloc_skb() calls in myri10ge_rx_done() and friends.

This could possibly account for why, with your kludge, you still
were only getting 56.4703 Gbps

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-12 23:29 ` David Miller
@ 2009-08-13  2:35   ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-13  2:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, brice, gallatin

On Wed, 12 Aug 2009, David Miller wrote:

> From: Bill Fink <billfink@mindspring.com>
> Date: Fri, 7 Aug 2009 17:06:00 -0400
> 
> > To kludge around this, I made a different patch to the myri10ge driver.
> > This time I hardcoded the NUMA node in the call to alloc_pages_node()
> > to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
> > and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
> > This is of course very specific to our specific system (NUMA node ids
> > and Myricom 10-GigE device IRQs), and is not something that would be
> > generically applicable.  But it was useful as a test, and it did
> > improve the receive side performance substantially!
> 
> This, unfortunately, won't be comprehensive.  You'd also need to
> kludge the NUMA node used for allocation of the skb->data buffer via
> the netdev_alloc_skb() calls in myri10ge_rx_done() and friends.
> 
> This could possibly account for why, with your kludge, you still
> were only getting 56.4703 Gbps

I actually did try this.  I changed the netdev_alloc_skb() call in
the myri10ge driver to an __alloc_skb() call and explicitly specified
the correct NUMA node (plus all the necessary extra code that gets
done under the covers by netdev_alloc_skb()).  It didn't help.

Not being a kernel developer, one thing I didn't know though was if
the skb was initially allocated on NUMA node A, as the skb got expanded
during its processing, would it always stay on NUMA node A, or could
it possibly be migrated subsequently to a different NUMA node B.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
       [not found]                       ` <4A856781.2080301@myri.com>
@ 2009-08-14 16:38                         ` Bill Fink
  2009-08-14 16:55                           ` Andrew Gallatin
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-14 16:38 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: netdev

Hi Drew,

On Fri, 14 Aug 2009, Andrew Gallatin wrote:

> Hi Bill,
> 
> A few questions.   I was looking at the manual for the
> X8DAH+-F, and it claims to support both I/OAT and DCA.
> Do you have either or both enabled?

I did not explicitly set either one, and the manual indicates they
are both enabled by default, which I also vaguely seem to recall
was the way they were set.  I'm not in at the office today so I
can't physically check.

> If yes, then
> what happens if you disable ioatdma (by setting
> net.ipv4.tcp_dma_copybreak=2147483647 with sysctl)?
> How about if you disable myri10ge's use of dca (load driver
> with myri10ge_dca=0).
> 
> Do you see any changes?

Good suggestions but unfortunately it didn't help (or hurt).
It may have helped a little bit on the transmit side (I saw one
test at 102 Gbps when the previous high I had seen was 101 Gbps),
but the receive side was still at 55 Mbps.

Would there be any difference between disabling I/OAT and DCA in
the BIOS versus the myri10ge module parameter and sysctl setting?
I can try any BIOS changes on Monday.

> I'm worried about ioatdma because I've seen problems with it
> before.  At least on Linux, it tends to busywait for the DMA
> to complete, which is actually slower than a memory copy in
> most cases that I've seen.
> 
> I'm worried about DCA because you've shown that the BIOS is buggy,
> so the tag table could be wrong (resulting in bad prefetching hints).

The new BIOS seems to be better at setting the NUMA node info.

> I'm also worried about DCA because I've never had the chance to
> use it on a 5520 based system, and there is always the chance
> that we may be doing something wrong ourselves in the NIC firmware
> (again resulting in bad prefetching hints).  Bad prefetching hints
> can cause cross-CPU chatter, and kill performance by wasting
> memory bandwidth, and dirtying a cache on another CPU
> for no reason.

Is there any easy way to monitor active memory bandwidth usage?

> Drew

						-Thanks

						-Bill

P.S.  I don't know if it's at all significant, but one time after
      a reboot that required an fsck because of exceeding the number
      of mounts without an fsck, thus incurring a significant delay
      in the boot process, the transmit performance dropped from
      its normal ~100 Gbps to 57 Gbps (similar to the receive side
      performance).  Another reboot restored the normal ~100 Gbps
      transmit side performance.  I have no idea why this might be,
      but I saw it once before when an fsck was required on boot,
      so it may not be a fluke.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 16:38                         ` Bill Fink
@ 2009-08-14 16:55                           ` Andrew Gallatin
  2009-08-14 21:13                             ` Aviv Greenberg
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-14 16:55 UTC (permalink / raw)
  To: Bill Fink; +Cc: netdev

Bill Fink wrote:
> Hi Drew,
> 
> On Fri, 14 Aug 2009, Andrew Gallatin wrote:
> 
>> Hi Bill,
>>
>> A few questions.   I was looking at the manual for the
>> X8DAH+-F, and it claims to support both I/OAT and DCA.
>> Do you have either or both enabled?
> 
> I did not explicitly set either one, and the manual indicates they
> are both enabled by default, which I also vaguely seem to recall
> was the way they were set.  I'm not in at the office today so I
> can't physically check.
> 
>> If yes, then
>> what happens if you disable ioatdma (by setting
>> net.ipv4.tcp_dma_copybreak=2147483647 with sysctl)?
>> How about if you disable myri10ge's use of dca (load driver
>> with myri10ge_dca=0).
>>
>> Do you see any changes?
> 
> Good suggestions but unfortunately it didn't help (or hurt).
> It may have helped a little bit on the transmit side (I saw one
> test at 102 Gbps when the previous high I had seen was 101 Gbps),
> but the receive side was still at 55 Mbps.

Darn.  But it shouldn't matter at all for the transmit side...
Speaking of the send side, have you tried using
netperf -tTCP_SENDFILE rather than nuttcp to make the
transmit side zero-copy?


> Would there be any difference between disabling I/OAT and DCA in
> the BIOS versus the myri10ge module parameter and sysctl setting?
> I can try any BIOS changes on Monday.

There should not be, no.

>> I'm worried about ioatdma because I've seen problems with it
>> before.  At least on Linux, it tends to busywait for the DMA
>> to complete, which is actually slower than a memory copy in
>> most cases that I've seen.
>>
>> I'm worried about DCA because you've shown that the BIOS is buggy,
>> so the tag table could be wrong (resulting in bad prefetching hints).
> 
> The new BIOS seems to be better at setting the NUMA node info.
> 
>> I'm also worried about DCA because I've never had the chance to
>> use it on a 5520 based system, and there is always the chance
>> that we may be doing something wrong ourselves in the NIC firmware
>> (again resulting in bad prefetching hints).  Bad prefetching hints
>> can cause cross-CPU chatter, and kill performance by wasting
>> memory bandwidth, and dirtying a cache on another CPU
>> for no reason.
> 
> Is there any easy way to monitor active memory bandwidth usage?

There may be something in the chipset, and there may be CPU counters,
(via oprofile) but I'm not aware of what they are.  It might be
interesting to run just 1/2 your test (all to, say, NUMA node
1) and then bind some lmbench memory copy (bw_mem) processes to
NUMA node 0, and see if the lmbench slows down (and/or is slowed
down) by the ongoing network traffic.

Drew

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-12  4:38                     ` Bill Fink
  2009-08-12 16:00                       ` Jesse Barnes
@ 2009-08-14 20:31                       ` Bill Fink
  2009-08-17 16:53                         ` Jesse Barnes
  1 sibling, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-14 20:31 UTC (permalink / raw)
  To: Bill Fink
  Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu, jbarnes

On Wed, 12 Aug 2009, Bill Fink wrote:

> On Tue, 11 Aug 2009, Brandeburg, Jesse wrote:
> 
> > Bill Fink wrote:
> > > On Sat, 8 Aug 2009, Neil Horman wrote:
> > > 
> > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin wrote:
> > >>> Neil Horman wrote:
> > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin wrote:
> > >>>>> Bill Fink wrote:
> > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> > >>>>>> 
> > >>>>>>> Bill Fink wrote:
> > >>>>>>> 
> > >>>>>>>> All sysfs local_cpus values are the same (00000000,000000ff),
> > >>>>>>>> so yes they are also wrong.
> > 
> > bill, I recently helped Jesse Barnes push a patch that addresses this kind
> > of issue on CoreI7, the root cause was the numa_node variable was
> > initialized based on slot on AMD systems, but needed to be set to -1 by
> > default on systems with a uniform IOH to slot architecture.
> > 
> > here is the commit ID:
> > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38
> > d674be519109696746192943a6d524019f7f
> > 
> > I'm not sure it is in linus' tree yet, this link is to net-next
> > 
> > Maybe see if it helps?
> 
> It's worth a shot.
> 
> Hopefully I can get a chance to build a new kernel tomorrow to check
> out some of the suggestions, like this one, the setting of ACPI_DEBUG,
> and the new ftrace module for checking NUMA affinity of skbs.

I applied this patch to my 2.6.29.6 kernel (from Fedora 11).

Now when I do:

	find /sys -name numa_node -exec grep . {} /dev/null \;

the numa_node for _all_ PCI devices is -1.

When I do:

	find /sys -name local_cpus -exec grep . {} /dev/null \;

I find that local_cpus is always 00000000,00000000.

Is that OK or should it be 00000000,000000ff (for my dual quad-core
Xeon 5580 system with no hyperthreading)?

Also, is it just not possible on this type of Intel Xeon system to
properly associate the PCI devices with the nearest NUMA node?

In any event, the patch didn't help (or hurt).  The transmit
performance remained at ~100 Gbps while the receive performance
remained at 55 Gbps.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-08  1:56     ` Neil Horman
@ 2009-08-14 20:44       ` Bill Fink
  2009-08-14 23:25         ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-14 20:44 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Fri, 7 Aug 2009, Neil Horman wrote:

> On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote:
> > On Fri, 7 Aug 2009, Neil Horman wrote:
> > 
> > > You're timing is impeccable!  I just posted a patch for an ftrace module to help
> > > detect just these kind of conditions:
> > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> > > 
> > > Hope that helps you out
> > > Neil
> > 
> > Thanks!  It could be helpful.  Do you have a pointer to documentation
> > on how to use it?  And does it require the latest GIT kernel or could
> > it possibly be used with a 2.6.29.6 kernel?
> > 
> > 						-Bill
> 
> It should apply to 2.6.29.6 no problem (might take a little massaging, but not
> much).

It doesn't look like I can apply your patches to my 2.6.29.6 kernel.

For starters, there's no include/trace/events directory, so there's
no include/trace/events/skb.h.  There is an include/trace/skb.h file,
but there's no TRACE_EVENT defined anywhere in the kernel.

I don't suppose it's as simple as defining (from include/linux/tracepoint.h
from Linus's GIT tree):

#define PARAMS(args...) args

#define TRACE_EVENT(name, proto, args, struct, assign, print)   \
	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))

So do you still think it's reasonable to try applying your patches
to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4
or 2.6.31-rc6?

						-Thanks

						-Bill



> No docs I'm afraid (sorry, I'm horrible about that)
> 
> Using it is easy though:
> 
> 1) Patch, build and boot the kernel (make sure to have
> CONFIG_SKB_SOURCES_TRACER, along with the other FTRACE requisite options)
> 
> 2) mount -t debugfs nodev /sys/kernel/debug
> 
> 3) cd /sys/kernel/debug/tracing
> 
> 4) echo skb_sources > ./current_tracer
> 
> 5) echo 1 > trace
> 
> 6) cat ./trace
> 
> Step 5 clears the trace buffer.  Step 6 provides you a list list this
> 
> 
> PID	ANID	CNID	RXQ	CCPU	LEN
> 
> 
> Where:
> PID - The process receiving an skb
> ANID - The node which the skb being received was allocated on
> CNID - The node which the process is running when it read this skb
> RQQ - The NIC receive queue that received this skb
> CCPU - The cpu the process was running on when it read the skb in question
> LEN - The length of the skb being received
> 
> Each entry in the list denotes a unique skb (obviously), and with a clever awk
> script you can identify which nodes each process in your system is receiving
> frames from, so that you can use numactl or taskset to bias that process to run
> on the same nodes cpus.
> 
> Note that step (6) wil show a larger list each time you cat that file (as trace
> records aren't removed during a read.  Step 5 is what actually clears the trace
> buffer and resets the list length to zero.
> 
> Hope that helps. Please feel free to email me if you have any questions.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 16:55                           ` Andrew Gallatin
@ 2009-08-14 21:13                             ` Aviv Greenberg
  2009-08-20  7:26                               ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Aviv Greenberg @ 2009-08-14 21:13 UTC (permalink / raw)
  To: Andrew Gallatin; +Cc: Bill Fink, netdev

>  There may be something in the chipset

shooting in the dark: when you lspci -vvv and check the MaxPayload and
MaxReadReq values for the myri devices - what are the values and are
they equal? Are they the same on all your platforms?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 20:44       ` Bill Fink
@ 2009-08-14 23:25         ` Neil Horman
  2009-08-20  7:50           ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-14 23:25 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote:
> On Fri, 7 Aug 2009, Neil Horman wrote:
> 
> > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote:
> > > On Fri, 7 Aug 2009, Neil Horman wrote:
> > > 
> > > > You're timing is impeccable!  I just posted a patch for an ftrace module to help
> > > > detect just these kind of conditions:
> > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> > > > 
> > > > Hope that helps you out
> > > > Neil
> > > 
> > > Thanks!  It could be helpful.  Do you have a pointer to documentation
> > > on how to use it?  And does it require the latest GIT kernel or could
> > > it possibly be used with a 2.6.29.6 kernel?
> > > 
> > > 						-Bill
> > 
> > It should apply to 2.6.29.6 no problem (might take a little massaging, but not
> > much).
> 
> It doesn't look like I can apply your patches to my 2.6.29.6 kernel.
> 
> For starters, there's no include/trace/events directory, so there's
> no include/trace/events/skb.h.  There is an include/trace/skb.h file,
> but there's no TRACE_EVENT defined anywhere in the kernel.
> 
> I don't suppose it's as simple as defining (from include/linux/tracepoint.h
> from Linus's GIT tree):
> 
> #define PARAMS(args...) args
> 
> #define TRACE_EVENT(name, proto, args, struct, assign, print)   \
> 	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
> 
> So do you still think it's reasonable to try applying your patches
> to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4
> or 2.6.31-rc6?
> 
> 						-Thanks
> 
> 						-Bill
> 
> 
> 
I thought the trace stuff went it around 2.6.29 but I might be mistaken.
Easiest thing to do likely would be find where in the tree those were introduced
and just apply them prior to my patches, or move to the latest kernel if you
can (at least for the purposes of testing)

Neil


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 20:31                       ` Bill Fink
@ 2009-08-17 16:53                         ` Jesse Barnes
  2009-08-18  7:07                           ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Jesse Barnes @ 2009-08-17 16:53 UTC (permalink / raw)
  To: Bill Fink
  Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Fri, 14 Aug 2009 16:31:55 -0400
Bill Fink <billfink@mindspring.com> wrote:

> On Wed, 12 Aug 2009, Bill Fink wrote:
> 
> > On Tue, 11 Aug 2009, Brandeburg, Jesse wrote:
> > 
> > > Bill Fink wrote:
> > > > On Sat, 8 Aug 2009, Neil Horman wrote:
> > > > 
> > > >> On Sat, Aug 08, 2009 at 02:21:36PM -0400, Andrew Gallatin
> > > >> wrote:
> > > >>> Neil Horman wrote:
> > > >>>> On Sat, Aug 08, 2009 at 07:08:20AM -0400, Andrew Gallatin
> > > >>>> wrote:
> > > >>>>> Bill Fink wrote:
> > > >>>>>> On Fri, 07 Aug 2009, Andrew Gallatin wrote:
> > > >>>>>> 
> > > >>>>>>> Bill Fink wrote:
> > > >>>>>>> 
> > > >>>>>>>> All sysfs local_cpus values are the same
> > > >>>>>>>> (00000000,000000ff), so yes they are also wrong.
> > > 
> > > bill, I recently helped Jesse Barnes push a patch that addresses
> > > this kind of issue on CoreI7, the root cause was the numa_node
> > > variable was initialized based on slot on AMD systems, but needed
> > > to be set to -1 by default on systems with a uniform IOH to slot
> > > architecture.
> > > 
> > > here is the commit ID:
> > > http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git;a=commit;h=3c38
> > > d674be519109696746192943a6d524019f7f
> > > 
> > > I'm not sure it is in linus' tree yet, this link is to net-next
> > > 
> > > Maybe see if it helps?
> > 
> > It's worth a shot.
> > 
> > Hopefully I can get a chance to build a new kernel tomorrow to check
> > out some of the suggestions, like this one, the setting of
> > ACPI_DEBUG, and the new ftrace module for checking NUMA affinity of
> > skbs.
> 
> I applied this patch to my 2.6.29.6 kernel (from Fedora 11).
> 
> Now when I do:
> 
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
> 
> the numa_node for _all_ PCI devices is -1.

Yeah, that sounds right (indicates they're not really tied to a
specific node).

> When I do:
> 
> 	find /sys -name local_cpus -exec grep . {} /dev/null \;
> 
> I find that local_cpus is always 00000000,00000000.
> 
> Is that OK or should it be 00000000,000000ff (for my dual quad-core
> Xeon 5580 system with no hyperthreading)?

Hm, yeah it probably should have the full CPU mask...

> Also, is it just not possible on this type of Intel Xeon system to
> properly associate the PCI devices with the nearest NUMA node?

All the PCI devices hang off the root complex, which is the same
distance to each node of memory (at least that's my understanding for
current platforms).

> In any event, the patch didn't help (or hurt).  The transmit
> performance remained at ~100 Gbps while the receive performance
> remained at 55 Gbps.

Maybe the other Jesse has some ideas here.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-17 16:53                         ` Jesse Barnes
@ 2009-08-18  7:07                           ` Bill Fink
  2009-08-18 11:54                             ` Andrew Gallatin
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-18  7:07 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Brandeburg, Jesse, Neil Horman, Andrew Gallatin, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Mon, 17 Aug 2009 09:53:02 -0700, Jesse Barnes wrote:

> On Fri, 14 Aug 2009 16:31:55 -0400
> Bill Fink <billfink@mindspring.com> wrote:
> 
> Hm, yeah it probably should have the full CPU mask...
> 
> > Also, is it just not possible on this type of Intel Xeon system to
> > properly associate the PCI devices with the nearest NUMA node?
> 
> All the PCI devices hang off the root complex, which is the same
> distance to each node of memory (at least that's my understanding for
> current platforms).

I admit to being confused then.  The basic system architecture
of the SuperMicro system is:

      Memory----CPU1----QPI----CPU2----Memory
                  |              |
                  |              |
                 QPI            QPI
                  |              |
                  |              |
                5520----QPI----5520
                ||||           ||||
                ||||           ||||
                ||||           ||||
                PCIe           PCIe

It doesn't appear that a given PCIe device is equidistant to the
two nodes of memory.  It's one QPI hop to the "local" (same side)
node, and two QPI hops to the "remote" (far side) node.  But then
I don't know what a root complex is, and how it fits into the
system architecture above.

> > In any event, the patch didn't help (or hurt).  The transmit
> > performance remained at ~100 Gbps while the receive performance
> > remained at 55 Gbps.
> 
> Maybe the other Jesse has some ideas here.

Any and all ideas welcome.  I even considered the idea that maybe
instead of transferring 9000 bytes of payload, perhaps it was
transferring the next higher power of 2, namely 16384, since
bc told me that 9000/16384*100 was 54.9316.  But I tried a test
today with an MTU of 8000 and it didn't make any difference.

BTW here's a diff of an "lspci -vvvxxxx" on the better receive side
performing Asus system (<) versus on the SuperMicro system (>) for
one of the Myricom 10-GigE interfaces:

[root@xeontest1 ~]# diff -bw /tmp/foo2 /tmp/foo3
1c1
< 06:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01)
---
> 04:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01)
3c3
<         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
---
>       Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+

I don't know what the ParErr- versus ParErr+ means.

5,9c5,9
<         Latency: 0, Cache Line Size: 64 bytes
<         Interrupt: pin A routed to IRQ 2277
<         Region 0: Memory at da000000 (64-bit, prefetchable) [size=16M]
<         Region 2: Memory at fa900000 (64-bit, non-prefetchable) [size=1M]
<         Expansion ROM at fa880000 [disabled] [size=512K]
---
>       Latency: 0, Cache Line Size: 256 bytes
>       Interrupt: pin A routed to IRQ 121
>       Region 0: Memory at f3000000 (64-bit, prefetchable) [size=16M]
>       Region 2: Memory at fa300000 (64-bit, non-prefetchable) [size=1M]
>       Expansion ROM at fa280000 [disabled] [size=512K]
11c11
<                 Address: 00000000fee0400c  Data: 4183
---
>               Address: 00000000fee00000  Data: 40cc
45c45
<         Capabilities: [1a8] Device Serial Number b6-be-46-ff-ff-dd-60-00
---
>       Capabilities: [1a8] Device Serial Number 88-be-46-ff-ff-dd-60-00

I don't see much difference other than a larger Cache Line Size
on the SuperMicro system.

47,48c47,48
< 00: c1 14 08 00 06 05 10 00 01 00 00 02 10 00 00 00
< 10: 0c 00 00 da 00 00 00 00 04 00 90 fa 00 00 00 00
---
> 00: c1 14 08 00 46 05 10 00 01 00 00 02 40 00 00 00
> 10: 0c 00 00 f3 00 00 00 00 04 00 30 fa 00 00 00 00
50,52c50,52
< 30: 00 00 88 fa 44 00 00 00 00 00 00 00 0b 01 00 00
< 40: 00 00 00 00 05 54 81 00 0c 40 e0 fe 00 00 00 00
< 50: 83 41 00 00 01 5c 03 00 00 20 00 64 10 a0 02 00
---
> 30: 00 00 28 fa 44 00 00 00 00 00 00 00 0e 01 00 00
> 40: 00 00 00 00 05 54 81 00 00 00 e0 fe 00 00 00 00
> 50: cc 40 00 00 01 5c 03 00 00 20 00 64 10 a0 02 00
73c73
< 1a0: 00 00 00 00 00 00 00 00 03 00 01 00 b6 be 46 ff
---
> 1a0: 00 00 00 00 00 00 00 00 03 00 01 00 88 be 46 ff

And here's part of the dmesg output on the Asus system:

myri10ge: Version 1.4.3-1.358
myri10ge 0000:06:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
myri10ge 0000:06:00.0: setting latency timer to 64
mtrr: type mismatch for da000000,1000000 old: write-back new: write-combining
firmware: requesting myri10ge_eth_z8e.dat
myri10ge 0000:06:00.0: Not enabling ECRC on non-root port 0000:05:02.0
firmware: requesting myri10ge_eth_z8e.dat
myri10ge 0000:06:00.0: MSI IRQ 2282, tx bndry 4096, fw myri10ge_eth_z8e.dat, WC
Disabled

And on the SuperMicro system:

myri10ge: Version 1.4.4-1.401
  alloc irq_desc for 35 on cpu 0 node 0
  alloc kstat_irqs on cpu 0 node 0
myri10ge 0000:04:00.0: PCI INT A -> GSI 35 (level, low) -> IRQ 35
myri10ge 0000:04:00.0: setting latency timer to 64
myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat
myri10ge 0000:04:00.0: Not enabling ECRC on non-root port 0000:03:02.0
myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat
  alloc irq_desc for 112 on cpu 0 node 0
  alloc kstat_irqs on cpu 0 node 0
myri10ge 0000:04:00.0: irq 112 for MSI/MSI-X
myri10ge 0000:04:00.0: MSI IRQ 112, tx bndry 4096, fw myri10ge_eth_z8e.dat, WC E
nabled
  alloc irq_desc for 24 on cpu 0 node 0
  alloc kstat_irqs on cpu 0 node 0

Interestingly, the "WC Enabled" is only indicated on the first two
10-GigE interfaces and disabled on the other ten.  For the Asus system
it indicates "WC Disabled" on all the interfaces, but also has that
earlier bit about "old: write-back new: write-combining", which doesn't
appear on the SuperMicro system (although that is using a slightly
newer version of the myri10ge driver).

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-18  7:07                           ` Bill Fink
@ 2009-08-18 11:54                             ` Andrew Gallatin
  2009-08-19 17:59                               ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-18 11:54 UTC (permalink / raw)
  To: Bill Fink
  Cc: Jesse Barnes, Brandeburg, Jesse, Neil Horman, Brice Goglin,
	Linux Network Developers, Yinghai Lu

Bill Fink wrote:

> <         Latency: 0, Cache Line Size: 64 bytes

<...>

>>       Latency: 0, Cache Line Size: 256 bytes


A cache line size of 256 clearly seems wrong for a Xeon.  I assume all
devices on the SuperMicro show the same value?

> Interestingly, the "WC Enabled" is only indicated on the first two

The WC is probably a red herring.

What does ethtool -S show for the DMA write bandwidth of the
NICs on the SuperMicro?

These values are obtained serially, as the driver resets
the NIC (reset happens at load time, and ifconfig up),
so they could easily sum to more than the memory bandwidth
of the system.  But it would be good to check for any anomalies.

I can send you a pointer to a tool we use internally, which loads
some custom firmware on the NIC, and can exercise the DMA engines
on all the NICs in parallel.  This would give an idea of the
aggregate DMA bandwidth available on the system.  Let me know
if you're interested.

Drew

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-18 11:54                             ` Andrew Gallatin
@ 2009-08-19 17:59                               ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-19 17:59 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Jesse Barnes, Brandeburg, Jesse, Neil Horman, Brice Goglin,
	Linux Network Developers, Yinghai Lu

On Tue, 18 Aug 2009, Andrew Gallatin wrote:

> Bill Fink wrote:
> 
> > <         Latency: 0, Cache Line Size: 64 bytes
> 
> <...>
> 
> >>       Latency: 0, Cache Line Size: 256 bytes
> 
> 
> A cache line size of 256 clearly seems wrong for a Xeon.  I assume all
> devices on the SuperMicro show the same value?

I forgot to check that.

> > Interestingly, the "WC Enabled" is only indicated on the first two
> 
> The WC is probably a red herring.
> 
> What does ethtool -S show for the DMA write bandwidth of the
> NICs on the SuperMicro?

I've attached the full "ethtool -S" output from both the Asus and
SuperMicro systems.  Here's just the bandwidth info:

Asus eth2:

[root@i7test1 ~]# ethtool -S eth2
NIC statistics:
...
     read_dma_bw_MBs: 1625
     write_dma_bw_MBs: 1599
     read_write_dma_bw_MBs: 3192

SuperMicro eth2 (on 5520 connected to NUMA node 1):

[root@xeontest1 ~]# ethtool -S eth2
NIC statistics:
...
     read_dma_bw_MBs: 1624
     write_dma_bw_MBs: 1605
     read_write_dma_bw_MBs: 1323

SuperMicro eth8 (on 5520 connected to NUMA node 0):

[root@xeontest1 ~]# ethtool -S eth8
NIC statistics:
...
     read_dma_bw_MBs: 1572
     write_dma_bw_MBs: 1605
     read_write_dma_bw_MBs: 2113

> These values are obtained serially, as the driver resets
> the NIC (reset happens at load time, and ifconfig up),
> so they could easily sum to more than the memory bandwidth
> of the system.  But it would be good to check for any anomalies.
> 
> I can send you a pointer to a tool we use internally, which loads
> some custom firmware on the NIC, and can exercise the DMA engines
> on all the NICs in parallel.  This would give an idea of the
> aggregate DMA bandwidth available on the system.  Let me know
> if you're interested.

Yes, I'd be interested.

						-Thanks

						-Bill



Full ethtool output:
--------------------------------------------------------------------------------

Asus eth2:

[root@i7test1 ~]# ethtool -S eth2
NIC statistics:
     rx_packets: 4
     tx_packets: 10
     rx_bytes: 240
     tx_bytes: 708
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_boundary: 4096
     WC: 0
     irq: 2282
     MSI: 1
     MSIX: 0
     read_dma_bw_MBs: 1625
     write_dma_bw_MBs: 1599
     read_write_dma_bw_MBs: 3192
     serial_number: 356055
     watchdog_resets: 0
     link_changes: 6
     link_up: 1
     dropped_link_overflow: 0
     dropped_link_error_or_filtered: 631516
     dropped_pause: 631516
     dropped_bad_phy: 0
     dropped_bad_crc32: 0
     dropped_unicast_filtered: 0
     dropped_multicast_filtered: 11
     dropped_runt: 0
     dropped_overrun: 0
     dropped_no_small_buffer: 0
     dropped_no_big_buffer: 0
     ----------- slice ---------: 0
     tx_pkt_start: 421736
     tx_pkt_done: 421736
     tx_req: 2866189
     tx_done: 2866189
     rx_small_cnt: 257731
     rx_big_cnt: 3830824
     wake_queue: 5698
     stop_queue: 5698
     tx_linearized: 0
     LRO aggregated: 1276950
     LRO flushed: 264545
     LRO avg aggr: 4
     LRO no_desc: 0

SuperMicro eth2 (on 5520 connected to NUMA node 1):

[root@xeontest1 ~]# ethtool -S eth2
NIC statistics:
     rx_packets: 0
     tx_packets: 10
     rx_bytes: 0
     tx_bytes: 708
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_boundary: 4096
     WC: 0
     irq: 112
     MSI: 1
     MSIX: 0
     read_dma_bw_MBs: 1624
     write_dma_bw_MBs: 1605
     read_write_dma_bw_MBs: 1323
     serial_number: 363134
     watchdog_resets: 0
     dca_capable_firmware: 1
     dca_device_present: 0
     link_changes: 2
     link_up: 1
     dropped_link_overflow: 0
     dropped_link_error_or_filtered: 200
     dropped_pause: 200
     dropped_bad_phy: 0
     dropped_bad_crc32: 0
     dropped_unicast_filtered: 0
     dropped_multicast_filtered: 0
     dropped_runt: 0
     dropped_overrun: 0
     dropped_no_small_buffer: 0
     dropped_no_big_buffer: 0
     ----------- slice ---------: 0
     tx_pkt_start: 440223
     tx_pkt_done: 440223
     tx_req: 3412102
     tx_done: 3412102
     rx_small_cnt: 213976
     rx_big_cnt: 3071854
     wake_queue: 1846
     stop_queue: 1846
     tx_linearized: 0
     LRO aggregated: 1024029
     LRO flushed: 269709
     LRO avg aggr: 3
     LRO no_desc: 0

SuperMicro eth8 (on 5520 connected to NUMA node 0):

[root@xeontest1 ~]# ethtool -S eth8
NIC statistics:
     rx_packets: 11
     tx_packets: 16
     rx_bytes: 864
     tx_bytes: 1228
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_boundary: 4096
     WC: 0
     irq: 118
     MSI: 1
     MSIX: 0
     read_dma_bw_MBs: 1572
     write_dma_bw_MBs: 1605
     read_write_dma_bw_MBs: 2113
     serial_number: 361233
     watchdog_resets: 0
     dca_capable_firmware: 1
     dca_device_present: 0
     link_changes: 4
     link_up: 1
     dropped_link_overflow: 0
     dropped_link_error_or_filtered: 224
     dropped_pause: 224
     dropped_bad_phy: 0
     dropped_bad_crc32: 0
     dropped_unicast_filtered: 0
     dropped_multicast_filtered: 0
     dropped_runt: 0
     dropped_overrun: 0
     dropped_no_small_buffer: 0
     dropped_no_big_buffer: 0
     ----------- slice ---------: 0
     tx_pkt_start: 575354
     tx_pkt_done: 575354
     tx_req: 3590761
     tx_done: 3590761
     rx_small_cnt: 227078
     rx_big_cnt: 4733499
     wake_queue: 2199
     stop_queue: 2199
     tx_linearized: 0
     LRO aggregated: 1578229
     LRO flushed: 404901
     LRO avg aggr: 3
     LRO no_desc: 0

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 21:13                             ` Aviv Greenberg
@ 2009-08-20  7:26                               ` Bill Fink
  2009-08-20 13:14                                 ` Ben Hutchings
  2009-08-20 13:17                                 ` Aviv Greenberg
  0 siblings, 2 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-20  7:26 UTC (permalink / raw)
  To: Aviv Greenberg; +Cc: Andrew Gallatin, netdev

On Sat, 15 Aug 2009, Aviv Greenberg wrote:

> >  There may be something in the chipset
> 
> shooting in the dark: when you lspci -vvv and check the MaxPayload and
> MaxReadReq values for the myri devices - what are the values and are
> they equal? Are they the same on all your platforms?

IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under
DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes,
and was the same on both the Asus and SuperMicro systems.  I will
doublecheck tomorrow at work.  I am not clear on the meanings of
the different parameters.  And is DevCtl for PCI control messages
and DevCap for actual data transfers or something else?

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-14 23:25         ` Neil Horman
@ 2009-08-20  7:50           ` Bill Fink
  2009-08-20 20:19             ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-20  7:50 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Fri, 14 Aug 2009, Neil Horman wrote:

> On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote:
> > On Fri, 7 Aug 2009, Neil Horman wrote:
> > 
> > > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote:
> > > > On Fri, 7 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > You're timing is impeccable!  I just posted a patch for an ftrace module to help
> > > > > detect just these kind of conditions:
> > > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> > > > > 
> > > > > Hope that helps you out
> > > > > Neil
> > > > 
> > > > Thanks!  It could be helpful.  Do you have a pointer to documentation
> > > > on how to use it?  And does it require the latest GIT kernel or could
> > > > it possibly be used with a 2.6.29.6 kernel?
> > > > 
> > > > 						-Bill
> > > 
> > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not
> > > much).
> > 
> > It doesn't look like I can apply your patches to my 2.6.29.6 kernel.
> > 
> > For starters, there's no include/trace/events directory, so there's
> > no include/trace/events/skb.h.  There is an include/trace/skb.h file,
> > but there's no TRACE_EVENT defined anywhere in the kernel.
> > 
> > I don't suppose it's as simple as defining (from include/linux/tracepoint.h
> > from Linus's GIT tree):
> > 
> > #define PARAMS(args...) args
> > 
> > #define TRACE_EVENT(name, proto, args, struct, assign, print)   \
> > 	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
> > 
> > So do you still think it's reasonable to try applying your patches
> > to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4
> > or 2.6.31-rc6?
> > 
> > 						-Thanks
> > 
> > 						-Bill
> > 
> > 
> > 
> I thought the trace stuff went it around 2.6.29 but I might be mistaken.
> Easiest thing to do likely would be find where in the tree those were introduced
> and just apply them prior to my patches, or move to the latest kernel if you
> can (at least for the purposes of testing)

I finally got a 2.6.31-rc6 kernel built and had some limited success
with your ftrace patches.  Doing some simple ping tests I was able to
verify that everything was mostly as expected regarding CPU and NUMA
memory affinity, with one weird exception.  eth2 through eth7, which
all connect to the 5520 I/O Hub that connects to NUMA node 1, all
correctly showed their allocations and consumptions on NUMA node 1.
eth8 through eth13 are all connected to the 5520 I/O Hub that connects
to NUMA node 0, and eth9 through eth13 all correctly reflected that
on the ping ftrace tests.  But eth8 showed its allocations being
done on NUMA node 1 instead of the expected NUMA node 0, which just
doesn't make sense since eth8 and eth9 are part of a dual-port 10-GigE
Myricom NIC (and I doublechecked that all the IRQ assignments were
correct).

When I tried an actual nuttcp performance test, even when rate limiting
to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
crashdump via kexec/kdump, but the kexec kernel, instead of just
generating a crashdump, fully booted the new kernel, which was
extremely sluggish until I rebooted it through a BIOS re-init,
and never produced a crashdump.  I tried this several times and
an immediate kernel oops was always the result (with either a TCP
or UDP test).  A ping test of 1000 9000-byte packets with an interval
of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
worked just fine.

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-20  7:26                               ` Bill Fink
@ 2009-08-20 13:14                                 ` Ben Hutchings
  2009-08-21  4:00                                   ` Bill Fink
  2009-08-20 13:17                                 ` Aviv Greenberg
  1 sibling, 1 reply; 95+ messages in thread
From: Ben Hutchings @ 2009-08-20 13:14 UTC (permalink / raw)
  To: Bill Fink; +Cc: Aviv Greenberg, Andrew Gallatin, netdev

On Thu, 2009-08-20 at 03:26 -0400, Bill Fink wrote:
> On Sat, 15 Aug 2009, Aviv Greenberg wrote:
> 
> > >  There may be something in the chipset
> > 
> > shooting in the dark: when you lspci -vvv and check the MaxPayload and
> > MaxReadReq values for the myri devices - what are the values and are
> > they equal? Are they the same on all your platforms?
> 
> IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under
> DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes,
> and was the same on both the Asus and SuperMicro systems.  I will
> doublecheck tomorrow at work.  I am not clear on the meanings of
> the different parameters.  And is DevCtl for PCI control messages
> and DevCap for actual data transfers or something else?

DevCap is the capability register, which is read-only; DevCtl is the
control register which holds the actual settings.

MaxPayload is the MTU and MRU for PCIe packets.  Each sub-tree of
devices connected to a single PCIe root port needs to have MaxPayload
set consistently.  MaxReadReq is the maximum size of any DMA read
request.  It is a per-device setting (or possibly per-function; I
forget).  It can be much larger than MaxPayload since read completions
can be fragmented.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-20  7:26                               ` Bill Fink
  2009-08-20 13:14                                 ` Ben Hutchings
@ 2009-08-20 13:17                                 ` Aviv Greenberg
  1 sibling, 0 replies; 95+ messages in thread
From: Aviv Greenberg @ 2009-08-20 13:17 UTC (permalink / raw)
  To: Bill Fink; +Cc: Andrew Gallatin, netdev

On Thu, Aug 20, 2009 at 10:26, Bill Fink<billfink@mindspring.com> wrote:
> IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under
> DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes,
> and was the same on both the Asus and SuperMicro systems.  I will
> doublecheck tomorrow at work.  I am not clear on the meanings of
> the different parameters.  And is DevCtl for PCI control messages
> and DevCap for actual data transfers or something else?

IIRC DevCap is what the device is capable of, and DevCtl is a control
register that is used to limit the device's PCIe MTU if needed (e.g
chipset limit). MaxPayload is the one used for RX DMA writes, and 128
bytes might be too low. I suggest you double check that.

You have to first figure out if your performance is limited by PCIe
bandwidth, or due to the NUMA stuff.

-- 

Stephen Leacock  - "I detest life-insurance agents: they always argue
that I shall some day die, which is not so." -
http://www.brainyquote.com/quotes/authors/s/stephen_leacock.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-20  7:50           ` Bill Fink
@ 2009-08-20 20:19             ` Neil Horman
  2009-08-21  4:14               ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-20 20:19 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> On Fri, 14 Aug 2009, Neil Horman wrote:
> 
> > On Fri, Aug 14, 2009 at 04:44:12PM -0400, Bill Fink wrote:
> > > On Fri, 7 Aug 2009, Neil Horman wrote:
> > > 
> > > > On Fri, Aug 07, 2009 at 08:54:42PM -0400, Bill Fink wrote:
> > > > > On Fri, 7 Aug 2009, Neil Horman wrote:
> > > > > 
> > > > > > You're timing is impeccable!  I just posted a patch for an ftrace module to help
> > > > > > detect just these kind of conditions:
> > > > > > http://marc.info/?l=linux-netdev&m=124967650218846&w=2
> > > > > > 
> > > > > > Hope that helps you out
> > > > > > Neil
> > > > > 
> > > > > Thanks!  It could be helpful.  Do you have a pointer to documentation
> > > > > on how to use it?  And does it require the latest GIT kernel or could
> > > > > it possibly be used with a 2.6.29.6 kernel?
> > > > > 
> > > > > 						-Bill
> > > > 
> > > > It should apply to 2.6.29.6 no problem (might take a little massaging, but not
> > > > much).
> > > 
> > > It doesn't look like I can apply your patches to my 2.6.29.6 kernel.
> > > 
> > > For starters, there's no include/trace/events directory, so there's
> > > no include/trace/events/skb.h.  There is an include/trace/skb.h file,
> > > but there's no TRACE_EVENT defined anywhere in the kernel.
> > > 
> > > I don't suppose it's as simple as defining (from include/linux/tracepoint.h
> > > from Linus's GIT tree):
> > > 
> > > #define PARAMS(args...) args
> > > 
> > > #define TRACE_EVENT(name, proto, args, struct, assign, print)   \
> > > 	DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
> > > 
> > > So do you still think it's reasonable to try applying your patches
> > > to my 2.6.29.6 kernel, or should I get a newer kernel like 2.6.30.4
> > > or 2.6.31-rc6?
> > > 
> > > 						-Thanks
> > > 
> > > 						-Bill
> > > 
> > > 
> > > 
> > I thought the trace stuff went it around 2.6.29 but I might be mistaken.
> > Easiest thing to do likely would be find where in the tree those were introduced
> > and just apply them prior to my patches, or move to the latest kernel if you
> > can (at least for the purposes of testing)
> 
> I finally got a 2.6.31-rc6 kernel built and had some limited success
> with your ftrace patches.  Doing some simple ping tests I was able to
> verify that everything was mostly as expected regarding CPU and NUMA
> memory affinity, with one weird exception.  eth2 through eth7, which
> all connect to the 5520 I/O Hub that connects to NUMA node 1, all
> correctly showed their allocations and consumptions on NUMA node 1.
> eth8 through eth13 are all connected to the 5520 I/O Hub that connects
> to NUMA node 0, and eth9 through eth13 all correctly reflected that
> on the ping ftrace tests.  But eth8 showed its allocations being
> done on NUMA node 1 instead of the expected NUMA node 0, which just
> doesn't make sense since eth8 and eth9 are part of a dual-port 10-GigE
> Myricom NIC (and I doublechecked that all the IRQ assignments were
> correct).
> 
Hmm, memory pressure on node zero causing netdev_alloc_skb to allocate on a
remote node perhaps?

> When I tried an actual nuttcp performance test, even when rate limiting
> to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> crashdump via kexec/kdump, but the kexec kernel, instead of just
> generating a crashdump, fully booted the new kernel, which was
> extremely sluggish until I rebooted it through a BIOS re-init,
> and never produced a crashdump.  I tried this several times and
> an immediate kernel oops was always the result (with either a TCP
> or UDP test).  A ping test of 1000 9000-byte packets with an interval
> of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> worked just fine.
> 
The sluggishness is expected, since the kdump kernel operates out of such
limited memory.  don't know why you booted to a full system rather than did a
crash recovery.  Don't suppose you got a backtrace did you?

Neil


> 						-Thanks
> 
> 						-Bill
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-20 13:14                                 ` Ben Hutchings
@ 2009-08-21  4:00                                   ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-08-21  4:00 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Aviv Greenberg, Andrew Gallatin, netdev

On Thu, 20 Aug 2009, Ben Hutchings wrote:

> On Thu, 2009-08-20 at 03:26 -0400, Bill Fink wrote:
> > On Sat, 15 Aug 2009, Aviv Greenberg wrote:
> > 
> > > >  There may be something in the chipset
> > > 
> > > shooting in the dark: when you lspci -vvv and check the MaxPayload and
> > > MaxReadReq values for the myri devices - what are the values and are
> > > they equal? Are they the same on all your platforms?
> > 
> > IIRC, under DevCap they indicated MaxPayload 4096 bytes, and under
> > DevCtl they indicated MaxPayload 128 bytes and MaxReadReq 4096 bytes,
> > and was the same on both the Asus and SuperMicro systems.  I will
> > doublecheck tomorrow at work.  I am not clear on the meanings of
> > the different parameters.  And is DevCtl for PCI control messages
> > and DevCap for actual data transfers or something else?
> 
> DevCap is the capability register, which is read-only; DevCtl is the
> control register which holds the actual settings.
> 
> MaxPayload is the MTU and MRU for PCIe packets.  Each sub-tree of
> devices connected to a single PCIe root port needs to have MaxPayload
> set consistently.  MaxReadReq is the maximum size of any DMA read
> request.  It is a per-device setting (or possibly per-function; I
> forget).  It can be much larger than MaxPayload since read completions
> can be fragmented.

Thanks for the explanation.  I saw a BIOS setting that allowed
increasing the MaxPayload from 128 bytes to 256 bytes, and then
verified that an "lspci -vvv" then showed the DevCtl MaxPayload
to be 256 bytes.  But unfortunately it didn't help improve the
read side performance any.

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-20 20:19             ` Neil Horman
@ 2009-08-21  4:14               ` Bill Fink
  2009-08-21 15:23                 ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-21  4:14 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Thu, 20 Aug 2009, Neil Horman wrote:

> On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> 
> > When I tried an actual nuttcp performance test, even when rate limiting
> > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > generating a crashdump, fully booted the new kernel, which was
> > extremely sluggish until I rebooted it through a BIOS re-init,
> > and never produced a crashdump.  I tried this several times and
> > an immediate kernel oops was always the result (with either a TCP
> > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > worked just fine.
> 
> The sluggishness is expected, since the kdump kernel operates out of such
> limited memory.  don't know why you booted to a full system rather than did a
> crash recovery.  Don't suppose you got a backtrace did you?

There was a backtrace on the screen but I didn't have a chance to
record it.  BTW did anyone ever think to print the backtrace in
reverse (first to some reserved memory and then output to the display)
so the more interesting parts wouldn't have scrolled off the top of
the screen?

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-21  4:14               ` Bill Fink
@ 2009-08-21 15:23                 ` Neil Horman
  2009-08-21 15:36                   ` Andrew Gallatin
  2009-08-26  7:10                   ` Bill Fink
  0 siblings, 2 replies; 95+ messages in thread
From: Neil Horman @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> On Thu, 20 Aug 2009, Neil Horman wrote:
> 
> > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > 
> > > When I tried an actual nuttcp performance test, even when rate limiting
> > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > generating a crashdump, fully booted the new kernel, which was
> > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > and never produced a crashdump.  I tried this several times and
> > > an immediate kernel oops was always the result (with either a TCP
> > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > worked just fine.
> > 
> > The sluggishness is expected, since the kdump kernel operates out of such
> > limited memory.  don't know why you booted to a full system rather than did a
> > crash recovery.  Don't suppose you got a backtrace did you?
> 
> There was a backtrace on the screen but I didn't have a chance to
> record it.  BTW did anyone ever think to print the backtrace in
> reverse (first to some reserved memory and then output to the display)
> so the more interesting parts wouldn't have scrolled off the top of
> the screen?
> 
The real solution is to use a console to which the output doesn't scroll off the
screen.  Normally people use a serial console they can log, or a RAC card that
they can record. Even on a regular vga monitor in text mode, you can set up the
vt iirc to allow for scrolling.

Neil

> 						-Bill
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-21 15:23                 ` Neil Horman
@ 2009-08-21 15:36                   ` Andrew Gallatin
  2009-08-26  7:10                   ` Bill Fink
  1 sibling, 0 replies; 95+ messages in thread
From: Andrew Gallatin @ 2009-08-21 15:36 UTC (permalink / raw)
  To: Neil Horman; +Cc: Bill Fink, Linux Network Developers, brice

Neil Horman wrote:
> On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
>> On Thu, 20 Aug 2009, Neil Horman wrote:
>>
>>> On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
>>>
>>>> When I tried an actual nuttcp performance test, even when rate limiting
>>>> to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
>>>> crashdump via kexec/kdump, but the kexec kernel, instead of just
>>>> generating a crashdump, fully booted the new kernel, which was
>>>> extremely sluggish until I rebooted it through a BIOS re-init,
>>>> and never produced a crashdump.  I tried this several times and
>>>> an immediate kernel oops was always the result (with either a TCP
>>>> or UDP test).  A ping test of 1000 9000-byte packets with an interval
>>>> of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
>>>> worked just fine.
>>> The sluggishness is expected, since the kdump kernel operates out of such
>>> limited memory.  don't know why you booted to a full system rather than did a
>>> crash recovery.  Don't suppose you got a backtrace did you?
>> There was a backtrace on the screen but I didn't have a chance to
>> record it.  BTW did anyone ever think to print the backtrace in
>> reverse (first to some reserved memory and then output to the display)
>> so the more interesting parts wouldn't have scrolled off the top of
>> the screen?
>>
> The real solution is to use a console to which the output doesn't scroll off the
> screen.  Normally people use a serial console they can log, or a RAC card that
> they can record. Even on a regular vga monitor in text mode, you can set up the
> vt iirc to allow for scrolling.

Indeed. Another option when setting up a serial console is not practical
is netconsole.  I've captured a few panics this way on machines like
macs, with no serial port support (at the time).

Drew

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-21 15:23                 ` Neil Horman
  2009-08-21 15:36                   ` Andrew Gallatin
@ 2009-08-26  7:10                   ` Bill Fink
  2009-08-26 11:00                     ` Neil Horman
  1 sibling, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-26  7:10 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Fri, 21 Aug 2009, Neil Horman wrote:

> On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > On Thu, 20 Aug 2009, Neil Horman wrote:
> > 
> > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > 
> > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > generating a crashdump, fully booted the new kernel, which was
> > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > and never produced a crashdump.  I tried this several times and
> > > > an immediate kernel oops was always the result (with either a TCP
> > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > worked just fine.
> > > 
> > > The sluggishness is expected, since the kdump kernel operates out of such
> > > limited memory.  don't know why you booted to a full system rather than did a
> > > crash recovery.  Don't suppose you got a backtrace did you?
> > 
> > There was a backtrace on the screen but I didn't have a chance to
> > record it.  BTW did anyone ever think to print the backtrace in
> > reverse (first to some reserved memory and then output to the display)
> > so the more interesting parts wouldn't have scrolled off the top of
> > the screen?
> > 
> The real solution is to use a console to which the output doesn't scroll off the
> screen.  Normally people use a serial console they can log, or a RAC card that
> they can record. Even on a regular vga monitor in text mode, you can set up the
> vt iirc to allow for scrolling.

None of our Asus P6T6 systems have serial consoles.  I don't know of
any RAC cards for them either, nor are there spare PCI slots available
in many cases.  I wouldn't think the Shift-PageUp trick would work
with a crashed kernel, but I admit I didn't try it.  I haven't checked
out netconsole yet either, but I'm not sure it would help either in a
case like this that was a network related kernel crash.

In any case, a simple kernel command line that would provide a reversed
backtrace would be a simple thing to facilitate Linux users providing
useful info to Linux kernel developers in helping to debug kernel
problems.  The most useful info would still be on the screen, so it
could be transcribed or a photo image of the screen could be taken.

Fortunately, in this specific case, the SuperMicro X8DAH+-F system
does have a serial console, and after a fair amount of effort I was
able to get it to work as desired, and was able to finally capture
a backtrace of the kernel oops.  BTW I believe the reason the
kexec/kdump didn't work was probably because it couldn't find
a /proc/vmcore file, although I don't know why that would be,
and the Fedora 10 /etc/init.d/kdump script will then just boot
up normally if it fails to find the /proc/vmcore file (or it's
zero size).

The following shows a simple ping test usage of the skb_sources
tracing feature:

[root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms

--- 192.168.1.10 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms

[root@xeontest1 tracing]# cat trace
# tracer: skb_sources
#
#       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
#        |       |       |       |       |       |       |
        4217    1       1       eth2    0       4       1500
        4217    1       1       eth2    0       4       1500
        4217    1       1       eth2    0       4       1500
        4217    1       1       eth2    0       4       1500
        4217    1       1       eth2    0       4       1500

All is as was expected.

But if I try an actual nuttcp performance test (even rate limited
to 1 Mbps), I get the following kernel oops:

[root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
PGD 337d12067 PUD 337d11067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
CPU 4
Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
Stack:
 ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
<0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
<0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
Call Trace:
 [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
 [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
 [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
 [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
 [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
 [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
 [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
 [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
 [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
 [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
 [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
 [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
 [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
 [<ffffffff810f785c>] vfs_read+0xc0/0x107
 [<ffffffff810f7971>] sys_read+0x4c/0x75
 [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
 RSP <ffff8801a5811a88>
CR2: 0000000000000038

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26  7:10                   ` Bill Fink
@ 2009-08-26 11:00                     ` Neil Horman
  2009-08-26 18:08                       ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-26 11:00 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> On Fri, 21 Aug 2009, Neil Horman wrote:
> 
> > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > 
> > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > 
> > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > and never produced a crashdump.  I tried this several times and
> > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > worked just fine.
> > > > 
> > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > 
> > > There was a backtrace on the screen but I didn't have a chance to
> > > record it.  BTW did anyone ever think to print the backtrace in
> > > reverse (first to some reserved memory and then output to the display)
> > > so the more interesting parts wouldn't have scrolled off the top of
> > > the screen?
> > > 
> > The real solution is to use a console to which the output doesn't scroll off the
> > screen.  Normally people use a serial console they can log, or a RAC card that
> > they can record. Even on a regular vga monitor in text mode, you can set up the
> > vt iirc to allow for scrolling.
> 
> None of our Asus P6T6 systems have serial consoles.  I don't know of
> any RAC cards for them either, nor are there spare PCI slots available
> in many cases.  I wouldn't think the Shift-PageUp trick would work
> with a crashed kernel, but I admit I didn't try it.  I haven't checked
> out netconsole yet either, but I'm not sure it would help either in a
> case like this that was a network related kernel crash.
> 
Any USB ports that you can attach a serial dongle to?  That would work as well,
or, as previously mentioned, netconsole also does the trick.

> In any case, a simple kernel command line that would provide a reversed
> backtrace would be a simple thing to facilitate Linux users providing
> useful info to Linux kernel developers in helping to debug kernel
> problems.  The most useful info would still be on the screen, so it
> could be transcribed or a photo image of the screen could be taken.
> 
I understand what your saying, I'm just saying there are currently several
options for you that have already solved this problem in differnt ways.

> Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> does have a serial console, and after a fair amount of effort I was
> able to get it to work as desired, and was able to finally capture
> a backtrace of the kernel oops.  BTW I believe the reason the
> kexec/kdump didn't work was probably because it couldn't find
> a /proc/vmcore file, although I don't know why that would be,
> and the Fedora 10 /etc/init.d/kdump script will then just boot
> up normally if it fails to find the /proc/vmcore file (or it's
> zero size).
> 
I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
happy to look into it further.

> The following shows a simple ping test usage of the skb_sources
> tracing feature:
> 
> [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> 
> --- 192.168.1.10 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> 
> [root@xeontest1 tracing]# cat trace
> # tracer: skb_sources
> #
> #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> #        |       |       |       |       |       |       |
>         4217    1       1       eth2    0       4       1500
>         4217    1       1       eth2    0       4       1500
>         4217    1       1       eth2    0       4       1500
>         4217    1       1       eth2    0       4       1500
>         4217    1       1       eth2    0       4       1500
> 
> All is as was expected.
> 
> But if I try an actual nuttcp performance test (even rate limited
> to 1 Mbps), I get the following kernel oops:
> 
thank you, I think I see the problem, I'll have a patch for you in just a bit

Thanks
Neil

> [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> PGD 337d12067 PUD 337d11067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> CPU 4
> Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> Stack:
>  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> Call Trace:
>  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
>  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
>  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
>  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
>  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
>  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
>  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
>  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
>  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
>  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
>  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
>  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
>  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
>  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
>  [<ffffffff810f785c>] vfs_read+0xc0/0x107
>  [<ffffffff810f7971>] sys_read+0x4c/0x75
>  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
>  RSP <ffff8801a5811a88>
> CR2: 0000000000000038
> 
> 						-Thanks
> 
> 						-Bill
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 11:00                     ` Neil Horman
@ 2009-08-26 18:08                       ` Neil Horman
  2009-08-26 18:15                         ` Ingo Molnar
                                           ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Neil Horman @ 2009-08-26 18:08 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > On Fri, 21 Aug 2009, Neil Horman wrote:
> > 
> > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > 
> > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > worked just fine.
> > > > > 
> > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > 
> > > > There was a backtrace on the screen but I didn't have a chance to
> > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > reverse (first to some reserved memory and then output to the display)
> > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > the screen?
> > > > 
> > > The real solution is to use a console to which the output doesn't scroll off the
> > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > vt iirc to allow for scrolling.
> > 
> > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > any RAC cards for them either, nor are there spare PCI slots available
> > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > out netconsole yet either, but I'm not sure it would help either in a
> > case like this that was a network related kernel crash.
> > 
> Any USB ports that you can attach a serial dongle to?  That would work as well,
> or, as previously mentioned, netconsole also does the trick.
> 
> > In any case, a simple kernel command line that would provide a reversed
> > backtrace would be a simple thing to facilitate Linux users providing
> > useful info to Linux kernel developers in helping to debug kernel
> > problems.  The most useful info would still be on the screen, so it
> > could be transcribed or a photo image of the screen could be taken.
> > 
> I understand what your saying, I'm just saying there are currently several
> options for you that have already solved this problem in differnt ways.
> 
> > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > does have a serial console, and after a fair amount of effort I was
> > able to get it to work as desired, and was able to finally capture
> > a backtrace of the kernel oops.  BTW I believe the reason the
> > kexec/kdump didn't work was probably because it couldn't find
> > a /proc/vmcore file, although I don't know why that would be,
> > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > up normally if it fails to find the /proc/vmcore file (or it's
> > zero size).
> > 
> I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> happy to look into it further.
> 
> > The following shows a simple ping test usage of the skb_sources
> > tracing feature:
> > 
> > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > 
> > --- 192.168.1.10 ping statistics ---
> > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > 
> > [root@xeontest1 tracing]# cat trace
> > # tracer: skb_sources
> > #
> > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > #        |       |       |       |       |       |       |
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> > 
> > All is as was expected.
> > 
> > But if I try an actual nuttcp performance test (even rate limited
> > to 1 Mbps), I get the following kernel oops:
> > 
> thank you, I think I see the problem, I'll have a patch for you in just a bit
> 
> Thanks
> Neil
> 
> > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > PGD 337d12067 PUD 337d11067 PMD 0
> > Oops: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > CPU 4
> > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > Stack:
> >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > Call Trace:
> >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> >  RSP <ffff8801a5811a88>
> > CR2: 0000000000000038
> > 
> > 						-Thanks
> > 
> > 						-Bill
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Here  you go, I think this will fix your oops.


    Fix NULL pointer deref in skb sources ftracer
    
    Its possible that skb->sk will be null in this path, so we shouldn't just assume
    we can pass it to sock_net
    
    Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

 trace_skb_sources.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
index 40eb071..8bf518f 100644
--- a/kernel/trace/trace_skb_sources.c
+++ b/kernel/trace/trace_skb_sources.c
@@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
 	struct ring_buffer_event *event;
 	struct trace_skb_event *entry;
 	struct trace_array *tr = skb_trace;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 
 	if (!trace_skb_source_enabled)
 		return;
@@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
 	entry->event_data.rx_queue = skb->queue_mapping;
 	entry->event_data.ccpu = smp_processor_id();
 
-	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
+	if (skb->sk)
+		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
+
 	if (dev) {
 		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
 		dev_put(dev);

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 18:08                       ` Neil Horman
@ 2009-08-26 18:15                         ` Ingo Molnar
  2009-08-26 19:04                           ` Neil Horman
  2009-08-27 17:32                         ` Bill Fink
  2009-08-27 17:44                         ` Bill Fink
  2 siblings, 1 reply; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 18:15 UTC (permalink / raw)
  To: Neil Horman; +Cc: Bill Fink, Linux Network Developers, brice, gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > On Fri, 21 Aug 2009, Neil Horman wrote:
> > > 
> > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > > 
> > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > > 
> > > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > > worked just fine.
> > > > > > 
> > > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > > 
> > > > > There was a backtrace on the screen but I didn't have a chance to
> > > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > > reverse (first to some reserved memory and then output to the display)
> > > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > > the screen?
> > > > > 
> > > > The real solution is to use a console to which the output doesn't scroll off the
> > > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > > vt iirc to allow for scrolling.
> > > 
> > > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > > any RAC cards for them either, nor are there spare PCI slots available
> > > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > > out netconsole yet either, but I'm not sure it would help either in a
> > > case like this that was a network related kernel crash.
> > > 
> > Any USB ports that you can attach a serial dongle to?  That would work as well,
> > or, as previously mentioned, netconsole also does the trick.
> > 
> > > In any case, a simple kernel command line that would provide a reversed
> > > backtrace would be a simple thing to facilitate Linux users providing
> > > useful info to Linux kernel developers in helping to debug kernel
> > > problems.  The most useful info would still be on the screen, so it
> > > could be transcribed or a photo image of the screen could be taken.
> > > 
> > I understand what your saying, I'm just saying there are currently several
> > options for you that have already solved this problem in differnt ways.
> > 
> > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > does have a serial console, and after a fair amount of effort I was
> > > able to get it to work as desired, and was able to finally capture
> > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > kexec/kdump didn't work was probably because it couldn't find
> > > a /proc/vmcore file, although I don't know why that would be,
> > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > up normally if it fails to find the /proc/vmcore file (or it's
> > > zero size).
> > > 
> > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > happy to look into it further.
> > 
> > > The following shows a simple ping test usage of the skb_sources
> > > tracing feature:
> > > 
> > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > 
> > > --- 192.168.1.10 ping statistics ---
> > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > 
> > > [root@xeontest1 tracing]# cat trace
> > > # tracer: skb_sources
> > > #
> > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > #        |       |       |       |       |       |       |
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > > 
> > > All is as was expected.
> > > 
> > > But if I try an actual nuttcp performance test (even rate limited
> > > to 1 Mbps), I get the following kernel oops:
> > > 
> > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > 
> > Thanks
> > Neil
> > 
> > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > PGD 337d12067 PUD 337d11067 PMD 0
> > > Oops: 0000 [#1] SMP
> > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > CPU 4
> > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > Stack:
> > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > Call Trace:
> > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > >  RSP <ffff8801a5811a88>
> > > CR2: 0000000000000038
> > > 
> > > 						-Thanks
> > > 
> > > 						-Bill
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> Here  you go, I think this will fix your oops.
> 
> 
>     Fix NULL pointer deref in skb sources ftracer
>     
>     Its possible that skb->sk will be null in this path, so we shouldn't just assume
>     we can pass it to sock_net
>     
>     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
>  trace_skb_sources.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

ok if this is just a temporary fix until TRACE_EVENT() is done, but 
we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's 
pushed to .32, right?

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 18:15                         ` Ingo Molnar
@ 2009-08-26 19:04                           ` Neil Horman
  2009-08-26 19:08                             ` Ingo Molnar
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-26 19:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Bill Fink, Linux Network Developers, brice, gallatin

On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > On Fri, 21 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > > > 
> > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > > > 
> > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > > > worked just fine.
> > > > > > > 
> > > > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > > > 
> > > > > > There was a backtrace on the screen but I didn't have a chance to
> > > > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > > > reverse (first to some reserved memory and then output to the display)
> > > > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > > > the screen?
> > > > > > 
> > > > > The real solution is to use a console to which the output doesn't scroll off the
> > > > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > > > vt iirc to allow for scrolling.
> > > > 
> > > > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > > > any RAC cards for them either, nor are there spare PCI slots available
> > > > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > > > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > > > out netconsole yet either, but I'm not sure it would help either in a
> > > > case like this that was a network related kernel crash.
> > > > 
> > > Any USB ports that you can attach a serial dongle to?  That would work as well,
> > > or, as previously mentioned, netconsole also does the trick.
> > > 
> > > > In any case, a simple kernel command line that would provide a reversed
> > > > backtrace would be a simple thing to facilitate Linux users providing
> > > > useful info to Linux kernel developers in helping to debug kernel
> > > > problems.  The most useful info would still be on the screen, so it
> > > > could be transcribed or a photo image of the screen could be taken.
> > > > 
> > > I understand what your saying, I'm just saying there are currently several
> > > options for you that have already solved this problem in differnt ways.
> > > 
> > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > does have a serial console, and after a fair amount of effort I was
> > > > able to get it to work as desired, and was able to finally capture
> > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > kexec/kdump didn't work was probably because it couldn't find
> > > > a /proc/vmcore file, although I don't know why that would be,
> > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > zero size).
> > > > 
> > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > happy to look into it further.
> > > 
> > > > The following shows a simple ping test usage of the skb_sources
> > > > tracing feature:
> > > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > 
> > > > --- 192.168.1.10 ping statistics ---
> > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > 
> > > > [root@xeontest1 tracing]# cat trace
> > > > # tracer: skb_sources
> > > > #
> > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > #        |       |       |       |       |       |       |
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > > 
> > > > All is as was expected.
> > > > 
> > > > But if I try an actual nuttcp performance test (even rate limited
> > > > to 1 Mbps), I get the following kernel oops:
> > > > 
> > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > 
> > > Thanks
> > > Neil
> > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > Oops: 0000 [#1] SMP
> > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > CPU 4
> > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > Stack:
> > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > Call Trace:
> > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > >  RSP <ffff8801a5811a88>
> > > > CR2: 0000000000000038
> > > > 
> > > > 						-Thanks
> > > > 
> > > > 						-Bill
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > 
> > Here  you go, I think this will fix your oops.
> > 
> > 
> >     Fix NULL pointer deref in skb sources ftracer
> >     
> >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> >     we can pass it to sock_net
> >     
> >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> >  trace_skb_sources.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> ok if this is just a temporary fix until TRACE_EVENT() is done, but 
> we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's 
> pushed to .32, right?
> 
Not sure that the two are related.  I think you meant to send this to the other
thread, didnt you?
Neil

> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:04                           ` Neil Horman
@ 2009-08-26 19:08                             ` Ingo Molnar
  2009-08-26 19:36                               ` David Miller
  2009-08-26 20:01                               ` Neil Horman
  0 siblings, 2 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 19:08 UTC (permalink / raw)
  To: Neil Horman, David S. Miller, Steven Rostedt,
	=?unknown-8bit?B?RnLDqWTDqXJpYw==?= Weisbecker
  Cc: Bill Fink, Linux Network Developers, brice, gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > > On Fri, 21 Aug 2009, Neil Horman wrote:
> > > > > 
> > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > > > > 
> > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > > > > 
> > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > > > > worked just fine.
> > > > > > > > 
> > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > > > > 
> > > > > > > There was a backtrace on the screen but I didn't have a chance to
> > > > > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > > > > reverse (first to some reserved memory and then output to the display)
> > > > > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > > > > the screen?
> > > > > > > 
> > > > > > The real solution is to use a console to which the output doesn't scroll off the
> > > > > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > > > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > > > > vt iirc to allow for scrolling.
> > > > > 
> > > > > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > > > > any RAC cards for them either, nor are there spare PCI slots available
> > > > > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > > > > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > > > > out netconsole yet either, but I'm not sure it would help either in a
> > > > > case like this that was a network related kernel crash.
> > > > > 
> > > > Any USB ports that you can attach a serial dongle to?  That would work as well,
> > > > or, as previously mentioned, netconsole also does the trick.
> > > > 
> > > > > In any case, a simple kernel command line that would provide a reversed
> > > > > backtrace would be a simple thing to facilitate Linux users providing
> > > > > useful info to Linux kernel developers in helping to debug kernel
> > > > > problems.  The most useful info would still be on the screen, so it
> > > > > could be transcribed or a photo image of the screen could be taken.
> > > > > 
> > > > I understand what your saying, I'm just saying there are currently several
> > > > options for you that have already solved this problem in differnt ways.
> > > > 
> > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > > does have a serial console, and after a fair amount of effort I was
> > > > > able to get it to work as desired, and was able to finally capture
> > > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > > kexec/kdump didn't work was probably because it couldn't find
> > > > > a /proc/vmcore file, although I don't know why that would be,
> > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > > zero size).
> > > > > 
> > > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > > happy to look into it further.
> > > > 
> > > > > The following shows a simple ping test usage of the skb_sources
> > > > > tracing feature:
> > > > > 
> > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > > 
> > > > > --- 192.168.1.10 ping statistics ---
> > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > > 
> > > > > [root@xeontest1 tracing]# cat trace
> > > > > # tracer: skb_sources
> > > > > #
> > > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > > #        |       |       |       |       |       |       |
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > > 
> > > > > All is as was expected.
> > > > > 
> > > > > But if I try an actual nuttcp performance test (even rate limited
> > > > > to 1 Mbps), I get the following kernel oops:
> > > > > 
> > > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > > 
> > > > Thanks
> > > > Neil
> > > > 
> > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > > Oops: 0000 [#1] SMP
> > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > > CPU 4
> > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > > Stack:
> > > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > > Call Trace:
> > > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > >  RSP <ffff8801a5811a88>
> > > > > CR2: 0000000000000038
> > > > > 
> > > > > 						-Thanks
> > > > > 
> > > > > 						-Bill
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > > 
> > > Here  you go, I think this will fix your oops.
> > > 
> > > 
> > >     Fix NULL pointer deref in skb sources ftracer
> > >     
> > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > >     we can pass it to sock_net
> > >     
> > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > 
> > >  trace_skb_sources.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > ok if this is just a temporary fix until TRACE_EVENT() is done, but 
> > we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's 
> > pushed to .32, right?
> 
> Not sure that the two are related.  I think you meant to send this 
> to the other thread, didnt you?

Sigh, no. Please re-read the past discussions about this. 
trace_skb_sources.c is a hack and should be converted to generic 
tracepoints. Is there anything in it that cannot be expressed in 
terms of TRACE_EVENT()?

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:08                             ` Ingo Molnar
@ 2009-08-26 19:36                               ` David Miller
  2009-08-26 19:48                                 ` Ingo Molnar
  2009-08-26 20:01                               ` Neil Horman
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2009-08-26 19:36 UTC (permalink / raw)
  To: mingo; +Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin

From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 26 Aug 2009 21:08:30 +0200

> Sigh, no. Please re-read the past discussions about this. 
> trace_skb_sources.c is a hack and should be converted to generic 
> tracepoints. Is there anything in it that cannot be expressed in 
> terms of TRACE_EVENT()?

Neil explained why he needed to implement it this way in his
reply to Steven Rostedt.  I attach it here for your
convenience.

Subject: Re: [PATCH 3/3] net: skb ftracer - Add actual ftrace code to kernel (v3)
From: Neil Horman <nhorman@tuxdriver.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: netdev@vger.kernel.org, davem@davemloft.net
Date: Tue, 18 Aug 2009 12:39:58 -0400
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Mew: tab/spc characters on Subject: are simplified.

On Mon, Aug 17, 2009 at 04:55:38PM -0400, Steven Rostedt wrote:
> 
> Hi Neil!
> 
> Sorry for the late reply, I've been on vacation for the last week.
> 
> On Thu, 13 Aug 2009, Neil Horman wrote:
> 
> > skb allocation / consumption correlator
> > 
> > Add ftracer module to kernel to print out a list that correlates a process id,
> > an skb it read, and the numa nodes on wich the process was running when it was
> > read along with the numa node the skbuff was allocated on.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> >  Makefile            |    1 
> >  trace.h             |   19 ++++++
> >  trace_skb_sources.c |  154 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 174 insertions(+)
> > 
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index 844164d..ee5e5b1 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -49,6 +49,7 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> >  ifeq ($(CONFIG_BLOCK),y)
> >  obj-$(CONFIG_EVENT_TRACING) += blktrace.o
> >  endif
> > +obj-$(CONFIG_SKB_SOURCES_TRACER) += trace_skb_sources.o
> >  obj-$(CONFIG_EVENT_TRACING) += trace_events.o
> >  obj-$(CONFIG_EVENT_TRACING) += trace_export.o
> >  obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 8b9f4f6..8a6281b 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -11,6 +11,7 @@
> >  #include <trace/boot.h>
> >  #include <linux/kmemtrace.h>
> >  #include <trace/power.h>
> > +#include <trace/events/skb.h>
> >  
> >  #include <linux/trace_seq.h>
> >  #include <linux/ftrace_event.h>
> > @@ -40,6 +41,7 @@ enum trace_type {
> >  	TRACE_KMEM_FREE,
> >  	TRACE_POWER,
> >  	TRACE_BLK,
> > +	TRACE_SKB_SOURCE,
> >  
> >  	__TRACE_LAST_TYPE,
> >  };
> > @@ -171,6 +173,21 @@ struct trace_power {
> >  	struct power_trace	state_data;
> >  };
> >  
> > +struct skb_record {
> > +	pid_t pid;		/* pid of the copying process */
> > +	int anid;		/* node where skb was allocated */
> > +	int cnid;		/* node to which skb was copied in userspace */
> > +	char ifname[IFNAMSIZ];	/* Name of the receiving interface */
> > +	int rx_queue;		/* The rx queue the skb was received on */
> > +	int ccpu;		/* Cpu the application got this frame from */
> > +	int len;		/* length of the data copied */
> > +};
> > +
> > +struct trace_skb_event {
> > +	struct trace_entry	ent;
> > +	struct skb_record	event_data;
> > +};
> > +
> >  enum kmemtrace_type_id {
> >  	KMEMTRACE_TYPE_KMALLOC = 0,	/* kmalloc() or kfree(). */
> >  	KMEMTRACE_TYPE_CACHE,		/* kmem_cache_*(). */
> > @@ -323,6 +340,8 @@ extern void __ftrace_bad_type(void);
> >  			  TRACE_SYSCALL_ENTER);				\
> >  		IF_ASSIGN(var, ent, struct syscall_trace_exit,		\
> >  			  TRACE_SYSCALL_EXIT);				\
> > +		IF_ASSIGN(var, ent, struct trace_skb_event,		\
> > +			  TRACE_SKB_SOURCE);				\
> >  		__ftrace_bad_type();					\
> >  	} while (0)
> >  
> > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > new file mode 100644
> > index 0000000..4ba3671
> > --- /dev/null
> > +++ b/kernel/trace/trace_skb_sources.c
> > @@ -0,0 +1,154 @@
> > +/*
> > + * ring buffer based tracer for analyzing per-socket skb sources
> > + *
> > + * Neil Horman <nhorman@tuxdriver.com> 
> > + * Copyright (C) 2009
> > + *
> > + *
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/debugfs.h>
> > +#include <trace/events/skb.h>
> > +#include <linux/kallsyms.h>
> > +#include <linux/module.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/netdevice.h>
> > +#include <net/sock.h>
> > +
> > +#include "trace.h"
> > +#include "trace_output.h"
> > +
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(skb_copy_datagram_iovec);
> > +
> > +static struct trace_array *skb_trace;
> > +static int __read_mostly trace_skb_source_enabled;
> > +
> > +static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > +{
> > +	struct ring_buffer_event *event;
> > +	struct trace_skb_event *entry;
> > +	struct trace_array *tr = skb_trace;
> > +	struct net_device *dev;
> > +
> > +	if (!trace_skb_source_enabled)
> > +		return;
> > +
> > +	if (in_interrupt())
> > +		return;
> 
> Is there a reason for not doing this in an interrupt?
> 
Because the idea is to correlate skb consumption to a process.  If we get in
this tracepoint in an interrupt, it doesn't make sense to record.


> > +
> > +	event = trace_buffer_lock_reserve(tr, TRACE_SKB_SOURCE,
> > +					  sizeof(*entry), 0, 0);
> > +	if (!event)
> > +		return;
> > +	entry = ring_buffer_event_data(event);
> > +
> > +	entry->event_data.pid = current->pid;
> 
> Note, the trace_buffer_lock_reserve will record the current pid, thus you 
> do not need to record it here.
> 
> > +	entry->event_data.anid = page_to_nid(virt_to_page(skb->data));
> > +	entry->event_data.cnid = cpu_to_node(smp_processor_id());
> > +	entry->event_data.len = len;
> > +	entry->event_data.rx_queue = skb->queue_mapping;
> > +	entry->event_data.ccpu = smp_processor_id();
> 
> Also, the cpu is recorded in the ring buffer. They are per cpu ring 
> buffers and that determines the cpu it was recorded on.
> 
> > +
> > +	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > +	if (dev) {
> > +		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > +		dev_put(dev);
> > +	} else {
> > +		strcpy(entry->event_data.ifname, "Unknown");
> > +	}
> > +
> > +	trace_buffer_unlock_commit(tr, event, 0, 0);
> > +}
> > +
> > +static int tracing_skb_source_register(void)
> > +{
> > +	int ret;
> > +
> > +	ret = register_trace_skb_copy_datagram_iovec(probe_skb_dequeue);
> > +	if (ret)
> > +		pr_info("skb source trace: Couldn't activate dequeue tracepoint");
> > +	
> > +	return ret;
> > +}
> > +
> > +static void start_skb_source_trace(struct trace_array *tr)
> > +{
> > +	trace_skb_source_enabled = 1;
> > +}
> > +
> > +static void stop_skb_source_trace(struct trace_array *tr)
> > +{
> > +	trace_skb_source_enabled = 0;
> > +}
> > +
> > +static void skb_source_trace_reset(struct trace_array *tr)
> > +{
> > +	trace_skb_source_enabled = 0;
> > +	unregister_trace_skb_copy_datagram_iovec(probe_skb_dequeue);
> > +}
> > +
> > +
> > +static int skb_source_trace_init(struct trace_array *tr)
> > +{
> > +	int cpu;
> > +	skb_trace = tr;
> > +
> > +	trace_skb_source_enabled = 1;
> > +	tracing_skb_source_register();
> > +
> > +	for_each_cpu(cpu, cpu_possible_mask)
> > +		tracing_reset(tr, cpu);
> > +	return 0;
> > +}
> > +
> > +static enum print_line_t skb_source_print_line(struct trace_iterator *iter)
> > +{
> > +	int ret = 0;
> > +	struct trace_entry *entry = iter->ent;
> 
> iter->cpu has the cpu that trace was recorded on.
> entry->pid has the pid of the process that did the recording.
> 
ok, I'll clean this up in a subsequent patch, since davem has already rolled
them in.

> > +	struct trace_skb_event *event;
> > +	struct skb_record *record;
> > +	struct trace_seq *s = &iter->seq;
> > +
> > +	trace_assign_type(event, entry);
> > +	record = &event->event_data;
> > +	if (entry->type != TRACE_SKB_SOURCE)
> > +		return TRACE_TYPE_UNHANDLED;
> > +
> > +	ret = trace_seq_printf(s, "	%d	%d	%d	%s	%d	%d	%d\n",
> > +			record->pid,
> > +			record->anid,
> > +			record->cnid,
> > +			record->ifname,
> > +			record->rx_queue,
> > +			record->ccpu,
> > +			record->len);
> > +
> > +	if (!ret)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +static void skb_source_print_header(struct seq_file *s)
> > +{
> > +	seq_puts(s, "#	PID	ANID	CNID	IFC	RXQ	CCPU	LEN\n");
> > +	seq_puts(s, "#	 |	 |	 |	 |	 |	 |	 |\n");
> > +}
> > +
> > +static struct tracer skb_source_tracer __read_mostly =
> > +{
> > +	.name		= "skb_sources",
> > +	.init		= skb_source_trace_init,
> > +	.start		= start_skb_source_trace,
> > +	.stop		= stop_skb_source_trace,
> > +	.reset		= skb_source_trace_reset,
> > +	.print_line	= skb_source_print_line,
> > +	.print_header	= skb_source_print_header,
> > +};
> > +
> > +static int init_skb_source_trace(void)
> > +{
> > +	return register_tracer(&skb_source_tracer);
> > +}
> > +device_initcall(init_skb_source_trace);
> > 
> 
> BTW, why not just do this as events? Or was this just a easy way to 
> communicate with the user space tools?
> 
Thats exactly why I did it.  the idea is for me to now write a user space tool
that lets me analyze the events and ajust process scheduling to optimize the rx
path.
Neil

> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:36                               ` David Miller
@ 2009-08-26 19:48                                 ` Ingo Molnar
  2009-08-26 20:23                                   ` Neil Horman
  2009-08-26 20:28                                   ` Ingo Molnar
  0 siblings, 2 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 19:48 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Wed, 26 Aug 2009 21:08:30 +0200
> 
> > Sigh, no. Please re-read the past discussions about this. 
> > trace_skb_sources.c is a hack and should be converted to generic 
> > tracepoints. Is there anything in it that cannot be expressed in 
> > terms of TRACE_EVENT()?
> 
> Neil explained why he needed to implement it this way in his reply 
> to Steven Rostedt.  I attach it here for your convenience.

thanks. The argument is invalid:

> > BTW, why not just do this as events? Or was this just a easy way 
> > to communicate with the user space tools?
> 
> Thats exactly why I did it.  the idea is for me to now write a 
> user space tool that lets me analyze the events and ajust process 
> scheduling to optimize the rx path. Neil

All tooling (in fact _more_ tooling) can be done based on generic, 
TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
available, have a generalized format with format parsers and user 
tooling implemented, etc. etc.

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:08                             ` Ingo Molnar
  2009-08-26 19:36                               ` David Miller
@ 2009-08-26 20:01                               ` Neil Horman
  2009-08-26 22:57                                 ` Ingo Molnar
  1 sibling, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-26 20:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David S. Miller, Steven Rostedt, Frédéric Weisbecker,
	Bill Fink, Linux Network Developers, brice, gallatin

On Wed, Aug 26, 2009 at 09:08:30PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Wed, Aug 26, 2009 at 08:15:02PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > > > On Fri, 21 Aug 2009, Neil Horman wrote:
> > > > > > 
> > > > > > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > > > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > > > > > 
> > > > > > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > > > > > 
> > > > > > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > > > > > worked just fine.
> > > > > > > > > 
> > > > > > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > > > > > 
> > > > > > > > There was a backtrace on the screen but I didn't have a chance to
> > > > > > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > > > > > reverse (first to some reserved memory and then output to the display)
> > > > > > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > > > > > the screen?
> > > > > > > > 
> > > > > > > The real solution is to use a console to which the output doesn't scroll off the
> > > > > > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > > > > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > > > > > vt iirc to allow for scrolling.
> > > > > > 
> > > > > > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > > > > > any RAC cards for them either, nor are there spare PCI slots available
> > > > > > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > > > > > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > > > > > out netconsole yet either, but I'm not sure it would help either in a
> > > > > > case like this that was a network related kernel crash.
> > > > > > 
> > > > > Any USB ports that you can attach a serial dongle to?  That would work as well,
> > > > > or, as previously mentioned, netconsole also does the trick.
> > > > > 
> > > > > > In any case, a simple kernel command line that would provide a reversed
> > > > > > backtrace would be a simple thing to facilitate Linux users providing
> > > > > > useful info to Linux kernel developers in helping to debug kernel
> > > > > > problems.  The most useful info would still be on the screen, so it
> > > > > > could be transcribed or a photo image of the screen could be taken.
> > > > > > 
> > > > > I understand what your saying, I'm just saying there are currently several
> > > > > options for you that have already solved this problem in differnt ways.
> > > > > 
> > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > > > does have a serial console, and after a fair amount of effort I was
> > > > > > able to get it to work as desired, and was able to finally capture
> > > > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > > > kexec/kdump didn't work was probably because it couldn't find
> > > > > > a /proc/vmcore file, although I don't know why that would be,
> > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > > > zero size).
> > > > > > 
> > > > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > > > happy to look into it further.
> > > > > 
> > > > > > The following shows a simple ping test usage of the skb_sources
> > > > > > tracing feature:
> > > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > > > 
> > > > > > --- 192.168.1.10 ping statistics ---
> > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > > > 
> > > > > > [root@xeontest1 tracing]# cat trace
> > > > > > # tracer: skb_sources
> > > > > > #
> > > > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > > > #        |       |       |       |       |       |       |
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > > 
> > > > > > All is as was expected.
> > > > > > 
> > > > > > But if I try an actual nuttcp performance test (even rate limited
> > > > > > to 1 Mbps), I get the following kernel oops:
> > > > > > 
> > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > > > 
> > > > > Thanks
> > > > > Neil
> > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > > > Oops: 0000 [#1] SMP
> > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > > > CPU 4
> > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > > > Stack:
> > > > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > > > Call Trace:
> > > > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > >  RSP <ffff8801a5811a88>
> > > > > > CR2: 0000000000000038
> > > > > > 
> > > > > > 						-Thanks
> > > > > > 
> > > > > > 						-Bill
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > 
> > > > 
> > > > Here  you go, I think this will fix your oops.
> > > > 
> > > > 
> > > >     Fix NULL pointer deref in skb sources ftracer
> > > >     
> > > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > > >     we can pass it to sock_net
> > > >     
> > > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > 
> > > >  trace_skb_sources.c |    6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > ok if this is just a temporary fix until TRACE_EVENT() is done, but 
> > > we'll get rid of this and do TRACE_EVENT() before net-next-2.6 it's 
> > > pushed to .32, right?
> > 
> > Not sure that the two are related.  I think you meant to send this 
> > to the other thread, didnt you?
> 
> Sigh, no. Please re-read the past discussions about this. 
> trace_skb_sources.c is a hack and should be converted to generic 
> tracepoints. Is there anything in it that cannot be expressed in 
> terms of TRACE_EVENT()?
> 
As David noted in my previous posting, no, I don't intend to change this.  It
would certainly be possible to express this in terms of just a TRACE_EVENT, but
it would much more complex and messy for any user space tool to do so, IMHO.  SO
I'd like to leave it as it is.  To say its a hack as it is would really be to
say any of the current ftrace modules are a hack, as all of them could just as
easily be expressed as a series of trace events which were later parsed by a
user space tool.

I thought you're comments were related to the conversion of the napi_poll
tracepoint to a TRACE_EVENT structure, Which is current in progress.

Best
Neil

> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:48                                 ` Ingo Molnar
@ 2009-08-26 20:23                                   ` Neil Horman
  2009-08-26 20:40                                     ` Ingo Molnar
  2009-08-26 23:46                                     ` Frederic Weisbecker
  2009-08-26 20:28                                   ` Ingo Molnar
  1 sibling, 2 replies; 95+ messages in thread
From: Neil Horman @ 2009-08-26 20:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > 
> > > Sigh, no. Please re-read the past discussions about this. 
> > > trace_skb_sources.c is a hack and should be converted to generic 
> > > tracepoints. Is there anything in it that cannot be expressed in 
> > > terms of TRACE_EVENT()?
> > 
> > Neil explained why he needed to implement it this way in his reply 
> > to Steven Rostedt.  I attach it here for your convenience.
> 
> thanks. The argument is invalid:
> 
Just because you assert that doesn't make it so, Ingo.

> > > BTW, why not just do this as events? Or was this just a easy way 
> > > to communicate with the user space tools?
> > 
> > Thats exactly why I did it.  the idea is for me to now write a 
> > user space tool that lets me analyze the events and ajust process 
> > scheduling to optimize the rx path. Neil
> 
> All tooling (in fact _more_ tooling) can be done based on generic, 
> TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> available, have a generalized format with format parsers and user 
> tooling implemented, etc. etc.
> 
Then why allow for ftrace modules at all?  I grant that the skb ftracer is a bit
trivial at the moment for an ftrace module, but I really prefer to leave it is
so that I can expand it with additional tracepoints.  And looking at them,
anything you've said above applies to any of the currently implemented ftrace
modules.  If you're so adamant that we should just do everything with
TRACE_EVENT log messages, then lets get rid of the ftrace infrastructure all
together.  Until we do that, however, I like my skb tracer just as it is.

Neil

> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 19:48                                 ` Ingo Molnar
  2009-08-26 20:23                                   ` Neil Horman
@ 2009-08-26 20:28                                   ` Ingo Molnar
  1 sibling, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 20:28 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin


* Ingo Molnar <mingo@elte.hu> wrote:

> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > 
> > > Sigh, no. Please re-read the past discussions about this. 
> > > trace_skb_sources.c is a hack and should be converted to generic 
> > > tracepoints. Is there anything in it that cannot be expressed in 
> > > terms of TRACE_EVENT()?
> > 
> > Neil explained why he needed to implement it this way in his reply 
> > to Steven Rostedt.  I attach it here for your convenience.
> 
> thanks. The argument is invalid:
> 
> > > BTW, why not just do this as events? Or was this just a easy way 
> > > to communicate with the user space tools?
> > 
> > Thats exactly why I did it.  the idea is for me to now write a 
> > user space tool that lets me analyze the events and ajust process 
> > scheduling to optimize the rx path. Neil
> 
> All tooling (in fact _more_ tooling) can be done based on generic, 
> TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> available, have a generalized format with format parsers and user 
> tooling implemented, etc. etc.

To expand on the 'etc. etc.'.

Right now we already have once TRACE_EVENT() based generic 
tracepoint for skbs - the skb_free one in 
include/trace/events/skb.h.

Here's a list of examples of what that single generic tracepoint 
allows us to do, which Neil's kernel/trace/trace_skb_sources.c code 
cannot do:

 - structured format/field description:

  aldebaran:~> cat /debug/tracing/events/skb/kfree_skb/format 

 name: kfree_skb
 ID: 603
 format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:void * skbaddr;	offset:16;	size:8;
	field:unsigned short protocol;	offset:24;	size:2;
	field:void * location;	offset:32;	size:8;

 print fmt: "skbaddr=%p protocol=%u location=%p", REC->skbaddr, REC->protocol, REC->location

  The advantages of that are numerous: we have a user-space parser
  for that, so new tracepoints or changes to tracepoints can be 
  propagated across the tooling automatically. (see below examples 
  about how this works in practice)

 - perfcounters integration:

    - it's enumerated and visible in the list of tracepoints:

        aldebaran:~> perf list 2>&1 | grep skb
        skb:kfree_skb                              [Tracepoint event]

    - the tracepoint can be used for statistics (perf stat):

        aldebaran:~> perf stat -e skb:kfree_skb -a sleep 1

        Performance counter stats for 'sleep 1':

    - noise analysis:

        aldebaran:~> perf stat --repeat 10 -e skb:kfree_skb -a sleep 1

        Performance counter stats for 'sleep 1' (10 runs):

             25  skb:kfree_skb              ( +-   7.692% )

    - the tracepoint can be used for profiling:

        aldebaran:~> perf top -e skb:kfree_skb -c 1

  ------------------------------------------------------------------------------
   PerfTop:     334 irqs/sec  kernel: 0.3% [1 skb:kfree_skb],  (all, 16 CPUs)
  ------------------------------------------------------------------------------
             samples    pcnt         RIP          kernel function
  ______     _______   _____   ________________   _______________
 
               23.00 - 100.0% - ffffffff81266828 : store_bind

    - can be used to do call-graph profiling that captures kernel 
      and user-space call-graphs as well:

     aldebaran:~> perf record --call-graph -e skb:kfree_skb -c 1 -f -a sleep 1
     [ perf record: Captured and wrote 0.035 MB perf.data (~1547 samples) ]

     aldebaran:~> perf report
     ...

# Samples: 4102
#
# Overhead          Command                                                                                             Shared Object  Symbol
# ........  ...............  ........................................................................................................  ......
#
    88.44%          distccd                                                                                                3641efb1d0  [.] 0x00003641efb1d0

     3.07%             Xorg                                                                                                3641ed6590  [.] 0x00003641ed6590

     2.51%  at-spi-registry                                                                                                3642a0db50  [.] 0x00003642a0db50

     2.24%             sshd  /lib64/libc-2.8.so                                                                                        [.] __libc_read

     0.73%             sshd                                                                                              7f71d4e69590  [.] 0x007f71d4e69590

     0.63%             init  [kernel]                                                                                                  [k] store_bind
     0.56%             sshd  /lib64/libc-2.8.so                                                                                        [.] __recvmsg

     0.49%  gnome-settings-                                                                                                3642a0db8b  [.] 0x00003642a0db8b

     0.39%             sshd  /lib64/libc-2.8.so                                                                                        [.] __GI___libc_connect

     0.39%             sshd  /lib64/libc-2.8.so                                                                                        [.] __sendto_nocancel

     0.15%               id  /lib64/libc-2.8.so                                                                                        [.] __GI___libc_connect
                |          
                |--50.00%-- get_mapping
                |          __nscd_get_map_ref
                |          
                 --50.00%-- __nscd_open_socket

     0.10%         metacity                                                                                                3641ed6590  [.] 0x00003641ed6590

     0.07%  gdm-simple-gree                                                                                                3642a0db8b  [.] 0x00003642a0db8b
                |          
                |--66.67%-- 0x3641ed65cb
                |          
                 --33.33%-- 0x3642a0db8b

     0.05%             bash  /lib64/libc-2.8.so                                                                                        [.] __GI___libc_connect
                |          
                |--50.00%-- get_mapping
                |          __nscd_get_map_ref
                |          
                 --50.00%-- __nscd_open_socket

     0.05%            :3129  /lib64/libc-2.8.so                                                                                        [.] __GI___libc_connect
                |          
                |--50.00%-- get_mapping
                |          __nscd_get_map_ref
                |          
                 --50.00%-- __nscd_open_socket

     0.05%            :3098  /lib64/libc-2.8.so                                                                                        [.] __GI___libc_connect
                |          
                |--50.00%-- get_mapping
                |          __nscd_get_map_ref
                |          
                 --50.00%-- __nscd_open_socket

     0.02%             init  [kernel]                                                                                                  [k] bind_con_driver
     0.02%  gnome-power-man                                                                                                3642a0db50  [.] 0x00003642a0db50

     0.02%              cc1  /opt/crosstool/gcc-4.2.2-glibc-2.3.6/i686-unknown-linux-gnu/libexec/gcc/i686-unknown-linux-gnu/4.2.2/cc1  [.] num_positive

   - can be used to capture traces to user-space and analyze them 
     there:

     aldebaran:/home/mingo> perf record -e skb:kfree_skb:r -c 1 -R -f -a sleep 10
     [ perf record: Captured and wrote 4.426 MB perf.data (~193365 samples) ]

     aldebaran:/home/mingo> perf trace
     version = 0.5B6
            init-0     [000]     0.000000: kfree_skb: skbaddr=0xffff8801bcc15300 protocol=2048 location=0xffffffff81461c94
            Xorg-4411  [000]     0.000000: kfree_skb: skbaddr=0xffff8801bb955a00 protocol=0 location=0xffffffff814e8aff
 at-spi-registry-4948  [000]     0.000000: kfree_skb: skbaddr=0xffff8801bb955a00 protocol=0 location=0xffffffff814e8aff
     ...

 - generic tracepoints can be available with lots of other 
   tracepoints at once - while the skb_sources plugin is exclusive.
   (no other plugin can be active at the same time) Generic 
   tracepoints have separate toggles - any sub-set of tracepoints 
   can be active at any time.

 - per tracepoint filter expressions support, such as:

    aldebaran:/debug/tracing/events/skb/kfree_skb> echo 'protocol == 0 && common_pid == 123' > filter 
    aldebaran:/debug/tracing/events/skb/kfree_skb> cat filter protocol == 0 && common_pid == 123
    protocol == 0 && common_pid == 123

   When this filter is modified, the kernel creates a (safe) list of
   (atomically evaluatable) predicaments from the expression and the
   data is filtered before it's traced.

   The filter engine works in process, softirq, IRQ, NMI and any
   other context and is very fast as well. (no parsing overhead in 
   the fastpath - we pre-parse the expression and break it down.)

In other words, generic tracepoints are _vastly_ superior to the 
skb_sources plugin, and this fact is obvious to all tracing 
developers, that's why every tracing developer who commented on this 
thread asked (in a rather befuddled way) "why not TRACE_EVENT()?".

And note that the above examples were based on a _single_ existing 
generic tracepoint of very limited utility - and still it already 
allowed a lot of interesting data to be captured. If we had a more 
comprehensive set of skb tracepoints, a whole lot of interesting 
possibilities would open up ...

All in one, we dont do new ftrace plugins that can be done via 
generic tracepoints - we only limit ftrace plugins to vastly 
different things like the function tracer or the latency tracer.

That's why we have things like a tracing tree and a review process, 
to address such issues before patches get committed.

David, please sort this out before sending any bits in this area to 
Linus, Neil's response is basically "i want it this way" which is 
not really acceptable - the maintainers of kernel/trace/* dont want 
it this way, for very good technical reasons.

The skb_sources hack should be converted to a proper 
TRACE_EVENT(skb_dequeue) tracepoint. Also, as we offered it on the 
onset, we'd be glad to help out with the conversion. I can do a 
patch if nobody volunteers.

Plus we'd like to encourage more TRACE_EVENT() networking 
tracepoints like the existing skb_free. They are a great tool.

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 20:23                                   ` Neil Horman
@ 2009-08-26 20:40                                     ` Ingo Molnar
  2009-08-26 22:39                                       ` Neil Horman
  2009-08-27  0:30                                       ` blktrace ftrace plugin, was " Christoph Hellwig
  2009-08-26 23:46                                     ` Frederic Weisbecker
  1 sibling, 2 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 20:40 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > 
> > * David Miller <davem@davemloft.net> wrote:
> > 
> > > From: Ingo Molnar <mingo@elte.hu>
> > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > 
> > > > Sigh, no. Please re-read the past discussions about this. 
> > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > terms of TRACE_EVENT()?
> > > 
> > > Neil explained why he needed to implement it this way in his reply 
> > > to Steven Rostedt.  I attach it here for your convenience.
> > 
> > thanks. The argument is invalid:
> 
> Just because you assert that doesn't make it so, Ingo.

I stand by that statement, the argument is invalid, for the many 
reasons i outlined in my previous mails. (you'd have gotten those 
same arguments had you submitted that patch to the folks who 
maintain kernel/trace/)

> > > > BTW, why not just do this as events? Or was this just a easy way 
> > > > to communicate with the user space tools?
> > > 
> > > Thats exactly why I did it.  the idea is for me to now write a 
> > > user space tool that lets me analyze the events and ajust process 
> > > scheduling to optimize the rx path. Neil
> > 
> > All tooling (in fact _more_ tooling) can be done based on generic, 
> > TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> > available, have a generalized format with format parsers and user 
> > tooling implemented, etc. etc.
> 
> Then why allow for ftrace modules at all?  [...]

We routinely reject trivial plugins like yours and ask people to use 
the proper mechanism: TRACE_EVENT().

We are also converting non-trivial plugins to generic tracepoints. A 
recent example are the system call tracepoints, but we also 
converted blktrace and kmemtrace to generic tracepoints.

But trace_skb_sources.c got committed to the networking tree, 
without review and acks from the tracing folks. Now you are 
unwilling to fix it and that's not very constructive.

> [...] I grant that the skb ftracer is a bit trivial at the moment 
> for an ftrace module, but I really prefer to leave it is so that I 
> can expand it with additional tracepoints.  And looking at them, 
> anything you've said above applies to any of the currently 
> implemented ftrace modules.  If you're so adamant that we should 
> just do everything with TRACE_EVENT log messages, then lets get 
> rid of the ftrace infrastructure all together.  Until we do that, 
> however, I like my skb tracer just as it is.

You dont seem to be aware of the breath of features and capabilities 
that TRACE_EVENT() based tooling allows us to do. Please see my 
previous mail about an (incomplete) list.

( One item i forgot to mention there: using them you can for example 
  trace full workloads, such as a kernel build - without other 
  workloads mixed into that trace. Etc. etc. - the list goes on. )

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 20:40                                     ` Ingo Molnar
@ 2009-08-26 22:39                                       ` Neil Horman
  2009-08-26 22:44                                         ` David Miller
                                                           ` (3 more replies)
  2009-08-27  0:30                                       ` blktrace ftrace plugin, was " Christoph Hellwig
  1 sibling, 4 replies; 95+ messages in thread
From: Neil Horman @ 2009-08-26 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > > 
> > > * David Miller <davem@davemloft.net> wrote:
> > > 
> > > > From: Ingo Molnar <mingo@elte.hu>
> > > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > > 
> > > > > Sigh, no. Please re-read the past discussions about this. 
> > > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > > terms of TRACE_EVENT()?
> > > > 
> > > > Neil explained why he needed to implement it this way in his reply 
> > > > to Steven Rostedt.  I attach it here for your convenience.
> > > 
> > > thanks. The argument is invalid:
> > 
> > Just because you assert that doesn't make it so, Ingo.
> 
> I stand by that statement, the argument is invalid, for the many 
> reasons i outlined in my previous mails. (you'd have gotten those 
> same arguments had you submitted that patch to the folks who 
> maintain kernel/trace/)
> 
Steven specifically told me to submit the patch to the subsystem maintainer that
I'm adding tracepoints for, and the only feedback I got on it was his one
question, the answer to which I assume satisfied him, due to that there was no
subseuqent discussion.  I'm going to ignore your previous emails, because,
despite the various advantages of just using plain TRACE_EVENTs because you
provide the ftrace interface, and I found it useful.  Your observation is
correct, I like it, and thats what I wanted to use, so I used it.  If you don't
want people to use it, don't provide it.

> > > > > BTW, why not just do this as events? Or was this just a easy way 
> > > > > to communicate with the user space tools?
> > > > 
> > > > Thats exactly why I did it.  the idea is for me to now write a 
> > > > user space tool that lets me analyze the events and ajust process 
> > > > scheduling to optimize the rx path. Neil
> > > 
> > > All tooling (in fact _more_ tooling) can be done based on generic, 
> > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> > > available, have a generalized format with format parsers and user 
> > > tooling implemented, etc. etc.
> > 
> > Then why allow for ftrace modules at all?  [...]
> 
> We routinely reject trivial plugins like yours and ask people to use 
> the proper mechanism: TRACE_EVENT().
> 
Again, if you consider there to only be one proper mechanism here, don't provide
others.

> We are also converting non-trivial plugins to generic tracepoints. A 
> recent example are the system call tracepoints, but we also 
> converted blktrace and kmemtrace to generic tracepoints.
> 
If you're getting rid of ftrace, then fine, just say so.  If the interface I
chose is getting removed, I'll change it.  But I'm not going to change it just
because you're going around saying my previous work sucks.  Theres nothing wrong
with it, it works quite well right now as it is.

> But trace_skb_sources.c got committed to the networking tree, 
> without review and acks from the tracing folks. Now you are 
> unwilling to fix it and that's not very constructive.
> 
I'm not willing to fix it because its not broken.  I submitted it where steven
suggested that I submitted it, and the reviews that I got were positive.  All
you've told me is that you think theres a better way.  Its fine if theres a
better way, but the way I have currently is sufficient.  I have acutal bugs to
fix.  Rewriting this to suit your opinions after the fact really isn't
productive for me.

> > [...] I grant that the skb ftracer is a bit trivial at the moment 
> > for an ftrace module, but I really prefer to leave it is so that I 
> > can expand it with additional tracepoints.  And looking at them, 
> > anything you've said above applies to any of the currently 
> > implemented ftrace modules.  If you're so adamant that we should 
> > just do everything with TRACE_EVENT log messages, then lets get 
> > rid of the ftrace infrastructure all together.  Until we do that, 
> > however, I like my skb tracer just as it is.
> 
> You dont seem to be aware of the breath of features and capabilities 
> that TRACE_EVENT() based tooling allows us to do. Please see my 
> previous mail about an (incomplete) list.
> 
Fine, I grant you that TRACE_EVENT might provide great advantages over an ftrace
module.  What you seem to be missing is that an ftrace module is sufficnet for
the needs of what I was tracing.


Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.  The code
I wrote works fairly well in my view, and I feel like the review on it was both
positive and sufficent for inclusion.  But thats not my call, its yours.  I can
meet my own need with a raw TRACE_EVENT for now just as easily.  IF you feel
like the skb plugin should be pulled, please do so, and let me know.  All I ask
is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place.  If you pull
the ftrace plugin, I'll submit a subsequent patch to agument the printing format
so that I can gather the numa allocation and consumption data directly there.

Regards
Neil



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:39                                       ` Neil Horman
@ 2009-08-26 22:44                                         ` David Miller
  2009-08-26 23:05                                           ` Ingo Molnar
                                                             ` (2 more replies)
  2009-08-26 23:14                                         ` Ingo Molnar
                                                           ` (2 subsequent siblings)
  3 siblings, 3 replies; 95+ messages in thread
From: David Miller @ 2009-08-26 22:44 UTC (permalink / raw)
  To: nhorman; +Cc: mingo, rostedt, fweisbec, billfink, netdev, brice, gallatin

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 26 Aug 2009 18:39:22 -0400

> Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.

I've gotten this kind of urging, both in private and in public, from
both of you now.  And I'm sorry, that's not how this works.

It is not my job to somehow force you turkeys how to work effectively
together. :-)

What we can do is ask Mr. Rostedt to asses the situation and give
his feedback.

So if Steven could give some feedback about this specific situation
that would be great and might help us move forward.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 20:01                               ` Neil Horman
@ 2009-08-26 22:57                                 ` Ingo Molnar
  0 siblings, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 22:57 UTC (permalink / raw)
  To: Neil Horman
  Cc: David S. Miller, Steven Rostedt, Fr?d?ric Weisbecker, Bill Fink,
	Linux Network Developers, brice, gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > Is there anything in it that cannot be expressed in terms of 
> > TRACE_EVENT()?
>
> As David noted in my previous posting, no, I don't intend to 
> change this. [...]

Well, this change lacks the ack of the maintainers of kernel/trace/* 
for the technical reasons outlined in the (many...) mails sent on 
this topic, so for the .32 networking tree to be properly pushable 
to Linus you'll have to come up with a better answer than "I don't 
intend to change this".

David, i tried to help but i really dont have time to deal with an 
inefficient workflow like this. Two weeks ago you committed a 
clearly broken patch to the tracing code, it had bugs, it was 
objected to because it does the wrong thing altogether and Neil 
refuses to fix it and i'm supposed to convince Neil what the right 
solution is?

That's not how maintenance is supposed to work, it's utterly not 
scalable. Please deal with it one way or another.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:44                                         ` David Miller
@ 2009-08-26 23:05                                           ` Ingo Molnar
  2009-08-26 23:08                                             ` David Miller
  2009-08-26 23:05                                           ` Steven Rostedt
  2009-08-26 23:19                                           ` Neil Horman
  2 siblings, 1 reply; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 23:05 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin


* David Miller <davem@davemloft.net> wrote:

> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Wed, 26 Aug 2009 18:39:22 -0400
> 
> > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your 
> > hands.
> 
> I've gotten this kind of urging, both in private and in public, 
> from both of you now.  And I'm sorry, that's not how this works.
> 
> It is not my job to somehow force you turkeys how to work 
> effectively together. :-)

It is definitely your job to ensure that you do not commit deficient 
patches to kernel/trace/ via the networking tree. You created this 
situation to begin with so you might as well take some 
responsibility and help resolve it.

And the thing is, we are not rigid about these things in the tracing 
tree and if this was a good change we would not mind and you'd have 
my Ack. The problem is that it's a crappy change and that Neil is 
refusing to fix it. So please fix it,

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:44                                         ` David Miller
  2009-08-26 23:05                                           ` Ingo Molnar
@ 2009-08-26 23:05                                           ` Steven Rostedt
  2009-08-26 23:09                                             ` David Miller
  2009-08-26 23:23                                             ` Neil Horman
  2009-08-26 23:19                                           ` Neil Horman
  2 siblings, 2 replies; 95+ messages in thread
From: Steven Rostedt @ 2009-08-26 23:05 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, Ingo Molnar, Frederic Weisbecker, billfink, netdev,
	brice, gallatin


On Wed, 26 Aug 2009, David Miller wrote:

> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Wed, 26 Aug 2009 18:39:22 -0400
> 
> > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.
> 
> I've gotten this kind of urging, both in private and in public, from
> both of you now.  And I'm sorry, that's not how this works.
> 
> It is not my job to somehow force you turkeys how to work effectively
> together. :-)
> 
> What we can do is ask Mr. Rostedt to asses the situation and give
> his feedback.
> 
> So if Steven could give some feedback about this specific situation
> that would be great and might help us move forward.

OK, here's my thought on the matter.

How about Neil try out doing all he can with the existing TRACE_EVENT 
work. I'm sure Ingo and myself would be fine with helping him with any 
issues he comes up with. If there is something that he hates about it, 
that really makes his user space code messy, then he can put the ball back 
in our court, and Ingo and I will need to come up with a solution.

If we truly hit a show stopper, than we can always fall back to the ftrace 
plugin. But until we find out for sure that TRACE_EVENT is not good 
enough, then we should try that out.

How's that sound?

-- Steve


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:05                                           ` Ingo Molnar
@ 2009-08-26 23:08                                             ` David Miller
  2009-08-26 23:58                                               ` Ingo Molnar
  0 siblings, 1 reply; 95+ messages in thread
From: David Miller @ 2009-08-26 23:08 UTC (permalink / raw)
  To: mingo; +Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin

From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 27 Aug 2009 01:05:14 +0200

> The problem is that it's a crappy change and that Neil is 
> refusing to fix it. So please fix it,

Thankfully, Steven Rostedt gave a much more useful and
reasonable response than you.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:05                                           ` Steven Rostedt
@ 2009-08-26 23:09                                             ` David Miller
  2009-08-26 23:30                                               ` Ingo Molnar
  2009-08-26 23:23                                             ` Neil Horman
  1 sibling, 1 reply; 95+ messages in thread
From: David Miller @ 2009-08-26 23:09 UTC (permalink / raw)
  To: rostedt; +Cc: nhorman, mingo, fweisbec, billfink, netdev, brice, gallatin

From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 26 Aug 2009 19:05:20 -0400 (EDT)

> How about Neil try out doing all he can with the existing TRACE_EVENT 
> work. I'm sure Ingo and myself would be fine with helping him with any 
> issues he comes up with. If there is something that he hates about it, 
> that really makes his user space code messy, then he can put the ball back 
> in our court, and Ingo and I will need to come up with a solution.
> 
> If we truly hit a show stopper, than we can always fall back to the ftrace 
> plugin. But until we find out for sure that TRACE_EVENT is not good 
> enough, then we should try that out.
> 
> How's that sound?

That works for me, thanks Steven!

I'll revert Neil's change from net-next-2.6 and we can work on a
usable solution, long-term.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:39                                       ` Neil Horman
  2009-08-26 22:44                                         ` David Miller
@ 2009-08-26 23:14                                         ` Ingo Molnar
  2009-08-26 23:33                                         ` Steven Rostedt
  2009-08-27  0:34                                         ` Christoph Hellwig
  3 siblings, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 23:14 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, rostedt, fweisbec, billfink, netdev, brice, gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * David Miller <davem@davemloft.net> wrote:
> > > > 
> > > > > From: Ingo Molnar <mingo@elte.hu>
> > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > > > 
> > > > > > Sigh, no. Please re-read the past discussions about this. 
> > > > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > > > terms of TRACE_EVENT()?
> > > > > 
> > > > > Neil explained why he needed to implement it this way in his reply 
> > > > > to Steven Rostedt.  I attach it here for your convenience.
> > > > 
> > > > thanks. The argument is invalid:
> > > 
> > > Just because you assert that doesn't make it so, Ingo.
> > 
> > I stand by that statement, the argument is invalid, for the many 
> > reasons i outlined in my previous mails. (you'd have gotten those 
> > same arguments had you submitted that patch to the folks who 
> > maintain kernel/trace/)
> 
> Steven specifically told me to submit the patch to the subsystem 
> maintainer that I'm adding tracepoints for, and the only feedback 
> I got on it was his one question, the answer to which I assume 
> satisfied him, due to that there was no subseuqent discussion.

I dont speak for Steve but i cannot imagine him suggesting to you to 
add a new plugin to kernel/trace/.

'adding tracepoints' is a shortcut for TRACE_EVENT() these days. 
Those are fundamentally decentralized indeed - but that's not what 
you used.

> I'm going to ignore your previous emails, because, despite the 
> various advantages of just using plain TRACE_EVENTs because you 
> provide the ftrace interface, and I found it useful.  Your 
> observation is correct, I like it, and thats what I wanted to use, 
> so I used it.  If you don't want people to use it, don't provide 
> it.

This might be convenient to you, but that's not how kernel 
maintenance works.

By your argument it would be fine for me to add a new networking 
protocol to net/ and ignore the objections from networking 
maintainers, with the argument that 'you provided protocol 
interfaces and i just made use of it and like it'?

> > > > > > BTW, why not just do this as events? Or was this just a easy way 
> > > > > > to communicate with the user space tools?
> > > > > 
> > > > > Thats exactly why I did it.  the idea is for me to now write a 
> > > > > user space tool that lets me analyze the events and ajust process 
> > > > > scheduling to optimize the rx path. Neil
> > > > 
> > > > All tooling (in fact _more_ tooling) can be done based on generic, 
> > > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> > > > available, have a generalized format with format parsers and user 
> > > > tooling implemented, etc. etc.
> > > 
> > > Then why allow for ftrace modules at all?  [...]
> > 
> > We routinely reject trivial plugins like yours and ask people to 
> > use the proper mechanism: TRACE_EVENT().
> 
> Again, if you consider there to only be one proper mechanism here, 
> don't provide others.

We dont provide them. kernel/trace/ is an internal directory to the 
tracing subsystem.

> > We are also converting non-trivial plugins to generic tracepoints. A 
> > recent example are the system call tracepoints, but we also 
> > converted blktrace and kmemtrace to generic tracepoints.
> 
> If you're getting rid of ftrace, then fine, just say so.  If the 
> interface I chose is getting removed, I'll change it.  But I'm not 
> going to change it just because you're going around saying my 
> previous work sucks.  Theres nothing wrong with it, it works quite 
> well right now as it is.

where did i say that we are getting rid of ftrace? We are not 
getting rid of it.

> > But trace_skb_sources.c got committed to the networking tree, 
> > without review and acks from the tracing folks. Now you are 
> > unwilling to fix it and that's not very constructive.
> 
> I'm not willing to fix it because its not broken.  I submitted it 
> where steven suggested that I submitted it, and the reviews that I 
> got were positive.  All you've told me is that you think theres a 
> better way.  Its fine if theres a better way, but the way I have 
> currently is sufficient.  I have acutal bugs to fix.  Rewriting 
> this to suit your opinions after the fact really isn't productive 
> for me.

No, you should do it differently because 1) it's in the wrong tree 
2) the maintainers of this code asked you to do that.

We'd never have committed your patch to the tracing tree - it was 
David's mistake to commit it.

Btw., commit 9ec04da74 lacks Steve's ack and has an ugly diffstat 
mixed into the commit log:

 Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

  Makefile            |    1
  trace.h             |   19 ++++++
  trace_skb_sources.c |  154 ++++++++++++++++++++++++++++++++++++++++++++++++++++
  3 files changed, 174 insertions(+)
 Signed-off-by: David S. Miller <davem@davemloft.net>

> > > [...] I grant that the skb ftracer is a bit trivial at the moment 
> > > for an ftrace module, but I really prefer to leave it is so that I 
> > > can expand it with additional tracepoints.  And looking at them, 
> > > anything you've said above applies to any of the currently 
> > > implemented ftrace modules.  If you're so adamant that we should 
> > > just do everything with TRACE_EVENT log messages, then lets get 
> > > rid of the ftrace infrastructure all together.  Until we do that, 
> > > however, I like my skb tracer just as it is.
> > 
> > You dont seem to be aware of the breath of features and capabilities 
> > that TRACE_EVENT() based tooling allows us to do. Please see my 
> > previous mail about an (incomplete) list.
> 
> Fine, I grant you that TRACE_EVENT might provide great advantages 
> over an ftrace module.  What you seem to be missing is that an 
> ftrace module is sufficnet for the needs of what I was tracing.
> 
> Ok, I'm rather tired of arguing.  Dave, I'll leave this in your 
> hands.  The code I wrote works fairly well in my view, and I feel 
> like the review on it was both positive and sufficent for 
> inclusion.  But thats not my call, its yours.  I can meet my own 
> need with a raw TRACE_EVENT for now just as easily.  IF you feel 
> like the skb plugin should be pulled, please do so, and let me 
> know.  All I ask is that you keep the skb_copy_datagram_iovec 
> TRACE_EVENT in place.  If you pull the ftrace plugin, I'll submit 
> a subsequent patch to agument the printing format so that I can 
> gather the numa allocation and consumption data directly there.

Well, David is not maintaining kernel/trace/ last i checked, so i'm 
puzzled why you leave it 'in the hands' of him.

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:44                                         ` David Miller
  2009-08-26 23:05                                           ` Ingo Molnar
  2009-08-26 23:05                                           ` Steven Rostedt
@ 2009-08-26 23:19                                           ` Neil Horman
  2 siblings, 0 replies; 95+ messages in thread
From: Neil Horman @ 2009-08-26 23:19 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, rostedt, fweisbec, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 03:44:10PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Wed, 26 Aug 2009 18:39:22 -0400
> 
> > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.
> 
> I've gotten this kind of urging, both in private and in public, from
> both of you now.  And I'm sorry, that's not how this works.
> 
> It is not my job to somehow force you turkeys how to work effectively
> together. :-)
> 
> What we can do is ask Mr. Rostedt to asses the situation and give
> his feedback.
> 
> So if Steven could give some feedback about this specific situation
> that would be great and might help us move forward.
> 
Thats a fine solution by me. I'll go with Stevens decision.
Neil


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:05                                           ` Steven Rostedt
  2009-08-26 23:09                                             ` David Miller
@ 2009-08-26 23:23                                             ` Neil Horman
  2009-08-26 23:29                                               ` David Miller
  1 sibling, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-26 23:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Miller, Ingo Molnar, Frederic Weisbecker, billfink, netdev,
	brice, gallatin

On Wed, Aug 26, 2009 at 07:05:20PM -0400, Steven Rostedt wrote:
> 
> On Wed, 26 Aug 2009, David Miller wrote:
> 
> > From: Neil Horman <nhorman@tuxdriver.com>
> > Date: Wed, 26 Aug 2009 18:39:22 -0400
> > 
> > > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.
> > 
> > I've gotten this kind of urging, both in private and in public, from
> > both of you now.  And I'm sorry, that's not how this works.
> > 
> > It is not my job to somehow force you turkeys how to work effectively
> > together. :-)
> > 
> > What we can do is ask Mr. Rostedt to asses the situation and give
> > his feedback.
> > 
> > So if Steven could give some feedback about this specific situation
> > that would be great and might help us move forward.
> 
> OK, here's my thought on the matter.
> 
> How about Neil try out doing all he can with the existing TRACE_EVENT 
> work. I'm sure Ingo and myself would be fine with helping him with any 
> issues he comes up with. If there is something that he hates about it, 
> that really makes his user space code messy, then he can put the ball back 
> in our court, and Ingo and I will need to come up with a solution.
> 
> If we truly hit a show stopper, than we can always fall back to the ftrace 
> plugin. But until we find out for sure that TRACE_EVENT is not good 
> enough, then we should try that out.
> 
> How's that sound?
> 

Ok, thats fine by me.  I really don't have any oposition to just using raw
TRACE_EVENTS for my current purposes, but as Ingo's previous mail shows, theres
_alot_ to it.  Using the ftrace interface was really, in the end, just simpler
for me.  But if just using TRACE_EVENT is the way it needs to be, so be it.

Dave, would you please revert commit 9ec04da7489d2c9ae01ea6e9b5fa313ccf3d35fb
and 5a165657bef7c47e5ff4cd138f7758ef6278e87b?  That should remove the ftrace
code, and leave the TRACE_EVENT tracepoint for skb_copy_datagram_to_iovec in
place.  I'll submit a patch in the next few days to augment the TRACE_EVENT
format to export all the data that I need.

Thanks!
Neil

> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:23                                             ` Neil Horman
@ 2009-08-26 23:29                                               ` David Miller
  0 siblings, 0 replies; 95+ messages in thread
From: David Miller @ 2009-08-26 23:29 UTC (permalink / raw)
  To: nhorman; +Cc: rostedt, mingo, fweisbec, billfink, netdev, brice, gallatin

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 26 Aug 2009 19:23:04 -0400

> Dave, would you please revert commit 9ec04da7489d2c9ae01ea6e9b5fa313ccf3d35fb
> and 5a165657bef7c47e5ff4cd138f7758ef6278e87b?  That should remove the ftrace
> code, and leave the TRACE_EVENT tracepoint for skb_copy_datagram_to_iovec in
> place.  I'll submit a patch in the next few days to augment the TRACE_EVENT
> format to export all the data that I need.

Ok.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:09                                             ` David Miller
@ 2009-08-26 23:30                                               ` Ingo Molnar
  0 siblings, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 23:30 UTC (permalink / raw)
  To: David Miller
  Cc: rostedt, nhorman, fweisbec, billfink, netdev, brice, gallatin


* David Miller <davem@davemloft.net> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Wed, 26 Aug 2009 19:05:20 -0400 (EDT)
> 
> > How about Neil try out doing all he can with the existing TRACE_EVENT 
> > work. I'm sure Ingo and myself would be fine with helping him with any 
> > issues he comes up with. If there is something that he hates about it, 
> > that really makes his user space code messy, then he can put the ball back 
> > in our court, and Ingo and I will need to come up with a solution.
> > 
> > If we truly hit a show stopper, than we can always fall back to the ftrace 
> > plugin. But until we find out for sure that TRACE_EVENT is not good 
> > enough, then we should try that out.
> > 
> > How's that sound?
> 
> That works for me, thanks Steven!
> 
> I'll revert Neil's change from net-next-2.6 and we can work on a 
> usable solution, long-term.

thanks David!

Also, my prior offer to help out with the TRACE_EVENT conversion 
stands as well, plus more TRACE_EVENT() tracepoints would be welcome 
too in the networking code.

It's a very useful feature and they are a lot easier (and more 
decentralized) to add than new tracing plugins.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:39                                       ` Neil Horman
  2009-08-26 22:44                                         ` David Miller
  2009-08-26 23:14                                         ` Ingo Molnar
@ 2009-08-26 23:33                                         ` Steven Rostedt
  2009-08-27  0:14                                           ` Neil Horman
  2009-08-27  0:34                                         ` Christoph Hellwig
  3 siblings, 1 reply; 95+ messages in thread
From: Steven Rostedt @ 2009-08-26 23:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin


On Wed, 26 Aug 2009, Neil Horman wrote:

> On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * David Miller <davem@davemloft.net> wrote:
> > > > 
> > > > > From: Ingo Molnar <mingo@elte.hu>
> > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > > > 
> > > > > > Sigh, no. Please re-read the past discussions about this. 
> > > > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > > > terms of TRACE_EVENT()?
> > > > > 
> > > > > Neil explained why he needed to implement it this way in his reply 
> > > > > to Steven Rostedt.  I attach it here for your convenience.
> > > > 
> > > > thanks. The argument is invalid:
> > > 
> > > Just because you assert that doesn't make it so, Ingo.
> > 
> > I stand by that statement, the argument is invalid, for the many 
> > reasons i outlined in my previous mails. (you'd have gotten those 
> > same arguments had you submitted that patch to the folks who 
> > maintain kernel/trace/)
> > 

> Steven specifically told me to submit the patch to the subsystem maintainer that
> I'm adding tracepoints for, and the only feedback I got on it was his one
> question, the answer to which I assume satisfied him, due to that there was no
> subseuqent discussion.  I'm going to ignore your previous emails, because,
> despite the various advantages of just using plain TRACE_EVENTs because you
> provide the ftrace interface, and I found it useful.  Your observation is
> correct, I like it, and thats what I wanted to use, so I used it.  If you don't
> want people to use it, don't provide it.

Actually, I suggested to submit it to the subsystem maintainer if there 
was no changes to the tracing infrastructure. We may have just had a 
misunderstanding there. No biggy.

Yes, the plugins are there for the tracers that are not really events. 
Those are the latency tracers (they have a double trace buffer for 
recording maxes), the function tracers (they are a separate beast 
themselves). The other tracers are on their way to being obsoleted. 
Ideally the only plugins we should have are:

 function, function_graph, mmiotrace, wakeup_rt, wakeup, irqsoff, 
preemptoff, preemptirqsoff.

The mmiotrace is a neat thing that traps calls of binary drivers to their 
devices, and traces what is written and read.

Thus, the plugins are reserved for the off the wall type of tracing. Not 
something that can easily be accomplished with tracepoints.

Currently sched_switch is still there, because the recording of 
task->comm's is associated with that tracer, and until we remove that 
binding, it will stay. But expect it to eventually disappear too.

> 
> > > > > > BTW, why not just do this as events? Or was this just a easy way 
> > > > > > to communicate with the user space tools?
> > > > > 
> > > > > Thats exactly why I did it.  the idea is for me to now write a 
> > > > > user space tool that lets me analyze the events and ajust process 
> > > > > scheduling to optimize the rx path. Neil
> > > > 
> > > > All tooling (in fact _more_ tooling) can be done based on generic, 
> > > > TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> > > > available, have a generalized format with format parsers and user 
> > > > tooling implemented, etc. etc.
> > > 
> > > Then why allow for ftrace modules at all?  [...]
> > 
> > We routinely reject trivial plugins like yours and ask people to use 
> > the proper mechanism: TRACE_EVENT().
> > 
> Again, if you consider there to only be one proper mechanism here, don't provide
> others.

I guess the issue is that the plugins were there first, and that we did 
what trace events do today with the plugins. When TRACE_EVENT became 
mature, it obsoleted a lot of the plugins. Thus we are trying to get rid 
of them. But for those tracers that do not do events, then we still need 
the plugin facility.

> 
> > We are also converting non-trivial plugins to generic tracepoints. A 
> > recent example are the system call tracepoints, but we also 
> > converted blktrace and kmemtrace to generic tracepoints.
> > 
> If you're getting rid of ftrace, then fine, just say so.  If the interface I
> chose is getting removed, I'll change it.  But I'm not going to change it just
> because you're going around saying my previous work sucks.  Theres nothing wrong
> with it, it works quite well right now as it is.

It does not suck, but it's "old school" ;-)

> 
> > But trace_skb_sources.c got committed to the networking tree, 
> > without review and acks from the tracing folks. Now you are 
> > unwilling to fix it and that's not very constructive.
> > 
> I'm not willing to fix it because its not broken.  I submitted it where steven
> suggested that I submitted it, and the reviews that I got were positive.  All
> you've told me is that you think theres a better way.  Its fine if theres a
> better way, but the way I have currently is sufficient.  I have acutal bugs to
> fix.  Rewriting this to suit your opinions after the fact really isn't
> productive for me.

I feel guilty here. I misunderstood the scope of your changes, and did not 
realize you were adding a plugin.

> 
> > > [...] I grant that the skb ftracer is a bit trivial at the moment 
> > > for an ftrace module, but I really prefer to leave it is so that I 
> > > can expand it with additional tracepoints.  And looking at them, 
> > > anything you've said above applies to any of the currently 
> > > implemented ftrace modules.  If you're so adamant that we should 
> > > just do everything with TRACE_EVENT log messages, then lets get 
> > > rid of the ftrace infrastructure all together.  Until we do that, 
> > > however, I like my skb tracer just as it is.
> > 
> > You dont seem to be aware of the breath of features and capabilities 
> > that TRACE_EVENT() based tooling allows us to do. Please see my 
> > previous mail about an (incomplete) list.
> > 
> Fine, I grant you that TRACE_EVENT might provide great advantages over an ftrace
> module.  What you seem to be missing is that an ftrace module is sufficnet for
> the needs of what I was tracing.
> 
> 
> Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.  The code
> I wrote works fairly well in my view, and I feel like the review on it was both
> positive and sufficent for inclusion.  But thats not my call, its yours.  I can
> meet my own need with a raw TRACE_EVENT for now just as easily.  IF you feel
> like the skb plugin should be pulled, please do so, and let me know.  All I ask
> is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place.  If you pull
> the ftrace plugin, I'll submit a subsequent patch to agument the printing format
> so that I can gather the numa allocation and consumption data directly there.

Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-).

You probably already read my previous email on the matter. Don't delete 
your plugin patch until we get everything you need with TRACE_EVENT alone.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 20:23                                   ` Neil Horman
  2009-08-26 20:40                                     ` Ingo Molnar
@ 2009-08-26 23:46                                     ` Frederic Weisbecker
  1 sibling, 0 replies; 95+ messages in thread
From: Frederic Weisbecker @ 2009-08-26 23:46 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, David Miller, rostedt, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 04:23:44PM -0400, Neil Horman wrote:
> On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > 
> > * David Miller <davem@davemloft.net> wrote:
> > 
> > > From: Ingo Molnar <mingo@elte.hu>
> > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > 
> > > > Sigh, no. Please re-read the past discussions about this. 
> > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > terms of TRACE_EVENT()?
> > > 
> > > Neil explained why he needed to implement it this way in his reply 
> > > to Steven Rostedt.  I attach it here for your convenience.
> > 
> > thanks. The argument is invalid:
> > 
> Just because you assert that doesn't make it so, Ingo.
> 
> > > > BTW, why not just do this as events? Or was this just a easy way 
> > > > to communicate with the user space tools?
> > > 
> > > Thats exactly why I did it.  the idea is for me to now write a 
> > > user space tool that lets me analyze the events and ajust process 
> > > scheduling to optimize the rx path. Neil
> > 
> > All tooling (in fact _more_ tooling) can be done based on generic, 
> > TRACE_EVENT() based tracepoints. Generic tracepoints are far more 
> > available, have a generalized format with format parsers and user 
> > tooling implemented, etc. etc.
> > 
> Then why allow for ftrace modules at all?


Well, the old way to implement a tracer was done as you did: create
a whole ftrace plugin (ie: a tracer).

But it's a bit of a burden to implement a tracer: you have to deal
with ring buffer directly using code that is pretty the same from
a trivial tracer to another, you have to deal with output formatting,
define explicitely your fields, their types, their format separately
if you want the filters to be supported.

Oh and you also need to handle your tracepoints by hand, check their
registration results. You also need to implement by your stop and start
callbacks that deactivate your tracepoints.

So that's a lot of repetitive and error-prone work.
Also kernel/trace hosts a lot of such error-prone code and it doesn't
only become a due diligence of maintainance from you but also for us.

The goal of the TRACE_EVENTs is to reduce the impact of everything I explained
above. You only need to care with the strict necessary things for your traces:

- field name
- field type
- field formats

And that's pretty all. All the burden of copying in the ring buffer, filtering,
tracepoints, formats, output is done in background.

Also your tracer becomes non-ABI dependant because the formats of your fields
are dynamically described in dedicated debugfs files.
Tracer fields, even though we have workarounds to describe their format, have
much more contraints. Their format have a bit more constraints to be fixed.

Also a lot of things are developed in userspace that can profit to every TRACE_EVENTs
as Ingo has shown with perf. Steve's trace-cmd tool also handles them.

The ftrace tracers plugin are still used for non trivial cases where tracing
based on tracepoints are not sufficient. For example the function/function graph
tracers that require hot patching and a gcc feature plus a lot of background subtle
things, or the preemptoff/irqsoff/preemptirqsoff tracers that require a snapshot
of a maximum latency trace, etc...


That's why the ftrace tracers plugins still exist: to cover the non-trivial
cases. But using them for tracing based on simple static tracepoints like yours
is a pure legacy.

Frederic.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:08                                             ` David Miller
@ 2009-08-26 23:58                                               ` Ingo Molnar
  2009-08-27  0:05                                                 ` Steven Rostedt
  2009-08-27  0:35                                                 ` Christoph Hellwig
  0 siblings, 2 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-26 23:58 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, rostedt, fweisbec, billfink, netdev, brice, gallatin


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Thu, 27 Aug 2009 01:05:14 +0200
> 
> > And the thing is, we are not rigid about these things in the 
> > tracing tree and if this was a good change we would not mind and 
> > you'd have my Ack. The problem is that it's a crappy change and 
> > that Neil is refusing to fix it. So please fix it,
> 
> Thankfully, Steven Rostedt gave a much more useful and reasonable 
> response than you.

I'm sorry you got that impression, but you are a maintainer yourself 
so you might perhaps understand it why sooner or later, if a 
maintainer's review does not get acted upon, one has to insist on 
clean patches in stronger terms.

Unfortunately you took away the "do not apply the patch" option from 
me that could have avoided the stronger words and could have kept 
this discussion more polite.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:58                                               ` Ingo Molnar
@ 2009-08-27  0:05                                                 ` Steven Rostedt
  2009-08-27  0:35                                                 ` Christoph Hellwig
  1 sibling, 0 replies; 95+ messages in thread
From: Steven Rostedt @ 2009-08-27  0:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, nhorman, fweisbec, billfink, netdev, brice, gallatin


On Thu, 27 Aug 2009, Ingo Molnar wrote:

> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Thu, 27 Aug 2009 01:05:14 +0200
> > 
> > > And the thing is, we are not rigid about these things in the 
> > > tracing tree and if this was a good change we would not mind and 
> > > you'd have my Ack. The problem is that it's a crappy change and 
> > > that Neil is refusing to fix it. So please fix it,
> > 
> > Thankfully, Steven Rostedt gave a much more useful and reasonable 
> > response than you.
> 
> I'm sorry you got that impression, but you are a maintainer yourself 
> so you might perhaps understand it why sooner or later, if a 
> maintainer's review does not get acted upon, one has to insist on 
> clean patches in stronger terms.
> 
> Unfortunately you took away the "do not apply the patch" option from 
> me that could have avoided the stronger words and could have kept 
> this discussion more polite.

I feel somewhat at fault here. Neil did give me a heads up on his project, 
but his patches went out when I was getting ready for vacation and had 
other priorities at the time. I could have brought up these issues before 
Dave took them, and he may have taken them because I did not.

But this is all water under the bridge. Time to be more productive.

-- Steve


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:33                                         ` Steven Rostedt
@ 2009-08-27  0:14                                           ` Neil Horman
  2009-08-27  0:29                                             ` Steven Rostedt
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-27  0:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 07:33:55PM -0400, Steven Rostedt wrote:
> 
> On Wed, 26 Aug 2009, Neil Horman wrote:
> 
> > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > On Wed, Aug 26, 2009 at 09:48:35PM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * David Miller <davem@davemloft.net> wrote:
> > > > > 
> > > > > > From: Ingo Molnar <mingo@elte.hu>
> > > > > > Date: Wed, 26 Aug 2009 21:08:30 +0200
> > > > > > 
> > > > > > > Sigh, no. Please re-read the past discussions about this. 
> > > > > > > trace_skb_sources.c is a hack and should be converted to generic 
> > > > > > > tracepoints. Is there anything in it that cannot be expressed in 
> > > > > > > terms of TRACE_EVENT()?
> > > > > > 
> > > > > > Neil explained why he needed to implement it this way in his reply 
> > > > > > to Steven Rostedt.  I attach it here for your convenience.
> > > > > 
> > > > > thanks. The argument is invalid:
> > > > 
> > > > Just because you assert that doesn't make it so, Ingo.
> > > 
> > > I stand by that statement, the argument is invalid, for the many 
> > > reasons i outlined in my previous mails. (you'd have gotten those 
> > > same arguments had you submitted that patch to the folks who 
> > > maintain kernel/trace/)
> > > 
> 
> > Steven specifically told me to submit the patch to the subsystem maintainer that
> > I'm adding tracepoints for, and the only feedback I got on it was his one
> > question, the answer to which I assume satisfied him, due to that there was no
> > subseuqent discussion.  I'm going to ignore your previous emails, because,
> > despite the various advantages of just using plain TRACE_EVENTs because you
> > provide the ftrace interface, and I found it useful.  Your observation is
> > correct, I like it, and thats what I wanted to use, so I used it.  If you don't
> > want people to use it, don't provide it.
> 
> Actually, I suggested to submit it to the subsystem maintainer if there 
> was no changes to the tracing infrastructure. We may have just had a 
> misunderstanding there. No biggy.
> 
I'm not sure how the addition of an ftrace module constitutes a change to the
tracing infrastructure, but whatever, yes, no biggy.  I've bugun modifying the
TRACE_EVENT that I added to export the data I need directly.  Should be pretty
straightforward.  Dave I'll have a patch up on netdev in a day or two after I
test it.  Steven, should this still just go to netdev with a cc to you?  I'd
like to avoid repeating the same confusion here a second time around if I can

> > 
> > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.  The code
> > I wrote works fairly well in my view, and I feel like the review on it was both
> > positive and sufficent for inclusion.  But thats not my call, its yours.  I can
> > meet my own need with a raw TRACE_EVENT for now just as easily.  IF you feel
> > like the skb plugin should be pulled, please do so, and let me know.  All I ask
> > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place.  If you pull
> > the ftrace plugin, I'll submit a subsequent patch to agument the printing format
> > so that I can gather the numa allocation and consumption data directly there.
> 
> Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-).
> 
Yes, that is rather centeral to what I'm monitoring :)

> You probably already read my previous email on the matter. Don't delete 
> your plugin patch until we get everything you need with TRACE_EVENT alone.
> 
Its ok, I should have the TRACE_EVENT modified to export this stuff directly by
tomorrow or friday anyway.  I really honestly just liked the ftrace interface
better, I found it a bit less confusing :)

Best
Neil

> Thanks,
> 
> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  0:14                                           ` Neil Horman
@ 2009-08-27  0:29                                             ` Steven Rostedt
  2009-08-27  1:17                                               ` Neil Horman
  2009-08-27  9:34                                               ` Ingo Molnar
  0 siblings, 2 replies; 95+ messages in thread
From: Steven Rostedt @ 2009-08-27  0:29 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin


On Wed, 26 Aug 2009, Neil Horman wrote:
> > 
> I'm not sure how the addition of an ftrace module constitutes a change to the
> tracing infrastructure, but whatever, yes, no biggy.  I've bugun modifying the
> TRACE_EVENT that I added to export the data I need directly.  Should be pretty
> straightforward.  Dave I'll have a patch up on netdev in a day or two after I
> test it.  Steven, should this still just go to netdev with a cc to you?  I'd
> like to avoid repeating the same confusion here a second time around if I can

Yes, please Cc myself, and Ingo on those changes. I see where the 
confusion came. It is where the code changes. The code in kernel/trace is 
considered ftrace internals (there's internal tracing upkeep that is 
needed for all plugins). But with TRACE_EVENT, those can happen totally 
inside a subsystem without touching any tracing directory. Those are 
yours, and the TRACE_EVENT is just an API to the rest of the kernel. We 
don't even care if you add a header to include/trace/events/ (if it 
follows the standard format).

But by adding a plugin, it causes more work for us. The plugin types do 
not get automated like TRACE_EVENTs and for binary readers like perf and 
trace-cmd, we need to hand export the binary format for them.

> 
> > > 
> > > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.  The code
> > > I wrote works fairly well in my view, and I feel like the review on it was both
> > > positive and sufficent for inclusion.  But thats not my call, its yours.  I can
> > > meet my own need with a raw TRACE_EVENT for now just as easily.  IF you feel
> > > like the skb plugin should be pulled, please do so, and let me know.  All I ask
> > > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place.  If you pull
> > > the ftrace plugin, I'll submit a subsequent patch to agument the printing format
> > > so that I can gather the numa allocation and consumption data directly there.
> > 
> > Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-).
> > 
> Yes, that is rather centeral to what I'm monitoring :)
> 
> > You probably already read my previous email on the matter. Don't delete 
> > your plugin patch until we get everything you need with TRACE_EVENT alone.
> > 
> Its ok, I should have the TRACE_EVENT modified to export this stuff directly by
> tomorrow or friday anyway.  I really honestly just liked the ftrace interface
> better, I found it a bit less confusing :)

Heh, because it was just a bit of cut and paste. But as Frederic said, 
very much prone to errors. And it breaks the binary userspace readers. 
Your new type did not get exported via trace_export.c.

TRACE_EVENT can be a little harder to learn, because it is all MACRO 
magic, but once you understand them, you'll find that they are very easy.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 95+ messages in thread

* blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 20:40                                     ` Ingo Molnar
  2009-08-26 22:39                                       ` Neil Horman
@ 2009-08-27  0:30                                       ` Christoph Hellwig
  2009-08-27  5:26                                         ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: rostedt, fweisbec, acme, jens.axboe, linux-kernel

On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> We are also converting non-trivial plugins to generic tracepoints. A 
> recent example are the system call tracepoints, but we also 
> converted blktrace and kmemtrace to generic tracepoints.

On something semi-related:  Any reason to keep the blktrace ftrace
plugin around?  I don't think there's much point in it.  It only got
added in 2.6.29, and all the blktrace tooling just uses the legacy
ioctls.  All new uses should just use the TRACE_EVENT output.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 22:39                                       ` Neil Horman
                                                           ` (2 preceding siblings ...)
  2009-08-26 23:33                                         ` Steven Rostedt
@ 2009-08-27  0:34                                         ` Christoph Hellwig
  3 siblings, 0 replies; 95+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:34 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, David Miller, rostedt, fweisbec, billfink, netdev,
	brice, gallatin

On Wed, Aug 26, 2009 at 06:39:22PM -0400, Neil Horman wrote:
> Steven specifically told me to submit the patch to the subsystem maintainer that
> I'm adding tracepoints for, and the only feedback I got on it was his one
> question, the answer to which I assume satisfied him, due to that there was no
> subseuqent discussion.  I'm going to ignore your previous emails, because,
> despite the various advantages of just using plain TRACE_EVENTs because you
> provide the ftrace interface, and I found it useful.  Your observation is
> correct, I like it, and thats what I wanted to use, so I used it.  If you don't
> want people to use it, don't provide it.

Neil, this attitude is a perfect way to end up on a shitlist.  I think
there is a fair case to make you didn't know that the ftrace plugin was
wrong when you did, but now you do.  And btw, I completely agree with
Ingo here - the TRACE_EVENT stuff is extremly userful to get borader
pictures of what's going on.  E.g. the combination of my unfortunately
not yet included xfs tracer and blktrace allowed debugging quite a lot
of interesting issues.

So instead of playing jackass here listen to what is the right approach
for it and fix it up.  It'll help us all in the end.


Looking forward to the day when plain DECLARE_TRACE goes away so people
can't "accidentally" use it.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 23:58                                               ` Ingo Molnar
  2009-08-27  0:05                                                 ` Steven Rostedt
@ 2009-08-27  0:35                                                 ` Christoph Hellwig
  2009-08-27  9:28                                                   ` Ingo Molnar
  1 sibling, 1 reply; 95+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, nhorman, rostedt, fweisbec, billfink, netdev,
	brice, gallatin

On Thu, Aug 27, 2009 at 01:58:26AM +0200, Ingo Molnar wrote:
> I'm sorry you got that impression, but you are a maintainer yourself 
> so you might perhaps understand it why sooner or later, if a 
> maintainer's review does not get acted upon, one has to insist on 
> clean patches in stronger terms.

Cool down a bit :)  While I totally agree with you on all the technical
bits here I think a slightly nicer attitude towars Neil and Dave would
help the cause a lot..


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  0:29                                             ` Steven Rostedt
@ 2009-08-27  1:17                                               ` Neil Horman
  2009-08-27  9:06                                                 ` Ingo Molnar
  2009-08-27  9:34                                               ` Ingo Molnar
  1 sibling, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-27  1:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, David Miller, fweisbec, billfink, netdev, brice, gallatin

On Wed, Aug 26, 2009 at 08:29:59PM -0400, Steven Rostedt wrote:
> 
> On Wed, 26 Aug 2009, Neil Horman wrote:
> > > 
> > I'm not sure how the addition of an ftrace module constitutes a change to the
> > tracing infrastructure, but whatever, yes, no biggy.  I've bugun modifying the
> > TRACE_EVENT that I added to export the data I need directly.  Should be pretty
> > straightforward.  Dave I'll have a patch up on netdev in a day or two after I
> > test it.  Steven, should this still just go to netdev with a cc to you?  I'd
> > like to avoid repeating the same confusion here a second time around if I can
> 
> Yes, please Cc myself, and Ingo on those changes. I see where the 
> confusion came. It is where the code changes. The code in kernel/trace is 
> considered ftrace internals (there's internal tracing upkeep that is 
> needed for all plugins). But with TRACE_EVENT, those can happen totally 
> inside a subsystem without touching any tracing directory. Those are 
> yours, and the TRACE_EVENT is just an API to the rest of the kernel. We 
> don't even care if you add a header to include/trace/events/ (if it 
> follows the standard format).
> 
> But by adding a plugin, it causes more work for us. The plugin types do 
> not get automated like TRACE_EVENTs and for binary readers like perf and 
> trace-cmd, we need to hand export the binary format for them.
> 
Understood, I'll keep that in mind in the future.

> > 
> > > > 
> > > > Ok, I'm rather tired of arguing.  Dave, I'll leave this in your hands.  The code
> > > > I wrote works fairly well in my view, and I feel like the review on it was both
> > > > positive and sufficent for inclusion.  But thats not my call, its yours.  I can
> > > > meet my own need with a raw TRACE_EVENT for now just as easily.  IF you feel
> > > > like the skb plugin should be pulled, please do so, and let me know.  All I ask
> > > > is that you keep the skb_copy_datagram_iovec TRACE_EVENT in place.  If you pull
> > > > the ftrace plugin, I'll submit a subsequent patch to agument the printing format
> > > > so that I can gather the numa allocation and consumption data directly there.
> > > 
> > > Yes, please keep the TRACE_EVENT (I think we can all agree on that ;-).
> > > 
> > Yes, that is rather centeral to what I'm monitoring :)
> > 
> > > You probably already read my previous email on the matter. Don't delete 
> > > your plugin patch until we get everything you need with TRACE_EVENT alone.
> > > 
> > Its ok, I should have the TRACE_EVENT modified to export this stuff directly by
> > tomorrow or friday anyway.  I really honestly just liked the ftrace interface
> > better, I found it a bit less confusing :)
> 
> Heh, because it was just a bit of cut and paste. But as Frederic said, 
> very much prone to errors. And it breaks the binary userspace readers. 
> Your new type did not get exported via trace_export.c.
> 
> TRACE_EVENT can be a little harder to learn, because it is all MACRO 
> magic, but once you understand them, you'll find that they are very easy.
> 
Yeah, macro magic is an understatement.  But I'll have the conversions done in
the next few days, no worries.

Thanks!
Neil

> Thanks!
> 
> -- Steve
> 
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  0:30                                       ` blktrace ftrace plugin, was " Christoph Hellwig
@ 2009-08-27  5:26                                         ` Jens Axboe
  2009-08-27  9:12                                           ` Ingo Molnar
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2009-08-27  5:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ingo Molnar, rostedt, fweisbec, acme, linux-kernel

On Wed, Aug 26 2009, Christoph Hellwig wrote:
> On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > We are also converting non-trivial plugins to generic tracepoints. A 
> > recent example are the system call tracepoints, but we also 
> > converted blktrace and kmemtrace to generic tracepoints.
> 
> On something semi-related:  Any reason to keep the blktrace ftrace
> plugin around?  I don't think there's much point in it.  It only got
> added in 2.6.29, and all the blktrace tooling just uses the legacy
> ioctls.  All new uses should just use the TRACE_EVENT output.

Lets kill it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  1:17                                               ` Neil Horman
@ 2009-08-27  9:06                                                 ` Ingo Molnar
  0 siblings, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-27  9:06 UTC (permalink / raw)
  To: Neil Horman
  Cc: Steven Rostedt, David Miller, fweisbec, billfink, netdev, brice,
	gallatin


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > TRACE_EVENT can be a little harder to learn, because it is all 
> > MACRO magic, but once you understand them, you'll find that they 
> > are very easy.
> 
> Yeah, macro magic is an understatement.  But I'll have the 
> conversions done in the next few days, no worries.

Cool, thanks Neil!

[ And we tracing folks are rather fond of that macro abuse, so if 
  you can think of ways it could be made even more abusively 
  C-alike, we are all ears ;-) ]

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  5:26                                         ` Jens Axboe
@ 2009-08-27  9:12                                           ` Ingo Molnar
  2009-08-27  9:14                                             ` Jens Axboe
  2009-08-28  2:03                                             ` Li Zefan
  0 siblings, 2 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-27  9:12 UTC (permalink / raw)
  To: Jens Axboe, Li Zefan
  Cc: Christoph Hellwig, rostedt, fweisbec, acme, linux-kernel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Wed, Aug 26 2009, Christoph Hellwig wrote:
> > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > > We are also converting non-trivial plugins to generic tracepoints. A 
> > > recent example are the system call tracepoints, but we also 
> > > converted blktrace and kmemtrace to generic tracepoints.
> > 
> > On something semi-related: Any reason to keep the blktrace 
> > ftrace plugin around?  I don't think there's much point in it.  
> > It only got added in 2.6.29, and all the blktrace tooling just 
> > uses the legacy ioctls.  All new uses should just use the 
> > TRACE_EVENT output.
> 
> Lets kill it.

Agreed.

I think we should keep the relayfs and ioctl compatibility bits 
though: blktrace has a mature user-space environment with many
years of installed base.

We could even move those bits back to block/blktrace_compat.c or so 
(after the ftrace plugin bits are removed), to make sure it's nicely 
isolated.

What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  9:12                                           ` Ingo Molnar
@ 2009-08-27  9:14                                             ` Jens Axboe
  2009-08-27 13:55                                               ` Arnaldo Carvalho de Melo
  2009-08-28  2:03                                             ` Li Zefan
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2009-08-27  9:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Li Zefan, Christoph Hellwig, rostedt, fweisbec, acme, linux-kernel

On Thu, Aug 27 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Wed, Aug 26 2009, Christoph Hellwig wrote:
> > > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > > > We are also converting non-trivial plugins to generic tracepoints. A 
> > > > recent example are the system call tracepoints, but we also 
> > > > converted blktrace and kmemtrace to generic tracepoints.
> > > 
> > > On something semi-related: Any reason to keep the blktrace 
> > > ftrace plugin around?  I don't think there's much point in it.  
> > > It only got added in 2.6.29, and all the blktrace tooling just 
> > > uses the legacy ioctls.  All new uses should just use the 
> > > TRACE_EVENT output.
> > 
> > Lets kill it.
> 
> Agreed.
> 
> I think we should keep the relayfs and ioctl compatibility bits 
> though: blktrace has a mature user-space environment with many
> years of installed base.
> 
> We could even move those bits back to block/blktrace_compat.c or so 
> (after the ftrace plugin bits are removed), to make sure it's nicely 
> isolated.
> 
> What do you think?

Of course, we have to retain the ioctl/relayfs interface, it's been in
use for years. Keeping those out of the other trace/ bits sounds sane.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  0:35                                                 ` Christoph Hellwig
@ 2009-08-27  9:28                                                   ` Ingo Molnar
  0 siblings, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-27  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Miller, nhorman, rostedt, fweisbec, billfink, netdev,
	brice, gallatin


* Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Aug 27, 2009 at 01:58:26AM +0200, Ingo Molnar wrote:
>
> > I'm sorry you got that impression, but you are a maintainer 
> > yourself so you might perhaps understand it why sooner or later, 
> > if a maintainer's review does not get acted upon, one has to 
> > insist on clean patches in stronger terms.
> 
> Cool down a bit :) [...]

Hello Pot, Kettle here ;-)

I guess i'll have to test the limits of your patience by queueing up 
some bad commit into fs/libfs.c via say the iommu tree, without acks 
and with commit log damage, which patch then triggers a build 
failure and a crash in linux-next (like this one did), and refuse to 
revert and not do anything substantial about your (initially polite) 
review feedback for 2 weeks (like it happened here), and see how 
measured your response will be after the 12th mail that gets faced 
with such passive-aggressive inaction ;-)

At that point, will your wall of patience finally start to crumble a 
tiny bit and will you resort to using the taboo term 'crappy patch' 
perhaps, like i did here? ;-)

Anyway, as Steve said it's now finally water under the bridge, time 
to move on.

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  0:29                                             ` Steven Rostedt
  2009-08-27  1:17                                               ` Neil Horman
@ 2009-08-27  9:34                                               ` Ingo Molnar
  1 sibling, 0 replies; 95+ messages in thread
From: Ingo Molnar @ 2009-08-27  9:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Neil Horman, David Miller, fweisbec, billfink, netdev, brice, gallatin


* Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> On Wed, 26 Aug 2009, Neil Horman wrote:
> > > 
> > I'm not sure how the addition of an ftrace module constitutes a change to the
> > tracing infrastructure, but whatever, yes, no biggy.  I've bugun modifying the
> > TRACE_EVENT that I added to export the data I need directly.  Should be pretty
> > straightforward.  Dave I'll have a patch up on netdev in a day or two after I
> > test it.  Steven, should this still just go to netdev with a cc to you?  I'd
> > like to avoid repeating the same confusion here a second time around if I can
> 
> Yes, please Cc myself, and Ingo on those changes. I see where the 
> confusion came. It is where the code changes. The code in 
> kernel/trace is considered ftrace internals (there's internal 
> tracing upkeep that is needed for all plugins). [...]

yeah - i pointed that out in the very first mail to David 9 days ago 
when this patch broke the build in linux-next: kernel/trace/ is like 
net/core/. It would be nice and important if the networking tree 
treated it as such in the future.
 
See the:

  [PATCH -next] trace_skb: fix build when CONFIG_NET is not enabled

discussion on lkml:

  http://lkml.org/lkml/2009/8/17/378

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  9:14                                             ` Jens Axboe
@ 2009-08-27 13:55                                               ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 95+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-08-27 13:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Li Zefan, Christoph Hellwig, rostedt, fweisbec,
	acme, linux-kernel

Em Thu, Aug 27, 2009 at 11:14:54AM +0200, Jens Axboe escreveu:
> On Thu, Aug 27 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Wed, Aug 26 2009, Christoph Hellwig wrote:
> > > > On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
> > > > > We are also converting non-trivial plugins to generic tracepoints. A 
> > > > > recent example are the system call tracepoints, but we also 
> > > > > converted blktrace and kmemtrace to generic tracepoints.
> > > > 
> > > > On something semi-related: Any reason to keep the blktrace 
> > > > ftrace plugin around?  I don't think there's much point in it.  
> > > > It only got added in 2.6.29, and all the blktrace tooling just 
> > > > uses the legacy ioctls.  All new uses should just use the 
> > > > TRACE_EVENT output.
> > > 
> > > Lets kill it.
> > 
> > Agreed.
> > 
> > I think we should keep the relayfs and ioctl compatibility bits 
> > though: blktrace has a mature user-space environment with many
> > years of installed base.
> > 
> > We could even move those bits back to block/blktrace_compat.c or so 
> > (after the ftrace plugin bits are removed), to make sure it's nicely 
> > isolated.
> > 
> > What do you think?
> 
> Of course, we have to retain the ioctl/relayfs interface, it's been in
> use for years. Keeping those out of the other trace/ bits sounds sane.

Yeah, I wonder tho if we couldn't somehow use the ring buffer
infrastructure in such a way as to provide the debugfs visible interface
provided by relayfs, IIRC systemtap is doing such a move too.

- Arnaldo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 18:08                       ` Neil Horman
  2009-08-26 18:15                         ` Ingo Molnar
@ 2009-08-27 17:32                         ` Bill Fink
  2009-09-02  5:28                           ` Bill Fink
  2009-08-27 17:44                         ` Bill Fink
  2 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-27 17:32 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Wed, 26 Aug 2009, Neil Horman wrote:

> On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > On Fri, 21 Aug 2009, Neil Horman wrote:
> > 
> > > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > > > 
> > > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > > to just 1 Mbps, I immediately got a kernel oops.  I tried to get a
> > > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > > and never produced a crashdump.  I tried this several times and
> > > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > > or UDP test).  A ping test of 1000 9000-byte packets with an interval
> > > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > > worked just fine.
> > > > > 
> > > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > > limited memory.  don't know why you booted to a full system rather than did a
> > > > > crash recovery.  Don't suppose you got a backtrace did you?
> > > > 
> > > > There was a backtrace on the screen but I didn't have a chance to
> > > > record it.  BTW did anyone ever think to print the backtrace in
> > > > reverse (first to some reserved memory and then output to the display)
> > > > so the more interesting parts wouldn't have scrolled off the top of
> > > > the screen?
> > > > 
> > > The real solution is to use a console to which the output doesn't scroll off the
> > > screen.  Normally people use a serial console they can log, or a RAC card that
> > > they can record. Even on a regular vga monitor in text mode, you can set up the
> > > vt iirc to allow for scrolling.
> > 
> > None of our Asus P6T6 systems have serial consoles.  I don't know of
> > any RAC cards for them either, nor are there spare PCI slots available
> > in many cases.  I wouldn't think the Shift-PageUp trick would work
> > with a crashed kernel, but I admit I didn't try it.  I haven't checked
> > out netconsole yet either, but I'm not sure it would help either in a
> > case like this that was a network related kernel crash.
> > 
> Any USB ports that you can attach a serial dongle to?  That would work as well,
> or, as previously mentioned, netconsole also does the trick.

I didn't know you could use a USB serial port as a serial console.
And after wasting several hours yesterday trying to get a USB serial
console to work without any success, I'm giving up on that idea.
Also since it requires building the required usb modules into the
kernel, it wouldn't be practical, since I'd have to rebuild the
kernel quite frequently given the frequency of Fedora kernel updates.
I still need to check into netconsole.

> > In any case, a simple kernel command line that would provide a reversed
> > backtrace would be a simple thing to facilitate Linux users providing
> > useful info to Linux kernel developers in helping to debug kernel
> > problems.  The most useful info would still be on the screen, so it
> > could be transcribed or a photo image of the screen could be taken.
> > 
> I understand what your saying, I'm just saying there are currently several
> options for you that have already solved this problem in differnt ways.

I would have been with you if the USB serial console idea had panned out.
But I've just about eliminated all the proposed alternatives as viable,
except for netconsole which I haven't investigated yet.  Sometimes the
additional low tech option of a reversed traceroute would be quite
convenient and not require lots of extra effort from the user.  BTW
ISTR that someone else suggested the same idea a while back, but it
didn't get any traction then either (can't find it in the archives
though from a quick search).

> > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > does have a serial console, and after a fair amount of effort I was
> > able to get it to work as desired, and was able to finally capture
> > a backtrace of the kernel oops.  BTW I believe the reason the
> > kexec/kdump didn't work was probably because it couldn't find
> > a /proc/vmcore file, although I don't know why that would be,
> > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > up normally if it fails to find the /proc/vmcore file (or it's
> > zero size).
> > 
> I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> happy to look into it further.

It's odd.  kexec/kdump works fine with the 2.6.29.6-217.2.3.fc11.x86_64
kernel from Fedora 11 (running on the Fedora 10 system).  I will try
again with the kernel-2.6.31-0.174.rc7.git2.fc12.src.rpm from Fedora 12,
in case it has some secret sauce in one of the Fedora patches to make
the Fedora /etc/init.d/kdump script happy.  kexec/kdump is my preferred
method of dealing with kernel oopses if I can get it to work.

Also, to get the /sbin/mkdumprd to work right, I had to make the
following change to it:

--- .orig/mkdumprd	2009-04-07 10:03:58.000000000 -0400
+++ .mod/mkdumprd	2009-08-19 19:04:38.000000000 -0400
@@ -384,7 +384,7 @@
             vg_list="$vg_list $vg"
             for device in `vgdisplay -v $vg 2>/dev/null | sed -n 's/PV Name//p'`; do
                 IS_UUID=`echo $device | grep UUID`
-                IS_LABEL=`echo $device | grep UUID`
+                IS_LABEL=`echo $device | grep LABEL`
                 if [ -n "$IS_UUID" -o -n "$IS_LABEL" ]
                 then
                     devname=`findfs $device`
@@ -398,7 +398,7 @@
         esac
     else
         IS_UUID=`echo $1 | grep UUID`
-        IS_LABEL=`echo $1 | grep UUID`
+        IS_LABEL=`echo $1 | grep LABEL`
         if [ -n "$IS_UUID" -o -n "$IS_LABEL" ]
         then
             devname=`findfs $1`

Without the patch to the /sbin/mkdumprd script, it couldn't find
my root filesystem on LABEL=root.

> > The following shows a simple ping test usage of the skb_sources
> > tracing feature:
> > 
> > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > 
> > --- 192.168.1.10 ping statistics ---
> > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > 
> > [root@xeontest1 tracing]# cat trace
> > # tracer: skb_sources
> > #
> > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > #        |       |       |       |       |       |       |
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> >         4217    1       1       eth2    0       4       1500
> > 
> > All is as was expected.
> > 
> > But if I try an actual nuttcp performance test (even rate limited
> > to 1 Mbps), I get the following kernel oops:
> > 
> thank you, I think I see the problem, I'll have a patch for you in just a bit

Thanks for the patch.  I'll address the results of using the patch
in a separate e-mail.

						-Bill



> > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > PGD 337d12067 PUD 337d11067 PMD 0
> > Oops: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > CPU 4
> > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > Stack:
> >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > Call Trace:
> >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> >  RSP <ffff8801a5811a88>
> > CR2: 0000000000000038

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-26 18:08                       ` Neil Horman
  2009-08-26 18:15                         ` Ingo Molnar
  2009-08-27 17:32                         ` Bill Fink
@ 2009-08-27 17:44                         ` Bill Fink
  2009-08-27 17:51                           ` Neil Horman
  2 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-08-27 17:44 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Wed, 26 Aug 2009, Neil Horman wrote:

> On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > 
> > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > does have a serial console, and after a fair amount of effort I was
> > > able to get it to work as desired, and was able to finally capture
> > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > kexec/kdump didn't work was probably because it couldn't find
> > > a /proc/vmcore file, although I don't know why that would be,
> > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > up normally if it fails to find the /proc/vmcore file (or it's
> > > zero size).
> > > 
> > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > happy to look into it further.
> > 
> > > The following shows a simple ping test usage of the skb_sources
> > > tracing feature:
> > > 
> > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > 
> > > --- 192.168.1.10 ping statistics ---
> > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > 
> > > [root@xeontest1 tracing]# cat trace
> > > # tracer: skb_sources
> > > #
> > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > #        |       |       |       |       |       |       |
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > >         4217    1       1       eth2    0       4       1500
> > > 
> > > All is as was expected.
> > > 
> > > But if I try an actual nuttcp performance test (even rate limited
> > > to 1 Mbps), I get the following kernel oops:
> > > 
> > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > 
> > Thanks
> > Neil
> > 
> > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > PGD 337d12067 PUD 337d11067 PMD 0
> > > Oops: 0000 [#1] SMP
> > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > CPU 4
> > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > Stack:
> > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > Call Trace:
> > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > >  RSP <ffff8801a5811a88>
> > > CR2: 0000000000000038
> 
> 
> 
> Here  you go, I think this will fix your oops.
> 
> 
>     Fix NULL pointer deref in skb sources ftracer
>     
>     Its possible that skb->sk will be null in this path, so we shouldn't just assume
>     we can pass it to sock_net
>     
>     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
>  trace_skb_sources.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> index 40eb071..8bf518f 100644
> --- a/kernel/trace/trace_skb_sources.c
> +++ b/kernel/trace/trace_skb_sources.c
> @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
>  	struct ring_buffer_event *event;
>  	struct trace_skb_event *entry;
>  	struct trace_array *tr = skb_trace;
> -	struct net_device *dev;
> +	struct net_device *dev = NULL;
>  
>  	if (!trace_skb_source_enabled)
>  		return;
> @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
>  	entry->event_data.rx_queue = skb->queue_mapping;
>  	entry->event_data.ccpu = smp_processor_id();
>  
> -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> +	if (skb->sk)
> +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> +
>  	if (dev) {
>  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
>  		dev_put(dev);



On the positive side, it did fix the oops.  But the results of the
skb_sources tracing was not that useful.

[root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
 5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT

First off, only 10 trace entries were made:

[root@xeontest1 tracing]# wc trace
14 90 334 trace

And here they are:

[root@xeontest1 tracing]# cat trace
# tracer: skb_sources
#
#       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
#        |       |       |       |       |       |       |
        5521    0       0       Unknown 0       3       888
        5521    0       0       Unknown 0       3       896
        5521    0       0       Unknown 0       3       20
        5521    0       0       Unknown 0       3       888
        5521    0       0       Unknown 0       3       896
        5521    0       0       Unknown 0       3       20
        5521    1       1       Unknown 0       4       20
        5521    1       1       Unknown 0       4       11
        5521    1       1       Unknown 0       4       540
        5521    1       1       Unknown 0       4       0

Even for these 10 entries, why is the IFC Unknown, and the LENs
seem to be wrong too.

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27 17:44                         ` Bill Fink
@ 2009-08-27 17:51                           ` Neil Horman
  2009-09-02  5:11                             ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-08-27 17:51 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> On Wed, 26 Aug 2009, Neil Horman wrote:
> 
> > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > 
> > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > does have a serial console, and after a fair amount of effort I was
> > > > able to get it to work as desired, and was able to finally capture
> > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > kexec/kdump didn't work was probably because it couldn't find
> > > > a /proc/vmcore file, although I don't know why that would be,
> > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > zero size).
> > > > 
> > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > happy to look into it further.
> > > 
> > > > The following shows a simple ping test usage of the skb_sources
> > > > tracing feature:
> > > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > 
> > > > --- 192.168.1.10 ping statistics ---
> > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > 
> > > > [root@xeontest1 tracing]# cat trace
> > > > # tracer: skb_sources
> > > > #
> > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > #        |       |       |       |       |       |       |
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > >         4217    1       1       eth2    0       4       1500
> > > > 
> > > > All is as was expected.
> > > > 
> > > > But if I try an actual nuttcp performance test (even rate limited
> > > > to 1 Mbps), I get the following kernel oops:
> > > > 
> > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > 
> > > Thanks
> > > Neil
> > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > Oops: 0000 [#1] SMP
> > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > CPU 4
> > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > Stack:
> > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > Call Trace:
> > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > >  RSP <ffff8801a5811a88>
> > > > CR2: 0000000000000038
> > 
> > 
> > 
> > Here  you go, I think this will fix your oops.
> > 
> > 
> >     Fix NULL pointer deref in skb sources ftracer
> >     
> >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> >     we can pass it to sock_net
> >     
> >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> >  trace_skb_sources.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > index 40eb071..8bf518f 100644
> > --- a/kernel/trace/trace_skb_sources.c
> > +++ b/kernel/trace/trace_skb_sources.c
> > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> >  	struct ring_buffer_event *event;
> >  	struct trace_skb_event *entry;
> >  	struct trace_array *tr = skb_trace;
> > -	struct net_device *dev;
> > +	struct net_device *dev = NULL;
> >  
> >  	if (!trace_skb_source_enabled)
> >  		return;
> > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> >  	entry->event_data.rx_queue = skb->queue_mapping;
> >  	entry->event_data.ccpu = smp_processor_id();
> >  
> > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > +	if (skb->sk)
> > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > +
> >  	if (dev) {
> >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> >  		dev_put(dev);
> 
> 
> 
> On the positive side, it did fix the oops.  But the results of the
> skb_sources tracing was not that useful.
> 
> [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
>  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> 
> First off, only 10 trace entries were made:
> 
> [root@xeontest1 tracing]# wc trace
> 14 90 334 trace
> 
> And here they are:
> 
> [root@xeontest1 tracing]# cat trace
> # tracer: skb_sources
> #
> #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> #        |       |       |       |       |       |       |
>         5521    0       0       Unknown 0       3       888
>         5521    0       0       Unknown 0       3       896
>         5521    0       0       Unknown 0       3       20
>         5521    0       0       Unknown 0       3       888
>         5521    0       0       Unknown 0       3       896
>         5521    0       0       Unknown 0       3       20
>         5521    1       1       Unknown 0       4       20
>         5521    1       1       Unknown 0       4       11
>         5521    1       1       Unknown 0       4       540
>         5521    1       1       Unknown 0       4       0
> 
> Even for these 10 entries, why is the IFC Unknown, and the LENs
> seem to be wrong too.
> 
> 						-Bill
> 
I'm not sure why you're getting Unknown Interface names.  Nominally that
indicates that the skb->iif value in the skb was incorrect or otherwise not set,
which shouldn't be the case.  As for the lengths that just seems wrong.  That
length value is taken directly from skb->len, so if its not right, it seems like
its not getting set correctly someplace.

As you may have seen we're removing the ftrace module, and replacing it with the
use of raw trace events.  When I have that working, I'll see if I get simmilar
results.  I never did in my local testing of the ftrace module, but perhaps its
related to load or something.
Neil


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blktrace ftrace plugin, was Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27  9:12                                           ` Ingo Molnar
  2009-08-27  9:14                                             ` Jens Axboe
@ 2009-08-28  2:03                                             ` Li Zefan
  1 sibling, 0 replies; 95+ messages in thread
From: Li Zefan @ 2009-08-28  2:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Christoph Hellwig, rostedt, fweisbec, acme, linux-kernel

Ingo Molnar wrote:
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
>> On Wed, Aug 26 2009, Christoph Hellwig wrote:
>>> On Wed, Aug 26, 2009 at 10:40:27PM +0200, Ingo Molnar wrote:
>>>> We are also converting non-trivial plugins to generic tracepoints. A 
>>>> recent example are the system call tracepoints, but we also 
>>>> converted blktrace and kmemtrace to generic tracepoints.
>>> On something semi-related: Any reason to keep the blktrace 
>>> ftrace plugin around?  I don't think there's much point in it.  
>>> It only got added in 2.6.29, and all the blktrace tooling just 
>>> uses the legacy ioctls.  All new uses should just use the 
>>> TRACE_EVENT output.
>> Lets kill it.
> 
> Agreed.
> 
> I think we should keep the relayfs and ioctl compatibility bits 
> though: blktrace has a mature user-space environment with many
> years of installed base.
> 
> We could even move those bits back to block/blktrace_compat.c or so 
> (after the ftrace plugin bits are removed), to make sure it's nicely 
> isolated.
> 
> What do you think?
> 

I'm all for removing the ftrace plugin. There're 2 concerns:

- dev_t info can't be recorded in some blk trace events. I think
  this will change in the future when we can map a request_queue to
  a unique device?

- Not all the output of ftrace plugin comes from tracepoints probing,
  but via blk_add_trace_msg(), which directly writes a string into
  ring buffer. I think they need to be converted to TRACE_EVENT.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27 17:51                           ` Neil Horman
@ 2009-09-02  5:11                             ` Bill Fink
  2009-09-02 10:49                               ` Neil Horman
  0 siblings, 1 reply; 95+ messages in thread
From: Bill Fink @ 2009-09-02  5:11 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Thu, 27 Aug 2009, Neil Horman wrote:

> On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> > On Wed, 26 Aug 2009, Neil Horman wrote:
> > 
> > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > 
> > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > > does have a serial console, and after a fair amount of effort I was
> > > > > able to get it to work as desired, and was able to finally capture
> > > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > > kexec/kdump didn't work was probably because it couldn't find
> > > > > a /proc/vmcore file, although I don't know why that would be,
> > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > > zero size).
> > > > > 
> > > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > > happy to look into it further.
> > > > 
> > > > > The following shows a simple ping test usage of the skb_sources
> > > > > tracing feature:
> > > > > 
> > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > > 
> > > > > --- 192.168.1.10 ping statistics ---
> > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > > 
> > > > > [root@xeontest1 tracing]# cat trace
> > > > > # tracer: skb_sources
> > > > > #
> > > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > > #        |       |       |       |       |       |       |
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > >         4217    1       1       eth2    0       4       1500
> > > > > 
> > > > > All is as was expected.
> > > > > 
> > > > > But if I try an actual nuttcp performance test (even rate limited
> > > > > to 1 Mbps), I get the following kernel oops:
> > > > > 
> > > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > > 
> > > > Thanks
> > > > Neil
> > > > 
> > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > > Oops: 0000 [#1] SMP
> > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > > CPU 4
> > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > > Stack:
> > > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > > Call Trace:
> > > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > >  RSP <ffff8801a5811a88>
> > > > > CR2: 0000000000000038
> > > 
> > > 
> > > 
> > > Here  you go, I think this will fix your oops.
> > > 
> > > 
> > >     Fix NULL pointer deref in skb sources ftracer
> > >     
> > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > >     we can pass it to sock_net
> > >     
> > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > 
> > >  trace_skb_sources.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > > index 40eb071..8bf518f 100644
> > > --- a/kernel/trace/trace_skb_sources.c
> > > +++ b/kernel/trace/trace_skb_sources.c
> > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > >  	struct ring_buffer_event *event;
> > >  	struct trace_skb_event *entry;
> > >  	struct trace_array *tr = skb_trace;
> > > -	struct net_device *dev;
> > > +	struct net_device *dev = NULL;
> > >  
> > >  	if (!trace_skb_source_enabled)
> > >  		return;
> > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > >  	entry->event_data.rx_queue = skb->queue_mapping;
> > >  	entry->event_data.ccpu = smp_processor_id();
> > >  
> > > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > +	if (skb->sk)
> > > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > +
> > >  	if (dev) {
> > >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > >  		dev_put(dev);
> > 
> > 
> > 
> > On the positive side, it did fix the oops.  But the results of the
> > skb_sources tracing was not that useful.
> > 
> > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
> >  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> > n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> > 
> > First off, only 10 trace entries were made:
> > 
> > [root@xeontest1 tracing]# wc trace
> > 14 90 334 trace
> > 
> > And here they are:
> > 
> > [root@xeontest1 tracing]# cat trace
> > # tracer: skb_sources
> > #
> > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > #        |       |       |       |       |       |       |
> >         5521    0       0       Unknown 0       3       888
> >         5521    0       0       Unknown 0       3       896
> >         5521    0       0       Unknown 0       3       20
> >         5521    0       0       Unknown 0       3       888
> >         5521    0       0       Unknown 0       3       896
> >         5521    0       0       Unknown 0       3       20
> >         5521    1       1       Unknown 0       4       20
> >         5521    1       1       Unknown 0       4       11
> >         5521    1       1       Unknown 0       4       540
> >         5521    1       1       Unknown 0       4       0
> > 
> > Even for these 10 entries, why is the IFC Unknown, and the LENs
> > seem to be wrong too.
> > 
> > 						-Bill
> > 
> I'm not sure why you're getting Unknown Interface names.  Nominally that
> indicates that the skb->iif value in the skb was incorrect or otherwise not set,
> which shouldn't be the case.  As for the lengths that just seems wrong.  That
> length value is taken directly from skb->len, so if its not right, it seems like
> its not getting set correctly someplace.
> 
> As you may have seen we're removing the ftrace module, and replacing it with the
> use of raw trace events.  When I have that working, I'll see if I get simmilar
> results.  I never did in my local testing of the ftrace module, but perhaps its
> related to load or something.

IIUC I should keep the first of your original three ftrace patches,
revert all the rest, and then apply your very latest patch that
augments the skb_copy_datagram_iovec TRACE_EVENT.  Do I have that
basically correct?

Then I just need to ask how do I use this new method?

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-08-27 17:32                         ` Bill Fink
@ 2009-09-02  5:28                           ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-09-02  5:28 UTC (permalink / raw)
  To: Bill Fink; +Cc: Neil Horman, Linux Network Developers, brice, gallatin

On Thu, 27 Aug 2009, Bill Fink wrote:

> On Wed, 26 Aug 2009, Neil Horman wrote:
> 
> > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > 
> > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > does have a serial console, and after a fair amount of effort I was
> > > able to get it to work as desired, and was able to finally capture
> > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > kexec/kdump didn't work was probably because it couldn't find
> > > a /proc/vmcore file, although I don't know why that would be,
> > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > up normally if it fails to find the /proc/vmcore file (or it's
> > > zero size).
> > > 
> > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > happy to look into it further.
> 
> It's odd.  kexec/kdump works fine with the 2.6.29.6-217.2.3.fc11.x86_64
> kernel from Fedora 11 (running on the Fedora 10 system).  I will try
> again with the kernel-2.6.31-0.174.rc7.git2.fc12.src.rpm from Fedora 12,
> in case it has some secret sauce in one of the Fedora patches to make
> the Fedora /etc/init.d/kdump script happy.  kexec/kdump is my preferred
> method of dealing with kernel oopses if I can get it to work.

The Fedora 12 kernel-2.6.31-0.174.rc7.git2 kernel didn't help with
the kexec/kdump issue, so I may file a bug if I can't figure anything
out.

Also that kernel had a huge performance hit on my tests.  Where I
usually get ~100 Gbps of aggregate transmit performance, I was instead
getting a mere 3 Gbps, with individual streams only getting about
200 to 400 Mbps.  If I get a chance, I'll have to try the vanilla
version to see if it has the same issue (a vanilla 2.6.31-rc6 is
fine).

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-09-02  5:11                             ` Bill Fink
@ 2009-09-02 10:49                               ` Neil Horman
  2009-09-02 15:38                                 ` Bill Fink
  0 siblings, 1 reply; 95+ messages in thread
From: Neil Horman @ 2009-09-02 10:49 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin

On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote:
> On Thu, 27 Aug 2009, Neil Horman wrote:
> 
> > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> > > On Wed, 26 Aug 2009, Neil Horman wrote:
> > > 
> > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > > 
> > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > > > does have a serial console, and after a fair amount of effort I was
> > > > > > able to get it to work as desired, and was able to finally capture
> > > > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > > > kexec/kdump didn't work was probably because it couldn't find
> > > > > > a /proc/vmcore file, although I don't know why that would be,
> > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > > > zero size).
> > > > > > 
> > > > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > > > happy to look into it further.
> > > > > 
> > > > > > The following shows a simple ping test usage of the skb_sources
> > > > > > tracing feature:
> > > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > > > 
> > > > > > --- 192.168.1.10 ping statistics ---
> > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > > > 
> > > > > > [root@xeontest1 tracing]# cat trace
> > > > > > # tracer: skb_sources
> > > > > > #
> > > > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > > > #        |       |       |       |       |       |       |
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > > 
> > > > > > All is as was expected.
> > > > > > 
> > > > > > But if I try an actual nuttcp performance test (even rate limited
> > > > > > to 1 Mbps), I get the following kernel oops:
> > > > > > 
> > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > > > 
> > > > > Thanks
> > > > > Neil
> > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > > > Oops: 0000 [#1] SMP
> > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > > > CPU 4
> > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > > > Stack:
> > > > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > > > Call Trace:
> > > > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > >  RSP <ffff8801a5811a88>
> > > > > > CR2: 0000000000000038
> > > > 
> > > > 
> > > > 
> > > > Here  you go, I think this will fix your oops.
> > > > 
> > > > 
> > > >     Fix NULL pointer deref in skb sources ftracer
> > > >     
> > > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > > >     we can pass it to sock_net
> > > >     
> > > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > 
> > > >  trace_skb_sources.c |    6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > > > index 40eb071..8bf518f 100644
> > > > --- a/kernel/trace/trace_skb_sources.c
> > > > +++ b/kernel/trace/trace_skb_sources.c
> > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > >  	struct ring_buffer_event *event;
> > > >  	struct trace_skb_event *entry;
> > > >  	struct trace_array *tr = skb_trace;
> > > > -	struct net_device *dev;
> > > > +	struct net_device *dev = NULL;
> > > >  
> > > >  	if (!trace_skb_source_enabled)
> > > >  		return;
> > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > >  	entry->event_data.rx_queue = skb->queue_mapping;
> > > >  	entry->event_data.ccpu = smp_processor_id();
> > > >  
> > > > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > +	if (skb->sk)
> > > > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > +
> > > >  	if (dev) {
> > > >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > > >  		dev_put(dev);
> > > 
> > > 
> > > 
> > > On the positive side, it did fix the oops.  But the results of the
> > > skb_sources tracing was not that useful.
> > > 
> > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
> > >  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> > > n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> > > 
> > > First off, only 10 trace entries were made:
> > > 
> > > [root@xeontest1 tracing]# wc trace
> > > 14 90 334 trace
> > > 
> > > And here they are:
> > > 
> > > [root@xeontest1 tracing]# cat trace
> > > # tracer: skb_sources
> > > #
> > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > #        |       |       |       |       |       |       |
> > >         5521    0       0       Unknown 0       3       888
> > >         5521    0       0       Unknown 0       3       896
> > >         5521    0       0       Unknown 0       3       20
> > >         5521    0       0       Unknown 0       3       888
> > >         5521    0       0       Unknown 0       3       896
> > >         5521    0       0       Unknown 0       3       20
> > >         5521    1       1       Unknown 0       4       20
> > >         5521    1       1       Unknown 0       4       11
> > >         5521    1       1       Unknown 0       4       540
> > >         5521    1       1       Unknown 0       4       0
> > > 
> > > Even for these 10 entries, why is the IFC Unknown, and the LENs
> > > seem to be wrong too.
> > > 
> > > 						-Bill
> > > 
> > I'm not sure why you're getting Unknown Interface names.  Nominally that
> > indicates that the skb->iif value in the skb was incorrect or otherwise not set,
> > which shouldn't be the case.  As for the lengths that just seems wrong.  That
> > length value is taken directly from skb->len, so if its not right, it seems like
> > its not getting set correctly someplace.
> > 
> > As you may have seen we're removing the ftrace module, and replacing it with the
> > use of raw trace events.  When I have that working, I'll see if I get simmilar
> > results.  I never did in my local testing of the ftrace module, but perhaps its
> > related to load or something.
> 
> IIUC I should keep the first of your original three ftrace patches,
> revert all the rest, and then apply your very latest patch that
> augments the skb_copy_datagram_iovec TRACE_EVENT.  Do I have that
> basically correct?
> 
Thats exactly correct, yes.

> Then I just need to ask how do I use this new method?
> 
It works in basically the same way.  Except instead of doing this:
echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer
you do this:
echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable
Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe]

Best
Neil

> 						-Thanks
> 
> 						-Bill
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Receive side performance issue with multi-10-GigE and NUMA
  2009-09-02 10:49                               ` Neil Horman
@ 2009-09-02 15:38                                 ` Bill Fink
  0 siblings, 0 replies; 95+ messages in thread
From: Bill Fink @ 2009-09-02 15:38 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Developers, brice, gallatin

On Wed, 2 Sep 2009, Neil Horman wrote:

> On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote:
> > On Thu, 27 Aug 2009, Neil Horman wrote:
> > 
> > > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> > > > On Wed, 26 Aug 2009, Neil Horman wrote:
> > > > 
> > > > > Here  you go, I think this will fix your oops.
> > > > > 
> > > > > 
> > > > >     Fix NULL pointer deref in skb sources ftracer
> > > > >     
> > > > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > > > >     we can pass it to sock_net
> > > > >     
> > > > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > 
> > > > >  trace_skb_sources.c |    6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > > > > index 40eb071..8bf518f 100644
> > > > > --- a/kernel/trace/trace_skb_sources.c
> > > > > +++ b/kernel/trace/trace_skb_sources.c
> > > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > > >  	struct ring_buffer_event *event;
> > > > >  	struct trace_skb_event *entry;
> > > > >  	struct trace_array *tr = skb_trace;
> > > > > -	struct net_device *dev;
> > > > > +	struct net_device *dev = NULL;
> > > > >  
> > > > >  	if (!trace_skb_source_enabled)
> > > > >  		return;
> > > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > > >  	entry->event_data.rx_queue = skb->queue_mapping;
> > > > >  	entry->event_data.ccpu = smp_processor_id();
> > > > >  
> > > > > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > > +	if (skb->sk)
> > > > > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > > +
> > > > >  	if (dev) {
> > > > >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > > > >  		dev_put(dev);
> > > > 
> > > > 
> > > > 
> > > > On the positive side, it did fix the oops.  But the results of the
> > > > skb_sources tracing was not that useful.
> > > > 
> > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
> > > >  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> > > > n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> > > > 
> > > > First off, only 10 trace entries were made:
> > > > 
> > > > [root@xeontest1 tracing]# wc trace
> > > > 14 90 334 trace
> > > > 
> > > > And here they are:
> > > > 
> > > > [root@xeontest1 tracing]# cat trace
> > > > # tracer: skb_sources
> > > > #
> > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > #        |       |       |       |       |       |       |
> > > >         5521    0       0       Unknown 0       3       888
> > > >         5521    0       0       Unknown 0       3       896
> > > >         5521    0       0       Unknown 0       3       20
> > > >         5521    0       0       Unknown 0       3       888
> > > >         5521    0       0       Unknown 0       3       896
> > > >         5521    0       0       Unknown 0       3       20
> > > >         5521    1       1       Unknown 0       4       20
> > > >         5521    1       1       Unknown 0       4       11
> > > >         5521    1       1       Unknown 0       4       540
> > > >         5521    1       1       Unknown 0       4       0
> > > > 
> > > > Even for these 10 entries, why is the IFC Unknown, and the LENs
> > > > seem to be wrong too.
> > > > 
> > > > 						-Bill
> > > > 
> > > I'm not sure why you're getting Unknown Interface names.  Nominally that
> > > indicates that the skb->iif value in the skb was incorrect or otherwise not set,
> > > which shouldn't be the case.  As for the lengths that just seems wrong.  That
> > > length value is taken directly from skb->len, so if its not right, it seems like
> > > its not getting set correctly someplace.
> > > 
> > > As you may have seen we're removing the ftrace module, and replacing it with the
> > > use of raw trace events.  When I have that working, I'll see if I get simmilar
> > > results.  I never did in my local testing of the ftrace module, but perhaps its
> > > related to load or something.
> > 
> > IIUC I should keep the first of your original three ftrace patches,
> > revert all the rest, and then apply your very latest patch that
> > augments the skb_copy_datagram_iovec TRACE_EVENT.  Do I have that
> > basically correct?
> > 
> Thats exactly correct, yes.
> 
> > Then I just need to ask how do I use this new method?
> > 
> It works in basically the same way.  Except instead of doing this:
> echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer
> you do this:
> echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable
> Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe]

Thanks!  I'll probably give this a try later today and report back.

						-Bill

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2009-09-02 15:38 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
2009-08-07 21:18 ` Brice Goglin
2009-08-07 21:51   ` Bill Fink
2009-08-07 21:53     ` Brice Goglin
2009-08-07 22:08       ` Bill Fink
2009-08-07 22:17         ` Brice Goglin
2009-08-07 22:55           ` Bill Fink
2009-08-08  1:03     ` Andrew Gallatin
2009-08-08  1:35       ` Bill Fink
2009-08-08 11:08         ` Andrew Gallatin
2009-08-08 11:26           ` Neil Horman
2009-08-08 18:21             ` Andrew Gallatin
2009-08-08 18:32               ` Neil Horman
2009-08-11  7:32                 ` Bill Fink
2009-08-11 11:02                   ` Neil Horman
2009-08-11 19:15                     ` Christoph Lameter
2009-08-11 22:27                   ` Andi Kleen
2009-08-12  4:30                     ` Bill Fink
2009-08-12  7:21                       ` Andi Kleen
     [not found]                       ` <4A856781.2080301@myri.com>
2009-08-14 16:38                         ` Bill Fink
2009-08-14 16:55                           ` Andrew Gallatin
2009-08-14 21:13                             ` Aviv Greenberg
2009-08-20  7:26                               ` Bill Fink
2009-08-20 13:14                                 ` Ben Hutchings
2009-08-21  4:00                                   ` Bill Fink
2009-08-20 13:17                                 ` Aviv Greenberg
2009-08-12  0:02                   ` Brandeburg, Jesse
2009-08-12  4:38                     ` Bill Fink
2009-08-12 16:00                       ` Jesse Barnes
2009-08-14 20:31                       ` Bill Fink
2009-08-17 16:53                         ` Jesse Barnes
2009-08-18  7:07                           ` Bill Fink
2009-08-18 11:54                             ` Andrew Gallatin
2009-08-19 17:59                               ` Bill Fink
2009-08-07 22:12 ` Neil Horman
2009-08-08  0:54   ` Bill Fink
2009-08-08  1:56     ` Neil Horman
2009-08-14 20:44       ` Bill Fink
2009-08-14 23:25         ` Neil Horman
2009-08-20  7:50           ` Bill Fink
2009-08-20 20:19             ` Neil Horman
2009-08-21  4:14               ` Bill Fink
2009-08-21 15:23                 ` Neil Horman
2009-08-21 15:36                   ` Andrew Gallatin
2009-08-26  7:10                   ` Bill Fink
2009-08-26 11:00                     ` Neil Horman
2009-08-26 18:08                       ` Neil Horman
2009-08-26 18:15                         ` Ingo Molnar
2009-08-26 19:04                           ` Neil Horman
2009-08-26 19:08                             ` Ingo Molnar
2009-08-26 19:36                               ` David Miller
2009-08-26 19:48                                 ` Ingo Molnar
2009-08-26 20:23                                   ` Neil Horman
2009-08-26 20:40                                     ` Ingo Molnar
2009-08-26 22:39                                       ` Neil Horman
2009-08-26 22:44                                         ` David Miller
2009-08-26 23:05                                           ` Ingo Molnar
2009-08-26 23:08                                             ` David Miller
2009-08-26 23:58                                               ` Ingo Molnar
2009-08-27  0:05                                                 ` Steven Rostedt
2009-08-27  0:35                                                 ` Christoph Hellwig
2009-08-27  9:28                                                   ` Ingo Molnar
2009-08-26 23:05                                           ` Steven Rostedt
2009-08-26 23:09                                             ` David Miller
2009-08-26 23:30                                               ` Ingo Molnar
2009-08-26 23:23                                             ` Neil Horman
2009-08-26 23:29                                               ` David Miller
2009-08-26 23:19                                           ` Neil Horman
2009-08-26 23:14                                         ` Ingo Molnar
2009-08-26 23:33                                         ` Steven Rostedt
2009-08-27  0:14                                           ` Neil Horman
2009-08-27  0:29                                             ` Steven Rostedt
2009-08-27  1:17                                               ` Neil Horman
2009-08-27  9:06                                                 ` Ingo Molnar
2009-08-27  9:34                                               ` Ingo Molnar
2009-08-27  0:34                                         ` Christoph Hellwig
2009-08-27  0:30                                       ` blktrace ftrace plugin, was " Christoph Hellwig
2009-08-27  5:26                                         ` Jens Axboe
2009-08-27  9:12                                           ` Ingo Molnar
2009-08-27  9:14                                             ` Jens Axboe
2009-08-27 13:55                                               ` Arnaldo Carvalho de Melo
2009-08-28  2:03                                             ` Li Zefan
2009-08-26 23:46                                     ` Frederic Weisbecker
2009-08-26 20:28                                   ` Ingo Molnar
2009-08-26 20:01                               ` Neil Horman
2009-08-26 22:57                                 ` Ingo Molnar
2009-08-27 17:32                         ` Bill Fink
2009-09-02  5:28                           ` Bill Fink
2009-08-27 17:44                         ` Bill Fink
2009-08-27 17:51                           ` Neil Horman
2009-09-02  5:11                             ` Bill Fink
2009-09-02 10:49                               ` Neil Horman
2009-09-02 15:38                                 ` Bill Fink
2009-08-12 23:29 ` David Miller
2009-08-13  2:35   ` Bill Fink

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.