All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance hit - NICs on different CPU sockets
@ 2016-06-13 14:07 Take Ceara
  2016-06-13 14:28 ` Bruce Richardson
  2016-06-13 19:35 ` Wiles, Keith
  0 siblings, 2 replies; 19+ messages in thread
From: Take Ceara @ 2016-06-13 14:07 UTC (permalink / raw)
  To: dev

Hi,

I'm reposting here as I didn't get any answers on the dpdk-users mailing list.

We're working on a stateful traffic generator (www.warp17.net) using
DPDK and we would like to control two XL710 NICs (one on each socket)
to maximize CPU usage. It looks that we run into the following
limitation:

http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
section 7.2, point 3

We completely split memory/cpu/NICs across the two sockets. However,
the performance with a single CPU and both NICs on the same socket is
better.
Why do all the NICs have to be on the same socket, is there a
driver/hw limitation?

Thanks,
Dumitru Ceara

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-13 14:07 Performance hit - NICs on different CPU sockets Take Ceara
@ 2016-06-13 14:28 ` Bruce Richardson
  2016-06-14  7:47   ` Take Ceara
  2016-06-13 19:35 ` Wiles, Keith
  1 sibling, 1 reply; 19+ messages in thread
From: Bruce Richardson @ 2016-06-13 14:28 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev

On Mon, Jun 13, 2016 at 04:07:37PM +0200, Take Ceara wrote:
> Hi,
> 
> I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
> 
> We're working on a stateful traffic generator (www.warp17.net) using
> DPDK and we would like to control two XL710 NICs (one on each socket)
> to maximize CPU usage. It looks that we run into the following
> limitation:
> 
> http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
> section 7.2, point 3
> 
> We completely split memory/cpu/NICs across the two sockets. However,
> the performance with a single CPU and both NICs on the same socket is
> better.
> Why do all the NICs have to be on the same socket, is there a
> driver/hw limitation?
> 
Hi,

so long as each thread only ever accesses the NIC on it's own local socket, then
there is no performance penalty. It's only when a thread on one socket works
using a NIC on a remote socket that you start seeing a penalty, with all
NIC-core communication having to go across QPI.

/Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-13 14:07 Performance hit - NICs on different CPU sockets Take Ceara
  2016-06-13 14:28 ` Bruce Richardson
@ 2016-06-13 19:35 ` Wiles, Keith
  2016-06-14  7:46   ` Take Ceara
  1 sibling, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-13 19:35 UTC (permalink / raw)
  To: Take Ceara, dev


On 6/13/16, 9:07 AM, "dev on behalf of Take Ceara" <dev-bounces@dpdk.org on behalf of dumitru.ceara@gmail.com> wrote:

>Hi,
>
>I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
>
>We're working on a stateful traffic generator (www.warp17.net) using
>DPDK and we would like to control two XL710 NICs (one on each socket)
>to maximize CPU usage. It looks that we run into the following
>limitation:
>
>http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>section 7.2, point 3
>
>We completely split memory/cpu/NICs across the two sockets. However,
>the performance with a single CPU and both NICs on the same socket is
>better.
>Why do all the NICs have to be on the same socket, is there a
>driver/hw limitation?

Normally the limitation is in the hardware, basically how the PCI bus is connected to the CPUs (or sockets). How the PCI buses are connected to the system depends on the Mother board design. I normally see the buses attached to socket 0, but you could have some of the buses attached to the other sockets or all on one socket via a PCI bridge device.

No easy way around the problem if some of your PCI buses are split or all on a single socket. Need to look at your system docs or look at lspci it has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.
>
>Thanks,
>Dumitru Ceara
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-13 19:35 ` Wiles, Keith
@ 2016-06-14  7:46   ` Take Ceara
  2016-06-14 13:47     ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-14  7:46 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

Hi Keith,

On Mon, Jun 13, 2016 at 9:35 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
> On 6/13/16, 9:07 AM, "dev on behalf of Take Ceara" <dev-bounces@dpdk.org on behalf of dumitru.ceara@gmail.com> wrote:
>
>>Hi,
>>
>>I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
>>
>>We're working on a stateful traffic generator (www.warp17.net) using
>>DPDK and we would like to control two XL710 NICs (one on each socket)
>>to maximize CPU usage. It looks that we run into the following
>>limitation:
>>
>>http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>>section 7.2, point 3
>>
>>We completely split memory/cpu/NICs across the two sockets. However,
>>the performance with a single CPU and both NICs on the same socket is
>>better.
>>Why do all the NICs have to be on the same socket, is there a
>>driver/hw limitation?
>
> Normally the limitation is in the hardware, basically how the PCI bus is connected to the CPUs (or sockets). How the PCI buses are connected to the system depends on the Mother board design. I normally see the buses attached to socket 0, but you could have some of the buses attached to the other sockets or all on one socket via a PCI bridge device.
>
> No easy way around the problem if some of your PCI buses are split or all on a single socket. Need to look at your system docs or look at lspci it has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.

This is the motherboard we use on our system:

http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm

I need to swap some NICs around (as now we moved everything on socket
1) before I can share the lspci output.

Thanks,
Dumitru

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-13 14:28 ` Bruce Richardson
@ 2016-06-14  7:47   ` Take Ceara
  0 siblings, 0 replies; 19+ messages in thread
From: Take Ceara @ 2016-06-14  7:47 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce,

On Mon, Jun 13, 2016 at 4:28 PM, Bruce Richardson
<bruce.richardson@intel.com> wrote:
> On Mon, Jun 13, 2016 at 04:07:37PM +0200, Take Ceara wrote:
>> Hi,
>>
>> I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
>>
>> We're working on a stateful traffic generator (www.warp17.net) using
>> DPDK and we would like to control two XL710 NICs (one on each socket)
>> to maximize CPU usage. It looks that we run into the following
>> limitation:
>>
>> http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>> section 7.2, point 3
>>
>> We completely split memory/cpu/NICs across the two sockets. However,
>> the performance with a single CPU and both NICs on the same socket is
>> better.
>> Why do all the NICs have to be on the same socket, is there a
>> driver/hw limitation?
>>
> Hi,
>
> so long as each thread only ever accesses the NIC on it's own local socket, then
> there is no performance penalty. It's only when a thread on one socket works
> using a NIC on a remote socket that you start seeing a penalty, with all
> NIC-core communication having to go across QPI.
>
> /Bruce

Thanks for the confirmation. We'll go through our code again to double
check that no thread accesses the NIC or memory on a remote socket.

Regards,
Dumitru

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-14  7:46   ` Take Ceara
@ 2016-06-14 13:47     ` Wiles, Keith
  2016-06-16 14:36       ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-14 13:47 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev


On 6/14/16, 2:46 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>Hi Keith,
>
>On Mon, Jun 13, 2016 at 9:35 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>
>> On 6/13/16, 9:07 AM, "dev on behalf of Take Ceara" <dev-bounces@dpdk.org on behalf of dumitru.ceara@gmail.com> wrote:
>>
>>>Hi,
>>>
>>>I'm reposting here as I didn't get any answers on the dpdk-users mailing list.
>>>
>>>We're working on a stateful traffic generator (www.warp17.net) using
>>>DPDK and we would like to control two XL710 NICs (one on each socket)
>>>to maximize CPU usage. It looks that we run into the following
>>>limitation:
>>>
>>>http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html
>>>section 7.2, point 3
>>>
>>>We completely split memory/cpu/NICs across the two sockets. However,
>>>the performance with a single CPU and both NICs on the same socket is
>>>better.
>>>Why do all the NICs have to be on the same socket, is there a
>>>driver/hw limitation?
>>
>> Normally the limitation is in the hardware, basically how the PCI bus is connected to the CPUs (or sockets). How the PCI buses are connected to the system depends on the Mother board design. I normally see the buses attached to socket 0, but you could have some of the buses attached to the other sockets or all on one socket via a PCI bridge device.
>>
>> No easy way around the problem if some of your PCI buses are split or all on a single socket. Need to look at your system docs or look at lspci it has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.
>
>This is the motherboard we use on our system:
>
>http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm
>
>I need to swap some NICs around (as now we moved everything on socket
>1) before I can share the lspci output.

FYI: the option for lspci is ‘lspci –tv’, but maybe more options too.

>
>Thanks,
>Dumitru
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-14 13:47     ` Wiles, Keith
@ 2016-06-16 14:36       ` Take Ceara
  2016-06-16 14:58         ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-16 14:36 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

Hi Keith,

On Tue, Jun 14, 2016 at 3:47 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>> Normally the limitation is in the hardware, basically how the PCI bus is connected to the CPUs (or sockets). How the PCI buses are connected to the system depends on the Mother board design. I normally see the buses attached to socket 0, but you could have some of the buses attached to the other sockets or all on one socket via a PCI bridge device.
>>>
>>> No easy way around the problem if some of your PCI buses are split or all on a single socket. Need to look at your system docs or look at lspci it has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.
>>
>>This is the motherboard we use on our system:
>>
>>http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm
>>
>>I need to swap some NICs around (as now we moved everything on socket
>>1) before I can share the lspci output.
>
> FYI: the option for lspci is ‘lspci –tv’, but maybe more options too.
>

I retested with two 10G X710 ports connected back to back:
port 0: 0000:01:00.3 - socket 0
port 1: 0000:81:00.3 - socket 1

I ran the following scenarios:
- assign 16 threads from CPU 0 on socket 0 to port 0 and 16 threads
from CPU 1 to port 1 => setup rate of 1.6M sess/s
- assign only the 16 threads from CPU0 for both ports (so 8 threads on
socket 0 for port 0 and 8 threads on socket 0 for port 1) => setup
rate of 3M sess/s
- assign only the 16 threads from CPU1 for both ports (so 8 threads on
socket 1 for port 0 and 8 threads on socket 1 for port 1) => setup
rate of 3M sess/s

I also tried a scenario with two machines connected back to back each
of which had a NIC on socket 1. I assigned 16 threads from socket 1 on
each machine to the port and performance scaled to 6M sess/s as
expected.

I double checked all our memory allocations and, at least in the
tested scenario, we never use memory that's not on the same socket as
the core.

I pasted below the output of lspci -tv. I see that 0000:01:00.3 and
0000:81:00.3 are connected to different PCI bridges but on each of
those bridges there are also "Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 DMA Channel <X>" devices.

It would be great if you could also take a look in case I
missed/misunderstood something.

Thanks,
Dumitru

# lspci -tv
-+-[0000:ff]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 0 ERROR Registers
 |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 1 ERROR Registers
 |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 0 ERROR Registers
 |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 1 ERROR Registers
 |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 +-[0000:80]-+-02.0-[81]--+-00.0  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
 |           |            +-00.1  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
 |           |            +-00.2  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
 |           |            \-00.3  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
 |           +-03.0-[82]----00.0  Intel Corporation Ethernet
Controller XL710 for 40GbE QSFP+
 |           +-03.2-[83]----00.0  Intel Corporation Ethernet
Controller XL710 for 40GbE QSFP+
 |           +-04.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 0
 |           +-04.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 1
 |           +-04.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 2
 |           +-04.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 3
 |           +-04.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 4
 |           +-04.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 5
 |           +-04.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 6
 |           +-04.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 7
 |           +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Address Map, VTd_Misc, System Management
 |           +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
 |           +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
RAS, Control Status and Global Errors
 |           \-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
 +-[0000:7f]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
R3 QPI Link 0 & 1 Monitoring
 |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Unicast Registers
 |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Buffered Ring Agent
 |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
System Address Decoder & Broadcast Registers
 |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
PCIe Ring Interface
 |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Scratchpad & Semaphore Registers
 |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
 |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 0 ERROR Registers
 |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 0 Channel 1 ERROR Registers
 |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 0 & 1
 |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
 |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 0 ERROR Registers
 |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Integrated Memory Controller 1 Channel 1 ERROR Registers
 |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DDRIO (VMSE) 2 & 3
 |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
             +-01.0-[01]--+-00.0  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
             |            +-00.1  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
             |            +-00.2  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
             |            \-00.3  Intel Corporation Ethernet
Controller X710 for 10GbE SFP+
             +-02.0-[02]--+-00.0  Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection
             |            \-00.1  Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection
             +-02.2-[03]--+-00.0  Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection
             |            \-00.1  Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection
             +-03.0-[04]--
             +-03.2-[05]--
             +-04.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 0
             +-04.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 1
             +-04.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 2
             +-04.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 3
             +-04.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 4
             +-04.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 5
             +-04.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 6
             +-04.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMA Channel 7
             +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Address Map, VTd_Misc, System Management
             +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
             +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
RAS, Control Status and Global Errors
             +-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
             +-11.0  Intel Corporation C610/X99 series chipset SPSR
             +-11.4  Intel Corporation C610/X99 series chipset sSATA
Controller [AHCI mode]
             +-14.0  Intel Corporation C610/X99 series chipset USB
xHCI Host Controller
             +-16.0  Intel Corporation C610/X99 series chipset MEI Controller #1
             +-16.1  Intel Corporation C610/X99 series chipset MEI Controller #2
             +-1a.0  Intel Corporation C610/X99 series chipset USB
Enhanced Host Controller #2
             +-1c.0-[06]--
             +-1c.3-[07-08]----00.0-[08]----00.0  ASPEED Technology,
Inc. ASPEED Graphics Family
             +-1c.4-[09]--+-00.0  Intel Corporation I350 Gigabit
Network Connection
             |            \-00.1  Intel Corporation I350 Gigabit
Network Connection
             +-1d.0  Intel Corporation C610/X99 series chipset USB
Enhanced Host Controller #1
             +-1f.0  Intel Corporation C610/X99 series chipset LPC Controller
             +-1f.2  Intel Corporation C610/X99 series chipset 6-Port
SATA Controller [AHCI mode]
             +-1f.3  Intel Corporation C610/X99 series chipset SMBus Controller
             \-1f.6  Intel Corporation C610/X99 series chipset Thermal Subsystem

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 14:36       ` Take Ceara
@ 2016-06-16 14:58         ` Wiles, Keith
  2016-06-16 15:16           ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 14:58 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev

On 6/16/16, 9:36 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>Hi Keith,
>
>On Tue, Jun 14, 2016 at 3:47 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>> Normally the limitation is in the hardware, basically how the PCI bus is connected to the CPUs (or sockets). How the PCI buses are connected to the system depends on the Mother board design. I normally see the buses attached to socket 0, but you could have some of the buses attached to the other sockets or all on one socket via a PCI bridge device.
>>>>
>>>> No easy way around the problem if some of your PCI buses are split or all on a single socket. Need to look at your system docs or look at lspci it has an option to dump the PCI bus as an ASCII tree, at least on Ubuntu.
>>>
>>>This is the motherboard we use on our system:
>>>
>>>http://www.supermicro.com/products/motherboard/Xeon/C600/X10DRX.cfm
>>>
>>>I need to swap some NICs around (as now we moved everything on socket
>>>1) before I can share the lspci output.
>>
>> FYI: the option for lspci is ‘lspci –tv’, but maybe more options too.
>>
>
>I retested with two 10G X710 ports connected back to back:
>port 0: 0000:01:00.3 - socket 0
>port 1: 0000:81:00.3 - socket 1

Please provide the output from tools/cpu_layout.py.

>
>I ran the following scenarios:
>- assign 16 threads from CPU 0 on socket 0 to port 0 and 16 threads
>from CPU 1 to port 1 => setup rate of 1.6M sess/s
>- assign only the 16 threads from CPU0 for both ports (so 8 threads on
>socket 0 for port 0 and 8 threads on socket 0 for port 1) => setup
>rate of 3M sess/s
>- assign only the 16 threads from CPU1 for both ports (so 8 threads on
>socket 1 for port 0 and 8 threads on socket 1 for port 1) => setup
>rate of 3M sess/s
>
>I also tried a scenario with two machines connected back to back each
>of which had a NIC on socket 1. I assigned 16 threads from socket 1 on
>each machine to the port and performance scaled to 6M sess/s as
>expected.
>
>I double checked all our memory allocations and, at least in the
>tested scenario, we never use memory that's not on the same socket as
>the core.
>
>I pasted below the output of lspci -tv. I see that 0000:01:00.3 and
>0000:81:00.3 are connected to different PCI bridges but on each of
>those bridges there are also "Intel Corporation Xeon E7 v3/Xeon E5
>v3/Core i7 DMA Channel <X>" devices.
>
>It would be great if you could also take a look in case I
>missed/misunderstood something.
>
>Thanks,
>Dumitru
>

From the output below it appears the x710 devices 01:00.[0-3] are on socket 0
And the x710 devices 02:00.[0-3] sit on socket 1.

This means the ports on 01.00.xx should be handled by socket 0 CPUs and 02:00.xx should be handled by Socket 1. I can not tell if that is the case for you here. The CPUs or lcores from the cpu_layout.py should help understand the layout.

># lspci -tv
>-+-[0000:ff]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>PCIe Ring Interface
> |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>PCIe Ring Interface
> |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 0
> |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 0
> |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 1
> |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 1
> |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
> |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
> |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel Target Address Decoder
> |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel Target Address Decoder
> |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Channel 0/1 Broadcast
> |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Global Broadcast
> |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 0 Thermal Control
> |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 1 Thermal Control
> |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 0 ERROR Registers
> |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 1 ERROR Registers
> |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
> |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
> |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel Target Address Decoder
> |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel Target Address Decoder
> |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Channel 2/3 Broadcast
> |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Global Broadcast
> |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 0 Thermal Control
> |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 1 Thermal Control
> |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 0 ERROR Registers
> |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 1 ERROR Registers
> |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
> |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
> +-[0000:80]-+-02.0-[81]--+-00.0  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
> |           |            +-00.1  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
> |           |            +-00.2  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
> |           |            \-00.3  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
> |           +-03.0-[82]----00.0  Intel Corporation Ethernet
>Controller XL710 for 40GbE QSFP+
> |           +-03.2-[83]----00.0  Intel Corporation Ethernet
>Controller XL710 for 40GbE QSFP+
> |           +-04.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 0
> |           +-04.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 1
> |           +-04.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 2
> |           +-04.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 3
> |           +-04.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 4
> |           +-04.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 5
> |           +-04.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 6
> |           +-04.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 7
> |           +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Address Map, VTd_Misc, System Management
> |           +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
> |           +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>RAS, Control Status and Global Errors
> |           \-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
> +-[0000:7f]-+-08.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-08.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-08.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 0
> |           +-09.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-09.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-09.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 QPI Link 1
> |           +-0b.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0b.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0b.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>R3 QPI Link 0 & 1 Monitoring
> |           +-0c.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0c.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0d.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0d.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Unicast Registers
> |           +-0f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Buffered Ring Agent
> |           +-0f.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-0f.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-0f.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>System Address Decoder & Broadcast Registers
> |           +-10.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>PCIe Ring Interface
> |           +-10.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>PCIe Ring Interface
> |           +-10.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-10.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-10.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Scratchpad & Semaphore Registers
> |           +-12.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 0
> |           +-12.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 0
> |           +-12.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 1
> |           +-12.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Home Agent 1
> |           +-13.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
> |           +-13.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Target Address, Thermal & RAS Registers
> |           +-13.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel Target Address Decoder
> |           +-13.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel Target Address Decoder
> |           +-13.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Channel 0/1 Broadcast
> |           +-13.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Global Broadcast
> |           +-14.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 0 Thermal Control
> |           +-14.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 1 Thermal Control
> |           +-14.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 0 ERROR Registers
> |           +-14.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 0 Channel 1 ERROR Registers
> |           +-14.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-14.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 0 & 1
> |           +-16.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
> |           +-16.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Target Address, Thermal & RAS Registers
> |           +-16.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel Target Address Decoder
> |           +-16.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel Target Address Decoder
> |           +-16.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Channel 2/3 Broadcast
> |           +-16.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO Global Broadcast
> |           +-17.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 0 Thermal Control
> |           +-17.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 1 Thermal Control
> |           +-17.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 0 ERROR Registers
> |           +-17.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Integrated Memory Controller 1 Channel 1 ERROR Registers
> |           +-17.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-17.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DDRIO (VMSE) 2 & 3
> |           +-1e.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1e.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Power Control Unit
> |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
> |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
> \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
>             +-01.0-[01]--+-00.0  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
>             |            +-00.1  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
>             |            +-00.2  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
>             |            \-00.3  Intel Corporation Ethernet
>Controller X710 for 10GbE SFP+
>             +-02.0-[02]--+-00.0  Intel Corporation 82599ES 10-Gigabit
>SFI/SFP+ Network Connection
>             |            \-00.1  Intel Corporation 82599ES 10-Gigabit
>SFI/SFP+ Network Connection
>             +-02.2-[03]--+-00.0  Intel Corporation 82599ES 10-Gigabit
>SFI/SFP+ Network Connection
>             |            \-00.1  Intel Corporation 82599ES 10-Gigabit
>SFI/SFP+ Network Connection
>             +-03.0-[04]--
>             +-03.2-[05]--
>             +-04.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 0
>             +-04.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 1
>             +-04.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 2
>             +-04.3  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 3
>             +-04.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 4
>             +-04.5  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 5
>             +-04.6  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 6
>             +-04.7  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>DMA Channel 7
>             +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>Address Map, VTd_Misc, System Management
>             +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot Plug
>             +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
>RAS, Control Status and Global Errors
>             +-05.4  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
>             +-11.0  Intel Corporation C610/X99 series chipset SPSR
>             +-11.4  Intel Corporation C610/X99 series chipset sSATA
>Controller [AHCI mode]
>             +-14.0  Intel Corporation C610/X99 series chipset USB
>xHCI Host Controller
>             +-16.0  Intel Corporation C610/X99 series chipset MEI Controller #1
>             +-16.1  Intel Corporation C610/X99 series chipset MEI Controller #2
>             +-1a.0  Intel Corporation C610/X99 series chipset USB
>Enhanced Host Controller #2
>             +-1c.0-[06]--
>             +-1c.3-[07-08]----00.0-[08]----00.0  ASPEED Technology,
>Inc. ASPEED Graphics Family
>             +-1c.4-[09]--+-00.0  Intel Corporation I350 Gigabit
>Network Connection
>             |            \-00.1  Intel Corporation I350 Gigabit
>Network Connection
>             +-1d.0  Intel Corporation C610/X99 series chipset USB
>Enhanced Host Controller #1
>             +-1f.0  Intel Corporation C610/X99 series chipset LPC Controller
>             +-1f.2  Intel Corporation C610/X99 series chipset 6-Port
>SATA Controller [AHCI mode]
>             +-1f.3  Intel Corporation C610/X99 series chipset SMBus Controller
>             \-1f.6  Intel Corporation C610/X99 series chipset Thermal Subsystem
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 14:58         ` Wiles, Keith
@ 2016-06-16 15:16           ` Take Ceara
  2016-06-16 15:29             ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-16 15:16 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Jun 16, 2016 at 4:58 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
> From the output below it appears the x710 devices 01:00.[0-3] are on socket 0
> And the x710 devices 02:00.[0-3] sit on socket 1.
>

I assume there's a mistake here. The x710 devices on socket 0 are:
$ lspci | grep -ie "01:.*x710"
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.2 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
01:00.3 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)

and the X710 devices on socket 1 are:
$ lspci | grep -ie "81:.*x710"
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.2 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)
81:00.3 Ethernet controller: Intel Corporation Ethernet Controller
X710 for 10GbE SFP+ (rev 01)

> This means the ports on 01.00.xx should be handled by socket 0 CPUs and 02:00.xx should be handled by Socket 1. I can not tell if that is the case for you here. The CPUs or lcores from the cpu_layout.py should help understand the layout.
>

That was the first scenario I tried:
- assign 16 CPUs from socket 0 to port 0 (01:00.3)
- assign 16 CPUs from socket 1 to port 1 (81:00.3)

Our performance measurements show then a setup rate of 1.6M sess/s
which is less then half of what I get when i install both X710 on
socket 1 and use only 16 CPUs from socket 1 for both ports.

I double checked the cpu layout. We also have our own CLI and warnings
when using cores that are not on the same socket as the port they're
assigned too so the mapping should be fine.

Thanks,
Dumitru

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 15:16           ` Take Ceara
@ 2016-06-16 15:29             ` Wiles, Keith
  2016-06-16 16:20               ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 15:29 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev

On 6/16/16, 10:16 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>On Thu, Jun 16, 2016 at 4:58 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>
>> From the output below it appears the x710 devices 01:00.[0-3] are on socket 0
>> And the x710 devices 02:00.[0-3] sit on socket 1.
>>
>
>I assume there's a mistake here. The x710 devices on socket 0 are:
>$ lspci | grep -ie "01:.*x710"
>01:00.0 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>01:00.1 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>01:00.2 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>01:00.3 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>
>and the X710 devices on socket 1 are:
>$ lspci | grep -ie "81:.*x710"
>81:00.0 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>81:00.1 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>81:00.2 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)
>81:00.3 Ethernet controller: Intel Corporation Ethernet Controller
>X710 for 10GbE SFP+ (rev 01)

Yes, you are correct I miss-read the lspci output.

>
>> This means the ports on 01.00.xx should be handled by socket 0 CPUs and 02:00.xx should be handled by Socket 1. I can not tell if that is the case for you here. The CPUs or lcores from the cpu_layout.py should help understand the layout.
>>
>
>That was the first scenario I tried:
>- assign 16 CPUs from socket 0 to port 0 (01:00.3)
>- assign 16 CPUs from socket 1 to port 1 (81:00.3)
>
>Our performance measurements show then a setup rate of 1.6M sess/s
>which is less then half of what I get when i install both X710 on
>socket 1 and use only 16 CPUs from socket 1 for both ports.

Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.

Maybe seeing the DPDK command line would help.

++Keith

>
>I double checked the cpu layout. We also have our own CLI and warnings
>when using cores that are not on the same socket as the port they're
>assigned too so the mapping should be fine.
>
>Thanks,
>Dumitru
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 15:29             ` Wiles, Keith
@ 2016-06-16 16:20               ` Take Ceara
  2016-06-16 16:56                 ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-16 16:20 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:

>
> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.

I doubt that there is a limitation wrt running 16 cores per port vs 8
cores per port as I've tried with two different machines connected
back to back each with one X710 port and 16 cores on each of them
running on that port. In that case our performance doubled as
expected.

>
> Maybe seeing the DPDK command line would help.

The command line I use with ports 01:00.3 and 81:00.3 is:
./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00

Our own qmap args allow the user to control exactly how cores are
split between ports. In this case we end up with:

warp17> show port map
Port 0[socket: 0]:
   Core 4[socket:0] (Tx: 0, Rx: 0)
   Core 5[socket:0] (Tx: 1, Rx: 1)
   Core 6[socket:0] (Tx: 2, Rx: 2)
   Core 7[socket:0] (Tx: 3, Rx: 3)
   Core 8[socket:0] (Tx: 4, Rx: 4)
   Core 9[socket:0] (Tx: 5, Rx: 5)
   Core 20[socket:0] (Tx: 6, Rx: 6)
   Core 21[socket:0] (Tx: 7, Rx: 7)
   Core 22[socket:0] (Tx: 8, Rx: 8)
   Core 23[socket:0] (Tx: 9, Rx: 9)
   Core 24[socket:0] (Tx: 10, Rx: 10)
   Core 25[socket:0] (Tx: 11, Rx: 11)
   Core 26[socket:0] (Tx: 12, Rx: 12)
   Core 27[socket:0] (Tx: 13, Rx: 13)
   Core 28[socket:0] (Tx: 14, Rx: 14)
   Core 29[socket:0] (Tx: 15, Rx: 15)

Port 1[socket: 1]:
   Core 10[socket:1] (Tx: 0, Rx: 0)
   Core 11[socket:1] (Tx: 1, Rx: 1)
   Core 12[socket:1] (Tx: 2, Rx: 2)
   Core 13[socket:1] (Tx: 3, Rx: 3)
   Core 14[socket:1] (Tx: 4, Rx: 4)
   Core 15[socket:1] (Tx: 5, Rx: 5)
   Core 16[socket:1] (Tx: 6, Rx: 6)
   Core 17[socket:1] (Tx: 7, Rx: 7)
   Core 18[socket:1] (Tx: 8, Rx: 8)
   Core 19[socket:1] (Tx: 9, Rx: 9)
   Core 30[socket:1] (Tx: 10, Rx: 10)
   Core 31[socket:1] (Tx: 11, Rx: 11)
   Core 32[socket:1] (Tx: 12, Rx: 12)
   Core 33[socket:1] (Tx: 13, Rx: 13)
   Core 34[socket:1] (Tx: 14, Rx: 14)
   Core 35[socket:1] (Tx: 15, Rx: 15)

Just for reference, the cpu_layout script shows:
$ $RTE_SDK/tools/cpu_layout.py
============================================================
Core and Socket Information (as reported by '/proc/cpuinfo')
============================================================

cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
sockets =  [0, 1]

        Socket 0        Socket 1
        --------        --------
Core 0  [0, 20]         [10, 30]
Core 1  [1, 21]         [11, 31]
Core 2  [2, 22]         [12, 32]
Core 3  [3, 23]         [13, 33]
Core 4  [4, 24]         [14, 34]
Core 8  [5, 25]         [15, 35]
Core 9  [6, 26]         [16, 36]
Core 10 [7, 27]         [17, 37]
Core 11 [8, 28]         [18, 38]
Core 12 [9, 29]         [19, 39]

I know it might be complicated to gigure out exactly what's happening
in our setup with our own code so please let me know if you need
additional information.

I appreciate the help!

Thanks,
Dumitru

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 16:20               ` Take Ceara
@ 2016-06-16 16:56                 ` Wiles, Keith
  2016-06-16 16:59                   ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 16:56 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev


On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
>>
>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>
>I doubt that there is a limitation wrt running 16 cores per port vs 8
>cores per port as I've tried with two different machines connected
>back to back each with one X710 port and 16 cores on each of them
>running on that port. In that case our performance doubled as
>expected.
>
>>
>> Maybe seeing the DPDK command line would help.
>
>The command line I use with ports 01:00.3 and 81:00.3 is:
>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>
>Our own qmap args allow the user to control exactly how cores are
>split between ports. In this case we end up with:
>
>warp17> show port map
>Port 0[socket: 0]:
>   Core 4[socket:0] (Tx: 0, Rx: 0)
>   Core 5[socket:0] (Tx: 1, Rx: 1)
>   Core 6[socket:0] (Tx: 2, Rx: 2)
>   Core 7[socket:0] (Tx: 3, Rx: 3)
>   Core 8[socket:0] (Tx: 4, Rx: 4)
>   Core 9[socket:0] (Tx: 5, Rx: 5)
>   Core 20[socket:0] (Tx: 6, Rx: 6)
>   Core 21[socket:0] (Tx: 7, Rx: 7)
>   Core 22[socket:0] (Tx: 8, Rx: 8)
>   Core 23[socket:0] (Tx: 9, Rx: 9)
>   Core 24[socket:0] (Tx: 10, Rx: 10)
>   Core 25[socket:0] (Tx: 11, Rx: 11)
>   Core 26[socket:0] (Tx: 12, Rx: 12)
>   Core 27[socket:0] (Tx: 13, Rx: 13)
>   Core 28[socket:0] (Tx: 14, Rx: 14)
>   Core 29[socket:0] (Tx: 15, Rx: 15)
>
>Port 1[socket: 1]:
>   Core 10[socket:1] (Tx: 0, Rx: 0)
>   Core 11[socket:1] (Tx: 1, Rx: 1)
>   Core 12[socket:1] (Tx: 2, Rx: 2)
>   Core 13[socket:1] (Tx: 3, Rx: 3)
>   Core 14[socket:1] (Tx: 4, Rx: 4)
>   Core 15[socket:1] (Tx: 5, Rx: 5)
>   Core 16[socket:1] (Tx: 6, Rx: 6)
>   Core 17[socket:1] (Tx: 7, Rx: 7)
>   Core 18[socket:1] (Tx: 8, Rx: 8)
>   Core 19[socket:1] (Tx: 9, Rx: 9)
>   Core 30[socket:1] (Tx: 10, Rx: 10)
>   Core 31[socket:1] (Tx: 11, Rx: 11)
>   Core 32[socket:1] (Tx: 12, Rx: 12)
>   Core 33[socket:1] (Tx: 13, Rx: 13)
>   Core 34[socket:1] (Tx: 14, Rx: 14)
>   Core 35[socket:1] (Tx: 15, Rx: 15)

On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.

The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.

To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.

In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
Core 1 [1, 21]    [11, 31]
Use lcore 1 or 11 depending on the socket you are on.

The info below is most likely the best performance and utilization of your system. If I got the values right ☺

./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
--qmap 0.0x00000003FE --qmap 1.0x00000FFE00

Port 0[socket: 0]:
   Core 2[socket:0] (Tx: 0, Rx: 0)
   Core 3[socket:0] (Tx: 1, Rx: 1)
   Core 4[socket:0] (Tx: 2, Rx: 2)
   Core 5[socket:0] (Tx: 3, Rx: 3)
   Core 6[socket:0] (Tx: 4, Rx: 4)
   Core 7[socket:0] (Tx: 5, Rx: 5)
   Core 8[socket:0] (Tx: 6, Rx: 6)
   Core 9[socket:0] (Tx: 7, Rx: 7)

8 cores on first socket leaving 0-1 lcores for Linux.

Port 1[socket: 1]:
   Core 10[socket:1] (Tx: 0, Rx: 0)
   Core 11[socket:1] (Tx: 1, Rx: 1)
   Core 12[socket:1] (Tx: 2, Rx: 2)
   Core 13[socket:1] (Tx: 3, Rx: 3)
   Core 14[socket:1] (Tx: 4, Rx: 4)
   Core 15[socket:1] (Tx: 5, Rx: 5)
   Core 16[socket:1] (Tx: 6, Rx: 6)
   Core 17[socket:1] (Tx: 7, Rx: 7)
   Core 18[socket:1] (Tx: 8, Rx: 8)
   Core 19[socket:1] (Tx: 9, Rx: 9)

All 10 cores on the second socket.

++Keith

>
>Just for reference, the cpu_layout script shows:
>$ $RTE_SDK/tools/cpu_layout.py
>============================================================
>Core and Socket Information (as reported by '/proc/cpuinfo')
>============================================================
>
>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>sockets =  [0, 1]
>
>        Socket 0        Socket 1
>        --------        --------
>Core 0  [0, 20]         [10, 30]
>Core 1  [1, 21]         [11, 31]
>Core 2  [2, 22]         [12, 32]
>Core 3  [3, 23]         [13, 33]
>Core 4  [4, 24]         [14, 34]
>Core 8  [5, 25]         [15, 35]
>Core 9  [6, 26]         [16, 36]
>Core 10 [7, 27]         [17, 37]
>Core 11 [8, 28]         [18, 38]
>Core 12 [9, 29]         [19, 39]
>
>I know it might be complicated to gigure out exactly what's happening
>in our setup with our own code so please let me know if you need
>additional information.
>
>I appreciate the help!
>
>Thanks,
>Dumitru
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 16:56                 ` Wiles, Keith
@ 2016-06-16 16:59                   ` Wiles, Keith
  2016-06-16 18:20                     ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 16:59 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev


On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:

>
>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>
>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>
>>>
>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>
>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>cores per port as I've tried with two different machines connected
>>back to back each with one X710 port and 16 cores on each of them
>>running on that port. In that case our performance doubled as
>>expected.
>>
>>>
>>> Maybe seeing the DPDK command line would help.
>>
>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>
>>Our own qmap args allow the user to control exactly how cores are
>>split between ports. In this case we end up with:
>>
>>warp17> show port map
>>Port 0[socket: 0]:
>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>
>>Port 1[socket: 1]:
>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>
>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>
>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>
>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>
>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>Core 1 [1, 21]    [11, 31]
>Use lcore 1 or 11 depending on the socket you are on.
>
>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>
>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>
>Port 0[socket: 0]:
>   Core 2[socket:0] (Tx: 0, Rx: 0)
>   Core 3[socket:0] (Tx: 1, Rx: 1)
>   Core 4[socket:0] (Tx: 2, Rx: 2)
>   Core 5[socket:0] (Tx: 3, Rx: 3)
>   Core 6[socket:0] (Tx: 4, Rx: 4)
>   Core 7[socket:0] (Tx: 5, Rx: 5)
>   Core 8[socket:0] (Tx: 6, Rx: 6)
>   Core 9[socket:0] (Tx: 7, Rx: 7)
>
>8 cores on first socket leaving 0-1 lcores for Linux.

9 cores and leaving the first core or two lcores for Linux
>
>Port 1[socket: 1]:
>   Core 10[socket:1] (Tx: 0, Rx: 0)
>   Core 11[socket:1] (Tx: 1, Rx: 1)
>   Core 12[socket:1] (Tx: 2, Rx: 2)
>   Core 13[socket:1] (Tx: 3, Rx: 3)
>   Core 14[socket:1] (Tx: 4, Rx: 4)
>   Core 15[socket:1] (Tx: 5, Rx: 5)
>   Core 16[socket:1] (Tx: 6, Rx: 6)
>   Core 17[socket:1] (Tx: 7, Rx: 7)
>   Core 18[socket:1] (Tx: 8, Rx: 8)
>   Core 19[socket:1] (Tx: 9, Rx: 9)
>
>All 10 cores on the second socket.
>
>++Keith
>
>>
>>Just for reference, the cpu_layout script shows:
>>$ $RTE_SDK/tools/cpu_layout.py
>>============================================================
>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>============================================================
>>
>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>sockets =  [0, 1]
>>
>>        Socket 0        Socket 1
>>        --------        --------
>>Core 0  [0, 20]         [10, 30]
>>Core 1  [1, 21]         [11, 31]
>>Core 2  [2, 22]         [12, 32]
>>Core 3  [3, 23]         [13, 33]
>>Core 4  [4, 24]         [14, 34]
>>Core 8  [5, 25]         [15, 35]
>>Core 9  [6, 26]         [16, 36]
>>Core 10 [7, 27]         [17, 37]
>>Core 11 [8, 28]         [18, 38]
>>Core 12 [9, 29]         [19, 39]
>>
>>I know it might be complicated to gigure out exactly what's happening
>>in our setup with our own code so please let me know if you need
>>additional information.
>>
>>I appreciate the help!
>>
>>Thanks,
>>Dumitru
>>
>
>
>
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 16:59                   ` Wiles, Keith
@ 2016-06-16 18:20                     ` Take Ceara
  2016-06-16 19:33                       ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-16 18:20 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>
>>
>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>
>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>
>>>>
>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>
>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>cores per port as I've tried with two different machines connected
>>>back to back each with one X710 port and 16 cores on each of them
>>>running on that port. In that case our performance doubled as
>>>expected.
>>>
>>>>
>>>> Maybe seeing the DPDK command line would help.
>>>
>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>
>>>Our own qmap args allow the user to control exactly how cores are
>>>split between ports. In this case we end up with:
>>>
>>>warp17> show port map
>>>Port 0[socket: 0]:
>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>
>>>Port 1[socket: 1]:
>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>
>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>
>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>
>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>
>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>Core 1 [1, 21]    [11, 31]
>>Use lcore 1 or 11 depending on the socket you are on.
>>
>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>
>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>
>>Port 0[socket: 0]:
>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>
>>8 cores on first socket leaving 0-1 lcores for Linux.
>
> 9 cores and leaving the first core or two lcores for Linux
>>
>>Port 1[socket: 1]:
>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>
>>All 10 cores on the second socket.

The values were almost right :) But that's because we reserve the
first two lcores that are passed to dpdk for our own management part.
I was aware that lcores are not physical cores so we don't expect
performance to scale linearly with the number of lcores. However, if
there's a chance that another hyperthread can run while the paired one
is stalling we'd like to take advantage of those cycles if possible.

Leaving that aside I just ran two more tests while using only one of
the two hwthreads in a core.

a. 2 ports on different sockets with 8 cores/port:
./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
-- --qmap 0.0x3FC --qmap 1.0xFF000
warp17> show port map
Port 0[socket: 0]:
   Core 2[socket:0] (Tx: 0, Rx: 0)
   Core 3[socket:0] (Tx: 1, Rx: 1)
   Core 4[socket:0] (Tx: 2, Rx: 2)
   Core 5[socket:0] (Tx: 3, Rx: 3)
   Core 6[socket:0] (Tx: 4, Rx: 4)
   Core 7[socket:0] (Tx: 5, Rx: 5)
   Core 8[socket:0] (Tx: 6, Rx: 6)
   Core 9[socket:0] (Tx: 7, Rx: 7)

Port 1[socket: 1]:
   Core 12[socket:1] (Tx: 0, Rx: 0)
   Core 13[socket:1] (Tx: 1, Rx: 1)
   Core 14[socket:1] (Tx: 2, Rx: 2)
   Core 15[socket:1] (Tx: 3, Rx: 3)
   Core 16[socket:1] (Tx: 4, Rx: 4)
   Core 17[socket:1] (Tx: 5, Rx: 5)
   Core 18[socket:1] (Tx: 6, Rx: 6)
   Core 19[socket:1] (Tx: 7, Rx: 7)

This gives a session setup rate of only 2M sessions/s.

b. 2 ports on socket 0 with 4 cores/port:
./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
--qmap 0.0x3C0 --qmap 1.0x03C
warp17> show port map
Port 0[socket: 0]:
   Core 6[socket:0] (Tx: 0, Rx: 0)
   Core 7[socket:0] (Tx: 1, Rx: 1)
   Core 8[socket:0] (Tx: 2, Rx: 2)
   Core 9[socket:0] (Tx: 3, Rx: 3)

Port 1[socket: 0]:
   Core 2[socket:0] (Tx: 0, Rx: 0)
   Core 3[socket:0] (Tx: 1, Rx: 1)
   Core 4[socket:0] (Tx: 2, Rx: 2)
   Core 5[socket:0] (Tx: 3, Rx: 3)

Surprisingly this gives a session setup rate of 3M sess/s!!

The packet processing cores are totally independent and only access
local socket memory/ports.
There is no locking or atomic variable access in fast path in our code.
The mbuf pools are not shared between cores handling the same port so
there should be no contention when allocating/freeing mbufs.
In this specific test scenario all the cores handling port 0 are
essentially executing the same code (TCP clients) and the cores on
port 1 as well (TCP servers).

Do you have any tips about what other things to check for?

Thanks,
Dumitru



>>
>>++Keith
>>
>>>
>>>Just for reference, the cpu_layout script shows:
>>>$ $RTE_SDK/tools/cpu_layout.py
>>>============================================================
>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>============================================================
>>>
>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>sockets =  [0, 1]
>>>
>>>        Socket 0        Socket 1
>>>        --------        --------
>>>Core 0  [0, 20]         [10, 30]
>>>Core 1  [1, 21]         [11, 31]
>>>Core 2  [2, 22]         [12, 32]
>>>Core 3  [3, 23]         [13, 33]
>>>Core 4  [4, 24]         [14, 34]
>>>Core 8  [5, 25]         [15, 35]
>>>Core 9  [6, 26]         [16, 36]
>>>Core 10 [7, 27]         [17, 37]
>>>Core 11 [8, 28]         [18, 38]
>>>Core 12 [9, 29]         [19, 39]
>>>
>>>I know it might be complicated to gigure out exactly what's happening
>>>in our setup with our own code so please let me know if you need
>>>additional information.
>>>
>>>I appreciate the help!
>>>
>>>Thanks,
>>>Dumitru
>>>
>>
>>
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 18:20                     ` Take Ceara
@ 2016-06-16 19:33                       ` Wiles, Keith
  2016-06-16 20:00                         ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 19:33 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev

On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>
>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>>
>>>
>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>
>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>
>>>>>
>>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>>
>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>cores per port as I've tried with two different machines connected
>>>>back to back each with one X710 port and 16 cores on each of them
>>>>running on that port. In that case our performance doubled as
>>>>expected.
>>>>
>>>>>
>>>>> Maybe seeing the DPDK command line would help.
>>>>
>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>
>>>>Our own qmap args allow the user to control exactly how cores are
>>>>split between ports. In this case we end up with:
>>>>
>>>>warp17> show port map
>>>>Port 0[socket: 0]:
>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>
>>>>Port 1[socket: 1]:
>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>
>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>>
>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>>
>>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>>
>>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>>Core 1 [1, 21]    [11, 31]
>>>Use lcore 1 or 11 depending on the socket you are on.
>>>
>>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>>
>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>
>>>Port 0[socket: 0]:
>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>
>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>
>> 9 cores and leaving the first core or two lcores for Linux
>>>
>>>Port 1[socket: 1]:
>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>
>>>All 10 cores on the second socket.
>
>The values were almost right :) But that's because we reserve the
>first two lcores that are passed to dpdk for our own management part.
>I was aware that lcores are not physical cores so we don't expect
>performance to scale linearly with the number of lcores. However, if
>there's a chance that another hyperthread can run while the paired one
>is stalling we'd like to take advantage of those cycles if possible.
>
>Leaving that aside I just ran two more tests while using only one of
>the two hwthreads in a core.
>
>a. 2 ports on different sockets with 8 cores/port:
>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>-- --qmap 0.0x3FC --qmap 1.0xFF000
>warp17> show port map
>Port 0[socket: 0]:
>   Core 2[socket:0] (Tx: 0, Rx: 0)
>   Core 3[socket:0] (Tx: 1, Rx: 1)
>   Core 4[socket:0] (Tx: 2, Rx: 2)
>   Core 5[socket:0] (Tx: 3, Rx: 3)
>   Core 6[socket:0] (Tx: 4, Rx: 4)
>   Core 7[socket:0] (Tx: 5, Rx: 5)
>   Core 8[socket:0] (Tx: 6, Rx: 6)
>   Core 9[socket:0] (Tx: 7, Rx: 7)
>
>Port 1[socket: 1]:
>   Core 12[socket:1] (Tx: 0, Rx: 0)
>   Core 13[socket:1] (Tx: 1, Rx: 1)
>   Core 14[socket:1] (Tx: 2, Rx: 2)
>   Core 15[socket:1] (Tx: 3, Rx: 3)
>   Core 16[socket:1] (Tx: 4, Rx: 4)
>   Core 17[socket:1] (Tx: 5, Rx: 5)
>   Core 18[socket:1] (Tx: 6, Rx: 6)
>   Core 19[socket:1] (Tx: 7, Rx: 7)
>
>This gives a session setup rate of only 2M sessions/s.
>
>b. 2 ports on socket 0 with 4 cores/port:
>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>--qmap 0.0x3C0 --qmap 1.0x03C

One more thing to try change the –m 32768 to –socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them.

What is the dpdk –n option set to on your system? Mine is set to ‘–n 4’

>warp17> show port map
>Port 0[socket: 0]:
>   Core 6[socket:0] (Tx: 0, Rx: 0)
>   Core 7[socket:0] (Tx: 1, Rx: 1)
>   Core 8[socket:0] (Tx: 2, Rx: 2)
>   Core 9[socket:0] (Tx: 3, Rx: 3)
>
>Port 1[socket: 0]:
>   Core 2[socket:0] (Tx: 0, Rx: 0)
>   Core 3[socket:0] (Tx: 1, Rx: 1)
>   Core 4[socket:0] (Tx: 2, Rx: 2)
>   Core 5[socket:0] (Tx: 3, Rx: 3)
>
>Surprisingly this gives a session setup rate of 3M sess/s!!
>
>The packet processing cores are totally independent and only access
>local socket memory/ports.
>There is no locking or atomic variable access in fast path in our code.
>The mbuf pools are not shared between cores handling the same port so
>there should be no contention when allocating/freeing mbufs.
>In this specific test scenario all the cores handling port 0 are
>essentially executing the same code (TCP clients) and the cores on
>port 1 as well (TCP servers).
>
>Do you have any tips about what other things to check for?
>
>Thanks,
>Dumitru
>
>
>
>>>
>>>++Keith
>>>
>>>>
>>>>Just for reference, the cpu_layout script shows:
>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>============================================================
>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>============================================================
>>>>
>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>sockets =  [0, 1]
>>>>
>>>>        Socket 0        Socket 1
>>>>        --------        --------
>>>>Core 0  [0, 20]         [10, 30]
>>>>Core 1  [1, 21]         [11, 31]
>>>>Core 2  [2, 22]         [12, 32]
>>>>Core 3  [3, 23]         [13, 33]
>>>>Core 4  [4, 24]         [14, 34]
>>>>Core 8  [5, 25]         [15, 35]
>>>>Core 9  [6, 26]         [16, 36]
>>>>Core 10 [7, 27]         [17, 37]
>>>>Core 11 [8, 28]         [18, 38]
>>>>Core 12 [9, 29]         [19, 39]
>>>>
>>>>I know it might be complicated to gigure out exactly what's happening
>>>>in our setup with our own code so please let me know if you need
>>>>additional information.
>>>>
>>>>I appreciate the help!
>>>>
>>>>Thanks,
>>>>Dumitru
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 19:33                       ` Wiles, Keith
@ 2016-06-16 20:00                         ` Take Ceara
  2016-06-16 20:16                           ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Take Ceara @ 2016-06-16 20:00 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>
>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>
>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>>>
>>>>
>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>>
>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>
>>>>>>
>>>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>>>
>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>cores per port as I've tried with two different machines connected
>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>running on that port. In that case our performance doubled as
>>>>>expected.
>>>>>
>>>>>>
>>>>>> Maybe seeing the DPDK command line would help.
>>>>>
>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>
>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>split between ports. In this case we end up with:
>>>>>
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>
>>>>>Port 1[socket: 1]:
>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>
>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>>>
>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>>>
>>>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>>>
>>>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>>>Core 1 [1, 21]    [11, 31]
>>>>Use lcore 1 or 11 depending on the socket you are on.
>>>>
>>>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>>>
>>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>>
>>>>Port 0[socket: 0]:
>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>
>>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>>
>>> 9 cores and leaving the first core or two lcores for Linux
>>>>
>>>>Port 1[socket: 1]:
>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>
>>>>All 10 cores on the second socket.
>>
>>The values were almost right :) But that's because we reserve the
>>first two lcores that are passed to dpdk for our own management part.
>>I was aware that lcores are not physical cores so we don't expect
>>performance to scale linearly with the number of lcores. However, if
>>there's a chance that another hyperthread can run while the paired one
>>is stalling we'd like to take advantage of those cycles if possible.
>>
>>Leaving that aside I just ran two more tests while using only one of
>>the two hwthreads in a core.
>>
>>a. 2 ports on different sockets with 8 cores/port:
>>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>>-- --qmap 0.0x3FC --qmap 1.0xFF000
>>warp17> show port map
>>Port 0[socket: 0]:
>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>
>>Port 1[socket: 1]:
>>   Core 12[socket:1] (Tx: 0, Rx: 0)
>>   Core 13[socket:1] (Tx: 1, Rx: 1)
>>   Core 14[socket:1] (Tx: 2, Rx: 2)
>>   Core 15[socket:1] (Tx: 3, Rx: 3)
>>   Core 16[socket:1] (Tx: 4, Rx: 4)
>>   Core 17[socket:1] (Tx: 5, Rx: 5)
>>   Core 18[socket:1] (Tx: 6, Rx: 6)
>>   Core 19[socket:1] (Tx: 7, Rx: 7)
>>
>>This gives a session setup rate of only 2M sessions/s.
>>
>>b. 2 ports on socket 0 with 4 cores/port:
>>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>>--qmap 0.0x3C0 --qmap 1.0x03C
>
> One more thing to try change the –m 32768 to –socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them.
>
> What is the dpdk –n option set to on your system? Mine is set to ‘–n 4’
>

I tried with –socket-mem 16384,16384 but it doesn't make any
difference. We call anyway rte_malloc_socket for everything that might
be accessed in fast path and the mempools are per-core and created
with the correct socket-id. Even when starting with '-m 32768' I see
that 16 hugepages get allocated on each of the sockets.

On the test server I have 4 memory channels so '-n 4'.

>>warp17> show port map
>>Port 0[socket: 0]:
>>   Core 6[socket:0] (Tx: 0, Rx: 0)
>>   Core 7[socket:0] (Tx: 1, Rx: 1)
>>   Core 8[socket:0] (Tx: 2, Rx: 2)
>>   Core 9[socket:0] (Tx: 3, Rx: 3)
>>
>>Port 1[socket: 0]:
>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>
>>Surprisingly this gives a session setup rate of 3M sess/s!!
>>
>>The packet processing cores are totally independent and only access
>>local socket memory/ports.
>>There is no locking or atomic variable access in fast path in our code.
>>The mbuf pools are not shared between cores handling the same port so
>>there should be no contention when allocating/freeing mbufs.
>>In this specific test scenario all the cores handling port 0 are
>>essentially executing the same code (TCP clients) and the cores on
>>port 1 as well (TCP servers).
>>
>>Do you have any tips about what other things to check for?
>>
>>Thanks,
>>Dumitru
>>
>>
>>
>>>>
>>>>++Keith
>>>>
>>>>>
>>>>>Just for reference, the cpu_layout script shows:
>>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>>============================================================
>>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>>============================================================
>>>>>
>>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>>sockets =  [0, 1]
>>>>>
>>>>>        Socket 0        Socket 1
>>>>>        --------        --------
>>>>>Core 0  [0, 20]         [10, 30]
>>>>>Core 1  [1, 21]         [11, 31]
>>>>>Core 2  [2, 22]         [12, 32]
>>>>>Core 3  [3, 23]         [13, 33]
>>>>>Core 4  [4, 24]         [14, 34]
>>>>>Core 8  [5, 25]         [15, 35]
>>>>>Core 9  [6, 26]         [16, 36]
>>>>>Core 10 [7, 27]         [17, 37]
>>>>>Core 11 [8, 28]         [18, 38]
>>>>>Core 12 [9, 29]         [19, 39]
>>>>>
>>>>>I know it might be complicated to gigure out exactly what's happening
>>>>>in our setup with our own code so please let me know if you need
>>>>>additional information.
>>>>>
>>>>>I appreciate the help!
>>>>>
>>>>>Thanks,
>>>>>Dumitru
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 20:00                         ` Take Ceara
@ 2016-06-16 20:16                           ` Wiles, Keith
  2016-06-16 20:19                             ` Wiles, Keith
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 20:16 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev


On 6/16/16, 3:00 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:

>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>
>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>
>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>>>>
>>>>>
>>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>>>
>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>>>>
>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>>cores per port as I've tried with two different machines connected
>>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>>running on that port. In that case our performance doubled as
>>>>>>expected.
>>>>>>
>>>>>>>
>>>>>>> Maybe seeing the DPDK command line would help.
>>>>>>
>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>>
>>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>>split between ports. In this case we end up with:
>>>>>>
>>>>>>warp17> show port map
>>>>>>Port 0[socket: 0]:
>>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>>
>>>>>>Port 1[socket: 1]:
>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>>
>>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>>>>
>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>>>>
>>>>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>>>>
>>>>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>>>>Core 1 [1, 21]    [11, 31]
>>>>>Use lcore 1 or 11 depending on the socket you are on.
>>>>>
>>>>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>>>>
>>>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>>>
>>>>>Port 0[socket: 0]:
>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>
>>>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>>>
>>>> 9 cores and leaving the first core or two lcores for Linux
>>>>>
>>>>>Port 1[socket: 1]:
>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>
>>>>>All 10 cores on the second socket.
>>>
>>>The values were almost right :) But that's because we reserve the
>>>first two lcores that are passed to dpdk for our own management part.
>>>I was aware that lcores are not physical cores so we don't expect
>>>performance to scale linearly with the number of lcores. However, if
>>>there's a chance that another hyperthread can run while the paired one
>>>is stalling we'd like to take advantage of those cycles if possible.
>>>
>>>Leaving that aside I just ran two more tests while using only one of
>>>the two hwthreads in a core.
>>>
>>>a. 2 ports on different sockets with 8 cores/port:
>>>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>>>-- --qmap 0.0x3FC --qmap 1.0xFF000
>>>warp17> show port map
>>>Port 0[socket: 0]:
>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>
>>>Port 1[socket: 1]:
>>>   Core 12[socket:1] (Tx: 0, Rx: 0)
>>>   Core 13[socket:1] (Tx: 1, Rx: 1)
>>>   Core 14[socket:1] (Tx: 2, Rx: 2)
>>>   Core 15[socket:1] (Tx: 3, Rx: 3)
>>>   Core 16[socket:1] (Tx: 4, Rx: 4)
>>>   Core 17[socket:1] (Tx: 5, Rx: 5)
>>>   Core 18[socket:1] (Tx: 6, Rx: 6)
>>>   Core 19[socket:1] (Tx: 7, Rx: 7)
>>>
>>>This gives a session setup rate of only 2M sessions/s.
>>>
>>>b. 2 ports on socket 0 with 4 cores/port:
>>>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>>>--qmap 0.0x3C0 --qmap 1.0x03C
>>
>> One more thing to try change the –m 32768 to –socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them.
>>
>> What is the dpdk –n option set to on your system? Mine is set to ‘–n 4’
>>
>
>I tried with –socket-mem 16384,16384 but it doesn't make any
>difference. We call anyway rte_malloc_socket for everything that might
>be accessed in fast path and the mempools are per-core and created
>with the correct socket-id. Even when starting with '-m 32768' I see
>that 16 hugepages get allocated on each of the sockets.
>
>On the test server I have 4 memory channels so '-n 4'.
>
>>>warp17> show port map
>>>Port 0[socket: 0]:
>>>   Core 6[socket:0] (Tx: 0, Rx: 0)
>>>   Core 7[socket:0] (Tx: 1, Rx: 1)
>>>   Core 8[socket:0] (Tx: 2, Rx: 2)
>>>   Core 9[socket:0] (Tx: 3, Rx: 3)
>>>
>>>Port 1[socket: 0]:
>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>   Core 5[socket:0] (Tx: 3, Rx: 3)

I do not know now. It seems like something else is going on here that we have not identified.

>>>
>>>Surprisingly this gives a session setup rate of 3M sess/s!!
>>>
>>>The packet processing cores are totally independent and only access
>>>local socket memory/ports.
>>>There is no locking or atomic variable access in fast path in our code.
>>>The mbuf pools are not shared between cores handling the same port so
>>>there should be no contention when allocating/freeing mbufs.
>>>In this specific test scenario all the cores handling port 0 are
>>>essentially executing the same code (TCP clients) and the cores on
>>>port 1 as well (TCP servers).
>>>
>>>Do you have any tips about what other things to check for?
>>>
>>>Thanks,
>>>Dumitru
>>>
>>>
>>>
>>>>>
>>>>>++Keith
>>>>>
>>>>>>
>>>>>>Just for reference, the cpu_layout script shows:
>>>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>>>============================================================
>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>>>============================================================
>>>>>>
>>>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>>>sockets =  [0, 1]
>>>>>>
>>>>>>        Socket 0        Socket 1
>>>>>>        --------        --------
>>>>>>Core 0  [0, 20]         [10, 30]
>>>>>>Core 1  [1, 21]         [11, 31]
>>>>>>Core 2  [2, 22]         [12, 32]
>>>>>>Core 3  [3, 23]         [13, 33]
>>>>>>Core 4  [4, 24]         [14, 34]
>>>>>>Core 8  [5, 25]         [15, 35]
>>>>>>Core 9  [6, 26]         [16, 36]
>>>>>>Core 10 [7, 27]         [17, 37]
>>>>>>Core 11 [8, 28]         [18, 38]
>>>>>>Core 12 [9, 29]         [19, 39]
>>>>>>
>>>>>>I know it might be complicated to gigure out exactly what's happening
>>>>>>in our setup with our own code so please let me know if you need
>>>>>>additional information.
>>>>>>
>>>>>>I appreciate the help!
>>>>>>
>>>>>>Thanks,
>>>>>>Dumitru
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 20:16                           ` Wiles, Keith
@ 2016-06-16 20:19                             ` Wiles, Keith
  2016-06-16 20:27                               ` Take Ceara
  0 siblings, 1 reply; 19+ messages in thread
From: Wiles, Keith @ 2016-06-16 20:19 UTC (permalink / raw)
  To: Take Ceara; +Cc: dev


On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:

>
>On 6/16/16, 3:00 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>
>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>
>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>
>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>>>>>
>>>>>>
>>>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>>>>
>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>>>>>
>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>>>cores per port as I've tried with two different machines connected
>>>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>>>running on that port. In that case our performance doubled as
>>>>>>>expected.
>>>>>>>
>>>>>>>>
>>>>>>>> Maybe seeing the DPDK command line would help.
>>>>>>>
>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>>>
>>>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>>>split between ports. In this case we end up with:
>>>>>>>
>>>>>>>warp17> show port map
>>>>>>>Port 0[socket: 0]:
>>>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>>>
>>>>>>>Port 1[socket: 1]:
>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>>>
>>>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>>>>>
>>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>>>>>
>>>>>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>>>>>
>>>>>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>>>>>Core 1 [1, 21]    [11, 31]
>>>>>>Use lcore 1 or 11 depending on the socket you are on.
>>>>>>
>>>>>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>>>>>
>>>>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>>>>
>>>>>>Port 0[socket: 0]:
>>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>>
>>>>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>>>>
>>>>> 9 cores and leaving the first core or two lcores for Linux
>>>>>>
>>>>>>Port 1[socket: 1]:
>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>
>>>>>>All 10 cores on the second socket.
>>>>
>>>>The values were almost right :) But that's because we reserve the
>>>>first two lcores that are passed to dpdk for our own management part.
>>>>I was aware that lcores are not physical cores so we don't expect
>>>>performance to scale linearly with the number of lcores. However, if
>>>>there's a chance that another hyperthread can run while the paired one
>>>>is stalling we'd like to take advantage of those cycles if possible.
>>>>
>>>>Leaving that aside I just ran two more tests while using only one of
>>>>the two hwthreads in a core.
>>>>
>>>>a. 2 ports on different sockets with 8 cores/port:
>>>>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>>>>-- --qmap 0.0x3FC --qmap 1.0xFF000
>>>>warp17> show port map
>>>>Port 0[socket: 0]:
>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>
>>>>Port 1[socket: 1]:
>>>>   Core 12[socket:1] (Tx: 0, Rx: 0)
>>>>   Core 13[socket:1] (Tx: 1, Rx: 1)
>>>>   Core 14[socket:1] (Tx: 2, Rx: 2)
>>>>   Core 15[socket:1] (Tx: 3, Rx: 3)
>>>>   Core 16[socket:1] (Tx: 4, Rx: 4)
>>>>   Core 17[socket:1] (Tx: 5, Rx: 5)
>>>>   Core 18[socket:1] (Tx: 6, Rx: 6)
>>>>   Core 19[socket:1] (Tx: 7, Rx: 7)
>>>>
>>>>This gives a session setup rate of only 2M sessions/s.
>>>>
>>>>b. 2 ports on socket 0 with 4 cores/port:
>>>>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>>>>--qmap 0.0x3C0 --qmap 1.0x03C
>>>
>>> One more thing to try change the –m 32768 to –socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them.
>>>
>>> What is the dpdk –n option set to on your system? Mine is set to ‘–n 4’
>>>
>>
>>I tried with –socket-mem 16384,16384 but it doesn't make any
>>difference. We call anyway rte_malloc_socket for everything that might
>>be accessed in fast path and the mempools are per-core and created
>>with the correct socket-id. Even when starting with '-m 32768' I see
>>that 16 hugepages get allocated on each of the sockets.
>>
>>On the test server I have 4 memory channels so '-n 4'.
>>
>>>>warp17> show port map
>>>>Port 0[socket: 0]:
>>>>   Core 6[socket:0] (Tx: 0, Rx: 0)
>>>>   Core 7[socket:0] (Tx: 1, Rx: 1)
>>>>   Core 8[socket:0] (Tx: 2, Rx: 2)
>>>>   Core 9[socket:0] (Tx: 3, Rx: 3)
>>>>
>>>>Port 1[socket: 0]:
>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>
>I do not know now. It seems like something else is going on here that we have not identified.

Maybe vTune or some other type of debug performance tool would be the next step here.

>
>>>>
>>>>Surprisingly this gives a session setup rate of 3M sess/s!!
>>>>
>>>>The packet processing cores are totally independent and only access
>>>>local socket memory/ports.
>>>>There is no locking or atomic variable access in fast path in our code.
>>>>The mbuf pools are not shared between cores handling the same port so
>>>>there should be no contention when allocating/freeing mbufs.
>>>>In this specific test scenario all the cores handling port 0 are
>>>>essentially executing the same code (TCP clients) and the cores on
>>>>port 1 as well (TCP servers).
>>>>
>>>>Do you have any tips about what other things to check for?
>>>>
>>>>Thanks,
>>>>Dumitru
>>>>
>>>>
>>>>
>>>>>>
>>>>>>++Keith
>>>>>>
>>>>>>>
>>>>>>>Just for reference, the cpu_layout script shows:
>>>>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>>>>============================================================
>>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>>>>============================================================
>>>>>>>
>>>>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>>>>sockets =  [0, 1]
>>>>>>>
>>>>>>>        Socket 0        Socket 1
>>>>>>>        --------        --------
>>>>>>>Core 0  [0, 20]         [10, 30]
>>>>>>>Core 1  [1, 21]         [11, 31]
>>>>>>>Core 2  [2, 22]         [12, 32]
>>>>>>>Core 3  [3, 23]         [13, 33]
>>>>>>>Core 4  [4, 24]         [14, 34]
>>>>>>>Core 8  [5, 25]         [15, 35]
>>>>>>>Core 9  [6, 26]         [16, 36]
>>>>>>>Core 10 [7, 27]         [17, 37]
>>>>>>>Core 11 [8, 28]         [18, 38]
>>>>>>>Core 12 [9, 29]         [19, 39]
>>>>>>>
>>>>>>>I know it might be complicated to gigure out exactly what's happening
>>>>>>>in our setup with our own code so please let me know if you need
>>>>>>>additional information.
>>>>>>>
>>>>>>>I appreciate the help!
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Dumitru
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Performance hit - NICs on different CPU sockets
  2016-06-16 20:19                             ` Wiles, Keith
@ 2016-06-16 20:27                               ` Take Ceara
  0 siblings, 0 replies; 19+ messages in thread
From: Take Ceara @ 2016-06-16 20:27 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Jun 16, 2016 at 10:19 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>
> On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>
>>
>>On 6/16/16, 3:00 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>
>>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>>
>>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>>
>>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.org on behalf of keith.wiles@intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>>>>>>
>>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right now I do not know what the issue is with the system. Could be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, which is why you get better performance when you have 8 core per port. I am not really seeing the whole picture and how DPDK is configured to help more. Sorry.
>>>>>>>>
>>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>>>>cores per port as I've tried with two different machines connected
>>>>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>>>>running on that port. In that case our performance doubled as
>>>>>>>>expected.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe seeing the DPDK command line would help.
>>>>>>>>
>>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>>>>
>>>>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>>>>split between ports. In this case we end up with:
>>>>>>>>
>>>>>>>>warp17> show port map
>>>>>>>>Port 0[socket: 0]:
>>>>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>>>>
>>>>>>>>Port 1[socket: 1]:
>>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>>>>
>>>>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 lcores total.
>>>>>>>
>>>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I understand some like to think they are interchangeable. The problem is the hyper-threads are logically interchangeable, but not performance wise. If you have two run-to-completion threads on a single physical core each on a different hyper-thread of that core [0,1], then the second lcore or thread (1) on that physical core will only get at most about 30-20% of the CPU cycles. Normally it is much less, unless you tune the code to make sure each thread is not trying to share the internal execution units, but some internal execution units are always shared.
>>>>>>>
>>>>>>>To get the best performance when hyper-threading is enable is to not run both threads on a single physical core, but only run one hyper-thread-0.
>>>>>>>
>>>>>>>In the table below the table lists the physical core id and each of the lcore ids per socket. Use the first lcore per socket for the best performance:
>>>>>>>Core 1 [1, 21]    [11, 31]
>>>>>>>Use lcore 1 or 11 depending on the socket you are on.
>>>>>>>
>>>>>>>The info below is most likely the best performance and utilization of your system. If I got the values right ☺
>>>>>>>
>>>>>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>>>>>
>>>>>>>Port 0[socket: 0]:
>>>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>>>
>>>>>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>>>>>
>>>>>> 9 cores and leaving the first core or two lcores for Linux
>>>>>>>
>>>>>>>Port 1[socket: 1]:
>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>
>>>>>>>All 10 cores on the second socket.
>>>>>
>>>>>The values were almost right :) But that's because we reserve the
>>>>>first two lcores that are passed to dpdk for our own management part.
>>>>>I was aware that lcores are not physical cores so we don't expect
>>>>>performance to scale linearly with the number of lcores. However, if
>>>>>there's a chance that another hyperthread can run while the paired one
>>>>>is stalling we'd like to take advantage of those cycles if possible.
>>>>>
>>>>>Leaving that aside I just ran two more tests while using only one of
>>>>>the two hwthreads in a core.
>>>>>
>>>>>a. 2 ports on different sockets with 8 cores/port:
>>>>>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>>>>>-- --qmap 0.0x3FC --qmap 1.0xFF000
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>
>>>>>Port 1[socket: 1]:
>>>>>   Core 12[socket:1] (Tx: 0, Rx: 0)
>>>>>   Core 13[socket:1] (Tx: 1, Rx: 1)
>>>>>   Core 14[socket:1] (Tx: 2, Rx: 2)
>>>>>   Core 15[socket:1] (Tx: 3, Rx: 3)
>>>>>   Core 16[socket:1] (Tx: 4, Rx: 4)
>>>>>   Core 17[socket:1] (Tx: 5, Rx: 5)
>>>>>   Core 18[socket:1] (Tx: 6, Rx: 6)
>>>>>   Core 19[socket:1] (Tx: 7, Rx: 7)
>>>>>
>>>>>This gives a session setup rate of only 2M sessions/s.
>>>>>
>>>>>b. 2 ports on socket 0 with 4 cores/port:
>>>>>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>>>>>--qmap 0.0x3C0 --qmap 1.0x03C
>>>>
>>>> One more thing to try change the –m 32768 to –socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them.
>>>>
>>>> What is the dpdk –n option set to on your system? Mine is set to ‘–n 4’
>>>>
>>>
>>>I tried with –socket-mem 16384,16384 but it doesn't make any
>>>difference. We call anyway rte_malloc_socket for everything that might
>>>be accessed in fast path and the mempools are per-core and created
>>>with the correct socket-id. Even when starting with '-m 32768' I see
>>>that 16 hugepages get allocated on each of the sockets.
>>>
>>>On the test server I have 4 memory channels so '-n 4'.
>>>
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 6[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 7[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 8[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 9[socket:0] (Tx: 3, Rx: 3)
>>>>>
>>>>>Port 1[socket: 0]:
>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>
>>I do not know now. It seems like something else is going on here that we have not identified.
>
> Maybe vTune or some other type of debug performance tool would be the next step here.
>

Thanks for the patience Keith.
I'll try some profiling and see where it takes us from there. I'll
update this thread when I have some new info.

Regards,
Dumitru

>>
>>>>>
>>>>>Surprisingly this gives a session setup rate of 3M sess/s!!
>>>>>
>>>>>The packet processing cores are totally independent and only access
>>>>>local socket memory/ports.
>>>>>There is no locking or atomic variable access in fast path in our code.
>>>>>The mbuf pools are not shared between cores handling the same port so
>>>>>there should be no contention when allocating/freeing mbufs.
>>>>>In this specific test scenario all the cores handling port 0 are
>>>>>essentially executing the same code (TCP clients) and the cores on
>>>>>port 1 as well (TCP servers).
>>>>>
>>>>>Do you have any tips about what other things to check for?
>>>>>
>>>>>Thanks,
>>>>>Dumitru
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>++Keith
>>>>>>>
>>>>>>>>
>>>>>>>>Just for reference, the cpu_layout script shows:
>>>>>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>>>>>============================================================
>>>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>>>>>============================================================
>>>>>>>>
>>>>>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>>>>>sockets =  [0, 1]
>>>>>>>>
>>>>>>>>        Socket 0        Socket 1
>>>>>>>>        --------        --------
>>>>>>>>Core 0  [0, 20]         [10, 30]
>>>>>>>>Core 1  [1, 21]         [11, 31]
>>>>>>>>Core 2  [2, 22]         [12, 32]
>>>>>>>>Core 3  [3, 23]         [13, 33]
>>>>>>>>Core 4  [4, 24]         [14, 34]
>>>>>>>>Core 8  [5, 25]         [15, 35]
>>>>>>>>Core 9  [6, 26]         [16, 36]
>>>>>>>>Core 10 [7, 27]         [17, 37]
>>>>>>>>Core 11 [8, 28]         [18, 38]
>>>>>>>>Core 12 [9, 29]         [19, 39]
>>>>>>>>
>>>>>>>>I know it might be complicated to gigure out exactly what's happening
>>>>>>>>in our setup with our own code so please let me know if you need
>>>>>>>>additional information.
>>>>>>>>
>>>>>>>>I appreciate the help!
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Dumitru
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-06-16 20:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 14:07 Performance hit - NICs on different CPU sockets Take Ceara
2016-06-13 14:28 ` Bruce Richardson
2016-06-14  7:47   ` Take Ceara
2016-06-13 19:35 ` Wiles, Keith
2016-06-14  7:46   ` Take Ceara
2016-06-14 13:47     ` Wiles, Keith
2016-06-16 14:36       ` Take Ceara
2016-06-16 14:58         ` Wiles, Keith
2016-06-16 15:16           ` Take Ceara
2016-06-16 15:29             ` Wiles, Keith
2016-06-16 16:20               ` Take Ceara
2016-06-16 16:56                 ` Wiles, Keith
2016-06-16 16:59                   ` Wiles, Keith
2016-06-16 18:20                     ` Take Ceara
2016-06-16 19:33                       ` Wiles, Keith
2016-06-16 20:00                         ` Take Ceara
2016-06-16 20:16                           ` Wiles, Keith
2016-06-16 20:19                             ` Wiles, Keith
2016-06-16 20:27                               ` Take Ceara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.