Re: Could not achieve wire speed for 40GE with any DPDK version on XL710 NIC's

From: Pavel Odintsov <pavel.odintsov@gmail.com>
To: Anuj Kalia <anujkaliaiitd@gmail.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: Could not achieve wire speed for 40GE with any DPDK version on XL710 NIC's
Date: Fri, 3 Jul 2015 11:35:45 +0300	[thread overview]
Message-ID: <CALgsdbfWibuZ-Znn4p7USB=GDT5WgbChJB=4au-fzzCXDpCeKA@mail.gmail.com> (raw)
In-Reply-To: <CADPSxAhXQWUgqkBacwSNU=qYUv61k4Ox2EN6qKq_xOhi3KMTkQ@mail.gmail.com>

Hello, folks!

We have found root of issue.

Intel do not offer wire speed for 64b packets in XL710 at all.

As mentioned in data sheet
http://www.intel.ru/content/dam/www/public/us/en/documents/product-briefs/xl710-10-40-gbe-controller-brief.pdf
we have:

Small packet performance: Maintains wire-rate throughput on smaller
payload sizes (>128 Bytes for 40 GbE and >64 Bytes for 10 GbE

Could anybody recommend NIC's which could truly achieve wire rate for 40GE?

On Wed, Jul 1, 2015 at 9:01 PM, Anuj Kalia <anujkaliaiitd@gmail.com> wrote:
> Thanks for the comments.
>
> On Wed, Jul 1, 2015 at 1:32 PM, Vladimir Medvedkin <medvedkinv@gmail.com> wrote:
>> Hi Anuj,
>>
>> Thanks for fixes!
>> I have 2 comments
>> - from i40e_ethdev.h : #define I40E_DEFAULT_RX_WTHRESH      0
>> - (26 + 32) / 4 (batched descriptor writeback) should be (26 + 4 * 32) / 4
>> (batched descriptor writeback)
>> , thus we have 135 bytes/packet
>>
>> This corresponds to 58.8 Mpps
>>
>> Regards,
>> Vladimir
>>
>> 2015-07-01 17:22 GMT+03:00 Anuj Kalia <anujkaliaiitd@gmail.com>:
>>>
>>> Vladimir,
>>>
>>> Few possible fixes to your PCIe analysis (let me know if I'm wrong):
>>> - ECRC is probably disabled (check using sudo lspci -vvv | grep
>>> CGenEn-), so TLP header is 26 bytes
>>> - Descriptor writeback can be batched using high value of WTHRESH,
>>> which is what DPDK uses by default
>>> - Read request contains full TLP header (26 bytes)
>>>
>>> Assuming WTHRESH = 4, bytes transferred from NIC to host per packet =
>>> 26 + 64 (packet itself) +
>>> (26 + 32) / 4 (batched descriptor writeback) +
>>> (26 / 4) (read request for new descriptors) =
>>> 111 bytes / packet
>>>
>>> This corresponds to 70.9 Mpps over PCIe 3.0 x8. Assuming 5% DLLP
>>> overhead, rate = 67.4 Mpps
>>>
>>> --Anuj
>>>
>>>
>>>
>>> On Wed, Jul 1, 2015 at 9:40 AM, Vladimir Medvedkin <medvedkinv@gmail.com>
>>> wrote:
>>> > In case with syn flood you should take into account return syn-ack
>>> > traffic,
>>> > which generates PCIe DLLP's from NIC to host, thus pcie bandwith exceeds
>>> > faster. And don't forget about DLLP's generated by rx traffic, which
>>> > saturates host-to-NIC bus.
>>> >
>>> > 2015-07-01 16:05 GMT+03:00 Pavel Odintsov <pavel.odintsov@gmail.com>:
>>> >
>>> >> Yes, Bruce, we understand this. But we are working with huge SYN
>>> >> attacks processing and they are 64byte only :(
>>> >>
>>> >> On Wed, Jul 1, 2015 at 3:59 PM, Bruce Richardson
>>> >> <bruce.richardson@intel.com> wrote:
>>> >> > On Wed, Jul 01, 2015 at 03:44:57PM +0300, Pavel Odintsov wrote:
>>> >> >> Thanks for answer, Vladimir! So we need look for x16 NIC if we want
>>> >> >> achieve 40GE line rate...
>>> >> >>
>>> >> > Note that this would only apply for your minimal i.e. 64-byte, packet
>>> >> sizes.
>>> >> > Once you go up to larger e.g. 128B packets, your PCI bandwidth
>>> >> requirements
>>> >> > are lower and you can easier achieve line rate.
>>> >> >
>>> >> > /Bruce
>>> >> >
>>> >> >> On Wed, Jul 1, 2015 at 3:06 PM, Vladimir Medvedkin <
>>> >> medvedkinv@gmail.com> wrote:
>>> >> >> > Hi Pavel,
>>> >> >> >
>>> >> >> > Looks like you ran into pcie bottleneck. So let's calculate xl710
>>> >> >> > rx
>>> >> only
>>> >> >> > case.
>>> >> >> > Assume we have 32byte descriptors (if we want more offload).
>>> >> >> > DMA makes one pcie transaction with packet payload, one descriptor
>>> >> writeback
>>> >> >> > and one memory request for free descriptors for every 4 packets.
>>> >> >> > For
>>> >> >> > Transaction Layer Packet (TLP) there is 30 bytes overhead (4 PHY +
>>> >> >> > 6
>>> >> DLL +
>>> >> >> > 16 header + 4 ECRC). So for 1 rx packet dma sends 30 + 64(packet
>>> >> itself) +
>>> >> >> > 30 + 32 (writeback descriptor) + (16 / 4) (read request for new
>>> >> >> > descriptors). Note that we do not take into account PCIe
>>> >> >> > ACK/NACK/FC
>>> >> Update
>>> >> >> > DLLP. So we have 160 bytes per packet. One lane PCIe 3.0 transmits
>>> >> >> > 1
>>> >> byte in
>>> >> >> > 1 ns, so x8 transmits 8 bytes  in 1 ns. 1 packet transmits in 20
>>> >> >> > ns.
>>> >> Thus
>>> >> >> > in theory pcie 3.0 x8 may transfer not more than 50mpps.
>>> >> >> > Correct me if I'm wrong.
>>> >> >> >
>>> >> >> > Regards,
>>> >> >> > Vladimir
>>> >> >> >
>>> >> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Sincerely yours, Pavel Odintsov
>>> >>
>>
>>

-- 
Sincerely yours, Pavel Odintsov