From mboxrd@z Thu Jan  1 00:00:00 1970
From: Shailja Pandey <csz168117@iitd.ac.in>
Subject: Re: Why packet replication is more efficient when done
 using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Date: Thu, 19 Apr 2018 20:00:08 +0530
Message-ID: <611770bb-d0c9-7f4e-9c3d-3b572c9e8023@iitd.ac.in>
References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in>
 <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Cc: "dev@dpdk.org" <dev@dpdk.org>
To: "Wiles, Keith" <keith.wiles@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from smtp1.iitd.ac.in (smtp1.iitd.ac.in [103.27.9.45])
 by dpdk.org (Postfix) with ESMTP id 15FA17EE0
 for <dev@dpdk.org>; Thu, 19 Apr 2018 16:30:20 +0200 (CEST)
In-Reply-To: <F7D6AC79-958C-4C08-81FD-4926DA54A1B9@intel.com>
Content-Language: en-US
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

> The two code fragments are doing two different ways the first is using =
a loop to create possible more then one replication and the second one is=
 not, correct? The loop can cause performance hits, but should be small.
Sorry for the confusion, for memcpy version also we are using a loop=20
outside of this function. Essentially, we are making same number of=20
copies in both the cases.
> The first one is using the hdr->next pointer which is in the second cac=
heline of the mbuf header, this can and will cause a cacheline miss and d=
egrade your performance. The second code does not touch hdr->next and wil=
l not cause a cacheline miss. When the packet goes beyond 64bytes then yo=
u hit the second cacheline, are you starting to see the problem here.
We also performed same experiment for different packet sizes(64B, 128B,=20
256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only=20
when the packet size increases from 64B to 128B and not after that. So,=20
cacheline miss should happen for other packet sizes also. I am not sure=20
why this is the case. Why the drop is not sharp after 128 B packets when=20
replicated using rte_pktmbuf_refcnt_update().

>   Every time you touch a new cache line performance will drop unless th=
e cacheline is prefetched into memory first, but in this case it really c=
an not be done easily. Count the cachelines you are touching and make sur=
e they are the same number in each case.
I don't understand the complexity here, could you please explain it in=20
detail.
>
> Why did you use memcpy and not rte_memcpy here as rte_memcpy should be =
faster?
>
> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce th=
e number of rte_pktmbuf_alloc() calls, which should help if you know the =
number of packets you need to replicate up front.
We are already using both of these functions, just to simplify the=20
pseudo-code I used memcpy and rte_pktmbuf_alloc().

# pktsz 1(64 bytes)=C2=A0=C2=A0=C2=A0 | =C2=A0 pktsz 2(128 bytes)=C2=A0=C2=
=A0=C2=A0=C2=A0 |=C2=A0 pktsz 3(256=20
bytes)=C2=A0=C2=A0=C2=A0 |=C2=A0 pktsz 4(512 bytes)=C2=A0=C2=A0 | pktsz 4=
(1024 bytes)=C2=A0=C2=A0=C2=A0 |
# memcpy=C2=A0=C2=A0=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 | =C2=A0 memcpy=C2=A0=
=C2=A0=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 =C2=A0 | memcpy refcnt=C2=A0=C2=A0=C2=
=A0 =C2=A0=C2=A0 |=C2=A0=20
memcpy=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0 | memcpy=C2=A0=C2=A0 r=
efcnt =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 |
 =C2=A0=C2=A0 5949888=C2=A0=C2=A0=C2=A0 5806720| =C2=A0 5831360=C2=A0=C2=A0=
=C2=A0 2890816=C2=A0 |=C2=A0 5640379=C2=A0=C2=A0=C2=A0 2886016 |=C2=A0=20
5107840=C2=A0=C2=A0 2863264=C2=A0 | 4510121=C2=A0=C2=A0 2692876=C2=A0=C2=A0=
=C2=A0 |

Throughput is in MPPS.

--=20

Thanks,
Shailja