From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shailja Pandey Subject: Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? Date: Thu, 19 Apr 2018 20:00:08 +0530 Message-ID: <611770bb-d0c9-7f4e-9c3d-3b572c9e8023@iitd.ac.in> References: <598ada8c-194d-e07e-6121-5dc74cf208a1@iitd.ac.in> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" To: "Wiles, Keith" Return-path: Received: from smtp1.iitd.ac.in (smtp1.iitd.ac.in [103.27.9.45]) by dpdk.org (Postfix) with ESMTP id 15FA17EE0 for ; Thu, 19 Apr 2018 16:30:20 +0200 (CEST) In-Reply-To: Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > The two code fragments are doing two different ways the first is using = a loop to create possible more then one replication and the second one is= not, correct? The loop can cause performance hits, but should be small. Sorry for the confusion, for memcpy version also we are using a loop=20 outside of this function. Essentially, we are making same number of=20 copies in both the cases. > The first one is using the hdr->next pointer which is in the second cac= heline of the mbuf header, this can and will cause a cacheline miss and d= egrade your performance. The second code does not touch hdr->next and wil= l not cause a cacheline miss. When the packet goes beyond 64bytes then yo= u hit the second cacheline, are you starting to see the problem here. We also performed same experiment for different packet sizes(64B, 128B,=20 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only=20 when the packet size increases from 64B to 128B and not after that. So,=20 cacheline miss should happen for other packet sizes also. I am not sure=20 why this is the case. Why the drop is not sharp after 128 B packets when=20 replicated using rte_pktmbuf_refcnt_update(). > Every time you touch a new cache line performance will drop unless th= e cacheline is prefetched into memory first, but in this case it really c= an not be done easily. Count the cachelines you are touching and make sur= e they are the same number in each case. I don't understand the complexity here, could you please explain it in=20 detail. > > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be = faster? > > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce th= e number of rte_pktmbuf_alloc() calls, which should help if you know the = number of packets you need to replicate up front. We are already using both of these functions, just to simplify the=20 pseudo-code I used memcpy and rte_pktmbuf_alloc(). # pktsz 1(64 bytes)=C2=A0=C2=A0=C2=A0 | =C2=A0 pktsz 2(128 bytes)=C2=A0=C2= =A0=C2=A0=C2=A0 |=C2=A0 pktsz 3(256=20 bytes)=C2=A0=C2=A0=C2=A0 |=C2=A0 pktsz 4(512 bytes)=C2=A0=C2=A0 | pktsz 4= (1024 bytes)=C2=A0=C2=A0=C2=A0 | # memcpy=C2=A0=C2=A0=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 | =C2=A0 memcpy=C2=A0= =C2=A0=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 =C2=A0 | memcpy refcnt=C2=A0=C2=A0=C2= =A0 =C2=A0=C2=A0 |=C2=A0=20 memcpy=C2=A0 refcnt=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0 | memcpy=C2=A0=C2=A0 r= efcnt =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 | =C2=A0=C2=A0 5949888=C2=A0=C2=A0=C2=A0 5806720| =C2=A0 5831360=C2=A0=C2=A0= =C2=A0 2890816=C2=A0 |=C2=A0 5640379=C2=A0=C2=A0=C2=A0 2886016 |=C2=A0=20 5107840=C2=A0=C2=A0 2863264=C2=A0 | 4510121=C2=A0=C2=A0 2692876=C2=A0=C2=A0= =C2=A0 | Throughput is in MPPS. --=20 Thanks, Shailja