Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?

From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Shailja Pandey <csz168117@iitd.ac.in>,
	"Wiles, Keith" <keith.wiles@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
Date: Fri, 20 Apr 2018 10:05:45 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB977258AE9189DB@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <611770bb-d0c9-7f4e-9c3d-3b572c9e8023@iitd.ac.in>

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Shailja Pandey
> Sent: Thursday, April 19, 2018 3:30 PM
> To: Wiles, Keith <keith.wiles@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update()
> function?
> 
> > The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second
> one is not, correct? The loop can cause performance hits, but should be small.
> Sorry for the confusion, for memcpy version also we are using a loop
> outside of this function. Essentially, we are making same number of
> copies in both the cases.
> > The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss
> and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes
> beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
> We also performed same experiment for different packet sizes(64B, 128B,
> 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only
> when the packet size increases from 64B to 128B and not after that. So,
> cacheline miss should happen for other packet sizes also. I am not sure
> why this is the case. Why the drop is not sharp after 128 B packets when
> replicated using rte_pktmbuf_refcnt_update().
> 
> >   Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it
> really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
> I don't understand the complexity here, could you please explain it in
> detail.
> >
> > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
> >
> > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you
> know the number of packets you need to replicate up front.
> We are already using both of these functions, just to simplify the
> pseudo-code I used memcpy and rte_pktmbuf_alloc().
> 
> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256
> bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |
> memcpy  refcnt       | memcpy   refcnt         |
>     5949888    5806720|   5831360    2890816  |  5640379    2886016 |
> 5107840   2863264  | 4510121   2692876    |
> 
> Throughput is in MPPS.
> 

Wonder what NIC and TX function do you use?
Any chance that multi-seg support is not on?
Konstantin