Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt

All of lore.kernel.org
 help / color / mirror / Atom feed

* Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
@ 2018-04-18 16:43 Shailja Pandey
  2018-04-18 18:36 ` Wiles, Keith
  0 siblings, 1 reply; 6+ messages in thread
From: Shailja Pandey @ 2018-04-18 16:43 UTC (permalink / raw)
  To: dev

Hello,

I am doing packet replication and I need to change the ethernet and IP 
header field for each replicated packet. I did it in two different ways:

 1. Share payload from the original packet using rte_mbuf_refcnt_update
    and allocate new mbuf for L2-L4 headers.
 2. memcpy() payload from the original packet to newly created mbuf and
    prepend L2-L4 headers to the mbuf.

I performed experiments with varying replication factor as well as 
varying packet size and found that memcpy() is performing way better 
than using rte_mbuf_refcnt_update(). But I am not sure why it is 
happening and what is making rte_mbuf_refcnt_update() even worse than 
memcpy().

Here is the sample code for both implementations:

*1. Using rte_mbuf_refcnt_update:*
**struct rte_mbuf *pkt = original packet;**
**
******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct 
ether_hdr)+sizeof(struct ipv4_hdr));
         rte_pktmbuf_refcnt_update(pkt, replication_factor);
         for(int i = 0; i < replication_factor; i++) {
               struct rte_mbuf *hdr;
               if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == 
NULL)) {
                 printf("Failed while cloning $$$\n");
                 return NULL;
            }
            hdr->next = pkt;
            hdr->pkt_len = (uint16_t)(hdr->data_len + pkt->pkt_len);
            hdr->nb_segs = (uint8_t)(pkt->nb_segs + 1);
            //*Update more metadate fields*
*
*
**rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr));
             //*modify L2 fields*

             rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr));
             //Modify L3 fields
             .
             .
             .
         }
*
*
*
*
*2. Using memcpy():*
**struct rte_mbuf *pkt = original packet
**struct rte_mbuf *hdr;**
         if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == NULL)) {
                 printf("Failed while cloning $$$\n");
                 return NULL;
         }

         /* prepend new header */
         char *eth_hdr = (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len);
         if(eth_hdr == NULL) {
                 printf("panic\n");
         }
         char *b = rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *);
         memcpy(eth_hdr, b, pkt->pkt_len);
         Change L2-L4 header fields in new packet

The throughput becomes roughly half when the packet size is increased 
from 64 bytes to 128 bytes and replication is done using 
*rte_mbuf_refcnt_update(). *The throughput remains more or less same 
when packet size increases and replication is done using *memcpy()*.

Any help would be appreciated.
**

--

Thanks,
Shailja

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
  2018-04-18 16:43 Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? Shailja Pandey
@ 2018-04-18 18:36 ` Wiles, Keith
  2018-04-19 14:30   ` Shailja Pandey
  0 siblings, 1 reply; 6+ messages in thread
From: Wiles, Keith @ 2018-04-18 18:36 UTC (permalink / raw)
  To: Shailja Pandey; +Cc: dev

> On Apr 18, 2018, at 11:43 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote:
> 
> Hello,
> 
> I am doing packet replication and I need to change the ethernet and IP header field for each replicated packet. I did it in two different ways:
> 
> 1. Share payload from the original packet using rte_mbuf_refcnt_update
>   and allocate new mbuf for L2-L4 headers.
> 2. memcpy() payload from the original packet to newly created mbuf and
>   prepend L2-L4 headers to the mbuf.
> 
> I performed experiments with varying replication factor as well as varying packet size and found that memcpy() is performing way better than using rte_mbuf_refcnt_update(). But I am not sure why it is happening and what is making rte_mbuf_refcnt_update() even worse than memcpy().
> 
> Here is the sample code for both implementations:

The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second one is not, correct? The loop can cause performance hits, but should be small.

The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes beyond 64bytes then you hit the second cacheline, are you starting to see the problem here. Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.

On Intel x86 systems 64 byte is the cacheline size and other arches have different sizes.

> 
> *1. Using rte_mbuf_refcnt_update:*
> **struct rte_mbuf *pkt = original packet;**
> **
> ******rte_pktmbuf_adj(pkt, (uint16_t)sizeof(struct ether_hdr)+sizeof(struct ipv4_hdr));
>         rte_pktmbuf_refcnt_update(pkt, replication_factor);
>         for(int i = 0; i < replication_factor; i++) {
>               struct rte_mbuf *hdr;
>               if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>            }
>            hdr->next = pkt;
>            hdr->pkt_len = (uint16_t)(hdr->data_len + pkt->pkt_len);
>            hdr->nb_segs = (uint8_t)(pkt->nb_segs + 1);
>            //*Update more metadate fields*
> *
> *
> **rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ether_hdr));
>             //*modify L2 fields*
> 
>             rte_pktmbuf_prepend(hdr, (uint16_t)sizeof(struct ipv4_hdr));
>             //Modify L3 fields
>             .
>             .
>             .
>         }
> *
> *
> *
> *
> *2. Using memcpy():*
> **struct rte_mbuf *pkt = original packet
> **struct rte_mbuf *hdr;**
>         if (unlikely ((hdr = rte_pktmbuf_alloc(header_pool)) == NULL)) {
>                 printf("Failed while cloning $$$\n");
>                 return NULL;
>         }
> 
>         /* prepend new header */
>         char *eth_hdr = (char *)rte_pktmbuf_prepend(hdr, pkt->pkt_len);
>         if(eth_hdr == NULL) {
>                 printf("panic\n");
>         }
>         char *b = rte_pktmbuf_mtod((struct rte_mbuf*)pkt, char *);
>         memcpy(eth_hdr, b, pkt->pkt_len);
>         Change L2-L4 header fields in new packet
> 
> The throughput becomes roughly half when the packet size is increased from 64 bytes to 128 bytes and replication is done using *rte_mbuf_refcnt_update(). *The throughput remains more or less same when packet size increases and replication is done using *memcpy()*.

Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?

I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you know the number of packets you need to replicate up front.

> 
> Any help would be appreciated.
> **
> 
> --
> 
> Thanks,
> Shailja
> 

Regards,
Keith

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
  2018-04-18 18:36 ` Wiles, Keith
@ 2018-04-19 14:30   ` Shailja Pandey
  2018-04-19 16:08     ` Wiles, Keith
  2018-04-20 10:05     ` Ananyev, Konstantin
  0 siblings, 2 replies; 6+ messages in thread
From: Shailja Pandey @ 2018-04-19 14:30 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

> The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second one is not, correct? The loop can cause performance hits, but should be small.
Sorry for the confusion, for memcpy version also we are using a loop 
outside of this function. Essentially, we are making same number of 
copies in both the cases.
> The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
We also performed same experiment for different packet sizes(64B, 128B, 
256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only 
when the packet size increases from 64B to 128B and not after that. So, 
cacheline miss should happen for other packet sizes also. I am not sure 
why this is the case. Why the drop is not sharp after 128 B packets when 
replicated using rte_pktmbuf_refcnt_update().

>   Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
I don't understand the complexity here, could you please explain it in 
detail.
>
> Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
>
> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you know the number of packets you need to replicate up front.
We are already using both of these functions, just to simplify the 
pseudo-code I used memcpy and rte_pktmbuf_alloc().

# pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256 
bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
# memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |  
memcpy  refcnt       | memcpy   refcnt         |
    5949888    5806720|   5831360    2890816  |  5640379    2886016 |  
5107840   2863264  | 4510121   2692876    |

Throughput is in MPPS.

-- 

Thanks,
Shailja

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
  2018-04-19 14:30   ` Shailja Pandey
@ 2018-04-19 16:08     ` Wiles, Keith
  2018-04-23 14:12       ` Shailja Pandey
  2018-04-20 10:05     ` Ananyev, Konstantin
  1 sibling, 1 reply; 6+ messages in thread
From: Wiles, Keith @ 2018-04-19 16:08 UTC (permalink / raw)
  To: Shailja Pandey; +Cc: dev



> On Apr 19, 2018, at 9:30 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote:
> 
>> The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second one is not, correct? The loop can cause performance hits, but should be small.
> Sorry for the confusion, for memcpy version also we are using a loop outside of this function. Essentially, we are making same number of copies in both the cases.
>> The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
> We also performed same experiment for different packet sizes(64B, 128B, 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only when the packet size increases from 64B to 128B and not after that. So, cacheline miss should happen for other packet sizes also. I am not sure why this is the case. Why the drop is not sharp after 128 B packets when replicated using rte_pktmbuf_refcnt_update().
> 
>>  Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
> I don't understand the complexity here, could you please explain it in detail.

In this case you can not do a prefetch on other cache lines far enough in advance to not get a CPU stall for a cacheline.

>> 
>> Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?

Still did not answer this question.

>> 
>> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you know the number of packets you need to replicate up front.
> We are already using both of these functions, just to simplify the pseudo-code I used memcpy and rte_pktmbuf_alloc().

Then please show the real code fragment as your example was confusing.

> 
> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256 bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |  memcpy  refcnt       | memcpy   refcnt         |
>    5949888    5806720|   5831360    2890816  |  5640379    2886016 |  5107840   2863264  | 4510121   2692876    |
> 

Refcnt also needs to adjust the value using a atomic update and you still have not told me the type of system you are on x86 or ???

Please describe your total system Host OS, DPDK version, NICs used, … a number of people have performance similar test and do not see the problem you are suggesting. Maybe modify say L3fwd (which does some thing similar to your example code) and see if you still see the difference. They you can post the patch to that example app and we can try to figure it out.

> Throughput is in MPPS.
> 
> -- 
> 
> Thanks,
> Shailja
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
  2018-04-19 16:08     ` Wiles, Keith
@ 2018-04-23 14:12       ` Shailja Pandey
  0 siblings, 0 replies; 6+ messages in thread
From: Shailja Pandey @ 2018-04-23 14:12 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev



On Thursday 19 April 2018 09:38 PM, Wiles, Keith wrote:
>
>> On Apr 19, 2018, at 9:30 AM, Shailja Pandey <csz168117@iitd.ac.in> wrote:
>>
>>> The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second one is not, correct? The loop can cause performance hits, but should be small.
>> Sorry for the confusion, for memcpy version also we are using a loop outside of this function. Essentially, we are making same number of copies in both the cases.
>>> The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
>> We also performed same experiment for different packet sizes(64B, 128B, 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only when the packet size increases from 64B to 128B and not after that. So, cacheline miss should happen for other packet sizes also. I am not sure why this is the case. Why the drop is not sharp after 128 B packets when replicated using rte_pktmbuf_refcnt_update().
>>
>>>   Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
>> I don't understand the complexity here, could you please explain it in detail.
> In this case you can not do a prefetch on other cache lines far enough in advance to not get a CPU stall for a cacheline.
>
>>> Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
> Still did not answer this question.
>
>>> I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you know the number of packets you need to replicate up front.
>> We are already using both of these functions, just to simplify the pseudo-code I used memcpy and rte_pktmbuf_alloc().
> Then please show the real code fragment as your example was confusing.

In our experiments, for packet replication using refcntupdate(), we 
observed a sharp drop in throughput when the packet size was changed 
from 64B to 128B because the replicated packets were not being sent. 
Only the original packets were being sent hence throughput roughly 
dropped to half compared to the case of 64B packets where both 
replicated and original packets were being sent. Actually, the 
ether_type field was not being set appropriately for replicated packets 
and hence the replicated packets were dropped at hardware level.

We did not realize this, as in case of 64B packet this was not a problem 
and NIC was able to transmit both original and replicated packets 
despite ether_type field not being set appropriately. For 128B and 
onward packets, replicated packets were sent by driver to NIC but not 
being transmitted on the wire from NIC and hence a drop in throughput.

After setting the ether_type field appropriately for 128B and onwards 
packet sizes, the throughput is similar for all packet sizes.

>
>> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256 bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
>> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |  memcpy  refcnt       | memcpy   refcnt         |
>>     5949888    5806720|   5831360    2890816  |  5640379    2886016 |  5107840   2863264  | 4510121   2692876    |
>>
> Refcnt also needs to adjust the value using a atomic update and you still have not told me the type of system you are on x86 or ???
>
> Please describe your total system Host OS, DPDK version, NICs used, … a number of people have performance similar test and do not see the problem you are suggesting. Maybe modify say L3fwd (which does some thing similar to your example code) and see if you still see the difference. They you can post the patch to that example app and we can try to figure it out.
>
>> Throughput is in MPPS.
>>
>> -- 
>>
>> Thanks,
>> Shailja
>>
> Regards,
> Keith
>
Thank again!

-- 

Thanks,
Shailja

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function?
  2018-04-19 14:30   ` Shailja Pandey
  2018-04-19 16:08     ` Wiles, Keith
@ 2018-04-20 10:05     ` Ananyev, Konstantin
  1 sibling, 0 replies; 6+ messages in thread
From: Ananyev, Konstantin @ 2018-04-20 10:05 UTC (permalink / raw)
  To: Shailja Pandey, Wiles, Keith; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Shailja Pandey
> Sent: Thursday, April 19, 2018 3:30 PM
> To: Wiles, Keith <keith.wiles@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update()
> function?
> 
> > The two code fragments are doing two different ways the first is using a loop to create possible more then one replication and the second
> one is not, correct? The loop can cause performance hits, but should be small.
> Sorry for the confusion, for memcpy version also we are using a loop
> outside of this function. Essentially, we are making same number of
> copies in both the cases.
> > The first one is using the hdr->next pointer which is in the second cacheline of the mbuf header, this can and will cause a cacheline miss
> and degrade your performance. The second code does not touch hdr->next and will not cause a cacheline miss. When the packet goes
> beyond 64bytes then you hit the second cacheline, are you starting to see the problem here.
> We also performed same experiment for different packet sizes(64B, 128B,
> 256B, 512B, 1024B, 1518B), the sharp drop in throughput is observed only
> when the packet size increases from 64B to 128B and not after that. So,
> cacheline miss should happen for other packet sizes also. I am not sure
> why this is the case. Why the drop is not sharp after 128 B packets when
> replicated using rte_pktmbuf_refcnt_update().
> 
> >   Every time you touch a new cache line performance will drop unless the cacheline is prefetched into memory first, but in this case it
> really can not be done easily. Count the cachelines you are touching and make sure they are the same number in each case.
> I don't understand the complexity here, could you please explain it in
> detail.
> >
> > Why did you use memcpy and not rte_memcpy here as rte_memcpy should be faster?
> >
> > I believe now DPDK has a rte_pktmbuf_alloc_bulk() function to reduce the number of rte_pktmbuf_alloc() calls, which should help if you
> know the number of packets you need to replicate up front.
> We are already using both of these functions, just to simplify the
> pseudo-code I used memcpy and rte_pktmbuf_alloc().
> 
> # pktsz 1(64 bytes)    |   pktsz 2(128 bytes)     |  pktsz 3(256
> bytes)    |  pktsz 4(512 bytes)   | pktsz 4(1024 bytes)    |
> # memcpy    refcnt    |   memcpy    refcnt      | memcpy refcnt       |
> memcpy  refcnt       | memcpy   refcnt         |
>     5949888    5806720|   5831360    2890816  |  5640379    2886016 |
> 5107840   2863264  | 4510121   2692876    |
> 
> Throughput is in MPPS.
> 

Wonder what NIC and TX function do you use?
Any chance that multi-seg support is not on?
Konstantin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-04-23 14:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18 16:43 Why packet replication is more efficient when done using memcpy( ) as compared to rte_mbuf_refcnt_update() function? Shailja Pandey
2018-04-18 18:36 ` Wiles, Keith
2018-04-19 14:30   ` Shailja Pandey
2018-04-19 16:08     ` Wiles, Keith
2018-04-23 14:12       ` Shailja Pandey
2018-04-20 10:05     ` Ananyev, Konstantin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.