[PATCH] mbuf: replace c memcpy code semantics with optimized rte

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
@ 2016-05-24 14:50 Jerin Jacob
  2016-05-24 14:59 ` Olivier Matz
  2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
  0 siblings, 2 replies; 32+ messages in thread
From: Jerin Jacob @ 2016-05-24 14:50 UTC (permalink / raw)
  To: dev; +Cc: thomas.monjalon, bruce.richardson, konstantin.ananyev, Jerin Jacob

Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_mempool/rte_mempool.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index ed2c110..ebe399a 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -74,6 +74,7 @@
 #include <rte_memory.h>
 #include <rte_branch_prediction.h>
 #include <rte_ring.h>
+#include <rte_memcpy.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 		    unsigned n, __rte_unused int is_mp)
 {
 	struct rte_mempool_cache *cache;
-	uint32_t index;
 	void **cache_objs;
 	unsigned lcore_id = rte_lcore_id();
 	uint32_t cache_size = mp->cache_size;
@@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
-	for (index = 0; index < n; ++index, obj_table++)
-		cache_objs[index] = *obj_table;
+	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
 
 	cache->len += n;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-24 14:50 [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy Jerin Jacob
@ 2016-05-24 14:59 ` Olivier Matz
  2016-05-24 15:17   ` Jerin Jacob
  2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
  1 sibling, 1 reply; 32+ messages in thread
From: Olivier Matz @ 2016-05-24 14:59 UTC (permalink / raw)
  To: Jerin Jacob, dev; +Cc: thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,


On 05/24/2016 04:50 PM, Jerin Jacob wrote:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  lib/librte_mempool/rte_mempool.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index ed2c110..ebe399a 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -74,6 +74,7 @@
>  #include <rte_memory.h>
>  #include <rte_branch_prediction.h>
>  #include <rte_ring.h>
> +#include <rte_memcpy.h>
>  
>  #ifdef __cplusplus
>  extern "C" {
> @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  		    unsigned n, __rte_unused int is_mp)
>  {
>  	struct rte_mempool_cache *cache;
> -	uint32_t index;
>  	void **cache_objs;
>  	unsigned lcore_id = rte_lcore_id();
>  	uint32_t cache_size = mp->cache_size;
> @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  	 */
>  
>  	/* Add elements back into the cache */
> -	for (index = 0; index < n; ++index, obj_table++)
> -		cache_objs[index] = *obj_table;
> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>  
>  	cache->len += n;
>  
> 

The commit title should be "mempool" instead of "mbuf".
Are you seeing some performance improvement by using rte_memcpy()?

Regards
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-24 14:59 ` Olivier Matz
@ 2016-05-24 15:17   ` Jerin Jacob
  2016-05-27 10:24     ` Hunt, David
  2016-06-24 15:56     ` Hunt, David
  0 siblings, 2 replies; 32+ messages in thread
From: Jerin Jacob @ 2016-05-24 15:17 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
> Hi Jerin,
> 
> 
> On 05/24/2016 04:50 PM, Jerin Jacob wrote:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> >  lib/librte_mempool/rte_mempool.h | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> > index ed2c110..ebe399a 100644
> > --- a/lib/librte_mempool/rte_mempool.h
> > +++ b/lib/librte_mempool/rte_mempool.h
> > @@ -74,6 +74,7 @@
> >  #include <rte_memory.h>
> >  #include <rte_branch_prediction.h>
> >  #include <rte_ring.h>
> > +#include <rte_memcpy.h>
> >  
> >  #ifdef __cplusplus
> >  extern "C" {
> > @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  		    unsigned n, __rte_unused int is_mp)
> >  {
> >  	struct rte_mempool_cache *cache;
> > -	uint32_t index;
> >  	void **cache_objs;
> >  	unsigned lcore_id = rte_lcore_id();
> >  	uint32_t cache_size = mp->cache_size;
> > @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  	 */
> >  
> >  	/* Add elements back into the cache */
> > -	for (index = 0; index < n; ++index, obj_table++)
> > -		cache_objs[index] = *obj_table;
> > +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >  
> >  	cache->len += n;
> >  
> > 
> 
> The commit title should be "mempool" instead of "mbuf".

I will fix it.

> Are you seeing some performance improvement by using rte_memcpy()?

Yes, In some case, In default case, It was replaced with memcpy by the
compiler itself(gcc 5.3). But when I tried external mempool manager patch and
then performance dropped almost 800Kpps. Debugging further it turns out that
external mempool managers unrelated change was knocking out the memcpy.
explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
unknown(In my test setup, packets are in the local cache, so it must be
something do with __mempool_put_bulk text alignment change or similar.

Anyone else observed performance drop with external poolmanager?

Jerin

> 
> Regards
> Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-24 14:50 [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy Jerin Jacob
  2016-05-24 14:59 ` Olivier Matz
@ 2016-05-26  8:07 ` Jerin Jacob
  2016-05-30  8:45   ` Olivier Matz
                     ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Jerin Jacob @ 2016-05-26  8:07 UTC (permalink / raw)
  To: dev
  Cc: thomas.monjalon, bruce.richardson, konstantin.ananyev,
	olivier.matz, Jerin Jacob

Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
v1..v2
Corrected the the git commit message(s/mbuf/mempool/g)
---
 lib/librte_mempool/rte_mempool.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 60339bd..24876a2 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -73,6 +73,7 @@
 #include <rte_memory.h>
 #include <rte_branch_prediction.h>
 #include <rte_ring.h>
+#include <rte_memcpy.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -739,7 +740,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 		    unsigned n, int is_mp)
 {
 	struct rte_mempool_cache *cache;
-	uint32_t index;
 	void **cache_objs;
 	unsigned lcore_id = rte_lcore_id();
 	uint32_t cache_size = mp->cache_size;
@@ -768,8 +768,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
-	for (index = 0; index < n; ++index, obj_table++)
-		cache_objs[index] = *obj_table;
+	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
 
 	cache->len += n;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-24 15:17   ` Jerin Jacob
@ 2016-05-27 10:24     ` Hunt, David
  2016-05-27 11:42       ` Jerin Jacob
  2016-05-27 13:45       ` Hunt, David
  2016-06-24 15:56     ` Hunt, David
  1 sibling, 2 replies; 32+ messages in thread
From: Hunt, David @ 2016-05-27 10:24 UTC (permalink / raw)
  To: Jerin Jacob, Olivier Matz
  Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev



On 5/24/2016 4:17 PM, Jerin Jacob wrote:
> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>
>> Are you seeing some performance improvement by using rte_memcpy()?
> Yes, In some case, In default case, It was replaced with memcpy by the
> compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> then performance dropped almost 800Kpps. Debugging further it turns out that
> external mempool managers unrelated change was knocking out the memcpy.
> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> unknown(In my test setup, packets are in the local cache, so it must be
> something do with __mempool_put_bulk text alignment change or similar.
>
> Anyone else observed performance drop with external poolmanager?
>
> Jerin

Jerin,
     I'm seeing a 300kpps drop in throughput when I apply this on top of 
the external
mempool manager patch. If you're seeing an increase if you apply this 
patch first, then
a drop when applying the mempool manager, the two patches must be 
conflicting in
some way. We probably need to investigate further.
Regards,
Dave.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-27 10:24     ` Hunt, David
@ 2016-05-27 11:42       ` Jerin Jacob
  2016-05-27 15:05         ` Thomas Monjalon
  2016-05-27 13:45       ` Hunt, David
  1 sibling, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-05-27 11:42 UTC (permalink / raw)
  To: Hunt, David
  Cc: Olivier Matz, dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Fri, May 27, 2016 at 11:24:57AM +0100, Hunt, David wrote:
> 
> 
> On 5/24/2016 4:17 PM, Jerin Jacob wrote:
> > On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
> > 
> > > Are you seeing some performance improvement by using rte_memcpy()?
> > Yes, In some case, In default case, It was replaced with memcpy by the
> > compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> > then performance dropped almost 800Kpps. Debugging further it turns out that
> > external mempool managers unrelated change was knocking out the memcpy.
> > explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> > unknown(In my test setup, packets are in the local cache, so it must be
> > something do with __mempool_put_bulk text alignment change or similar.
> > 
> > Anyone else observed performance drop with external poolmanager?
> > 
> > Jerin
> 
> Jerin,
>     I'm seeing a 300kpps drop in throughput when I apply this on top of the
> external
> mempool manager patch. If you're seeing an increase if you apply this patch
> first, then
> a drop when applying the mempool manager, the two patches must be
> conflicting in
> some way. We probably need to investigate further.

In general, My concern is that most probably this patch also will get dropped
on floor due to conflit in different architecture and some architecture/platform
need to maintain this out out tree.

Unlike other projects, DPDK modules are hand optimized due do that
some change are depended register allocations and compiler version and
text alignment etc.

IMHO, I think we should have means to abstract this _logical_ changes
under conditional compilation flags and any arch/platform can choose
to select what it suites better for that arch/platform.

We may NOT need to have frequent patches to select the specific
configuration, but logical patches under compilation flags can be accepted and
each arch/platform can choose specific set configuration when we make
the final release candidate for the release.

Any thoughts?

Jerin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-27 10:24     ` Hunt, David
  2016-05-27 11:42       ` Jerin Jacob
@ 2016-05-27 13:45       ` Hunt, David
  1 sibling, 0 replies; 32+ messages in thread
From: Hunt, David @ 2016-05-27 13:45 UTC (permalink / raw)
  To: Jerin Jacob, Olivier Matz
  Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev



On 5/27/2016 11:24 AM, Hunt, David wrote:
>
>
> On 5/24/2016 4:17 PM, Jerin Jacob wrote:
>> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>>
>>> Are you seeing some performance improvement by using rte_memcpy()?
>> Yes, In some case, In default case, It was replaced with memcpy by the
>> compiler itself(gcc 5.3). But when I tried external mempool manager 
>> patch and
>> then performance dropped almost 800Kpps. Debugging further it turns 
>> out that
>> external mempool managers unrelated change was knocking out the memcpy.
>> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
>> unknown(In my test setup, packets are in the local cache, so it must be
>> something do with __mempool_put_bulk text alignment change or similar.
>>
>> Anyone else observed performance drop with external poolmanager?
>>
>> Jerin
>
> Jerin,
>     I'm seeing a 300kpps drop in throughput when I apply this on top 
> of the external
> mempool manager patch. If you're seeing an increase if you apply this 
> patch first, then
> a drop when applying the mempool manager, the two patches must be 
> conflicting in
> some way. We probably need to investigate further.
> Regards,
> Dave.
>

On further investigation, I now have a setup with no performance 
degradation. My previous tests
were accessing the NICS on a different NUMA node. Once I initiated 
testPMD with the correct coremask,
the difference between pre and post rte_memcpy patch is negligible 
(maybe 0.1% drop).

Regards,
Dave.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-27 11:42       ` Jerin Jacob
@ 2016-05-27 15:05         ` Thomas Monjalon
  2016-05-30  8:44           ` Olivier Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Monjalon @ 2016-05-27 15:05 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Hunt, David, Olivier Matz, dev, bruce.richardson, konstantin.ananyev

2016-05-27 17:12, Jerin Jacob:
> IMHO, I think we should have means to abstract this _logical_ changes
> under conditional compilation flags and any arch/platform can choose
> to select what it suites better for that arch/platform.
> 
> We may NOT need to have frequent patches to select the specific
> configuration, but logical patches under compilation flags can be accepted and
> each arch/platform can choose specific set configuration when we make
> the final release candidate for the release.
> 
> Any thoughts?

Yes having some #ifdefs for arch configuration may be reasonnable.
But other methods must be preffered first:
1/ try implementing the function in arch-specific files
2/ and check at runtime if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_X
3/ or check #ifdef RTE_MACHINE_CPUFLAG_X
4/ or check #ifdef RTE_ARCH_Y
5/ or check a specific #ifdef RTE_FEATURE_NAME to choose in config files

The option 2 is a nice to have which implies other options.

Maybe that doc/guides/contributing/design.rst needs to be updated.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-27 15:05         ` Thomas Monjalon
@ 2016-05-30  8:44           ` Olivier Matz
  0 siblings, 0 replies; 32+ messages in thread
From: Olivier Matz @ 2016-05-30  8:44 UTC (permalink / raw)
  To: Thomas Monjalon, Jerin Jacob
  Cc: Hunt, David, dev, bruce.richardson, konstantin.ananyev



On 05/27/2016 05:05 PM, Thomas Monjalon wrote:
> 2016-05-27 17:12, Jerin Jacob:
>> IMHO, I think we should have means to abstract this _logical_ changes
>> under conditional compilation flags and any arch/platform can choose
>> to select what it suites better for that arch/platform.
>>
>> We may NOT need to have frequent patches to select the specific
>> configuration, but logical patches under compilation flags can be accepted and
>> each arch/platform can choose specific set configuration when we make
>> the final release candidate for the release.
>>
>> Any thoughts?
> 
> Yes having some #ifdefs for arch configuration may be reasonnable.
> But other methods must be preffered first:
> 1/ try implementing the function in arch-specific files

I agree with Thomas. This option should be preferred, and I think we
should avoid as much as possible to have:

#if ARCH1
  do stuff optimized for arch1
#elif ARCH2
  do the same stuff optimized for arch2
#else
  ...


In this particular case, rte_memcpy() seems to be the appropriate
function, because it should already be arch-optimized.


> 2/ and check at runtime if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_X
> 3/ or check #ifdef RTE_MACHINE_CPUFLAG_X
> 4/ or check #ifdef RTE_ARCH_Y
> 5/ or check a specific #ifdef RTE_FEATURE_NAME to choose in config files
> 
> The option 2 is a nice to have which implies other options.
> 
> Maybe that doc/guides/contributing/design.rst needs to be updated.


Regards,
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
@ 2016-05-30  8:45   ` Olivier Matz
  2016-05-31 12:58     ` Jerin Jacob
  2016-06-30  9:41   ` Thomas Monjalon
  2016-06-30 12:16   ` [PATCH v3] " Jerin Jacob
  2 siblings, 1 reply; 32+ messages in thread
From: Olivier Matz @ 2016-05-30  8:45 UTC (permalink / raw)
  To: Jerin Jacob, dev; +Cc: thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

On 05/26/2016 10:07 AM, Jerin Jacob wrote:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
> v1..v2
> Corrected the the git commit message(s/mbuf/mempool/g)
> ---
>  lib/librte_mempool/rte_mempool.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index 60339bd..24876a2 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -73,6 +73,7 @@
>  #include <rte_memory.h>
>  #include <rte_branch_prediction.h>
>  #include <rte_ring.h>
> +#include <rte_memcpy.h>
>  
>  #ifdef __cplusplus
>  extern "C" {
> @@ -739,7 +740,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  		    unsigned n, int is_mp)
>  {
>  	struct rte_mempool_cache *cache;
> -	uint32_t index;
>  	void **cache_objs;
>  	unsigned lcore_id = rte_lcore_id();
>  	uint32_t cache_size = mp->cache_size;
> @@ -768,8 +768,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>  	 */
>  
>  	/* Add elements back into the cache */
> -	for (index = 0; index < n; ++index, obj_table++)
> -		cache_objs[index] = *obj_table;
> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>  
>  	cache->len += n;
>  
> 

I also checked in the get_bulk() function, which looks like that:

	/* Now fill in the response ... */
	for (index = 0, len = cache->len - 1;
			index < n;
			++index, len--, obj_table++)
		*obj_table = cache_objs[len];

I think we could replace it by something like:

	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);

The only difference is that it won't reverse the pointers in the
table, but I don't see any problem with that.

What do you think?


Regards,
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-30  8:45   ` Olivier Matz
@ 2016-05-31 12:58     ` Jerin Jacob
  2016-05-31 21:05       ` Olivier MATZ
  0 siblings, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-05-31 12:58 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Mon, May 30, 2016 at 10:45:11AM +0200, Olivier Matz wrote:
> Hi Jerin,
> 
> On 05/26/2016 10:07 AM, Jerin Jacob wrote:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> > v1..v2
> > Corrected the the git commit message(s/mbuf/mempool/g)
> > ---
> >  lib/librte_mempool/rte_mempool.h | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> > index 60339bd..24876a2 100644
> > --- a/lib/librte_mempool/rte_mempool.h
> > +++ b/lib/librte_mempool/rte_mempool.h
> > @@ -73,6 +73,7 @@
> >  #include <rte_memory.h>
> >  #include <rte_branch_prediction.h>
> >  #include <rte_ring.h>
> > +#include <rte_memcpy.h>
> >  
> >  #ifdef __cplusplus
> >  extern "C" {
> > @@ -739,7 +740,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  		    unsigned n, int is_mp)
> >  {
> >  	struct rte_mempool_cache *cache;
> > -	uint32_t index;
> >  	void **cache_objs;
> >  	unsigned lcore_id = rte_lcore_id();
> >  	uint32_t cache_size = mp->cache_size;
> > @@ -768,8 +768,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> >  	 */
> >  
> >  	/* Add elements back into the cache */
> > -	for (index = 0; index < n; ++index, obj_table++)
> > -		cache_objs[index] = *obj_table;
> > +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >  
> >  	cache->len += n;
> >  
> > 
> 
> I also checked in the get_bulk() function, which looks like that:
> 
> 	/* Now fill in the response ... */
> 	for (index = 0, len = cache->len - 1;
> 			index < n;
> 			++index, len--, obj_table++)
> 		*obj_table = cache_objs[len];
> 
> I think we could replace it by something like:
> 
> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
> 
> The only difference is that it won't reverse the pointers in the
> table, but I don't see any problem with that.
> 
> What do you think?

In true sense, it will _not_ be LIFO. Not sure about cache usage implications
on the specific use cases.

Jerin

> 
> 
> Regards,
> Olivier
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-31 12:58     ` Jerin Jacob
@ 2016-05-31 21:05       ` Olivier MATZ
  2016-06-01  7:00         ` Jerin Jacob
  0 siblings, 1 reply; 32+ messages in thread
From: Olivier MATZ @ 2016-05-31 21:05 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

>>>  	/* Add elements back into the cache */
>>> -	for (index = 0; index < n; ++index, obj_table++)
>>> -		cache_objs[index] = *obj_table;
>>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>>>  
>>>  	cache->len += n;
>>>  
>>>
>>
>> I also checked in the get_bulk() function, which looks like that:
>>
>> 	/* Now fill in the response ... */
>> 	for (index = 0, len = cache->len - 1;
>> 			index < n;
>> 			++index, len--, obj_table++)
>> 		*obj_table = cache_objs[len];
>>
>> I think we could replace it by something like:
>>
>> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
>>
>> The only difference is that it won't reverse the pointers in the
>> table, but I don't see any problem with that.
>>
>> What do you think?
> 
> In true sense, it will _not_ be LIFO. Not sure about cache usage implications
> on the specific use cases.

Today, the objects pointers are reversed only in the get(). It means
that this code:

	rte_mempool_get_bulk(mp, table, 4);
	for (i = 0; i < 4; i++)
		printf("obj = %p\n", t[i]);
	rte_mempool_put_bulk(mp, table, 4);


	printf("-----\n");
	rte_mempool_get_bulk(mp, table, 4);
	for (i = 0; i < 4; i++)
		printf("obj = %p\n", t[i]);
	rte_mempool_put_bulk(mp, table, 4);

prints:

	addr1
	addr2
	addr3
	addr4
	-----
	addr4
	addr3
	addr2
	addr1

Which is quite strange.

I don't think it would be an issue to replace the loop by a
rte_memcpy(), it may increase the copy speed and it will be
more coherent with the put().


Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-31 21:05       ` Olivier MATZ
@ 2016-06-01  7:00         ` Jerin Jacob
  2016-06-02  7:36           ` Olivier MATZ
  0 siblings, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-06-01  7:00 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
> Hi Jerin,

Hi Olivier,

> 
> >>>  	/* Add elements back into the cache */
> >>> -	for (index = 0; index < n; ++index, obj_table++)
> >>> -		cache_objs[index] = *obj_table;
> >>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> >>>  
> >>>  	cache->len += n;
> >>>  
> >>>
> >>
> >> I also checked in the get_bulk() function, which looks like that:
> >>
> >> 	/* Now fill in the response ... */
> >> 	for (index = 0, len = cache->len - 1;
> >> 			index < n;
> >> 			++index, len--, obj_table++)
> >> 		*obj_table = cache_objs[len];
> >>
> >> I think we could replace it by something like:
> >>
> >> 	rte_memcpy(obj_table, &cache_objs[len - n], sizeof(void *) * n);
> >>
> >> The only difference is that it won't reverse the pointers in the
> >> table, but I don't see any problem with that.
> >>
> >> What do you think?
> > 
> > In true sense, it will _not_ be LIFO. Not sure about cache usage implications
> > on the specific use cases.
> 
> Today, the objects pointers are reversed only in the get(). It means
> that this code:
> 
> 	rte_mempool_get_bulk(mp, table, 4);
> 	for (i = 0; i < 4; i++)
> 		printf("obj = %p\n", t[i]);
> 	rte_mempool_put_bulk(mp, table, 4);
> 
> 
> 	printf("-----\n");
> 	rte_mempool_get_bulk(mp, table, 4);
> 	for (i = 0; i < 4; i++)
> 		printf("obj = %p\n", t[i]);
> 	rte_mempool_put_bulk(mp, table, 4);
> 
> prints:
> 
> 	addr1
> 	addr2
> 	addr3
> 	addr4
> 	-----
> 	addr4
> 	addr3
> 	addr2
> 	addr1
> 
> Which is quite strange.

IMO, It is the expected LIFO behavior. Right ?

What is not expected is the following, which is the case after change. Or Am I
missing something here?

addr1
addr2
addr3
addr4
-----
addr1
addr2
addr3
addr4

> 
> I don't think it would be an issue to replace the loop by a
> rte_memcpy(), it may increase the copy speed and it will be
> more coherent with the put().
> 
> 
> Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-01  7:00         ` Jerin Jacob
@ 2016-06-02  7:36           ` Olivier MATZ
  2016-06-02  9:39             ` Jerin Jacob
  0 siblings, 1 reply; 32+ messages in thread
From: Olivier MATZ @ 2016-06-02  7:36 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

On 06/01/2016 09:00 AM, Jerin Jacob wrote:
> On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
>> Today, the objects pointers are reversed only in the get(). It means
>> that this code:
>>
>> 	rte_mempool_get_bulk(mp, table, 4);
>> 	for (i = 0; i < 4; i++)
>> 		printf("obj = %p\n", t[i]);
>> 	rte_mempool_put_bulk(mp, table, 4);
>>
>>
>> 	printf("-----\n");
>> 	rte_mempool_get_bulk(mp, table, 4);
>> 	for (i = 0; i < 4; i++)
>> 		printf("obj = %p\n", t[i]);
>> 	rte_mempool_put_bulk(mp, table, 4);
>>
>> prints:
>>
>> 	addr1
>> 	addr2
>> 	addr3
>> 	addr4
>> 	-----
>> 	addr4
>> 	addr3
>> 	addr2
>> 	addr1
>>
>> Which is quite strange.
> 
> IMO, It is the expected LIFO behavior. Right ?
> 
> What is not expected is the following, which is the case after change. Or Am I
> missing something here?
> 
> addr1
> addr2
> addr3
> addr4
> -----
> addr1
> addr2
> addr3
> addr4
> 
>>
>> I don't think it would be an issue to replace the loop by a
>> rte_memcpy(), it may increase the copy speed and it will be
>> more coherent with the put().
>>

I think the LIFO behavior should occur on a per-bulk basis. I mean,
it should behave like in the exemplaes below:

  // pool cache is in state X
  obj1 = mempool_get(mp)
  obj2 = mempool_get(mp)
  mempool_put(mp, obj2)
  mempool_put(mp, obj1)
  // pool cache is back in state X

  // pool cache is in state X
  bulk1 = mempool_get_bulk(mp, 16)
  bulk2 = mempool_get_bulk(mp, 16)
  mempool_put_bulk(mp, bulk2, 16)
  mempool_put_bulk(mp, bulk1, 16)
  // pool cache is back in state X

Note that today it's not the case for bulks, since object addresses
are reversed only in get(), we are not back in the original state.
I don't really see the advantage of this.

Removing the reversing may accelerate the cache in case of bulk get,
I think.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-02  7:36           ` Olivier MATZ
@ 2016-06-02  9:39             ` Jerin Jacob
  2016-06-02 21:16               ` Olivier MATZ
  0 siblings, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-06-02  9:39 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote:
> Hi Jerin,
> 
> On 06/01/2016 09:00 AM, Jerin Jacob wrote:
> > On Tue, May 31, 2016 at 11:05:30PM +0200, Olivier MATZ wrote:
> >> Today, the objects pointers are reversed only in the get(). It means
> >> that this code:
> >>
> >> 	rte_mempool_get_bulk(mp, table, 4);
> >> 	for (i = 0; i < 4; i++)
> >> 		printf("obj = %p\n", t[i]);
> >> 	rte_mempool_put_bulk(mp, table, 4);
> >>
> >>
> >> 	printf("-----\n");
> >> 	rte_mempool_get_bulk(mp, table, 4);
> >> 	for (i = 0; i < 4; i++)
> >> 		printf("obj = %p\n", t[i]);
> >> 	rte_mempool_put_bulk(mp, table, 4);
> >>
> >> prints:
> >>
> >> 	addr1
> >> 	addr2
> >> 	addr3
> >> 	addr4
> >> 	-----
> >> 	addr4
> >> 	addr3
> >> 	addr2
> >> 	addr1
> >>
> >> Which is quite strange.
> > 
> > IMO, It is the expected LIFO behavior. Right ?
> > 
> > What is not expected is the following, which is the case after change. Or Am I
> > missing something here?
> > 
> > addr1
> > addr2
> > addr3
> > addr4
> > -----
> > addr1
> > addr2
> > addr3
> > addr4
> > 
> >>
> >> I don't think it would be an issue to replace the loop by a
> >> rte_memcpy(), it may increase the copy speed and it will be
> >> more coherent with the put().
> >>
> 
> I think the LIFO behavior should occur on a per-bulk basis. I mean,
> it should behave like in the exemplaes below:
> 
>   // pool cache is in state X
>   obj1 = mempool_get(mp)
>   obj2 = mempool_get(mp)
>   mempool_put(mp, obj2)
>   mempool_put(mp, obj1)
>   // pool cache is back in state X
> 
>   // pool cache is in state X
>   bulk1 = mempool_get_bulk(mp, 16)
>   bulk2 = mempool_get_bulk(mp, 16)
>   mempool_put_bulk(mp, bulk2, 16)
>   mempool_put_bulk(mp, bulk1, 16)
>   // pool cache is back in state X
> 

Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued buffer
comes out for next "get" makes more chance that buffer in Last level cache.

> Note that today it's not the case for bulks, since object addresses
> are reversed only in get(), we are not back in the original state.
> I don't really see the advantage of this.
> 
> Removing the reversing may accelerate the cache in case of bulk get,
> I think.

I tried in my setup, it was dropping the performance. Have you got
improvement in any setup?

Jerin

> 
> Regards,
> Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-02  9:39             ` Jerin Jacob
@ 2016-06-02 21:16               ` Olivier MATZ
  2016-06-03  7:02                 ` Jerin Jacob
  0 siblings, 1 reply; 32+ messages in thread
From: Olivier MATZ @ 2016-06-02 21:16 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

On 06/02/2016 11:39 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote:
>> I think the LIFO behavior should occur on a per-bulk basis. I mean,
>> it should behave like in the exemplaes below:
>>
>>   // pool cache is in state X
>>   obj1 = mempool_get(mp)
>>   obj2 = mempool_get(mp)
>>   mempool_put(mp, obj2)
>>   mempool_put(mp, obj1)
>>   // pool cache is back in state X
>>
>>   // pool cache is in state X
>>   bulk1 = mempool_get_bulk(mp, 16)
>>   bulk2 = mempool_get_bulk(mp, 16)
>>   mempool_put_bulk(mp, bulk2, 16)
>>   mempool_put_bulk(mp, bulk1, 16)
>>   // pool cache is back in state X
>>
> 
> Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued buffer
> comes out for next "get" makes more chance that buffer in Last level cache.

Yes, from a memory cache perspective, I think you are right.

In practise, I'm not sure it's so important because on many hw drivers,
a packet buffer returns to the pool only after a round of the tx ring.
So I'd say it won't make a big difference here.

>> Note that today it's not the case for bulks, since object addresses
>> are reversed only in get(), we are not back in the original state.
>> I don't really see the advantage of this.
>>
>> Removing the reversing may accelerate the cache in case of bulk get,
>> I think.
> 
> I tried in my setup, it was dropping the performance. Have you got
> improvement in any setup?

I know that the mempool_perf autotest is not representative
of a real use case, but it gives a trend. I did a quick test with
- the legacy code,
- the rte_memcpy in put()
- the rte_memcpy in both put() and get() (no reverse)
It seems that removing the reversing brings ~50% of enhancement
with bulks of 32 (on an westmere):

legacy
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=839922483
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=849792204
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1617022156
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1675087052
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=3202914713
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=3268725963

rte_memcpy in put() (your patch proposal)
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=762157465
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=744593817
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=1500276326
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1461347942
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2974076107
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2928122264

rte_memcpy in put() and get()
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=974834892
mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=1129329459
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=2147798220
mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=2232457625
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32
rate_persec=4510816664
mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32
n_keep=128 rate_persec=4582421298

This is probably more a measure of the pure CPU cost of the mempool
function, without considering the memory cache aspect. So, of course,
a real use-case test should be done to confirm or not that it increases
the performance. I'll manage to do a test and let you know the result.

By the way, not all drivers are allocating or freeing the mbufs by
bulk, so this modification would only affect these ones. What driver
are you using for your test?


Regards,
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-02 21:16               ` Olivier MATZ
@ 2016-06-03  7:02                 ` Jerin Jacob
  2016-06-17 10:40                   ` Olivier Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-06-03  7:02 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
Hi Olivier,

> This is probably more a measure of the pure CPU cost of the mempool
> function, without considering the memory cache aspect. So, of course,
> a real use-case test should be done to confirm or not that it increases
> the performance. I'll manage to do a test and let you know the result.

OK

IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
However, if get rte_memcpy with behavioral changes makes sense some platform
then we can enable it on conditional basics(I am OK with that)

> 
> By the way, not all drivers are allocating or freeing the mbufs by
> bulk, so this modification would only affect these ones. What driver
> are you using for your test?

I have tested with ThunderX nicvf pmd(uses the bulk mode).
Recently sent out driver in ml for review

Jerin

> 
> 
> Regards,
> Olivier
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-03  7:02                 ` Jerin Jacob
@ 2016-06-17 10:40                   ` Olivier Matz
  2016-06-24 16:04                     ` Olivier Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Olivier Matz @ 2016-06-17 10:40 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

On 06/03/2016 09:02 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
> Hi Olivier,
> 
>> This is probably more a measure of the pure CPU cost of the mempool
>> function, without considering the memory cache aspect. So, of course,
>> a real use-case test should be done to confirm or not that it increases
>> the performance. I'll manage to do a test and let you know the result.
> 
> OK
> 
> IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
> However, if get rte_memcpy with behavioral changes makes sense some platform
> then we can enable it on conditional basics(I am OK with that)
> 
>>
>> By the way, not all drivers are allocating or freeing the mbufs by
>> bulk, so this modification would only affect these ones. What driver
>> are you using for your test?
> 
> I have tested with ThunderX nicvf pmd(uses the bulk mode).
> Recently sent out driver in ml for review

Just to let you know I do not forget this. I still need to
find some time to do a performance test.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-24 15:17   ` Jerin Jacob
  2016-05-27 10:24     ` Hunt, David
@ 2016-06-24 15:56     ` Hunt, David
  2016-06-24 16:02       ` Olivier Matz
  1 sibling, 1 reply; 32+ messages in thread
From: Hunt, David @ 2016-06-24 15:56 UTC (permalink / raw)
  To: Jerin Jacob, Olivier Matz
  Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Jerin,

I just ran a couple of tests on this patch on the latest master head on 
a couple of machines. An older quad socket E5-4650 and a quad socket 
E5-2699 v3

E5-4650:
I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the 
cached tests.

E5-2699 v3:
I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the 
cached tests.

This is purely the autotest comparison, I don't have traffic generator 
results. But based on the above, I don't think there are any performance 
issues with the patch.

Regards,
Dave.




On 24/5/2016 4:17 PM, Jerin Jacob wrote:
> On Tue, May 24, 2016 at 04:59:47PM +0200, Olivier Matz wrote:
>> Hi Jerin,
>>
>>
>> On 05/24/2016 04:50 PM, Jerin Jacob wrote:
>>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>> ---
>>>   lib/librte_mempool/rte_mempool.h | 5 ++---
>>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
>>> index ed2c110..ebe399a 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -74,6 +74,7 @@
>>>   #include <rte_memory.h>
>>>   #include <rte_branch_prediction.h>
>>>   #include <rte_ring.h>
>>> +#include <rte_memcpy.h>
>>>   
>>>   #ifdef __cplusplus
>>>   extern "C" {
>>> @@ -917,7 +918,6 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>>>   		    unsigned n, __rte_unused int is_mp)
>>>   {
>>>   	struct rte_mempool_cache *cache;
>>> -	uint32_t index;
>>>   	void **cache_objs;
>>>   	unsigned lcore_id = rte_lcore_id();
>>>   	uint32_t cache_size = mp->cache_size;
>>> @@ -946,8 +946,7 @@ __mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
>>>   	 */
>>>   
>>>   	/* Add elements back into the cache */
>>> -	for (index = 0; index < n; ++index, obj_table++)
>>> -		cache_objs[index] = *obj_table;
>>> +	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
>>>   
>>>   	cache->len += n;
>>>   
>>>
>> The commit title should be "mempool" instead of "mbuf".
> I will fix it.
>
>> Are you seeing some performance improvement by using rte_memcpy()?
> Yes, In some case, In default case, It was replaced with memcpy by the
> compiler itself(gcc 5.3). But when I tried external mempool manager patch and
> then performance dropped almost 800Kpps. Debugging further it turns out that
> external mempool managers unrelated change was knocking out the memcpy.
> explicit rte_memcpy brought back 500Kpps. Remaing 300Kpps drop is still
> unknown(In my test setup, packets are in the local cache, so it must be
> something do with __mempool_put_bulk text alignment change or similar.
>
> Anyone else observed performance drop with external poolmanager?
>
> Jerin
>
>> Regards
>> Olivier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-24 15:56     ` Hunt, David
@ 2016-06-24 16:02       ` Olivier Matz
  0 siblings, 0 replies; 32+ messages in thread
From: Olivier Matz @ 2016-06-24 16:02 UTC (permalink / raw)
  To: Hunt, David, Jerin Jacob
  Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev

Hi Dave,

On 06/24/2016 05:56 PM, Hunt, David wrote:
> Hi Jerin,
> 
> I just ran a couple of tests on this patch on the latest master head on
> a couple of machines. An older quad socket E5-4650 and a quad socket
> E5-2699 v3
> 
> E5-4650:
> I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the
> cached tests.
> 
> E5-2699 v3:
> I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the
> cached tests.
> 
> This is purely the autotest comparison, I don't have traffic generator
> results. But based on the above, I don't think there are any performance
> issues with the patch.
> 

Thanks for doing the test on your side. I think it's probably enough
to integrate Jerin's patch .

About using a rte_memcpy() in the mempool_get(), I don't think I'll have
the time to do a more exhaustive test before the 16.07, so I'll come
back with it later.

I'm sending an ack on the v2 thread.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-17 10:40                   ` Olivier Matz
@ 2016-06-24 16:04                     ` Olivier Matz
  0 siblings, 0 replies; 32+ messages in thread
From: Olivier Matz @ 2016-06-24 16:04 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, thomas.monjalon, bruce.richardson, konstantin.ananyev



On 06/17/2016 12:40 PM, Olivier Matz wrote:
> Hi Jerin,
> 
> On 06/03/2016 09:02 AM, Jerin Jacob wrote:
>> On Thu, Jun 02, 2016 at 11:16:16PM +0200, Olivier MATZ wrote:
>> Hi Olivier,
>>
>>> This is probably more a measure of the pure CPU cost of the mempool
>>> function, without considering the memory cache aspect. So, of course,
>>> a real use-case test should be done to confirm or not that it increases
>>> the performance. I'll manage to do a test and let you know the result.
>>
>> OK
>>
>> IMO, put rte_memcpy makes sense(this patch) as their no behavior change.
>> However, if get rte_memcpy with behavioral changes makes sense some platform
>> then we can enable it on conditional basics(I am OK with that)
>>
>>>
>>> By the way, not all drivers are allocating or freeing the mbufs by
>>> bulk, so this modification would only affect these ones. What driver
>>> are you using for your test?
>>
>> I have tested with ThunderX nicvf pmd(uses the bulk mode).
>> Recently sent out driver in ml for review
> 
> Just to let you know I do not forget this. I still need to
> find some time to do a performance test.


Quoting from the other thread [1] too to save this in patchwork:
[1] http://dpdk.org/ml/archives/dev/2016-June/042701.html


> On 06/24/2016 05:56 PM, Hunt, David wrote:
>> Hi Jerin,
>>
>> I just ran a couple of tests on this patch on the latest master head on
>> a couple of machines. An older quad socket E5-4650 and a quad socket
>> E5-2699 v3
>>
>> E5-4650:
>> I'm seeing a gain of 2% for un-cached tests and a gain of 9% on the
>> cached tests.
>>
>> E5-2699 v3:
>> I'm seeing a loss of 0.1% for un-cached tests and a gain of 11% on the
>> cached tests.
>>
>> This is purely the autotest comparison, I don't have traffic generator
>> results. But based on the above, I don't think there are any performance
>> issues with the patch.
>>
> 
> Thanks for doing the test on your side. I think it's probably enough
> to integrate Jerin's patch .
> 
> About using a rte_memcpy() in the mempool_get(), I don't think I'll have
> the time to do a more exhaustive test before the 16.07, so I'll come
> back with it later.
> 
> I'm sending an ack on the v2 thread.


Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
  2016-05-30  8:45   ` Olivier Matz
@ 2016-06-30  9:41   ` Thomas Monjalon
  2016-06-30 11:38     ` Jerin Jacob
  2016-06-30 12:16   ` [PATCH v3] " Jerin Jacob
  2 siblings, 1 reply; 32+ messages in thread
From: Thomas Monjalon @ 2016-06-30  9:41 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, bruce.richardson, konstantin.ananyev, olivier.matz

2016-05-26 13:37, Jerin Jacob:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

Please Jerin (or anyone else), could you rebase this patch?
Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-30  9:41   ` Thomas Monjalon
@ 2016-06-30 11:38     ` Jerin Jacob
  0 siblings, 0 replies; 32+ messages in thread
From: Jerin Jacob @ 2016-06-30 11:38 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, bruce.richardson, konstantin.ananyev, olivier.matz

On Thu, Jun 30, 2016 at 11:41:59AM +0200, Thomas Monjalon wrote:
> 2016-05-26 13:37, Jerin Jacob:
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> 
> Please Jerin (or anyone else), could you rebase this patch?

OK. I will send the rebased version

> Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
  2016-05-30  8:45   ` Olivier Matz
  2016-06-30  9:41   ` Thomas Monjalon
@ 2016-06-30 12:16   ` Jerin Jacob
  2016-06-30 17:28     ` Thomas Monjalon
  2 siblings, 1 reply; 32+ messages in thread
From: Jerin Jacob @ 2016-06-30 12:16 UTC (permalink / raw)
  To: dev; +Cc: thomas.monjalon, olivier.matz, Jerin Jacob

Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---
v1..v2
Corrected the the git commit message(s/mbuf/mempool/g)
v2..v3
re-base to master
---
---
 lib/librte_mempool/rte_mempool.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index b2a5197..c8a81e2 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -74,6 +74,7 @@
 #include <rte_memory.h>
 #include <rte_branch_prediction.h>
 #include <rte_ring.h>
+#include <rte_memcpy.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1028,7 +1029,6 @@ static inline void __attribute__((always_inline))
 __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
 		      unsigned n, struct rte_mempool_cache *cache, int flags)
 {
-	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1052,8 +1052,7 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
-	for (index = 0; index < n; ++index, obj_table++)
-		cache_objs[index] = *obj_table;
+	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
 
 	cache->len += n;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-30 12:16   ` [PATCH v3] " Jerin Jacob
@ 2016-06-30 17:28     ` Thomas Monjalon
  2016-07-05  8:43       ` Ferruh Yigit
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Monjalon @ 2016-06-30 17:28 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: dev, olivier.matz

2016-06-30 17:46, Jerin Jacob:
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Applied, thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-06-30 17:28     ` Thomas Monjalon
@ 2016-07-05  8:43       ` Ferruh Yigit
  2016-07-05 11:32         ` Yuanhan Liu
  0 siblings, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2016-07-05  8:43 UTC (permalink / raw)
  To: Thomas Monjalon, Jerin Jacob; +Cc: dev, olivier.matz, Pablo de Lara

On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
> 2016-06-30 17:46, Jerin Jacob:
>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Applied, thanks
> 

Hi Jerin,

This commit cause a compilation error on target i686-native-linuxapp-gcc
with gcc6.

Compilation error is:
== Build drivers/net/virtio
  CC virtio_rxtx_simple.o
In file included from
/root/development/dpdk/build/include/rte_mempool.h:77:0,
                 from
/root/development/dpdk/drivers/net/virtio/virtio_rxtx_simple.c:46:
/root/development/dpdk/drivers/net/virtio/virtio_rxtx_simple.c: In
function ‘virtio_xmit_pkts_simple’:
/root/development/dpdk/build/include/rte_memcpy.h:551:2: error: array
subscript is above array bounds [-Werror=array-bounds]
  rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/development/dpdk/build/include/rte_memcpy.h:552:2: error: array
subscript is above array bounds [-Werror=array-bounds]
  rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/development/dpdk/build/include/rte_memcpy.h:553:2: error: array
subscript is above array bounds [-Werror=array-bounds]
  rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/development/dpdk/build/include/rte_memcpy.h:554:2: error: array
subscript is above array bounds [-Werror=array-bounds]
  rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-05  8:43       ` Ferruh Yigit
@ 2016-07-05 11:32         ` Yuanhan Liu
  2016-07-05 13:13           ` Jerin Jacob
  0 siblings, 1 reply; 32+ messages in thread
From: Yuanhan Liu @ 2016-07-05 11:32 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ferruh Yigit, Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
> On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
> > 2016-06-30 17:46, Jerin Jacob:
> >> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > 
> > Applied, thanks
> > 
> 
> Hi Jerin,
> 
> This commit cause a compilation error on target i686-native-linuxapp-gcc
> with gcc6.

Besides that, I'm more curious to know have you actually seen any
performance boost?

	--yliu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-05 11:32         ` Yuanhan Liu
@ 2016-07-05 13:13           ` Jerin Jacob
  2016-07-05 13:42             ` Yuanhan Liu
  2016-07-05 14:09             ` Ferruh Yigit
  0 siblings, 2 replies; 32+ messages in thread
From: Jerin Jacob @ 2016-07-05 13:13 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: Ferruh Yigit, Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On Tue, Jul 05, 2016 at 07:32:46PM +0800, Yuanhan Liu wrote:
> On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
> > On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
> > > 2016-06-30 17:46, Jerin Jacob:
> > >> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > >> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > > 
> > > Applied, thanks
> > > 
> > 
> > Hi Jerin,
> > 
> > This commit cause a compilation error on target i686-native-linuxapp-gcc
> > with gcc6.
> 
> Besides that, I'm more curious to know have you actually seen any
> performance boost?

let me first address your curiosity,
http://dpdk.org/dev/patchwork/patch/12993/( check the second comment)
http://dpdk.org/ml/archives/dev/2016-June/042701.html

Ferruh,

I have tested on a x86 machine with gcc 6.1. I could n't see any issues
with i686-native-linuxapp-gcc target

Steps following to create gcc 6.1 toolchain
https://sahas.ra.naman.ms/2016/05/31/building-gcc6-1-on-fedora-23/
(removed --disable-multilib to have support for -m32)

➜ [dpdk-master] $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/gcc-6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-6.1.0/configure --prefix=/opt/gcc-6.1.0
--enable-languages=c,c++ --enable-libmudflap --with-system-zlib
Thread model: posix
gcc version 6.1.0 (GCC)

More over this issue seems like an issue from x86 rte_memcpy implementation.

Jerin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-05 13:13           ` Jerin Jacob
@ 2016-07-05 13:42             ` Yuanhan Liu
  2016-07-05 14:09             ` Ferruh Yigit
  1 sibling, 0 replies; 32+ messages in thread
From: Yuanhan Liu @ 2016-07-05 13:42 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ferruh Yigit, Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On Tue, Jul 05, 2016 at 06:43:57PM +0530, Jerin Jacob wrote:
> On Tue, Jul 05, 2016 at 07:32:46PM +0800, Yuanhan Liu wrote:
> > On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
> > > On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
> > > > 2016-06-30 17:46, Jerin Jacob:
> > > >> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > >> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > > > 
> > > > Applied, thanks
> > > > 
> > > 
> > > Hi Jerin,
> > > 
> > > This commit cause a compilation error on target i686-native-linuxapp-gcc
> > > with gcc6.
> > 
> > Besides that, I'm more curious to know have you actually seen any
> > performance boost?
> 
> let me first address your curiosity,
> http://dpdk.org/dev/patchwork/patch/12993/( check the second comment)
> http://dpdk.org/ml/archives/dev/2016-June/042701.html

Thanks. Maybe it's good to include those info in the commit log next
time.

	--yliu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-05 13:13           ` Jerin Jacob
  2016-07-05 13:42             ` Yuanhan Liu
@ 2016-07-05 14:09             ` Ferruh Yigit
  2016-07-06 16:21               ` Ferruh Yigit
  1 sibling, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2016-07-05 14:09 UTC (permalink / raw)
  To: Jerin Jacob, Yuanhan Liu
  Cc: Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On 7/5/2016 2:13 PM, Jerin Jacob wrote:
> On Tue, Jul 05, 2016 at 07:32:46PM +0800, Yuanhan Liu wrote:
>> On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
>>> On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
>>>> 2016-06-30 17:46, Jerin Jacob:
>>>>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>> Acked-by: Olivier Matz <olivier.matz@6wind.com>
>>>>
>>>> Applied, thanks
>>>>
>>>
>>> Hi Jerin,
>>>
>>> This commit cause a compilation error on target i686-native-linuxapp-gcc
>>> with gcc6.
>>
>> Besides that, I'm more curious to know have you actually seen any
>> performance boost?
> 
> let me first address your curiosity,
> http://dpdk.org/dev/patchwork/patch/12993/( check the second comment)
> http://dpdk.org/ml/archives/dev/2016-June/042701.html
> 
> Ferruh,

Hi Jerin,

> 
> I have tested on a x86 machine with gcc 6.1. I could n't see any issues
> with i686-native-linuxapp-gcc target
Thanks for investigating the issue.

> 
> Steps following to create gcc 6.1 toolchain
> https://sahas.ra.naman.ms/2016/05/31/building-gcc6-1-on-fedora-23/
> (removed --disable-multilib to have support for -m32)
> 
> ➜ [dpdk-master] $ gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/opt/gcc-6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
> Target: x86_64-pc-linux-gnu
> Configured with: ../gcc-6.1.0/configure --prefix=/opt/gcc-6.1.0
> --enable-languages=c,c++ --enable-libmudflap --with-system-zlib
> Thread model: posix
> gcc version 6.1.0 (GCC)
I am using Fedora24, which has gcc6 (6.1.1) as default.

> 
> More over this issue seems like an issue from x86 rte_memcpy implementation.
You are right. But i686 compilation starts failing with this commit.
And reverting this commit in the current HEAD solves the compilation
problem.
I am not really clear about reason of the compilation error.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-05 14:09             ` Ferruh Yigit
@ 2016-07-06 16:21               ` Ferruh Yigit
  2016-07-07 13:51                 ` Ferruh Yigit
  0 siblings, 1 reply; 32+ messages in thread
From: Ferruh Yigit @ 2016-07-06 16:21 UTC (permalink / raw)
  To: Jerin Jacob, Yuanhan Liu
  Cc: Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On 7/5/2016 3:09 PM, Ferruh Yigit wrote:
> On 7/5/2016 2:13 PM, Jerin Jacob wrote:
>> On Tue, Jul 05, 2016 at 07:32:46PM +0800, Yuanhan Liu wrote:
>>> On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
>>>> On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
>>>>> 2016-06-30 17:46, Jerin Jacob:
>>>>>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>>> Acked-by: Olivier Matz <olivier.matz@6wind.com>
>>>>>
>>>>> Applied, thanks
>>>>>
>>>>
>>>> Hi Jerin,
>>>>
>>>> This commit cause a compilation error on target i686-native-linuxapp-gcc
>>>> with gcc6.
>>>
>>> Besides that, I'm more curious to know have you actually seen any
>>> performance boost?
>>
>> let me first address your curiosity,
>> http://dpdk.org/dev/patchwork/patch/12993/( check the second comment)
>> http://dpdk.org/ml/archives/dev/2016-June/042701.html
>>
>> Ferruh,
> 
> Hi Jerin,
> 
>>
>> I have tested on a x86 machine with gcc 6.1. I could n't see any issues
>> with i686-native-linuxapp-gcc target
> Thanks for investigating the issue.
> 
>>
>> Steps following to create gcc 6.1 toolchain
>> https://sahas.ra.naman.ms/2016/05/31/building-gcc6-1-on-fedora-23/
>> (removed --disable-multilib to have support for -m32)
>>
>> ➜ [dpdk-master] $ gcc -v
>> Using built-in specs.
>> COLLECT_GCC=gcc
>> COLLECT_LTO_WRAPPER=/opt/gcc-6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
>> Target: x86_64-pc-linux-gnu
>> Configured with: ../gcc-6.1.0/configure --prefix=/opt/gcc-6.1.0
>> --enable-languages=c,c++ --enable-libmudflap --with-system-zlib
>> Thread model: posix
>> gcc version 6.1.0 (GCC)
> I am using Fedora24, which has gcc6 (6.1.1) as default.
> 
>>
>> More over this issue seems like an issue from x86 rte_memcpy implementation.
> You are right. But i686 compilation starts failing with this commit.
> And reverting this commit in the current HEAD solves the compilation
> problem.
> I am not really clear about reason of the compilation error.

The compile error is because compiler is so smart now and at the same
time not enough smart.

Call stack is as following:

virtio_xmit_pkts_simple
  virtio_xmit_cleanup
    rte_mempool_put_bulk
      rte_mempool_generic_put
        __mempool_generic_put
	  rte_memcpy

The array used as source buffer in virtio_xmit_cleanup (free) is a
pointer array with 32 elements, in 32bit this makes 128 bytes.

in rte_memcpy() implementation, there a code piece as following:
if (size > 256) {
    rte_move128(...);
    rte_move128(...); <--- [1]
    ....
}

The compiler traces the array all through the call stack and knows the
size of array is 128 and generates a warning on above [1] which tries to
access beyond byte 128.
But unfortunately it ignores the "(size > 256)" check.

In 64bit, this warning is not generated because array size becomes 256
bytes.

So this warning is a false positive. Although I am working on it, if
anybody has a suggestion to prevent warning, quite welcome. At worst I
will disable this compiler warning for the file.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mempool: replace c memcpy code semantics with optimized rte_memcpy
  2016-07-06 16:21               ` Ferruh Yigit
@ 2016-07-07 13:51                 ` Ferruh Yigit
  0 siblings, 0 replies; 32+ messages in thread
From: Ferruh Yigit @ 2016-07-07 13:51 UTC (permalink / raw)
  To: Jerin Jacob, Yuanhan Liu
  Cc: Thomas Monjalon, dev, olivier.matz, Pablo de Lara

On 7/6/2016 5:21 PM, Ferruh Yigit wrote:
> On 7/5/2016 3:09 PM, Ferruh Yigit wrote:
>> On 7/5/2016 2:13 PM, Jerin Jacob wrote:
>>> On Tue, Jul 05, 2016 at 07:32:46PM +0800, Yuanhan Liu wrote:
>>>> On Tue, Jul 05, 2016 at 09:43:03AM +0100, Ferruh Yigit wrote:
>>>>> On 6/30/2016 6:28 PM, Thomas Monjalon wrote:
>>>>>> 2016-06-30 17:46, Jerin Jacob:
>>>>>>> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>>>> Acked-by: Olivier Matz <olivier.matz@6wind.com>
>>>>>>
>>>>>> Applied, thanks
>>>>>>
>>>>>
>>>>> Hi Jerin,
>>>>>
>>>>> This commit cause a compilation error on target i686-native-linuxapp-gcc
>>>>> with gcc6.
>>>>
>>>> Besides that, I'm more curious to know have you actually seen any
>>>> performance boost?
>>>
>>> let me first address your curiosity,
>>> http://dpdk.org/dev/patchwork/patch/12993/( check the second comment)
>>> http://dpdk.org/ml/archives/dev/2016-June/042701.html
>>>
>>> Ferruh,
>>
>> Hi Jerin,
>>
>>>
>>> I have tested on a x86 machine with gcc 6.1. I could n't see any issues
>>> with i686-native-linuxapp-gcc target
>> Thanks for investigating the issue.
>>
>>>
>>> Steps following to create gcc 6.1 toolchain
>>> https://sahas.ra.naman.ms/2016/05/31/building-gcc6-1-on-fedora-23/
>>> (removed --disable-multilib to have support for -m32)
>>>
>>> ➜ [dpdk-master] $ gcc -v
>>> Using built-in specs.
>>> COLLECT_GCC=gcc
>>> COLLECT_LTO_WRAPPER=/opt/gcc-6.1.0/libexec/gcc/x86_64-pc-linux-gnu/6.1.0/lto-wrapper
>>> Target: x86_64-pc-linux-gnu
>>> Configured with: ../gcc-6.1.0/configure --prefix=/opt/gcc-6.1.0
>>> --enable-languages=c,c++ --enable-libmudflap --with-system-zlib
>>> Thread model: posix
>>> gcc version 6.1.0 (GCC)
>> I am using Fedora24, which has gcc6 (6.1.1) as default.
>>
>>>
>>> More over this issue seems like an issue from x86 rte_memcpy implementation.
>> You are right. But i686 compilation starts failing with this commit.
>> And reverting this commit in the current HEAD solves the compilation
>> problem.
>> I am not really clear about reason of the compilation error.
> 
> The compile error is because compiler is so smart now and at the same
> time not enough smart.
> 
> Call stack is as following:
> 
> virtio_xmit_pkts_simple
>   virtio_xmit_cleanup
>     rte_mempool_put_bulk
>       rte_mempool_generic_put
>         __mempool_generic_put
> 	  rte_memcpy
> 
> The array used as source buffer in virtio_xmit_cleanup (free) is a
> pointer array with 32 elements, in 32bit this makes 128 bytes.
> 
> in rte_memcpy() implementation, there a code piece as following:
> if (size > 256) {
>     rte_move128(...);
>     rte_move128(...); <--- [1]
>     ....
> }
> 
> The compiler traces the array all through the call stack and knows the
> size of array is 128 and generates a warning on above [1] which tries to
> access beyond byte 128.
> But unfortunately it ignores the "(size > 256)" check.
> 
> In 64bit, this warning is not generated because array size becomes 256
> bytes.
> 
> So this warning is a false positive. Although I am working on it, if
> anybody has a suggestion to prevent warning, quite welcome. At worst I
> will disable this compiler warning for the file.

I have sent a patch:
http://dpdk.org/ml/archives/dev/2016-July/043492.html

Giving a hint to compiler that variable "size" is related to the size of
the source buffer fixes compiler warning.

- This update is in virtio fast path, can you please review it from
point of performance effect.

- Isn't this surprisingly smart of compiler, or am I missing something J

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-07-07 13:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24 14:50 [PATCH] mbuf: replace c memcpy code semantics with optimized rte_memcpy Jerin Jacob
2016-05-24 14:59 ` Olivier Matz
2016-05-24 15:17   ` Jerin Jacob
2016-05-27 10:24     ` Hunt, David
2016-05-27 11:42       ` Jerin Jacob
2016-05-27 15:05         ` Thomas Monjalon
2016-05-30  8:44           ` Olivier Matz
2016-05-27 13:45       ` Hunt, David
2016-06-24 15:56     ` Hunt, David
2016-06-24 16:02       ` Olivier Matz
2016-05-26  8:07 ` [PATCH v2] mempool: " Jerin Jacob
2016-05-30  8:45   ` Olivier Matz
2016-05-31 12:58     ` Jerin Jacob
2016-05-31 21:05       ` Olivier MATZ
2016-06-01  7:00         ` Jerin Jacob
2016-06-02  7:36           ` Olivier MATZ
2016-06-02  9:39             ` Jerin Jacob
2016-06-02 21:16               ` Olivier MATZ
2016-06-03  7:02                 ` Jerin Jacob
2016-06-17 10:40                   ` Olivier Matz
2016-06-24 16:04                     ` Olivier Matz
2016-06-30  9:41   ` Thomas Monjalon
2016-06-30 11:38     ` Jerin Jacob
2016-06-30 12:16   ` [PATCH v3] " Jerin Jacob
2016-06-30 17:28     ` Thomas Monjalon
2016-07-05  8:43       ` Ferruh Yigit
2016-07-05 11:32         ` Yuanhan Liu
2016-07-05 13:13           ` Jerin Jacob
2016-07-05 13:42             ` Yuanhan Liu
2016-07-05 14:09             ` Ferruh Yigit
2016-07-06 16:21               ` Ferruh Yigit
2016-07-07 13:51                 ` Ferruh Yigit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.