All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH ] examples/l3fwd: fix aliasing in port grouping
@ 2017-11-02 14:31 Guduri Prathyusha
  2017-11-02 14:46 ` Ananyev, Konstantin
  0 siblings, 1 reply; 7+ messages in thread
From: Guduri Prathyusha @ 2017-11-02 14:31 UTC (permalink / raw)
  To: tomasz.kantecki
  Cc: Jianbo.Liu, guduriprathyusha, konstantin.ananyev, dev, Guduri Prathyusha

With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
pointers pointing to same chunk of memory and with -f-strict-aliasing the
pointers are assumed to be pointing to different memory and compiler
reorders instructions that depend on pnum and pn. This breaks port
grouping algorithm.

This patch eliminates the usage of union and uses memcpy for copying
gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
not call its library implementation but uses appropriate LD and ST
instructions directly and hence no performance overhead.

Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
---
 examples/l3fwd/l3fwd_neon.h | 11 +++--------
 examples/l3fwd/l3fwd_sse.h  | 11 +++--------
 2 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
index 4bc161394..10a602a04 100644
--- a/examples/l3fwd/l3fwd_neon.h
+++ b/examples/l3fwd/l3fwd_neon.h
@@ -100,11 +100,6 @@ static inline uint16_t *
 port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
 	     uint16x8_t dp2)
 {
-	union {
-		uint16_t u16[FWDSTEP + 1];
-		uint64_t u64;
-	} *pnum = (void *)pn;
-
 	int32_t v;
 	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};

@@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,

 	/* if dest port value has changed. */
 	if (v != GRPMSK) {
-		pnum->u64 = gptbl[v].pnum;
-		pnum->u16[FWDSTEP] = 1;
-		lp = pnum->u16 + gptbl[v].idx;
+		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
+		pn[FWDSTEP] = 1;
+		lp = pn + gptbl[v].idx;
 	}

 	return lp;
diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
index 831760f02..79a71d77e 100644
--- a/examples/l3fwd/l3fwd_sse.h
+++ b/examples/l3fwd/l3fwd_sse.h
@@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 static inline uint16_t *
 port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
 {
-	union {
-		uint16_t u16[FWDSTEP + 1];
-		uint64_t u64;
-	} *pnum = (void *)pn;
-
 	int32_t v;

 	dp1 = _mm_cmpeq_epi16(dp1, dp2);
@@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)

 	/* if dest port value has changed. */
 	if (v != GRPMSK) {
-		pnum->u64 = gptbl[v].pnum;
-		pnum->u16[FWDSTEP] = 1;
-		lp = pnum->u16 + gptbl[v].idx;
+		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
+		pn[FWDSTEP] = 1;
+		lp = pn + gptbl[v].idx;
 	}

 	return lp;
--
2.14.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-02 14:31 [PATCH ] examples/l3fwd: fix aliasing in port grouping Guduri Prathyusha
@ 2017-11-02 14:46 ` Ananyev, Konstantin
  2017-11-02 15:33   ` Guduri Prathyusha
  0 siblings, 1 reply; 7+ messages in thread
From: Ananyev, Konstantin @ 2017-11-02 14:46 UTC (permalink / raw)
  To: Guduri Prathyusha, Kantecki, Tomasz; +Cc: Jianbo.Liu, guduriprathyusha, dev

Hi,

> -----Original Message-----
> From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> Sent: Thursday, November 2, 2017 2:31 PM
> To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> Prathyusha <gprathyusha@caviumnetworks.com>
> Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> 
> With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> pointers pointing to same chunk of memory and with -f-strict-aliasing the
> pointers are assumed to be pointing to different memory and compiler
> reorders instructions that depend on pnum and pn. This breaks port
> grouping algorithm.
> 
> This patch eliminates the usage of union and uses memcpy for copying
> gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> not call its library implementation but uses appropriate LD and ST
> instructions directly and hence no performance overhead.
> 
> Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> ---
>  examples/l3fwd/l3fwd_neon.h | 11 +++--------
>  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
>  2 files changed, 6 insertions(+), 16 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> index 4bc161394..10a602a04 100644
> --- a/examples/l3fwd/l3fwd_neon.h
> +++ b/examples/l3fwd/l3fwd_neon.h
> @@ -100,11 +100,6 @@ static inline uint16_t *
>  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
>  	     uint16x8_t dp2)
>  {
> -	union {
> -		uint16_t u16[FWDSTEP + 1];
> -		uint64_t u64;
> -	} *pnum = (void *)pn;
> -
>  	int32_t v;
>  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> 
> @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> 
>  	/* if dest port value has changed. */
>  	if (v != GRPMSK) {
> -		pnum->u64 = gptbl[v].pnum;
> -		pnum->u16[FWDSTEP] = 1;
> -		lp = pnum->u16 + gptbl[v].idx;
> +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> +		pn[FWDSTEP] = 1;
> +		lp = pn + gptbl[v].idx;
>  	}
> 
>  	return lp;
> diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> index 831760f02..79a71d77e 100644
> --- a/examples/l3fwd/l3fwd_sse.h
> +++ b/examples/l3fwd/l3fwd_sse.h
> @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>  static inline uint16_t *
>  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
>  {
> -	union {
> -		uint16_t u16[FWDSTEP + 1];
> -		uint64_t u64;
> -	} *pnum = (void *)pn;
> -
>  	int32_t v;
> 
>  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> 
>  	/* if dest port value has changed. */
>  	if (v != GRPMSK) {
> -		pnum->u64 = gptbl[v].pnum;
> -		pnum->u16[FWDSTEP] = 1;
> -		lp = pnum->u16 + gptbl[v].idx;
> +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> +		pn[FWDSTEP] = 1;
> +		lp = pn + gptbl[v].idx;

Could you explain a bit more here - which exactly instructions were reordered
and what kind of problems did it cause?
Specially on IA?
In any case I don't think using rte_memcpy is a good thing to use here:
it is a huge inline function - way too much to copy just 64 bit variable.
Konstantin

>  	}
> 
>  	return lp;
> --
> 2.14.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-02 14:46 ` Ananyev, Konstantin
@ 2017-11-02 15:33   ` Guduri Prathyusha
  2017-11-02 15:52     ` Ananyev, Konstantin
  0 siblings, 1 reply; 7+ messages in thread
From: Guduri Prathyusha @ 2017-11-02 15:33 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Jianbo.Liu, guduriprathyusha, tomasz.kantecki

On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> Hi,
Hi
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 2:31 PM
> > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > Prathyusha <gprathyusha@caviumnetworks.com>
> > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > pointers are assumed to be pointing to different memory and compiler
> > reorders instructions that depend on pnum and pn. This breaks port
> > grouping algorithm.
> >
> > This patch eliminates the usage of union and uses memcpy for copying
> > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > not call its library implementation but uses appropriate LD and ST
> > instructions directly and hence no performance overhead.
> >
> > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > ---
> >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> >  2 files changed, 6 insertions(+), 16 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > index 4bc161394..10a602a04 100644
> > --- a/examples/l3fwd/l3fwd_neon.h
> > +++ b/examples/l3fwd/l3fwd_neon.h
> > @@ -100,11 +100,6 @@ static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> >  	     uint16x8_t dp2)
> >  {
> > -	union {
> > -		uint16_t u16[FWDSTEP + 1];
> > -		uint64_t u64;
> > -	} *pnum = (void *)pn;
> > -
> >  	int32_t v;
> >  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> >
> > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> >
> >  	/* if dest port value has changed. */
> >  	if (v != GRPMSK) {
> > -		pnum->u64 = gptbl[v].pnum;
> > -		pnum->u16[FWDSTEP] = 1;
> > -		lp = pnum->u16 + gptbl[v].idx;
> > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +		pn[FWDSTEP] = 1;
> > +		lp = pn + gptbl[v].idx;
> >  	}
> >
> >  	return lp;
> > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > index 831760f02..79a71d77e 100644
> > --- a/examples/l3fwd/l3fwd_sse.h
> > +++ b/examples/l3fwd/l3fwd_sse.h
> > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> >  static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> >  {
> > -	union {
> > -		uint16_t u16[FWDSTEP + 1];
> > -		uint64_t u64;
> > -	} *pnum = (void *)pn;
> > -
> >  	int32_t v;
> >
> >  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> >
> >  	/* if dest port value has changed. */
> >  	if (v != GRPMSK) {
> > -		pnum->u64 = gptbl[v].pnum;
> > -		pnum->u16[FWDSTEP] = 1;
> > -		lp = pnum->u16 + gptbl[v].idx;
> > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +		pn[FWDSTEP] = 1;
> > +		lp = pn + gptbl[v].idx;
>
> Could you explain a bit more here - which exactly instructions were reordered
> and what kind of problems did it cause?
> Specially on IA?

This issue is observed on ARM since ARM gcc is more aggressive in
reordering than x86 gcc. In ARM when v != GRPMSK, the following
instructions ordering is not guarenteed because of strict aliasing.

lp[0] += gptbl[v].lpv;
pnum->u64 = gptbl[v].pnum;
pnum->u16[FWDSTEP] = 1;
lp = pnum->u16 + gptbl[v].idx;

That results in wrong lp[0] updation.
memcpy in this case will avoid this problem.

> In any case I don't think using rte_memcpy is a good thing to use here:
> it is a huge inline function - way too much to copy just 64 bit variable.

I agree that rte_memcpy is overhead in this case but how about using
memcpy that will not use library implementation if the size is constant.
memcpy with constant size uses built_in_memcpy that does not add
performance overhead.

Thoughts?

> Konstantin
>
> >  	}
> >
> >  	return lp;
> > --
> > 2.14.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-02 15:33   ` Guduri Prathyusha
@ 2017-11-02 15:52     ` Ananyev, Konstantin
  2017-11-02 17:38       ` Prathyusha, Guduri
  2017-11-03  3:21       ` Jianbo.Liu
  0 siblings, 2 replies; 7+ messages in thread
From: Ananyev, Konstantin @ 2017-11-02 15:52 UTC (permalink / raw)
  To: Guduri Prathyusha; +Cc: dev, Jianbo.Liu, guduriprathyusha, Kantecki, Tomasz



> -----Original Message-----
> From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> Sent: Thursday, November 2, 2017 3:34 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> 
> On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > Hi,
> Hi
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 2:31 PM
> > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > pointers are assumed to be pointing to different memory and compiler
> > > reorders instructions that depend on pnum and pn. This breaks port
> > > grouping algorithm.
> > >
> > > This patch eliminates the usage of union and uses memcpy for copying
> > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > not call its library implementation but uses appropriate LD and ST
> > > instructions directly and hence no performance overhead.
> > >
> > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > ---
> > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > index 4bc161394..10a602a04 100644
> > > --- a/examples/l3fwd/l3fwd_neon.h
> > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >  	     uint16x8_t dp2)
> > >  {
> > > -	union {
> > > -		uint16_t u16[FWDSTEP + 1];
> > > -		uint64_t u64;
> > > -	} *pnum = (void *)pn;
> > > -
> > >  	int32_t v;
> > >  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > >
> > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >
> > >  	/* if dest port value has changed. */
> > >  	if (v != GRPMSK) {
> > > -		pnum->u64 = gptbl[v].pnum;
> > > -		pnum->u16[FWDSTEP] = 1;
> > > -		lp = pnum->u16 + gptbl[v].idx;
> > > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +		pn[FWDSTEP] = 1;
> > > +		lp = pn + gptbl[v].idx;
> > >  	}
> > >
> > >  	return lp;
> > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > index 831760f02..79a71d77e 100644
> > > --- a/examples/l3fwd/l3fwd_sse.h
> > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > >  static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >  {
> > > -	union {
> > > -		uint16_t u16[FWDSTEP + 1];
> > > -		uint64_t u64;
> > > -	} *pnum = (void *)pn;
> > > -
> > >  	int32_t v;
> > >
> > >  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >
> > >  	/* if dest port value has changed. */
> > >  	if (v != GRPMSK) {
> > > -		pnum->u64 = gptbl[v].pnum;
> > > -		pnum->u16[FWDSTEP] = 1;
> > > -		lp = pnum->u16 + gptbl[v].idx;
> > > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +		pn[FWDSTEP] = 1;
> > > +		lp = pn + gptbl[v].idx;
> >
> > Could you explain a bit more here - which exactly instructions were reordered
> > and what kind of problems did it cause?
> > Specially on IA?
> 
> This issue is observed on ARM since ARM gcc is more aggressive in
> reordering than x86 gcc.

Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
Unless there is a reproducible problem with x86 -
my preference would be to keep that file intact.

> In ARM when v != GRPMSK, the following
> instructions ordering is not guarenteed because of strict aliasing.
> 
> lp[0] += gptbl[v].lpv;
> pnum->u64 = gptbl[v].pnum;
> pnum->u16[FWDSTEP] = 1;
> lp = pnum->u16 + gptbl[v].idx;

Ok, so what in particular is reordered by the compiler:

 lp[0] += gptbl[v].lpv; (1)
 pnum->u64 = gptbl[v].pnum; (2)
 pnum->u16[FWDSTEP] = 1;   (3)
 lp = pnum->u16 + gptbl[v].idx; (4)

(2) and (3)?
If so I am not sure how it could be a problem:
they do stores to the different locations.
(1) and (4) as I can see shouldn't be reordered.
Anyway - if you think this a compiler reordering issue,
then adding rte_compiler_barrier() should fix the issue, right?

> 
> That results in wrong lp[0] updation.
> memcpy in this case will avoid this problem.
> 
> > In any case I don't think using rte_memcpy is a good thing to use here:
> > it is a huge inline function - way too much to copy just 64 bit variable.
> 
> I agree that rte_memcpy is overhead in this case but how about using
> memcpy that will not use library implementation if the size is constant.
> memcpy with constant size uses built_in_memcpy that does not add
> performance overhead.

On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
ib/librte_eal/common/include/arch/x86/rte_memcpy.h

> 
> Thoughts?

As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
If it does (still not sure how) - check would compiler barrier help here.
Konstantin 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-02 15:52     ` Ananyev, Konstantin
@ 2017-11-02 17:38       ` Prathyusha, Guduri
  2017-11-03  3:21       ` Jianbo.Liu
  1 sibling, 0 replies; 7+ messages in thread
From: Prathyusha, Guduri @ 2017-11-02 17:38 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev, Jianbo.Liu, guduriprathyusha, Kantecki, Tomasz


Hi,

Some issue with my mutt command line and hence apologies for unpleasant formatting in the mail. Please see inline

________________________________
From: dev <dev-bounces@dpdk.org> on behalf of Ananyev, Konstantin <konstantin.ananyev@intel.com>
Sent: Thursday, November 2, 2017 9:22 PM
To: Prathyusha, Guduri
Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz
Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping



> -----Original Message-----
> From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> Sent: Thursday, November 2, 2017 3:34 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
>
> On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > Hi,
> Hi
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 2:31 PM
> > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > pointers are assumed to be pointing to different memory and compiler
> > > reorders instructions that depend on pnum and pn. This breaks port
> > > grouping algorithm.
> > >
> > > This patch eliminates the usage of union and uses memcpy for copying
> > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > not call its library implementation but uses appropriate LD and ST
> > > instructions directly and hence no performance overhead.
> > >
> > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > ---
> > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > index 4bc161394..10a602a04 100644
> > > --- a/examples/l3fwd/l3fwd_neon.h
> > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >         uint16x8_t dp2)
> > >  {
> > > - union {
> > > -         uint16_t u16[FWDSTEP + 1];
> > > -         uint64_t u64;
> > > - } *pnum = (void *)pn;
> > > -
> > >    int32_t v;
> > >    uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > >
> > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >
> > >    /* if dest port value has changed. */
> > >    if (v != GRPMSK) {
> > > -         pnum->u64 = gptbl[v].pnum;
> > > -         pnum->u16[FWDSTEP] = 1;
> > > -         lp = pnum->u16 + gptbl[v].idx;
> > > +         rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +         pn[FWDSTEP] = 1;
> > > +         lp = pn + gptbl[v].idx;
> > >    }
> > >
> > >    return lp;
> > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > index 831760f02..79a71d77e 100644
> > > --- a/examples/l3fwd/l3fwd_sse.h
> > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > >  static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >  {
> > > - union {
> > > -         uint16_t u16[FWDSTEP + 1];
> > > -         uint64_t u64;
> > > - } *pnum = (void *)pn;
> > > -
> > >    int32_t v;
> > >
> > >    dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >
> > >    /* if dest port value has changed. */
> > >    if (v != GRPMSK) {
> > > -         pnum->u64 = gptbl[v].pnum;
> > > -         pnum->u16[FWDSTEP] = 1;
> > > -         lp = pnum->u16 + gptbl[v].idx;
> > > +         rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +         pn[FWDSTEP] = 1;
> > > +         lp = pn + gptbl[v].idx;
> >
> > Could you explain a bit more here - which exactly instructions were reordered
> > and what kind of problems did it cause?
> > Specially on IA?
>
> This issue is observed on ARM since ARM gcc is more aggressive in
> reordering than x86 gcc.

Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
Unless there is a reproducible problem with x86 -
my preference would be to keep that file intact.

> In ARM when v != GRPMSK, the following
> instructions ordering is not guarenteed because of strict aliasing.
>
> lp[0] += gptbl[v].lpv;
> pnum->u64 = gptbl[v].pnum;
> pnum->u16[FWDSTEP] = 1;
> lp = pnum->u16 + gptbl[v].idx;

Ok, so what in particular is reordered by the compiler:

 lp[0] += gptbl[v].lpv; (1)
 pnum->u64 = gptbl[v].pnum; (2)
 pnum->u16[FWDSTEP] = 1;   (3)
 lp = pnum->u16 + gptbl[v].idx; (4)

(2) and (3)?
If so I am not sure how it could be a problem:
they do stores to the different locations.
(1) and (4) as I can see shouldn't be reordered.

Anyway - if you think this a compiler reordering issue,
then adding rte_compiler_barrier() should fix the issue, right?

[prathyusha] : Yes, adding a rte_compiler_barrier() fixes the issue in
case of ARM. We think it is needed in x86 case also but If you still
think to keep l3fwd_sse.h in tact then I will not modify l3fwd_sse.h in V2.

But certainly change l3fwd_neon.h to use rte_compiler_barrier()

Let me know what you prefer will spin a V2 accordingly.

Thanks,
Prathyusha

>
> That results in wrong lp[0] updation.
> memcpy in this case will avoid this problem.
>
> > In any case I don't think using rte_memcpy is a good thing to use here:
> > it is a huge inline function - way too much to copy just 64 bit variable.
>
> I agree that rte_memcpy is overhead in this case but how about using
> memcpy that will not use library implementation if the size is constant.
> memcpy with constant size uses built_in_memcpy that does not add
> performance overhead.

On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
ib/librte_eal/common/include/arch/x86/rte_memcpy.h

>
> Thoughts?

As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
If it does (still not sure how) - check would compiler barrier help here.
Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-02 15:52     ` Ananyev, Konstantin
  2017-11-02 17:38       ` Prathyusha, Guduri
@ 2017-11-03  3:21       ` Jianbo.Liu
  2017-11-03  5:42         ` Guduri Prathyusha
  1 sibling, 1 reply; 7+ messages in thread
From: Jianbo.Liu @ 2017-11-03  3:21 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Guduri Prathyusha, dev, guduriprathyusha, Kantecki, Tomasz

The 11/02/2017 15:52, Ananyev, Konstantin wrote:
>
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 3:34 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > Hi,
> > Hi
> > >
> > > > -----Original Message-----
> > > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > >
> > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives

May I ask the detail version about the gcc you are using?

> > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > pointers are assumed to be pointing to different memory and compiler
> > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > grouping algorithm.
> > > >
> > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > not call its library implementation but uses appropriate LD and ST
> > > > instructions directly and hence no performance overhead.
> > > >
> > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > > ---
> > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > index 4bc161394..10a602a04 100644
> > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >              uint16x8_t dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > >
> > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > > >         }
> > > >
> > > >         return lp;
> > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > index 831760f02..79a71d77e 100644
> > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > >  static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >
> > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > >
> > > Could you explain a bit more here - which exactly instructions were reordered
> > > and what kind of problems did it cause?
> > > Specially on IA?
> >
> > This issue is observed on ARM since ARM gcc is more aggressive in
> > reordering than x86 gcc.
>
> Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> Unless there is a reproducible problem with x86 -
> my preference would be to keep that file intact.
>
> > In ARM when v != GRPMSK, the following
> > instructions ordering is not guarenteed because of strict aliasing.
> >
> > lp[0] += gptbl[v].lpv;
> > pnum->u64 = gptbl[v].pnum;
> > pnum->u16[FWDSTEP] = 1;
> > lp = pnum->u16 + gptbl[v].idx;
>
> Ok, so what in particular is reordered by the compiler:
>
>  lp[0] += gptbl[v].lpv; (1)
>  pnum->u64 = gptbl[v].pnum; (2)
>  pnum->u16[FWDSTEP] = 1;   (3)
>  lp = pnum->u16 + gptbl[v].idx; (4)
>
> (2) and (3)?
> If so I am not sure how it could be a problem:
> they do stores to the different locations.
> (1) and (4) as I can see shouldn't be reordered.
> Anyway - if you think this a compiler reordering issue,
> then adding rte_compiler_barrier() should fix the issue, right?

Agree.

>
> >
> > That results in wrong lp[0] updation.
> > memcpy in this case will avoid this problem.
> >
> > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > it is a huge inline function - way too much to copy just 64 bit variable.
> >
> > I agree that rte_memcpy is overhead in this case but how about using
> > memcpy that will not use library implementation if the size is constant.
> > memcpy with constant size uses built_in_memcpy that does not add
> > performance overhead.
>
> On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> ib/librte_eal/common/include/arch/x86/rte_memcpy.h
>
> >
> > Thoughts?
>
> As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> If it does (still not sure how) - check would compiler barrier help here.
> Konstantin
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ] examples/l3fwd: fix aliasing in port grouping
  2017-11-03  3:21       ` Jianbo.Liu
@ 2017-11-03  5:42         ` Guduri Prathyusha
  0 siblings, 0 replies; 7+ messages in thread
From: Guduri Prathyusha @ 2017-11-03  5:42 UTC (permalink / raw)
  To: konstantin.ananyev
  Cc: dev, guduri.prathyusha, guduriprathyusha, tomasz.kantecki,
	konstantin.ananyev

On Fri, Nov 03, 2017 at 11:21:43AM +0800, Jianbo.Liu@arm.com wrote:
> The 11/02/2017 15:52, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 3:34 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > > Hi,
> > > Hi
> > > >
> > > > > -----Original Message-----
> > > > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > > >
> > > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
>
> May I ask the detail version about the gcc you are using?
>
Am using gcc provided by ubuntu rootfs

root@localhost:~/dpdk# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/5/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr --program-suffix=-5 --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-arm64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-arm64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-arm64
--with-arch-directory=aarch64
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu
--host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)

> > > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > > pointers are assumed to be pointing to different memory and compiler
> > > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > > grouping algorithm.
> > > > >
> > > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > > not call its library implementation but uses appropriate LD and ST
> > > > > instructions directly and hence no performance overhead.
> > > > >
> > > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > > > ---
> > > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > > index 4bc161394..10a602a04 100644
> > > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >              uint16x8_t dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > > >
> > > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > > >         }
> > > > >
> > > > >         return lp;
> > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > > index 831760f02..79a71d77e 100644
> > > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > > >  static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >
> > > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > >
> > > > Could you explain a bit more here - which exactly instructions were reordered
> > > > and what kind of problems did it cause?
> > > > Specially on IA?
> > >
> > > This issue is observed on ARM since ARM gcc is more aggressive in
> > > reordering than x86 gcc.
> >
> > Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> > Unless there is a reproducible problem with x86 -
> > my preference would be to keep that file intact.
> >
> > > In ARM when v != GRPMSK, the following
> > > instructions ordering is not guarenteed because of strict aliasing.
> > >
> > > lp[0] += gptbl[v].lpv;
> > > pnum->u64 = gptbl[v].pnum;
> > > pnum->u16[FWDSTEP] = 1;
> > > lp = pnum->u16 + gptbl[v].idx;
> >
> > Ok, so what in particular is reordered by the compiler:
> >
> >  lp[0] += gptbl[v].lpv; (1)
> >  pnum->u64 = gptbl[v].pnum; (2)
> >  pnum->u16[FWDSTEP] = 1;   (3)
> >  lp = pnum->u16 + gptbl[v].idx; (4)
> >
> > (2) and (3)?
> > If so I am not sure how it could be a problem:
> > they do stores to the different locations.
> > (1) and (4) as I can see shouldn't be reordered.
> > Anyway - if you think this a compiler reordering issue,
> > then adding rte_compiler_barrier() should fix the issue, right?
>
> Agree.
>
Yes, rte_compiler_barrier() solves the issue. I will spin a V2 if you
are ok with it.
> >
> > >
> > > That results in wrong lp[0] updation.
> > > memcpy in this case will avoid this problem.
> > >
> > > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > > it is a huge inline function - way too much to copy just 64 bit variable.
> > >
> > > I agree that rte_memcpy is overhead in this case but how about using
> > > memcpy that will not use library implementation if the size is constant.
> > > memcpy with constant size uses built_in_memcpy that does not add
> > > performance overhead.
> >
> > On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> > ib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >
> > >
> > > Thoughts?
> >
> > As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> > If it does (still not sure how) - check would compiler barrier help here.
> > Konstantin
> >
>
> --
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-03  5:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-02 14:31 [PATCH ] examples/l3fwd: fix aliasing in port grouping Guduri Prathyusha
2017-11-02 14:46 ` Ananyev, Konstantin
2017-11-02 15:33   ` Guduri Prathyusha
2017-11-02 15:52     ` Ananyev, Konstantin
2017-11-02 17:38       ` Prathyusha, Guduri
2017-11-03  3:21       ` Jianbo.Liu
2017-11-03  5:42         ` Guduri Prathyusha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.