Re: [PATCH 3/4] eal/arm: Enable lpm/table/pipeline libs

From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
To: Jianbo Liu <jianbo.liu@linaro.org>
Cc: dev@dpdk.org
Subject: Re: [PATCH 3/4] eal/arm: Enable lpm/table/pipeline libs
Date: Wed, 2 Dec 2015 13:33:03 +0530	[thread overview]
Message-ID: <20151202080259.GA32494@localhost.localdomain> (raw)
In-Reply-To: <CAP4Qi3-5ofDU-2-4KsxFzMC1OpTsc5WjmxcFT2Eu_URA0UBzDw@mail.gmail.com>

On Wed, Dec 02, 2015 at 02:54:52PM +0800, Jianbo Liu wrote:
> On 2 December 2015 at 00:41, Jerin Jacob <jerin.jacob@caviumnetworks.com> wrote:
> > On Tue, Dec 01, 2015 at 01:41:15PM -0500, Jianbo Liu wrote:
> >> Adds ARM NEON support for lpm.
> >> And enables table/pipeline libraries which depend on lpm.
> >
> > I already sent the patch on the same yesterday.
> > We can converge the patches after the discussion.
> > Please check "[dpdk-dev] [PATCH 0/3] add lpm support for NEON" on ml
> >
> Yes, I have read your patch. But there are many differences, so I sent
> mine for your reviewing :)
> 
> >
> >>
> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> >> ---
> >>  config/defconfig_arm-armv7a-linuxapp-gcc          |  3 -
> >>  config/defconfig_arm64-armv8a-linuxapp-gcc        |  3 -
> >>  lib/librte_eal/common/include/arch/arm/rte_vect.h | 28 ++++++++++
> >>  lib/librte_lpm/rte_lpm.h                          | 68 ++++++++++++++++-------
> >>  4 files changed, 77 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/config/defconfig_arm-armv7a-linuxapp-gcc b/config/defconfig_arm-armv7a-linuxapp-gcc
> >> index cbebd64..efffa1f 100644
> >> --- a/config/defconfig_arm-armv7a-linuxapp-gcc
> >> +++ b/config/defconfig_arm-armv7a-linuxapp-gcc
> >> @@ -53,9 +53,6 @@ CONFIG_RTE_LIBRTE_KNI=n
> >>  CONFIG_RTE_EAL_IGB_UIO=n
> >>
> >>  # fails to compile on ARM
> >> -CONFIG_RTE_LIBRTE_LPM=n
> >> -CONFIG_RTE_LIBRTE_TABLE=n
> >> -CONFIG_RTE_LIBRTE_PIPELINE=n
> >>  CONFIG_RTE_SCHED_VECTOR=n
> >>
> >>  # cannot use those on ARM
> >> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc
> >> index 504f3ed..57f7941 100644
> >> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc
> >> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
> >> @@ -51,7 +51,4 @@ CONFIG_RTE_LIBRTE_IVSHMEM=n
> >>  CONFIG_RTE_LIBRTE_FM10K_PMD=n
> >>  CONFIG_RTE_LIBRTE_I40E_PMD=n
> >>
> >> -CONFIG_RTE_LIBRTE_LPM=n
> >> -CONFIG_RTE_LIBRTE_TABLE=n
> >> -CONFIG_RTE_LIBRTE_PIPELINE=n
> >>  CONFIG_RTE_SCHED_VECTOR=n
> >> diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h b/lib/librte_eal/common/include/arch/arm/rte_vect.h
> >> index a33c054..7437711 100644
> >> --- a/lib/librte_eal/common/include/arch/arm/rte_vect.h
> >> +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h
> >> @@ -41,6 +41,8 @@ extern "C" {
> >>
> >>  typedef int32x4_t xmm_t;
> >>
> >> +typedef int32x4_t __m128i;
> >> +
> >>  #define      XMM_SIZE        (sizeof(xmm_t))
> >>  #define      XMM_MASK        (XMM_SIZE - 1)
> >>
> >> @@ -53,6 +55,32 @@ typedef union rte_xmm {
> >>       double   pd[XMM_SIZE / sizeof(double)];
> >>  } __attribute__((aligned(16))) rte_xmm_t;
> >>
> >> +static __inline __m128i
> >> +_mm_set_epi32(int i3, int i2, int i1, int i0)
> >> +{
> >> +     int32_t r[4] = {i0, i1, i2, i3};
> >> +
> >> +     return vld1q_s32(r);
> >> +}
> >> +
> >> +static __inline __m128i
> >> +_mm_loadu_si128(__m128i *p)
> >> +{
> >> +     return vld1q_s32((int32_t *)p);
> >> +}
> >> +
> >> +static __inline __m128i
> >> +_mm_set1_epi32(int i)
> >> +{
> >> +     return vdupq_n_s32(i);
> >> +}
> >> +
> >> +static __inline __m128i
> >> +_mm_and_si128(__m128i a, __m128i b)
> >> +{
> >> +     return vandq_s32(a, b);
> >> +}
> >> +

IMO, it's not always good to emulate GCC defined intrinsics of
other architecture. What if a legacy DPDK application has such mappings
then BOOM, multiple definition, which one is correct? which one
to comment it out? Integration pain starts for DPDK library consumer:-(

> >
> > IMO, it makes sense to not emulate the SSE intrinsics with NEON
> > Let's create the rte_vect_* as required. look at the existing patch.
> >
> I thought of creating a layer of SIMD over all the platforms before.
> But can't you see it make things complicated, considering there are
> only few simple intrinsic to implement?

Not true, There were, a lot of SSE intrinsics needs be to emulated for ACL NEON
implementation if I were to take this approach and emulation comes with
the cost.

So my take is,
lets the each architecture implementation for specific SIMD version of DPDK
API in the library should have the freedom to implement the API in
NATIVE.

And let's create only rte_vect_* abstraction only for using
that API/library. Which boils down to have very minimal rte_vect_*
abstraction to load, store, set not beyond that.

This makes clear "contract" between DPDK library and the applications.
and make easy for remaning new architecture  porting effort in DPDK.

Imagine how your proposed function will look like if new architecture
wants to implement "optimized" version of rte_lpm_lookupx4

> If do so, we also need to explain to others how to use these interfaces.
> Besides, this patch did the smallest changes to the original code, and
> more likely to be accepted by others.

other patch makes no changes to IA version of rte_lpm_lookupx4.I thought
that make reviewer easy to review the changes in architecture
perspective.

> 
> >
> >>  #ifdef RTE_ARCH_ARM
> >>  /* NEON intrinsic vqtbl1q_u8() is not supported in ARMv7-A(AArch32) */
> >>  static __inline uint8x16_t
> >> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> >> index c299ce2..c76c07d 100644
> >> --- a/lib/librte_lpm/rte_lpm.h
> >> +++ b/lib/librte_lpm/rte_lpm.h
> >> @@ -361,6 +361,47 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips,
> >>  /* Mask four results. */
> >>  #define       RTE_LPM_MASKX4_RES     UINT64_C(0x00ff00ff00ff00ff)
> >>
> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64)
> >
> > Separate out arm implementation to the different header file.
> > Too many ifdef looks odd in the header file and difficult to manage.
> >
> But there are many ifdefs already.
> And It seems unreasonable to add a new file only for one small function.
> 

small or big, its matter of each architecture to have
the freedom for the optimized version for the implementation.

What if  other architecture demands to write this function in assembly
or restructure it for performance improvement?

> >
> >> +static inline void
> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t tbl[4])
> >> +{
> >> +     uint32x4_t i24;
> >> +     uint32_t idx[4];
> >> +
> >> +     /* get 4 indexes for tbl24[]. */
> >> +     i24 = vshrq_n_u32(vreinterpretq_u32_s32(ip), CHAR_BIT);
> >> +     vst1q_u32(idx, i24);
> >> +
> >> +     /* extract values from tbl24[] */
> >> +     tbl[0] = *(const uint16_t *)&lpm->tbl24[idx[0]];
> >> +     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx[1]];
> >> +     tbl[2] = *(const uint16_t *)&lpm->tbl24[idx[2]];
> >> +     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx[3]];
> >> +}
> >
> > Nice. There is an improvement in this portion code wrt my patch. This is
> > a candidate for convergence.
> >
> >
> >> +#else
> >> +static inline void
> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, __m128i ip, uint16_t tbl[4])
> >> +{
> >> +     __m128i i24;
> >> +     uint64_t idx;
> >> +
> >> +     /* get 4 indexes for tbl24[]. */
> >> +     i24 = _mm_srli_epi32(ip, CHAR_BIT);
> >> +
> >> +     /* extract values from tbl24[] */
> >> +     idx = _mm_cvtsi128_si64(i24);
> >> +     i24 = _mm_srli_si128(i24, sizeof(uint64_t));
> >> +
> >> +     tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> >> +     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
> >> +
> >> +     idx = _mm_cvtsi128_si64(i24);
> >> +
> >> +     tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> >> +     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
> >> +}
> >> +#endif
> >> +
> >>  /**
> >>   * Lookup four IP addresses in an LPM table.
> >>   *
> >> @@ -381,17 +422,19 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips,
> >>   *   if lookup would fail.
> >>   */
> >>  static inline void
> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64)
> >> +rte_lpm_lookupx4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t hop[4],
> >> +     uint16_t defv)
> >
> > This would call for change in the change the ABI,
> > IMO, __m128i can be used to represent 128bit vector to avoid ABI chang
> >
> This redefine rte_lpm_lookupx4 is unncessary, I will remove it, so no
> ABI change.
> And there only one ifdef for ARM platforms left.
> 
> >
> >> +#else
> > separate out arm implementation to the different header file. Too many
> > ifdef looks odd in the header file.
> >
> > Could you  rebase your patch based on existing patch and send the
> > improvement portion as separate patch or I can send update patch with
> > your improvements and with your signoff.
> >
> >
> >>  rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4],
> >>       uint16_t defv)
> >> +#endif
> >>  {
> >> -     __m128i i24;
> >>       rte_xmm_t i8;
> >>       uint16_t tbl[4];
> >> -     uint64_t idx, pt;
> >> -
> >> -     const __m128i mask8 =
> >> -             _mm_set_epi32(UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX);
> >> +     uint64_t pt;
> >>
> >> +     const __m128i mask8 = _mm_set1_epi32(UINT8_MAX);
> >>       /*
> >>        * RTE_LPM_VALID_EXT_ENTRY_BITMASK for 4 LPM entries
> >>        * as one 64-bit value (0x0300030003000300).
> >> @@ -412,20 +455,7 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4],
> >>               (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 32 |
> >>               (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 48);
> >>
> >> -     /* get 4 indexes for tbl24[]. */
> >> -     i24 = _mm_srli_epi32(ip, CHAR_BIT);
> >> -
> >> -     /* extract values from tbl24[] */
> >> -     idx = _mm_cvtsi128_si64(i24);
> >> -     i24 = _mm_srli_si128(i24, sizeof(uint64_t));
> >> -
> >> -     tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> >> -     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
> >> -
> >> -     idx = _mm_cvtsi128_si64(i24);
> >> -
> >> -     tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> >> -     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
> >> +     rte_lpm_tbl24_val4(lpm, ip, tbl);
> >>
> >>       /* get 4 indexes for tbl8[]. */
> >>       i8.x = _mm_and_si128(ip, mask8);
> >> --
> >> 1.8.3.1
> >>