From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jianbo Liu Subject: Re: [PATCH 3/4] eal/arm: Enable lpm/table/pipeline libs Date: Wed, 2 Dec 2015 21:13:51 +0800 Message-ID: References: <1448995276-9599-1-git-send-email-jianbo.liu@linaro.org> <1448995276-9599-4-git-send-email-jianbo.liu@linaro.org> <20151201164139.GA12144@localhost.localdomain> <20151202080259.GA32494@localhost.localdomain> <20151202103903.GA4940@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: dev@dpdk.org To: Jerin Jacob Return-path: Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by dpdk.org (Postfix) with ESMTP id C9F552E81 for ; Wed, 2 Dec 2015 14:13:51 +0100 (CET) Received: by vkha189 with SMTP id a189so24060501vkh.2 for ; Wed, 02 Dec 2015 05:13:51 -0800 (PST) In-Reply-To: <20151202103903.GA4940@localhost.localdomain> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 2 December 2015 at 18:39, Jerin Jacob wrote: > On Wed, Dec 02, 2015 at 05:49:41PM +0800, Jianbo Liu wrote: >> On 2 December 2015 at 16:03, Jerin Jacob wrote: >> > On Wed, Dec 02, 2015 at 02:54:52PM +0800, Jianbo Liu wrote: >> >> On 2 December 2015 at 00:41, Jerin Jacob wrote: >> >> > On Tue, Dec 01, 2015 at 01:41:15PM -0500, Jianbo Liu wrote: >> >> >> Adds ARM NEON support for lpm. >> >> >> And enables table/pipeline libraries which depend on lpm. >> >> > >> >> > I already sent the patch on the same yesterday. >> >> > We can converge the patches after the discussion. >> >> > Please check "[dpdk-dev] [PATCH 0/3] add lpm support for NEON" on ml >> >> > >> >> Yes, I have read your patch. But there are many differences, so I sent >> >> mine for your reviewing :) >> >> >> >> > >> >> >> >> >> >> Signed-off-by: Jianbo Liu >> >> >> --- >> >> >> config/defconfig_arm-armv7a-linuxapp-gcc | 3 - >> >> >> config/defconfig_arm64-armv8a-linuxapp-gcc | 3 - >> >> >> lib/librte_eal/common/include/arch/arm/rte_vect.h | 28 ++++++++++ >> >> >> lib/librte_lpm/rte_lpm.h | 68 ++++++++++++++++------- >> >> >> 4 files changed, 77 insertions(+), 25 deletions(-) >> >> >> >> >> >> diff --git a/config/defconfig_arm-armv7a-linuxapp-gcc b/config/defconfig_arm-armv7a-linuxapp-gcc >> >> >> index cbebd64..efffa1f 100644 >> >> >> --- a/config/defconfig_arm-armv7a-linuxapp-gcc >> >> >> +++ b/config/defconfig_arm-armv7a-linuxapp-gcc >> >> >> @@ -53,9 +53,6 @@ CONFIG_RTE_LIBRTE_KNI=n >> >> >> CONFIG_RTE_EAL_IGB_UIO=n >> >> >> >> >> >> # fails to compile on ARM >> >> >> -CONFIG_RTE_LIBRTE_LPM=n >> >> >> -CONFIG_RTE_LIBRTE_TABLE=n >> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n >> >> >> CONFIG_RTE_SCHED_VECTOR=n >> >> >> >> >> >> # cannot use those on ARM >> >> >> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc >> >> >> index 504f3ed..57f7941 100644 >> >> >> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc >> >> >> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc >> >> >> @@ -51,7 +51,4 @@ CONFIG_RTE_LIBRTE_IVSHMEM=n >> >> >> CONFIG_RTE_LIBRTE_FM10K_PMD=n >> >> >> CONFIG_RTE_LIBRTE_I40E_PMD=n >> >> >> >> >> >> -CONFIG_RTE_LIBRTE_LPM=n >> >> >> -CONFIG_RTE_LIBRTE_TABLE=n >> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n >> >> >> CONFIG_RTE_SCHED_VECTOR=n >> >> >> diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h b/lib/librte_eal/common/include/arch/arm/rte_vect.h >> >> >> index a33c054..7437711 100644 >> >> >> --- a/lib/librte_eal/common/include/arch/arm/rte_vect.h >> >> >> +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h >> >> >> @@ -41,6 +41,8 @@ extern "C" { >> >> >> >> >> >> typedef int32x4_t xmm_t; >> >> >> >> >> >> +typedef int32x4_t __m128i; >> >> >> + >> >> >> #define XMM_SIZE (sizeof(xmm_t)) >> >> >> #define XMM_MASK (XMM_SIZE - 1) >> >> >> >> >> >> @@ -53,6 +55,32 @@ typedef union rte_xmm { >> >> >> double pd[XMM_SIZE / sizeof(double)]; >> >> >> } __attribute__((aligned(16))) rte_xmm_t; >> >> >> >> >> >> +static __inline __m128i >> >> >> +_mm_set_epi32(int i3, int i2, int i1, int i0) >> >> >> +{ >> >> >> + int32_t r[4] = {i0, i1, i2, i3}; >> >> >> + >> >> >> + return vld1q_s32(r); >> >> >> +} >> >> >> + >> >> >> +static __inline __m128i >> >> >> +_mm_loadu_si128(__m128i *p) >> >> >> +{ >> >> >> + return vld1q_s32((int32_t *)p); >> >> >> +} >> >> >> + >> >> >> +static __inline __m128i >> >> >> +_mm_set1_epi32(int i) >> >> >> +{ >> >> >> + return vdupq_n_s32(i); >> >> >> +} >> >> >> + >> >> >> +static __inline __m128i >> >> >> +_mm_and_si128(__m128i a, __m128i b) >> >> >> +{ >> >> >> + return vandq_s32(a, b); >> >> >> +} >> >> >> + >> > >> > IMO, it's not always good to emulate GCC defined intrinsics of >> > other architecture. What if a legacy DPDK application has such mappings >> > then BOOM, multiple definition, which one is correct? which one >> > to comment it out? Integration pain starts for DPDK library consumer:-( >> > >> They can include rte_vect.h in build/include directly, which is linked correctly >> to the one for that ARCH, so there is no need to worry about. > > I think you missed the point,I was trying to say that > legacy DPDK application and third party stacks uses SSE2NEON kind of > libraries > for quick integration, for example, something like this > https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h > > AND they include "rte_lpm.h"(it internally includes rte_vect.h) > that lead to multiple definition and its not good. > But you will have similar issue since "typedef int32x4_t __m128i" appears in both your patch and this header file. >> >> >> >> > >> >> > IMO, it makes sense to not emulate the SSE intrinsics with NEON >> >> > Let's create the rte_vect_* as required. look at the existing patch. >> >> > >> >> I thought of creating a layer of SIMD over all the platforms before. >> >> But can't you see it make things complicated, considering there are >> >> only few simple intrinsic to implement? >> > >> > Not true, There were, a lot of SSE intrinsics needs be to emulated for ACL NEON >> > implementation if I were to take this approach and emulation comes with >> > the cost. >> > >> No, I will not re-implement all the intrinsic like that . >> I only do with the simple intrinsic, such as load/store, as you said below. > > but you forced to add _mm_and_si128 also to the list and emulated > _mm_and_si128 intrinsic. Am just saying no emulation. > I means simple intrinsic, not load/store only. Depends on how you define emulation. Actually, these simple intrisinic could be only one NEON instruction, and will not bring cost. > >> >> > So my take is, >> > lets the each architecture implementation for specific SIMD version of DPDK >> > API in the library should have the freedom to implement the API in >> > NATIVE. >> > >> > And let's create only rte_vect_* abstraction only for using >> > that API/library. Which boils down to have very minimal rte_vect_* >> > abstraction to load, store, set not beyond that. >> > >> > This makes clear "contract" between DPDK library and the applications. >> > and make easy for remaning new architecture porting effort in DPDK. >> > >> Agree. >> But I reuse existing intrinsic names, and you recreate new ones. >> And I try to do as few changes as possible, and try to avoid any >> mistaken which may cause code un-compiled. > > Its trival to verify. Just compile it > >> I think it's design level question, we need to hear what others talk about it. >> >> > Imagine how your proposed function will look like if new architecture >> > wants to implement "optimized" version of rte_lpm_lookupx4 >> > >> There is no optimization for this (simple) rte_lpm_lookupx4, otherwise >> you have done that in your patch. >> If there is for other new platform, defintely they should do like >> yours, as you did for NEON ACL. >> >> > >> >> If do so, we also need to explain to others how to use these interfaces. >> >> Besides, this patch did the smallest changes to the original code, and >> >> more likely to be accepted by others. >> > >> > other patch makes no changes to IA version of rte_lpm_lookupx4.I thought >> > that make reviewer easy to review the changes in architecture >> > perspective. >> > >> As I know, they don't enable LPM for PPC, and ARM is the first one to >> touch this issue. >> >> >> >> >> > >> >> >> #ifdef RTE_ARCH_ARM >> >> >> /* NEON intrinsic vqtbl1q_u8() is not supported in ARMv7-A(AArch32) */ >> >> >> static __inline uint8x16_t >> >> >> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h >> >> >> index c299ce2..c76c07d 100644 >> >> >> --- a/lib/librte_lpm/rte_lpm.h >> >> >> +++ b/lib/librte_lpm/rte_lpm.h >> >> >> @@ -361,6 +361,47 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips, >> >> >> /* Mask four results. */ >> >> >> #define RTE_LPM_MASKX4_RES UINT64_C(0x00ff00ff00ff00ff) >> >> >> >> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) >> >> > >> >> > Separate out arm implementation to the different header file. >> >> > Too many ifdef looks odd in the header file and difficult to manage. >> >> > >> >> But there are many ifdefs already. >> >> And It seems unreasonable to add a new file only for one small function. >> >> >> > >> > small or big, its matter of each architecture to have >> > the freedom for the optimized version for the implementation. >> > >> > What if other architecture demands to write this function in assembly >> > or restructure it for performance improvement? >> > >> If there is such demands, should do like that. >> But I don't see any restructure in your patch, and you still follow >> the logic as x86, is it worth adding a new file? > > SIMD Logic on getting 4 indexes for tbl24[] is different. > > /* get 4 indexes for tbl24[]. */ > i24 = _mm_srli_epi32(ip, CHAR_BIT); > > /* extract values from tbl24[] */ > idx = _mm_cvtsi128_si64(i24); > i24 = _mm_srli_si128(i24, sizeof(uint64_t)); > > tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > idx = _mm_cvtsi128_si64(i24); > > tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > VS > > /* extract values from tbl24[] */ > idx = vgetq_lane_u64((uint64x2_t)i24, 0); > > tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > idx = vgetq_lane_u64((uint64x2_t)i24, 1); > > tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > It's only the optimazation of part of code in that function. I did the similar in my patch. But, looking from the whole, this function is not restructured, and the logic is the same as x86. >> >> > >> >> > >> >> >> +static inline void >> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t tbl[4]) >> >> >> +{ >> >> >> + uint32x4_t i24; >> >> >> + uint32_t idx[4]; >> >> >> + >> >> >> + /* get 4 indexes for tbl24[]. */ >> >> >> + i24 = vshrq_n_u32(vreinterpretq_u32_s32(ip), CHAR_BIT); >> >> >> + vst1q_u32(idx, i24); >> >> >> + >> >> >> + /* extract values from tbl24[] */ >> >> >> + tbl[0] = *(const uint16_t *)&lpm->tbl24[idx[0]]; >> >> >> + tbl[1] = *(const uint16_t *)&lpm->tbl24[idx[1]]; >> >> >> + tbl[2] = *(const uint16_t *)&lpm->tbl24[idx[2]]; >> >> >> + tbl[3] = *(const uint16_t *)&lpm->tbl24[idx[3]]; >> >> >> +} >> >> > >> >> > Nice. There is an improvement in this portion code wrt my patch. This is >> >> > a candidate for convergence. >> >> > >> >> > >> >> >> +#else >> >> >> +static inline void >> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, __m128i ip, uint16_t tbl[4]) >> >> >> +{ >> >> >> + __m128i i24; >> >> >> + uint64_t idx; >> >> >> + >> >> >> + /* get 4 indexes for tbl24[]. */ >> >> >> + i24 = _mm_srli_epi32(ip, CHAR_BIT); >> >> >> + >> >> >> + /* extract values from tbl24[] */ >> >> >> + idx = _mm_cvtsi128_si64(i24); >> >> >> + i24 = _mm_srli_si128(i24, sizeof(uint64_t)); >> >> >> + >> >> >> + tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; >> >> >> + tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; >> >> >> + >> >> >> + idx = _mm_cvtsi128_si64(i24); >> >> >> + >> >> >> + tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; >> >> >> + tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; >> >> >> +} >> >> >> +#endif >> >> >> + >> >> >> /** >> >> >> * Lookup four IP addresses in an LPM table. >> >> >> * >> >> >> @@ -381,17 +422,19 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips, >> >> >> * if lookup would fail. >> >> >> */ >> >> >> static inline void >> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) >> >> >> +rte_lpm_lookupx4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t hop[4], >> >> >> + uint16_t defv) >> >> > >> >> > This would call for change in the change the ABI, >> >> > IMO, __m128i can be used to represent 128bit vector to avoid ABI chang >> >> > >> >> This redefine rte_lpm_lookupx4 is unncessary, I will remove it, so no >> >> ABI change. >> >> And there only one ifdef for ARM platforms left. >> >> >> >> > >> >> >> +#else >> >> > separate out arm implementation to the different header file. Too many >> >> > ifdef looks odd in the header file. >> >> > >> >> > Could you rebase your patch based on existing patch and send the >> >> > improvement portion as separate patch or I can send update patch with >> >> > your improvements and with your signoff. >> >> > >> >> > >> >> >> rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4], >> >> >> uint16_t defv) >> >> >> +#endif >> >> >> { >> >> >> - __m128i i24; >> >> >> rte_xmm_t i8; >> >> >> uint16_t tbl[4]; >> >> >> - uint64_t idx, pt; >> >> >> - >> >> >> - const __m128i mask8 = >> >> >> - _mm_set_epi32(UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX); >> >> >> + uint64_t pt; >> >> >> >> >> >> + const __m128i mask8 = _mm_set1_epi32(UINT8_MAX); >> >> >> /* >> >> >> * RTE_LPM_VALID_EXT_ENTRY_BITMASK for 4 LPM entries >> >> >> * as one 64-bit value (0x0300030003000300). >> >> >> @@ -412,20 +455,7 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4], >> >> >> (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 32 | >> >> >> (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 48); >> >> >> >> >> >> - /* get 4 indexes for tbl24[]. */ >> >> >> - i24 = _mm_srli_epi32(ip, CHAR_BIT); >> >> >> - >> >> >> - /* extract values from tbl24[] */ >> >> >> - idx = _mm_cvtsi128_si64(i24); >> >> >> - i24 = _mm_srli_si128(i24, sizeof(uint64_t)); >> >> >> - >> >> >> - tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; >> >> >> - tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; >> >> >> - >> >> >> - idx = _mm_cvtsi128_si64(i24); >> >> >> - >> >> >> - tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; >> >> >> - tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; >> >> >> + rte_lpm_tbl24_val4(lpm, ip, tbl); >> >> >> >> >> >> /* get 4 indexes for tbl8[]. */ >> >> >> i8.x = _mm_and_si128(ip, mask8); >> >> >> -- >> >> >> 1.8.3.1 >> >> >>