Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

From: linhaifeng <haifeng.lin@huawei.com>
To: Zhihong Wang <zhihong.wang@intel.com>, <dev@dpdk.org>
Subject: Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
Date: Wed, 30 Aug 2017 17:37:42 +0800	[thread overview]
Message-ID: <59A68766.7080307@huawei.com> (raw)
In-Reply-To: <1453086314-30158-1-git-send-email-zhihong.wang@intel.com>

在 2016/1/18 11:05, Zhihong Wang 写道:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
>
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
>
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
>
> Code changes are:
>
>   1. Read CPUID to check if AVX512 is supported by CPU
>
>   2. Predefine AVX512 macro if AVX512 is enabled by compiler
>
>   3. Implement AVX512 memcpy and choose the right implementation based on
>      predefined macros
>
>   4. Decide alignment unit for memcpy perf test based on predefined macros
>
> --------------
> Changes in v2:
>
>   1. Tune performance for prior platforms
>
> Zhihong Wang (5):
>   lib/librte_eal: Identify AVX512 CPU flag
>   mk: Predefine AVX512 macro for compiler
>   lib/librte_eal: Optimize memcpy for AVX512 platforms
>   app/test: Adjust alignment unit for memcpy perf test
>   lib/librte_eal: Tune memcpy for prior platforms
>
>  app/test/test_memcpy_perf.c                        |   6 +
>  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
>  .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
>  mk/rte.cpuflags.mk                                 |   4 +
>  4 files changed, 268 insertions(+), 13 deletions(-)
>

Hi Zhihong Wang

I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than avx2 rte_memcpy.

The vm loop test for ovs dpdk results:
avx512 is *15*Gbps
perf data:
  0.52 │      vmovdq (%r8,%r10,1),%zmm0
 95.33 │      sub    $0x40,%r9
  0.45 │      add    $0x40,%r8
  0.60 │      vmovdq %zmm0,-0x40(%r8)
  1.84 │      cmp    $0x3f,%r9
       │    ↓ ja     f20
       │      lea    -0x40(%rsi),%r8
  0.15 │      or     $0xffffffffffffffc0,%rsi
  0.21 │      and    $0xffffffffffffffc0,%r8
  0.00 │      lea    0x40(%rsi,%r8,1),%rsi
  0.00 │      vmovdq (%rcx,%rsi,1),%zmm0
  0.22 │      vmovdq %zmm0,(%rdx,%rsi,1)
  0.67 │    ↓ jmpq   c78
       │      mov    -0x128(%rbp),%rdi
       │      rex.R
       │      .byte  0x89
       │      popfq

avx2 is *18.8*Gbps
perf data:
  0.96 │      add    %r9,%r13
 66.04 │      vmovdq (%rdx),%ymm0
  1.20 │      sub    $0x40,%rdi
  1.53 │      add    $0x40,%rdx
 10.83 │      vmovdq %ymm0,-0x40(%rdx,%r15,1)
  8.64 │      vmovdq -0x20(%rdx),%ymm0
  7.58 │      vmovdq %ymm0,-0x40(%rdx,%r13,1)

dpdk version: v17.05
ovs version: 2.8.90
qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)

gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
kernal version: 3.10.0

compile dpdk:
CONFIG_RTE_ENABLE_AVX512=y
export DPDK_DIR=$PWD
export DPDK_TARGET=x86_64-native-linuxapp-gcc
export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
make install T=$DPDK_TARGET DESTDIR=install

compile ovs:
sh boot.sh
./configure  CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --localstatedir=/var --sysconfdir=/etc
make -j
make install

The test for dpdk test_memcpy_perf:
avx2：
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
(bytes)        (ticks)        (ticks)        (ticks)        (ticks)
------- -------------- -------------- -------------- --------------
========================== 32B aligned ============================
     64       6 -   10      27 -   52      30 -   39      56 -   97
    512      24 -   44     251 -  271     145 -  217     396 -  447
   1024      35 -   78     394 -  433     252 -  319     609 -  670
------- -------------- -------------- -------------- --------------
C    64       3 -    9      28 -   31      29 -   40      55 -   66
C   512      25 -   55     253 -  268     139 -  268     397 -  410
C  1024      32 -   83     394 -  416     250 -  396     612 -  687
=========================== Unaligned =============================
     64       8 -    9      85 -   71      45 -   45     125 -  121
    512      33 -   49     282 -  305     153 -  252     420 -  478
   1024      42 -   83     409 -  491     259 -  389     640 -  748
------- -------------- -------------- -------------- --------------
C    64       4 -    9      42 -   46      39 -   46      76 -   90
C   512      33 -   55     280 -  272     153 -  281     421 -  415
C  1024      41 -   83     407 -  427     258 -  405     578 -  701
======= ============== ============== ============== ==============

avx512：
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
(bytes)        (ticks)        (ticks)        (ticks)        (ticks)
------- -------------- -------------- -------------- --------------
========================== 64B aligned ============================
     64       6 -    9      18 -   33      24 -   38      40 -   65
    512      18 -   44     178 -  262     138 -  218     309 -  429
   1024      27 -   79     338 -  430     250 -  322     560 -  674
------- -------------- -------------- -------------- --------------
C    64       3 -    9      18 -   20      23 -   41      39 -   50
C   512      15 -   54     205 -  270     134 -  268     304 -  409
C  1024      24 -   83     371 -  414     242 -  400     550 -  692
=========================== Unaligned =============================
     64       8 -    9      87 -   74      45 -   48     125 -  118
    512      23 -   49     298 -  311     150 -  250     437 -  482
   1024      36 -   83     427 -  505     259 -  406     633 -  754
------- -------------- -------------- -------------- --------------
C    64       4 -    9      42 -   46      39 -   46      76 -   94
C   512      23 -   55     246 -  277     152 -  290     349 -  426
C  1024      38 -   83     398 -  431     258 -  416     634 -  725
======= ============== ============== ============== ==============