All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Add crypto PMD optimized for ARMv8
@ 2016-12-04 11:33 zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 1/3] mk: fix build of assembly files for ARM64 zbigniew.bodek
                   ` (5 more replies)
  0 siblings, 6 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-04 11:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided. 
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)	  OPS(M)      Throughput(Gbps)
64	  729 %	      742 %
128	  577 %	      592 %
256	  483 %	      476 %
512	  336 %	      351 %
768	  300 %	      286 %
1024	  263 %	      250 %
1280	  225 %	      229 %
1536	  214 %	      213 %
1792	  186 %	      203 %
2048	  200 %	      193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 MAC, SHA256 HMAC
and SHA1 HMAC.

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA256 MAC
and SHA1 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

The assembly code requires some style
improvements to avoid using >80 character lines.
This issue will be addressed in v2 patch.
Further performance improvements are planned
in the following patch revisions.

Zbigniew Bodek (3):
  mk: fix build of assembly files for ARM64
  crypto/armv8: add PMD optimized for ARMv8 processors
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                        |    6 +
 app/test/test_cryptodev.c                          |   63 +
 app/test/test_cryptodev_aes_test_vectors.h         |  211 ++-
 app/test/test_cryptodev_blockcipher.c              |    4 +
 app/test/test_cryptodev_blockcipher.h              |    1 +
 app/test/test_cryptodev_perf.c                     |  508 ++++++
 config/common_base                                 |    6 +
 config/defconfig_arm64-armv8a-linuxapp-gcc         |    2 +
 doc/guides/cryptodevs/armv8.rst                    |   82 +
 doc/guides/cryptodevs/index.rst                    |    1 +
 doc/guides/rel_notes/release_17_02.rst             |    5 +
 drivers/crypto/Makefile                            |    3 +
 drivers/crypto/armv8/Makefile                      |   84 +
 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S     | 1678 ++++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256.S        | 1518 ++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S   | 1854 ++++++++++++++++++++
 drivers/crypto/armv8/asm/aes_core.S                |  151 ++
 drivers/crypto/armv8/asm/include/rte_armv8_defs.h  |   78 +
 drivers/crypto/armv8/asm/sha1_core.S               |  515 ++++++
 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S | 1598 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S    | 1619 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_core.S             |  519 ++++++
 .../crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S   | 1791 +++++++++++++++++++
 drivers/crypto/armv8/genassym.c                    |   55 +
 drivers/crypto/armv8/rte_armv8_pmd.c               |  905 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c           |  390 ++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h       |  210 +++
 drivers/crypto/armv8/rte_armv8_pmd_version.map     |    3 +
 lib/librte_cryptodev/rte_cryptodev.h               |    3 +
 mk/arch/arm64/rte.vars.mk                          |    1 -
 mk/rte.app.mk                                      |    3 +
 mk/toolchain/gcc/rte.vars.mk                       |    6 +-
 32 files changed, 13862 insertions(+), 11 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes_core.S
 create mode 100644 drivers/crypto/armv8/asm/include/rte_armv8_defs.h
 create mode 100644 drivers/crypto/armv8/asm/sha1_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/genassym.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 1/3] mk: fix build of assembly files for ARM64
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2016-12-04 11:33 ` zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 2/3] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-04 11:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Avoid using incorrect assembler (nasm) and unsupported flags
when building for ARM64.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 mk/arch/arm64/rte.vars.mk    | 1 -
 mk/toolchain/gcc/rte.vars.mk | 6 ++++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mk/arch/arm64/rte.vars.mk b/mk/arch/arm64/rte.vars.mk
index c168426..3b1178a 100644
--- a/mk/arch/arm64/rte.vars.mk
+++ b/mk/arch/arm64/rte.vars.mk
@@ -53,7 +53,6 @@ CROSS ?=
 
 CPU_CFLAGS  ?=
 CPU_LDFLAGS ?=
-CPU_ASFLAGS ?= -felf
 
 export ARCH CROSS CPU_CFLAGS CPU_LDFLAGS CPU_ASFLAGS
 
diff --git a/mk/toolchain/gcc/rte.vars.mk b/mk/toolchain/gcc/rte.vars.mk
index ff70f3d..94f6412 100644
--- a/mk/toolchain/gcc/rte.vars.mk
+++ b/mk/toolchain/gcc/rte.vars.mk
@@ -41,9 +41,11 @@
 CC        = $(CROSS)gcc
 KERNELCC  = $(CROSS)gcc
 CPP       = $(CROSS)cpp
-# for now, we don't use as but nasm.
-# AS      = $(CROSS)as
+ifeq ($(CONFIG_RTE_ARCH_X86),y)
 AS        = nasm
+else
+AS        = $(CROSS)as
+endif
 AR        = $(CROSS)ar
 LD        = $(CROSS)ld
 OBJCOPY   = $(CROSS)objcopy
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/3] crypto/armv8: add PMD optimized for ARMv8 processors
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 1/3] mk: fix build of assembly files for ARM64 zbigniew.bodek
@ 2016-12-04 11:33 ` zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 3/3] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-04 11:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek, Emery Davis

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver using ARMv8
cryptographic extensions.
CPU compatibility with this driver is detected in run-time
and virtual crypto device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost for chained
crypto operations processing, such as encryption + HMAC generation,
decryption + HMAC validation. In particular, cipher only or hash
only operations are not provided.

The driver currently supports AES-128-CBC in combination with:
SHA256 MAC, SHA256 HMAC and SHA1 HMAC.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Signed-off-by: Emery Davis <emery.davis@caviumnetworks.com>
---
 MAINTAINERS                                        |    6 +
 config/common_base                                 |    6 +
 config/defconfig_arm64-armv8a-linuxapp-gcc         |    2 +
 doc/guides/cryptodevs/armv8.rst                    |   82 +
 doc/guides/cryptodevs/index.rst                    |    1 +
 doc/guides/rel_notes/release_17_02.rst             |    5 +
 drivers/crypto/Makefile                            |    3 +
 drivers/crypto/armv8/Makefile                      |   84 +
 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S     | 1678 ++++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256.S        | 1518 ++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S   | 1854 ++++++++++++++++++++
 drivers/crypto/armv8/asm/aes_core.S                |  151 ++
 drivers/crypto/armv8/asm/include/rte_armv8_defs.h  |   78 +
 drivers/crypto/armv8/asm/sha1_core.S               |  515 ++++++
 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S | 1598 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S    | 1619 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_core.S             |  519 ++++++
 .../crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S   | 1791 +++++++++++++++++++
 drivers/crypto/armv8/genassym.c                    |   55 +
 drivers/crypto/armv8/rte_armv8_pmd.c               |  905 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c           |  390 ++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h       |  210 +++
 drivers/crypto/armv8/rte_armv8_pmd_version.map     |    3 +
 lib/librte_cryptodev/rte_cryptodev.h               |    3 +
 mk/rte.app.mk                                      |    3 +
 25 files changed, 13079 insertions(+)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes_core.S
 create mode 100644 drivers/crypto/armv8/asm/include/rte_armv8_defs.h
 create mode 100644 drivers/crypto/armv8/asm/sha1_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/genassym.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 26d9590..ef1f25b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -445,6 +445,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
diff --git a/config/common_base b/config/common_base
index 4bff83a..b410a3b 100644
--- a/config/common_base
+++ b/config/common_base
@@ -406,6 +406,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc
index 6321884..a99ceb9 100644
--- a/config/defconfig_arm64-armv8a-linuxapp-gcc
+++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
@@ -47,3 +47,5 @@ CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_LIBRTE_FM10K_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y
diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..67d8bc3
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,82 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2016.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+================================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level assembly code specific to all supported cipher and hash
+combinations.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+* ``RTE_CRYPTO_AUTH_SHA1``
+* ``RTE_CRYPTO_AUTH_SHA256``
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+To compile ARMv8 Crypto PMD, it has to be enabled in the config/common_base
+file. No additional packages need to be installed.
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Input data has to be a multiple of 16 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 3b65038..c6c92b0 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -38,6 +38,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..a5de944 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+ifeq ($(CONFIG_RTE_ARCH_ARM64),y)
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
+endif
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..8fdd374
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,84 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2016.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(SRCDIR)/asm/include
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+VPATH += $(SRCDIR)/asm
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+# library asm files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha1_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha1_hmac.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha256.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha256_hmac.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha1_hmac_aes128cbc_dec.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_aes128cbc_dec.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_hmac_aes128cbc_dec.S
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+# runtime generated assembly symbols
+all: clean assym.s
+
+assym.s: genassym.c
+	@$(CC) $(CFLAGS) -O0 -S $< -o - | \
+		awk '($$1 == "<genassym>") { print "#define " $$2 "\t" $$3 }' > \
+		$(SRCDIR)/asm/$@
+
+.PHONY:	clean
+clean:
+	@rm -f $(SRCDIR)/asm/assym.s
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S b/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
new file mode 100644
index 0000000..efa1cdd
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
@@ -0,0 +1,1678 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha1_hmac
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA1(o_key_pad | SHA1(i_key_pad | out))
+ *
+ * Prototype:
+ * void aes128cbc_sha1_hmac(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha1_hmac(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 -- temp register for SHA1
+ * v20 -- ABCD copy (q20)
+ * v21 -- sha working state (q21)
+ * v22 -- sha working state (q22)
+ * v23 -- temp register for SHA1
+ * v24 -- sha state ABCD
+ * v25 -- sha state E
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha1_hmac.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha1_hmac
+	.type	aes128cbc_sha1_hmac,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+aes128cbc_sha1_hmac:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	ld1		{v24.4s, v25.4s},[x6]			/* init ABCD, EFGH. (2 cycs) */
+	ldr		x6, [x5, #HMAC_OKEYPAD]			/* save pointer to o_key_pad partial hash */
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,12					/* no main loop if <12 */
+	b.lt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+/* proceed */
+	ld1		{v3.16b},[x5]				/* get 1st ivec */
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	mov		x11,x4					/* len -> x11 needed at end */
+	lsr		x12,x11,6				/* total_blocks */
+
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out  */
+	adr		x8,.Lrcon				/* base address for sha round consts */
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	aese		v0.16b,v10.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1					/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]			/* rcon */
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0					/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	sub		x7,x12,1				/* main_blocks = total_blocks - 1 */
+	and		x13,x10,3				/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+
+/* Note, aes_blocks_left := number after the main (sha) block is done. Can be 0 */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * Thats OK since there are 6 cycles before we can use the load anyway; so this goes
+ * as fast as it can without SW pipelining (too complicated given the code size)
+ */
+	rev32		v26.16b,v0.16b				/* fix endian w0, aes res 0 */
+	ld1		{v0.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b				/* fix endian w1, aes res 1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming  */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	aese		v0.16b,v8.16b
+	rev32		v28.16b,v2.16b				/* fix endian w2, aes res 2 */
+	aesmc		v0.16b,v0.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v0.16b,v9.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v0.16b,v10.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v23.4s,v4.4s,v27.4s
+/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b				/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aese		v0.16b,v12.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aese		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aese		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aese		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	eor		v1.16b,v1.16b,v0.16b			/* mode op 1 xor w/ prev value */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v1.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v10.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	sha1h		s22,s24
+	aese		v1.16b,v12.16b
+	sha1p		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aese		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v14.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v1.16b,v15.16b
+	sha1h		s22,s24
+	add		v23.4s,v5.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v16.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v1.16b,v17.16b
+	sha1h		s21,s24
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	sha1p		q24,s22,v23.4s
+	add		v23.4s,v6.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b			/* mode of 2 xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	aese		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v2.16b,v2.16b
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+
+	aese		v2.16b,v11.16b
+	add		v19.4s,v6.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aese		v2.16b,v12.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aese		v2.16b,v13.16b
+	sha1su1		v29.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v2.16b,v14.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v2.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v16.16b
+	add		v19.4s,v6.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	sha1su1		v26.4s,v29.4s
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3, sha quad 3 */
+	aese		v3.16b,v8.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aese		v3.16b,v10.16b
+	sha1su1		v29.4s,v28.4s
+	aesmc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v26.4s
+	aese		v3.16b,v11.16b
+	sha1h		s22,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aese		v3.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesmc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v28.4s
+	aese		v3.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v29.4s
+	aese		v3.16b,v17.16b
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+
+
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+
+	rev32		v26.16b,v0.16b				/* fix endian w0 */
+	rev32		v27.16b,v1.16b				/* fix endian w1 */
+	rev32		v28.16b,v2.16b				/* fix endian w2 */
+	rev32		v29.16b,v3.16b				/* fix endian w3 */
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	cbz		x13, .Lbm2fromQ0			/* skip if none left */
+	subs		x14,x13,1				/* local copy of aes_blocks_left */
+
+/* mode op 0 */
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v19.4s,v4.4s,v26.4s
+	aese		v0.16b,v8.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v0.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aese		v0.16b,v10.16b
+	sha1su1		v26.4s,v29.4s
+	add		v19.4s,v4.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	sha1h		s21,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aese		v0.16b,v12.16b
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v4.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aese		v0.16b,v14.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	sha1h		s21,s24
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha1c		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ1				/* if aes_blocks_left_count == 0 */
+
+/* mode op 1 */
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v23.4s,v5.4s,v27.4s
+	aese		v1.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v10.16b
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v5.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1				/* dec counter */
+	aese		v1.16b,v11.16b
+	sha1h		s22,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v12.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v5.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v14.16b
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v16.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ2				/* if aes_blocks_left_count == 0 */
+
+/* mode op 2 */
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v19.4s,v6.4s,v28.4s
+	aese		v2.16b,v8.16b
+	add		v23.4s,v6.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aese		v2.16b,v12.16b
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v14.16b
+	sha1su1		v26.4s,v29.4s
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha1m		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lbm2fromQ3				/* join common code at Quad 3 */
+
+/*
+ * now there is the b-2 sha block before the final one.  Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+.Lbm2fromQ0:
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+.Lbm2fromQ1:
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+.Lbm2fromQ2:
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+.Lbm2fromQ3:
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	eor		v26.16b,v26.16b,v26.16b			/* zero reg */
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	eor		v27.16b,v27.16b,v27.16b			/* zero reg */
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	eor		v28.16b,v28.16b,v28.16b			/* zero reg */
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		w15,0x80				/* that's the 1 of the pad */
+	/* Add one SHA-1 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32				/* len_hi */
+	and		x9,x11,0xffffffff			/* len_lo */
+	mov		v26.b[0],w15				/* assume block 0 is dst */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x9,x9,3					/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b			/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100				/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1				/* dec amount left */
+	mov		v27.b[0],w15				/* assume block 1 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1				/* dec amount left */
+	mov		v28.b[0],w15				/* assume block 2 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15				/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi,LenLo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+/*
+ * do last sha of pad block
+ */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+
+	mov		x11, #64+20				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11				/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	mov		w15,0x80				/* sha padding word */
+
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+
+	mov		x9,x8					/* top of rcon */
+
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+
+	mov		v27.b[3],w15				/* assume this was final block */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v26.16b,v0.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+
+	mov		v28.b[3],w15				/* assume this was final block */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v27.16b,v1.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+
+	mov		v29.b[3],w15				/* assume this was final block */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v28.16b,v2.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+
+	rev32		v29.16b,v3.16b				/* load res to sha 0, endian swap */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * now we have the sha1 to do for these 4 aes blocks
+ */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+
+	mov		v26.b[3],w15				/* assume this was final block */
+
+	sub		x10,x10,1				/* dec num_blocks */
+	cbnz		x10,.Lshort_loop			/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha1 blocks
+ */
+.Lpost_short_loop:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add	x11, x11, #64
+	lsr	x12,x11,32					/* len_hi */
+	and	x13,x11,0xffffffff				/* len_lo */
+	lsl	x12,x12,3					/* len_hi in bits */
+	lsl	x13,x13,3					/* len_lo in bits */
+
+	mov	v29.s[3],w13					/* len_lo */
+	mov	v29.s[2],w12					/* len_hi */
+
+/* do final block */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+
+	mov		x11, #64+20				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11				/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+	.size	aes128cbc_sha1_hmac, .-aes128cbc_sha1_hmac
diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha256.S b/drivers/crypto/armv8/asm/aes128cbc_sha256.S
new file mode 100644
index 0000000..c203925
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha256.S
@@ -0,0 +1,1518 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha256
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA256(out)
+ *
+ * Prototype:
+ * void aes128cbc_sha256(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha256(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- regShaStateABCD
+ * v25 -- regShaStateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha256.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha256
+	.type	aes128cbc_sha256,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+aes128cbc_sha256:
+/* fetch args */
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next aes_ptr_in */
+	adr		x12,.Linit_sha_state			/* address of sha init state consts */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,12					/* no main loop if <12 */
+	ld1		{v24.4s, v25.4s},[x12]			/* init ABCD, EFGH. (2 cycs) */
+	b.lt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+/* proceed */
+	ld1		{v3.16b},[x5]				/* get 1st ivec */
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	mov		x11,x4					/* len -> x11 needed at end */
+	lsr		x12,x11,6				/* total_blocks */
+
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out  */
+	adr		x8,.Lrcon				/* base address for sha round consts */
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	aese		v0.16b,v10.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1					/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]			/* rcon */
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0					/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	sub		x7,x12,1				/* main_blocks = total_blocks - 1 */
+	and		x13,x10,3				/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+
+/* Note, aes_blocks_left := number after the main (sha) block is done. Can be 0 */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * Thats OK since there are 6 cycles before we can use the load anyway; so this goes
+ * as fast as it can without SW pipelining (too complicated given the code size)
+ */
+	rev32		v26.16b,v0.16b				/* fix endian w0, aes res 0 */
+	ld1		{v0.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b				/* fix endian w1, aes res 1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming  */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+	ld1		{v5.16b},[x9],16			/* key1 */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aese		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	rev32		v28.16b,v2.16b				/* fix endian w2, aes res 2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+	aesmc		v0.16b,v0.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b				/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aese		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	eor		v1.16b,v1.16b,v0.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v7.16b},[x9],16			/* key7  */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v10.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aese		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aese		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aese		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b			/* mode of 2 xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesmc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v3.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+
+	rev32		v26.16b,v0.16b				/* fix endian w0 */
+	rev32		v27.16b,v1.16b				/* fix endian w1 */
+	rev32		v28.16b,v2.16b				/* fix endian w2 */
+	rev32		v29.16b,v3.16b				/* fix endian w3 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	cbz		x13, .Lbm2fromQ0			/* skip if none left */
+	subs		x14,x13,1				/* local copy of aes_blocks_left */
+
+/* mode op 0 */
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aese		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ1				/* if aes_blocks_left_count == 0 */
+
+/* mode op 1 */
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aese		v1.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v1.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v1.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1				/* dec counter */
+	aese		v1.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v1.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v16.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ2				/* if aes_blocks_left_count == 0 */
+
+/* mode op 2 */
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aese		v2.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lbm2fromQ3				/* join common code at Quad 3 */
+
+/*
+ * now there is the b-2 sha block before the final one.  Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+/* quad 0 */
+.Lbm2fromQ0:
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lbm2fromQ1:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lbm2fromQ2:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lbm2fromQ3:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b			/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b			/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b			/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		w15,0x80				/* that's the 1 of the pad */
+	lsr		x12,x11,32				/* len_hi */
+	and		x9,x11,0xffffffff			/* len_lo */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		v26.b[0],w15				/* assume block 0 is dst */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x9,x9,3					/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b			/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100				/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1				/* dec amount left */
+	mov		v27.b[0],w15				/* assume block 1 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1				/* dec amount left */
+	mov		v28.b[0],w15				/* assume block 2 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15				/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi, len_lo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+/*
+ * do last sha of pad block
+ */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	mov		x9,sp
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		sp,sp,8*16
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+/*
+ * now we just have to put this into big endian and store!
+ */
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b				/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b				/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	mov		w15,0x80				/* sha padding word */
+
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+
+	mov		v27.b[3],w15				/* assume this was final block */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v26.16b,v0.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+
+	mov		v28.b[3],w15				/* assume this was final block */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v27.16b,v1.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+
+	mov		v29.b[3],w15				/* assume this was final block */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v28.16b,v2.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+
+	rev32		v29.16b,v3.16b				/* load res to sha 0, endian swap */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * now we have the sha256 to do for these 4 aes blocks
+ */
+
+	mov	v22.16b,v24.16b					/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b					/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+
+	mov		v26.b[3],w15				/* assume this was final block */
+
+	sub		x10,x10,1				/* dec num_blocks */
+	cbnz		x10,.Lshort_loop			/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha256 blocks
+ */
+.Lpost_short_loop:
+	lsr	x12,x11,32					/* len_hi */
+	and	x13,x11,0xffffffff				/* len_lo */
+	lsl	x12,x12,3					/* len_hi in bits */
+	lsl	x13,x13,3					/* len_lo in bits */
+
+	mov	v29.s[3],w13					/* len_lo */
+	mov	v29.s[2],w12					/* len_hi */
+
+/* do final block */
+
+	mov	v22.16b,v24.16b					/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b					/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	mov		x9,sp
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		sp,sp,8*16
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b				/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b				/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+	ret
+
+	.size	aes128cbc_sha256, .-aes128cbc_sha256
diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S b/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
new file mode 100644
index 0000000..3a32eb2
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
@@ -0,0 +1,1854 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha256_hmac
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA256(o_key_pad | SHA256(i_key_pad | out))
+ *
+ * Prototype:
+ * void aes128cbc_sha256_hmac(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha256_hmac(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- sha state ABCD
+ * v25 -- sha state EFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha256_hmac.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha256_hmac
+	.type	aes128cbc_sha256_hmac,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+aes128cbc_sha256_hmac:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	ld1		{v24.4s, v25.4s},[x6]			/* init ABCD, EFGH. (2 cycs) */
+	ldr		x6, [x5, #HMAC_OKEYPAD]			/* save pointer to o_key_pad partial hash */
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next aes_ptr_in */
+	adr		x12,.Linit_sha_state			/* address of sha init state consts */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,12					/* no main loop if <12 */
+	b.lt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+/* proceed */
+	ld1		{v3.16b},[x5]				/* get 1st ivec */
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	mov		x11,x4					/* len -> x11 needed at end */
+	lsr		x12,x11,6				/* total_blocks */
+
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out  */
+	adr		x8,.Lrcon				/* base address for sha round consts */
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	aese		v0.16b,v10.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1					/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]			/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]			/* rcon */
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0					/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]			/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	sub		x7,x12,1				/* main_blocks = total_blocks - 1 */
+	and		x13,x10,3				/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+
+/* Note, aes_blocks_left := number after the main (sha) block is done. Can be 0 */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * Thats OK since there are 6 cycles before we can use the load anyway; so this goes
+ * as fast as it can without SW pipelining (too complicated given the code size)
+ */
+	rev32		v26.16b,v0.16b				/* fix endian w0, aes res 0 */
+	ld1		{v0.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b				/* fix endian w1, aes res 1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming  */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+	ld1		{v5.16b},[x9],16			/* key1 */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aese		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	rev32		v28.16b,v2.16b				/* fix endian w2, aes res 2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+	aesmc		v0.16b,v0.16b
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b				/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aese		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	eor		v1.16b,v1.16b,v0.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v7.16b},[x9],16			/* key7  */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v10.16b
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aese		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aese		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aese		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b			/* mode of 2 xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	aese		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesmc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aese		v3.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+
+	rev32		v26.16b,v0.16b				/* fix endian w0 */
+	rev32		v27.16b,v1.16b				/* fix endian w1 */
+	rev32		v28.16b,v2.16b				/* fix endian w2 */
+	rev32		v29.16b,v3.16b				/* fix endian w3 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	cbz		x13, .Lbm2fromQ0			/* skip if none left */
+	subs		x14,x13,1				/* local copy of aes_blocks_left */
+
+/* mode op 0 */
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aese		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ1				/* if aes_blocks_left_count == 0 */
+
+/* mode op 1 */
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aese		v1.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v1.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v1.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1				/* dec counter */
+	aese		v1.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v1.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v16.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	beq		.Lbm2fromQ2				/* if aes_blocks_left_count == 0 */
+
+/* mode op 2 */
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aese		v2.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lbm2fromQ3				/* join common code at Quad 3 */
+
+/*
+ * now there is the b-2 sha block before the final one.  Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+/* quad 0 */
+.Lbm2fromQ0:
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lbm2fromQ1:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lbm2fromQ2:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lbm2fromQ3:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b			/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b			/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b			/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		w15,0x80				/* that's the 1 of the pad */
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32				/* len_hi */
+	and		x9,x11,0xffffffff			/* len_lo */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		v26.b[0],w15				/* assume block 0 is dst */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x9,x9,3					/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b			/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100				/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1				/* dec amount left */
+	mov		v27.b[0],w15				/* assume block 1 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1				/* dec amount left */
+	mov		v28.b[0],w15				/* assume block 2 is dst */
+	cbz		x14,.Lpad100				/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15				/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi,LenLo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+/*
+ * do last sha of pad block
+ */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+
+	mov		x11, #64+32				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+
+	ld1		{v4.16b},[x8],16			/* key0 */
+	ld1		{v5.16b},[x8],16			/* key1 */
+	ld1		{v6.16b},[x8],16			/* key2 */
+	ld1		{v7.16b},[x8],16			/* key3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key4 */
+	ld1		{v5.16b},[x8],16			/* key5 */
+	ld1		{v6.16b},[x8],16			/* key6 */
+	ld1		{v7.16b},[x8],16			/* key7 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key8 */
+	ld1		{v5.16b},[x8],16			/* key9 */
+	ld1		{v6.16b},[x8],16			/* key10 */
+	ld1		{v7.16b},[x8],16			/* key11 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key12 */
+	ld1		{v5.16b},[x8],16			/* key13 */
+	ld1		{v6.16b},[x8],16			/* key14 */
+	ld1		{v7.16b},[x8],16			/* key15 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	mov		w15,0x80				/* sha padding word */
+
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+	ld1		{v0.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v0.16b,v0.16b,v3.16b			/* xor w/ prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+
+	mov		v27.b[3],w15				/* assume this was final block */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v26.16b,v0.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v1.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v1.16b,v1.16b,v0.16b			/* xor w/ prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+
+	mov		v28.b[3],w15				/* assume this was final block */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v27.16b,v1.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v2.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v2.16b,v2.16b,v1.16b			/* xor w/ prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+
+	mov		v29.b[3],w15				/* assume this was final block */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	rev32		v28.16b,v2.16b				/* load res to sha 0, endian swap */
+	sub		x10,x10,1				/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop			/* break if no more */
+
+	ld1		{v3.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	eor		v3.16b,v3.16b,v2.16b			/* xor w/ prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+
+	rev32		v29.16b,v3.16b				/* load res to sha 0, endian swap */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * now we have the sha256 to do for these 4 aes blocks
+ */
+
+	mov	v22.16b,v24.16b					/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b					/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	eor		v26.16b,v26.16b,v26.16b			/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b			/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b			/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero sha src 3 */
+
+	mov		v26.b[3],w15				/* assume this was final block */
+
+	sub		x10,x10,1				/* dec num_blocks */
+	cbnz		x10,.Lshort_loop			/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha256 blocks
+ */
+.Lpost_short_loop:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add	x11, x11, #64
+	lsr	x12,x11,32					/* len_hi */
+	and	x13,x11,0xffffffff				/* len_lo */
+	lsl	x12,x12,3					/* len_hi in bits */
+	lsl	x13,x13,3					/* len_lo in bits */
+
+	mov	v29.s[3],w13					/* len_lo */
+	mov	v29.s[2],w12					/* len_hi */
+
+/* do final block */
+
+	mov	v22.16b,v24.16b					/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b					/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+
+	mov		x11, #64+32				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11				/* and the higher part */
+
+	ld1		{v4.16b},[x8],16			/* key0 */
+	ld1		{v5.16b},[x8],16			/* key1 */
+	ld1		{v6.16b},[x8],16			/* key2 */
+	ld1		{v7.16b},[x8],16			/* key3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key4 */
+	ld1		{v5.16b},[x8],16			/* key5 */
+	ld1		{v6.16b},[x8],16			/* key6 */
+	ld1		{v7.16b},[x8],16			/* key7 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key8 */
+	ld1		{v5.16b},[x8],16			/* key9 */
+	ld1		{v6.16b},[x8],16			/* key10 */
+	ld1		{v7.16b},[x8],16			/* key11 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key12 */
+	ld1		{v5.16b},[x8],16			/* key13 */
+	ld1		{v6.16b},[x8],16			/* key14 */
+	ld1		{v7.16b},[x8],16			/* key15 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+	.size	aes128cbc_sha256_hmac, .-aes128cbc_sha256_hmac
diff --git a/drivers/crypto/armv8/asm/aes_core.S b/drivers/crypto/armv8/asm/aes_core.S
new file mode 100644
index 0000000..b7ceae6
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes_core.S
@@ -0,0 +1,151 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+	.file	"aes_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global	aes128_key_sched_enc
+	.type	aes128_key_sched_enc, %function
+	.global	aes128_key_sched_dec
+	.type	aes128_key_sched_dec, %function
+
+	/*
+	 * AES key expand algorithm for single round.
+	 */
+	.macro	key_expand res, key, shuffle_mask, rcon, tq0, tq1, td
+	/* temp = rotword(key[3]) */
+	tbl	\td\().8b,{\key\().16b},\shuffle_mask\().8b
+	dup	\tq0\().2d,\td\().d[0]
+	/* temp = subbytes(temp) */
+	aese	\tq0\().16b,v19\().16b			/* q19 := 0 */
+	/* temp = temp + rcon */
+	mov	w11,\rcon
+	dup	\tq1\().4s,w11
+	eor	\tq0\().16b,\tq0\().16b,\tq1\().16b
+	/* tq1 = [0, a, b, c] */
+	ext	\tq1\().16b,v19\().16b,\key\().16b,12  	/* q19 := 0 */
+	eor	\res\().16b,\key\().16b,\tq1\().16b
+	/* tq1 = [0, 0, a, b] */
+	ext	\tq1\().16b,v19\().16b,\tq1\().16b,12  	/* q19 := 0 */
+	eor	\res\().16b,\res\().16b,\tq1\().16b
+	/* tq1 = [0, 0, 0, a] */
+	ext	\tq1\().16b,v19\().16b,\tq1\().16b,12	/* q19 := 0 */
+	eor	\res\().16b,\res\().16b,\tq1\().16b
+	/* + temp */
+	eor	\res\().16b,\res\().16b,\tq0\().16b
+	.endm
+/*
+ * *expanded_key, *user_key
+ */
+	.align	4
+aes128_key_sched_enc:
+	sub	sp,sp,4*16
+	st1	{v8.16b - v11.16b},[sp]
+	ld1	{v0.16b},[x1]				/* user_key */
+	mov	w10,0x0e0d				/* form shuffle_word */
+	mov	w11,0x0c0f
+	orr	w10,w10,w11,lsl 16
+	dup	v20.4s,w10				/* shuffle_mask */
+	eor	v19.16b,v19.16b,v19.16b			/* zero */
+	/* Expand key */
+	key_expand v1,v0,v20,0x1,v21,v16,v17
+	key_expand v2,v1,v20,0x2,v21,v16,v17
+	key_expand v3,v2,v20,0x4,v21,v16,v17
+	key_expand v4,v3,v20,0x8,v21,v16,v17
+	key_expand v5,v4,v20,0x10,v21,v16,v17
+	key_expand v6,v5,v20,0x20,v21,v16,v17
+	key_expand v7,v6,v20,0x40,v21,v16,v17
+	key_expand v8,v7,v20,0x80,v21,v16,v17
+	key_expand v9,v8,v20,0x1b,v21,v16,v17
+	key_expand v10,v9,v20,0x36,v21,v16,v17
+	/* Store round keys in the correct order */
+	st1	{v0.16b - v3.16b},[x0],64
+	st1	{v4.16b - v7.16b},[x0],64
+	st1	{v8.16b - v10.16b},[x0],48
+
+	ld1	{v8.16b - v11.16b},[sp]
+	add	sp,sp,4*16
+	ret
+
+	.size	aes128_key_sched_enc, .-aes128_key_sched_enc
+
+/*
+ * *expanded_key, *user_key
+ */
+	.align	4
+aes128_key_sched_dec:
+	sub	sp,sp,4*16
+	st1	{v8.16b-v11.16b},[sp]
+	ld1	{v0.16b},[x1]				/* user_key */
+	mov	w10,0x0e0d				/* form shuffle_word */
+	mov	w11,0x0c0f
+	orr	w10,w10,w11,lsl 16
+	dup	v20.4s,w10				/* shuffle_mask */
+	eor	v19.16b,v19.16b,v19.16b			/* zero */
+	/*
+	 * Expand key.
+	 * Intentionally reverse registers order to allow
+	 * for multiple store later.
+	 * (Store must be performed in the ascending registers' order)
+	 */
+	key_expand v10,v0,v20,0x1,v21,v16,v17
+	key_expand v9,v10,v20,0x2,v21,v16,v17
+	key_expand v8,v9,v20,0x4,v21,v16,v17
+	key_expand v7,v8,v20,0x8,v21,v16,v17
+	key_expand v6,v7,v20,0x10,v21,v16,v17
+	key_expand v5,v6,v20,0x20,v21,v16,v17
+	key_expand v4,v5,v20,0x40,v21,v16,v17
+	key_expand v3,v4,v20,0x80,v21,v16,v17
+	key_expand v2,v3,v20,0x1b,v21,v16,v17
+	key_expand v1,v2,v20,0x36,v21,v16,v17
+	/* Inverse mixcolumns for keys 1-9 (registers v10-v2) */
+	aesimc	v10.16b, v10.16b
+	aesimc	v9.16b, v9.16b
+	aesimc	v8.16b, v8.16b
+	aesimc	v7.16b, v7.16b
+	aesimc	v6.16b, v6.16b
+	aesimc	v5.16b, v5.16b
+	aesimc	v4.16b, v4.16b
+	aesimc	v3.16b, v3.16b
+	aesimc	v2.16b, v2.16b
+	/* Store round keys in the correct order */
+	st1	{v1.16b - v4.16b},[x0],64
+	st1	{v5.16b - v8.16b},[x0],64
+	st1	{v9.16b, v10.16b},[x0],32
+	st1	{v0.16b},[x0],16
+
+	ld1	{v8.16b - v11.16b},[sp]
+	add	sp,sp,4*16
+	ret
+
+	.size	aes128_key_sched_dec, .-aes128_key_sched_dec
diff --git a/drivers/crypto/armv8/asm/include/rte_armv8_defs.h b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
new file mode 100644
index 0000000..a1d4d24
--- /dev/null
+++ b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
@@ -0,0 +1,78 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_DEFS_H_
+#define _RTE_ARMV8_DEFS_H_
+
+struct crypto_arg {
+	struct {
+		uint8_t		*key;
+		uint8_t		*iv;
+	} cipher;
+	struct {
+		struct {
+			uint8_t	*key;
+			uint8_t *i_key_pad;
+			uint8_t *o_key_pad;
+		} hmac;
+	} digest;
+};
+
+typedef struct crypto_arg crypto_arg_t;
+
+void aes128_key_sched_enc(uint8_t *expanded_key, const uint8_t *user_key);
+void aes128_key_sched_dec(uint8_t *expanded_key, const uint8_t *user_key);
+
+void aes128cbc_sha1_hmac(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void aes128cbc_sha256(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void aes128cbc_sha256_hmac(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void aes128cbc_dec_sha256(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void sha1_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void sha256_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void sha256_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+void sha256_aes128cbc(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc, uint8_t *ddst,
+			uint64_t len, crypto_arg_t *arg);
+
+int sha1_block_partial(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+int sha1_block(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+
+int sha256_block_partial(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+int sha256_block(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+
+#endif /* _RTE_ARMV8_DEFS_H_ */
diff --git a/drivers/crypto/armv8/asm/sha1_core.S b/drivers/crypto/armv8/asm/sha1_core.S
new file mode 100644
index 0000000..cf5bff3
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha1_core.S
@@ -0,0 +1,515 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Core SHA-1 Primitives
+ *
+ * Operations:
+ * sha1_block_partial:
+ * 	out = partial_sha1(init, in, len)	<- no final block
+ *
+ * sha1_block:
+ * 	out = sha1(init, in, len)
+ *
+ * Prototype:
+ *
+ * int sha1_block_partial(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * int sha1_block(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * returns: 0 (sucess), -1 (failure)
+ *
+ * Registers used:
+ *
+ * sha1_block_partial(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * sha1_block(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v4 - v7 -- round consts for sha
+ * v22 -- sha working state ABCD (q22)
+ * v24 -- reg_sha_stateABCD
+ * v25 -- reg_sha_stateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16 (+20 for the HMAC),
+ * otherwise error code is returned.
+ *
+ */
+	.file "sha1_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global sha1_block_partial
+	.type	sha1_block_partial,%function
+	.global sha1_block
+	.type	sha1_block,%function
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+	.align	4
+.Linit_sha_state:
+	.word		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476
+	.word		0xc3d2e1f0, 0x00000000, 0x00000000, 0x00000000
+
+	.align	4
+
+sha1_block_partial:
+	mov		x6, #1					/* indicate partial hash */
+	ands		x5, x3, #0x3f				/* Check size mod 1 SHA block */
+	b.ne		.Lsha1_error
+	cbnz		x0, 1f
+	adr		x0,.Linit_sha_state			/* address of sha init state consts */
+1:
+	ld1		{v24.4s},[x0],16			/* init ABCD */
+	ld1		{v25.4s},[x0]				/* and E */
+
+	/* Load SHA-1 constants */
+	adr		x4,.Lrcon
+	ld1		{v4.16b},[x4],16			/* key0 */
+	ld1		{v5.16b},[x4],16			/* key1 */
+	ld1		{v6.16b},[x4],16			/* key2 */
+	ld1		{v7.16b},[x4],16			/* key3 */
+
+	lsr		x5, x3, 2				/* number of 4B blocks */
+	b		.Lsha1_loop
+
+sha1_block:
+	mov		x6, xzr					/* indicate full hash */
+	and		x5, x3, #0xf				/* check size mod 16B block */
+	cmp		x5, #4					/* additional word is accepted */
+	b.eq		1f
+	cbnz		x5, .Lsha1_error
+1:
+	cbnz		x0, 2f
+	adr		x0,.Linit_sha_state			/* address of sha init state consts */
+2:
+	ld1		{v24.4s},[x0],16			/* init ABCD */
+	ld1		{v25.4s},[x0]				/* and E */
+
+	/* Load SHA-1 constants */
+	adr		x4,.Lrcon
+	ld1		{v4.16b},[x4],16			/* key0 */
+	ld1		{v5.16b},[x4],16			/* key1 */
+	ld1		{v6.16b},[x4],16			/* key2 */
+	ld1		{v7.16b},[x4],16			/* key3 */
+
+	lsr		x5, x3, 2				/* number of 4B blocks */
+	cmp		x5, #16					/* at least 16 4B blocks give 1 SHA block */
+	b.lo		.Lsha1_last
+
+	.align	4
+
+.Lsha1_loop:
+	sub		x5, x5, #16				/* substract 1 SHA block */
+
+	ld1		{v26.16b},[x1],16			/* dsrc[0] */
+	ld1		{v27.16b},[x1],16			/* dsrc[1] */
+	ld1		{v28.16b},[x1],16			/* dsrc[2] */
+	ld1		{v29.16b},[x1],16			/* dsrc[3] */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+/* quad 0 */
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s25,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v4.4s,v27.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v4.4s,v28.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v4.4s,v29.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v5.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v5.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v5.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v6.4s,v29.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v6.4s,v26.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v6.4s,v27.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v7.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v16.4s,v7.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v24.4s,v24.4s,v22.4s
+	add		v25.4s,v25.4s,v18.4s
+
+	cmp		x5, #16
+	b.hs		.Lsha1_loop
+
+	/* Store partial hash and return or complete hash */
+	cbz		x6, .Lsha1_last
+
+	st1		{v24.16b},[x2],16
+	st1		{v25.16b},[x2]
+
+	mov		x0, xzr
+	ret
+
+	/*
+	 * Last block with padding. v24-v25[0] contain hash state.
+	 */
+.Lsha1_last:
+
+	eor		v26.16b, v26.16b, v26.16b
+	eor		v27.16b, v27.16b, v27.16b
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x4,.Lrcon
+	/* Number of bits in message */
+	lsl		x3, x3, 3
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+
+	/* Fill out the first vector register and the end of the block */
+	mov		v29.s[3], w3				/* move length to the end of the block */
+	lsr		x3, x3, 32
+	mov		v29.s[2], w3				/* and the higher part */
+
+	/* The remaining part is up to 3 16B blocks and up to 1 4B block */
+	mov		w6, #0x80				/* that's the 1 of the pad */
+	mov		v26.b[3], w6
+	cbz		x5,.Lsha1_final
+	/* Are there 3 16B blocks? */
+	cmp		x5, #12
+	b.lo		1f
+	ld1		{v26.16b},[x1],16
+	ld1		{v27.16b},[x1],16
+	ld1		{v28.16b},[x1],16
+	rev32		v26.16b, v26.16b
+	rev32		v27.16b, v27.16b
+	rev32		v28.16b, v28.16b
+	sub		x5,x5,#12
+	mov		v29.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v29.b[7], wzr
+	ld1		{v29.s}[0],[x1],4
+	rev32		v29.16b,v29.16b
+	mov		v29.b[7], w6
+	b		.Lsha1_final
+1:
+	/* Are there 2 16B blocks? */
+	cmp		x5, #8
+	b.lo		2f
+	ld1		{v26.16b},[x1],16
+	ld1		{v27.16b},[x1],16
+	rev32		v26.16b,v26.16b
+	rev32		v27.16b,v27.16b
+	sub		x5,x5,#8
+	mov		v28.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v28.b[7], wzr
+	ld1		{v28.s}[0],[x1],4
+	rev32		v28.16b,v28.16b
+	mov		v28.b[7], w6
+	b		.Lsha1_final
+2:
+	/* Is there 1 16B block? */
+	cmp		x5, #4
+	b.lo		3f
+	ld1		{v26.16b},[x1],16
+	rev32		v26.16b,v26.16b
+	sub		x5,x5,#4
+	mov		v27.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v27.b[7], wzr
+	ld1		{v27.s}[0],[x1],4
+	rev32		v27.16b,v27.16b
+	mov		v27.b[7], w6
+	b		.Lsha1_final
+3:
+	ld1		{v26.s}[0],[x1],4
+	rev32		v26.16b,v26.16b
+	mov		v26.b[7], w6
+
+.Lsha1_final:
+	ld1		{v4.16b},[x4],16			/* key0 */
+	ld1		{v5.16b},[x4],16			/* key1 */
+	ld1		{v6.16b},[x4],16			/* key2 */
+	ld1		{v7.16b},[x4],16			/* key3 */
+/* quad 0 */
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s25,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v4.4s,v27.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v4.4s,v28.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v4.4s,v29.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v5.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v5.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v5.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v6.4s,v29.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v6.4s,v26.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v6.4s,v27.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v7.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v16.4s,v7.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v25.4s,v25.4s,v18.4s
+	add		v24.4s,v24.4s,v22.4s
+
+	rev32		v24.16b,v24.16b
+	rev32		v25.16b,v25.16b
+
+	st1		{v24.16b}, [x2],16
+	st1		{v25.s}[0], [x2]
+
+	mov		x0, xzr
+	ret
+
+.Lsha1_error:
+	mov		x0, #-1
+	ret
+
+	.size	sha1_block_partial, .-sha1_block_partial
+	.size	sha1_block, .-sha1_block
diff --git a/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
new file mode 100644
index 0000000..f38d0a6
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
@@ -0,0 +1,1598 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha1_hmac/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA1(o_key_pad | SHA1(i_key_pad | in))
+ *
+ * Prototype:
+ *
+ * void sha1_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha1_hmac_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 -- temp register for SHA1
+ * v20 -- ABCD copy (q20)
+ * v21 -- sha working state (q21)
+ * v22 -- sha working state (q22)
+ * v23 -- temp register for SHA1
+ * v24 -- sha state ABCD
+ * v25 -- sha state E
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks, however they are somewhat optimized,
+ * and more so than the enc/auth versions.
+ */
+	.file "sha1_hmac_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha1_hmac_aes128cbc_dec
+	.type	sha1_hmac_aes128cbc_dec,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+sha1_hmac_aes128cbc_dec:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	ld1		{v24.4s, v25.4s},[x6]			/* init ABCD, EFGH. (2 cycs) */
+	ldr		x6, [x5, #HMAC_OKEYPAD]			/* save pointer to o_key_pad partial hash */
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next *in */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,16					/* no main loop if <16 */
+	blt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4					/* len -> x11 needed at end */
+	mov		x7,sp					/* copy for address mode */
+	ld1		{v30.16b},[x5]				/* get 1st ivec */
+	lsr		x12,x11,6				/* total_blocks (sha) */
+	mov		x4,x0					/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	ld1		{v29.16b},[x4],16			/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha1 block
+ */
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out */
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+/*
+ * do the first sha1 block on the plaintext
+ */
+	mov		v20.16b,v24.16b				/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	st1		{v9.16b},[x7],16
+	rev32		v26.16b,v26.16b				/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b				/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b				/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b				/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+	add		v19.4s,v4.4s,v26.4s
+	st1		{v14.16b},[x7],16
+	add		v23.4s,v4.4s,v27.4s
+	st1		{v15.16b},[x7],16
+/* quad 0 */
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	sha1c		q24,s25,v19.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+	sha1c		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v4.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+	sha1c		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	sha1c		q24,s21,v19.4s
+	add		v19.4s,v5.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+/* quad 1 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	sha1p		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v5.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+	sha1p		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	sha1p		q24,s21,v19.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	sha1p		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v23.4s,v6.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	add		v19.4s,v6.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	sha1p		q24,s21,v19.4s
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	sha1p		q24,s21,v19.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v29.16b},[x4],16			/* next w3 */
+	sha1p		q24,s22,v23.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2				/* main_blocks = total_blocks - 5 */
+	add		v24.4s,v24.4s,v20.4s
+	and		x13,x10,3				/* aes_blocks_left */
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	add		v25.4s,v25.4s,v21.4s
+	add		x2,x0,128				/* lead_ptr = *in */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * I've rewritten this to hoist the v0 loads but there is still no way to hide the
+ * required latency of these sha-associated instructions. It is a perfect example of
+ * why putting to much time into an NP-complete and NP-hard problem can be a mistake,
+ * even if it looks like a reasonable thing at the surface.
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+/* aes xform 0, sha quad 0 */
+	aesd		v0.16b,v8.16b
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v0.16b,v10.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v23.4s,v4.4s,v27.4s
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aesd		v0.16b,v12.16b
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aesd		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* get next aes block, with update */
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v1.16b,v8.16b
+	sha1h		s21,s24
+	add		v19.4s,v5.4s,v28.4s
+	sha1p		q24,s22,v23.4s
+	aesimc		v1.16b,v1.16b
+	sha1su1		v27.4s,v26.4s
+	aesd		v1.16b,v9.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v10.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v28.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1h		s21,s24
+	aesd		v1.16b,v12.16b
+	sha1p		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	sha1h		s22,s24
+	add		v19.4s,v5.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v14.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v16.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v23.4s,v6.4s,v29.4s
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+/* aes xform 2, sha quad 2 */
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v9.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v2.16b,v10.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v11.16b
+	sha1su1		v29.4s,v28.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v13.16b
+	sha1su1		v26.4s,v29.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v15.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	sha1h		s22,s24
+	aesd		v2.16b,v16.16b
+	sha1m		q24,s21,v19.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v2.16b,v17.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v7.4s,v29.4s
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v19.4s,v7.4s,v26.4s
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+/* aes xform 3, sha quad 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v10.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v3.16b,v3.16b
+	sha1su1		v29.4s,v28.4s
+	aesd		v3.16b,v11.16b
+	sha1h		s22,s24
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aesd		v3.16b,v13.16b
+	sha1h		s21,s24
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesimc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v28.4s
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	sha1h		s22,s24
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v29.4s
+	aesd		v3.16b,v17.16b
+	sha1h		s21,s24
+	ld1		{v29.16b},[x4],16			/* next w3 */
+	sha1p		q24,s22,v23.4s
+	add		v24.4s,v24.4s,v20.4s
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	add		v25.4s,v25.4s,v21.4s
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+/*
+ * now the loop epilog. Since the reads for sha have already been done in advance, we
+ * have to have an extra unwind. This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+/* aes xform 0, sha quad 0 */
+	aesd		v0.16b,v8.16b
+	add		v19.4s,v4.4s,v26.4s
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesd		v0.16b,v9.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aesd		v0.16b,v11.16b
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha1su1		v26.4s,v29.4s
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aesd		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	add		v23.4s,v5.4s,v27.4s
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v1.16b,v8.16b
+	sha1h		s21,s24
+	add		v19.4s,v5.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesd		v1.16b,v10.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	aesimc		v1.16b,v1.16b
+	sha1h		s22,s24
+	aesd		v1.16b,v11.16b
+	sha1p		q24,s21,v19.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v14.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v23.4s,v5.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v1.16b,v16.16b
+	sha1h		s22,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v17.16b
+	add		v19.4s,v6.4s,v28.4s
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	sha1su1		v26.4s,v29.4s
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	add		v23.4s,v6.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* mode op 2 */
+/* aes xform 2, sha quad 2 */
+	aesd		v2.16b,v8.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	add		v19.4s,v6.4s,v26.4s
+	aesd		v2.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v13.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v2.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v16.16b
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	sha1su1		v26.4s,v29.4s
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+/* mode op 3 */
+/* aes xform 3, sha quad 3 */
+	aesd		v3.16b,v8.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v10.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v26.4s
+	aesd		v3.16b,v11.16b
+	sha1h		s22,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v12.16b
+	ld1		{v0.16b},[x0]				/* read first aes block, no bump */
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aesd		v3.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v7.4s,v28.4s
+	aesd		v3.16b,v14.16b
+	sha1h		s22,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v15.16b
+	add		v23.4s,v7.4s,v29.4s
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ ivec (modeop) */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * Now, there is the final b-1 sha1 padded block. This contains between 0-3 aes blocks.
+ * we take some pains to avoid read spill by only reading the blocks that are actually defined.
+ * this is also the final sha block code for the shortCases.
+ */
+.Ljoin_common:
+	mov		w15,0x80				/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100				/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b			/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v26.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1				/* dec amount left */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	cbnz		x14,.Lpad200				/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v27.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1				/* dec amount left */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	cbnz		x14,.Lpad300				/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v28.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v29.b[3],w15				/* all data is bogus */
+
+.Lpad_done:
+	/* Add one SHA-1 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32				/* len_hi */
+	and		x14,x11,0xffffffff			/* len_lo */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x14,x14,3				/* len_lo in bits */
+
+	mov		v29.s[3],w14				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left		/* none to do */
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesd		v0.16b,v10.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v12.16b
+	sha1c		q24,s25,v19.4s
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v4.4s,v28.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v15.16b
+	sha1c		q24,s21,v19.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v4.4s,v29.4s
+	aesd		v0.16b,v16.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v17.16b
+	sha1su1		v29.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	add		v23.4s,v5.4s,v27.4s
+	aesd		v0.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v0.16b,v11.16b
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v12.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v0.16b,v13.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v5.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v0.16b,v14.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v0.16b,v15.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v0.16b,v16.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v0.16b,v17.16b
+	sha1su1		v26.4s,v29.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	add		v19.4s,v6.4s,v28.4s
+	aesd		v0.16b,v8.16b
+	add		v23.4s,v6.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v10.16b
+	sha1m		q24,s21,v19.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v28.4s,v27.4s
+	aesd		v0.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v12.16b
+	sha1m		q24,s22,v23.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v29.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v0.16b,v15.16b
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesd		v0.16b,v16.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1m		q24,s22,v23.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	sha1su1		v27.4s,v26.4s
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+
+	mov		x11, #64+20				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11				/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+	mov		x4,x0					/* sha_ptr_in = in */
+
+	mov		x9,x8					/* top of rcon */
+
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block and 0-3 aes blocks
+ * Note that everything is done serially to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4					/* check if 4 or more */
+	blt		.Llast_sha_block			/* if less, bail to last block */
+
+	ld1		{v31.16b},[x4]				/* next w no update */
+	ld1		{v0.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v26.16b,v0.16b				/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v1.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v27.16b,v1.16b				/* endian swap for sha */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ prev value */
+
+	ld1		{v31.16b},[x4]				/* read no update */
+	ld1		{v2.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v28.16b,v2.16b				/* endian swap for sha */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v3.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v29.16b,v3.16b				/* endian swap for sha */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+/*
+ * now we have the sha1 to do for these 4 aes blocks. Note that.
+ */
+
+	mov		v20.16b,v24.16b				/* working ABCD <- ABCD */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/* quad 0 */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+/* quad 3 */
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	sub		x10,x10,4				/* 4 less */
+	b		.Lshort_loop				/* keep looping */
+/*
+ * this is arranged so that we can join the common unwind code that does the last
+ * sha block and the final 0-3 aes blocks
+ */
+.Llast_sha_block:
+	mov		x13,x10					/* copy aes blocks for common */
+	b		.Ljoin_common				/* join common code */
+
+	.size	sha1_hmac_aes128cbc_dec, .-sha1_hmac_aes128cbc_dec
diff --git a/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
new file mode 100644
index 0000000..403d329
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
@@ -0,0 +1,1619 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha256/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA256(in)
+ *
+ * Prototype:
+ *
+ * void sha256_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha256_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- regShaStateABCD
+ * v25 -- regShaStateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks, however they are somewhat optimized,
+ * and more so than the enc/auth versions.
+ */
+	.file "sha256_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha256_aes128cbc_dec
+	.type   sha256_aes128cbc_dec,%function
+
+
+	.align  4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+sha256_aes128cbc_dec:
+/* fetch args */
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next *in */
+	adr		x12,.Linit_sha_state			/* address of sha init state consts */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,16					/* no main loop if <16 */
+	ld1		{v24.4s, v25.4s},[x12]			/* init ABCD, EFGH. (2 cycs) */
+	blt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4					/* len -> x11 needed at end */
+	mov		x7,sp					/* copy for address mode */
+	ld1		{v30.16b},[x5]				/* get 1st ivec */
+	lsr		x12,x11,6				/* total_blocks (sha) */
+	mov		x4,x0					/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	ld1		{v29.16b},[x4],16			/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha256 block
+ */
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out */
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+/*
+ * do the first sha256 block on the plaintext
+ */
+	mov		v22.16b,v24.16b				/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	mov		v23.16b,v25.16b				/* init working EFGH */
+	st1		{v9.16b},[x7],16
+
+	rev32		v26.16b,v26.16b				/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b				/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b				/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b				/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	st1		{v14.16b},[x7],16
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	st1		{v15.16b},[x7],16
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	ld1		{v29.16b},[x4],16			/* next w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2				/* main_blocks = total_blocks - 5 */
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	and		x13,x10,3				/* aes_blocks_left */
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+	add		x2,x0,128				/* lead_ptr = *in */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * I've rewritten this to hoist the v0 loads but there is still no way to hide the
+ * required latency of these sha-associated instructions.  It is a perfect example of
+ * why putting to much time into an NP-complete and NP-hard problem can be a mistake,
+ * even if it looks like a reasonable thing at the surface.
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v4.16b},[x9],16			/* key0 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* get next aes block, with update */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	ld1		{v29.16b},[x4],16			/* next w3 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+/*
+ * now the loop epilog.  Since the reads for sha have already been done in advance, we
+ * have to have an extra unwind.  This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	ld1		{v7.16b},[x9],16			/* key3  */
+	aesimc		v0.16b,v0.16b
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* mode op 2 */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* mode op 3 */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	ld1		{v0.16b},[x0]				/* read first aes block, no bump */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ ivec (modeop) */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * Now, there is the final b-1 sha256 padded block. This contains between 0-3 aes blocks.
+ * we take some pains to avoid read spill by only reading the blocks that are actually defined.
+ * this is also the final sha block code for the shortCases.
+ */
+.Ljoin_common:
+	mov		w15,0x80				/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100				/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b			/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v26.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1				/* dec amount left */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	cbnz		x14,.Lpad200				/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v27.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1				/* dec amount left */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	cbnz		x14,.Lpad300				/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v28.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v29.b[3],w15				/* all data is bogus */
+
+.Lpad_done:
+	lsr		x12,x11,32				/* len_hi */
+	and		x14,x11,0xffffffff			/* len_lo */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x14,x14,3				/* len_lo in bits */
+
+	mov		v29.s[3],w14				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left		/* none to do */
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	aesd		v0.16b,v8.16b
+	ld1		{v7.16b},[x9],16			/* key3 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesd		v0.16b,v10.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	aesd		v0.16b,v11.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256h2	q23, q21, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesd		v0.16b,v10.16b
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v11.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v13.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v29.4s,v26.4s
+	aesd		v0.16b,v16.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v10.16b
+	sha256h2	q23, q21, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v0.16b,v11.16b
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v5.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesd		v0.16b,v13.16b
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesd		v0.16b,v14.16b
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b			/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b			/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b			/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+/*
+ * now we just have to put this into big endian and store! and clean up stack...
+ */
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b				/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b				/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+	mov		x4,x0					/* sha_ptr_in = in */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block and 0-3 aes blocks
+ * Note that everything is done serially to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4					/* check if 4 or more */
+	blt		.Llast_sha_block			/* if less, bail to last block */
+
+	ld1		{v31.16b},[x4]				/* next w no update */
+	ld1		{v0.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v26.16b,v0.16b				/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v1.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v27.16b,v1.16b				/* endian swap for sha */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ prev value */
+
+	ld1		{v31.16b},[x4]				/* read no update */
+	ld1		{v2.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v28.16b,v2.16b				/* endian swap for sha */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v3.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v29.16b,v3.16b				/* endian swap for sha */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+
+/*
+ * now we have the sha256 to do for these 4 aes blocks. Note that.
+ */
+
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* quad 0 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	sub		x10,x10,4				/* 4 less */
+	b		.Lshort_loop				/* keep looping */
+/*
+ * this is arranged so that we can join the common unwind code that does the last
+ * sha block and the final 0-3 aes blocks
+ */
+.Llast_sha_block:
+	mov		x13,x10					/* copy aes blocks for common */
+	b		.Ljoin_common				/* join common code */
+
+	.size	sha256_aes128cbc_dec, .-sha256_aes128cbc_dec
diff --git a/drivers/crypto/armv8/asm/sha256_core.S b/drivers/crypto/armv8/asm/sha256_core.S
new file mode 100644
index 0000000..1280a49
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_core.S
@@ -0,0 +1,519 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Core SHA-2 Primitives
+ *
+ * Operations:
+ * sha256_block_partial:
+ * 	out = partial_sha256(init, in, len)	<- no final block
+ *
+ * sha256_block:
+ * 	out = sha256(init, in, len)
+ *
+ * Prototype:
+ *
+ * int sha256_block_partial(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * int sha256_block(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * returns: 0 (sucess), -1 (failure)
+ *
+ * Registers used:
+ *
+ * sha256_block_partial(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * sha256_block(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v4 - v7 -- round consts for sha
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- reg_sha_stateABCD
+ * v25 -- reg_sha_stateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise error code is returned.
+ *
+ */
+	.file "sha256_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global sha256_block_partial
+	.type	sha256_block_partial,%function
+	.global sha256_block
+	.type	sha256_block,%function
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+	.align	4
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+	.align	4
+
+sha256_block_partial:
+	mov		x6, #1					/* indicate partial hash */
+	ands		x5, x3, #0x3f				/* Check size mod 1 SHA block */
+	b.ne		.Lsha256_error
+	cbnz		x0, 1f
+	adr		x0,.Linit_sha_state			/* address of sha init state consts */
+1:
+	ld1		{v24.4s, v25.4s},[x0]			/* init ABCD, EFGH. (2 cycs) */
+	lsr		x5, x3, 4				/* number of 16B blocks (will be at least 4) */
+	b		.Lsha256_loop
+
+sha256_block:
+	mov		x6, xzr					/* indicate full hash */
+	ands		x5, x3, #0xf				/* Check size mod 16B block */
+	b.ne		.Lsha256_error
+	cbnz		x0, 1f
+	adr		x0,.Linit_sha_state			/* address of sha init state consts */
+1:
+	ld1		{v24.4s, v25.4s},[x0]			/* init ABCD, EFGH. (2 cycs) */
+	lsr		x5, x3, 4				/* number of 16B blocks */
+	cmp		x5, #4					/* at least 4 16B blocks give 1 SHA block */
+	b.lo		.Lsha256_last
+
+	.align	4
+.Lsha256_loop:
+	sub		x5, x5, #4				/* substract 1 SHA block */
+	adr		x4,.Lrcon
+
+	ld1		{v26.16b},[x1],16			/* dsrc[0] */
+	ld1		{v27.16b},[x1],16			/* dsrc[1] */
+	ld1		{v28.16b},[x1],16			/* dsrc[2] */
+	ld1		{v29.16b},[x1],16			/* dsrc[3] */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+
+	ld1		{v4.16b},[x4],16			/* key0 */
+	ld1		{v5.16b},[x4],16			/* key1 */
+	ld1		{v6.16b},[x4],16			/* key2 */
+	ld1		{v7.16b},[x4],16			/* key3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key4 */
+	ld1		{v5.16b},[x4],16			/* key5 */
+	ld1		{v6.16b},[x4],16			/* key6 */
+	ld1		{v7.16b},[x4],16			/* key7 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key8 */
+	ld1		{v5.16b},[x4],16			/* key9 */
+	ld1		{v6.16b},[x4],16			/* key10 */
+	ld1		{v7.16b},[x4],16			/* key11 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key12 */
+	ld1		{v5.16b},[x4],16			/* key13 */
+	ld1		{v6.16b},[x4],16			/* key14 */
+	ld1		{v7.16b},[x4],16			/* key15 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	cmp		x5, #4
+	b.hs		.Lsha256_loop
+
+	/* Store partial hash and return or complete hash */
+	cbz		x6, .Lsha256_last
+
+	st1		{v24.16b, v25.16b}, [x2]
+
+	mov		x0, xzr
+	ret
+
+	/*
+	 * Last block with padding. v24-v25 contain hash state.
+	 */
+.Lsha256_last:
+	eor		v26.16b, v26.16b, v26.16b
+	eor		v27.16b, v27.16b, v27.16b
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x4,.Lrcon
+	lsl		x3, x3, 3
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+
+	/* Fill out the first vector register and the end of the block */
+	mov		v29.s[3], w3				/* move length to the end of the block */
+	lsr		x3, x3, 32
+	mov		v29.s[2], w3				/* and the higher part */
+	/* Set padding 1 to the first reg */
+	mov		w6, #0x80				/* that's the 1 of the pad */
+	mov		v26.b[3], w6
+	cbz		x5,.Lsha256_final
+
+	sub		x5, x5, #1
+	mov		v27.16b, v26.16b
+	ld1		{v26.16b},[x1],16
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	cbz		x5,.Lsha256_final
+
+	sub		x5, x5, #1
+	mov		v28.16b, v27.16b
+	ld1		{v27.16b},[x1],16
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	cbz		x5,.Lsha256_final
+
+	mov		v29.b[0], w6
+	ld1		{v28.16b},[x1],16
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+.Lsha256_final:
+
+	ld1		{v4.16b},[x4],16			/* key0 */
+	ld1		{v5.16b},[x4],16			/* key1 */
+	ld1		{v6.16b},[x4],16			/* key2 */
+	ld1		{v7.16b},[x4],16			/* key3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key4 */
+	ld1		{v5.16b},[x4],16			/* key5 */
+	ld1		{v6.16b},[x4],16			/* key6 */
+	ld1		{v7.16b},[x4],16			/* key7 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key8 */
+	ld1		{v5.16b},[x4],16			/* key9 */
+	ld1		{v6.16b},[x4],16			/* key10 */
+	ld1		{v7.16b},[x4],16			/* key11 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16			/* key12 */
+	ld1		{v5.16b},[x4],16			/* key13 */
+	ld1		{v6.16b},[x4],16			/* key14 */
+	ld1		{v7.16b},[x4],16			/* key15 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x2]			/* save them both */
+
+	mov		x0, xzr
+	ret
+
+.Lsha256_error:
+	mov		x0, #-1
+	ret
+
+	.size	sha256_block_partial, .-sha256_block_partial
diff --git a/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
new file mode 100644
index 0000000..3256327
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
@@ -0,0 +1,1791 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha256_hmac/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA256(o_key_pad | SHA256(i_key_pad | in))
+ *
+ * Prototype:
+ *
+ * void sha256_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha256_hmac_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- sha state ABCD
+ * v25 -- sha state EFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not defined.
+ * For AES partial blocks the user is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks, however they are somewhat optimized,
+ * and more so than the enc/auth versions.
+ */
+	.file "sha256_hmac_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha256_hmac_aes128cbc_dec
+	.type	sha256_hmac_aes128cbc_dec,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+sha256_hmac_aes128cbc_dec:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	ld1		{v24.4s, v25.4s},[x6]			/* init ABCD, EFGH. (2 cycs) */
+	ldr		x6, [x5, #HMAC_OKEYPAD]			/* save pointer to o_key_pad partial hash */
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]			/* pref next *in */
+	adr		x12,.Linit_sha_state			/* address of sha init state consts */
+	prfm		PLDL1KEEP,[x1,0]			/* pref next aes_ptr_out */
+	lsr		x10,x4,4				/* aes_blocks = len/16 */
+	cmp		x10,16					/* no main loop if <16 */
+	blt		.Lshort_cases				/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4					/* len -> x11 needed at end */
+	mov		x7,sp					/* copy for address mode */
+	ld1		{v30.16b},[x5]				/* get 1st ivec */
+	lsr		x12,x11,6				/* total_blocks (sha) */
+	mov		x4,x0					/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	ld1		{v29.16b},[x4],16			/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha256 block
+ */
+	prfm		PLDL1KEEP,[x0,64]			/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out */
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+/*
+ * do the first sha256 block on the plaintext
+ */
+
+	mov		v22.16b,v24.16b				/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	mov		v23.16b,v25.16b				/* init working EFGH */
+	st1		{v9.16b},[x7],16
+
+	rev32		v26.16b,v26.16b				/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b				/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b				/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b				/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+/* quad 0 */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	st1		{v14.16b},[x7],16
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	st1		{v15.16b},[x7],16
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v8.16b},[x2],16			/* rk[0] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v9.16b},[x2],16			/* rk[1] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v10.16b},[x2],16			/* rk[2] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v11.16b},[x2],16			/* rk[3] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v12.16b},[x2],16			/* rk[4] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v13.16b},[x2],16			/* rk[5] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v14.16b},[x2],16			/* rk[6] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v15.16b},[x2],16			/* rk[7] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v16.16b},[x2],16			/* rk[8] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v17.16b},[x2],16			/* rk[9] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v18.16b},[x2],16			/* rk[10] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	ld1		{v29.16b},[x4],16			/* next w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2				/* main_blocks = total_blocks - 5 */
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	and		x13,x10,3				/* aes_blocks_left */
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+	add		x2,x0,128				/* lead_ptr = *in */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle, this takes longer than it looks.
+ * I've rewritten this to hoist the v0 loads but there is still no way to hide the
+ * required latency of these sha-associated instructions. It is a perfect example of
+ * why putting to much time into an NP-complete and NP-hard problem can be a mistake,
+ * even if it looks like a reasonable thing at the surface.
+ */
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v4.16b},[x9],16			/* key0 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* get next aes block, with update */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	ld1		{v29.16b},[x4],16			/* next w3 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1					/* dec block count */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]				/* next aes block, no update */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* next aes block, update aes_ptr_in */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbnz		x7,.Lmain_loop				/* loop if more to do */
+/*
+ * now the loop epilog. Since the reads for sha have already been done in advance, we
+ * have to have an extra unwind. This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]			/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	prfm		PLDL1KEEP,[x1,64]			/* pref next aes_ptr_out, streaming */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16			/* key2 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+	aesimc		v0.16b,v0.16b
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	rev32		v29.16b,v29.16b				/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16			/* key5 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd (1 cyc stall on v22) */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b			/* final res 0 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16			/* key7 */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16			/* key5 (extra stall from mov) */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16			/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64				/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b			/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* mode op 1 xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* mode op 2 */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16			/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16			/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16			/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b			/* mode of 2 xor w/ prev value */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+/* mode op 3 */
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesd		v3.16b,v9.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	ld1		{v0.16b},[x0]				/* read first aes block, no bump */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b			/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+	ld1		{v31.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	ld1		{v1.16b},[x0]				/* read next aes block, no update */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	ld1		{v2.16b},[x0]				/* read next aes block, no update */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b			/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ ivec (modeop) */
+	ld1		{v31.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	ld1		{v3.16b},[x0]				/* read next aes block, no update */
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b			/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ ivec (modeop) */
+	ld1		{v30.16b},[x0],16			/* read next aes block, update aes_ptr_in */
+
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b			/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+/*
+ * Now, there is the final b-1 sha256 padded block. This contains between 0-3 aes blocks.
+ * we take some pains to avoid read spill by only reading the blocks that are actually defined.
+ * this is also the final sha block code for the shortCases.
+ */
+.Ljoin_common:
+	mov		w15,0x80				/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100				/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b			/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v26.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1				/* dec amount left */
+	ld1		{v26.16b},[x4],16			/* next w0 */
+	cbnz		x14,.Lpad200				/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b			/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v27.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1				/* dec amount left */
+	ld1		{v27.16b},[x4],16			/* next w1 */
+	cbnz		x14,.Lpad300				/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b			/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v28.b[0],w15				/* all data is bogus */
+	b		.Lpad_done				/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16			/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b			/* zero the rest */
+	mov		v29.b[3],w15				/* all data is bogus */
+
+.Lpad_done:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32				/* len_hi */
+	and		x14,x11,0xffffffff			/* len_lo */
+	lsl		x12,x12,3				/* len_hi in bits */
+	lsl		x14,x14,3				/* len_lo in bits */
+
+	mov		v29.s[3],w14				/* len_lo */
+	mov		v29.s[2],w12				/* len_hi */
+
+	rev32		v26.16b,v26.16b				/* fix endian w0 */
+	rev32		v27.16b,v27.16b				/* fix endian w1 */
+	rev32		v28.16b,v28.16b				/* fix endian w2 */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left		/* none to do */
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	aesd		v0.16b,v8.16b
+	ld1		{v7.16b},[x9],16			/* key3 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	aesd		v0.16b,v10.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	aesd		v0.16b,v11.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256h2	q23, q21, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+
+	ld1		{v0.16b},[x0]				/* read first aes block, bump aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesd		v0.16b,v10.16b
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v11.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v12.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v13.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v14.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v29.4s,v26.4s
+	aesd		v0.16b,v16.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1				/* dec counter */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+
+	ld1		{v0.16b},[x0],16			/* read first aes block, bump aes_ptr_in */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v10.16b
+	sha256h2	q23, q21, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v0.16b,v11.16b
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v5.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesd		v0.16b,v13.16b
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesd		v0.16b,v14.16b
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	eor		v3.16b,v0.16b,v18.16b			/* res 0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v3.16b,v30.16b			/* xor w/ ivec (modeop) */
+
+
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b			/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b			/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b			/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x8,.Lrcon				/* base address for sha round consts */
+
+	ld1		{v24.16b,v25.16b}, [x6]			/* load o_key_pad partial hash */
+
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80				/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+
+	mov		x11, #64+32				/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11				/* move length to the end of the block */
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11				/* and the higher part */
+
+	ld1		{v4.16b},[x8],16			/* key0 */
+	ld1		{v5.16b},[x8],16			/* key1 */
+	ld1		{v6.16b},[x8],16			/* key2 */
+	ld1		{v7.16b},[x8],16			/* key3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key4 */
+	ld1		{v5.16b},[x8],16			/* key5 */
+	ld1		{v6.16b},[x8],16			/* key6 */
+	ld1		{v7.16b},[x8],16			/* key7 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key8 */
+	ld1		{v5.16b},[x8],16			/* key9 */
+	ld1		{v6.16b},[x8],16			/* key10 */
+	ld1		{v7.16b},[x8],16			/* key11 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16			/* key12 */
+	ld1		{v5.16b},[x8],16			/* key13 */
+	ld1		{v6.16b},[x8],16			/* key14 */
+	ld1		{v7.16b},[x8],16			/* key15 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s			/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s			/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s			/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	st1		{v24.4s,v25.4s},[x3]			/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp					/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]				/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64		/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64		/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]			/* rk[8-10] */
+	adr		x8,.Lrcon				/* rcon */
+	lsl		x11,x10,4				/* len = aes_blocks*16 */
+	mov		x4,x0					/* sha_ptr_in = in */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block and 0-3 aes blocks
+ * Note that everything is done serially to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4					/* check if 4 or more */
+	blt		.Llast_sha_block			/* if less, bail to last block */
+
+	ld1		{v31.16b},[x4]				/* next w no update */
+	ld1		{v0.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v26.16b,v0.16b				/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v1.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v27.16b,v1.16b				/* endian swap for sha */
+	st1		{v0.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b			/* xor w/ prev value */
+
+	ld1		{v31.16b},[x4]				/* read no update */
+	ld1		{v2.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v28.16b,v2.16b				/* endian swap for sha */
+	st1		{v1.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b			/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]				/* read no update */
+	ld1		{v3.16b},[x4],16			/* read next aes block, update aes_ptr_in */
+	rev32		v29.16b,v3.16b				/* endian swap for sha */
+	st1		{v2.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b			/* xor w/ prev value */
+
+/*
+ * now we have the sha256 to do for these 4 aes blocks. Note that.
+ */
+
+	mov		x9,x8					/* top of rcon */
+	ld1		{v4.16b},[x9],16			/* key0 */
+	mov		v22.16b,v24.16b				/* working ABCD <- ABCD */
+	ld1		{v5.16b},[x9],16			/* key1 */
+	mov		v23.16b,v25.16b				/* working EFGH <- EFGH */
+	st1		{v3.16b},[x1],16			/* save aes res, bump aes_out_ptr */
+
+/* quad 0 */
+	ld1		{v6.16b},[x9],16			/* key2 */
+	ld1		{v7.16b},[x9],16			/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+
+	ld1		{v4.16b},[x9],16			/* key4 */
+	ld1		{v5.16b},[x9],16			/* key5 */
+	ld1		{v6.16b},[x9],16			/* key6 */
+	ld1		{v7.16b},[x9],16			/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s			/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s			/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s			/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s			/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b			/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s			/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s			/* EFGH += working copy */
+
+	sub		x10,x10,4				/* 4 less */
+	b		.Lshort_loop				/* keep looping */
+/*
+ * this is arranged so that we can join the common unwind code that does the last
+ * sha block and the final 0-3 aes blocks
+ */
+.Llast_sha_block:
+	mov		x13,x10					/* copy aes blocks for common */
+	b		.Ljoin_common				/* join common code */
+
+	.size	sha256_hmac_aes128cbc_dec, .-sha256_hmac_aes128cbc_dec
diff --git a/drivers/crypto/armv8/genassym.c b/drivers/crypto/armv8/genassym.c
new file mode 100644
index 0000000..44604ce
--- /dev/null
+++ b/drivers/crypto/armv8/genassym.c
@@ -0,0 +1,55 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_common.h>
+
+#include "rte_armv8_defs.h"
+
+#define	ASSYM(name, offset)						\
+do {									\
+	asm volatile("----------\n");					\
+	/* Place pattern, name + value in the assembly code */		\
+	asm volatile("\n<genassym> " #name " %0\n" :: "i" (offset));	\
+} while (0)
+
+
+static void __rte_unused
+generate_as_symbols(void)
+{
+
+	ASSYM(CIPHER_KEY, offsetof(struct crypto_arg, cipher.key));
+	ASSYM(CIPHER_IV, offsetof(struct crypto_arg, cipher.iv));
+
+	ASSYM(HMAC_KEY, offsetof(struct crypto_arg, digest.hmac.key));
+	ASSYM(HMAC_IKEYPAD, offsetof(struct crypto_arg, digest.hmac.i_key_pad));
+	ASSYM(HMAC_OKEYPAD, offsetof(struct crypto_arg, digest.hmac.o_key_pad));
+}
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..8b9a7bb
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,905 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "rte_armv8_defs.h"
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be decribed by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+#define	CRYPTO_KEY(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA1_HMAC][CRYPTO_KEY(128)] = aes128cbc_sha1_hmac,
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA256][CRYPTO_KEY(128)] = aes128cbc_sha256,
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA256_HMAC][CRYPTO_KEY(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA1_HMAC][CRYPTO_KEY(128)] = sha1_hmac_aes128cbc_dec,
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA256][CRYPTO_KEY(128)] = sha256_aes128cbc_dec,
+	[RTE_CRYPTO_CIPHER_AES_CBC][RTE_CRYPTO_AUTH_SHA256_HMAC][CRYPTO_KEY(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define	CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][CRYPTO_KEY(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[RTE_CRYPTO_CIPHER_AES_CBC][CRYPTO_KEY(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[RTE_CRYPTO_CIPHER_AES_CBC][CRYPTO_KEY(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define	CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][CRYPTO_KEY(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * Global static parameter used to create a unique name for each
+ * ARMV8 crypto device.
+ */
+static unsigned int unique_name_id;
+
+static inline int
+create_unique_device_name(char *name, size_t size)
+{
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+			unique_name_id++);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA256:
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
+		break;
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 192);
+		break;
+	case 256:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 256);
+		break;
+	default:
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (sess == NULL)
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	uint8_t *src, *dst;
+	uint8_t *adst, *asrc;
+	uint64_t srclen;
+
+	srclen = op->sym->cipher.data.length;
+	ARMV8_CRYPTO_ASSERT(
+		op->sym->cipher.data.length == op->sym->auth.data.length);
+
+	src = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	dst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		asrc = dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		asrc = src;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(mbuf_dst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(mbuf_src,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	crypto_func(src, dst, asrc, adst, srclen, &arg);
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+	}
+}
+
+/** Process crypto operation for mbuf */
+static int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+	int retval;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (op->status != RTE_CRYPTO_OP_STATUS_ERROR)
+		retval = rte_ring_enqueue(qp->processed_ops, (void *)op);
+	else
+		retval = -1;
+
+	return retval;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	qp->stats.enqueued_count += i;
+	return i;
+
+enqueue_err:
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return i;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(const char *name,
+		struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
+	struct armv8_crypto_private *internals;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* create a unique device name */
+	if (create_unique_device_name(crypto_dev_name,
+			RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
+		return -EINVAL;
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed", name);
+
+	cryptodev_armv8_crypto_uninit(crypto_dev_name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id()
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(name, &init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..0f768f4
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,390 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "rte_armv8_defs.h"
+#include "rte_armv8_pmd_private.h"
+
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA256 */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256,
+					.block_size = 64,
+					.key_size = {
+						.min = 0,
+						.max = 0,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 32,
+						.increment = 8
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..fc1dae4
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,210 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define	ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define	ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define	NBBY		8		/* Number of bits in a byte */
+#define	BYTE_LENGTH(x)	((x) / 8)	/* Number of bytes in x (roun down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define	CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define	CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define	CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define	CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define	CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define	HMAC_IPAD_VALUE			(0x36)
+#define	HMAC_OPAD_VALUE			(0x5C)
+
+#define	SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define	SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define	SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define	SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define	SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define	SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef void (*crypto_func_t)(uint8_t *, uint8_t *, uint8_t *, uint8_t *,
+				uint64_t, crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index 8f63e8f..7bab79d 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 CM device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..a1d332d 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -145,6 +145,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+ifeq ($(CONFIG_RTE_ARCH_ARM64),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+endif
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 3/3] app/test: add ARMv8 crypto tests and test vectors
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 1/3] mk: fix build of assembly files for ARM64 zbigniew.bodek
  2016-12-04 11:33 ` [PATCH 2/3] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2016-12-04 11:33 ` zbigniew.bodek
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-04 11:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.
Add test vectors for cipher + SHA256 MAC generation.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  63 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 211 +++++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 508 +++++++++++++++++++++++++++++
 5 files changed, 779 insertions(+), 8 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 872f8b4..a0540d6 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,27 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1545,6 +1566,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -6504,6 +6541,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -6567,6 +6621,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -6575,3 +6637,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index 1c68f93..470c2d9 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,136 @@
 	}
 };
 
+/** AES-128-CBC SHA256 MAC test vector */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 512
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 512
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256,
+	.digest = {
+		.data = {
+			0xA8, 0xBC, 0xDB, 0x99, 0xAA, 0x45, 0x91, 0xA3,
+			0x2D, 0x75, 0x41, 0x92, 0x28, 0x01, 0x87, 0x5D,
+			0x45, 0xED, 0x49, 0x05, 0xD3, 0xAE, 0x32, 0x57,
+			0xB7, 0x79, 0x65, 0xFC, 0xFA, 0x6C, 0xFA, 0xDF
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_14 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -878,37 +1008,69 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_14,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_14,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -954,7 +1116,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -963,7 +1126,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1006,7 +1170,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1015,7 +1180,37 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Encryption Digest",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Decryption Digest "
+			"Verify",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Encryption Digest "
+			"Sessionless",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Decryption Digest "
+			"Verify Sessionless",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index 37b10cf..6963241 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -82,6 +82,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -508,6 +509,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index 04ff1ee..bd362c7 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -49,6 +49,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 59a6891..3598edf 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,27 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2422,6 +2449,136 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			start_cycles = rte_rdtsc_precise();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				((num_to_submit - num_sent) < burst_size) ?
+				num_to_submit - num_sent : burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			/* Wait until requests have been sent. */
+			rte_delay_ms(1);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc_precise();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2683,6 +2840,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3356,6 +3563,138 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc_precise();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))],
+					sess, pparams->buf_size, digest_length);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc_precise();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3664,6 +4003,153 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4214,6 +4700,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4270,6 +4769,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4279,3 +4786,4 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest, perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD
  2016-12-07  2:32   ` [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2016-12-06 20:27     ` Thomas Monjalon
  2016-12-07 19:04       ` Zbigniew Bodek
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-06 20:27 UTC (permalink / raw)
  To: dev; +Cc: zbigniew.bodek, pablo.de.lara.guarch, jerin.jacob, declan.doherty

2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Add type and name for ARMv8 crypto PMD
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
[...]
> --- a/lib/librte_cryptodev/rte_cryptodev.h
> +++ b/lib/librte_cryptodev/rte_cryptodev.h
> @@ -66,6 +66,8 @@
>  /**< KASUMI PMD device name */
>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
>  /**< KASUMI PMD device name */
> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
> +/**< ARMv8 CM device name */
>  
>  /** Crypto device type */
>  enum rte_cryptodev_type {
> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
>  };

Can we remove all these types and names in the generic crypto API?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-07  2:32   ` [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8 zbigniew.bodek
@ 2016-12-06 20:29     ` Thomas Monjalon
  2016-12-06 21:18       ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-06 20:29 UTC (permalink / raw)
  To: zbigniew.bodek; +Cc: dev, pablo.de.lara.guarch, jerin.jacob, Emery Davis

2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> This patch adds core low-level crypto operations
> for ARMv8 processors. The assembly code is a base
> for an optimized PMD and is currently excluded
> from the build.

It's a bit sad that you cannot achieve the same performance with
C code and a good compiler.
Have you tried it? How much is the difference?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 20:29     ` Thomas Monjalon
@ 2016-12-06 21:18       ` Jerin Jacob
  2016-12-06 21:42         ` Thomas Monjalon
  0 siblings, 1 reply; 100+ messages in thread
From: Jerin Jacob @ 2016-12-06 21:18 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > 
> > This patch adds core low-level crypto operations
> > for ARMv8 processors. The assembly code is a base
> > for an optimized PMD and is currently excluded
> > from the build.
> 
> It's a bit sad that you cannot achieve the same performance with
> C code and a good compiler.
> Have you tried it? How much is the difference?

Like AES-NI on IA side(exposed as separate PMD in dpdk),
armv8 has special dedicated instructions for crypto operation using SIMD.
This patch is using the "dedicated" armv8 crypto instructions and SIMD
operation to achieve better performance.

We had compared with openssl implementation.Here is the performance
improvement for chained crypto operations case WRT openssl pmd

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 21:18       ` Jerin Jacob
@ 2016-12-06 21:42         ` Thomas Monjalon
  2016-12-06 22:05           ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-06 21:42 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

2016-12-07 02:48, Jerin Jacob:
> On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > 
> > > This patch adds core low-level crypto operations
> > > for ARMv8 processors. The assembly code is a base
> > > for an optimized PMD and is currently excluded
> > > from the build.
> > 
> > It's a bit sad that you cannot achieve the same performance with
> > C code and a good compiler.
> > Have you tried it? How much is the difference?
> 
> Like AES-NI on IA side(exposed as separate PMD in dpdk),
> armv8 has special dedicated instructions for crypto operation using SIMD.
> This patch is using the "dedicated" armv8 crypto instructions and SIMD
> operation to achieve better performance.

It does not justify to have all the code in asm.

> We had compared with openssl implementation.Here is the performance
> improvement for chained crypto operations case WRT openssl pmd
> 
> Buffer
> Size(B)   OPS(M)      Throughput(Gbps)
> 64        729 %        742 %
> 128       577 %        592 %
> 256       483 %        476 %
> 512       336 %        351 %
> 768       300 %        286 %
> 1024      263 %        250 %
> 1280      225 %        229 %
> 1536      214 %        213 %
> 1792      186 %        203 %
> 2048      200 %        193 %

OK but what is the performance difference between this asm code
and a C equivalent?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 21:42         ` Thomas Monjalon
@ 2016-12-06 22:05           ` Jerin Jacob
  2016-12-06 22:41             ` Thomas Monjalon
  0 siblings, 1 reply; 100+ messages in thread
From: Jerin Jacob @ 2016-12-06 22:05 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

On Tue, Dec 06, 2016 at 10:42:51PM +0100, Thomas Monjalon wrote:
> 2016-12-07 02:48, Jerin Jacob:
> > On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > 
> > > > This patch adds core low-level crypto operations
> > > > for ARMv8 processors. The assembly code is a base
> > > > for an optimized PMD and is currently excluded
> > > > from the build.
> > > 
> > > It's a bit sad that you cannot achieve the same performance with
> > > C code and a good compiler.
> > > Have you tried it? How much is the difference?
> > 
> > Like AES-NI on IA side(exposed as separate PMD in dpdk),
> > armv8 has special dedicated instructions for crypto operation using SIMD.
> > This patch is using the "dedicated" armv8 crypto instructions and SIMD
> > operation to achieve better performance.
> 
> It does not justify to have all the code in asm.

Why ? if we can have separate dpdk pmd for AES-NI on IA . Why not for ARM?

> 
> > We had compared with openssl implementation.Here is the performance
> > improvement for chained crypto operations case WRT openssl pmd
> > 
> > Buffer
> > Size(B)   OPS(M)      Throughput(Gbps)
> > 64        729 %        742 %
> > 128       577 %        592 %
> > 256       483 %        476 %
> > 512       336 %        351 %
> > 768       300 %        286 %
> > 1024      263 %        250 %
> > 1280      225 %        229 %
> > 1536      214 %        213 %
> > 1792      186 %        203 %
> > 2048      200 %        193 %
> 
> OK but what is the performance difference between this asm code
> and a C equivalent?

Do you you want compare against the scalar version of C code? its not
even worth to think about it. The vector version will use
dedicated armv8 instruction for crypto so its not portable anyway.
We would like to asm code so that we can have better control on what we do
and we cant rely compiler for that.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 22:05           ` Jerin Jacob
@ 2016-12-06 22:41             ` Thomas Monjalon
  2016-12-06 23:24               ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-06 22:41 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

2016-12-07 03:35, Jerin Jacob:
> On Tue, Dec 06, 2016 at 10:42:51PM +0100, Thomas Monjalon wrote:
> > 2016-12-07 02:48, Jerin Jacob:
> > > On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > > > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > > 
> > > > > This patch adds core low-level crypto operations
> > > > > for ARMv8 processors. The assembly code is a base
> > > > > for an optimized PMD and is currently excluded
> > > > > from the build.
> > > > 
> > > > It's a bit sad that you cannot achieve the same performance with
> > > > C code and a good compiler.
> > > > Have you tried it? How much is the difference?
> > > 
> > > Like AES-NI on IA side(exposed as separate PMD in dpdk),
> > > armv8 has special dedicated instructions for crypto operation using SIMD.
> > > This patch is using the "dedicated" armv8 crypto instructions and SIMD
> > > operation to achieve better performance.
> > 
> > It does not justify to have all the code in asm.
> 
> Why ? if we can have separate dpdk pmd for AES-NI on IA . Why not for ARM?

Jerin, you or me is not understanding the other.
It is perfectly fine to have a separate PMD.
I am just talking about the language C vs ASM.

> > > We had compared with openssl implementation.Here is the performance
> > > improvement for chained crypto operations case WRT openssl pmd
> > > 
> > > Buffer
> > > Size(B)   OPS(M)      Throughput(Gbps)
> > > 64        729 %        742 %
> > > 128       577 %        592 %
> > > 256       483 %        476 %
> > > 512       336 %        351 %
> > > 768       300 %        286 %
> > > 1024      263 %        250 %
> > > 1280      225 %        229 %
> > > 1536      214 %        213 %
> > > 1792      186 %        203 %
> > > 2048      200 %        193 %
> > 
> > OK but what is the performance difference between this asm code
> > and a C equivalent?
> 
> Do you you want compare against the scalar version of C code? its not
> even worth to think about it. The vector version will use
> dedicated armv8 instruction for crypto so its not portable anyway.
> We would like to asm code so that we can have better control on what we do
> and we cant rely compiler for that.

No I'm talking about comparing a PMD written in C vs this one in ASM.
It"s just harder to read ASM. Most of DPDK code is in C.
And only some small functions are written in ASM.
The vector instructions use some C intrinsics.
Do you mean that the instructions that you are using have no intrinsics
equivalent? Nobody made it into GCC?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 22:41             ` Thomas Monjalon
@ 2016-12-06 23:24               ` Jerin Jacob
  2016-12-07 15:00                 ` Thomas Monjalon
  0 siblings, 1 reply; 100+ messages in thread
From: Jerin Jacob @ 2016-12-06 23:24 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

On Tue, Dec 06, 2016 at 02:41:01PM -0800, Thomas Monjalon wrote:
> 2016-12-07 03:35, Jerin Jacob:
> > On Tue, Dec 06, 2016 at 10:42:51PM +0100, Thomas Monjalon wrote:
> > > 2016-12-07 02:48, Jerin Jacob:
> > > > On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > > > > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > > > 
> > > > > > This patch adds core low-level crypto operations
> > > > > > for ARMv8 processors. The assembly code is a base
> > > > > > for an optimized PMD and is currently excluded
> > > > > > from the build.
> > > > > 
> > > > > It's a bit sad that you cannot achieve the same performance with
> > > > > C code and a good compiler.
> > > > > Have you tried it? How much is the difference?
> > > > 
> > > > Like AES-NI on IA side(exposed as separate PMD in dpdk),
> > > > armv8 has special dedicated instructions for crypto operation using SIMD.
> > > > This patch is using the "dedicated" armv8 crypto instructions and SIMD
> > > > operation to achieve better performance.
> > > 
> > > It does not justify to have all the code in asm.
> > 
> > Why ? if we can have separate dpdk pmd for AES-NI on IA . Why not for ARM?
> 
> Jerin, you or me is not understanding the other.
> It is perfectly fine to have a separate PMD.
> I am just talking about the language C vs ASM.

Hmm. Both are bit connected topic :-)

If you check the AES-NI PMD installation guide, We need to download the
"ASM" optimized AES-NI library and build with yasm.
We all uses fine grained ASM code such work.
So AES-NI case those are still ASM code but reside in some other
library.

http://dpdk.org/doc/guides/cryptodevs/aesni_mb.html(Check Installation section)
https://downloadcenter.intel.com/download/22972

Even linux kernel use, hardcore ASM for crypto work.
https://github.com/torvalds/linux/blob/master/arch/arm/crypto/aes-ce-core.S

> 
> > > > We had compared with openssl implementation.Here is the performance
> > > > improvement for chained crypto operations case WRT openssl pmd
> > > > 
> > > > Buffer
> > > > Size(B)   OPS(M)      Throughput(Gbps)
> > > > 64        729 %        742 %
> > > > 128       577 %        592 %
> > > > 256       483 %        476 %
> > > > 512       336 %        351 %
> > > > 768       300 %        286 %
> > > > 1024      263 %        250 %
> > > > 1280      225 %        229 %
> > > > 1536      214 %        213 %
> > > > 1792      186 %        203 %
> > > > 2048      200 %        193 %
> > > 
> > > OK but what is the performance difference between this asm code
> > > and a C equivalent?
> > 
> > Do you you want compare against the scalar version of C code? its not
> > even worth to think about it. The vector version will use
> > dedicated armv8 instruction for crypto so its not portable anyway.
> > We would like to asm code so that we can have better control on what we do
> > and we cant rely compiler for that.
> 
> No I'm talking about comparing a PMD written in C vs this one in ASM.

Only fast stuff written in ASM. Remaining pmd is written in C.
Look  "crypto/armv8: add PMD optimized for ARMv8 processors"

> It"s just harder to read ASM. Most of DPDK code is in C.
> And only some small functions are written in ASM.
> The vector instructions use some C intrinsics.
> Do you mean that the instructions that you are using have no intrinsics
> equivalent? Nobody made it into GCC?
There is intrinsic equivalent for crypto but that will work only on
armv8. If we start using the arch specific intrinsic then it better to
plain ASM code, it is clean and we all do similar scheme for core crypto
work(like AES-NI library, linux etc)

We did a lot of effort to make clean armv8 ASM code _optimized_ for DPDK workload.
Just because someone doesn't familiar with armv8 Assembly its not fair to
say write it in C.


> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                   ` (2 preceding siblings ...)
  2016-12-04 11:33 ` [PATCH 3/3] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2016-12-07  2:32 ` zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
                     ` (10 more replies)
  2016-12-07  2:36 ` [PATCH v2 11/12] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
  2016-12-07  2:37 ` [PATCH v2 12/12] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  5 siblings, 11 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided. 
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 MAC, SHA256 HMAC
and SHA1 HMAC.

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA256 MAC
and SHA1 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

Further performance improvements are planned
in the following patch revisions.

---

v2:
* Fixed checkpatch warnings
* Divide patches into smaller logical parts

Zbigniew Bodek (12):
  mk: fix build of assembly files for ARM64
  lib: add cryptodev type for the upcoming ARMv8 PMD
  crypto/armv8: Add core crypto operations for ARMv8
  crypto/armv8: Add AES+SHA256 crypto operations for ARMv8
  crypto/armv8: Add AES+SHA1 crypto operations for ARMv8
  crypto/armv8: add PMD optimized for ARMv8 processors
  crypto/armv8: generate ASM symbols automatically
  mk/crypto/armv8: add PMD to the build system
  doc/armv8: update documentation about crypto PMD
  crypto/armv8: enable ARMv8 PMD in the configuration
  crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                        |    6 +
 app/test/test_cryptodev.c                          |   63 +
 app/test/test_cryptodev_aes_test_vectors.h         |  211 ++-
 app/test/test_cryptodev_blockcipher.c              |    4 +
 app/test/test_cryptodev_blockcipher.h              |    1 +
 app/test/test_cryptodev_perf.c                     |  508 ++++++
 config/common_base                                 |    6 +
 config/defconfig_arm64-armv8a-linuxapp-gcc         |    2 +
 doc/guides/cryptodevs/armv8.rst                    |   82 +
 doc/guides/cryptodevs/index.rst                    |    1 +
 doc/guides/rel_notes/release_17_02.rst             |    5 +
 drivers/crypto/Makefile                            |    3 +
 drivers/crypto/armv8/Makefile                      |   84 +
 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S     | 1719 ++++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256.S        | 1544 ++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S   | 1879 ++++++++++++++++++++
 drivers/crypto/armv8/asm/aes_core.S                |  151 ++
 drivers/crypto/armv8/asm/include/rte_armv8_defs.h  |   80 +
 drivers/crypto/armv8/asm/sha1_core.S               |  518 ++++++
 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S | 1650 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S    | 1658 +++++++++++++++++
 drivers/crypto/armv8/asm/sha256_core.S             |  525 ++++++
 .../crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S   | 1832 +++++++++++++++++++
 drivers/crypto/armv8/genassym.c                    |   55 +
 drivers/crypto/armv8/rte_armv8_pmd.c               |  915 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c           |  390 ++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h       |  210 +++
 drivers/crypto/armv8/rte_armv8_pmd_version.map     |    3 +
 lib/librte_cryptodev/rte_cryptodev.h               |    3 +
 mk/arch/arm64/rte.vars.mk                          |    1 -
 mk/rte.app.mk                                      |    3 +
 mk/toolchain/gcc/rte.vars.mk                       |    6 +-
 32 files changed, 14107 insertions(+), 11 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/aes_core.S
 create mode 100644 drivers/crypto/armv8/asm/include/rte_armv8_defs.h
 create mode 100644 drivers/crypto/armv8/asm/sha1_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/genassym.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v2 01/12] mk: fix build of assembly files for ARM64
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-21 14:46     ` De Lara Guarch, Pablo
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
                     ` (9 subsequent siblings)
  10 siblings, 2 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Avoid using incorrect assembler (nasm) and unsupported flags
when building for ARM64.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 mk/arch/arm64/rte.vars.mk    | 1 -
 mk/toolchain/gcc/rte.vars.mk | 6 ++++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mk/arch/arm64/rte.vars.mk b/mk/arch/arm64/rte.vars.mk
index c168426..3b1178a 100644
--- a/mk/arch/arm64/rte.vars.mk
+++ b/mk/arch/arm64/rte.vars.mk
@@ -53,7 +53,6 @@ CROSS ?=
 
 CPU_CFLAGS  ?=
 CPU_LDFLAGS ?=
-CPU_ASFLAGS ?= -felf
 
 export ARCH CROSS CPU_CFLAGS CPU_LDFLAGS CPU_ASFLAGS
 
diff --git a/mk/toolchain/gcc/rte.vars.mk b/mk/toolchain/gcc/rte.vars.mk
index ff70f3d..94f6412 100644
--- a/mk/toolchain/gcc/rte.vars.mk
+++ b/mk/toolchain/gcc/rte.vars.mk
@@ -41,9 +41,11 @@
 CC        = $(CROSS)gcc
 KERNELCC  = $(CROSS)gcc
 CPP       = $(CROSS)cpp
-# for now, we don't use as but nasm.
-# AS      = $(CROSS)as
+ifeq ($(CONFIG_RTE_ARCH_X86),y)
 AS        = nasm
+else
+AS        = $(CROSS)as
+endif
 AR        = $(CROSS)ar
 LD        = $(CROSS)ld
 OBJCOPY   = $(CROSS)objcopy
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-06 20:27     ` Thomas Monjalon
  2016-12-07  2:32   ` [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8 zbigniew.bodek
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add type and name for ARMv8 crypto PMD

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 lib/librte_cryptodev/rte_cryptodev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index 8f63e8f..7bab79d 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 CM device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-06 20:29     ` Thomas Monjalon
  2016-12-07  2:32   ` [PATCH v2 04/12] crypto/armv8: Add AES+SHA256 " zbigniew.bodek
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek, Emery Davis

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch adds core low-level crypto operations
for ARMv8 processors. The assembly code is a base
for an optimized PMD and is currently excluded
from the build.

Standalone SHA1 and SHA256 are provided to support
partial hashing of inner/outer key+padding and
authentication keys longer than 160/256 bits.
Optimized AES key schedule is also included.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Signed-off-by: Emery Davis <emery.davis@caviumnetworks.com>
---
 drivers/crypto/armv8/asm/aes_core.S    | 151 ++++++++++
 drivers/crypto/armv8/asm/sha1_core.S   | 518 ++++++++++++++++++++++++++++++++
 drivers/crypto/armv8/asm/sha256_core.S | 525 +++++++++++++++++++++++++++++++++
 3 files changed, 1194 insertions(+)
 create mode 100644 drivers/crypto/armv8/asm/aes_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha1_core.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_core.S

diff --git a/drivers/crypto/armv8/asm/aes_core.S b/drivers/crypto/armv8/asm/aes_core.S
new file mode 100644
index 0000000..b7ceae6
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes_core.S
@@ -0,0 +1,151 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+	.file	"aes_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global	aes128_key_sched_enc
+	.type	aes128_key_sched_enc, %function
+	.global	aes128_key_sched_dec
+	.type	aes128_key_sched_dec, %function
+
+	/*
+	 * AES key expand algorithm for single round.
+	 */
+	.macro	key_expand res, key, shuffle_mask, rcon, tq0, tq1, td
+	/* temp = rotword(key[3]) */
+	tbl	\td\().8b,{\key\().16b},\shuffle_mask\().8b
+	dup	\tq0\().2d,\td\().d[0]
+	/* temp = subbytes(temp) */
+	aese	\tq0\().16b,v19\().16b			/* q19 := 0 */
+	/* temp = temp + rcon */
+	mov	w11,\rcon
+	dup	\tq1\().4s,w11
+	eor	\tq0\().16b,\tq0\().16b,\tq1\().16b
+	/* tq1 = [0, a, b, c] */
+	ext	\tq1\().16b,v19\().16b,\key\().16b,12  	/* q19 := 0 */
+	eor	\res\().16b,\key\().16b,\tq1\().16b
+	/* tq1 = [0, 0, a, b] */
+	ext	\tq1\().16b,v19\().16b,\tq1\().16b,12  	/* q19 := 0 */
+	eor	\res\().16b,\res\().16b,\tq1\().16b
+	/* tq1 = [0, 0, 0, a] */
+	ext	\tq1\().16b,v19\().16b,\tq1\().16b,12	/* q19 := 0 */
+	eor	\res\().16b,\res\().16b,\tq1\().16b
+	/* + temp */
+	eor	\res\().16b,\res\().16b,\tq0\().16b
+	.endm
+/*
+ * *expanded_key, *user_key
+ */
+	.align	4
+aes128_key_sched_enc:
+	sub	sp,sp,4*16
+	st1	{v8.16b - v11.16b},[sp]
+	ld1	{v0.16b},[x1]				/* user_key */
+	mov	w10,0x0e0d				/* form shuffle_word */
+	mov	w11,0x0c0f
+	orr	w10,w10,w11,lsl 16
+	dup	v20.4s,w10				/* shuffle_mask */
+	eor	v19.16b,v19.16b,v19.16b			/* zero */
+	/* Expand key */
+	key_expand v1,v0,v20,0x1,v21,v16,v17
+	key_expand v2,v1,v20,0x2,v21,v16,v17
+	key_expand v3,v2,v20,0x4,v21,v16,v17
+	key_expand v4,v3,v20,0x8,v21,v16,v17
+	key_expand v5,v4,v20,0x10,v21,v16,v17
+	key_expand v6,v5,v20,0x20,v21,v16,v17
+	key_expand v7,v6,v20,0x40,v21,v16,v17
+	key_expand v8,v7,v20,0x80,v21,v16,v17
+	key_expand v9,v8,v20,0x1b,v21,v16,v17
+	key_expand v10,v9,v20,0x36,v21,v16,v17
+	/* Store round keys in the correct order */
+	st1	{v0.16b - v3.16b},[x0],64
+	st1	{v4.16b - v7.16b},[x0],64
+	st1	{v8.16b - v10.16b},[x0],48
+
+	ld1	{v8.16b - v11.16b},[sp]
+	add	sp,sp,4*16
+	ret
+
+	.size	aes128_key_sched_enc, .-aes128_key_sched_enc
+
+/*
+ * *expanded_key, *user_key
+ */
+	.align	4
+aes128_key_sched_dec:
+	sub	sp,sp,4*16
+	st1	{v8.16b-v11.16b},[sp]
+	ld1	{v0.16b},[x1]				/* user_key */
+	mov	w10,0x0e0d				/* form shuffle_word */
+	mov	w11,0x0c0f
+	orr	w10,w10,w11,lsl 16
+	dup	v20.4s,w10				/* shuffle_mask */
+	eor	v19.16b,v19.16b,v19.16b			/* zero */
+	/*
+	 * Expand key.
+	 * Intentionally reverse registers order to allow
+	 * for multiple store later.
+	 * (Store must be performed in the ascending registers' order)
+	 */
+	key_expand v10,v0,v20,0x1,v21,v16,v17
+	key_expand v9,v10,v20,0x2,v21,v16,v17
+	key_expand v8,v9,v20,0x4,v21,v16,v17
+	key_expand v7,v8,v20,0x8,v21,v16,v17
+	key_expand v6,v7,v20,0x10,v21,v16,v17
+	key_expand v5,v6,v20,0x20,v21,v16,v17
+	key_expand v4,v5,v20,0x40,v21,v16,v17
+	key_expand v3,v4,v20,0x80,v21,v16,v17
+	key_expand v2,v3,v20,0x1b,v21,v16,v17
+	key_expand v1,v2,v20,0x36,v21,v16,v17
+	/* Inverse mixcolumns for keys 1-9 (registers v10-v2) */
+	aesimc	v10.16b, v10.16b
+	aesimc	v9.16b, v9.16b
+	aesimc	v8.16b, v8.16b
+	aesimc	v7.16b, v7.16b
+	aesimc	v6.16b, v6.16b
+	aesimc	v5.16b, v5.16b
+	aesimc	v4.16b, v4.16b
+	aesimc	v3.16b, v3.16b
+	aesimc	v2.16b, v2.16b
+	/* Store round keys in the correct order */
+	st1	{v1.16b - v4.16b},[x0],64
+	st1	{v5.16b - v8.16b},[x0],64
+	st1	{v9.16b, v10.16b},[x0],32
+	st1	{v0.16b},[x0],16
+
+	ld1	{v8.16b - v11.16b},[sp]
+	add	sp,sp,4*16
+	ret
+
+	.size	aes128_key_sched_dec, .-aes128_key_sched_dec
diff --git a/drivers/crypto/armv8/asm/sha1_core.S b/drivers/crypto/armv8/asm/sha1_core.S
new file mode 100644
index 0000000..283c946
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha1_core.S
@@ -0,0 +1,518 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Core SHA-1 Primitives
+ *
+ * Operations:
+ * sha1_block_partial:
+ * 	out = partial_sha1(init, in, len)	<- no final block
+ *
+ * sha1_block:
+ * 	out = sha1(init, in, len)
+ *
+ * Prototype:
+ *
+ * int sha1_block_partial(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * int sha1_block(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * returns: 0 (success), -1 (failure)
+ *
+ * Registers used:
+ *
+ * sha1_block_partial(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * sha1_block(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v4 - v7 -- round consts for sha
+ * v22 -- sha working state ABCD (q22)
+ * v24 -- reg_sha_stateABCD
+ * v25 -- reg_sha_stateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16 (+20 for the HMAC),
+ * otherwise error code is returned.
+ *
+ */
+	.file "sha1_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global sha1_block_partial
+	.type	sha1_block_partial,%function
+	.global sha1_block
+	.type	sha1_block,%function
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+	.align	4
+.Linit_sha_state:
+	.word		0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476
+	.word		0xc3d2e1f0, 0x00000000, 0x00000000, 0x00000000
+
+	.align	4
+
+sha1_block_partial:
+	mov		x6, #1			/* indicate partial hash */
+	ands		x5, x3, #0x3f		/* Check size mod 1 SHA block */
+	b.ne		.Lsha1_error
+	cbnz		x0, 1f
+	/* address of sha init state consts */
+	adr		x0,.Linit_sha_state
+1:
+	ld1		{v24.4s},[x0],16	/* init ABCD */
+	ld1		{v25.4s},[x0]		/* and E */
+
+	/* Load SHA-1 constants */
+	adr		x4,.Lrcon
+	ld1		{v4.16b},[x4],16	/* key0 */
+	ld1		{v5.16b},[x4],16	/* key1 */
+	ld1		{v6.16b},[x4],16	/* key2 */
+	ld1		{v7.16b},[x4],16	/* key3 */
+
+	lsr		x5, x3, 2		/* number of 4B blocks */
+	b		.Lsha1_loop
+
+sha1_block:
+	mov		x6, xzr		/* indicate full hash */
+	and		x5, x3, #0xf	/* check size mod 16B block */
+	cmp		x5, #4		/* additional word is accepted */
+	b.eq		1f
+	cbnz		x5, .Lsha1_error
+1:
+	cbnz		x0, 2f
+	/* address of sha init state consts */
+	adr		x0,.Linit_sha_state
+2:
+	ld1		{v24.4s},[x0],16	/* init ABCD */
+	ld1		{v25.4s},[x0]		/* and E */
+
+	/* Load SHA-1 constants */
+	adr		x4,.Lrcon
+	ld1		{v4.16b},[x4],16	/* key0 */
+	ld1		{v5.16b},[x4],16	/* key1 */
+	ld1		{v6.16b},[x4],16	/* key2 */
+	ld1		{v7.16b},[x4],16	/* key3 */
+
+	lsr		x5, x3, 2		/* number of 4B blocks */
+	/* at least 16 4B blocks give 1 SHA block */
+	cmp		x5, #16
+	b.lo		.Lsha1_last
+
+	.align	4
+
+.Lsha1_loop:
+	sub		x5, x5, #16		/* substract 1 SHA block */
+
+	ld1		{v26.16b},[x1],16	/* dsrc[0] */
+	ld1		{v27.16b},[x1],16	/* dsrc[1] */
+	ld1		{v28.16b},[x1],16	/* dsrc[2] */
+	ld1		{v29.16b},[x1],16	/* dsrc[3] */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+/* quad 0 */
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s25,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v4.4s,v27.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v4.4s,v28.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v4.4s,v29.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v5.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v5.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v5.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v6.4s,v29.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v6.4s,v26.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v6.4s,v27.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v7.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v16.4s,v7.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v24.4s,v24.4s,v22.4s
+	add		v25.4s,v25.4s,v18.4s
+
+	cmp		x5, #16
+	b.hs		.Lsha1_loop
+
+	/* Store partial hash and return or complete hash */
+	cbz		x6, .Lsha1_last
+
+	st1		{v24.16b},[x2],16
+	st1		{v25.16b},[x2]
+
+	mov		x0, xzr
+	ret
+
+	/*
+	 * Last block with padding. v24-v25[0] contain hash state.
+	 */
+.Lsha1_last:
+
+	eor		v26.16b, v26.16b, v26.16b
+	eor		v27.16b, v27.16b, v27.16b
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x4,.Lrcon
+	/* Number of bits in message */
+	lsl		x3, x3, 3
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	/* move length to the end of the block */
+	mov		v29.s[3], w3
+	lsr		x3, x3, 32
+	/* and the higher part */
+	mov		v29.s[2], w3
+
+	/* The remaining part is up to 3 16B blocks and up to 1 4B block */
+	mov		w6, #0x80		/* that's the 1 of the pad */
+	mov		v26.b[3], w6
+	cbz		x5,.Lsha1_final
+	/* Are there 3 16B blocks? */
+	cmp		x5, #12
+	b.lo		1f
+	ld1		{v26.16b},[x1],16
+	ld1		{v27.16b},[x1],16
+	ld1		{v28.16b},[x1],16
+	rev32		v26.16b, v26.16b
+	rev32		v27.16b, v27.16b
+	rev32		v28.16b, v28.16b
+	sub		x5,x5,#12
+	mov		v29.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v29.b[7], wzr
+	ld1		{v29.s}[0],[x1],4
+	rev32		v29.16b,v29.16b
+	mov		v29.b[7], w6
+	b		.Lsha1_final
+1:
+	/* Are there 2 16B blocks? */
+	cmp		x5, #8
+	b.lo		2f
+	ld1		{v26.16b},[x1],16
+	ld1		{v27.16b},[x1],16
+	rev32		v26.16b,v26.16b
+	rev32		v27.16b,v27.16b
+	sub		x5,x5,#8
+	mov		v28.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v28.b[7], wzr
+	ld1		{v28.s}[0],[x1],4
+	rev32		v28.16b,v28.16b
+	mov		v28.b[7], w6
+	b		.Lsha1_final
+2:
+	/* Is there 1 16B block? */
+	cmp		x5, #4
+	b.lo		3f
+	ld1		{v26.16b},[x1],16
+	rev32		v26.16b,v26.16b
+	sub		x5,x5,#4
+	mov		v27.b[7], w6
+	cbz		x5,.Lsha1_final
+	mov		v27.b[7], wzr
+	ld1		{v27.s}[0],[x1],4
+	rev32		v27.16b,v27.16b
+	mov		v27.b[7], w6
+	b		.Lsha1_final
+3:
+	ld1		{v26.s}[0],[x1],4
+	rev32		v26.16b,v26.16b
+	mov		v26.b[7], w6
+
+.Lsha1_final:
+	ld1		{v4.16b},[x4],16	/* key0 */
+	ld1		{v5.16b},[x4],16	/* key1 */
+	ld1		{v6.16b},[x4],16	/* key2 */
+	ld1		{v7.16b},[x4],16	/* key3 */
+/* quad 0 */
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s25,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v4.4s,v27.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v4.4s,v28.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v4.4s,v29.4s
+	sha1h		s18,s24
+	sha1c		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v4.4s,v26.4s
+	sha1h		s19,s24
+	sha1c		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v5.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v5.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v5.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v5.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v17.4s,v6.4s,v29.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v6.4s,v26.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v17.4s,v6.4s,v27.4s
+	sha1h		s18,s24
+	sha1m		q24,s19,v17.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v16.4s,v6.4s,v28.4s
+	sha1h		s19,s24
+	sha1m		q24,s18,v16.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v16.4s,v7.4s,v26.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v27.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v16.4s,v7.4s,v28.4s
+	sha1h		s19,s24
+	sha1p		q24,s18,v16.4s
+
+	add		v17.4s,v7.4s,v29.4s
+	sha1h		s18,s24
+	sha1p		q24,s19,v17.4s
+
+	add		v25.4s,v25.4s,v18.4s
+	add		v24.4s,v24.4s,v22.4s
+
+	rev32		v24.16b,v24.16b
+	rev32		v25.16b,v25.16b
+
+	st1		{v24.16b}, [x2],16
+	st1		{v25.s}[0], [x2]
+
+	mov		x0, xzr
+	ret
+
+.Lsha1_error:
+	mov		x0, #-1
+	ret
+
+	.size	sha1_block_partial, .-sha1_block_partial
+	.size	sha1_block, .-sha1_block
diff --git a/drivers/crypto/armv8/asm/sha256_core.S b/drivers/crypto/armv8/asm/sha256_core.S
new file mode 100644
index 0000000..2b2da7f
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_core.S
@@ -0,0 +1,525 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Core SHA-2 Primitives
+ *
+ * Operations:
+ * sha256_block_partial:
+ * 	out = partial_sha256(init, in, len)	<- no final block
+ *
+ * sha256_block:
+ * 	out = sha256(init, in, len)
+ *
+ * Prototype:
+ *
+ * int sha256_block_partial(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * int sha256_block(uint8_t *init,
+ *			uint8_t *dsrc, uint8_t *ddst, uint64_t len)
+ *
+ * returns: 0 (success), -1 (failure)
+ *
+ * Registers used:
+ *
+ * sha256_block_partial(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * sha256_block(
+ *	init,			x0	(hash init state - NULL for default)
+ *	dsrc,			x1	(digest src address)
+ *	ddst,			x2	(digest dst address)
+ *	len,			x3	(length)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v4 - v7 -- round consts for sha
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- reg_sha_stateABCD
+ * v25 -- reg_sha_stateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16,
+ * otherwise error code is returned.
+ *
+ */
+	.file "sha256_core.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.align	4
+	.global sha256_block_partial
+	.type	sha256_block_partial,%function
+	.global sha256_block
+	.type	sha256_block,%function
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+	.align	4
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+	.align	4
+
+sha256_block_partial:
+	mov		x6, #1			/* indicate partial hash */
+	ands		x5, x3, #0x3f		/* check size mod 1 SHA block */
+	b.ne		.Lsha256_error
+	cbnz		x0, 1f
+	/* address of sha init state consts */
+	adr		x0,.Linit_sha_state
+1:
+	ld1		{v24.4s, v25.4s},[x0]	/* init ABCD, EFGH */
+	/* number of 16B blocks (will be at least 4) */
+	lsr		x5, x3, 4
+	b		.Lsha256_loop
+
+sha256_block:
+	mov		x6, xzr			/* indicate full hash */
+	ands		x5, x3, #0xf		/* check size mod 16B block */
+	b.ne		.Lsha256_error
+	cbnz		x0, 1f
+	/* address of sha init state consts */
+	adr		x0,.Linit_sha_state
+1:
+	ld1		{v24.4s, v25.4s},[x0]	/* init ABCD, EFGH. (2 cycs) */
+	lsr		x5, x3, 4		/* number of 16B blocks */
+	cmp		x5, #4	/* at least 4 16B blocks give 1 SHA block */
+	b.lo		.Lsha256_last
+
+	.align	4
+.Lsha256_loop:
+	sub		x5, x5, #4		/* substract 1 SHA block */
+	adr		x4,.Lrcon
+
+	ld1		{v26.16b},[x1],16	/* dsrc[0] */
+	ld1		{v27.16b},[x1],16	/* dsrc[1] */
+	ld1		{v28.16b},[x1],16	/* dsrc[2] */
+	ld1		{v29.16b},[x1],16	/* dsrc[3] */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+
+	ld1		{v4.16b},[x4],16	/* key0 */
+	ld1		{v5.16b},[x4],16	/* key1 */
+	ld1		{v6.16b},[x4],16	/* key2 */
+	ld1		{v7.16b},[x4],16	/* key3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key4 */
+	ld1		{v5.16b},[x4],16	/* key5 */
+	ld1		{v6.16b},[x4],16	/* key6 */
+	ld1		{v7.16b},[x4],16	/* key7 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key8 */
+	ld1		{v5.16b},[x4],16	/* key9 */
+	ld1		{v6.16b},[x4],16	/* key10 */
+	ld1		{v7.16b},[x4],16	/* key11 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key12 */
+	ld1		{v5.16b},[x4],16	/* key13 */
+	ld1		{v6.16b},[x4],16	/* key14 */
+	ld1		{v7.16b},[x4],16	/* key15 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	cmp		x5, #4
+	b.hs		.Lsha256_loop
+
+	/* Store partial hash and return or complete hash */
+	cbz		x6, .Lsha256_last
+
+	st1		{v24.16b, v25.16b}, [x2]
+
+	mov		x0, xzr
+	ret
+
+	/*
+	 * Last block with padding. v24-v25 contain hash state.
+	 */
+.Lsha256_last:
+	eor		v26.16b, v26.16b, v26.16b
+	eor		v27.16b, v27.16b, v27.16b
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+
+	adr		x4,.Lrcon
+	lsl		x3, x3, 3
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+
+	/* Fill out the first vector register and the end of the block */
+
+	/* move length to the end of the block */
+	mov		v29.s[3], w3
+	lsr		x3, x3, 32
+	mov		v29.s[2], w3		/* and the higher part */
+	/* set padding 1 to the first reg */
+	mov		w6, #0x80		/* that's the 1 of the pad */
+	mov		v26.b[3], w6
+	cbz		x5,.Lsha256_final
+
+	sub		x5, x5, #1
+	mov		v27.16b, v26.16b
+	ld1		{v26.16b},[x1],16
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	cbz		x5,.Lsha256_final
+
+	sub		x5, x5, #1
+	mov		v28.16b, v27.16b
+	ld1		{v27.16b},[x1],16
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	cbz		x5,.Lsha256_final
+
+	mov		v29.b[0], w6
+	ld1		{v28.16b},[x1],16
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+
+.Lsha256_final:
+
+	ld1		{v4.16b},[x4],16	/* key0 */
+	ld1		{v5.16b},[x4],16	/* key1 */
+	ld1		{v6.16b},[x4],16	/* key2 */
+	ld1		{v7.16b},[x4],16	/* key3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key4 */
+	ld1		{v5.16b},[x4],16	/* key5 */
+	ld1		{v6.16b},[x4],16	/* key6 */
+	ld1		{v7.16b},[x4],16	/* key7 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key8 */
+	ld1		{v5.16b},[x4],16	/* key9 */
+	ld1		{v6.16b},[x4],16	/* key10 */
+	ld1		{v7.16b},[x4],16	/* key11 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x4],16	/* key12 */
+	ld1		{v5.16b},[x4],16	/* key13 */
+	ld1		{v6.16b},[x4],16	/* key14 */
+	ld1		{v7.16b},[x4],16	/* key15 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x2]	/* save them both */
+
+	mov		x0, xzr
+	ret
+
+.Lsha256_error:
+	mov		x0, #-1
+	ret
+
+	.size	sha256_block_partial, .-sha256_block_partial
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 04/12] crypto/armv8: Add AES+SHA256 crypto operations for ARMv8
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (2 preceding siblings ...)
  2016-12-07  2:32   ` [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8 zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 05/12] crypto/armv8: Add AES+SHA1 " zbigniew.bodek
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek, Emery Davis

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch adds AES-128-CBC + SHA256 low-level
crypto operations for ARMv8 processors.
The assembly code is a base for an optimized PMD
and is currently excluded from the build.

This code is optimized to provide performance boost
for combined operations such as encryption + HMAC
generation, decryption + HMAC validation.

Introduced operations add support for AES-128-CBC in
combination with:
SHA256 MAC, SHA256 HMAC

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Signed-off-by: Emery Davis <emery.davis@caviumnetworks.com>
---
 drivers/crypto/armv8/asm/aes128cbc_sha256.S        | 1544 ++++++++++++++++
 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S   | 1879 ++++++++++++++++++++
 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S    | 1658 +++++++++++++++++
 .../crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S   | 1832 +++++++++++++++++++
 4 files changed, 6913 insertions(+)
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256.S
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
 create mode 100644 drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S

diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha256.S b/drivers/crypto/armv8/asm/aes128cbc_sha256.S
new file mode 100644
index 0000000..caed87d
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha256.S
@@ -0,0 +1,1544 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha256
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA256(out)
+ *
+ * Prototype:
+ * void aes128cbc_sha256(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha256(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- regShaStateABCD
+ * v25 -- regShaStateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results
+ * are not defined. For AES partial blocks the user is required
+ * to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha256.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha256
+	.type	aes128cbc_sha256,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+aes128cbc_sha256:
+/* fetch args */
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next aes_ptr_in */
+	/* address of sha init state consts */
+	adr		x12,.Linit_sha_state
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,12			/* no main loop if <12 */
+	ld1		{v24.4s, v25.4s},[x12]	/* init ABCD, EFGH. (2 cycs) */
+	b.lt		.Lshort_cases		/* branch if < 12 */
+
+	/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	/* proceed */
+	ld1		{v3.16b},[x5]		/* get 1st ivec */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		x11,x4			/* len -> x11 needed at end */
+	lsr		x12,x11,6		/* total_blocks */
+
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out  */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	aese		v0.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1			/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]	/* rcon */
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0			/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	sub		x7,x12,1	/* main_blocks = total_blocks - 1 */
+	and		x13,x10,3	/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+/*
+ * Note, aes_blocks_left := number after the main (sha)
+ * block is done. Can be 0
+ */
+/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle,
+ * this takes longer than it looks.
+ * Thats OK since there are 6 cycles before we can use
+ * the load anyway; so this goes as fast as it can without
+ * SW pipelining (too complicated given the code size)
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0, aes res 0 */
+/* next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b		/* fix endian w1, aes res 1 */
+/* pref next aes_ptr_out, streaming  */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+	ld1		{v5.16b},[x9],16	/* key1 */
+/*
+ * aes xform 0, sha quad 0
+ */
+	aese		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	rev32		v28.16b,v2.16b		/* fix endian w2, aes res 2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+	aesmc		v0.16b,v0.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aese		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b		/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aese		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	eor		v1.16b,v1.16b,v0.16b	/* mode op 1 xor w/prev value */
+	ld1		{v7.16b},[x9],16	/* key7  */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aese		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aese		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aese		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aese		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b	/* mode of 2 xor w/prev value */
+
+/* aes xform 2, sha quad 2 */
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aese		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesmc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/ prev value */
+
+/* aes xform 3, sha quad 3 (hash only) */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aese		v3.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbnz		x7,.Lmain_loop		/* loop if more to do */
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0 */
+	rev32		v27.16b,v1.16b		/* fix endian w1 */
+	rev32		v28.16b,v2.16b		/* fix endian w2 */
+	rev32		v29.16b,v3.16b		/* fix endian w3 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	cbz		x13, .Lbm2fromQ0	/* skip if none left */
+	subs		x14,x13,1	/* local copy of aes_blocks_left */
+
+/*
+ * mode op 0
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v0.16b},[x0],16
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aese		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ1
+/*
+ * mode op 1
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v1.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aese		v1.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v1.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v1.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1		/* dec counter */
+	aese		v1.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v1.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v16.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ2
+
+/*
+ * mode op 2
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v2.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aese		v2.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	/* join common code at Quad 3 */
+	b		.Lbm2fromQ3
+
+/*
+ * now there is the b-2 sha block before the final one.  Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+/* quad 0 */
+.Lbm2fromQ0:
+	mov		x9,x8				/* top of rcon */
+	ld1		{v4.16b},[x9],16		/* key0 */
+	ld1		{v5.16b},[x9],16		/* key1 */
+	ld1		{v6.16b},[x9],16		/* key2 */
+	ld1		{v7.16b},[x9],16		/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s		/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lbm2fromQ1:
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s		/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lbm2fromQ2:
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s		/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lbm2fromQ3:
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s		/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b		/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b		/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b		/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		w15,0x80		/* that's the 1 of the pad */
+	lsr		x12,x11,32		/* len_hi */
+	and		x9,x11,0xffffffff	/* len_lo */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		v26.b[0],w15		/* assume block 0 is dst */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x9,x9,3			/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b	/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100		/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1		/* dec amount left */
+	mov		v27.b[0],w15		/* assume block 1 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1		/* dec amount left */
+	mov		v28.b[0],w15		/* assume block 2 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15		/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi, len_lo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+
+/*
+ * do last sha of pad block
+ */
+
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	mov		x9,sp
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		sp,sp,8*16
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+/*
+ * now we just have to put this into big endian and store!
+ */
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b			/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b			/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]		/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	mov		w15,0x80			/* sha padding word */
+
+	lsl		x11,x10,4		/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	eor		v0.16b,v0.16b,v3.16b		/* xor w/prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	/* assume this was final block */
+	mov		v27.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	/* load res to sha 0, endian swap */
+	rev32		v26.16b,v0.16b
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/ prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	/* assume this was final block */
+	mov		v28.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	rev32		v27.16b,v1.16b	/* load res to sha 0, endian swap */
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	/* assume this was final block */
+	mov		v29.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	rev32		v28.16b,v2.16b	/* load res to sha 0, endian swap */
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+
+	rev32		v29.16b,v3.16b	/* load res to sha 0, endian swap */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * now we have the sha256 to do for these 4 aes blocks
+ */
+
+	mov	v22.16b,v24.16b			/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b			/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8				/* top of rcon */
+	ld1		{v4.16b},[x9],16		/* key0 */
+	ld1		{v5.16b},[x9],16		/* key1 */
+	ld1		{v6.16b},[x9],16		/* key2 */
+	ld1		{v7.16b},[x9],16		/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s		/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s		/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s		/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16		/* key4 */
+	ld1		{v5.16b},[x9],16		/* key5 */
+	ld1		{v6.16b},[x9],16		/* key6 */
+	ld1		{v7.16b},[x9],16		/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s		/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s		/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s		/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s		/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b		/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+	/* assume this was final block */
+	mov		v26.b[3],w15
+
+	sub		x10,x10,1		/* dec num_blocks */
+	cbnz		x10,.Lshort_loop	/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha256 blocks
+ */
+.Lpost_short_loop:
+	lsr	x12,x11,32			/* len_hi */
+	and	x13,x11,0xffffffff		/* len_lo */
+	lsl	x12,x12,3			/* len_hi in bits */
+	lsl	x13,x13,3			/* len_lo in bits */
+
+	mov	v29.s[3],w13			/* len_lo */
+	mov	v29.s[2],w12			/* len_hi */
+
+/* do final block */
+	mov	v22.16b,v24.16b			/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b			/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	mov		x9,sp
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		sp,sp,8*16
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b			/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b			/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]		/* save them both */
+	ret
+
+	.size	aes128cbc_sha256, .-aes128cbc_sha256
diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S b/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
new file mode 100644
index 0000000..499e8eb
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha256_hmac.S
@@ -0,0 +1,1879 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha256_hmac
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA256(o_key_pad | SHA256(i_key_pad | out))
+ *
+ * Prototype:
+ * void aes128cbc_sha256_hmac(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha256_hmac(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- sha state ABCD
+ * v25 -- sha state EFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results
+ * are not defined. For AES partial blocks the user is required
+ * to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha256_hmac.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha256_hmac
+	.type	aes128cbc_sha256_hmac,%function
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+aes128cbc_sha256_hmac:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	/* init ABCD, EFGH. */
+	ld1		{v24.4s, v25.4s},[x6]
+	/* save pointer to o_key_pad partial hash */
+	ldr		x6, [x5, #HMAC_OKEYPAD]
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next aes_ptr_in */
+	/* address of sha init state consts */
+	adr		x12,.Linit_sha_state
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,12			/* no main loop if <12 */
+	b.lt		.Lshort_cases		/* branch if < 12 */
+
+	/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+/* proceed */
+	ld1		{v3.16b},[x5]		/* get 1st ivec */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		x11,x4			/* len -> x11 needed at end */
+	lsr		x12,x11,6		/* total_blocks */
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out  */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	aese		v0.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1			/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]	/* rcon */
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0			/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	sub		x7,x12,1	/* main_blocks = total_blocks - 1 */
+	and		x13,x10,3	/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+
+/*
+ * Note, aes_blocks_left := number after the main (sha)
+ * block is done. Can be 0
+ */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+
+/*
+ * because both mov, rev32 and eor have a busy cycle,
+ * this takes longer than it looks. Thats OK since there are 6 cycles
+ * before we can use the load anyway; so this goes as fast as it can without
+ * SW pipelining (too complicated given the code size)
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0, aes res 0 */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b		/* fix endian w1, aes res 1 */
+	/* pref next aes_ptr_out, streaming  */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+	ld1		{v5.16b},[x9],16	/* key1 */
+/*
+ * aes xform 0, sha quad 0
+ */
+	aese		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	rev32		v28.16b,v2.16b		/* fix endian w2, aes res 2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+	aesmc		v0.16b,v0.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aese		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b		/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aese		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	eor		v1.16b,v1.16b,v0.16b	/* mode op 1 xor w/prev value */
+	ld1		{v7.16b},[x9],16	/* key7  */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aese		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aese		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aese		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aese		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b	/* mode of 2 xor w/prev value */
+
+/* aes xform 2, sha quad 2 */
+
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aese		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aese		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesmc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/prev value */
+
+/* aes xform 3, sha quad 3 (hash only) */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aese		v3.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbnz		x7,.Lmain_loop		/* loop if more to do */
+
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0 */
+	rev32		v27.16b,v1.16b		/* fix endian w1 */
+	rev32		v28.16b,v2.16b		/* fix endian w2 */
+	rev32		v29.16b,v3.16b		/* fix endian w3 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	cbz		x13, .Lbm2fromQ0	/* skip if none left */
+	subs		x14,x13,1	/* local copy of aes_blocks_left */
+/*
+ * mode op 0
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v0.16b},[x0],16
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aese		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesmc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v0.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v0.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v0.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v0.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ1
+/*
+ * mode op 1
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v1.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aese		v1.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesmc		v1.16b,v1.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v1.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v1.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1		/* dec counter */
+	aese		v1.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v1.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v1.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aese		v1.16b,v16.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ2
+/*
+ * mode op 2
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v2.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aese		v2.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesmc		v2.16b,v2.16b
+	sha256su0	v26.4s,v27.4s
+	aese		v2.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aese		v2.16b,v10.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256su0	v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aese		v2.16b,v12.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesmc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aese		v2.16b,v14.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	sha256su0	v29.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	/* join common code at Quad 3 */
+	b		.Lbm2fromQ3
+/*
+ * now there is the b-2 sha block before the final one.  Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+/* quad 0 */
+.Lbm2fromQ0:
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lbm2fromQ1:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lbm2fromQ2:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lbm2fromQ3:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b	/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b	/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b	/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		w15,0x80		/* that's the 1 of the pad */
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32		/* len_hi */
+	and		x9,x11,0xffffffff	/* len_lo */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		v26.b[0],w15		/* assume block 0 is dst */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x9,x9,3			/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b	/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100		/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1		/* dec amount left */
+	mov		v27.b[0],w15		/* assume block 1 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1		/* dec amount left */
+	mov		v28.b[0],w15		/* assume block 2 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15		/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi,LenLo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+/*
+ * do last sha of pad block
+ */
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80		/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+	/* size of o_key_pad + inner hash */
+	mov		x11, #64+32
+	lsl		x11, x11, 3
+	/* move length to the end of the block */
+	mov		v29.s[3], w11
+
+	ld1		{v4.16b},[x8],16	/* key0 */
+	ld1		{v5.16b},[x8],16	/* key1 */
+	ld1		{v6.16b},[x8],16	/* key2 */
+	ld1		{v7.16b},[x8],16	/* key3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key4 */
+	ld1		{v5.16b},[x8],16	/* key5 */
+	ld1		{v6.16b},[x8],16	/* key6 */
+	ld1		{v7.16b},[x8],16	/* key7 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key8 */
+	ld1		{v5.16b},[x8],16	/* key9 */
+	ld1		{v6.16b},[x8],16	/* key10 */
+	ld1		{v7.16b},[x8],16	/* key11 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key12 */
+	ld1		{v5.16b},[x8],16	/* key13 */
+	ld1		{v6.16b},[x8],16	/* key14 */
+	ld1		{v7.16b},[x8],16	/* key15 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]	/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	mov		w15,0x80			/* sha padding word */
+
+	lsl		x11,x10,4		/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	eor		v0.16b,v0.16b,v3.16b		/* xor w/prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	/* assume this was final block */
+	mov		v27.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	rev32		v26.16b,v0.16b	/* load res to sha 0, endian swap */
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	/* assume this was final block */
+	mov		v28.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	rev32		v27.16b,v1.16b	/* load res to sha 0, endian swap */
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	/* assume this was final block */
+	mov		v29.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	rev32		v28.16b,v2.16b	/* load res to sha 0, endian swap */
+	sub		x10,x10,1		/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop	/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/ prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+
+	rev32		v29.16b,v3.16b	/* load res to sha 0, endian swap */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * now we have the sha256 to do for these 4 aes blocks
+ */
+	mov	v22.16b,v24.16b			/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b			/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+	/* assume this was final block */
+	mov		v26.b[3],w15
+
+	sub		x10,x10,1		/* dec num_blocks */
+	cbnz		x10,.Lshort_loop	/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha256 blocks
+ */
+.Lpost_short_loop:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add	x11, x11, #64
+	lsr	x12,x11,32			/* len_hi */
+	and	x13,x11,0xffffffff		/* len_lo */
+	lsl	x12,x12,3			/* len_hi in bits */
+	lsl	x13,x13,3			/* len_lo in bits */
+
+	mov	v29.s[3],w13			/* len_lo */
+	mov	v29.s[2],w12			/* len_hi */
+
+/* do final block */
+	mov	v22.16b,v24.16b			/* working ABCD <- ABCD */
+	mov	v23.16b,v25.16b			/* working EFGH <- EFGH */
+
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80		/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+	/* size of o_key_pad + inner hash */
+	mov		x11, #64+32
+	lsl		x11, x11, 3
+	/* move length to the end of the block */
+	mov		v29.s[3], w11
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11		/* and the higher part */
+
+	ld1		{v4.16b},[x8],16	/* key0 */
+	ld1		{v5.16b},[x8],16	/* key1 */
+	ld1		{v6.16b},[x8],16	/* key2 */
+	ld1		{v7.16b},[x8],16	/* key3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key4 */
+	ld1		{v5.16b},[x8],16	/* key5 */
+	ld1		{v6.16b},[x8],16	/* key6 */
+	ld1		{v7.16b},[x8],16	/* key7 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key8 */
+	ld1		{v5.16b},[x8],16	/* key9 */
+	ld1		{v6.16b},[x8],16	/* key10 */
+	ld1		{v7.16b},[x8],16	/* key11 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key12 */
+	ld1		{v5.16b},[x8],16	/* key13 */
+	ld1		{v6.16b},[x8],16	/* key14 */
+	ld1		{v7.16b},[x8],16	/* key15 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]	/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+	.size	aes128cbc_sha256_hmac, .-aes128cbc_sha256_hmac
diff --git a/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
new file mode 100644
index 0000000..e33c77b
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_aes128cbc_dec.S
@@ -0,0 +1,1658 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha256/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA256(in)
+ *
+ * Prototype:
+ *
+ * void sha256_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha256_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- regShaStateABCD
+ * v25 -- regShaStateEFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16,
+ * otherwise results are not defined. For AES partial blocks the user
+ * is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks,
+ * however they are somewhat optimized, and more so than the enc/auth versions.
+ */
+	.file "sha256_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha256_aes128cbc_dec
+	.type   sha256_aes128cbc_dec,%function
+
+
+	.align  4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+sha256_aes128cbc_dec:
+/* fetch args */
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next *in */
+	/* address of sha init state consts */
+	adr		x12,.Linit_sha_state
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,16			/* no main loop if <16 */
+	ld1		{v24.4s, v25.4s},[x12]	/* init ABCD, EFGH. (2 cycs) */
+	blt		.Lshort_cases		/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4			/* len -> x11 needed at end */
+	mov		x7,sp			/* copy for address mode */
+	ld1		{v30.16b},[x5]		/* get 1st ivec */
+	lsr		x12,x11,6		/* total_blocks (sha) */
+	mov		x4,x0			/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	ld1		{v29.16b},[x4],16	/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha256 block
+ */
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+/*
+ * do the first sha256 block on the plaintext
+ */
+	mov		v22.16b,v24.16b		/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	mov		v23.16b,v25.16b		/* init working EFGH */
+	st1		{v9.16b},[x7],16
+
+	rev32		v26.16b,v26.16b		/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b		/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b		/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b		/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	st1		{v14.16b},[x7],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	st1		{v15.16b},[x7],16
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	ld1		{v29.16b},[x4],16	/* next w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2		/* main_blocks=total_blocks-5 */
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	and		x13,x10,3		/* aes_blocks_left */
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	add		x2,x0,128		/* lead_ptr = *in */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+
+/*
+ * Because both mov, rev32 and eor have a busy cycle,
+ * this takes longer than it looks.
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v4.16b},[x9],16	/* key0 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* get next aes block, with update */
+	ld1		{v30.16b},[x0],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/* aes xform 2, sha quad 2 */
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+
+/* aes xform 3, sha quad 3 (hash only) */
+
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	ld1		{v29.16b},[x4],16	/* next w3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/prev value */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbnz		x7,.Lmain_loop		/* loop if more to do */
+/*
+ * now the loop epilog. Since the reads for sha have already been done
+ * in advance, we have to have an extra unwind.
+ * This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+	aesimc		v0.16b,v0.16b
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/* mode op 2 */
+
+/* aes xform 2, sha quad 2 */
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+
+/* mode op 3 */
+
+/* aes xform 3, sha quad 3 (hash only) */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	/* read first aes block, no bump */
+	ld1		{v0.16b},[x0]
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ prev value */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * Now, there is the final b-1 sha256 padded block.
+ * This contains between 0-3 aes blocks. We take some pains to avoid read spill
+ * by only reading the blocks that are actually defined.
+ * this is also the final sha block code for the short_cases.
+ */
+.Ljoin_common:
+	mov		w15,0x80		/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100	/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b	/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b	/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v26.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1		/* dec amount left */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	cbnz		x14,.Lpad200	/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b	/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v27.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1		/* dec amount left */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	cbnz		x14,.Lpad300	/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v28.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v29.b[3],w15		/* all data is bogus */
+
+.Lpad_done:
+	lsr		x12,x11,32		/* len_hi */
+	and		x14,x11,0xffffffff	/* len_lo */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x14,x14,3		/* len_lo in bits */
+
+	mov		v29.s[3],w14		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left	/* none to do */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v31.16b},[x0],16
+
+	mov		x9,x8				/* top of rcon */
+	ld1		{v4.16b},[x9],16		/* key0 */
+	ld1		{v5.16b},[x9],16		/* key1 */
+	ld1		{v6.16b},[x9],16		/* key2 */
+	aesd		v0.16b,v8.16b
+	ld1		{v7.16b},[x9],16		/* key3 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesd		v0.16b,v10.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	aesd		v0.16b,v11.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256h2	q23, q21, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1		/* dec counter */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v30.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesd		v0.16b,v10.16b
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v11.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v13.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v29.4s,v26.4s
+	aesd		v0.16b,v16.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1		/* dec counter */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v10.16b
+	sha256h2	q23, q21, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v0.16b,v11.16b
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v5.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesd		v0.16b,v13.16b
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesd		v0.16b,v14.16b
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b	/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b	/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b	/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+/*
+ * now we just have to put this into big endian and store! and clean up stack...
+ */
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	rev32		v24.16b,v24.16b			/* big endian ABCD */
+	ld1		{v12.16b - v15.16b},[x9]
+	rev32		v25.16b,v25.16b			/* big endian EFGH */
+
+	st1		{v24.4s,v25.4s},[x3]		/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	lsl		x11,x10,4		/* len = aes_blocks*16 */
+	mov		x4,x0			/* sha_ptr_in = in */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block
+ * and 0-3 aes blocks. Note that everything is done serially
+ * to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4			/* check if 4 or more */
+	/* if less, bail to last block */
+	blt		.Llast_sha_block
+
+	ld1		{v31.16b},[x4]		/* next w no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x4],16
+	rev32		v26.16b,v0.16b		/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x4],16
+	rev32		v27.16b,v1.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/ prev value */
+
+	ld1		{v31.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x4],16
+	rev32		v28.16b,v2.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x4],16
+	rev32		v29.16b,v3.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b		/* xor w/prev value */
+
+/*
+ * now we have the sha256 to do for these 4 aes blocks. Note that.
+ */
+
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+
+/* quad 0 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	sub		x10,x10,4		/* 4 less */
+	b		.Lshort_loop		/* keep looping */
+/*
+ * this is arranged so that we can join the common unwind code that does
+ * the last sha block and the final 0-3 aes blocks
+ */
+.Llast_sha_block:
+	mov		x13,x10			/* copy aes blocks for common */
+	b		.Ljoin_common		/* join common code */
+
+	.size	sha256_aes128cbc_dec, .-sha256_aes128cbc_dec
diff --git a/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
new file mode 100644
index 0000000..4ca34c1
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha256_hmac_aes128cbc_dec.S
@@ -0,0 +1,1832 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha256_hmac/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA256(o_key_pad | SHA256(i_key_pad | in))
+ *
+ * Prototype:
+ *
+ * void sha256_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha256_hmac_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 - v20 -- round keys
+ * v21 -- ABCD tmp
+ * v22 -- sha working state ABCD (q22)
+ * v23 -- sha working state EFGH (q23)
+ * v24 -- sha state ABCD
+ * v25 -- sha state EFGH
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16,
+ * otherwise results are not defined. For AES partial blocks the user
+ * is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks,
+ * however they are somewhat optimized, and more so than the enc/auth versions.
+ */
+	.file "sha256_hmac_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha256_hmac_aes128cbc_dec
+	.type	sha256_hmac_aes128cbc_dec,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
+	.word		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
+	.word		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
+	.word		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
+	.word		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
+	.word		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
+	.word		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
+	.word		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
+	.word		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
+	.word		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
+	.word		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
+	.word		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
+	.word		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
+	.word		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
+	.word		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
+	.word		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
+
+.Linit_sha_state:
+	.word		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a
+	.word		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
+
+sha256_hmac_aes128cbc_dec:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	/* init ABCD, EFGH */
+	ld1		{v24.4s, v25.4s},[x6]
+	/* save pointer to o_key_pad partial hash */
+	ldr		x6, [x5, #HMAC_OKEYPAD]
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next *in */
+	/* address of sha init state consts */
+	adr		x12,.Linit_sha_state
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,16			/* no main loop if <16 */
+	blt		.Lshort_cases		/* branch if < 12 */
+
+	/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4			/* len -> x11 needed at end */
+	mov		x7,sp			/* copy for address mode */
+	ld1		{v30.16b},[x5]		/* get 1st ivec */
+	lsr		x12,x11,6		/* total_blocks (sha) */
+	mov		x4,x0			/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	ld1		{v29.16b},[x4],16	/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha256 block
+ */
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+/*
+ * do the first sha256 block on the plaintext
+ */
+
+	mov		v22.16b,v24.16b		/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	mov		v23.16b,v25.16b		/* init working EFGH */
+	st1		{v9.16b},[x7],16
+
+	rev32		v26.16b,v26.16b		/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b		/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b		/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b		/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+/* quad 0 */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	st1		{v14.16b},[x7],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	st1		{v15.16b},[x7],16
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h2	q23, q21, v7.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h2	q23, q21, v4.4s
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v5.4s
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	ld1		{v29.16b},[x4],16	/* next w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2		/* main_blocks=total_blocks-5 */
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	and		x13,x10,3		/* aes_blocks_left */
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	add		x2,x0,128		/* lead_ptr = *in */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+
+/*
+ * Because both mov, rev32 and eor have a busy cycle, this takes longer
+ * than it looks.
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v4.16b},[x9],16	/* key0 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* get next aes block, with update */
+	ld1		{v30.16b},[x0],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/* aes xform 2, sha quad 2 */
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+
+/* aes xform 3, sha quad 3 (hash only) */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	ld1		{v29.16b},[x4],16	/* next w3 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ prev value */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbnz		x7,.Lmain_loop		/* loop if more to do */
+/*
+ * Now the loop epilog. Since the reads for sha have already been done
+ * in advance, we have to have an extra unwind.
+ * This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+
+/*
+ * aes xform 0, sha quad 0
+ */
+	aesd		v0.16b,v8.16b
+	ld1		{v6.16b},[x9],16	/* key2 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+	aesimc		v0.16b,v0.16b
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesd		v0.16b,v9.16b
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v26.4s,v27.4s
+	aesd		v0.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	sha256h		q22, q23, v5.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v16.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+
+/* aes xform 1, sha quad 1 */
+	sha256su0	v26.4s,v27.4s
+	ld1		{v7.16b},[x9],16	/* key7 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesd		v1.16b,v8.16b
+	sha256h		q22, q23, v4.4s
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h2	q23, q21, v4.4s
+	aesimc		v1.16b,v1.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v1.16b,v9.16b
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v1.16b,v10.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesd		v1.16b,v11.16b
+	ld1		{v5.16b},[x9],16	/* key5 */
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v1.16b,v12.16b
+	sha256h2	q23, q21, v6.4s
+	ld1		{v6.16b},[x9],16	/* key6 */
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha256su0	v29.4s,v26.4s
+	aesd		v1.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v1.16b,v1.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v1.16b,v14.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/* mode op 2 */
+
+/* aes xform 2, sha quad 2 */
+	sha256su0	v26.4s,v27.4s
+	aesd		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v2.16b,v9.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	ld1		{v4.16b},[x9],16	/* key4 */
+	aesimc		v2.16b,v2.16b
+	sha256su0	v27.4s,v28.4s
+	aesd		v2.16b,v10.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v2.16b,v11.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	ld1		{v5.16b},[x9],16	/* key5 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v2.16b,v13.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256su0	v29.4s,v26.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v2.16b,v2.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v2.16b,v15.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	aesimc		v2.16b,v2.16b
+	ld1		{v7.16b},[x9],16	/* key7 */
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+
+/* mode op 3 */
+
+/* aes xform 3, sha quad 3 (hash only) */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v3.16b,v12.16b
+	/* read first aes block, no bump */
+	ld1		{v0.16b},[x0]
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v3.16b,v3.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/prev value */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * Now, there is the final b-1 sha256 padded block.
+ * This contains between 0-3 aes blocks. We take some pains to avoid read spill
+ * by only reading the blocks that are actually defined.
+ * This is also the final sha block code for the shortCases.
+ */
+.Ljoin_common:
+	mov		w15,0x80		/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100	/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b	/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b	/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v26.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1		/* dec amount left */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	cbnz		x14,.Lpad200	/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b	/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v27.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1		/* dec amount left */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	cbnz		x14,.Lpad300	/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v28.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v29.b[3],w15		/* all data is bogus */
+
+.Lpad_done:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32		/* len_hi */
+	and		x14,x11,0xffffffff	/* len_lo */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x14,x14,3		/* len_lo in bits */
+
+	mov		v29.s[3],w14		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left	/* none to do */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v31.16b},[x0],16
+
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	aesd		v0.16b,v8.16b
+	ld1		{v7.16b},[x9],16	/* key3 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	aesd		v0.16b,v10.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	aesd		v0.16b,v11.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v14.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256h2	q23, q21, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1		/* dec counter */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v30.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesd		v0.16b,v10.16b
+	sha256h		q22, q23, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v4.4s
+	aesd		v0.16b,v11.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v27.4s,v28.4s
+	aesd		v0.16b,v12.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v5.4s
+	aesd		v0.16b,v13.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	aesimc		v0.16b,v0.16b
+	sha256su0	v28.4s,v29.4s
+	aesd		v0.16b,v14.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	sha256su0	v29.4s,v26.4s
+	aesd		v0.16b,v16.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+
+	sub		x13,x13,1		/* dec counter */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	aesd		v0.16b,v8.16b
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	aesimc		v0.16b,v0.16b
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	aesd		v0.16b,v9.16b
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v4.4s
+	aesd		v0.16b,v10.16b
+	sha256h2	q23, q21, v4.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v26.4s,v28.4s,v29.4s
+	aesd		v0.16b,v11.16b
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v5.4s
+	aesd		v0.16b,v12.16b
+	sha256h2	q23, q21, v5.4s
+	aesimc		v0.16b,v0.16b
+	sha256su1	v27.4s,v29.4s,v26.4s
+	aesd		v0.16b,v13.16b
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesd		v0.16b,v14.16b
+	sha256h		q22, q23, v6.4s
+	aesimc		v0.16b,v0.16b
+	sha256h2	q23, q21, v6.4s
+	aesd		v0.16b,v15.16b
+	sha256su1	v28.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+
+	aesd		v0.16b,v16.16b
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	aesimc		v0.16b,v0.16b
+	sha256h		q22, q23, v7.4s
+	aesd		v0.16b,v17.16b
+	sha256h2	q23, q21, v7.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	sha256su1	v29.4s,v27.4s,v28.4s
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	eor		v26.16b,v26.16b,v26.16b	/* zero reg */
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	eor		v27.16b,v27.16b,v27.16b	/* zero reg */
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	eor		v28.16b,v28.16b,v28.16b	/* zero reg */
+	sha256h2	q23, q21, v7.4s
+
+	add		v26.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v27.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80		/* that's the 1 of the pad */
+	mov		v28.b[3], w11
+	/* size of o_key_pad + inner hash */
+	mov		x11, #64+32
+	lsl		x11, x11, 3
+	/* move length to the end of the block */
+	mov		v29.s[3], w11
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11		/* and the higher part */
+
+	ld1		{v4.16b},[x8],16	/* key0 */
+	ld1		{v5.16b},[x8],16	/* key1 */
+	ld1		{v6.16b},[x8],16	/* key2 */
+	ld1		{v7.16b},[x8],16	/* key3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key4 */
+	ld1		{v5.16b},[x8],16	/* key5 */
+	ld1		{v6.16b},[x8],16	/* key6 */
+	ld1		{v7.16b},[x8],16	/* key7 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key8 */
+	ld1		{v5.16b},[x8],16	/* key9 */
+	ld1		{v6.16b},[x8],16	/* key10 */
+	ld1		{v7.16b},[x8],16	/* key11 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key8+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su0	v26.4s,v27.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key9+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su0	v27.4s,v28.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key10+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su0	v28.4s,v29.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key11+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su0	v29.4s,v26.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+	ld1		{v4.16b},[x8],16	/* key12 */
+	ld1		{v5.16b},[x8],16	/* key13 */
+	ld1		{v6.16b},[x8],16	/* key14 */
+	ld1		{v7.16b},[x8],16	/* key15 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key12+w0 */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v5.4s,v5.4s,v27.4s	/* wk = key13+w1 */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v6.4s,v6.4s,v28.4s	/* wk = key14+w2 */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+
+	add		v7.4s,v7.4s,v29.4s	/* wk = key15+w3 */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+	st1		{v24.4s,v25.4s},[x3]	/* save them both */
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	st1		{v24.4s,v25.4s},[x3]	/* save them both */
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	lsl		x11,x10,4			/* len=aes_blocks*16 */
+	mov		x4,x0				/* sha_ptr_in = in */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block
+ * and 0-3 aes blocks.
+ * Note that everything is done serially to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4			/* check if 4 or more */
+	/* if less, bail to last block */
+	blt		.Llast_sha_block
+
+	ld1		{v31.16b},[x4]		/* next w no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x4],16
+	rev32		v26.16b,v0.16b		/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x4],16
+	rev32		v27.16b,v1.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/prev value */
+
+	ld1		{v31.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x4],16
+	rev32		v28.16b,v2.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x4],16
+	rev32		v29.16b,v3.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b		/* xor w/prev value */
+
+/*
+ * now we have the sha256 to do for these 4 aes blocks. Note that.
+ */
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	mov		v22.16b,v24.16b		/* working ABCD <- ABCD */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	mov		v23.16b,v25.16b		/* working EFGH <- EFGH */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+
+/* quad 0 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 1 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 2 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key4+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key5+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key6+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key7+w3 */
+
+	sha256su0	v26.4s,v27.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+	sha256su1	v26.4s,v28.4s,v29.4s
+
+	sha256su0	v27.4s,v28.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+	sha256su1	v27.4s,v29.4s,v26.4s
+
+	sha256su0	v28.4s,v29.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+	sha256su1	v28.4s,v26.4s,v27.4s
+
+	sha256su0	v29.4s,v26.4s
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+	sha256su1	v29.4s,v27.4s,v28.4s
+
+/* quad 3 */
+	ld1		{v4.16b},[x9],16	/* key4 */
+	ld1		{v5.16b},[x9],16	/* key5 */
+	ld1		{v6.16b},[x9],16	/* key6 */
+	ld1		{v7.16b},[x9],16	/* key7 */
+
+	add		v4.4s,v4.4s,v26.4s	/* wk = key0+w0 */
+	add		v5.4s,v5.4s,v27.4s	/* wk = key1+w1 */
+	add		v6.4s,v6.4s,v28.4s	/* wk = key2+w2 */
+	add		v7.4s,v7.4s,v29.4s	/* wk = key3+w3 */
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v4.4s
+	sha256h2	q23, q21, v4.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v5.4s
+	sha256h2	q23, q21, v5.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v6.4s
+	sha256h2	q23, q21, v6.4s
+
+	mov		v21.16b, v22.16b	/* copy abcd */
+	sha256h		q22, q23, v7.4s
+	sha256h2	q23, q21, v7.4s
+
+	add		v24.4s,v24.4s,v22.4s	/* ABCD += working copy */
+	add		v25.4s,v25.4s,v23.4s	/* EFGH += working copy */
+
+	sub		x10,x10,4		/* 4 less */
+	b		.Lshort_loop		/* keep looping */
+/*
+ * This is arranged so that we can join the common unwind code that does
+ * the last sha block and the final 0-3 aes blocks.
+ */
+.Llast_sha_block:
+	mov		x13,x10			/* copy aes blocks for common */
+	b		.Ljoin_common		/* join common code */
+
+	.size	sha256_hmac_aes128cbc_dec, .-sha256_hmac_aes128cbc_dec
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 05/12] crypto/armv8: Add AES+SHA1 crypto operations for ARMv8
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (3 preceding siblings ...)
  2016-12-07  2:32   ` [PATCH v2 04/12] crypto/armv8: Add AES+SHA256 " zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-07  2:32   ` [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek, Emery Davis

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch adds AES-128-CBC + SHA1 low-level
crypto operations for ARMv8 processors.
The assembly code is a base for an optimized PMD
and is currently excluded from the build.

This code is optimized to provide performance boost
for combined operations such as encryption + HMAC
generation, decryption + HMAC validation.

Introduced operations add support for AES-128-CBC in
combination with:
SHA1 MAC, SHA1 HMAC

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Signed-off-by: Emery Davis <emery.davis@caviumnetworks.com>
---
 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S     | 1719 ++++++++++++++++++++
 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S | 1650 +++++++++++++++++++
 2 files changed, 3369 insertions(+)
 create mode 100644 drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
 create mode 100644 drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S

diff --git a/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S b/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
new file mode 100644
index 0000000..8b8348a
--- /dev/null
+++ b/drivers/crypto/armv8/asm/aes128cbc_sha1_hmac.S
@@ -0,0 +1,1719 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Enc/Auth Primitive = aes128cbc/sha1_hmac
+ *
+ * Operations:
+ *
+ * out = encrypt-AES128CBC(in)
+ * return_hash_ptr = SHA1(o_key_pad | SHA1(i_key_pad | out))
+ *
+ * Prototype:
+ * void aes128cbc_sha1_hmac(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * aes128cbc_sha1_hmac(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 -- temp register for SHA1
+ * v20 -- ABCD copy (q20)
+ * v21 -- sha working state (q21)
+ * v22 -- sha working state (q22)
+ * v23 -- temp register for SHA1
+ * v24 -- sha state ABCD
+ * v25 -- sha state E
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16, otherwise results are not
+ * defined. For AES partial blocks the user is required to pad the input
+ * to modulus 16 = 0.
+ *
+ * Short lengths are not optimized at < 12 AES blocks
+ */
+
+	.file "aes128cbc_sha1_hmac.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global aes128cbc_sha1_hmac
+	.type	aes128cbc_sha1_hmac,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+aes128cbc_sha1_hmac:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	/* init ABCD, E */
+	ld1		{v24.4s, v25.4s},[x6]
+	/* save pointer to o_key_pad partial hash */
+	ldr		x6, [x5, #HMAC_OKEYPAD]
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,12			/* no main loop if <12 */
+	b.lt		.Lshort_cases		/* branch if < 12 */
+
+	/* protect registers */
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	/* proceed */
+	ld1		{v3.16b},[x5]		/* get 1st ivec */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		x11,x4			/* len -> x11 needed at end */
+	lsr		x12,x11,6		/* total_blocks */
+/*
+ * now we can do the loop prolog, 1st aes sequence of 4 blocks
+ */
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ ivec (modeop) */
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	aesmc		v0.16b,v0.16b
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+	aese		v0.16b,v9.16b
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out  */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+	aesmc		v0.16b,v0.16b
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	aese		v0.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aesmc		v0.16b,v0.16b
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	aese		v0.16b,v16.16b
+	mov		x4,x1			/* sha_ptr_in = aes_ptr_out */
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	prfm		PLDL1KEEP,[x8,0*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,2*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,4*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	prfm		PLDL1KEEP,[x8,6*64]	/* rcon */
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	prfm		PLDL1KEEP,[x8,8*64]	/* rcon */
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ivec (modeop) */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	mov		x2,x0			/* lead_ptr = aes_ptr_in */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	prfm		PLDL1KEEP,[x8,10*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	prfm		PLDL1KEEP,[x8,12*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	prfm		PLDL1KEEP,[x8,14*64]	/* rcon */
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/ ivec (modeop) */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	/* main_blocks = total_blocks - 1 */
+	sub		x7,x12,1
+	and		x13,x10,3		/* aes_blocks_left */
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+
+/*
+ * Note, aes_blocks_left := number after
+ * the main (sha) block is done. Can be 0
+ */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3  */
+/*
+ * main combined loop CBC
+ */
+.Lmain_loop:
+/*
+ * because both mov, rev32 and eor have a busy cycle,
+ * this takes longer than it looks.
+ * Thats OK since there are 6 cycles before we can use the load anyway;
+ * so this goes as fast as it can without SW pipelining (too complicated
+ * given the code size)
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0, aes res 0 */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v1.16b		/* fix endian w1, aes res 1 */
+	/* pref next aes_ptr_out, streaming  */
+	prfm		PLDL1KEEP,[x1,64]
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	aese		v0.16b,v8.16b
+	rev32		v28.16b,v2.16b		/* fix endian w2, aes res 2 */
+	aesmc		v0.16b,v0.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	aese		v0.16b,v9.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v0.16b,v10.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	add		v23.4s,v4.4s,v27.4s
+	/* no place to get rid of this stall */
+	rev32		v29.16b,v3.16b		/* fix endian w3, aes res 3 */
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aese		v0.16b,v12.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aese		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aese		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aese		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	eor		v1.16b,v1.16b,v0.16b	/* mode op 1 xor w/prev value */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aese		v1.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v10.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	sha1h		s22,s24
+	aese		v1.16b,v12.16b
+	sha1p		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aese		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v14.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v1.16b,v15.16b
+	sha1h		s22,s24
+	add		v23.4s,v5.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v16.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v1.16b,v17.16b
+	sha1h		s21,s24
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	sha1p		q24,s22,v23.4s
+	add		v23.4s,v6.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+/* mode op 2 */
+	eor		v2.16b,v2.16b,v1.16b	/* mode of 2 xor w/prev value */
+
+/* aes xform 2, sha quad 2 */
+	aese		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	aesmc		v2.16b,v2.16b
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+
+	aese		v2.16b,v11.16b
+	add		v19.4s,v6.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aese		v2.16b,v12.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aese		v2.16b,v13.16b
+	sha1su1		v29.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	aese		v2.16b,v14.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v2.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v16.16b
+	add		v19.4s,v6.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	sha1su1		v26.4s,v29.4s
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+
+/* mode op 3 */
+	eor		v3.16b,v3.16b,v2.16b	/* xor w/prev value */
+
+/* aes xform 3, sha quad 3 */
+	aese		v3.16b,v8.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aese		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aese		v3.16b,v10.16b
+	sha1su1		v29.4s,v28.4s
+	aesmc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v26.4s
+	aese		v3.16b,v11.16b
+	sha1h		s22,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aese		v3.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aese		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesmc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v28.4s
+	aese		v3.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v29.4s
+	aese		v3.16b,v17.16b
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbnz		x7,.Lmain_loop		/* loop if more to do */
+
+
+/*
+ * epilog, process remaining aes blocks and b-2 sha block
+ * do this inline (no loop) to overlap with the sha part
+ * note there are 0-3 aes blocks left.
+ */
+	rev32		v26.16b,v0.16b		/* fix endian w0 */
+	rev32		v27.16b,v1.16b		/* fix endian w1 */
+	rev32		v28.16b,v2.16b		/* fix endian w2 */
+	rev32		v29.16b,v3.16b		/* fix endian w3 */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	cbz		x13, .Lbm2fromQ0	/* skip if none left */
+	/* local copy of aes_blocks_left */
+	subs		x14,x13,1
+
+/*
+ * mode op 0
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v0.16b},[x0],16
+	eor		v0.16b,v0.16b,v3.16b	/* xor w/ prev value */
+
+/* aes xform 0, sha quad 0 */
+	add		v19.4s,v4.4s,v26.4s
+	aese		v0.16b,v8.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aese		v0.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aese		v0.16b,v10.16b
+	sha1su1		v26.4s,v29.4s
+	add		v19.4s,v4.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	sha1h		s21,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aese		v0.16b,v12.16b
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v4.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	sha1h		s22,s24
+	aesmc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aese		v0.16b,v14.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	sha1h		s21,s24
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	sha1c		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ1
+/*
+ * mode op 1
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v1.16b},[x0],16
+
+	eor		v1.16b,v1.16b,v0.16b	/* xor w/ prev value */
+
+/* aes xform 1, sha quad 1 */
+	add		v23.4s,v5.4s,v27.4s
+	aese		v1.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aese		v1.16b,v9.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v10.16b
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v5.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesmc		v1.16b,v1.16b
+	subs		x14,x14,1		/* dec counter */
+	aese		v1.16b,v11.16b
+	sha1h		s22,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v12.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v5.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aese		v1.16b,v14.16b
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	sha1h		s22,s24
+	aesmc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aese		v1.16b,v16.16b
+	sha1su1		v26.4s,v29.4s
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	/* if aes_blocks_left_count == 0 */
+	beq		.Lbm2fromQ2
+
+/*
+ * mode op 2
+ * read next aes block, update aes_ptr_in
+ */
+	ld1		{v2.16b},[x0],16
+	eor		v2.16b,v2.16b,v1.16b	/* xor w/ prev value */
+
+/* aes xform 2, sha quad 2 */
+	add		v19.4s,v6.4s,v28.4s
+	aese		v2.16b,v8.16b
+	add		v23.4s,v6.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aese		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aese		v2.16b,v12.16b
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	sha1h		s22,s24
+	aesmc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aese		v2.16b,v14.16b
+	sha1su1		v26.4s,v29.4s
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	sha1h		s21,s24
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	sha1m		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	/* join common code at Quad 3 */
+	b		.Lbm2fromQ3
+
+/*
+ * now there is the b-2 sha block before the final one. Execution takes over
+ * in the appropriate part of this depending on how many aes blocks were left.
+ * If there were none, the whole thing is executed.
+ */
+.Lbm2fromQ0:
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+.Lbm2fromQ1:
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+.Lbm2fromQ2:
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+.Lbm2fromQ3:
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	eor		v26.16b,v26.16b,v26.16b		/* zero reg */
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	eor		v27.16b,v27.16b,v27.16b		/* zero reg */
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	eor		v28.16b,v28.16b,v28.16b		/* zero reg */
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+/*
+ * now we can do the final block, either all padding or 1-3 aes blocks
+ * len in x11, aes_blocks_left in x13. should move the aes data setup of this
+ * to the last aes bit.
+ */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	mov		w15,0x80		/* that's the 1 of the pad */
+	/* Add one SHA-1 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32		/* len_hi */
+	and		x9,x11,0xffffffff	/* len_lo */
+	mov		v26.b[0],w15		/* assume block 0 is dst */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x9,x9,3			/* len_lo in bits */
+	eor		v29.16b,v29.16b,v29.16b	/* zero reg */
+/*
+ * places the 0x80 in the correct block, copies the appropriate data
+ */
+	cbz		x13,.Lpad100		/* no data to get */
+	mov		v26.16b,v0.16b
+	sub		x14,x13,1		/* dec amount left */
+	mov		v27.b[0],w15		/* assume block 1 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v27.16b,v1.16b
+	sub		x14,x14,1		/* dec amount left */
+	mov		v28.b[0],w15		/* assume block 2 is dst */
+	cbz		x14,.Lpad100		/* branch if done */
+	mov		v28.16b,v2.16b
+	mov		v29.b[3],w15		/* block 3, doesn't get rev'd */
+/*
+ * get the len_hi,LenLo in bits according to
+ *     len_hi = (uint32_t)(((len>>32) & 0xffffffff)<<3); (x12)
+ *     len_lo = (uint32_t)((len & 0xffffffff)<<3); (x9)
+ * this is done before the if/else above
+ */
+.Lpad100:
+	mov		v29.s[3],w9		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+/*
+ * note that q29 is already built in the correct format, so no swap required
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+/*
+ * do last sha of pad block
+ */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+
+	mov		v20.16b,v24.16b	/* working ABCD <- ABCD */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80	/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+
+	mov		x11, #64+20	/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	/* move length to the end of the block */
+	mov		v29.s[3], w11
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11	/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v3.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	mov		w15,0x80			/* sha padding word */
+
+	lsl		x11,x10,4		/* len = aes_blocks*16 */
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+
+	mov		x9,x8				/* top of rcon */
+
+	ld1		{v4.16b},[x9],16		/* key0 */
+	ld1		{v5.16b},[x9],16		/* key1 */
+	ld1		{v6.16b},[x9],16		/* key2 */
+	ld1		{v7.16b},[x9],16		/* key3 */
+/*
+ * the idea in the short loop (at least 1) is to break out with the padding
+ * already in place excepting the final word.
+ */
+.Lshort_loop:
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	eor		v0.16b,v0.16b,v3.16b		/* xor w/ prev value */
+
+/* aes xform 0 */
+	aese		v0.16b,v8.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v9.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v10.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v11.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v12.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v13.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v14.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v15.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v16.16b
+	aesmc		v0.16b,v0.16b
+	aese		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	/* assume this was final block */
+	mov		v27.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	/* load res to sha 0, endian swap */
+	rev32		v26.16b,v0.16b
+	sub		x10,x10,1			/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop		/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x0],16
+	eor		v1.16b,v1.16b,v0.16b		/* xor w/ prev value */
+
+/* aes xform 1 */
+	aese		v1.16b,v8.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v9.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v10.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v11.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v12.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v13.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v14.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v15.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v16.16b
+	aesmc		v1.16b,v1.16b
+	aese		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	/* assume this was final block */
+	mov		v28.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	/* load res to sha 0, endian swap */
+	rev32		v27.16b,v1.16b
+	sub		x10,x10,1			/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop		/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x0],16
+	eor		v2.16b,v2.16b,v1.16b		/* xor w/ prev value */
+
+/* aes xform 2 */
+	aese		v2.16b,v8.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v9.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v10.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v11.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v12.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v13.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v14.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v15.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v16.16b
+	aesmc		v2.16b,v2.16b
+	aese		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	/* assume this was final block */
+	mov		v29.b[3],w15
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	/* load res to sha 0, endian swap */
+	rev32		v28.16b,v2.16b
+	sub		x10,x10,1			/* dec num_blocks */
+	cbz		x10,.Lpost_short_loop		/* break if no more */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x0],16
+	eor		v3.16b,v3.16b,v2.16b		/* xor w/prev value */
+
+/* aes xform 3 */
+	aese		v3.16b,v8.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v9.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v10.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v11.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v12.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v13.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v14.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v15.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v16.16b
+	aesmc		v3.16b,v3.16b
+	aese		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	/* load res to sha 0, endian swap */
+	rev32		v29.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * now we have the sha1 to do for these 4 aes blocks
+ */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	eor		v26.16b,v26.16b,v26.16b		/* zero sha src 0 */
+	eor		v27.16b,v27.16b,v27.16b		/* zero sha src 1 */
+	eor		v28.16b,v28.16b,v28.16b		/* zero sha src 2 */
+	eor		v29.16b,v29.16b,v29.16b		/* zero sha src 3 */
+	/* assume this was final block */
+	mov		v26.b[3],w15
+
+	sub		x10,x10,1		/* dec num_blocks */
+	cbnz		x10,.Lshort_loop	/* keep looping if more */
+/*
+ * there are between 0 and 3 aes blocks in the final sha1 blocks
+ */
+.Lpost_short_loop:
+	/* Add one SHA-2 block since hash is calculated including i_key_pad */
+	add	x11, x11, #64
+	lsr	x12,x11,32			/* len_hi */
+	and	x13,x11,0xffffffff		/* len_lo */
+	lsl	x12,x12,3			/* len_hi in bits */
+	lsl	x13,x13,3			/* len_lo in bits */
+
+	mov	v29.s[3],w13			/* len_lo */
+	mov	v29.s[2],w12			/* len_hi */
+
+	/* do final block */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80		/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+
+	mov		x11, #64+20	/* size of o_key_pad + inner hash */
+	lsl		x11, x11, 3
+	/* move length to the end of the block */
+	mov		v29.s[3], w11
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11	/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+	.size	aes128cbc_sha1_hmac, .-aes128cbc_sha1_hmac
diff --git a/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S b/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
new file mode 100644
index 0000000..a5a9e85
--- /dev/null
+++ b/drivers/crypto/armv8/asm/sha1_hmac_aes128cbc_dec.S
@@ -0,0 +1,1650 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "assym.s"
+
+/*
+ * Description:
+ *
+ * Combined Auth/Dec Primitive = sha1_hmac/aes128cbc
+ *
+ * Operations:
+ *
+ * out = decrypt-AES128CBC(in)
+ * return_ash_ptr = SHA1(o_key_pad | SHA1(i_key_pad | in))
+ *
+ * Prototype:
+ *
+ * void sha1_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst,
+ *			uint8_t *dsrc, uint8_t *ddst,
+ *			uint64_t len, crypto_arg_t *arg)
+ *
+ * Registers used:
+ *
+ * sha1_hmac_aes128cbc_dec(
+ *	csrc,			x0	(cipher src address)
+ *	cdst,			x1	(cipher dst address)
+ *	dsrc,			x2	(digest src address - ignored)
+ *	ddst,			x3	(digest dst address)
+ *	len,			x4	(length)
+ *	arg			x5	:
+ *		arg->cipher.key		(round keys)
+ *		arg->cipher.iv		(initialization vector)
+ *		arg->digest.hmac.i_key_pad	(partially hashed i_key_pad)
+ *		arg->digest.hmac.o_key_pad	(partially hashed o_key_pad)
+ *	)
+ *
+ * Routine register definitions:
+ *
+ * v0 - v3 -- aes results
+ * v4 - v7 -- round consts for sha
+ * v8 - v18 -- round keys
+ * v19 -- temp register for SHA1
+ * v20 -- ABCD copy (q20)
+ * v21 -- sha working state (q21)
+ * v22 -- sha working state (q22)
+ * v23 -- temp register for SHA1
+ * v24 -- sha state ABCD
+ * v25 -- sha state E
+ * v26 -- sha block 0
+ * v27 -- sha block 1
+ * v28 -- sha block 2
+ * v29 -- sha block 3
+ * v30 -- reserved
+ * v31 -- reserved
+ *
+ *
+ * Constraints:
+ *
+ * The variable "len" must be a multiple of 16,
+ * otherwise results are not defined. For AES partial blocks the user
+ * is required to pad the input to modulus 16 = 0.
+ *
+ * Short lengths are less optimized at < 16 AES blocks,
+ * however they are somewhat optimized, and more so than the enc/auth versions.
+ */
+	.file "sha1_hmac_aes128cbc_dec.S"
+	.text
+	.cpu generic+fp+simd+crypto+crc
+	.global sha1_hmac_aes128cbc_dec
+	.type	sha1_hmac_aes128cbc_dec,%function
+
+
+	.align	4
+.Lrcon:
+	.word		0x5a827999, 0x5a827999, 0x5a827999, 0x5a827999
+	.word		0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1, 0x6ed9eba1
+	.word		0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc, 0x8f1bbcdc
+	.word		0xca62c1d6, 0xca62c1d6, 0xca62c1d6, 0xca62c1d6
+
+sha1_hmac_aes128cbc_dec:
+/* fetch args */
+	ldr		x6, [x5, #HMAC_IKEYPAD]
+	/* init ABCD, E */
+	ld1		{v24.4s, v25.4s},[x6]
+	/* save pointer to o_key_pad partial hash */
+	ldr		x6, [x5, #HMAC_OKEYPAD]
+
+	ldr		x2, [x5, #CIPHER_KEY]
+	ldr		x5, [x5, #CIPHER_IV]
+/*
+ * init sha state, prefetch, check for small cases.
+ * Note that the output is prefetched as a load, for the in-place case
+ */
+	prfm		PLDL1KEEP,[x0,0]	/* pref next *in */
+	prfm		PLDL1KEEP,[x1,0]	/* pref next aes_ptr_out */
+	lsr		x10,x4,4		/* aes_blocks = len/16 */
+	cmp		x10,16			/* no main loop if <16 */
+	blt		.Lshort_cases		/* branch if < 12 */
+
+/* protect registers */
+	sub		sp,sp,8*16
+	mov		x11,x4			/* len -> x11 needed at end */
+	mov		x7,sp			/* copy for address mode */
+	ld1		{v30.16b},[x5]		/* get 1st ivec */
+	lsr		x12,x11,6		/* total_blocks (sha) */
+	mov		x4,x0			/* sha_ptr_in = *in */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	ld1		{v29.16b},[x4],16	/* next w3 */
+
+/*
+ * now we can do the loop prolog, 1st sha1 block
+ */
+	prfm		PLDL1KEEP,[x0,64]	/* pref next aes_ptr_in */
+	prfm		PLDL1KEEP,[x1,64]	/* pref next aes_ptr_out */
+	/* base address for sha round consts */
+	adr		x8,.Lrcon
+/*
+ * do the first sha1 block on the plaintext
+ */
+	mov		v20.16b,v24.16b		/* init working ABCD */
+	st1		{v8.16b},[x7],16
+	st1		{v9.16b},[x7],16
+	rev32		v26.16b,v26.16b		/* endian swap w0 */
+	st1		{v10.16b},[x7],16
+	rev32		v27.16b,v27.16b		/* endian swap w1 */
+	st1		{v11.16b},[x7],16
+	rev32		v28.16b,v28.16b		/* endian swap w2 */
+	st1		{v12.16b},[x7],16
+	rev32		v29.16b,v29.16b		/* endian swap w3 */
+	st1		{v13.16b},[x7],16
+	mov		x9,x8			/* top of rcon */
+	ld1		{v4.16b},[x9],16	/* key0 */
+	ld1		{v5.16b},[x9],16	/* key1 */
+	ld1		{v6.16b},[x9],16	/* key2 */
+	ld1		{v7.16b},[x9],16	/* key3 */
+	add		v19.4s,v4.4s,v26.4s
+	st1		{v14.16b},[x7],16
+	add		v23.4s,v4.4s,v27.4s
+	st1		{v15.16b},[x7],16
+/* quad 0 */
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v8.16b},[x2],16	/* rk[0] */
+	sha1c		q24,s25,v19.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v9.16b},[x2],16	/* rk[1] */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	ld1		{v10.16b},[x2],16	/* rk[2] */
+	sha1c		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v4.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v11.16b},[x2],16	/* rk[3] */
+	sha1c		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v12.16b},[x2],16	/* rk[4] */
+	sha1c		q24,s21,v19.4s
+	add		v19.4s,v5.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v13.16b},[x2],16	/* rk[5] */
+/* quad 1 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v14.16b},[x2],16	/* rk[6] */
+	sha1p		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+	add		v23.4s,v5.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v15.16b},[x2],16	/* rk[7] */
+	sha1p		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v16.16b},[x2],16	/* rk[8] */
+	sha1p		q24,s21,v19.4s
+	sha1su1		v26.4s,v29.4s
+	ld1		{v17.16b},[x2],16	/* rk[9] */
+	add		v19.4s,v6.4s,v28.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v18.16b},[x2],16	/* rk[10] */
+	sha1p		q24,s22,v23.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v23.4s,v6.4s,v29.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	add		v19.4s,v6.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+/* quad 3 */
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	sha1p		q24,s21,v19.4s
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	sha1p		q24,s21,v19.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	ld1		{v29.16b},[x4],16	/* next w3 */
+	sha1p		q24,s22,v23.4s
+
+/*
+ * aes_blocks_left := number after the main (sha) block is done.
+ * can be 0 note we account for the extra unwind in main_blocks
+ */
+	sub		x7,x12,2		/* main_blocks=total_blocks-5 */
+	add		v24.4s,v24.4s,v20.4s
+	and		x13,x10,3		/* aes_blocks_left */
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	add		v25.4s,v25.4s,v21.4s
+	add		x2,x0,128		/* lead_ptr = *in */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+/*
+ * main combined loop CBC, can be used by auth/enc version
+ */
+.Lmain_loop:
+/*
+ * Because both mov, rev32 and eor have a busy cycle,
+ * this takes longer than it looks.
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+/* aes xform 0, sha quad 0 */
+	aesd		v0.16b,v8.16b
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v0.16b,v10.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	add		v23.4s,v4.4s,v27.4s
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aesd		v0.16b,v12.16b
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aesd		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* get next aes block, with update */
+	ld1		{v30.16b},[x0],16
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	add		v23.4s,v5.4s,v27.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	aesd		v1.16b,v8.16b
+	sha1h		s21,s24
+	add		v19.4s,v5.4s,v28.4s
+	sha1p		q24,s22,v23.4s
+	aesimc		v1.16b,v1.16b
+	sha1su1		v27.4s,v26.4s
+	aesd		v1.16b,v9.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v10.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v28.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1h		s21,s24
+	aesd		v1.16b,v12.16b
+	sha1p		q24,s22,v23.4s
+	sha1su1		v29.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	sha1h		s22,s24
+	add		v19.4s,v5.4s,v26.4s
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v14.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v23.4s,v5.4s,v27.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v16.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	add		v23.4s,v6.4s,v29.4s
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+/* aes xform 2, sha quad 2 */
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v2.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v9.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v2.16b,v10.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v11.16b
+	sha1su1		v29.4s,v28.4s
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v13.16b
+	sha1su1		v26.4s,v29.4s
+	add		v23.4s,v6.4s,v27.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v15.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	sha1h		s22,s24
+	aesd		v2.16b,v16.16b
+	sha1m		q24,s21,v19.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v2.16b,v17.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v7.4s,v29.4s
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	add		v19.4s,v7.4s,v26.4s
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+/* aes xform 3, sha quad 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v10.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v3.16b,v3.16b
+	sha1su1		v29.4s,v28.4s
+	aesd		v3.16b,v11.16b
+	sha1h		s22,s24
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aesd		v3.16b,v13.16b
+	sha1h		s21,s24
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v14.16b
+	sub		x7,x7,1			/* dec block count */
+	aesimc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v28.4s
+	aesd		v3.16b,v15.16b
+	ld1		{v0.16b},[x0]		/* next aes block, no update */
+	sha1h		s22,s24
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v29.4s
+	aesd		v3.16b,v17.16b
+	sha1h		s21,s24
+	ld1		{v29.16b},[x4],16	/* next w3 */
+	sha1p		q24,s22,v23.4s
+	add		v24.4s,v24.4s,v20.4s
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ prev value */
+	/* next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	add		v25.4s,v25.4s,v21.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	/* loop if more to do */
+	cbnz		x7,.Lmain_loop
+/*
+ * now the loop epilog. Since the reads for sha have already been done
+ * in advance, we have to have an extra unwind.
+ * This is why the test for the short cases is 16 and not 12.
+ *
+ * the unwind, which is just the main loop without the tests or final reads.
+ */
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	prfm		PLDL1KEEP,[x2,64]	/* pref next lead_ptr */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	/* pref next aes_ptr_out, streaming */
+	prfm		PLDL1KEEP,[x1,64]
+/* aes xform 0, sha quad 0 */
+	aesd		v0.16b,v8.16b
+	add		v19.4s,v4.4s,v26.4s
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+	aesimc		v0.16b,v0.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesd		v0.16b,v9.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s25,v19.4s
+	aesd		v0.16b,v11.16b
+	rev32		v29.16b,v29.16b		/* fix endian w3 */
+	aesimc		v0.16b,v0.16b
+	sha1su1		v26.4s,v29.4s
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	add		v19.4s,v4.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	add		v23.4s,v4.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s21,v19.4s
+	aesd		v0.16b,v16.16b
+	sha1su1		v28.4s,v27.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1c		q24,s22,v23.4s
+	add		v19.4s,v4.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	eor		v0.16b,v0.16b,v18.16b	/* final res 0 */
+	add		v23.4s,v5.4s,v27.4s
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su1		v26.4s,v29.4s
+/* aes xform 1, sha quad 1 */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v1.16b,v8.16b
+	sha1h		s21,s24
+	add		v19.4s,v5.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	add		v23.4s,v5.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	aesd		v1.16b,v10.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	aesimc		v1.16b,v1.16b
+	sha1h		s22,s24
+	aesd		v1.16b,v11.16b
+	sha1p		q24,s21,v19.4s
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v1.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v1.16b,v14.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su1		v29.4s,v28.4s
+	aesimc		v1.16b,v1.16b
+	add		x2,x2,64		/* bump lead_ptr */
+	aesd		v1.16b,v15.16b
+	add		v23.4s,v5.4s,v27.4s
+	aesimc		v1.16b,v1.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v1.16b,v16.16b
+	sha1h		s22,s24
+	aesimc		v1.16b,v1.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v1.16b,v17.16b
+	add		v19.4s,v6.4s,v28.4s
+	eor		v1.16b,v1.16b,v18.16b	/* res xf 1 */
+	sha1su1		v26.4s,v29.4s
+	eor		v1.16b,v1.16b,v31.16b	/* mode op 1 xor w/prev value */
+	sha1su0		v27.4s,v28.4s,v29.4s
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	add		v23.4s,v6.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* mode op 2 */
+/* aes xform 2, sha quad 2 */
+	aesd		v2.16b,v8.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v10.16b
+	sha1su1		v28.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	add		v19.4s,v6.4s,v26.4s
+	aesd		v2.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	sha1h		s21,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s22,v23.4s
+	aesd		v2.16b,v13.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesd		v2.16b,v14.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesimc		v2.16b,v2.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v2.16b,v15.16b
+	sha1h		s22,s24
+	aesimc		v2.16b,v2.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v2.16b,v16.16b
+	add		v19.4s,v6.4s,v28.4s
+	aesimc		v2.16b,v2.16b
+	sha1su1		v26.4s,v29.4s
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b	/* mode of 2 xor w/prev value */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	add		v23.4s,v7.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su1		v28.4s,v27.4s
+/* mode op 3 */
+/* aes xform 3, sha quad 3 */
+	aesd		v3.16b,v8.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v3.16b,v3.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+	aesd		v3.16b,v9.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v10.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v3.16b,v3.16b
+	add		v19.4s,v7.4s,v26.4s
+	aesd		v3.16b,v11.16b
+	sha1h		s22,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v12.16b
+	/* read first aes block, no bump */
+	ld1		{v0.16b},[x0]
+	aesimc		v3.16b,v3.16b
+	add		v23.4s,v7.4s,v27.4s
+	aesd		v3.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	add		v19.4s,v7.4s,v28.4s
+	aesd		v3.16b,v14.16b
+	sha1h		s22,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v3.16b,v15.16b
+	add		v23.4s,v7.4s,v29.4s
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	sha1h		s21,s24
+	aesimc		v3.16b,v3.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* aes res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ prev value */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+/*
+ * now we have to do the 4 aes blocks (b-2) that catch up to where sha is
+ */
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	/* read next aes block, no update */
+	ld1		{v1.16b},[x0]
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v2.16b},[x0]
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b	/* res 1 */
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v31.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	/* read next aes block, no update */
+	ld1		{v3.16b},[x0]
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b	/* res 2 */
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/ ivec (modeop) */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v30.16b},[x0],16
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b	/* res 3 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/*
+ * Now, there is the final b-1 sha1 padded block.
+ * This contains between 0-3 aes blocks. We take some pains to avoid read spill
+ * by only reading the blocks that are actually defined.
+ * this is also the final sha block code for the short_cases.
+ */
+.Ljoin_common:
+	mov		w15,0x80	/* that's the 1 of the pad */
+	cbnz		x13,.Lpad100	/* branch if there is some real data */
+	eor		v26.16b,v26.16b,v26.16b		/* zero the rest */
+	eor		v27.16b,v27.16b,v27.16b		/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b		/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b		/* zero the rest */
+	mov		v26.b[0],w15			/* all data is bogus */
+	b		.Lpad_done			/* go do rest */
+
+.Lpad100:
+	sub		x14,x13,1		/* dec amount left */
+	ld1		{v26.16b},[x4],16	/* next w0 */
+	cbnz		x14,.Lpad200	/* branch if there is some real data */
+	eor		v27.16b,v27.16b,v27.16b	/* zero the rest */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v27.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad200:
+	sub		x14,x14,1		/* dec amount left */
+	ld1		{v27.16b},[x4],16	/* next w1 */
+	cbnz		x14,.Lpad300	/* branch if there is some real data */
+	eor		v28.16b,v28.16b,v28.16b	/* zero the rest */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v28.b[0],w15		/* all data is bogus */
+	b		.Lpad_done		/* go do rest */
+
+.Lpad300:
+	ld1		{v28.16b},[x4],16	/* next w2 */
+	eor		v29.16b,v29.16b,v29.16b	/* zero the rest */
+	mov		v29.b[3],w15		/* all data is bogus */
+
+.Lpad_done:
+	/* Add one SHA-1 block since hash is calculated including i_key_pad */
+	add		x11, x11, #64
+	lsr		x12,x11,32		/* len_hi */
+	and		x14,x11,0xffffffff	/* len_lo */
+	lsl		x12,x12,3		/* len_hi in bits */
+	lsl		x14,x14,3		/* len_lo in bits */
+
+	mov		v29.s[3],w14		/* len_lo */
+	mov		v29.s[2],w12		/* len_hi */
+
+	rev32		v26.16b,v26.16b		/* fix endian w0 */
+	rev32		v27.16b,v27.16b		/* fix endian w1 */
+	rev32		v28.16b,v28.16b		/* fix endian w2 */
+
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+/*
+ * final sha block
+ * the strategy is to combine the 0-3 aes blocks, which is faster but
+ * a little gourmand on code space.
+ */
+	cbz		x13,.Lzero_aes_blocks_left	/* none to do */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v31.16b},[x0],16
+
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	add		v19.4s,v4.4s,v26.4s
+	aesd		v0.16b,v10.16b
+	add		v23.4s,v4.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v12.16b
+	sha1c		q24,s25,v19.4s
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesd		v0.16b,v13.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v14.16b
+	sha1su1		v27.4s,v26.4s
+	add		v19.4s,v4.4s,v28.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v15.16b
+	sha1c		q24,s21,v19.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v4.4s,v29.4s
+	aesd		v0.16b,v16.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1c		q24,s22,v23.4s
+	aesd		v0.16b,v17.16b
+	sha1su1		v29.4s,v28.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+	/* dec counter */
+	sub		x13,x13,1
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad1
+
+/* aes xform 1 */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0]
+	ld1		{v30.16b},[x0],16
+	add		v23.4s,v5.4s,v27.4s
+	aesd		v0.16b,v8.16b
+	add		v19.4s,v5.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v0.16b,v11.16b
+	sha1su1		v27.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesd		v0.16b,v12.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v0.16b,v13.16b
+	sha1su1		v28.4s,v27.4s
+	add		v23.4s,v5.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesd		v0.16b,v14.16b
+	sha1h		s21,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s22,v23.4s
+	aesd		v0.16b,v15.16b
+	sha1su1		v29.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	add		v19.4s,v5.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesd		v0.16b,v16.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1p		q24,s21,v19.4s
+	aesd		v0.16b,v17.16b
+	sha1su1		v26.4s,v29.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ ivec (modeop) */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	sub		x13,x13,1		/* dec counter */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	cbz		x13,.Lfrmquad2
+
+/* aes xform 2 */
+	/* read first aes block, bump aes_ptr_in */
+	ld1		{v0.16b},[x0],16
+	add		v19.4s,v6.4s,v28.4s
+	aesd		v0.16b,v8.16b
+	add		v23.4s,v6.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	sha1su0		v28.4s,v29.4s,v26.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s22,s24
+	aesd		v0.16b,v10.16b
+	sha1m		q24,s21,v19.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v28.4s,v27.4s
+	aesd		v0.16b,v11.16b
+	sha1su0		v29.4s,v26.4s,v27.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v12.16b
+	sha1m		q24,s22,v23.4s
+	aesimc		v0.16b,v0.16b
+	sha1su1		v29.4s,v28.4s
+	aesd		v0.16b,v13.16b
+	add		v19.4s,v6.4s,v26.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	sha1h		s22,s24
+	aesimc		v0.16b,v0.16b
+	sha1m		q24,s21,v19.4s
+	aesd		v0.16b,v15.16b
+	sha1su1		v26.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	add		v23.4s,v6.4s,v27.4s
+	aesd		v0.16b,v16.16b
+	sha1su0		v27.4s,v28.4s,v29.4s
+	aesimc		v0.16b,v0.16b
+	sha1h		s21,s24
+	aesd		v0.16b,v17.16b
+	sha1m		q24,s22,v23.4s
+	eor		v3.16b,v0.16b,v18.16b	/* res 0 */
+	sha1su1		v27.4s,v26.4s
+	eor		v3.16b,v3.16b,v30.16b	/* xor w/ ivec (modeop) */
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+	b		.Lfrmquad3
+/*
+ * the final block with no aes component, i.e from here there were zero blocks
+ */
+
+.Lzero_aes_blocks_left:
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+/* quad 1 */
+.Lfrmquad1:
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+/* quad 2 */
+.Lfrmquad2:
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+/* quad 3 */
+.Lfrmquad3:
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v26.4s,v24.4s,v20.4s
+	add		v27.4s,v25.4s,v21.4s
+
+	/* Calculate final HMAC */
+	eor		v28.16b, v28.16b, v28.16b
+	eor		v29.16b, v29.16b, v29.16b
+	/* load o_key_pad partial hash */
+	ld1		{v24.16b,v25.16b}, [x6]
+	/* working ABCD <- ABCD */
+	mov		v20.16b,v24.16b
+
+	/* Set padding 1 to the first reg */
+	mov		w11, #0x80		/* that's the 1 of the pad */
+	mov		v27.b[7], w11
+	/* size of o_key_pad + inner hash */
+	mov		x11, #64+20
+	/* move length to the end of the block */
+	lsl		x11, x11, 3
+	mov		v29.s[3], w11
+	lsr		x11, x11, 32
+	mov		v29.s[2], w11		/* and the higher part */
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	rev32		v24.16b, v24.16b
+	rev32		v25.16b, v25.16b
+
+	st1		{v24.16b}, [x3],16
+	st1		{v25.s}[0], [x3]
+
+	mov		x9,sp
+	add		sp,sp,8*16
+	ld1		{v8.16b - v11.16b},[x9],4*16
+	ld1		{v12.16b - v15.16b},[x9]
+
+	ret
+
+/*
+ * These are the short cases (less efficient), here used for 1-11 aes blocks.
+ * x10 = aes_blocks
+ */
+.Lshort_cases:
+	sub		sp,sp,8*16
+	mov		x9,sp			/* copy for address mode */
+	st1		{v8.16b - v11.16b},[x9],4*16
+	st1		{v12.16b - v15.16b},[x9]
+
+	ld1		{v30.16b},[x5]			/* get ivec */
+	ld1		{v8.16b-v11.16b},[x2],64	/* rk[0-3] */
+	ld1		{v12.16b-v15.16b},[x2],64	/* rk[4-7] */
+	ld1		{v16.16b-v18.16b},[x2]		/* rk[8-10] */
+	adr		x8,.Lrcon			/* rcon */
+	lsl		x11,x10,4		/* len = aes_blocks*16 */
+	mov		x4,x0				/* sha_ptr_in = in */
+
+	mov		x9,x8				/* top of rcon */
+
+	ld1		{v4.16b},[x9],16		/* key0 */
+	ld1		{v5.16b},[x9],16		/* key1 */
+	ld1		{v6.16b},[x9],16		/* key2 */
+	ld1		{v7.16b},[x9],16		/* key3 */
+
+/*
+ * This loop does 4 at a time, so that at the end there is a final sha block
+ * and 0-3 aes blocks. Note that everything is done serially
+ * to avoid complication.
+ */
+.Lshort_loop:
+	cmp		x10,4			/* check if 4 or more */
+	/* if less, bail to last block */
+	blt		.Llast_sha_block
+
+	ld1		{v31.16b},[x4]		/* next w no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v0.16b},[x4],16
+	rev32		v26.16b,v0.16b		/* endian swap for sha */
+	add		x0,x0,64
+
+/* aes xform 0 */
+	aesd		v0.16b,v8.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v9.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v10.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v11.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v12.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v13.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v14.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v15.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v16.16b
+	aesimc		v0.16b,v0.16b
+	aesd		v0.16b,v17.16b
+	eor		v0.16b,v0.16b,v18.16b
+	eor		v0.16b,v0.16b,v30.16b	/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v1.16b},[x4],16
+	rev32		v27.16b,v1.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v0.16b},[x1],16
+
+/* aes xform 1 */
+	aesd		v1.16b,v8.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v9.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v10.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v11.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v12.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v13.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v14.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v15.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v16.16b
+	aesimc		v1.16b,v1.16b
+	aesd		v1.16b,v17.16b
+	eor		v1.16b,v1.16b,v18.16b
+	eor		v1.16b,v1.16b,v31.16b	/* xor w/ prev value */
+
+	ld1		{v31.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v2.16b},[x4],16
+	rev32		v28.16b,v2.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v1.16b},[x1],16
+
+/* aes xform 2 */
+	aesd		v2.16b,v8.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v9.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v10.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v11.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v12.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v13.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v14.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v15.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v16.16b
+	aesimc		v2.16b,v2.16b
+	aesd		v2.16b,v17.16b
+	eor		v2.16b,v2.16b,v18.16b
+	eor		v2.16b,v2.16b,v30.16b	/* xor w/ prev value */
+
+	ld1		{v30.16b},[x4]		/* read no update */
+	/* read next aes block, update aes_ptr_in */
+	ld1		{v3.16b},[x4],16
+	rev32		v29.16b,v3.16b		/* endian swap for sha */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v2.16b},[x1],16
+
+/* aes xform 3 */
+	aesd		v3.16b,v8.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v9.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v10.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v11.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v12.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v13.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v14.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v15.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v16.16b
+	aesimc		v3.16b,v3.16b
+	aesd		v3.16b,v17.16b
+	eor		v3.16b,v3.16b,v18.16b
+	eor		v3.16b,v3.16b,v31.16b	/* xor w/ prev value */
+/*
+ * now we have the sha1 to do for these 4 aes blocks. Note that.
+ */
+
+	mov		v20.16b,v24.16b		/* working ABCD <- ABCD */
+	/* save aes res, bump aes_out_ptr */
+	st1		{v3.16b},[x1],16
+/* quad 0 */
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s25,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v4.4s,v27.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v4.4s,v28.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v4.4s,v29.4s
+	sha1h		s21,s24
+	sha1c		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v4.4s,v26.4s
+	sha1h		s22,s24
+	sha1c		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+/* quad 1 */
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v5.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v5.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v5.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v5.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+/* quad 2 */
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v6.4s,v29.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+
+	add		v19.4s,v6.4s,v26.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v26.4s,v27.4s,v28.4s
+	sha1su1		v26.4s,v29.4s
+
+	add		v23.4s,v6.4s,v27.4s
+	sha1h		s21,s24
+	sha1m		q24,s22,v23.4s
+	sha1su0		v27.4s,v28.4s,v29.4s
+	sha1su1		v27.4s,v26.4s
+
+	add		v19.4s,v6.4s,v28.4s
+	sha1h		s22,s24
+	sha1m		q24,s21,v19.4s
+	sha1su0		v28.4s,v29.4s,v26.4s
+	sha1su1		v28.4s,v27.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+	sha1su0		v29.4s,v26.4s,v27.4s
+	sha1su1		v29.4s,v28.4s
+/* quad 3 */
+	add		v19.4s,v7.4s,v26.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v27.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v19.4s,v7.4s,v28.4s
+	sha1h		s22,s24
+	sha1p		q24,s21,v19.4s
+
+	add		v23.4s,v7.4s,v29.4s
+	sha1h		s21,s24
+	sha1p		q24,s22,v23.4s
+
+	add		v25.4s,v25.4s,v21.4s
+	add		v24.4s,v24.4s,v20.4s
+
+	sub		x10,x10,4		/* 4 less */
+	b		.Lshort_loop		/* keep looping */
+/*
+ * this is arranged so that we can join the common unwind code
+ * that does the last sha block and the final 0-3 aes blocks
+ */
+.Llast_sha_block:
+	mov		x13,x10			/* copy aes blocks for common */
+	b		.Ljoin_common		/* join common code */
+
+	.size	sha1_hmac_aes128cbc_dec, .-sha1_hmac_aes128cbc_dec
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (4 preceding siblings ...)
  2016-12-07  2:32   ` [PATCH v2 05/12] crypto/armv8: Add AES+SHA1 " zbigniew.bodek
@ 2016-12-07  2:32   ` zbigniew.bodek
  2016-12-21 14:55     ` De Lara Guarch, Pablo
  2016-12-07  2:33   ` [PATCH v2 07/12] crypto/armv8: generate ASM symbols automatically zbigniew.bodek
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:32 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver
using ARMv8 cryptographic extensions.
CPU compatibility with this driver is detected in
run-time and virtual crypto device will not be
created if CPU doesn't provide:
AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost
for chained crypto operations processing,
such as encryption + HMAC generation,
decryption + HMAC validation. In particular,
cipher only or hash only operations are
not provided.

The driver currently supports AES-128-CBC
in combination with:
SHA256 MAC, SHA256 HMAC and SHA1 HMAC and relies
on the low-level assembly code.

This patch adds driver's code only and does
not include it in the build system.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile                     |  72 ++
 drivers/crypto/armv8/asm/include/rte_armv8_defs.h |  80 ++
 drivers/crypto/armv8/rte_armv8_pmd.c              | 915 ++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c          | 390 +++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h      | 210 +++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map    |   3 +
 6 files changed, 1670 insertions(+)
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/asm/include/rte_armv8_defs.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..2d053a4
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,72 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2016.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(SRCDIR)/asm/include
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+VPATH += $(SRCDIR)/asm
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+# library asm files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha1_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_core.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha1_hmac.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha256.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += aes128cbc_sha256_hmac.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha1_hmac_aes128cbc_dec.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_aes128cbc_dec.S
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += sha256_hmac_aes128cbc_dec.S
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/asm/include/rte_armv8_defs.h b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
new file mode 100644
index 0000000..ea05495
--- /dev/null
+++ b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
@@ -0,0 +1,80 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_DEFS_H_
+#define _RTE_ARMV8_DEFS_H_
+
+struct crypto_arg {
+	struct {
+		uint8_t		*key;
+		uint8_t		*iv;
+	} cipher;
+	struct {
+		struct {
+			uint8_t	*key;
+			uint8_t *i_key_pad;
+			uint8_t *o_key_pad;
+		} hmac;
+	} digest;
+};
+
+typedef struct crypto_arg crypto_arg_t;
+
+void aes128_key_sched_enc(uint8_t *expanded_key, const uint8_t *user_key);
+void aes128_key_sched_dec(uint8_t *expanded_key, const uint8_t *user_key);
+
+void aes128cbc_sha1_hmac(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void aes128cbc_sha256(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void aes128cbc_sha256_hmac(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void aes128cbc_dec_sha256(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void sha1_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void sha256_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void sha256_hmac_aes128cbc_dec(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+void sha256_aes128cbc(uint8_t *csrc, uint8_t *cdst, uint8_t *dsrc,
+			uint8_t *ddst, uint64_t len, crypto_arg_t *arg);
+
+int sha1_block_partial(uint8_t *init, const uint8_t *src, uint8_t *dst,
+			uint64_t len);
+int sha1_block(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+
+int sha256_block_partial(uint8_t *init, const uint8_t *src, uint8_t *dst,
+			uint64_t len);
+int sha256_block(uint8_t *init, const uint8_t *src, uint8_t *dst, uint64_t len);
+
+#endif /* _RTE_ARMV8_DEFS_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..0410bb0
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,915 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "rte_armv8_defs.h"
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be decribed by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t
+crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+/* Evaluate to key length definition */
+#define	KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/* Local aliases for supported ciphers */
+#define	CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
+/* Local aliases for supported hashes */
+#define	AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
+#define	AUTH_SHA256		RTE_CRYPTO_AUTH_SHA256
+#define	AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
+	[CIPH_AES_CBC][AUTH_SHA256][KEYL(128)] = aes128cbc_sha256,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256][KEYL(128)] = sha256_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define	CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t
+crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define	CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * Global static parameter used to create a unique name for each
+ * ARMV8 crypto device.
+ */
+static unsigned int unique_name_id;
+
+static inline int
+create_unique_device_name(char *name, size_t size)
+{
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+			unique_name_id++);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA256:
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
+		break;
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 192);
+		break;
+	case 256:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 256);
+		break;
+	default:
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (sess == NULL)
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	uint8_t *src, *dst;
+	uint8_t *adst, *asrc;
+	uint64_t srclen;
+
+	srclen = op->sym->cipher.data.length;
+	ARMV8_CRYPTO_ASSERT(
+		op->sym->cipher.data.length == op->sym->auth.data.length);
+
+	src = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	dst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		asrc = dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		asrc = src;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(mbuf_dst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(mbuf_src,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	crypto_func(src, dst, asrc, adst, srclen, &arg);
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+	}
+}
+
+/** Process crypto operation for mbuf */
+static int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+	int retval;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (op->status != RTE_CRYPTO_OP_STATUS_ERROR)
+		retval = rte_ring_enqueue(qp->processed_ops, (void *)op);
+	else
+		retval = -1;
+
+	return retval;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	qp->stats.enqueued_count += i;
+	return i;
+
+enqueue_err:
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return i;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(const char *name,
+		struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
+	struct armv8_crypto_private *internals;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* create a unique device name */
+	if (create_unique_device_name(crypto_dev_name,
+			RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
+		return -EINVAL;
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed", name);
+
+	cryptodev_armv8_crypto_uninit(crypto_dev_name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id()
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(name, &init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..0f768f4
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,390 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "rte_armv8_defs.h"
+#include "rte_armv8_pmd_private.h"
+
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA256 */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256,
+					.block_size = 64,
+					.key_size = {
+						.min = 0,
+						.max = 0,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 32,
+						.increment = 8
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..fc1dae4
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,210 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define	ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define	ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define	NBBY		8		/* Number of bits in a byte */
+#define	BYTE_LENGTH(x)	((x) / 8)	/* Number of bytes in x (roun down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define	CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define	CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define	CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define	CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define	CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define	HMAC_IPAD_VALUE			(0x36)
+#define	HMAC_OPAD_VALUE			(0x5C)
+
+#define	SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define	SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define	SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define	SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define	SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define	SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef void (*crypto_func_t)(uint8_t *, uint8_t *, uint8_t *, uint8_t *,
+				uint64_t, crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 07/12] crypto/armv8: generate ASM symbols automatically
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (5 preceding siblings ...)
  2016-12-07  2:32   ` [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2016-12-07  2:33   ` zbigniew.bodek
  2016-12-07  2:33   ` [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

In order to acquire crypto_arg structure fields
from the assembly code it is necessary to generate
macros that will define offsets to those fields
during app build. This will allow for free
crypto_arg structure modifications in the future
without the necessity to make similar changes
for the assembly code.

Introduce genassym.c file that will be used
to generate phony assembly code containing pattern
followed by field offset value. Use awk to create
"define NAME value" sequence in the newly created
assym.s file.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile   | 12 +++++++++
 drivers/crypto/armv8/genassym.c | 55 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)
 create mode 100644 drivers/crypto/armv8/genassym.c

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
index 2d053a4..8fdd374 100644
--- a/drivers/crypto/armv8/Makefile
+++ b/drivers/crypto/armv8/Makefile
@@ -69,4 +69,16 @@ DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
 DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
 DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
 
+# runtime generated assembly symbols
+all: clean assym.s
+
+assym.s: genassym.c
+	@$(CC) $(CFLAGS) -O0 -S $< -o - | \
+		awk '($$1 == "<genassym>") { print "#define " $$2 "\t" $$3 }' > \
+		$(SRCDIR)/asm/$@
+
+.PHONY:	clean
+clean:
+	@rm -f $(SRCDIR)/asm/assym.s
+
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/genassym.c b/drivers/crypto/armv8/genassym.c
new file mode 100644
index 0000000..44604ce
--- /dev/null
+++ b/drivers/crypto/armv8/genassym.c
@@ -0,0 +1,55 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_common.h>
+
+#include "rte_armv8_defs.h"
+
+#define	ASSYM(name, offset)						\
+do {									\
+	asm volatile("----------\n");					\
+	/* Place pattern, name + value in the assembly code */		\
+	asm volatile("\n<genassym> " #name " %0\n" :: "i" (offset));	\
+} while (0)
+
+
+static void __rte_unused
+generate_as_symbols(void)
+{
+
+	ASSYM(CIPHER_KEY, offsetof(struct crypto_arg, cipher.key));
+	ASSYM(CIPHER_IV, offsetof(struct crypto_arg, cipher.iv));
+
+	ASSYM(HMAC_KEY, offsetof(struct crypto_arg, digest.hmac.key));
+	ASSYM(HMAC_IKEYPAD, offsetof(struct crypto_arg, digest.hmac.i_key_pad));
+	ASSYM(HMAC_OKEYPAD, offsetof(struct crypto_arg, digest.hmac.o_key_pad));
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (6 preceding siblings ...)
  2016-12-07  2:33   ` [PATCH v2 07/12] crypto/armv8: generate ASM symbols automatically zbigniew.bodek
@ 2016-12-07  2:33   ` zbigniew.bodek
  2016-12-21 15:01     ` De Lara Guarch, Pablo
  2016-12-07  2:33   ` [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD zbigniew.bodek
                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Build ARMv8 crypto PMD if compiling for ARM64
and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
is enable in the configuration file.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 drivers/crypto/Makefile | 3 +++
 mk/rte.app.mk           | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..a5de944 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+ifeq ($(CONFIG_RTE_ARCH_ARM64),y)
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
+endif
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..a1d332d 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -145,6 +145,9 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+ifeq ($(CONFIG_RTE_ARCH_ARM64),y)
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+endif
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (7 preceding siblings ...)
  2016-12-07  2:33   ` [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
@ 2016-12-07  2:33   ` zbigniew.bodek
  2016-12-07 21:13     ` Mcnamara, John
  2016-12-07  2:33   ` [PATCH v2 10/12] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
  2016-12-08 10:24   ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 Bruce Richardson
  10 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add documentation about the driver and update
release notes.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 doc/guides/cryptodevs/armv8.rst        | 82 ++++++++++++++++++++++++++++++++++
 doc/guides/cryptodevs/index.rst        |  1 +
 doc/guides/rel_notes/release_17_02.rst |  5 +++
 3 files changed, 88 insertions(+)
 create mode 100644 doc/guides/cryptodevs/armv8.rst

diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..67d8bc3
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,82 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2016.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+================================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level assembly code specific to all supported cipher and hash
+combinations.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+* ``RTE_CRYPTO_AUTH_SHA1``
+* ``RTE_CRYPTO_AUTH_SHA256``
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+To compile ARMv8 Crypto PMD, it has to be enabled in the config/common_base
+file. No additional packages need to be installed.
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Input data has to be a multiple of 16 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 3b65038..c6c92b0 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -38,6 +38,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 10/12] crypto/armv8: enable ARMv8 PMD in the configuration
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (8 preceding siblings ...)
  2016-12-07  2:33   ` [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD zbigniew.bodek
@ 2016-12-07  2:33   ` zbigniew.bodek
  2016-12-08 10:24   ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 Bruce Richardson
  10 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:33 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option to
the common configuration file and enable it by
default for ARM64.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 config/common_base                         | 6 ++++++
 config/defconfig_arm64-armv8a-linuxapp-gcc | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/config/common_base b/config/common_base
index 4bff83a..b410a3b 100644
--- a/config/common_base
+++ b/config/common_base
@@ -406,6 +406,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc
index 6321884..a99ceb9 100644
--- a/config/defconfig_arm64-armv8a-linuxapp-gcc
+++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
@@ -47,3 +47,5 @@ CONFIG_RTE_EAL_IGB_UIO=n
 CONFIG_RTE_LIBRTE_FM10K_PMD=n
 
 CONFIG_RTE_SCHED_VECTOR=n
+
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 11/12] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                   ` (3 preceding siblings ...)
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2016-12-07  2:36 ` zbigniew.bodek
  2016-12-07  2:37 ` [PATCH v2 12/12] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  5 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:36 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 26d9590..ef1f25b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -445,6 +445,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v2 12/12] app/test: add ARMv8 crypto tests and test vectors
  2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                   ` (4 preceding siblings ...)
  2016-12-07  2:36 ` [PATCH v2 11/12] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
@ 2016-12-07  2:37 ` zbigniew.bodek
  5 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2016-12-07  2:37 UTC (permalink / raw)
  To: pablo.de.lara.guarch, jerin.jacob; +Cc: dev, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.
Add test vectors for cipher + SHA256 MAC generation.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  63 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 211 +++++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 508 +++++++++++++++++++++++++++++
 5 files changed, 779 insertions(+), 8 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 872f8b4..a0540d6 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,27 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1545,6 +1566,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -6504,6 +6541,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -6567,6 +6621,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -6575,3 +6637,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index 1c68f93..470c2d9 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,136 @@
 	}
 };
 
+/** AES-128-CBC SHA256 MAC test vector */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 512
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 512
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256,
+	.digest = {
+		.data = {
+			0xA8, 0xBC, 0xDB, 0x99, 0xAA, 0x45, 0x91, 0xA3,
+			0x2D, 0x75, 0x41, 0x92, 0x28, 0x01, 0x87, 0x5D,
+			0x45, 0xED, 0x49, 0x05, 0xD3, 0xAE, 0x32, 0x57,
+			0xB7, 0x79, 0x65, 0xFC, 0xFA, 0x6C, 0xFA, 0xDF
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_14 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -878,37 +1008,69 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_14,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_14,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -954,7 +1116,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -963,7 +1126,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1006,7 +1170,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1015,7 +1180,37 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Encryption Digest",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Decryption Digest "
+			"Verify",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Encryption Digest "
+			"Sessionless",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC MAC-SHA256 Decryption Digest "
+			"Verify Sessionless",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index 37b10cf..6963241 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -82,6 +82,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -508,6 +509,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index 04ff1ee..bd362c7 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -49,6 +49,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 59a6891..271f026 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,27 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2422,6 +2449,136 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			start_cycles = rte_rdtsc_precise();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				((num_to_submit - num_sent) < burst_size) ?
+				num_to_submit - num_sent : burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			/* Wait until requests have been sent. */
+			rte_delay_ms(1);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc_precise();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2683,6 +2840,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3356,6 +3563,138 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc_precise();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))],
+					sess, pparams->buf_size, digest_length);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc_precise();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3664,6 +4003,153 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4214,6 +4700,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4270,6 +4769,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4279,3 +4786,4 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest, perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-06 23:24               ` Jerin Jacob
@ 2016-12-07 15:00                 ` Thomas Monjalon
  2016-12-07 16:30                   ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-07 15:00 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

2016-12-07 04:54, Jerin Jacob:
> On Tue, Dec 06, 2016 at 02:41:01PM -0800, Thomas Monjalon wrote:
> > 2016-12-07 03:35, Jerin Jacob:
> > > On Tue, Dec 06, 2016 at 10:42:51PM +0100, Thomas Monjalon wrote:
> > > > 2016-12-07 02:48, Jerin Jacob:
> > > > > On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > > > > > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > > > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > > > > 
> > > > > > > This patch adds core low-level crypto operations
> > > > > > > for ARMv8 processors. The assembly code is a base
> > > > > > > for an optimized PMD and is currently excluded
> > > > > > > from the build.
> > > > > > 
> > > > > > It's a bit sad that you cannot achieve the same performance with
> > > > > > C code and a good compiler.
> > > > > > Have you tried it? How much is the difference?
> > > > > 
> > > > > Like AES-NI on IA side(exposed as separate PMD in dpdk),
> > > > > armv8 has special dedicated instructions for crypto operation using SIMD.
> > > > > This patch is using the "dedicated" armv8 crypto instructions and SIMD
> > > > > operation to achieve better performance.
> > > > 
> > > > It does not justify to have all the code in asm.
> > > 
> > > Why ? if we can have separate dpdk pmd for AES-NI on IA . Why not for ARM?
> > 
> > Jerin, you or me is not understanding the other.
> > It is perfectly fine to have a separate PMD.
> > I am just talking about the language C vs ASM.
> 
> Hmm. Both are bit connected topic :-)
> 
> If you check the AES-NI PMD installation guide, We need to download the
> "ASM" optimized AES-NI library and build with yasm.
> We all uses fine grained ASM code such work.
> So AES-NI case those are still ASM code but reside in some other
> library.

Yes

> http://dpdk.org/doc/guides/cryptodevs/aesni_mb.html(Check Installation section)
> https://downloadcenter.intel.com/download/22972
> 
> Even linux kernel use, hardcore ASM for crypto work.
> https://github.com/torvalds/linux/blob/master/arch/arm/crypto/aes-ce-core.S

Yes

> > > > > We had compared with openssl implementation.Here is the performance
> > > > > improvement for chained crypto operations case WRT openssl pmd
> > > > > 
> > > > > Buffer
> > > > > Size(B)   OPS(M)      Throughput(Gbps)
> > > > > 64        729 %        742 %
> > > > > 128       577 %        592 %
> > > > > 256       483 %        476 %
> > > > > 512       336 %        351 %
> > > > > 768       300 %        286 %
> > > > > 1024      263 %        250 %
> > > > > 1280      225 %        229 %
> > > > > 1536      214 %        213 %
> > > > > 1792      186 %        203 %
> > > > > 2048      200 %        193 %
> > > > 
> > > > OK but what is the performance difference between this asm code
> > > > and a C equivalent?
> > > 
> > > Do you you want compare against the scalar version of C code? its not
> > > even worth to think about it. The vector version will use
> > > dedicated armv8 instruction for crypto so its not portable anyway.
> > > We would like to asm code so that we can have better control on what we do
> > > and we cant rely compiler for that.
> > 
> > No I'm talking about comparing a PMD written in C vs this one in ASM.
> 
> Only fast stuff written in ASM. Remaining pmd is written in C.
> Look  "crypto/armv8: add PMD optimized for ARMv8 processors"
> 
> > It"s just harder to read ASM. Most of DPDK code is in C.
> > And only some small functions are written in ASM.
> > The vector instructions use some C intrinsics.
> > Do you mean that the instructions that you are using have no intrinsics
> > equivalent? Nobody made it into GCC?
> 
> There is intrinsic equivalent for crypto but that will work only on
> armv8. If we start using the arch specific intrinsic then it better to
> plain ASM code, it is clean and we all do similar scheme for core crypto
> work(like AES-NI library, linux etc)
> 
> We did a lot of effort to make clean armv8 ASM code _optimized_ for DPDK workload.
> Just because someone doesn't familiar with armv8 Assembly its not fair to
> say write it in C.

I'm just saying it is sad, as it is sad for AES-NI or Linux code.
Please read again my questions:
Have you tried it? How much is the difference?

I'm not saying it should not enter in DPDK, I'm just asking some basic
questions to better understand the motivations and the status of ARM crypto
in general.
You did not answer for comparing with a C implementation, so I guess you
have implemented it in ASM without even trying to do it in C.
The conclusion: we will never know what is the real gain of coding this in ASM.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8
  2016-12-07 15:00                 ` Thomas Monjalon
@ 2016-12-07 16:30                   ` Jerin Jacob
  0 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2016-12-07 16:30 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: zbigniew.bodek, dev, pablo.de.lara.guarch, Emery Davis

On Wed, Dec 07, 2016 at 04:00:07PM +0100, Thomas Monjalon wrote:
> 2016-12-07 04:54, Jerin Jacob:
> > On Tue, Dec 06, 2016 at 02:41:01PM -0800, Thomas Monjalon wrote:
> > > 2016-12-07 03:35, Jerin Jacob:
> > > > On Tue, Dec 06, 2016 at 10:42:51PM +0100, Thomas Monjalon wrote:
> > > > > 2016-12-07 02:48, Jerin Jacob:
> > > > > > On Tue, Dec 06, 2016 at 09:29:25PM +0100, Thomas Monjalon wrote:
> > > > > > > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> > > > > > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > > > > > 
> > > > > > > > This patch adds core low-level crypto operations
> > > > > > > > for ARMv8 processors. The assembly code is a base
> > > > > > > > for an optimized PMD and is currently excluded
> > > > > > > > from the build.
> > > > > > > 
> > > > > > > It's a bit sad that you cannot achieve the same performance with
> > > > > > > C code and a good compiler.
> > > > > > > Have you tried it? How much is the difference?
> > > > > > 
> > > > > > Like AES-NI on IA side(exposed as separate PMD in dpdk),
> > > > > > armv8 has special dedicated instructions for crypto operation using SIMD.
> > > > > > This patch is using the "dedicated" armv8 crypto instructions and SIMD
> > > > > > operation to achieve better performance.
> > > > > 
> > > > > It does not justify to have all the code in asm.
> > > > 
> > > > Why ? if we can have separate dpdk pmd for AES-NI on IA . Why not for ARM?
> > > 
> > > Jerin, you or me is not understanding the other.
> > > It is perfectly fine to have a separate PMD.
> > > I am just talking about the language C vs ASM.
> > 
> > Hmm. Both are bit connected topic :-)
> > 
> > If you check the AES-NI PMD installation guide, We need to download the
> > "ASM" optimized AES-NI library and build with yasm.
> > We all uses fine grained ASM code such work.
> > So AES-NI case those are still ASM code but reside in some other
> > library.
> 
> Yes
> 
> > http://dpdk.org/doc/guides/cryptodevs/aesni_mb.html(Check Installation section)
> > https://downloadcenter.intel.com/download/22972
> > 
> > Even linux kernel use, hardcore ASM for crypto work.
> > https://github.com/torvalds/linux/blob/master/arch/arm/crypto/aes-ce-core.S
> 
> Yes
> 
> > > > > > We had compared with openssl implementation.Here is the performance
> > > > > > improvement for chained crypto operations case WRT openssl pmd
> > > > > > 
> > > > > > Buffer
> > > > > > Size(B)   OPS(M)      Throughput(Gbps)
> > > > > > 64        729 %        742 %
> > > > > > 128       577 %        592 %
> > > > > > 256       483 %        476 %
> > > > > > 512       336 %        351 %
> > > > > > 768       300 %        286 %
> > > > > > 1024      263 %        250 %
> > > > > > 1280      225 %        229 %
> > > > > > 1536      214 %        213 %
> > > > > > 1792      186 %        203 %
> > > > > > 2048      200 %        193 %
> > > > > 
> > > > > OK but what is the performance difference between this asm code
> > > > > and a C equivalent?
> > > > 
> > > > Do you you want compare against the scalar version of C code? its not
> > > > even worth to think about it. The vector version will use
> > > > dedicated armv8 instruction for crypto so its not portable anyway.
> > > > We would like to asm code so that we can have better control on what we do
> > > > and we cant rely compiler for that.
> > > 
> > > No I'm talking about comparing a PMD written in C vs this one in ASM.
> > 
> > Only fast stuff written in ASM. Remaining pmd is written in C.
> > Look  "crypto/armv8: add PMD optimized for ARMv8 processors"
> > 
> > > It"s just harder to read ASM. Most of DPDK code is in C.
> > > And only some small functions are written in ASM.
> > > The vector instructions use some C intrinsics.
> > > Do you mean that the instructions that you are using have no intrinsics
> > > equivalent? Nobody made it into GCC?
> > 
> > There is intrinsic equivalent for crypto but that will work only on
> > armv8. If we start using the arch specific intrinsic then it better to
> > plain ASM code, it is clean and we all do similar scheme for core crypto
> > work(like AES-NI library, linux etc)
> > 
> > We did a lot of effort to make clean armv8 ASM code _optimized_ for DPDK workload.
> > Just because someone doesn't familiar with armv8 Assembly its not fair to
> > say write it in C.
> 
> I'm just saying it is sad, as it is sad for AES-NI or Linux code.
> Please read again my questions:
> Have you tried it? How much is the difference?

We haven't tried due to following reasons,
1) It is a norm in the industry to write such things in ASM.So we have to do it anyway.
2) It really takes a lot of R&D cycles first to write it in C and then ASM. So
skipped the R&D part and moved to ASM directly as we need to write in
ASM anyway.

> I'm not saying it should not enter in DPDK, I'm just asking some basic
> questions to better understand the motivations and the status of ARM crypto
> in general.

OK

> You did not answer for comparing with a C implementation, so I guess you
> have implemented it in ASM without even trying to do it in C.
> The conclusion: we will never know what is the real gain of coding this in ASM.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD
  2016-12-06 20:27     ` Thomas Monjalon
@ 2016-12-07 19:04       ` Zbigniew Bodek
  2016-12-07 20:09         ` Thomas Monjalon
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2016-12-07 19:04 UTC (permalink / raw)
  To: Thomas Monjalon, dev
  Cc: zbigniew.bodek, pablo.de.lara.guarch, jerin.jacob, declan.doherty

On 06.12.2016 21:27, Thomas Monjalon wrote:
> 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Add type and name for ARMv8 crypto PMD
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> [...]
>> --- a/lib/librte_cryptodev/rte_cryptodev.h
>> +++ b/lib/librte_cryptodev/rte_cryptodev.h
>> @@ -66,6 +66,8 @@
>>  /**< KASUMI PMD device name */
>>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
>>  /**< KASUMI PMD device name */
>> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
>> +/**< ARMv8 CM device name */
>>
>>  /** Crypto device type */
>>  enum rte_cryptodev_type {
>> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
>>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
>>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
>> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
>>  };
>
> Can we remove all these types and names in the generic crypto API?
>

Hello Thomas,

I added another PMD type and therefore we need new, unique number for 
it. I'm not sure if I understand correctly what you mean here, so please 
elaborate.

Kind regards
Zbigniew

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD
  2016-12-07 19:04       ` Zbigniew Bodek
@ 2016-12-07 20:09         ` Thomas Monjalon
  2016-12-09 12:06           ` Declan Doherty
  0 siblings, 1 reply; 100+ messages in thread
From: Thomas Monjalon @ 2016-12-07 20:09 UTC (permalink / raw)
  To: Zbigniew Bodek
  Cc: dev, zbigniew.bodek, pablo.de.lara.guarch, jerin.jacob, declan.doherty

2016-12-07 20:04, Zbigniew Bodek:
> On 06.12.2016 21:27, Thomas Monjalon wrote:
> > 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
> >> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >>
> >> Add type and name for ARMv8 crypto PMD
> >>
> >> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > [...]
> >> --- a/lib/librte_cryptodev/rte_cryptodev.h
> >> +++ b/lib/librte_cryptodev/rte_cryptodev.h
> >> @@ -66,6 +66,8 @@
> >>  /**< KASUMI PMD device name */
> >>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
> >>  /**< KASUMI PMD device name */
> >> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
> >> +/**< ARMv8 CM device name */
> >>
> >>  /** Crypto device type */
> >>  enum rte_cryptodev_type {
> >> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
> >>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
> >>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
> >>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
> >> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
> >>  };
> >
> > Can we remove all these types and names in the generic crypto API?
> >
> 
> Hello Thomas,
> 
> I added another PMD type and therefore we need new, unique number for 
> it. I'm not sure if I understand correctly what you mean here, so please 
> elaborate.

My comment is not specific to your PMD.
I think there is something wrong in the design of cryptodev if we need
to update rte_cryptodev.h each time a new driver is added.
There is no such thing in ethdev.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD
  2016-12-07  2:33   ` [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD zbigniew.bodek
@ 2016-12-07 21:13     ` Mcnamara, John
  0 siblings, 0 replies; 100+ messages in thread
From: Mcnamara, John @ 2016-12-07 21:13 UTC (permalink / raw)
  To: zbigniew.bodek, De Lara Guarch, Pablo, jerin.jacob; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
> zbigniew.bodek@caviumnetworks.com
> Sent: Wednesday, December 7, 2016 2:33 AM
> To: De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>;
> jerin.jacob@caviumnetworks.com
> Cc: dev@dpdk.org; Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> Subject: [dpdk-dev] [PATCH v2 09/12] doc/armv8: update documentation about
> crypto PMD
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Add documentation about the driver and update release notes.
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>


Acked-by: John McNamara <john.mcnamara@intel.com>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                     ` (9 preceding siblings ...)
  2016-12-07  2:33   ` [PATCH v2 10/12] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
@ 2016-12-08 10:24   ` Bruce Richardson
  2016-12-08 11:32     ` Zbigniew Bodek
  10 siblings, 1 reply; 100+ messages in thread
From: Bruce Richardson @ 2016-12-08 10:24 UTC (permalink / raw)
  To: zbigniew.bodek; +Cc: pablo.de.lara.guarch, jerin.jacob, dev

On Tue, Dec 06, 2016 at 06:32:53PM -0800, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Introduce crypto poll mode driver using ARMv8
> cryptographic extensions. This PMD is optimized
> to provide performance boost for chained
> crypto operations processing, such as:
> * encryption + HMAC generation
> * decryption + HMAC validation.
> In particular, cipher only or hash only
> operations are not provided. 
> Performance gain can be observed in tests
> against OpenSSL PMD which also uses ARM
> crypto extensions for packets processing.
> 
Hi,

great to see more crypto drivers coming into DPDK, thanks.

Question: do you know if this code would have any export compliance
implications for DPDK - or for those repackaging DPDK? Up till now, all
the crypto code used by DPDK was actually packaged in separate libraries
that were re-used, meaning that DPDK didn't contain any crypto
algorithms itself.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-08 10:24   ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 Bruce Richardson
@ 2016-12-08 11:32     ` Zbigniew Bodek
  2016-12-08 17:45       ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2016-12-08 11:32 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: pablo.de.lara.guarch, jerin.jacob, dev

On 08.12.2016 11:24, Bruce Richardson wrote:
> On Tue, Dec 06, 2016 at 06:32:53PM -0800, zbigniew.bodek@caviumnetworks.com wrote:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Introduce crypto poll mode driver using ARMv8
>> cryptographic extensions. This PMD is optimized
>> to provide performance boost for chained
>> crypto operations processing, such as:
>> * encryption + HMAC generation
>> * decryption + HMAC validation.
>> In particular, cipher only or hash only
>> operations are not provided.
>> Performance gain can be observed in tests
>> against OpenSSL PMD which also uses ARM
>> crypto extensions for packets processing.
>>
> Hi,
>
> great to see more crypto drivers coming into DPDK, thanks.
>
> Question: do you know if this code would have any export compliance
> implications for DPDK - or for those repackaging DPDK? Up till now, all
> the crypto code used by DPDK was actually packaged in separate libraries
> that were re-used, meaning that DPDK didn't contain any crypto
> algorithms itself.
>

Hello Bruce,

I don't know to be honest. I didn't know the reasoning behind not 
including crypto code for Intel for example. I thought it was due to 
licensing and code control rather than export compliance.

Maybe someone from the DPDK community will know what are the constraints 
related to including crypto algorithms to DPDK.

Kind regards
Zbigniew

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-08 11:32     ` Zbigniew Bodek
@ 2016-12-08 17:45       ` Jerin Jacob
  2016-12-21 15:34         ` Declan Doherty
  0 siblings, 1 reply; 100+ messages in thread
From: Jerin Jacob @ 2016-12-08 17:45 UTC (permalink / raw)
  To: Zbigniew Bodek; +Cc: Bruce Richardson, pablo.de.lara.guarch, dev

On Thu, Dec 08, 2016 at 12:32:52PM +0100, Zbigniew Bodek wrote:
> On 08.12.2016 11:24, Bruce Richardson wrote:
> > On Tue, Dec 06, 2016 at 06:32:53PM -0800, zbigniew.bodek@caviumnetworks.com wrote:
> > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > 
> > > Introduce crypto poll mode driver using ARMv8
> > > cryptographic extensions. This PMD is optimized
> > > to provide performance boost for chained
> > > crypto operations processing, such as:
> > > * encryption + HMAC generation
> > > * decryption + HMAC validation.
> > > In particular, cipher only or hash only
> > > operations are not provided.
> > > Performance gain can be observed in tests
> > > against OpenSSL PMD which also uses ARM
> > > crypto extensions for packets processing.
> > > 
> > Hi,
> > 
> > great to see more crypto drivers coming into DPDK, thanks.
> > 
> > Question: do you know if this code would have any export compliance
> > implications for DPDK - or for those repackaging DPDK? Up till now, all
> > the crypto code used by DPDK was actually packaged in separate libraries
> > that were re-used, meaning that DPDK didn't contain any crypto
> > algorithms itself.
> > 
> 
> Hello Bruce,
> 
> I don't know to be honest. I didn't know the reasoning behind not including
> crypto code for Intel for example. I thought it was due to licensing and
> code control rather than export compliance.
> 
> Maybe someone from the DPDK community will know what are the constraints
> related to including crypto algorithms to DPDK.

One of the primary reason why we thought of going with this approach is
for out of the box "distribution" enablement. We thought, if the core crypto
algorithm sits in some git-hub code or public hosted tarball then the
PMD will never be added to standard distributions and which is a setback
for armv8 server ecosystem.

Having said that and as Zbigniew mentioned, We are open for revisiting
the crypto core algorithm and PMD split if there are community concerns
about export compliance. Let us know.

Jerin

> 
> Kind regards
> Zbigniew

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD
  2016-12-07 20:09         ` Thomas Monjalon
@ 2016-12-09 12:06           ` Declan Doherty
  0 siblings, 0 replies; 100+ messages in thread
From: Declan Doherty @ 2016-12-09 12:06 UTC (permalink / raw)
  To: Thomas Monjalon, Zbigniew Bodek
  Cc: dev, zbigniew.bodek, pablo.de.lara.guarch, jerin.jacob

On 07/12/16 20:09, Thomas Monjalon wrote:
> 2016-12-07 20:04, Zbigniew Bodek:
>> On 06.12.2016 21:27, Thomas Monjalon wrote:
>>> 2016-12-06 18:32, zbigniew.bodek@caviumnetworks.com:
>>>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>>>
>>>> Add type and name for ARMv8 crypto PMD
>>>>
>>>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>> [...]
>>>> --- a/lib/librte_cryptodev/rte_cryptodev.h
>>>> +++ b/lib/librte_cryptodev/rte_cryptodev.h
>>>> @@ -66,6 +66,8 @@
>>>>  /**< KASUMI PMD device name */
>>>>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
>>>>  /**< KASUMI PMD device name */
>>>> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
>>>> +/**< ARMv8 CM device name */
>>>>
>>>>  /** Crypto device type */
>>>>  enum rte_cryptodev_type {
>>>> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>>>>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
>>>>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
>>>>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
>>>> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
>>>>  };
>>>
>>> Can we remove all these types and names in the generic crypto API?
>>>
>>
>> Hello Thomas,
>>
>> I added another PMD type and therefore we need new, unique number for
>> it. I'm not sure if I understand correctly what you mean here, so please
>> elaborate.
>
> My comment is not specific to your PMD.
> I think there is something wrong in the design of cryptodev if we need
> to update rte_cryptodev.h each time a new driver is added.
> There is no such thing in ethdev.
>

Hey Thomas, I've been meaning to have a look at removing this enum, I 
just haven't had the time as yet, I think since there is now a standard 
naming convention for all pmds, the use for this is redundant.

This change will require a ABI/API deprecation notice, so I'll put that 
into 17.02 and then do the patches to remove for 17.05

Declan

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 01/12] mk: fix build of assembly files for ARM64
  2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
@ 2016-12-21 14:46     ` De Lara Guarch, Pablo
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  1 sibling, 0 replies; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2016-12-21 14:46 UTC (permalink / raw)
  To: zbigniew.bodek, jerin.jacob; +Cc: dev

Hi Zbigniew,

> -----Original Message-----
> From: zbigniew.bodek@caviumnetworks.com
> [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, December 07, 2016 2:33 AM
> To: De Lara Guarch, Pablo; jerin.jacob@caviumnetworks.com
> Cc: dev@dpdk.org; Zbigniew Bodek
> Subject: [PATCH v2 01/12] mk: fix build of assembly files for ARM64
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Avoid using incorrect assembler (nasm) and unsupported flags
> when building for ARM64.
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

If this is a fix, you should include a "Fixes" line and CC the stable tree list.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors
  2016-12-07  2:32   ` [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2016-12-21 14:55     ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2016-12-21 14:55 UTC (permalink / raw)
  To: zbigniew.bodek, jerin.jacob; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
> zbigniew.bodek@caviumnetworks.com
> Sent: Wednesday, December 07, 2016 2:33 AM
> To: De Lara Guarch, Pablo; jerin.jacob@caviumnetworks.com
> Cc: dev@dpdk.org; Zbigniew Bodek
> Subject: [dpdk-dev] [PATCH v2 06/12] crypto/armv8: add PMD optimized
> for ARMv8 processors
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> This patch introduces crypto poll mode driver
> using ARMv8 cryptographic extensions.
> CPU compatibility with this driver is detected in
> run-time and virtual crypto device will not be
> created if CPU doesn't provide:
> AES, SHA1, SHA2 and NEON.
> 
> This PMD is optimized to provide performance boost
> for chained crypto operations processing,
> such as encryption + HMAC generation,
> decryption + HMAC validation. In particular,
> cipher only or hash only operations are
> not provided.
> 
> The driver currently supports AES-128-CBC
> in combination with:
> SHA256 MAC, SHA256 HMAC and SHA1 HMAC and relies
> on the low-level assembly code.
> 
> This patch adds driver's code only and does
> not include it in the build system.
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  drivers/crypto/armv8/Makefile                     |  72 ++
>  drivers/crypto/armv8/asm/include/rte_armv8_defs.h |  80 ++
>  drivers/crypto/armv8/rte_armv8_pmd.c              | 915
> ++++++++++++++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_ops.c          | 390 +++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_private.h      | 210 +++++
>  drivers/crypto/armv8/rte_armv8_pmd_version.map    |   3 +
>  6 files changed, 1670 insertions(+)
>  create mode 100644 drivers/crypto/armv8/Makefile
>  create mode 100644
> drivers/crypto/armv8/asm/include/rte_armv8_defs.h
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
> 
...

> diff --git a/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
> b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
> new file mode 100644
> index 0000000..ea05495
> --- /dev/null
> +++ b/drivers/crypto/armv8/asm/include/rte_armv8_defs.h
> @@ -0,0 +1,80 @@
...

> +
> +#ifndef _RTE_ARMV8_DEFS_H_
> +#define _RTE_ARMV8_DEFS_H_
> +
> +struct crypto_arg {
> +	struct {
> +		uint8_t		*key;
> +		uint8_t		*iv;
> +	} cipher;

Remove unnecessary tab above.

> +	struct {
> +		struct {
> +			uint8_t	*key;
> +			uint8_t *i_key_pad;
> +			uint8_t *o_key_pad;
> +		} hmac;
> +	} digest;
> +};

...

> diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c
> b/drivers/crypto/armv8/rte_armv8_pmd.c
> new file mode 100644
> index 0000000..0410bb0
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd.c


> + * 3D array type for ARM Combined Mode crypto functions pointers.
> + * CRYPTO_CIPHER_MAX:			max cipher ID number
> + * CRYPTO_AUTH_MAX:			max auth ID number
> + * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
> + */
> +typedef const crypto_func_t
> +crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_
> CIPHER_KEYLEN_MAX];
> +
> +/* Evaluate to key length definition */
> +#define	KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_
> ## keyl)

I don't think a tab is necessary here after define (happens on other parts)

> +
> +/* Local aliases for supported ciphers */
> +#define	CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
> +/* Local aliases for supported hashes */
> +#define	AUTH_SHA1_HMAC
> 	RTE_CRYPTO_AUTH_SHA1_HMAC
> +#define	AUTH_SHA256		RTE_CRYPTO_AUTH_SHA256
> +#define	AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC

...

> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> new file mode 100644
> index 0000000..0f768f4
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c

...

> +	{	/* AES CBC */
> +		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +			{.sym = {
> +				.xform_type =
> RTE_CRYPTO_SYM_XFORM_CIPHER,
> +				{.cipher = {
> +					.algo =
> RTE_CRYPTO_CIPHER_AES_CBC,
> +					.block_size = 16,
> +					.key_size = {
> +						.min = 16,
> +						.max = 32,
> +						.increment = 8

>From what I read, this PMD only supports AES-128-CBC.
If that's right, then key_size should be .min = 16, .max = 16, .increment = 0.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system
  2016-12-07  2:33   ` [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
@ 2016-12-21 15:01     ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2016-12-21 15:01 UTC (permalink / raw)
  To: zbigniew.bodek, jerin.jacob; +Cc: dev



> -----Original Message-----
> From: zbigniew.bodek@caviumnetworks.com
> [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, December 07, 2016 2:33 AM
> To: De Lara Guarch, Pablo; jerin.jacob@caviumnetworks.com
> Cc: dev@dpdk.org; Zbigniew Bodek
> Subject: [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Build ARMv8 crypto PMD if compiling for ARM64
> and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
> is enable in the configuration file.
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  drivers/crypto/Makefile | 3 +++
>  mk/rte.app.mk           | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
> index 745c614..a5de944 100644
> --- a/drivers/crypto/Makefile
> +++ b/drivers/crypto/Makefile
> @@ -33,6 +33,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
> 
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
>  DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
> +ifeq ($(CONFIG_RTE_ARCH_ARM64),y)
> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
> +endif

Is this extra conditional necessary (ARM64)? I would think enabling ARMV8_CRYPTO would be sufficient.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-08 17:45       ` Jerin Jacob
@ 2016-12-21 15:34         ` Declan Doherty
  2016-12-22  4:57           ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Declan Doherty @ 2016-12-21 15:34 UTC (permalink / raw)
  To: Jerin Jacob, Zbigniew Bodek; +Cc: Bruce Richardson, pablo.de.lara.guarch, dev

On 08/12/16 17:45, Jerin Jacob wrote:
> On Thu, Dec 08, 2016 at 12:32:52PM +0100, Zbigniew Bodek wrote:
>> On 08.12.2016 11:24, Bruce Richardson wrote:
>>> On Tue, Dec 06, 2016 at 06:32:53PM -0800, zbigniew.bodek@caviumnetworks.com wrote:
>>>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>>>
>>>> Introduce crypto poll mode driver using ARMv8
>>>> cryptographic extensions. This PMD is optimized
>>>> to provide performance boost for chained
>>>> crypto operations processing, such as:
>>>> * encryption + HMAC generation
>>>> * decryption + HMAC validation.
>>>> In particular, cipher only or hash only
>>>> operations are not provided.
>>>> Performance gain can be observed in tests
>>>> against OpenSSL PMD which also uses ARM
>>>> crypto extensions for packets processing.
>>>>
>>> Hi,
>>>
>>> great to see more crypto drivers coming into DPDK, thanks.
>>>
>>> Question: do you know if this code would have any export compliance
>>> implications for DPDK - or for those repackaging DPDK? Up till now, all
>>> the crypto code used by DPDK was actually packaged in separate libraries
>>> that were re-used, meaning that DPDK didn't contain any crypto
>>> algorithms itself.
>>>
>>
>> Hello Bruce,
>>
>> I don't know to be honest. I didn't know the reasoning behind not including
>> crypto code for Intel for example. I thought it was due to licensing and
>> code control rather than export compliance.
>>
>> Maybe someone from the DPDK community will know what are the constraints
>> related to including crypto algorithms to DPDK.
>
> One of the primary reason why we thought of going with this approach is
> for out of the box "distribution" enablement. We thought, if the core crypto
> algorithm sits in some git-hub code or public hosted tarball then the
> PMD will never be added to standard distributions and which is a setback
> for armv8 server ecosystem.
>
> Having said that and as Zbigniew mentioned, We are open for revisiting
> the crypto core algorithm and PMD split if there are community concerns
> about export compliance. Let us know.
>
> Jerin
>
>>
>> Kind regards
>> Zbigniew

Hey Jerin/Zbigniew,


as Bruce said it's great to see you contributing to the crypto ecosystem 
in DPDK. I don't know if the export compliance with the core crypto code 
is an issue or not, that's definitely not my area of expertise, but I do 
have some concern which I think it relates somewhat to Thomas questions 
regarding implementing the core crypto algorithms in C rather than assembly.

I wonder is there the expertise within the DPDK community to 
review/maintain the core crypto code in terms of both the assembly code 
itself and also the details of the crypto algorithm's implementations 
themselves. I know I wouldn't feel I have the knowledge/expertise to be 
able to review the core crypto algorithm's implementations and the 
assembly code itself and sign-off on them.

I understand the advantage of having the code integrated directly into 
DPDK for packaging etc but this also puts the ownest on the DPDK 
community for the correctness of the underlying implementation of a 
particular algorithm. I think the approach of a separate library removes 
this responsibility from the community and places it on the distributor 
of the core crypto library.

Declan

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v2 00/12] Add crypto PMD optimized for ARMv8
  2016-12-21 15:34         ` Declan Doherty
@ 2016-12-22  4:57           ` Jerin Jacob
  0 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2016-12-22  4:57 UTC (permalink / raw)
  To: Declan Doherty
  Cc: Zbigniew Bodek, Bruce Richardson, pablo.de.lara.guarch, dev

On Wed, Dec 21, 2016 at 03:34:14PM +0000, Declan Doherty wrote:
> On 08/12/16 17:45, Jerin Jacob wrote:
> > On Thu, Dec 08, 2016 at 12:32:52PM +0100, Zbigniew Bodek wrote:
> > > On 08.12.2016 11:24, Bruce Richardson wrote:
> > > > On Tue, Dec 06, 2016 at 06:32:53PM -0800, zbigniew.bodek@caviumnetworks.com wrote:
> > > > > From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > > > 
> > > 
> > > Hello Bruce,
> > > 
> > > I don't know to be honest. I didn't know the reasoning behind not including
> > > crypto code for Intel for example. I thought it was due to licensing and
> > > code control rather than export compliance.
> > > 
> > > Maybe someone from the DPDK community will know what are the constraints
> > > related to including crypto algorithms to DPDK.
> > 
> > One of the primary reason why we thought of going with this approach is
> > for out of the box "distribution" enablement. We thought, if the core crypto
> > algorithm sits in some git-hub code or public hosted tarball then the
> > PMD will never be added to standard distributions and which is a setback
> > for armv8 server ecosystem.
> > 
> > Having said that and as Zbigniew mentioned, We are open for revisiting
> > the crypto core algorithm and PMD split if there are community concerns
> > about export compliance. Let us know.
> > 
> > Jerin
> > 
> > > 
> > > Kind regards
> > > Zbigniew
> 
> Hey Jerin/Zbigniew,
> 
> 
> as Bruce said it's great to see you contributing to the crypto ecosystem in
> DPDK. I don't know if the export compliance with the core crypto code is an
> issue or not, that's definitely not my area of expertise, but I do have some
> concern which I think it relates somewhat to Thomas questions regarding
> implementing the core crypto algorithms in C rather than assembly.
> 
> I wonder is there the expertise within the DPDK community to review/maintain
> the core crypto code in terms of both the assembly code itself and also the
> details of the crypto algorithm's implementations themselves. I know I
> wouldn't feel I have the knowledge/expertise to be able to review the core
> crypto algorithm's implementations and the assembly code itself and sign-off
> on them.
> 
> I understand the advantage of having the code integrated directly into DPDK
> for packaging etc but this also puts the ownest on the DPDK community for
> the correctness of the underlying implementation of a particular algorithm.
> I think the approach of a separate library removes this responsibility from
> the community and places it on the distributor of the core crypto library.

OK. Taking Thomas and your feedback into account, We will move the core
crypto ARMv8 ASM code to separate library.

Jerin
> 
> Declan
> 
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
  2016-12-21 14:46     ` De Lara Guarch, Pablo
@ 2017-01-04 17:33     ` zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 1/8] mk: fix build of assembly files for ARM64 zbigniew.bodek
                         ` (9 more replies)
  1 sibling, 10 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided.
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC.
The core crypto functionality of this driver is
provided by the external armv8_crypto library
that can be downloaded from the Cavium repository:
https://github.com/caviumnetworks/armv8_crypto

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA1 and
SHA256 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

---
v3:
* Addressed review remarks
* Moved low-level assembly code to the external library
* Removed SHA256 MAC cases
* Various fixes: interface to the library, digest destination
  and source address interpreting, missing mbuf manipulations.

v2:
* Fixed checkpatch warnings
* Divide patches into smaller logical parts

Zbigniew Bodek (8):
  mk: fix build of assembly files for ARM64
  lib: add cryptodev type for the upcoming ARMv8 PMD
  crypto/armv8: add PMD optimized for ARMv8 processors
  mk/crypto/armv8: add PMD to the build system
  doc/armv8: update documentation about crypto PMD
  crypto/armv8: enable ARMv8 PMD in the configuration
  crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                    |   6 +
 app/test/test_cryptodev.c                      |  63 ++
 app/test/test_cryptodev_aes_test_vectors.h     | 144 +++-
 app/test/test_cryptodev_blockcipher.c          |   4 +
 app/test/test_cryptodev_blockcipher.h          |   1 +
 app/test/test_cryptodev_perf.c                 | 480 +++++++++++++
 config/common_base                             |   6 +
 doc/guides/cryptodevs/armv8.rst                |  96 +++
 doc/guides/cryptodevs/index.rst                |   1 +
 doc/guides/rel_notes/release_17_02.rst         |   5 +
 drivers/crypto/Makefile                        |   1 +
 drivers/crypto/armv8/Makefile                  |  73 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 lib/librte_cryptodev/rte_cryptodev.h           |   3 +
 mk/arch/arm64/rte.vars.mk                      |   1 -
 mk/rte.app.mk                                  |   2 +
 mk/toolchain/gcc/rte.vars.mk                   |   6 +-
 20 files changed, 2390 insertions(+), 11 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 1/8] mk: fix build of assembly files for ARM64
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-13  8:13         ` Hemant Agrawal
  2017-01-04 17:33       ` [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
                         ` (8 subsequent siblings)
  9 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Avoid using incorrect assembler (nasm) and unsupported flags
when building for ARM64.

Fixes:	af75078fece3 ("first public release")
	b3ce00e5fe36 ("mk: introduce ARMv8 architecture")

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 mk/arch/arm64/rte.vars.mk    | 1 -
 mk/toolchain/gcc/rte.vars.mk | 6 ++++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mk/arch/arm64/rte.vars.mk b/mk/arch/arm64/rte.vars.mk
index c168426..3b1178a 100644
--- a/mk/arch/arm64/rte.vars.mk
+++ b/mk/arch/arm64/rte.vars.mk
@@ -53,7 +53,6 @@ CROSS ?=
 
 CPU_CFLAGS  ?=
 CPU_LDFLAGS ?=
-CPU_ASFLAGS ?= -felf
 
 export ARCH CROSS CPU_CFLAGS CPU_LDFLAGS CPU_ASFLAGS
 
diff --git a/mk/toolchain/gcc/rte.vars.mk b/mk/toolchain/gcc/rte.vars.mk
index ff70f3d..94f6412 100644
--- a/mk/toolchain/gcc/rte.vars.mk
+++ b/mk/toolchain/gcc/rte.vars.mk
@@ -41,9 +41,11 @@
 CC        = $(CROSS)gcc
 KERNELCC  = $(CROSS)gcc
 CPP       = $(CROSS)cpp
-# for now, we don't use as but nasm.
-# AS      = $(CROSS)as
+ifeq ($(CONFIG_RTE_ARCH_X86),y)
 AS        = nasm
+else
+AS        = $(CROSS)as
+endif
 AR        = $(CROSS)ar
 LD        = $(CROSS)ld
 OBJCOPY   = $(CROSS)objcopy
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 1/8] mk: fix build of assembly files for ARM64 zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-13  8:16         ` Hemant Agrawal
  2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                         ` (7 subsequent siblings)
  9 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add type and name for ARMv8 crypto PMD

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 lib/librte_cryptodev/rte_cryptodev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index 8f63e8f..6f34f22 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 Crypto PMD device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 1/8] mk: fix build of assembly files for ARM64 zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-06  2:45         ` Jianbo Liu
                           ` (2 more replies)
  2017-01-04 17:33       ` [PATCH v3 4/8] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
                         ` (6 subsequent siblings)
  9 siblings, 3 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver
using ARMv8 cryptographic extensions.
CPU compatibility with this driver is detected in
run-time and virtual crypto device will not be
created if CPU doesn't provide:
AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost
for chained crypto operations processing,
such as encryption + HMAC generation,
decryption + HMAC validation. In particular,
cipher only or hash only operations are
not provided.

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC
and relies on the external armv8_crypto library:
https://github.com/caviumnetworks/armv8_crypto

This patch adds driver's code only and does
not include it in the build system.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile                  |  73 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 5 files changed, 1582 insertions(+)
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..dc5ea02
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,73 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2017.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(MAKECMDGOALS),clean)
+ifneq ($(MAKECMDGOALS),config)
+ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
+$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
+endif
+endif
+endif
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -L$(RTE_SDK)/../openssl -I$(RTE_SDK)/../openssl/include
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+# external library dependencies
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
+LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..39433bb
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,926 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be decribed by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t
+crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+/* Evaluate to key length definition */
+#define KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/* Local aliases for supported ciphers */
+#define CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
+/* Local aliases for supported hashes */
+#define AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
+#define AUTH_SHA256		RTE_CRYPTO_AUTH_SHA256
+#define AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t
+crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * Global static parameter used to create a unique name for each
+ * ARMV8 crypto device.
+ */
+static unsigned int unique_name_id;
+
+static inline int
+create_unique_device_name(char *name, size_t size)
+{
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+			unique_name_id++);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA256:
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
+		break;
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 192);
+		break;
+	case 256:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 256);
+		break;
+	default:
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (sess == NULL)
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	struct rte_mbuf *m_asrc, *m_adst;
+	uint8_t *csrc, *cdst;
+	uint8_t *adst, *asrc;
+	uint64_t clen, alen __rte_unused;
+	int error;
+
+	clen = op->sym->cipher.data.length;
+	alen = op->sym->auth.data.length;
+
+	csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		m_asrc = m_adst = mbuf_dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		m_asrc = mbuf_src;
+		m_adst = mbuf_dst;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+	asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
+				op->sym->auth.data.offset);
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(m_adst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
+	if (error != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+		/* Trim area used for digest from mbuf. */
+		rte_pktmbuf_trim(m_asrc,
+				op->sym->auth.digest.length);
+	}
+}
+
+/** Process crypto operation for mbuf */
+static int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+	int retval;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (op->status != RTE_CRYPTO_OP_STATUS_ERROR)
+		retval = rte_ring_enqueue(qp->processed_ops, (void *)op);
+	else
+		retval = -1;
+
+	return retval;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	qp->stats.enqueued_count += i;
+	return i;
+
+enqueue_err:
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return i;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(const char *name,
+		struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
+	struct armv8_crypto_private *internals;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* create a unique device name */
+	if (create_unique_device_name(crypto_dev_name,
+			RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
+		return -EINVAL;
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed", name);
+
+	cryptodev_armv8_crypto_uninit(crypto_dev_name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id()
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(name, &init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..2bf6475
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,369 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..fe46cde
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,211 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define NBBY		8		/* Number of bits in a byte */
+#define BYTE_LENGTH(x)	((x) / 8)	/* Number of bytes in x (roun down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define HMAC_IPAD_VALUE			(0x36)
+#define HMAC_OPAD_VALUE			(0x5C)
+
+#define SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
+				uint8_t *, uint8_t *, uint64_t,
+				crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 4/8] mk/crypto/armv8: add PMD to the build system
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (2 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 5/8] doc/armv8: update documentation about crypto PMD zbigniew.bodek
                         ` (5 subsequent siblings)
  9 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Build ARMv8 crypto PMD if compiling for ARM64
and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
is enable in the configuration file.
ARMV8_CRYPTO_LIB_PATH environment variable will
point to the appropriate library directory.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 drivers/crypto/Makefile | 1 +
 mk/rte.app.mk           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..77b02cf 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..bbb5265 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -145,6 +145,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 5/8] doc/armv8: update documentation about crypto PMD
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (3 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 4/8] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
                         ` (4 subsequent siblings)
  9 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add documentation about the driver and update
release notes.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 doc/guides/cryptodevs/armv8.rst        | 96 ++++++++++++++++++++++++++++++++++
 doc/guides/cryptodevs/index.rst        |  1 +
 doc/guides/rel_notes/release_17_02.rst |  5 ++
 3 files changed, 102 insertions(+)
 create mode 100644 doc/guides/cryptodevs/armv8.rst

diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..ca8781e
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,96 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2017.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+================================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level library, written in the assembly code.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+In order to enable this virtual crypto PMD, user must:
+
+* Download ARMv8 crypto library source code from
+  `here <https://github.com/caviumnetworks/armv8_crypto>`_
+
+* Export the environmental variable ARMV8_CRYPTO_LIB_PATH with
+  the path where the ``armv8_crypto`` library was downloaded
+  or cloned.
+
+* Build the library by invoking:
+
+.. code-block:: console
+
+	make -C $ARMV8_CRYPTO_LIB_PATH/
+
+* Set CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y in
+  config/defconfig_arm64-armv8a-linuxapp-gcc
+
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Cipher input data has to be a multiple of 16 bytes.
+* Digest input data has to be a multiple of 8 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 180af82..c3e1f56 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -52,6 +52,11 @@ New Features
   See the :ref:`Generic flow API <Generic_flow_API>` documentation for more
   information.
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 6/8] crypto/armv8: enable ARMv8 PMD in the configuration
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (4 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 5/8] doc/armv8: update documentation about crypto PMD zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 7/8] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option to
the common configuration file. Don't enable it by
default for ARM64 as it requires external library
to build.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 config/common_base | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/config/common_base b/config/common_base
index edb6a54..e0c0c0a 100644
--- a/config/common_base
+++ b/config/common_base
@@ -407,6 +407,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 7/8] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (5 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-04 17:33       ` [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ebc97b8..89b5179 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -447,6 +447,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (6 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 7/8] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
@ 2017-01-04 17:33       ` zbigniew.bodek
  2017-01-12 10:48         ` De Lara Guarch, Pablo
  2017-01-13  9:28         ` Hemant Agrawal
  2017-01-10 17:11       ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
  2017-01-13  8:07       ` Hemant Agrawal
  9 siblings, 2 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-04 17:33 UTC (permalink / raw)
  To: dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  63 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 144 ++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 480 +++++++++++++++++++++++++++++
 5 files changed, 684 insertions(+), 8 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 872f8b4..a0540d6 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,27 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1545,6 +1566,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -6504,6 +6541,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -6567,6 +6621,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -6575,3 +6637,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index 1c68f93..5683406 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,98 @@
 	}
 };
 
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -878,37 +970,69 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -954,7 +1078,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -963,7 +1088,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1006,7 +1132,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1015,7 +1142,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index 37b10cf..6963241 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -82,6 +82,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -508,6 +509,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index 04ff1ee..bd362c7 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -49,6 +49,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 59a6891..827cf4d 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,27 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2422,6 +2449,136 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			start_cycles = rte_rdtsc_precise();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				((num_to_submit - num_sent) < burst_size) ?
+				num_to_submit - num_sent : burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			/* Wait until requests have been sent. */
+			rte_delay_ms(1);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc_precise();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc_precise();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc_precise();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2683,6 +2840,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3356,6 +3563,138 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc_precise();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))],
+					sess, pparams->buf_size, digest_length);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc_precise();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3664,6 +4003,125 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4214,6 +4672,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4270,6 +4741,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4279,3 +4758,4 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest, perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-06  2:45         ` Jianbo Liu
  2017-01-12 13:12           ` Zbigniew Bodek
  2017-01-13  7:57         ` Hemant Agrawal
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2 siblings, 1 reply; 100+ messages in thread
From: Jianbo Liu @ 2017-01-06  2:45 UTC (permalink / raw)
  To: zbigniew.bodek; +Cc: dev, pablo.de.lara.guarch, Declan Doherty, Jerin Jacob

On 5 January 2017 at 01:33,  <zbigniew.bodek@caviumnetworks.com> wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> This patch introduces crypto poll mode driver
> using ARMv8 cryptographic extensions.
> CPU compatibility with this driver is detected in
> run-time and virtual crypto device will not be
> created if CPU doesn't provide:
> AES, SHA1, SHA2 and NEON.
>
> This PMD is optimized to provide performance boost
> for chained crypto operations processing,
> such as encryption + HMAC generation,
> decryption + HMAC validation. In particular,
> cipher only or hash only operations are
> not provided.
>
> The driver currently supports AES-128-CBC
> in combination with: SHA256 HMAC and SHA1 HMAC
> and relies on the external armv8_crypto library:
> https://github.com/caviumnetworks/armv8_crypto
>

It's standalone lib. I think you should change the following line in
its Makefile, so not depend on DPDK.
"include $(RTE_SDK)/mk/rte.lib.mk"

> This patch adds driver's code only and does
> not include it in the build system.
>
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  drivers/crypto/armv8/Makefile                  |  73 ++
>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>  5 files changed, 1582 insertions(+)
>  create mode 100644 drivers/crypto/armv8/Makefile
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>
> diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
> new file mode 100644
> index 0000000..dc5ea02
> --- /dev/null
> +++ b/drivers/crypto/armv8/Makefile
> @@ -0,0 +1,73 @@
> +#
> +#   BSD LICENSE
> +#
> +#   Copyright (C) Cavium networks Ltd. 2017.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Cavium networks nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +#
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +ifneq ($(MAKECMDGOALS),clean)
> +ifneq ($(MAKECMDGOALS),config)
> +ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
> +$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
> +endif
> +endif
> +endif
> +
> +# library name
> +LIB = librte_pmd_armv8.a
> +
> +# build flags
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +CFLAGS += -L$(RTE_SDK)/../openssl -I$(RTE_SDK)/../openssl/include

Is it really needed?

> +
> +# library version
> +LIBABIVER := 1
> +
> +# versioning export map
> +EXPORT_MAP := rte_armv8_pmd_version.map
> +
> +# external library dependencies
> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
> +LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
> +
> +# library source files
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
> +
> +# library dependencies
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
> new file mode 100644
> index 0000000..39433bb
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd.c
> @@ -0,0 +1,926 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright (C) Cavium networks Ltd. 2017.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Cavium networks nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_hexdump.h>
> +#include <rte_cryptodev.h>
> +#include <rte_cryptodev_pmd.h>
> +#include <rte_vdev.h>
> +#include <rte_malloc.h>
> +#include <rte_cpuflags.h>
> +
> +#include "armv8_crypto_defs.h"
> +
> +#include "rte_armv8_pmd_private.h"
> +
> +static int cryptodev_armv8_crypto_uninit(const char *name);
> +
> +/**
> + * Pointers to the supported combined mode crypto functions are stored
> + * in the static tables. Each combined (chained) cryptographic operation
> + * can be decribed by a set of numbers:
> + * - order:    order of operations (cipher, auth) or (auth, cipher)
> + * - direction:        encryption or decryption
> + * - calg:     cipher algorithm such as AES_CBC, AES_CTR, etc.
> + * - aalg:     authentication algorithm such as SHA1, SHA256, etc.
> + * - keyl:     cipher key length, for example 128, 192, 256 bits
> + *
> + * In order to quickly acquire each function pointer based on those numbers,
> + * a hierarchy of arrays is maintained. The final level, 3D array is indexed
> + * by the combined mode function parameters only (cipher algorithm,
> + * authentication algorithm and key length).
> + *
> + * This gives 3 memory accesses to obtain a function pointer instead of
> + * traversing the array manually and comparing function parameters on each loop.
> + *
> + *                   +--+CRYPTO_FUNC
> + *            +--+ENC|
> + *      +--+CA|
> + *      |     +--+DEC
> + * ORDER|
> + *      |     +--+ENC
> + *      +--+AC|
> + *            +--+DEC
> + *
> + */
> +
> +/**
> + * 3D array type for ARM Combined Mode crypto functions pointers.
> + * CRYPTO_CIPHER_MAX:                  max cipher ID number
> + * CRYPTO_AUTH_MAX:                    max auth ID number
> + * CRYPTO_CIPHER_KEYLEN_MAX:           max key length ID number
> + */
> +typedef const crypto_func_t
> +crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
> +
> +/* Evaluate to key length definition */
> +#define KEYL(keyl)             (ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
> +
> +/* Local aliases for supported ciphers */
> +#define CIPH_AES_CBC           RTE_CRYPTO_CIPHER_AES_CBC
> +/* Local aliases for supported hashes */
> +#define AUTH_SHA1_HMAC         RTE_CRYPTO_AUTH_SHA1_HMAC
> +#define AUTH_SHA256            RTE_CRYPTO_AUTH_SHA256
> +#define AUTH_SHA256_HMAC       RTE_CRYPTO_AUTH_SHA256_HMAC
> +
> +/**
> + * Arrays containing pointers to particular cryptographic,
> + * combined mode functions.
> + * crypto_op_ca_encrypt:       cipher (encrypt), authenticate
> + * crypto_op_ca_decrypt:       cipher (decrypt), authenticate
> + * crypto_op_ac_encrypt:       authenticate, cipher (encrypt)
> + * crypto_op_ac_decrypt:       authenticate, cipher (decrypt)
> + */
> +static const crypto_func_tbl_t
> +crypto_op_ca_encrypt = {
> +       /* [cipher alg][auth alg][key length] = crypto_function, */
> +       [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
> +       [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
> +};
> +
> +static const crypto_func_tbl_t
> +crypto_op_ca_decrypt = {
> +       NULL
> +};
> +
> +static const crypto_func_tbl_t
> +crypto_op_ac_encrypt = {
> +       NULL
> +};
> +
> +static const crypto_func_tbl_t
> +crypto_op_ac_decrypt = {
> +       /* [cipher alg][auth alg][key length] = crypto_function, */
> +       [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
> +       [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
> +};
> +
> +/**
> + * Arrays containing pointers to particular cryptographic function sets,
> + * covering given cipher operation directions (encrypt, decrypt)
> + * for each order of cipher and authentication pairs.
> + */
> +static const crypto_func_tbl_t *
> +crypto_cipher_auth[] = {
> +       &crypto_op_ca_encrypt,
> +       &crypto_op_ca_decrypt,
> +       NULL
> +};
> +
> +static const crypto_func_tbl_t *
> +crypto_auth_cipher[] = {
> +       &crypto_op_ac_encrypt,
> +       &crypto_op_ac_decrypt,
> +       NULL
> +};
> +
> +/**
> + * Top level array containing pointers to particular cryptographic
> + * function sets, covering given order of chained operations.
> + * crypto_cipher_auth: cipher first, authenticate after
> + * crypto_auth_cipher: authenticate first, cipher after
> + */
> +static const crypto_func_tbl_t **
> +crypto_chain_order[] = {
> +       crypto_cipher_auth,
> +       crypto_auth_cipher,
> +       NULL
> +};
> +
> +/**
> + * Extract particular combined mode crypto function from the 3D array.
> + */
> +#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)                  \
> +({                                                                     \
> +       crypto_func_tbl_t *func_tbl =                                   \
> +                               (crypto_chain_order[(order)])[(cop)];   \
> +                                                                       \
> +       ((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);              \
> +})
> +
> +/*----------------------------------------------------------------------------*/
> +
> +/**
> + * 2D array type for ARM key schedule functions pointers.
> + * CRYPTO_CIPHER_MAX:                  max cipher ID number
> + * CRYPTO_CIPHER_KEYLEN_MAX:           max key length ID number
> + */
> +typedef const crypto_key_sched_t
> +crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
> +
> +static const crypto_key_sched_tbl_t
> +crypto_key_sched_encrypt = {
> +       /* [cipher alg][key length] = key_expand_func, */
> +       [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
> +};
> +
> +static const crypto_key_sched_tbl_t
> +crypto_key_sched_decrypt = {
> +       /* [cipher alg][key length] = key_expand_func, */
> +       [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
> +};
> +
> +/**
> + * Top level array containing pointers to particular key generation
> + * function sets, covering given operation direction.
> + * crypto_key_sched_encrypt:   keys for encryption
> + * crypto_key_sched_decrypt:   keys for decryption
> + */
> +static const crypto_key_sched_tbl_t *
> +crypto_key_sched_dir[] = {
> +       &crypto_key_sched_encrypt,
> +       &crypto_key_sched_decrypt,
> +       NULL
> +};
> +
> +/**
> + * Extract particular combined mode crypto function from the 3D array.
> + */
> +#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)                          \
> +({                                                                     \
> +       crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];   \
> +                                                                       \
> +       ((*ks_tbl)[(calg)][KEYL(keyl)]);                                \
> +})
> +
> +/*----------------------------------------------------------------------------*/
> +
> +/**
> + * Global static parameter used to create a unique name for each
> + * ARMV8 crypto device.
> + */
> +static unsigned int unique_name_id;
> +
> +static inline int
> +create_unique_device_name(char *name, size_t size)
> +{
> +       int ret;
> +
> +       if (name == NULL)
> +               return -EINVAL;
> +
> +       ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
> +                       unique_name_id++);
> +       if (ret < 0)
> +               return ret;
> +       return 0;
> +}
> +
> +/*
> + *------------------------------------------------------------------------------
> + * Session Prepare
> + *------------------------------------------------------------------------------
> + */
> +
> +/** Get xform chain order */
> +static enum armv8_crypto_chain_order
> +armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
> +{
> +
> +       /*
> +        * This driver currently covers only chained operations.
> +        * Ignore only cipher or only authentication operations
> +        * or chains longer than 2 xform structures.
> +        */
> +       if (xform->next == NULL || xform->next->next != NULL)
> +               return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
> +
> +       if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
> +               if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
> +                       return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
> +       }
> +
> +       if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
> +               if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
> +                       return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
> +       }
> +
> +       return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
> +}
> +
> +static inline void
> +auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
> +                               const struct rte_crypto_sym_xform *xform)
> +{
> +       size_t i;
> +
> +       /* Generate i_key_pad and o_key_pad */
> +       memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
> +       rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
> +                                                       xform->auth.key.length);
> +       memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
> +       rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
> +                                                       xform->auth.key.length);
> +       /*
> +        * XOR key with IPAD/OPAD values to obtain i_key_pad
> +        * and o_key_pad.
> +        * Byte-by-byte operation may seem to be the less efficient
> +        * here but in fact it's the opposite.
> +        * The result ASM code is likely operate on NEON registers
> +        * (load auth key to Qx, load IPAD/OPAD to multiple
> +        * elements of Qy, eor 128 bits at once).
> +        */
> +       for (i = 0; i < SHA_BLOCK_MAX; i++) {
> +               sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
> +               sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
> +       }
> +}
> +
> +static inline int
> +auth_set_prerequisites(struct armv8_crypto_session *sess,
> +                       const struct rte_crypto_sym_xform *xform)
> +{
> +       uint8_t partial[64] = { 0 };
> +       int error;
> +
> +       switch (xform->auth.algo) {
> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
> +               /*
> +                * Generate authentication key, i_key_pad and o_key_pad.
> +                */
> +               /* Zero memory under key */
> +               memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
> +
> +               if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
> +                       /*
> +                        * In case the key is longer than 160 bits
> +                        * the algorithm will use SHA1(key) instead.
> +                        */
> +                       error = sha1_block(NULL, xform->auth.key.data,
> +                               sess->auth.hmac.key, xform->auth.key.length);
> +                       if (error != 0)
> +                               return -1;
> +               } else {
> +                       /*
> +                        * Now copy the given authentication key to the session
> +                        * key assuming that the session key is zeroed there is
> +                        * no need for additional zero padding if the key is
> +                        * shorter than SHA1_AUTH_KEY_LENGTH.
> +                        */
> +                       rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
> +                                                       xform->auth.key.length);
> +               }
> +
> +               /* Prepare HMAC padding: key|pattern */
> +               auth_hmac_pad_prepare(sess, xform);
> +               /*
> +                * Calculate partial hash values for i_key_pad and o_key_pad.
> +                * Will be used as initialization state for final HMAC.
> +                */
> +               error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
> +                   partial, SHA1_BLOCK_SIZE);
> +               if (error != 0)
> +                       return -1;
> +               memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
> +
> +               error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
> +                   partial, SHA1_BLOCK_SIZE);
> +               if (error != 0)
> +                       return -1;
> +               memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
> +
> +               break;
> +       case RTE_CRYPTO_AUTH_SHA256_HMAC:
> +               /*
> +                * Generate authentication key, i_key_pad and o_key_pad.
> +                */
> +               /* Zero memory under key */
> +               memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
> +
> +               if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
> +                       /*
> +                        * In case the key is longer than 256 bits
> +                        * the algorithm will use SHA256(key) instead.
> +                        */
> +                       error = sha256_block(NULL, xform->auth.key.data,
> +                               sess->auth.hmac.key, xform->auth.key.length);
> +                       if (error != 0)
> +                               return -1;
> +               } else {
> +                       /*
> +                        * Now copy the given authentication key to the session
> +                        * key assuming that the session key is zeroed there is
> +                        * no need for additional zero padding if the key is
> +                        * shorter than SHA256_AUTH_KEY_LENGTH.
> +                        */
> +                       rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
> +                                                       xform->auth.key.length);
> +               }
> +
> +               /* Prepare HMAC padding: key|pattern */
> +               auth_hmac_pad_prepare(sess, xform);
> +               /*
> +                * Calculate partial hash values for i_key_pad and o_key_pad.
> +                * Will be used as initialization state for final HMAC.
> +                */
> +               error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
> +                   partial, SHA256_BLOCK_SIZE);
> +               if (error != 0)
> +                       return -1;
> +               memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
> +
> +               error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
> +                   partial, SHA256_BLOCK_SIZE);
> +               if (error != 0)
> +                       return -1;
> +               memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
> +
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int
> +cipher_set_prerequisites(struct armv8_crypto_session *sess,
> +                       const struct rte_crypto_sym_xform *xform)
> +{
> +       crypto_key_sched_t cipher_key_sched;
> +
> +       cipher_key_sched = sess->cipher.key_sched;
> +       if (likely(cipher_key_sched != NULL)) {
> +               /* Set up cipher session key */
> +               cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
> +       }
> +
> +       return 0;
> +}
> +
> +static int
> +armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
> +               const struct rte_crypto_sym_xform *cipher_xform,
> +               const struct rte_crypto_sym_xform *auth_xform)
> +{
> +       enum armv8_crypto_chain_order order;
> +       enum armv8_crypto_cipher_operation cop;
> +       enum rte_crypto_cipher_algorithm calg;
> +       enum rte_crypto_auth_algorithm aalg;
> +
> +       /* Validate and prepare scratch order of combined operations */
> +       switch (sess->chain_order) {
> +       case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
> +       case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
> +               order = sess->chain_order;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +       /* Select cipher direction */
> +       sess->cipher.direction = cipher_xform->cipher.op;
> +       /* Select cipher key */
> +       sess->cipher.key.length = cipher_xform->cipher.key.length;
> +       /* Set cipher direction */
> +       cop = sess->cipher.direction;
> +       /* Set cipher algorithm */
> +       calg = cipher_xform->cipher.algo;
> +
> +       /* Select cipher algo */
> +       switch (calg) {
> +       /* Cover supported cipher algorithms */
> +       case RTE_CRYPTO_CIPHER_AES_CBC:
> +               sess->cipher.algo = calg;
> +               /* IV len is always 16 bytes (block size) for AES CBC */
> +               sess->cipher.iv_len = 16;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +       /* Select auth generate/verify */
> +       sess->auth.operation = auth_xform->auth.op;
> +
> +       /* Select auth algo */
> +       switch (auth_xform->auth.algo) {
> +       /* Cover supported hash algorithms */
> +       case RTE_CRYPTO_AUTH_SHA256:
> +               aalg = auth_xform->auth.algo;
> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
> +               break;
> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
> +       case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
> +               aalg = auth_xform->auth.algo;
> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       /* Verify supported key lengths and extract proper algorithm */
> +       switch (cipher_xform->cipher.key.length << 3) {
> +       case 128:
> +               sess->crypto_func =
> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
> +               sess->cipher.key_sched =
> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 128);
> +               break;
> +       case 192:
> +               sess->crypto_func =
> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
> +               sess->cipher.key_sched =
> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 192);
> +               break;
> +       case 256:
> +               sess->crypto_func =
> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
> +               sess->cipher.key_sched =
> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 256);
> +               break;
> +       default:
> +               sess->crypto_func = NULL;
> +               sess->cipher.key_sched = NULL;
> +               return -EINVAL;
> +       }
> +
> +       if (unlikely(sess->crypto_func == NULL)) {
> +               /*
> +                * If we got here that means that there must be a bug

Since AES-128-CBC is only supported in your patch. It means that
crypto_func could be NULL according to the switch above if
cipher.key.length > 128?

> +                * in the algorithms selection above. Nevertheless keep
> +                * it here to catch bug immediately and avoid NULL pointer
> +                * dereference in OPs processing.
> +                */
> +               ARMV8_CRYPTO_LOG_ERR(
> +                       "No appropriate crypto function for given parameters");
> +               return -EINVAL;
> +       }
> +
> +       /* Set up cipher session prerequisites */
> +       if (cipher_set_prerequisites(sess, cipher_xform) != 0)
> +               return -EINVAL;
> +
> +       /* Set up authentication session prerequisites */
> +       if (auth_set_prerequisites(sess, auth_xform) != 0)
> +               return -EINVAL;
> +
> +       return 0;
> +}
> +

....

> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> new file mode 100644
> index 0000000..2bf6475
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> @@ -0,0 +1,369 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright (C) Cavium networks Ltd. 2017.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Cavium networks nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_malloc.h>
> +#include <rte_cryptodev_pmd.h>
> +
> +#include "armv8_crypto_defs.h"
> +
> +#include "rte_armv8_pmd_private.h"
> +
> +static const struct rte_cryptodev_capabilities
> +       armv8_crypto_pmd_capabilities[] = {
> +       {       /* SHA1 HMAC */
> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +                       {.sym = {
> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
> +                               {.auth = {
> +                                       .algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
> +                                       .block_size = 64,
> +                                       .key_size = {
> +                                               .min = 16,
> +                                               .max = 128,
> +                                               .increment = 0
> +                                       },
> +                                       .digest_size = {
> +                                               .min = 20,
> +                                               .max = 20,
> +                                               .increment = 0
> +                                       },
> +                                       .aad_size = { 0 }
> +                               }, }
> +                       }, }
> +       },
> +       {       /* SHA256 HMAC */
> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +                       {.sym = {
> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
> +                               {.auth = {
> +                                       .algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
> +                                       .block_size = 64,
> +                                       .key_size = {
> +                                               .min = 16,
> +                                               .max = 128,
> +                                               .increment = 0
> +                                       },
> +                                       .digest_size = {
> +                                               .min = 32,
> +                                               .max = 32,
> +                                               .increment = 0
> +                                       },
> +                                       .aad_size = { 0 }
> +                               }, }
> +                       }, }
> +       },
> +       {       /* AES CBC */
> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +                       {.sym = {
> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
> +                               {.cipher = {
> +                                       .algo = RTE_CRYPTO_CIPHER_AES_CBC,
> +                                       .block_size = 16,
> +                                       .key_size = {
> +                                               .min = 16,
> +                                               .max = 16,
> +                                               .increment = 0
> +                                       },
> +                                       .iv_size = {
> +                                               .min = 16,
> +                                               .max = 16,
> +                                               .increment = 0
> +                                       }
> +                               }, }
> +                       }, }
> +       },
> +

It's strange that you defined aes and hmac here, but not implemented
them, though their combinations are implemented.
Will you add later?

> +       RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
> +};
> +
> +

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (7 preceding siblings ...)
  2017-01-04 17:33       ` [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2017-01-10 17:11       ` De Lara Guarch, Pablo
  2017-01-10 17:50         ` Zbigniew Bodek
  2017-01-13  8:07       ` Hemant Agrawal
  9 siblings, 1 reply; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-10 17:11 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: Doherty, Declan, jerin.jacob

Hi Zbigniew,


> -----Original Message-----
> From: zbigniew.bodek@caviumnetworks.com
> [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, January 04, 2017 5:33 PM
> To: dev@dpdk.org
> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> jerin.jacob@caviumnetworks.com; Zbigniew Bodek
> Subject: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

...

> 
> Zbigniew Bodek (8):
>   mk: fix build of assembly files for ARM64
>   lib: add cryptodev type for the upcoming ARMv8 PMD
>   crypto/armv8: add PMD optimized for ARMv8 processors
>   mk/crypto/armv8: add PMD to the build system
>   doc/armv8: update documentation about crypto PMD
>   crypto/armv8: enable ARMv8 PMD in the configuration
>   crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
>   app/test: add ARMv8 crypto tests and test vectors

Thanks for this patchset.

Could you change the titles of some of these patches?
The prefix should be "mk:" and not "mk/crypto/armv8", for instance.
The other ones that should be changed are "doc/armv8" -> "doc" and "crypto/armv8: update MAINTAINERS" to "MAINTAINERS:".

I can do this for you, if you are OK with these changes.

Apart from this, can anyone review these changes? I do not have access to an ARM board,
so it is a bit difficult for me to review it.

Thanks,
Pablo 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-10 17:11       ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
@ 2017-01-10 17:50         ` Zbigniew Bodek
  0 siblings, 0 replies; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-10 17:50 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, dev; +Cc: Doherty, Declan, jerin.jacob

Hello Pablo,

Please check my answers in-line below.

Kind regards
Zbigniew

On 10.01.2017 18:11, De Lara Guarch, Pablo wrote:
> Hi Zbigniew,
>
>
>> -----Original Message-----
>> From: zbigniew.bodek@caviumnetworks.com
>> [mailto:zbigniew.bodek@caviumnetworks.com]
>> Sent: Wednesday, January 04, 2017 5:33 PM
>> To: dev@dpdk.org
>> Cc: De Lara Guarch, Pablo; Doherty, Declan;
>> jerin.jacob@caviumnetworks.com; Zbigniew Bodek
>> Subject: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
>>
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> ...
>
>>
>> Zbigniew Bodek (8):
>>   mk: fix build of assembly files for ARM64
>>   lib: add cryptodev type for the upcoming ARMv8 PMD
>>   crypto/armv8: add PMD optimized for ARMv8 processors
>>   mk/crypto/armv8: add PMD to the build system
>>   doc/armv8: update documentation about crypto PMD
>>   crypto/armv8: enable ARMv8 PMD in the configuration
>>   crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
>>   app/test: add ARMv8 crypto tests and test vectors
>
> Thanks for this patchset.
>
> Could you change the titles of some of these patches?
> The prefix should be "mk:" and not "mk/crypto/armv8", for instance.
> The other ones that should be changed are "doc/armv8" -> "doc" and "crypto/armv8: update MAINTAINERS" to "MAINTAINERS:".
>
> I can do this for you, if you are OK with these changes.

I'm OK with the changes and I will appreciate changing those names if 
this is not an inconvenience for you.

>
> Apart from this, can anyone review these changes? I do not have access to an ARM board,
> so it is a bit difficult for me to review it.

I would like to add that I can help with the build and installation in 
case the documentation is not sufficient (this would also mean changing 
the documentation).

>
> Thanks,
> Pablo
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-04 17:33       ` [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2017-01-12 10:48         ` De Lara Guarch, Pablo
  2017-01-12 11:50           ` Zbigniew Bodek
  2017-01-13  9:28         ` Hemant Agrawal
  1 sibling, 1 reply; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-12 10:48 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: Doherty, Declan, jerin.jacob

Hi Bodek,

> -----Original Message-----
> From: zbigniew.bodek@caviumnetworks.com
> [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, January 04, 2017 5:33 PM
> To: dev@dpdk.org
> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> jerin.jacob@caviumnetworks.com; Zbigniew Bodek
> Subject: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Introduce unit tests for ARMv8 crypto PMD.
> Add test vectors for short cases such as 160 bytes.
> These test cases are ARMv8 specific since the code provides
> different processing paths for different input data sizes.
> 
> User can validate correctness of algorithms' implementation using:
> * cryptodev_sw_armv8_autotest
> For performance test one can use:
> * cryptodev_sw_armv8_perftest
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Could you rebase this patchset with the dpdk-next-crypto tree?
There is a compilation error due to a missing parameter in a function that has recently changed.

Thanks,
Pablo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-12 10:48         ` De Lara Guarch, Pablo
@ 2017-01-12 11:50           ` Zbigniew Bodek
  2017-01-12 12:07             ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-12 11:50 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, dev; +Cc: Doherty, Declan, jerin.jacob

Hello Pablo,

On 12.01.2017 11:48, De Lara Guarch, Pablo wrote:
> Hi Bodek,
>
>> -----Original Message-----
>> From: zbigniew.bodek@caviumnetworks.com
>> [mailto:zbigniew.bodek@caviumnetworks.com]
>> Sent: Wednesday, January 04, 2017 5:33 PM
>> To: dev@dpdk.org
>> Cc: De Lara Guarch, Pablo; Doherty, Declan;
>> jerin.jacob@caviumnetworks.com; Zbigniew Bodek
>> Subject: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
>>
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Introduce unit tests for ARMv8 crypto PMD.
>> Add test vectors for short cases such as 160 bytes.
>> These test cases are ARMv8 specific since the code provides
>> different processing paths for different input data sizes.
>>
>> User can validate correctness of algorithms' implementation using:
>> * cryptodev_sw_armv8_autotest
>> For performance test one can use:
>> * cryptodev_sw_armv8_perftest
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Could you rebase this patchset with the dpdk-next-crypto tree?
> There is a compilation error due to a missing parameter in a function that has recently changed.

I see. The rebase is done. Should I send full v4 patchset now?

Kind regards
Zbigniew

>
> Thanks,
> Pablo
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-12 11:50           ` Zbigniew Bodek
@ 2017-01-12 12:07             ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-12 12:07 UTC (permalink / raw)
  To: Zbigniew Bodek, dev; +Cc: Doherty, Declan, jerin.jacob



> -----Original Message-----
> From: Zbigniew Bodek [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Thursday, January 12, 2017 11:51 AM
> To: De Lara Guarch, Pablo; dev@dpdk.org
> Cc: Doherty, Declan; jerin.jacob@caviumnetworks.com
> Subject: Re: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test
> vectors
> 
> Hello Pablo,
> 
> On 12.01.2017 11:48, De Lara Guarch, Pablo wrote:
> > Hi Bodek,
> >
> >> -----Original Message-----
> >> From: zbigniew.bodek@caviumnetworks.com
> >> [mailto:zbigniew.bodek@caviumnetworks.com]
> >> Sent: Wednesday, January 04, 2017 5:33 PM
> >> To: dev@dpdk.org
> >> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> >> jerin.jacob@caviumnetworks.com; Zbigniew Bodek
> >> Subject: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test
> vectors
> >>
> >> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >>
> >> Introduce unit tests for ARMv8 crypto PMD.
> >> Add test vectors for short cases such as 160 bytes.
> >> These test cases are ARMv8 specific since the code provides
> >> different processing paths for different input data sizes.
> >>
> >> User can validate correctness of algorithms' implementation using:
> >> * cryptodev_sw_armv8_autotest
> >> For performance test one can use:
> >> * cryptodev_sw_armv8_perftest
> >>
> >> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >
> > Could you rebase this patchset with the dpdk-next-crypto tree?
> > There is a compilation error due to a missing parameter in a function that
> has recently changed.
> 
> I see. The rebase is done. Should I send full v4 patchset now?
> 

There are some comments from Jianbo Liu. Take a look at them in case
you have something to change there.

Also, since you are sending a v4 patchset, make the commit name changes too, please.

Thanks,
Pablo

> Kind regards
> Zbigniew
> 
> >
> > Thanks,
> > Pablo
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-06  2:45         ` Jianbo Liu
@ 2017-01-12 13:12           ` Zbigniew Bodek
  2017-01-13  7:41             ` Jianbo Liu
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-12 13:12 UTC (permalink / raw)
  To: Jianbo Liu; +Cc: dev, pablo.de.lara.guarch, Declan Doherty, Jerin Jacob

Hello  Jianbo Liu,

Thanks for the review. Please check my answers in-line.

Kind regards
Zbigniew

On 06.01.2017 03:45, Jianbo Liu wrote:
> On 5 January 2017 at 01:33,  <zbigniew.bodek@caviumnetworks.com> wrote:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> This patch introduces crypto poll mode driver
>> using ARMv8 cryptographic extensions.
>> CPU compatibility with this driver is detected in
>> run-time and virtual crypto device will not be
>> created if CPU doesn't provide:
>> AES, SHA1, SHA2 and NEON.
>>
>> This PMD is optimized to provide performance boost
>> for chained crypto operations processing,
>> such as encryption + HMAC generation,
>> decryption + HMAC validation. In particular,
>> cipher only or hash only operations are
>> not provided.
>>
>> The driver currently supports AES-128-CBC
>> in combination with: SHA256 HMAC and SHA1 HMAC
>> and relies on the external armv8_crypto library:
>> https://github.com/caviumnetworks/armv8_crypto
>>
>
> It's standalone lib. I think you should change the following line in
> its Makefile, so not depend on DPDK.
> "include $(RTE_SDK)/mk/rte.lib.mk"
>
>> This patch adds driver's code only and does
>> not include it in the build system.
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>> ---
>>  drivers/crypto/armv8/Makefile                  |  73 ++
>>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>>  5 files changed, 1582 insertions(+)
>>  create mode 100644 drivers/crypto/armv8/Makefile
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>>
>> diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
>> new file mode 100644
>> index 0000000..dc5ea02
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/Makefile
>> @@ -0,0 +1,73 @@
>> +#
>> +#   BSD LICENSE
>> +#
>> +#   Copyright (C) Cavium networks Ltd. 2017.
>> +#
>> +#   Redistribution and use in source and binary forms, with or without
>> +#   modification, are permitted provided that the following conditions
>> +#   are met:
>> +#
>> +#     * Redistributions of source code must retain the above copyright
>> +#       notice, this list of conditions and the following disclaimer.
>> +#     * Redistributions in binary form must reproduce the above copyright
>> +#       notice, this list of conditions and the following disclaimer in
>> +#       the documentation and/or other materials provided with the
>> +#       distribution.
>> +#     * Neither the name of Cavium networks nor the names of its
>> +#       contributors may be used to endorse or promote products derived
>> +#       from this software without specific prior written permission.
>> +#
>> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> +#
>> +
>> +include $(RTE_SDK)/mk/rte.vars.mk
>> +
>> +ifneq ($(MAKECMDGOALS),clean)
>> +ifneq ($(MAKECMDGOALS),config)
>> +ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
>> +$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
>> +endif
>> +endif
>> +endif
>> +
>> +# library name
>> +LIB = librte_pmd_armv8.a
>> +
>> +# build flags
>> +CFLAGS += -O3
>> +CFLAGS += $(WERROR_FLAGS)
>> +CFLAGS += -L$(RTE_SDK)/../openssl -I$(RTE_SDK)/../openssl/include
>
> Is it really needed?

No. It is removed now.

>
>> +
>> +# library version
>> +LIBABIVER := 1
>> +
>> +# versioning export map
>> +EXPORT_MAP := rte_armv8_pmd_version.map
>> +
>> +# external library dependencies
>> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
>> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
>> +LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
>> +
>> +# library source files
>> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
>> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
>> +
>> +# library dependencies
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
>> +
>> +include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
>> new file mode 100644
>> index 0000000..39433bb
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd.c
>> @@ -0,0 +1,926 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Cavium networks nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + */
>> +
>> +#include <stdbool.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_hexdump.h>
>> +#include <rte_cryptodev.h>
>> +#include <rte_cryptodev_pmd.h>
>> +#include <rte_vdev.h>
>> +#include <rte_malloc.h>
>> +#include <rte_cpuflags.h>
>> +
>> +#include "armv8_crypto_defs.h"
>> +
>> +#include "rte_armv8_pmd_private.h"
>> +
>> +static int cryptodev_armv8_crypto_uninit(const char *name);
>> +
>> +/**
>> + * Pointers to the supported combined mode crypto functions are stored
>> + * in the static tables. Each combined (chained) cryptographic operation
>> + * can be decribed by a set of numbers:
>> + * - order:    order of operations (cipher, auth) or (auth, cipher)
>> + * - direction:        encryption or decryption
>> + * - calg:     cipher algorithm such as AES_CBC, AES_CTR, etc.
>> + * - aalg:     authentication algorithm such as SHA1, SHA256, etc.
>> + * - keyl:     cipher key length, for example 128, 192, 256 bits
>> + *
>> + * In order to quickly acquire each function pointer based on those numbers,
>> + * a hierarchy of arrays is maintained. The final level, 3D array is indexed
>> + * by the combined mode function parameters only (cipher algorithm,
>> + * authentication algorithm and key length).
>> + *
>> + * This gives 3 memory accesses to obtain a function pointer instead of
>> + * traversing the array manually and comparing function parameters on each loop.
>> + *
>> + *                   +--+CRYPTO_FUNC
>> + *            +--+ENC|
>> + *      +--+CA|
>> + *      |     +--+DEC
>> + * ORDER|
>> + *      |     +--+ENC
>> + *      +--+AC|
>> + *            +--+DEC
>> + *
>> + */
>> +
>> +/**
>> + * 3D array type for ARM Combined Mode crypto functions pointers.
>> + * CRYPTO_CIPHER_MAX:                  max cipher ID number
>> + * CRYPTO_AUTH_MAX:                    max auth ID number
>> + * CRYPTO_CIPHER_KEYLEN_MAX:           max key length ID number
>> + */
>> +typedef const crypto_func_t
>> +crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
>> +
>> +/* Evaluate to key length definition */
>> +#define KEYL(keyl)             (ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
>> +
>> +/* Local aliases for supported ciphers */
>> +#define CIPH_AES_CBC           RTE_CRYPTO_CIPHER_AES_CBC
>> +/* Local aliases for supported hashes */
>> +#define AUTH_SHA1_HMAC         RTE_CRYPTO_AUTH_SHA1_HMAC
>> +#define AUTH_SHA256            RTE_CRYPTO_AUTH_SHA256
>> +#define AUTH_SHA256_HMAC       RTE_CRYPTO_AUTH_SHA256_HMAC
>> +
>> +/**
>> + * Arrays containing pointers to particular cryptographic,
>> + * combined mode functions.
>> + * crypto_op_ca_encrypt:       cipher (encrypt), authenticate
>> + * crypto_op_ca_decrypt:       cipher (decrypt), authenticate
>> + * crypto_op_ac_encrypt:       authenticate, cipher (encrypt)
>> + * crypto_op_ac_decrypt:       authenticate, cipher (decrypt)
>> + */
>> +static const crypto_func_tbl_t
>> +crypto_op_ca_encrypt = {
>> +       /* [cipher alg][auth alg][key length] = crypto_function, */
>> +       [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
>> +       [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
>> +};
>> +
>> +static const crypto_func_tbl_t
>> +crypto_op_ca_decrypt = {
>> +       NULL
>> +};
>> +
>> +static const crypto_func_tbl_t
>> +crypto_op_ac_encrypt = {
>> +       NULL
>> +};
>> +
>> +static const crypto_func_tbl_t
>> +crypto_op_ac_decrypt = {
>> +       /* [cipher alg][auth alg][key length] = crypto_function, */
>> +       [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
>> +       [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
>> +};
>> +
>> +/**
>> + * Arrays containing pointers to particular cryptographic function sets,
>> + * covering given cipher operation directions (encrypt, decrypt)
>> + * for each order of cipher and authentication pairs.
>> + */
>> +static const crypto_func_tbl_t *
>> +crypto_cipher_auth[] = {
>> +       &crypto_op_ca_encrypt,
>> +       &crypto_op_ca_decrypt,
>> +       NULL
>> +};
>> +
>> +static const crypto_func_tbl_t *
>> +crypto_auth_cipher[] = {
>> +       &crypto_op_ac_encrypt,
>> +       &crypto_op_ac_decrypt,
>> +       NULL
>> +};
>> +
>> +/**
>> + * Top level array containing pointers to particular cryptographic
>> + * function sets, covering given order of chained operations.
>> + * crypto_cipher_auth: cipher first, authenticate after
>> + * crypto_auth_cipher: authenticate first, cipher after
>> + */
>> +static const crypto_func_tbl_t **
>> +crypto_chain_order[] = {
>> +       crypto_cipher_auth,
>> +       crypto_auth_cipher,
>> +       NULL
>> +};
>> +
>> +/**
>> + * Extract particular combined mode crypto function from the 3D array.
>> + */
>> +#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)                  \
>> +({                                                                     \
>> +       crypto_func_tbl_t *func_tbl =                                   \
>> +                               (crypto_chain_order[(order)])[(cop)];   \
>> +                                                                       \
>> +       ((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);              \
>> +})
>> +
>> +/*----------------------------------------------------------------------------*/
>> +
>> +/**
>> + * 2D array type for ARM key schedule functions pointers.
>> + * CRYPTO_CIPHER_MAX:                  max cipher ID number
>> + * CRYPTO_CIPHER_KEYLEN_MAX:           max key length ID number
>> + */
>> +typedef const crypto_key_sched_t
>> +crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
>> +
>> +static const crypto_key_sched_tbl_t
>> +crypto_key_sched_encrypt = {
>> +       /* [cipher alg][key length] = key_expand_func, */
>> +       [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
>> +};
>> +
>> +static const crypto_key_sched_tbl_t
>> +crypto_key_sched_decrypt = {
>> +       /* [cipher alg][key length] = key_expand_func, */
>> +       [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
>> +};
>> +
>> +/**
>> + * Top level array containing pointers to particular key generation
>> + * function sets, covering given operation direction.
>> + * crypto_key_sched_encrypt:   keys for encryption
>> + * crypto_key_sched_decrypt:   keys for decryption
>> + */
>> +static const crypto_key_sched_tbl_t *
>> +crypto_key_sched_dir[] = {
>> +       &crypto_key_sched_encrypt,
>> +       &crypto_key_sched_decrypt,
>> +       NULL
>> +};
>> +
>> +/**
>> + * Extract particular combined mode crypto function from the 3D array.
>> + */
>> +#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)                          \
>> +({                                                                     \
>> +       crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];   \
>> +                                                                       \
>> +       ((*ks_tbl)[(calg)][KEYL(keyl)]);                                \
>> +})
>> +
>> +/*----------------------------------------------------------------------------*/
>> +
>> +/**
>> + * Global static parameter used to create a unique name for each
>> + * ARMV8 crypto device.
>> + */
>> +static unsigned int unique_name_id;
>> +
>> +static inline int
>> +create_unique_device_name(char *name, size_t size)
>> +{
>> +       int ret;
>> +
>> +       if (name == NULL)
>> +               return -EINVAL;
>> +
>> +       ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
>> +                       unique_name_id++);
>> +       if (ret < 0)
>> +               return ret;
>> +       return 0;
>> +}
>> +
>> +/*
>> + *------------------------------------------------------------------------------
>> + * Session Prepare
>> + *------------------------------------------------------------------------------
>> + */
>> +
>> +/** Get xform chain order */
>> +static enum armv8_crypto_chain_order
>> +armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
>> +{
>> +
>> +       /*
>> +        * This driver currently covers only chained operations.
>> +        * Ignore only cipher or only authentication operations
>> +        * or chains longer than 2 xform structures.
>> +        */
>> +       if (xform->next == NULL || xform->next->next != NULL)
>> +               return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
>> +
>> +       if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
>> +               if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
>> +                       return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
>> +       }
>> +
>> +       if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
>> +               if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
>> +                       return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
>> +       }
>> +
>> +       return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
>> +}
>> +
>> +static inline void
>> +auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
>> +                               const struct rte_crypto_sym_xform *xform)
>> +{
>> +       size_t i;
>> +
>> +       /* Generate i_key_pad and o_key_pad */
>> +       memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
>> +       rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
>> +                                                       xform->auth.key.length);
>> +       memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
>> +       rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
>> +                                                       xform->auth.key.length);
>> +       /*
>> +        * XOR key with IPAD/OPAD values to obtain i_key_pad
>> +        * and o_key_pad.
>> +        * Byte-by-byte operation may seem to be the less efficient
>> +        * here but in fact it's the opposite.
>> +        * The result ASM code is likely operate on NEON registers
>> +        * (load auth key to Qx, load IPAD/OPAD to multiple
>> +        * elements of Qy, eor 128 bits at once).
>> +        */
>> +       for (i = 0; i < SHA_BLOCK_MAX; i++) {
>> +               sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
>> +               sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
>> +       }
>> +}
>> +
>> +static inline int
>> +auth_set_prerequisites(struct armv8_crypto_session *sess,
>> +                       const struct rte_crypto_sym_xform *xform)
>> +{
>> +       uint8_t partial[64] = { 0 };
>> +       int error;
>> +
>> +       switch (xform->auth.algo) {
>> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
>> +               /*
>> +                * Generate authentication key, i_key_pad and o_key_pad.
>> +                */
>> +               /* Zero memory under key */
>> +               memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
>> +
>> +               if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
>> +                       /*
>> +                        * In case the key is longer than 160 bits
>> +                        * the algorithm will use SHA1(key) instead.
>> +                        */
>> +                       error = sha1_block(NULL, xform->auth.key.data,
>> +                               sess->auth.hmac.key, xform->auth.key.length);
>> +                       if (error != 0)
>> +                               return -1;
>> +               } else {
>> +                       /*
>> +                        * Now copy the given authentication key to the session
>> +                        * key assuming that the session key is zeroed there is
>> +                        * no need for additional zero padding if the key is
>> +                        * shorter than SHA1_AUTH_KEY_LENGTH.
>> +                        */
>> +                       rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
>> +                                                       xform->auth.key.length);
>> +               }
>> +
>> +               /* Prepare HMAC padding: key|pattern */
>> +               auth_hmac_pad_prepare(sess, xform);
>> +               /*
>> +                * Calculate partial hash values for i_key_pad and o_key_pad.
>> +                * Will be used as initialization state for final HMAC.
>> +                */
>> +               error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
>> +                   partial, SHA1_BLOCK_SIZE);
>> +               if (error != 0)
>> +                       return -1;
>> +               memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
>> +
>> +               error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
>> +                   partial, SHA1_BLOCK_SIZE);
>> +               if (error != 0)
>> +                       return -1;
>> +               memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
>> +
>> +               break;
>> +       case RTE_CRYPTO_AUTH_SHA256_HMAC:
>> +               /*
>> +                * Generate authentication key, i_key_pad and o_key_pad.
>> +                */
>> +               /* Zero memory under key */
>> +               memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
>> +
>> +               if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
>> +                       /*
>> +                        * In case the key is longer than 256 bits
>> +                        * the algorithm will use SHA256(key) instead.
>> +                        */
>> +                       error = sha256_block(NULL, xform->auth.key.data,
>> +                               sess->auth.hmac.key, xform->auth.key.length);
>> +                       if (error != 0)
>> +                               return -1;
>> +               } else {
>> +                       /*
>> +                        * Now copy the given authentication key to the session
>> +                        * key assuming that the session key is zeroed there is
>> +                        * no need for additional zero padding if the key is
>> +                        * shorter than SHA256_AUTH_KEY_LENGTH.
>> +                        */
>> +                       rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
>> +                                                       xform->auth.key.length);
>> +               }
>> +
>> +               /* Prepare HMAC padding: key|pattern */
>> +               auth_hmac_pad_prepare(sess, xform);
>> +               /*
>> +                * Calculate partial hash values for i_key_pad and o_key_pad.
>> +                * Will be used as initialization state for final HMAC.
>> +                */
>> +               error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
>> +                   partial, SHA256_BLOCK_SIZE);
>> +               if (error != 0)
>> +                       return -1;
>> +               memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
>> +
>> +               error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
>> +                   partial, SHA256_BLOCK_SIZE);
>> +               if (error != 0)
>> +                       return -1;
>> +               memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
>> +
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int
>> +cipher_set_prerequisites(struct armv8_crypto_session *sess,
>> +                       const struct rte_crypto_sym_xform *xform)
>> +{
>> +       crypto_key_sched_t cipher_key_sched;
>> +
>> +       cipher_key_sched = sess->cipher.key_sched;
>> +       if (likely(cipher_key_sched != NULL)) {
>> +               /* Set up cipher session key */
>> +               cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int
>> +armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
>> +               const struct rte_crypto_sym_xform *cipher_xform,
>> +               const struct rte_crypto_sym_xform *auth_xform)
>> +{
>> +       enum armv8_crypto_chain_order order;
>> +       enum armv8_crypto_cipher_operation cop;
>> +       enum rte_crypto_cipher_algorithm calg;
>> +       enum rte_crypto_auth_algorithm aalg;
>> +
>> +       /* Validate and prepare scratch order of combined operations */
>> +       switch (sess->chain_order) {
>> +       case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
>> +       case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
>> +               order = sess->chain_order;
>> +               break;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +       /* Select cipher direction */
>> +       sess->cipher.direction = cipher_xform->cipher.op;
>> +       /* Select cipher key */
>> +       sess->cipher.key.length = cipher_xform->cipher.key.length;
>> +       /* Set cipher direction */
>> +       cop = sess->cipher.direction;
>> +       /* Set cipher algorithm */
>> +       calg = cipher_xform->cipher.algo;
>> +
>> +       /* Select cipher algo */
>> +       switch (calg) {
>> +       /* Cover supported cipher algorithms */
>> +       case RTE_CRYPTO_CIPHER_AES_CBC:
>> +               sess->cipher.algo = calg;
>> +               /* IV len is always 16 bytes (block size) for AES CBC */
>> +               sess->cipher.iv_len = 16;
>> +               break;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +       /* Select auth generate/verify */
>> +       sess->auth.operation = auth_xform->auth.op;
>> +
>> +       /* Select auth algo */
>> +       switch (auth_xform->auth.algo) {
>> +       /* Cover supported hash algorithms */
>> +       case RTE_CRYPTO_AUTH_SHA256:
>> +               aalg = auth_xform->auth.algo;
>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
>> +               break;
>> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
>> +       case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
>> +               aalg = auth_xform->auth.algo;
>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
>> +               break;
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +
>> +       /* Verify supported key lengths and extract proper algorithm */
>> +       switch (cipher_xform->cipher.key.length << 3) {
>> +       case 128:
>> +               sess->crypto_func =
>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
>> +               sess->cipher.key_sched =
>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 128);
>> +               break;
>> +       case 192:
>> +               sess->crypto_func =
>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
>> +               sess->cipher.key_sched =
>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 192);
>> +               break;
>> +       case 256:
>> +               sess->crypto_func =
>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
>> +               sess->cipher.key_sched =
>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 256);
>> +               break;
>> +       default:
>> +               sess->crypto_func = NULL;
>> +               sess->cipher.key_sched = NULL;
>> +               return -EINVAL;
>> +       }
>> +
>> +       if (unlikely(sess->crypto_func == NULL)) {
>> +               /*
>> +                * If we got here that means that there must be a bug
>
> Since AES-128-CBC is only supported in your patch. It means that
> crypto_func could be NULL according to the switch above if
> cipher.key.length > 128?

Yes. Instead of checking for key lengths in a similar way that we check 
for algorithms, etc. we just fail when we don't find appropriate 
function. Do you suggest that this should be changed?

>
>> +                * in the algorithms selection above. Nevertheless keep
>> +                * it here to catch bug immediately and avoid NULL pointer
>> +                * dereference in OPs processing.
>> +                */
>> +               ARMV8_CRYPTO_LOG_ERR(
>> +                       "No appropriate crypto function for given parameters");
>> +               return -EINVAL;
>> +       }
>> +
>> +       /* Set up cipher session prerequisites */
>> +       if (cipher_set_prerequisites(sess, cipher_xform) != 0)
>> +               return -EINVAL;
>> +
>> +       /* Set up authentication session prerequisites */
>> +       if (auth_set_prerequisites(sess, auth_xform) != 0)
>> +               return -EINVAL;
>> +
>> +       return 0;
>> +}
>> +
>
> ....
>
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>> new file mode 100644
>> index 0000000..2bf6475
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>> @@ -0,0 +1,369 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Cavium networks nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> + */
>> +
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_malloc.h>
>> +#include <rte_cryptodev_pmd.h>
>> +
>> +#include "armv8_crypto_defs.h"
>> +
>> +#include "rte_armv8_pmd_private.h"
>> +
>> +static const struct rte_cryptodev_capabilities
>> +       armv8_crypto_pmd_capabilities[] = {
>> +       {       /* SHA1 HMAC */
>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +                       {.sym = {
>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>> +                               {.auth = {
>> +                                       .algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
>> +                                       .block_size = 64,
>> +                                       .key_size = {
>> +                                               .min = 16,
>> +                                               .max = 128,
>> +                                               .increment = 0
>> +                                       },
>> +                                       .digest_size = {
>> +                                               .min = 20,
>> +                                               .max = 20,
>> +                                               .increment = 0
>> +                                       },
>> +                                       .aad_size = { 0 }
>> +                               }, }
>> +                       }, }
>> +       },
>> +       {       /* SHA256 HMAC */
>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +                       {.sym = {
>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>> +                               {.auth = {
>> +                                       .algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
>> +                                       .block_size = 64,
>> +                                       .key_size = {
>> +                                               .min = 16,
>> +                                               .max = 128,
>> +                                               .increment = 0
>> +                                       },
>> +                                       .digest_size = {
>> +                                               .min = 32,
>> +                                               .max = 32,
>> +                                               .increment = 0
>> +                                       },
>> +                                       .aad_size = { 0 }
>> +                               }, }
>> +                       }, }
>> +       },
>> +       {       /* AES CBC */
>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +                       {.sym = {
>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
>> +                               {.cipher = {
>> +                                       .algo = RTE_CRYPTO_CIPHER_AES_CBC,
>> +                                       .block_size = 16,
>> +                                       .key_size = {
>> +                                               .min = 16,
>> +                                               .max = 16,
>> +                                               .increment = 0
>> +                                       },
>> +                                       .iv_size = {
>> +                                               .min = 16,
>> +                                               .max = 16,
>> +                                               .increment = 0
>> +                                       }
>> +                               }, }
>> +                       }, }
>> +       },
>> +
>
> It's strange that you defined aes and hmac here, but not implemented
> them, though their combinations are implemented.
> Will you add later?

We may add standalone algorithms in the future but those ops here are 
not for that purpose. I thought that since there is no chained 
operations capability we should export what we can do even though that 
it will work (mean not return error) only if the operations are chained.
Do you have some other suggestion?

>
>> +       RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
>> +};
>> +
>> +

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-12 13:12           ` Zbigniew Bodek
@ 2017-01-13  7:41             ` Jianbo Liu
  2017-01-13 19:09               ` Zbigniew Bodek
  0 siblings, 1 reply; 100+ messages in thread
From: Jianbo Liu @ 2017-01-13  7:41 UTC (permalink / raw)
  To: Zbigniew Bodek; +Cc: dev, pablo.de.lara.guarch, Declan Doherty, Jerin Jacob

On 12 January 2017 at 21:12, Zbigniew Bodek
<zbigniew.bodek@caviumnetworks.com> wrote:
> Hello  Jianbo Liu,
>
> Thanks for the review. Please check my answers in-line.
>
> Kind regards
> Zbigniew
>
>
> On 06.01.2017 03:45, Jianbo Liu wrote:
>>
>> On 5 January 2017 at 01:33,  <zbigniew.bodek@caviumnetworks.com> wrote:
>>>
>>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>>
>>> This patch introduces crypto poll mode driver
>>> using ARMv8 cryptographic extensions.
>>> CPU compatibility with this driver is detected in
>>> run-time and virtual crypto device will not be
>>> created if CPU doesn't provide:
>>> AES, SHA1, SHA2 and NEON.
>>>
>>> This PMD is optimized to provide performance boost
>>> for chained crypto operations processing,
>>> such as encryption + HMAC generation,
>>> decryption + HMAC validation. In particular,
>>> cipher only or hash only operations are
>>> not provided.
>>>
>>> The driver currently supports AES-128-CBC
>>> in combination with: SHA256 HMAC and SHA1 HMAC
>>> and relies on the external armv8_crypto library:
>>> https://github.com/caviumnetworks/armv8_crypto
>>>
>>
>> It's standalone lib. I think you should change the following line in
>> its Makefile, so not depend on DPDK.
>> "include $(RTE_SDK)/mk/rte.lib.mk"
>>
>>> This patch adds driver's code only and does
>>> not include it in the build system.
>>>
>>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>> ---
>>>  drivers/crypto/armv8/Makefile                  |  73 ++
>>>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926
>>> +++++++++++++++++++++++++
>>>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>>>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>>>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>>>  5 files changed, 1582 insertions(+)
>>>  create mode 100644 drivers/crypto/armv8/Makefile
>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>>>
.....

>>> +       /* Select auth algo */
>>> +       switch (auth_xform->auth.algo) {
>>> +       /* Cover supported hash algorithms */
>>> +       case RTE_CRYPTO_AUTH_SHA256:
>>> +               aalg = auth_xform->auth.algo;
>>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
>>> +               break;
>>> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
>>> +       case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
>>> +               aalg = auth_xform->auth.algo;
>>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
>>> +               break;
>>> +       default:
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       /* Verify supported key lengths and extract proper algorithm */
>>> +       switch (cipher_xform->cipher.key.length << 3) {
>>> +       case 128:
>>> +               sess->crypto_func =
>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>> 128);
>>> +               sess->cipher.key_sched =
>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 128);
>>> +               break;
>>> +       case 192:
>>> +               sess->crypto_func =
>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>> 192);
>>> +               sess->cipher.key_sched =
>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 192);
>>> +               break;
>>> +       case 256:
>>> +               sess->crypto_func =
>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>> 256);
>>> +               sess->cipher.key_sched =
>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 256);
>>> +               break;
>>> +       default:
>>> +               sess->crypto_func = NULL;
>>> +               sess->cipher.key_sched = NULL;
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       if (unlikely(sess->crypto_func == NULL)) {
>>> +               /*
>>> +                * If we got here that means that there must be a bug
>>
>>
>> Since AES-128-CBC is only supported in your patch. It means that
>> crypto_func could be NULL according to the switch above if
>> cipher.key.length > 128?
>
>
> Yes. Instead of checking for key lengths in a similar way that we check for
> algorithms, etc. we just fail when we don't find appropriate function. Do
> you suggest that this should be changed?
>

I mean to return error directly if length is not 128 in the above
switch, so this "if" is no necessary.

>
>>
>>> +                * in the algorithms selection above. Nevertheless keep
>>> +                * it here to catch bug immediately and avoid NULL
>>> pointer
>>> +                * dereference in OPs processing.
>>> +                */
>>> +               ARMV8_CRYPTO_LOG_ERR(
>>> +                       "No appropriate crypto function for given
>>> parameters");
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       /* Set up cipher session prerequisites */
>>> +       if (cipher_set_prerequisites(sess, cipher_xform) != 0)
>>> +               return -EINVAL;
>>> +
>>> +       /* Set up authentication session prerequisites */
>>> +       if (auth_set_prerequisites(sess, auth_xform) != 0)
>>> +               return -EINVAL;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>
>>
>> ....
>>
>>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>> b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>> new file mode 100644
>>> index 0000000..2bf6475
>>> --- /dev/null
>>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>> @@ -0,0 +1,369 @@
>>> +/*
>>> + *   BSD LICENSE
>>> + *
>>> + *   Copyright (C) Cavium networks Ltd. 2017.
>>> + *
>>> + *   Redistribution and use in source and binary forms, with or without
>>> + *   modification, are permitted provided that the following conditions
>>> + *   are met:
>>> + *
>>> + *     * Redistributions of source code must retain the above copyright
>>> + *       notice, this list of conditions and the following disclaimer.
>>> + *     * Redistributions in binary form must reproduce the above
>>> copyright
>>> + *       notice, this list of conditions and the following disclaimer in
>>> + *       the documentation and/or other materials provided with the
>>> + *       distribution.
>>> + *     * Neither the name of Cavium networks nor the names of its
>>> + *       contributors may be used to endorse or promote products derived
>>> + *       from this software without specific prior written permission.
>>> + *
>>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>>> FOR
>>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>>> COPYRIGHT
>>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>>> INCIDENTAL,
>>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>>> USE,
>>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
>>> ANY
>>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
>>> USE
>>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>>> DAMAGE.
>>> + */
>>> +
>>> +#include <string.h>
>>> +
>>> +#include <rte_common.h>
>>> +#include <rte_malloc.h>
>>> +#include <rte_cryptodev_pmd.h>
>>> +
>>> +#include "armv8_crypto_defs.h"
>>> +
>>> +#include "rte_armv8_pmd_private.h"
>>> +
>>> +static const struct rte_cryptodev_capabilities
>>> +       armv8_crypto_pmd_capabilities[] = {
>>> +       {       /* SHA1 HMAC */
>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>> +                       {.sym = {
>>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>>> +                               {.auth = {
>>> +                                       .algo =
>>> RTE_CRYPTO_AUTH_SHA1_HMAC,
>>> +                                       .block_size = 64,
>>> +                                       .key_size = {
>>> +                                               .min = 16,
>>> +                                               .max = 128,
>>> +                                               .increment = 0
>>> +                                       },
>>> +                                       .digest_size = {
>>> +                                               .min = 20,
>>> +                                               .max = 20,
>>> +                                               .increment = 0
>>> +                                       },
>>> +                                       .aad_size = { 0 }
>>> +                               }, }
>>> +                       }, }
>>> +       },
>>> +       {       /* SHA256 HMAC */
>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>> +                       {.sym = {
>>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>>> +                               {.auth = {
>>> +                                       .algo =
>>> RTE_CRYPTO_AUTH_SHA256_HMAC,
>>> +                                       .block_size = 64,
>>> +                                       .key_size = {
>>> +                                               .min = 16,
>>> +                                               .max = 128,
>>> +                                               .increment = 0
>>> +                                       },
>>> +                                       .digest_size = {
>>> +                                               .min = 32,
>>> +                                               .max = 32,
>>> +                                               .increment = 0
>>> +                                       },
>>> +                                       .aad_size = { 0 }
>>> +                               }, }
>>> +                       }, }
>>> +       },
>>> +       {       /* AES CBC */
>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>> +                       {.sym = {
>>> +                               .xform_type =
>>> RTE_CRYPTO_SYM_XFORM_CIPHER,
>>> +                               {.cipher = {
>>> +                                       .algo =
>>> RTE_CRYPTO_CIPHER_AES_CBC,
>>> +                                       .block_size = 16,
>>> +                                       .key_size = {
>>> +                                               .min = 16,
>>> +                                               .max = 16,
>>> +                                               .increment = 0
>>> +                                       },
>>> +                                       .iv_size = {
>>> +                                               .min = 16,
>>> +                                               .max = 16,
>>> +                                               .increment = 0
>>> +                                       }
>>> +                               }, }
>>> +                       }, }
>>> +       },
>>> +
>>
>>
>> It's strange that you defined aes and hmac here, but not implemented
>> them, though their combinations are implemented.
>> Will you add later?
>
>
> We may add standalone algorithms in the future but those ops here are not
> for that purpose. I thought that since there is no chained operations
> capability we should export what we can do even though that it will work
> (mean not return error) only if the operations are chained.
> Do you have some other suggestion?
>

Nothing special. Either implement them later, or add new chained ops
(is that possible?)
BTW, can you explain what optimization you have done, so I can better
understand your asm code, thanks!

>
>>
>>> +       RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
>>> +};
>>> +
>>> +

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
  2017-01-06  2:45         ` Jianbo Liu
@ 2017-01-13  7:57         ` Hemant Agrawal
  2017-01-13 19:15           ` Zbigniew Bodek
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2 siblings, 1 reply; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-13  7:57 UTC (permalink / raw)
  To: zbigniew.bodek; +Cc: dev, pablo.de.lara.guarch, Jerin Jacob

On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> This patch introduces crypto poll mode driver
> using ARMv8 cryptographic extensions.
> CPU compatibility with this driver is detected in
> run-time and virtual crypto device will not be
> created if CPU doesn't provide:
> AES, SHA1, SHA2 and NEON.
>
> This PMD is optimized to provide performance boost
> for chained crypto operations processing,
> such as encryption + HMAC generation,
> decryption + HMAC validation. In particular,
> cipher only or hash only operations are
> not provided.
>
> The driver currently supports AES-128-CBC
> in combination with: SHA256 HMAC and SHA1 HMAC
> and relies on the external armv8_crypto library:
> https://github.com/caviumnetworks/armv8_crypto
>
> This patch adds driver's code only and does
> not include it in the build system.
>
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  drivers/crypto/armv8/Makefile                  |  73 ++
>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>  5 files changed, 1582 insertions(+)
>  create mode 100644 drivers/crypto/armv8/Makefile
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>
> diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
> new file mode 100644
> index 0000000..dc5ea02
> --- /dev/null
> +++ b/drivers/crypto/armv8/Makefile
> @@ -0,0 +1,73 @@
> +#
> +#   BSD LICENSE
> +#
> +#   Copyright (C) Cavium networks Ltd. 2017.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Cavium networks nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +#
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +ifneq ($(MAKECMDGOALS),clean)
> +ifneq ($(MAKECMDGOALS),config)
> +ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
> +$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
> +endif
> +endif
> +endif
> +
> +# library name
> +LIB = librte_pmd_armv8.a
> +
> +# build flags
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +CFLAGS += -L$(RTE_SDK)/../openssl -I$(RTE_SDK)/../openssl/include
> +
> +# library version
> +LIBABIVER := 1
> +
> +# versioning export map
> +EXPORT_MAP := rte_armv8_pmd_version.map
> +
> +# external library dependencies
> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
> +LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
> +
> +# library source files
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
> +
> +# library dependencies
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
> new file mode 100644
> index 0000000..39433bb
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd.c
> @@ -0,0 +1,926 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright (C) Cavium networks Ltd. 2017.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Cavium networks nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_hexdump.h>
> +#include <rte_cryptodev.h>
> +#include <rte_cryptodev_pmd.h>
> +#include <rte_vdev.h>
> +#include <rte_malloc.h>
> +#include <rte_cpuflags.h>
> +
> +#include "armv8_crypto_defs.h"
> +
> +#include "rte_armv8_pmd_private.h"
> +
> +static int cryptodev_armv8_crypto_uninit(const char *name);
> +
> +/**
> + * Pointers to the supported combined mode crypto functions are stored
> + * in the static tables. Each combined (chained) cryptographic operation
> + * can be decribed by a set of numbers:

replace "decribed" with "described"

> + * - order:	order of operations (cipher, auth) or (auth, cipher)
> + * - direction:	encryption or decryption
> + * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
> + * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
> + * - keyl:	cipher key length, for example 128, 192, 256 bits
> + *
> + * In order to quickly acquire each function pointer based on those numbers,
> + * a hierarchy of arrays is maintained. The final level, 3D array is indexed
> + * by the combined mode function parameters only (cipher algorithm,
> + * authentication algorithm and key length).
> + *
> + * This gives 3 memory accesses to obtain a function pointer instead of
> + * traversing the array manually and comparing function parameters on each loop.
> + *
> + *                   +--+CRYPTO_FUNC
> + *            +--+ENC|
> + *      +--+CA|
> + *      |     +--+DEC
> + * ORDER|
> + *      |     +--+ENC
> + *      +--+AC|
> + *            +--+DEC
> + *
> + */
> +
> +/**
> + * 3D array type for ARM Combined Mode crypto functions pointers.
> + * CRYPTO_CIPHER_MAX:			max cipher ID number
> + * CRYPTO_AUTH_MAX:			max auth ID number
> + * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
> + */
> +typedef const crypto_func_t
> +crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
> +
> +/* Evaluate to key length definition */
> +#define KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
> +
> +/* Local aliases for supported ciphers */
> +#define CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
> +/* Local aliases for supported hashes */
> +#define AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
> +#define AUTH_SHA256		RTE_CRYPTO_AUTH_SHA256

SHA256 you are defining both AUTH and HMAC, however for SHA1 only HMAC.
In your implementation, you seems to be only supporting HMAC.

> +#define AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
> +
> +/**
> + * Arrays containing pointers to particular cryptographic,
> + * combined mode functions.
> + * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
> + * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
> + * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
> + * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
> + */
> +static const crypto_func_tbl_t
> +crypto_op_ca_encrypt = {
> +	/* [cipher alg][auth alg][key length] = crypto_function, */
> +	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
> +	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
> +};
> +
do you plan to support aes192 and aes256 as well?

> +static const crypto_func_tbl_t
> +crypto_op_ca_decrypt = {
> +	NULL
> +};
> +
> +static const crypto_func_tbl_t
> +crypto_op_ac_encrypt = {
> +	NULL
> +};
> +
> +static const crypto_func_tbl_t
> +crypto_op_ac_decrypt = {
> +	/* [cipher alg][auth alg][key length] = crypto_function, */
> +	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
> +	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
> +};
> +
> +/**
> + * Arrays containing pointers to particular cryptographic function sets,
> + * covering given cipher operation directions (encrypt, decrypt)
> + * for each order of cipher and authentication pairs.
> + */
> +static const crypto_func_tbl_t *
> +crypto_cipher_auth[] = {
> +	&crypto_op_ca_encrypt,
> +	&crypto_op_ca_decrypt,
> +	NULL
> +};
> +
> +static const crypto_func_tbl_t *
> +crypto_auth_cipher[] = {
> +	&crypto_op_ac_encrypt,
> +	&crypto_op_ac_decrypt,
> +	NULL
> +};
> +
> +/**
> + * Top level array containing pointers to particular cryptographic
> + * function sets, covering given order of chained operations.
> + * crypto_cipher_auth:	cipher first, authenticate after
> + * crypto_auth_cipher:	authenticate first, cipher after
> + */
> +static const crypto_func_tbl_t **
> +crypto_chain_order[] = {
> +	crypto_cipher_auth,
> +	crypto_auth_cipher,
> +	NULL
> +};
> +
> +/**
> + * Extract particular combined mode crypto function from the 3D array.
> + */
> +#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
> +({									\
> +	crypto_func_tbl_t *func_tbl =					\
> +				(crypto_chain_order[(order)])[(cop)];	\
> +									\
> +	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
> +})
> +
> +/*----------------------------------------------------------------------------*/
> +
> +/**
> + * 2D array type for ARM key schedule functions pointers.
> + * CRYPTO_CIPHER_MAX:			max cipher ID number
> + * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
> + */
> +typedef const crypto_key_sched_t
> +crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
> +
> +static const crypto_key_sched_tbl_t
> +crypto_key_sched_encrypt = {
> +	/* [cipher alg][key length] = key_expand_func, */
> +	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
> +};
> +
> +static const crypto_key_sched_tbl_t
> +crypto_key_sched_decrypt = {
> +	/* [cipher alg][key length] = key_expand_func, */
> +	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
> +};
> +
> +/**
> + * Top level array containing pointers to particular key generation
> + * function sets, covering given operation direction.
> + * crypto_key_sched_encrypt:	keys for encryption
> + * crypto_key_sched_decrypt:	keys for decryption
> + */
> +static const crypto_key_sched_tbl_t *
> +crypto_key_sched_dir[] = {
> +	&crypto_key_sched_encrypt,
> +	&crypto_key_sched_decrypt,
> +	NULL
> +};
> +
> +/**
> + * Extract particular combined mode crypto function from the 3D array.
> + */
> +#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
> +({									\
> +	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
> +									\
> +	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
> +})
> +
> +/*----------------------------------------------------------------------------*/
> +
> +/**
> + * Global static parameter used to create a unique name for each
> + * ARMV8 crypto device.
> + */
> +static unsigned int unique_name_id;
> +
> +static inline int
> +create_unique_device_name(char *name, size_t size)
> +{
> +	int ret;
> +
> +	if (name == NULL)
> +		return -EINVAL;
> +
> +	ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
> +			unique_name_id++);
> +	if (ret < 0)
> +		return ret;
> +	return 0;
> +}
> +
> +/*
> + *------------------------------------------------------------------------------
> + * Session Prepare
> + *------------------------------------------------------------------------------
> + */
> +
> +/** Get xform chain order */
> +static enum armv8_crypto_chain_order
> +armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
> +{
> +
> +	/*
> +	 * This driver currently covers only chained operations.
> +	 * Ignore only cipher or only authentication operations
> +	 * or chains longer than 2 xform structures.
> +	 */
> +	if (xform->next == NULL || xform->next->next != NULL)
> +		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
> +
> +	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
> +		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
> +			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
> +	}
> +
> +	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
> +		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
> +			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
> +	}
> +
> +	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
> +}
> +
> +static inline void
> +auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
> +				const struct rte_crypto_sym_xform *xform)
> +{
> +	size_t i;
> +
> +	/* Generate i_key_pad and o_key_pad */
> +	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
> +	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
> +							xform->auth.key.length);
> +	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
> +	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
> +							xform->auth.key.length);
> +	/*
> +	 * XOR key with IPAD/OPAD values to obtain i_key_pad
> +	 * and o_key_pad.
> +	 * Byte-by-byte operation may seem to be the less efficient
> +	 * here but in fact it's the opposite.
> +	 * The result ASM code is likely operate on NEON registers
> +	 * (load auth key to Qx, load IPAD/OPAD to multiple
> +	 * elements of Qy, eor 128 bits at once).
> +	 */
> +	for (i = 0; i < SHA_BLOCK_MAX; i++) {
> +		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
> +		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
> +	}
> +}
> +
> +static inline int
> +auth_set_prerequisites(struct armv8_crypto_session *sess,
> +			const struct rte_crypto_sym_xform *xform)
> +{
> +	uint8_t partial[64] = { 0 };
> +	int error;
> +
> +	switch (xform->auth.algo) {
> +	case RTE_CRYPTO_AUTH_SHA1_HMAC:
> +		/*
> +		 * Generate authentication key, i_key_pad and o_key_pad.
> +		 */
> +		/* Zero memory under key */
> +		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
> +
> +		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
> +			/*
> +			 * In case the key is longer than 160 bits
> +			 * the algorithm will use SHA1(key) instead.
> +			 */
> +			error = sha1_block(NULL, xform->auth.key.data,
> +				sess->auth.hmac.key, xform->auth.key.length);
> +			if (error != 0)
> +				return -1;
> +		} else {
> +			/*
> +			 * Now copy the given authentication key to the session
> +			 * key assuming that the session key is zeroed there is
> +			 * no need for additional zero padding if the key is
> +			 * shorter than SHA1_AUTH_KEY_LENGTH.
> +			 */
> +			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
> +							xform->auth.key.length);
> +		}
> +
> +		/* Prepare HMAC padding: key|pattern */
> +		auth_hmac_pad_prepare(sess, xform);
> +		/*
> +		 * Calculate partial hash values for i_key_pad and o_key_pad.
> +		 * Will be used as initialization state for final HMAC.
> +		 */
> +		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
> +		    partial, SHA1_BLOCK_SIZE);
> +		if (error != 0)
> +			return -1;
> +		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
> +
> +		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
> +		    partial, SHA1_BLOCK_SIZE);
> +		if (error != 0)
> +			return -1;
> +		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
> +
> +		break;
> +	case RTE_CRYPTO_AUTH_SHA256_HMAC:
> +		/*
> +		 * Generate authentication key, i_key_pad and o_key_pad.
> +		 */
> +		/* Zero memory under key */
> +		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
> +
> +		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
> +			/*
> +			 * In case the key is longer than 256 bits
> +			 * the algorithm will use SHA256(key) instead.
> +			 */
> +			error = sha256_block(NULL, xform->auth.key.data,
> +				sess->auth.hmac.key, xform->auth.key.length);
> +			if (error != 0)
> +				return -1;
> +		} else {
> +			/*
> +			 * Now copy the given authentication key to the session
> +			 * key assuming that the session key is zeroed there is
> +			 * no need for additional zero padding if the key is
> +			 * shorter than SHA256_AUTH_KEY_LENGTH.
> +			 */
> +			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
> +							xform->auth.key.length);
> +		}
> +
> +		/* Prepare HMAC padding: key|pattern */
> +		auth_hmac_pad_prepare(sess, xform);
> +		/*
> +		 * Calculate partial hash values for i_key_pad and o_key_pad.
> +		 * Will be used as initialization state for final HMAC.
> +		 */
> +		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
> +		    partial, SHA256_BLOCK_SIZE);
> +		if (error != 0)
> +			return -1;
> +		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
> +
> +		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
> +		    partial, SHA256_BLOCK_SIZE);
> +		if (error != 0)
> +			return -1;
> +		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
> +
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline int
> +cipher_set_prerequisites(struct armv8_crypto_session *sess,
> +			const struct rte_crypto_sym_xform *xform)
> +{
> +	crypto_key_sched_t cipher_key_sched;
> +
> +	cipher_key_sched = sess->cipher.key_sched;
> +	if (likely(cipher_key_sched != NULL)) {
> +		/* Set up cipher session key */
> +		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
> +		const struct rte_crypto_sym_xform *cipher_xform,
> +		const struct rte_crypto_sym_xform *auth_xform)
> +{
> +	enum armv8_crypto_chain_order order;
> +	enum armv8_crypto_cipher_operation cop;
> +	enum rte_crypto_cipher_algorithm calg;
> +	enum rte_crypto_auth_algorithm aalg;
> +
> +	/* Validate and prepare scratch order of combined operations */
> +	switch (sess->chain_order) {
> +	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
> +	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
> +		order = sess->chain_order;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	/* Select cipher direction */
> +	sess->cipher.direction = cipher_xform->cipher.op;
> +	/* Select cipher key */
> +	sess->cipher.key.length = cipher_xform->cipher.key.length;
> +	/* Set cipher direction */
> +	cop = sess->cipher.direction;
> +	/* Set cipher algorithm */
> +	calg = cipher_xform->cipher.algo;
> +
> +	/* Select cipher algo */
> +	switch (calg) {
> +	/* Cover supported cipher algorithms */
> +	case RTE_CRYPTO_CIPHER_AES_CBC:
> +		sess->cipher.algo = calg;
> +		/* IV len is always 16 bytes (block size) for AES CBC */
> +		sess->cipher.iv_len = 16;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	/* Select auth generate/verify */
> +	sess->auth.operation = auth_xform->auth.op;
> +
> +	/* Select auth algo */
> +	switch (auth_xform->auth.algo) {
> +	/* Cover supported hash algorithms */
> +	case RTE_CRYPTO_AUTH_SHA256:
> +		aalg = auth_xform->auth.algo;
> +		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
> +		break;

as previously stated, are you supporting AUTH types?


> +	case RTE_CRYPTO_AUTH_SHA1_HMAC:
> +	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
> +		aalg = auth_xform->auth.algo;
> +		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/* Verify supported key lengths and extract proper algorithm */
> +	switch (cipher_xform->cipher.key.length << 3) {
> +	case 128:
> +		sess->crypto_func =
> +				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
> +		sess->cipher.key_sched =
> +				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
> +		break;
> +	case 192:

aes192 and aes256?

> +		sess->crypto_func =
> +				CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
> +		sess->cipher.key_sched =
> +				CRYPTO_GET_KEY_SCHED(cop, calg, 192);
> +		break;
> +	case 256:
> +		sess->crypto_func =
> +				CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
> +		sess->cipher.key_sched =
> +				CRYPTO_GET_KEY_SCHED(cop, calg, 256);
> +		break;
> +	default:
> +		sess->crypto_func = NULL;
> +		sess->cipher.key_sched = NULL;
> +		return -EINVAL;
> +	}
> +
> +	if (unlikely(sess->crypto_func == NULL)) {
> +		/*
> +		 * If we got here that means that there must be a bug
> +		 * in the algorithms selection above. Nevertheless keep
> +		 * it here to catch bug immediately and avoid NULL pointer
> +		 * dereference in OPs processing.
> +		 */
> +		ARMV8_CRYPTO_LOG_ERR(
> +			"No appropriate crypto function for given parameters");
> +		return -EINVAL;
> +	}
> +
> +	/* Set up cipher session prerequisites */
> +	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
> +		return -EINVAL;
> +
> +	/* Set up authentication session prerequisites */
> +	if (auth_set_prerequisites(sess, auth_xform) != 0)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/** Parse crypto xform chain and set private session parameters */
> +int
> +armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
> +		const struct rte_crypto_sym_xform *xform)
> +{
> +	const struct rte_crypto_sym_xform *cipher_xform = NULL;
> +	const struct rte_crypto_sym_xform *auth_xform = NULL;
> +	bool is_chained_op;
> +	int ret;
> +
> +	/* Filter out spurious/broken requests */
> +	if (xform == NULL)
> +		return -EINVAL;
> +
> +	sess->chain_order = armv8_crypto_get_chain_order(xform);
> +	switch (sess->chain_order) {
> +	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
> +		cipher_xform = xform;
> +		auth_xform = xform->next;
> +		is_chained_op = true;
> +		break;
> +	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
> +		auth_xform = xform;
> +		cipher_xform = xform->next;
> +		is_chained_op = true;
> +		break;
> +	default:
> +		is_chained_op = false;
> +		return -EINVAL;
> +	}
> +
> +	if (is_chained_op) {
> +		ret = armv8_crypto_set_session_chained_parameters(sess,
> +						cipher_xform, auth_xform);
> +		if (unlikely(ret != 0)) {
> +			ARMV8_CRYPTO_LOG_ERR(
> +			"Invalid/unsupported chained (cipher/auth) parameters");
> +			return -EINVAL;
> +		}
> +	} else {
> +		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/** Provide session for operation */
> +static struct armv8_crypto_session *
> +get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
> +{
> +	struct armv8_crypto_session *sess = NULL;
> +
> +	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
> +		/* get existing session */
> +		if (likely(op->sym->session != NULL &&
> +				op->sym->session->dev_type ==
> +				RTE_CRYPTODEV_ARMV8_PMD)) {
> +			sess = (struct armv8_crypto_session *)
> +				op->sym->session->_private;
> +		}
> +	} else {
> +		/* provide internal session */
> +		void *_sess = NULL;
> +
> +		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
> +			sess = (struct armv8_crypto_session *)
> +				((struct rte_cryptodev_sym_session *)_sess)
> +				->_private;
> +
> +			if (unlikely(armv8_crypto_set_session_parameters(
> +					sess, op->sym->xform) != 0)) {
> +				rte_mempool_put(qp->sess_mp, _sess);
> +				sess = NULL;
> +			} else
> +				op->sym->session = _sess;
> +		}
> +	}
> +
> +	if (sess == NULL)
> +		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
> +
> +	return sess;
> +}
> +
> +/*
> + *------------------------------------------------------------------------------
> + * Process Operations
> + *------------------------------------------------------------------------------
> + */
> +
> +/*----------------------------------------------------------------------------*/
> +
> +/** Process cipher operation */
> +static void
> +process_armv8_chained_op
> +		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
> +		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
> +{
> +	crypto_func_t crypto_func;
> +	crypto_arg_t arg;
> +	struct rte_mbuf *m_asrc, *m_adst;
> +	uint8_t *csrc, *cdst;
> +	uint8_t *adst, *asrc;
> +	uint64_t clen, alen __rte_unused;
> +	int error;
> +
> +	clen = op->sym->cipher.data.length;
> +	alen = op->sym->auth.data.length;
> +
> +	csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
> +			op->sym->cipher.data.offset);
> +	cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
> +			op->sym->cipher.data.offset);
> +
> +	switch (sess->chain_order) {
> +	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
> +		m_asrc = m_adst = mbuf_dst;
> +		break;
> +	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
> +		m_asrc = mbuf_src;
> +		m_adst = mbuf_dst;
> +		break;
> +	default:
> +		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
> +		return;
> +	}
> +	asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
> +				op->sym->auth.data.offset);
> +
> +	switch (sess->auth.mode) {
> +	case ARMV8_CRYPTO_AUTH_AS_AUTH:
> +		/* Nothing to do here, just verify correct option */
> +		break;
> +	case ARMV8_CRYPTO_AUTH_AS_HMAC:
> +		arg.digest.hmac.key = sess->auth.hmac.key;
> +		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
> +		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
> +		break;
> +	default:
> +		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
> +		return;
> +	}
> +
> +	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
> +		adst = op->sym->auth.digest.data;
> +		if (adst == NULL) {
> +			adst = rte_pktmbuf_mtod_offset(m_adst,
> +					uint8_t *,
> +					op->sym->auth.data.offset +
> +					op->sym->auth.data.length);
> +		}
> +	} else {
> +		adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
> +				op->sym->auth.digest.length);
> +	}
> +
> +	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
> +		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
> +		return;
> +	}
> +
> +	arg.cipher.iv = op->sym->cipher.iv.data;
> +	arg.cipher.key = sess->cipher.key.data;
> +	/* Acquire combined mode function */
> +	crypto_func = sess->crypto_func;
> +	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
> +	error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
> +	if (error != 0) {
> +		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
> +		return;
> +	}
> +
> +	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
> +	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
> +		if (memcmp(adst, op->sym->auth.digest.data,
> +				op->sym->auth.digest.length) != 0) {
> +			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
> +		}
> +		/* Trim area used for digest from mbuf. */
> +		rte_pktmbuf_trim(m_asrc,
> +				op->sym->auth.digest.length);
> +	}
> +}
> +
> +/** Process crypto operation for mbuf */
> +static int
> +process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
> +		struct armv8_crypto_session *sess)
> +{
> +	struct rte_mbuf *msrc, *mdst;
> +	int retval;
> +
> +	msrc = op->sym->m_src;
> +	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
> +
> +	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
> +
> +	switch (sess->chain_order) {
> +	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
> +	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
> +		process_armv8_chained_op(op, sess, msrc, mdst);
> +		break;
> +	default:
> +		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
> +		break;
> +	}
> +
> +	/* Free session if a session-less crypto op */
> +	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
> +		memset(sess, 0, sizeof(struct armv8_crypto_session));
> +		rte_mempool_put(qp->sess_mp, op->sym->session);
> +		op->sym->session = NULL;
> +	}
> +
> +	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
> +		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
> +
> +	if (op->status != RTE_CRYPTO_OP_STATUS_ERROR)
> +		retval = rte_ring_enqueue(qp->processed_ops, (void *)op);
> +	else
> +		retval = -1;
> +
> +	return retval;
> +}
> +
> +/*
> + *------------------------------------------------------------------------------
> + * PMD Framework
> + *------------------------------------------------------------------------------
> + */
> +
> +/** Enqueue burst */
> +static uint16_t
> +armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
> +		uint16_t nb_ops)
> +{
> +	struct armv8_crypto_session *sess;
> +	struct armv8_crypto_qp *qp = queue_pair;
> +	int i, retval;
> +
> +	for (i = 0; i < nb_ops; i++) {
> +		sess = get_session(qp, ops[i]);
> +		if (unlikely(sess == NULL))
> +			goto enqueue_err;
> +
> +		retval = process_op(qp, ops[i], sess);
> +		if (unlikely(retval < 0))
> +			goto enqueue_err;
> +	}
> +
> +	qp->stats.enqueued_count += i;
> +	return i;
> +
> +enqueue_err:
> +	if (ops[i] != NULL)
> +		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
> +
> +	qp->stats.enqueue_err_count++;
> +	return i;
> +}
> +
> +/** Dequeue burst */
> +static uint16_t
> +armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
> +		uint16_t nb_ops)
> +{
> +	struct armv8_crypto_qp *qp = queue_pair;
> +
> +	unsigned int nb_dequeued = 0;
> +
> +	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
> +			(void **)ops, nb_ops);
> +	qp->stats.dequeued_count += nb_dequeued;
> +
> +	return nb_dequeued;
> +}
> +
> +/** Create ARMv8 crypto device */
> +static int
> +cryptodev_armv8_crypto_create(const char *name,
> +		struct rte_crypto_vdev_init_params *init_params)
> +{
> +	struct rte_cryptodev *dev;
> +	char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
> +	struct armv8_crypto_private *internals;
> +
> +	/* Check CPU for support for AES instruction set */
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
> +		ARMV8_CRYPTO_LOG_ERR(
> +			"AES instructions not supported by CPU");
> +		return -EFAULT;
> +	}
> +
> +	/* Check CPU for support for SHA instruction set */
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
> +	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
> +		ARMV8_CRYPTO_LOG_ERR(
> +			"SHA1/SHA2 instructions not supported by CPU");
> +		return -EFAULT;
> +	}
> +
> +	/* Check CPU for support for Advance SIMD instruction set */
> +	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
> +		ARMV8_CRYPTO_LOG_ERR(
> +			"Advanced SIMD instructions not supported by CPU");
> +		return -EFAULT;
> +	}
> +
> +	/* create a unique device name */
> +	if (create_unique_device_name(crypto_dev_name,
> +			RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
> +		ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
> +		return -EINVAL;
> +	}
> +
> +	dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
> +				sizeof(struct armv8_crypto_private),
> +				init_params->socket_id);
> +	if (dev == NULL) {
> +		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
> +		goto init_error;
> +	}
> +
> +	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
> +	dev->dev_ops = rte_armv8_crypto_pmd_ops;
> +
> +	/* register rx/tx burst functions for data path */
> +	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
> +	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
> +
> +	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
> +			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
> +
> +	/* Set vector instructions mode supported */
> +	internals = dev->data->dev_private;
> +
> +	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
> +	internals->max_nb_sessions = init_params->max_nb_sessions;
> +
> +	return 0;
> +
> +init_error:
> +	ARMV8_CRYPTO_LOG_ERR(
> +		"driver %s: cryptodev_armv8_crypto_create failed", name);
> +
> +	cryptodev_armv8_crypto_uninit(crypto_dev_name);
> +	return -EFAULT;
> +}
> +
> +/** Initialise ARMv8 crypto device */
> +static int
> +cryptodev_armv8_crypto_init(const char *name,
> +		const char *input_args)
> +{
> +	struct rte_crypto_vdev_init_params init_params = {
> +		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
> +		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
> +		rte_socket_id()
> +	};
> +
> +	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
> +
> +	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
> +			init_params.socket_id);
> +	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
> +			init_params.max_nb_queue_pairs);
> +	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
> +			init_params.max_nb_sessions);
> +
> +	return cryptodev_armv8_crypto_create(name, &init_params);
> +}
> +
> +/** Uninitialise ARMv8 crypto device */
> +static int
> +cryptodev_armv8_crypto_uninit(const char *name)
> +{
> +	if (name == NULL)
> +		return -EINVAL;
> +
> +	RTE_LOG(INFO, PMD,
> +		"Closing ARMv8 crypto device %s on numa socket %u\n",
> +		name, rte_socket_id());
> +
> +	return 0;
> +}
> +
> +static struct rte_vdev_driver armv8_crypto_drv = {
> +	.probe = cryptodev_armv8_crypto_init,
> +	.remove = cryptodev_armv8_crypto_uninit
> +};
> +
> +RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
> +RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
> +RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
> +	"max_nb_queue_pairs=<int> "
> +	"max_nb_sessions=<int> "
> +	"socket_id=<int>");
> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> new file mode 100644
> index 0000000..2bf6475
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
> @@ -0,0 +1,369 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright (C) Cavium networks Ltd. 2017.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Cavium networks nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_malloc.h>
> +#include <rte_cryptodev_pmd.h>
> +
> +#include "armv8_crypto_defs.h"
> +
> +#include "rte_armv8_pmd_private.h"
> +
> +static const struct rte_cryptodev_capabilities
> +	armv8_crypto_pmd_capabilities[] = {
> +	{	/* SHA1 HMAC */
> +		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +			{.sym = {
> +				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
> +				{.auth = {
> +					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
> +					.block_size = 64,
> +					.key_size = {
> +						.min = 16,
> +						.max = 128,
> +						.increment = 0
> +					},
> +					.digest_size = {
> +						.min = 20,
> +						.max = 20,
> +						.increment = 0
> +					},
> +					.aad_size = { 0 }
> +				}, }
> +			}, }
> +	},
> +	{	/* SHA256 HMAC */
> +		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +			{.sym = {
> +				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
> +				{.auth = {
> +					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
> +					.block_size = 64,
> +					.key_size = {
> +						.min = 16,
> +						.max = 128,
> +						.increment = 0
> +					},
> +					.digest_size = {
> +						.min = 32,
> +						.max = 32,
> +						.increment = 0
> +					},
> +					.aad_size = { 0 }
> +				}, }
> +			}, }
> +	},
> +	{	/* AES CBC */
> +		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
> +			{.sym = {
> +				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
> +				{.cipher = {
> +					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
> +					.block_size = 16,
> +					.key_size = {
> +						.min = 16,
> +						.max = 16,

do you plan max = 32 ?

> +						.increment = 0
> +					},
> +					.iv_size = {
> +						.min = 16,
> +						.max = 16,
> +						.increment = 0
> +					}
> +				}, }
> +			}, }
> +	},
> +
> +	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
> +};
> +
> +
> +/** Configure device */
> +static int
> +armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
> +{
> +	return 0;
> +}
> +
> +/** Start device */
> +static int
> +armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
> +{
> +	return 0;
> +}
> +
> +/** Stop device */
> +static void
> +armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
> +{
> +}
> +
> +/** Close device */
> +static int
> +armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
> +{
> +	return 0;
> +}
> +
> +
> +/** Get device statistics */
> +static void
> +armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
> +		struct rte_cryptodev_stats *stats)
> +{
> +	int qp_id;
> +
> +	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
> +		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
> +
> +		stats->enqueued_count += qp->stats.enqueued_count;
> +		stats->dequeued_count += qp->stats.dequeued_count;
> +
> +		stats->enqueue_err_count += qp->stats.enqueue_err_count;
> +		stats->dequeue_err_count += qp->stats.dequeue_err_count;
> +	}
> +}
> +
> +/** Reset device statistics */
> +static void
> +armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
> +{
> +	int qp_id;
> +
> +	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
> +		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
> +
> +		memset(&qp->stats, 0, sizeof(qp->stats));
> +	}
> +}
> +
> +
> +/** Get device info */
> +static void
> +armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
> +		struct rte_cryptodev_info *dev_info)
> +{
> +	struct armv8_crypto_private *internals = dev->data->dev_private;
> +
> +	if (dev_info != NULL) {
> +		dev_info->dev_type = dev->dev_type;
> +		dev_info->feature_flags = dev->feature_flags;
> +		dev_info->capabilities = armv8_crypto_pmd_capabilities;
> +		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
> +		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
> +	}
> +}
> +
> +/** Release queue pair */
> +static int
> +armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
> +{
> +
> +	if (dev->data->queue_pairs[qp_id] != NULL) {
> +		rte_free(dev->data->queue_pairs[qp_id]);
> +		dev->data->queue_pairs[qp_id] = NULL;
> +	}
> +
> +	return 0;
> +}
> +
> +/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
> +static int
> +armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
> +		struct armv8_crypto_qp *qp)
> +{
> +	unsigned int n;
> +
> +	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
> +			dev->data->dev_id, qp->id);
> +
> +	if (n > sizeof(qp->name))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +
> +/** Create a ring to place processed operations on */
> +static struct rte_ring *
> +armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
> +		unsigned int ring_size, int socket_id)
> +{
> +	struct rte_ring *r;
> +
> +	r = rte_ring_lookup(qp->name);
> +	if (r) {
> +		if (r->prod.size >= ring_size) {
> +			ARMV8_CRYPTO_LOG_INFO(
> +				"Reusing existing ring %s for processed ops",
> +				 qp->name);
> +			return r;
> +		}
> +
> +		ARMV8_CRYPTO_LOG_ERR(
> +			"Unable to reuse existing ring %s for processed ops",
> +			 qp->name);
> +		return NULL;
> +	}
> +
> +	return rte_ring_create(qp->name, ring_size, socket_id,
> +			RING_F_SP_ENQ | RING_F_SC_DEQ);
> +}
> +
> +
> +/** Setup a queue pair */
> +static int
> +armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
> +		const struct rte_cryptodev_qp_conf *qp_conf,
> +		 int socket_id)
> +{
> +	struct armv8_crypto_qp *qp = NULL;
> +
> +	/* Free memory prior to re-allocation if needed. */
> +	if (dev->data->queue_pairs[qp_id] != NULL)
> +		armv8_crypto_pmd_qp_release(dev, qp_id);
> +
> +	/* Allocate the queue pair data structure. */
> +	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
> +					RTE_CACHE_LINE_SIZE, socket_id);
> +	if (qp == NULL)
> +		return -ENOMEM;
> +
> +	qp->id = qp_id;
> +	dev->data->queue_pairs[qp_id] = qp;
> +
> +	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
> +		goto qp_setup_cleanup;
> +
> +	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
> +			qp_conf->nb_descriptors, socket_id);
> +	if (qp->processed_ops == NULL)
> +		goto qp_setup_cleanup;
> +
> +	qp->sess_mp = dev->data->session_pool;
> +
> +	memset(&qp->stats, 0, sizeof(qp->stats));
> +
> +	return 0;
> +
> +qp_setup_cleanup:
> +	if (qp)
> +		rte_free(qp);
> +
> +	return -1;
> +}
> +
> +/** Start queue pair */
> +static int
> +armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
> +		__rte_unused uint16_t queue_pair_id)
> +{
> +	return -ENOTSUP;
> +}
> +
> +/** Stop queue pair */
> +static int
> +armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
> +		__rte_unused uint16_t queue_pair_id)
> +{
> +	return -ENOTSUP;
> +}
> +
> +/** Return the number of allocated queue pairs */
> +static uint32_t
> +armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
> +{
> +	return dev->data->nb_queue_pairs;
> +}
> +
> +/** Returns the size of the session structure */
> +static unsigned
> +armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
> +{
> +	return sizeof(struct armv8_crypto_session);
> +}
> +
> +/** Configure the session from a crypto xform chain */
> +static void *
> +armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
> +		struct rte_crypto_sym_xform *xform, void *sess)
> +{
> +	if (unlikely(sess == NULL)) {
> +		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
> +		return NULL;
> +	}
> +
> +	if (armv8_crypto_set_session_parameters(
> +			sess, xform) != 0) {
> +		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
> +		return NULL;
> +	}
> +
> +	return sess;
> +}
> +
> +/** Clear the memory of session so it doesn't leave key material behind */
> +static void
> +armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
> +				void *sess)
> +{
> +
> +	/* Zero out the whole structure */
> +	if (sess)
> +		memset(sess, 0, sizeof(struct armv8_crypto_session));
> +}
> +
> +struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
> +		.dev_configure		= armv8_crypto_pmd_config,
> +		.dev_start		= armv8_crypto_pmd_start,
> +		.dev_stop		= armv8_crypto_pmd_stop,
> +		.dev_close		= armv8_crypto_pmd_close,
> +
> +		.stats_get		= armv8_crypto_pmd_stats_get,
> +		.stats_reset		= armv8_crypto_pmd_stats_reset,
> +
> +		.dev_infos_get		= armv8_crypto_pmd_info_get,
> +
> +		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
> +		.queue_pair_release	= armv8_crypto_pmd_qp_release,
> +		.queue_pair_start	= armv8_crypto_pmd_qp_start,
> +		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
> +		.queue_pair_count	= armv8_crypto_pmd_qp_count,
> +
> +		.session_get_size	= armv8_crypto_pmd_session_get_size,
> +		.session_configure	= armv8_crypto_pmd_session_configure,
> +		.session_clear		= armv8_crypto_pmd_session_clear
> +};
> +
> +struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
> new file mode 100644
> index 0000000..fe46cde
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
> @@ -0,0 +1,211 @@
> +/*
> + *   BSD LICENSE
> + *
> + *   Copyright (C) Cavium networks Ltd. 2017.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Cavium networks nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
> +#define _RTE_ARMV8_PMD_PRIVATE_H_
> +
> +#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
> +	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
> +			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
> +			__func__, __LINE__, ## args)
> +
> +#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
> +#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
> +	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
> +			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
> +			__func__, __LINE__, ## args)
> +
> +#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
> +	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
> +			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
> +			__func__, __LINE__, ## args)
> +
> +#define ARMV8_CRYPTO_ASSERT(con)				\
> +do {								\
> +	if (!(con)) {						\
> +		rte_panic("%s(): "				\
> +		    con "condition failed, line %u", __func__);	\
> +	}							\
> +} while (0)
> +
> +#else
> +#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
> +#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
> +#define ARMV8_CRYPTO_ASSERT(con)
> +#endif
> +
> +#define NBBY		8		/* Number of bits in a byte */

is it being used somewhere?

> +#define BYTE_LENGTH(x)	((x) / 8)	/* Number of bytes in x (roun down) */

"round down"  instead of "roun down"

> +
> +/** ARMv8 operation order mode enumerator */
> +enum armv8_crypto_chain_order {
> +	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
> +	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
> +	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
> +	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
> +};
> +
> +/** ARMv8 cipher operation enumerator */
> +enum armv8_crypto_cipher_operation {
> +	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
> +	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
> +	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
> +	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
> +};
> +
> +enum armv8_crypto_cipher_keylen {
> +	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
> +	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
> +	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
> +	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
> +	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
> +		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
> +};
> +
> +/** ARMv8 auth mode enumerator */
> +enum armv8_crypto_auth_mode {
> +	ARMV8_CRYPTO_AUTH_AS_AUTH,
> +	ARMV8_CRYPTO_AUTH_AS_HMAC,
> +	ARMV8_CRYPTO_AUTH_AS_CIPHER,
> +	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
> +	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
> +};
> +
> +#define CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
> +#define CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
> +#define CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
> +#define CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
> +#define CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
> +
> +#define HMAC_IPAD_VALUE			(0x36)
> +#define HMAC_OPAD_VALUE			(0x5C)
> +
> +#define SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
> +#define SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
> +
> +#define SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
> +#define SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
> +
> +#define SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
> +#define SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
> +
> +typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
> +				uint8_t *, uint8_t *, uint64_t,
> +				crypto_arg_t *);
> +
> +typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
> +
> +/** private data structure for each ARMv8 crypto device */
> +struct armv8_crypto_private {
> +	unsigned int max_nb_qpairs;
> +	/**< Max number of queue pairs */
> +	unsigned int max_nb_sessions;
> +	/**< Max number of sessions */
> +};
> +
> +/** ARMv8 crypto queue pair */
> +struct armv8_crypto_qp {
> +	uint16_t id;
> +	/**< Queue Pair Identifier */
> +	char name[RTE_CRYPTODEV_NAME_LEN];
> +	/**< Unique Queue Pair Name */
> +	struct rte_ring *processed_ops;
> +	/**< Ring for placing process packets */
> +	struct rte_mempool *sess_mp;
> +	/**< Session Mempool */
> +	struct rte_cryptodev_stats stats;
> +	/**< Queue pair statistics */
> +} __rte_cache_aligned;
> +
> +/** ARMv8 crypto private session structure */
> +struct armv8_crypto_session {
> +	enum armv8_crypto_chain_order chain_order;
> +	/**< chain order mode */
> +	crypto_func_t crypto_func;
> +	/**< cryptographic function to use for this session */
> +
> +	/** Cipher Parameters */
> +	struct {
> +		enum rte_crypto_cipher_operation direction;
> +		/**< cipher operation direction */
> +		enum rte_crypto_cipher_algorithm algo;
> +		/**< cipher algorithm */
> +		int iv_len;
> +		/**< IV length */
> +
> +		struct {
> +			uint8_t data[256];
> +			/**< key data */
> +			size_t length;
> +			/**< key length in bytes */
> +		} key;
> +
> +		crypto_key_sched_t key_sched;
> +		/**< Key schedule function */
> +	} cipher;
> +
> +	/** Authentication Parameters */
> +	struct {
> +		enum rte_crypto_auth_operation operation;
> +		/**< auth operation generate or verify */
> +		enum armv8_crypto_auth_mode mode;
> +		/**< auth operation mode */
> +
> +		union {
> +			struct {
> +				/* Add data if needed */
> +			} auth;
> +
> +			struct {
> +				uint8_t i_key_pad[SHA_BLOCK_MAX]
> +							__rte_cache_aligned;
> +				/**< inner pad (max supported block length) */
> +				uint8_t o_key_pad[SHA_BLOCK_MAX]
> +							__rte_cache_aligned;
> +				/**< outer pad (max supported block length) */
> +				uint8_t key[SHA_AUTH_KEY_MAX];
> +				/**< HMAC key (max supported length)*/
> +			} hmac;
> +		};
> +	} auth;
> +
> +} __rte_cache_aligned;
> +
> +/** Set and validate ARMv8 crypto session parameters */
> +extern int armv8_crypto_set_session_parameters(
> +		struct armv8_crypto_session *sess,
> +		const struct rte_crypto_sym_xform *xform);
> +
> +/** device specific operations function pointer structure */
> +extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
> +
> +#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
> new file mode 100644
> index 0000000..1f84b68
> --- /dev/null
> +++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
> @@ -0,0 +1,3 @@
> +DPDK_17.02 {
> +	local: *;
> +};
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                         ` (8 preceding siblings ...)
  2017-01-10 17:11       ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
@ 2017-01-13  8:07       ` Hemant Agrawal
  2017-01-13 18:59         ` Zbigniew Bodek
  9 siblings, 1 reply; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-13  8:07 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Introduce crypto poll mode driver using ARMv8
> cryptographic extensions. This PMD is optimized
> to provide performance boost for chained
> crypto operations processing, such as:
> * encryption + HMAC generation
> * decryption + HMAC validation.
> In particular, cipher only or hash only
> operations are not provided.

Do you have a plan to add the crypto only, auth/hash only support into 
this driver?
Also, do you plan to add additional cases w.r.t supported by other 
crypto driver?

> Performance gain can be observed in tests
> against OpenSSL PMD which also uses ARM
> crypto extensions for packets processing.
>
> Exemplary crypto performance tests comparison:
>
> cipher_hash. cipher algo: AES_CBC
> auth algo: SHA1_HMAC cipher key size=16.
> burst_size: 64 ops
>
> ARMv8 PMD improvement over OpenSSL PMD
> (Optimized for ARMv8 cipher only and hash
> only cases):
>
> Buffer
> Size(B)   OPS(M)      Throughput(Gbps)
> 64        729 %        742 %
> 128       577 %        592 %
> 256       483 %        476 %
> 512       336 %        351 %
> 768       300 %        286 %
> 1024      263 %        250 %
> 1280      225 %        229 %
> 1536      214 %        213 %
> 1792      186 %        203 %
> 2048      200 %        193 %
>
> The driver currently supports AES-128-CBC
> in combination with: SHA256 HMAC and SHA1 HMAC.
> The core crypto functionality of this driver is
> provided by the external armv8_crypto library
> that can be downloaded from the Cavium repository:
> https://github.com/caviumnetworks/armv8_crypto
>
> CPU compatibility with this virtual device
> is detected in run-time and virtual crypto
> device will not be created if CPU doesn't
> provide AES, SHA1, SHA2 and NEON.
>
> The functionality and performance of this
> code can be tested using generic test application
> with the following commands:
> * cryptodev_sw_armv8_autotest
> * cryptodev_sw_armv8_perftest
> New test vectors and cases have been added
> to the general pool. In particular SHA1 and
> SHA256 HMAC for short cases were introduced.
> This is because low-level ARM assembly code
> is using different code paths for long and
> short data sets, so in order to test the
> mentioned driver correctly, two different
> data sets need to be provided.
>
> ---
> v3:
> * Addressed review remarks
> * Moved low-level assembly code to the external library
> * Removed SHA256 MAC cases
> * Various fixes: interface to the library, digest destination
>   and source address interpreting, missing mbuf manipulations.
>
> v2:
> * Fixed checkpatch warnings
> * Divide patches into smaller logical parts
>
> Zbigniew Bodek (8):
>   mk: fix build of assembly files for ARM64
>   lib: add cryptodev type for the upcoming ARMv8 PMD
>   crypto/armv8: add PMD optimized for ARMv8 processors
>   mk/crypto/armv8: add PMD to the build system
>   doc/armv8: update documentation about crypto PMD
>   crypto/armv8: enable ARMv8 PMD in the configuration
>   crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
>   app/test: add ARMv8 crypto tests and test vectors
>
>  MAINTAINERS                                    |   6 +
>  app/test/test_cryptodev.c                      |  63 ++
>  app/test/test_cryptodev_aes_test_vectors.h     | 144 +++-
>  app/test/test_cryptodev_blockcipher.c          |   4 +
>  app/test/test_cryptodev_blockcipher.h          |   1 +
>  app/test/test_cryptodev_perf.c                 | 480 +++++++++++++
>  config/common_base                             |   6 +
>  doc/guides/cryptodevs/armv8.rst                |  96 +++
>  doc/guides/cryptodevs/index.rst                |   1 +
>  doc/guides/rel_notes/release_17_02.rst         |   5 +
>  drivers/crypto/Makefile                        |   1 +
>  drivers/crypto/armv8/Makefile                  |  73 ++
>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926 +++++++++++++++++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>  lib/librte_cryptodev/rte_cryptodev.h           |   3 +
>  mk/arch/arm64/rte.vars.mk                      |   1 -
>  mk/rte.app.mk                                  |   2 +
>  mk/toolchain/gcc/rte.vars.mk                   |   6 +-
>  20 files changed, 2390 insertions(+), 11 deletions(-)
>  create mode 100644 doc/guides/cryptodevs/armv8.rst
>  create mode 100644 drivers/crypto/armv8/Makefile
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 1/8] mk: fix build of assembly files for ARM64
  2017-01-04 17:33       ` [PATCH v3 1/8] mk: fix build of assembly files for ARM64 zbigniew.bodek
@ 2017-01-13  8:13         ` Hemant Agrawal
  0 siblings, 0 replies; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-13  8:13 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Avoid using incorrect assembler (nasm) and unsupported flags
> when building for ARM64.
>
> Fixes:	af75078fece3 ("first public release")
> 	b3ce00e5fe36 ("mk: introduce ARMv8 architecture")
>
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  mk/arch/arm64/rte.vars.mk    | 1 -
>  mk/toolchain/gcc/rte.vars.mk | 6 ++++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mk/arch/arm64/rte.vars.mk b/mk/arch/arm64/rte.vars.mk
> index c168426..3b1178a 100644
> --- a/mk/arch/arm64/rte.vars.mk
> +++ b/mk/arch/arm64/rte.vars.mk
> @@ -53,7 +53,6 @@ CROSS ?=
>
>  CPU_CFLAGS  ?=
>  CPU_LDFLAGS ?=
> -CPU_ASFLAGS ?= -felf
>
>  export ARCH CROSS CPU_CFLAGS CPU_LDFLAGS CPU_ASFLAGS
>
> diff --git a/mk/toolchain/gcc/rte.vars.mk b/mk/toolchain/gcc/rte.vars.mk
> index ff70f3d..94f6412 100644
> --- a/mk/toolchain/gcc/rte.vars.mk
> +++ b/mk/toolchain/gcc/rte.vars.mk
> @@ -41,9 +41,11 @@
>  CC        = $(CROSS)gcc
>  KERNELCC  = $(CROSS)gcc
>  CPP       = $(CROSS)cpp
> -# for now, we don't use as but nasm.
> -# AS      = $(CROSS)as
> +ifeq ($(CONFIG_RTE_ARCH_X86),y)
>  AS        = nasm
> +else
> +AS        = $(CROSS)as
> +endif
>  AR        = $(CROSS)ar
>  LD        = $(CROSS)ld
>  OBJCOPY   = $(CROSS)objcopy
>

you may add:
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>

on a side note=> This patch is not related to this patch series anymore.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-04 17:33       ` [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2017-01-13  8:16         ` Hemant Agrawal
  2017-01-13 15:50           ` Zbigniew Bodek
  2017-01-16  5:57           ` Jianbo Liu
  0 siblings, 2 replies; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-13  8:16 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Add type and name for ARMv8 crypto PMD
>
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  lib/librte_cryptodev/rte_cryptodev.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
> index 8f63e8f..6f34f22 100644
> --- a/lib/librte_cryptodev/rte_cryptodev.h
> +++ b/lib/librte_cryptodev/rte_cryptodev.h
> @@ -66,6 +66,8 @@
>  /**< KASUMI PMD device name */
>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
>  /**< KASUMI PMD device name */
> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
> +/**< ARMv8 Crypto PMD device name */
>
I will suggest the name as armv8ce or armv8_ce for this driver.
Do you agree?

>  /** Crypto device type */
>  enum rte_cryptodev_type {
> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
>  };
>
>  extern const char **rte_cyptodev_names;
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-04 17:33       ` [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  2017-01-12 10:48         ` De Lara Guarch, Pablo
@ 2017-01-13  9:28         ` Hemant Agrawal
  1 sibling, 0 replies; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-13  9:28 UTC (permalink / raw)
  To: zbigniew.bodek, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Introduce unit tests for ARMv8 crypto PMD.
> Add test vectors for short cases such as 160 bytes.
> These test cases are ARMv8 specific since the code provides
> different processing paths for different input data sizes.
>
> User can validate correctness of algorithms' implementation using:
> * cryptodev_sw_armv8_autotest
> For performance test one can use:
> * cryptodev_sw_armv8_perftest
>
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> ---
>  app/test/test_cryptodev.c                  |  63 ++++
>  app/test/test_cryptodev_aes_test_vectors.h | 144 ++++++++-
>  app/test/test_cryptodev_blockcipher.c      |   4 +
>  app/test/test_cryptodev_blockcipher.h      |   1 +
>  app/test/test_cryptodev_perf.c             | 480 +++++++++++++++++++++++++++++
>  5 files changed, 684 insertions(+), 8 deletions(-)
>
> diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
> index 872f8b4..a0540d6 100644
> --- a/app/test/test_cryptodev.c
> +++ b/app/test/test_cryptodev.c
............
> @@ -2422,6 +2449,136 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
>  	return TEST_SUCCESS;
>  }
>
> +static int
> +test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
> +{
> +	uint32_t num_to_submit = pparams->total_operations;
> +	struct rte_crypto_op *c_ops[num_to_submit];
> +	struct rte_crypto_op *proc_ops[num_to_submit];
> +	uint64_t failed_polls, retries, start_cycles, end_cycles,
> +		 total_cycles = 0;
> +	uint32_t burst_sent = 0, burst_received = 0;
> +	uint32_t i, burst_size, num_sent, num_ops_received;
> +
> +	struct crypto_testsuite_params *ts_params = &testsuite_params;
> +
> +	static struct rte_cryptodev_sym_session *sess;
> +
> +	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
> +
> +	if (rte_cryptodev_count() == 0) {
> +		printf("\nNo crypto devices found. Is PMD build configured?\n");
> +		return TEST_FAILED;
> +	}
> +
> +	/* Create Crypto session*/
> +	sess = test_perf_create_armv8_session(ts_params->dev_id,
> +			pparams->chain, pparams->cipher_algo,
> +			pparams->cipher_key_length, pparams->auth_algo);
> +	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
> +
> +	/* Generate Crypto op data structure(s)*/
> +	for (i = 0; i < num_to_submit ; i++) {
> +		struct rte_mbuf *m = test_perf_create_pktmbuf(
> +						ts_params->mbuf_mp,
> +						pparams->buf_size);
> +		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
> +
> +		struct rte_crypto_op *op =
> +				rte_crypto_op_alloc(ts_params->op_mpool,
> +						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
> +		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
> +
> +		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
> +				digest_length);
> +		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
> +
> +		c_ops[i] = op;
> +	}
> +
> +	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
> +			"auth_algo:%s, Packet Size %u bytes",
> +			pmd_name(gbl_cryptodev_perftest_devtype),
> +			ts_params->dev_id, 0,
> +			chain_mode_name(pparams->chain),
> +			cipher_algo_name(pparams->cipher_algo),
> +			pparams->cipher_key_length,
> +			auth_algo_name(pparams->auth_algo),
> +			pparams->buf_size);
> +	printf("\nOps Tx\tOps Rx\tOps/burst  ");
> +	printf("Retries  "
> +		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
> +
> +	for (i = 2; i <= 128 ; i *= 2) {
> +		num_sent = 0;
> +		num_ops_received = 0;
> +		retries = 0;
> +		failed_polls = 0;
> +		burst_size = i;
> +		total_cycles = 0;
> +		while (num_sent < num_to_submit) {
> +			start_cycles = rte_rdtsc_precise();
> +			burst_sent = rte_cryptodev_enqueue_burst(
> +				ts_params->dev_id,
> +				0, &c_ops[num_sent],
> +				((num_to_submit - num_sent) < burst_size) ?
> +				num_to_submit - num_sent : burst_size);
> +			end_cycles = rte_rdtsc_precise();
> +			if (burst_sent == 0)
> +				retries++;
> +			num_sent += burst_sent;
> +			total_cycles += (end_cycles - start_cycles);
> +
> +			/* Wait until requests have been sent. */
> +			rte_delay_ms(1);
> +
you may remove this delay.

> +			start_cycles = rte_rdtsc_precise();
> +			burst_received = rte_cryptodev_dequeue_burst(
> +					ts_params->dev_id, 0, proc_ops,
> +					burst_size);
> +			end_cycles = rte_rdtsc_precise();
> +			if (burst_received < burst_sent)
> +				failed_polls++;
> +			num_ops_received += burst_received;
> +
> +			total_cycles += end_cycles - start_cycles;
> +		}
> +
> +		while (num_ops_received != num_to_submit) {
> +			/* Sending 0 length burst to flush sw crypto device */
> +			rte_cryptodev_enqueue_burst(
> +						ts_params->dev_id, 0, NULL, 0);
> +
> +			start_cycles = rte_rdtsc_precise();
> +			burst_received = rte_cryptodev_dequeue_burst(
> +				ts_params->dev_id, 0, proc_ops, burst_size);
> +			end_cycles = rte_rdtsc_precise();
> +
> +			total_cycles += end_cycles - start_cycles;
> +			if (burst_received == 0)
> +				failed_polls++;
> +			num_ops_received += burst_received;
> +		}
> +
> +		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
> +		printf("\t\t%"PRIu64, retries);
> +		printf("\t%"PRIu64, failed_polls);
> +		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
> +		printf("\t\t%"PRIu64,
> +			(total_cycles/num_ops_received)*burst_size);
> +		printf("\t\t%"PRIu64,
> +			total_cycles/(num_ops_received*pparams->buf_size));
> +	}
> +	printf("\n");
> +
> +	for (i = 0; i < num_to_submit ; i++) {
> +		rte_pktmbuf_free(c_ops[i]->sym->m_src);
> +		rte_crypto_op_free(c_ops[i]);
> +	}
> +
> +	return TEST_SUCCESS;
> +}
> +
>  static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
>  {
>  	switch (algo) {
> @@ -2683,6 +2840,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
>  	}
>  }
>
> +static struct rte_cryptodev_sym_session *
> +test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
> +		enum rte_crypto_cipher_algorithm cipher_algo,
> +		unsigned int cipher_key_len,
> +		enum rte_crypto_auth_algorithm auth_algo)
> +{
> +	struct rte_crypto_sym_xform cipher_xform = { 0 };
> +	struct rte_crypto_sym_xform auth_xform = { 0 };
> +
> +	/* Setup Cipher Parameters */
> +	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
> +	cipher_xform.cipher.algo = cipher_algo;
> +
> +	switch (cipher_algo) {
> +	case RTE_CRYPTO_CIPHER_AES_CBC:
> +		cipher_xform.cipher.key.data = aes_cbc_128_key;
> +		break;
> +	default:
> +		return NULL;
> +	}
> +
> +	cipher_xform.cipher.key.length = cipher_key_len;
> +
> +	/* Setup Auth Parameters */
> +	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
> +	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
> +	auth_xform.auth.algo = auth_algo;
> +
> +	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
> +
> +	switch (chain) {
> +	case CIPHER_HASH:
> +		cipher_xform.next = &auth_xform;
> +		auth_xform.next = NULL;
> +		/* Encrypt and hash the result */
> +		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
> +		/* Create Crypto session*/
> +		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
> +	case HASH_CIPHER:
> +		auth_xform.next = &cipher_xform;
> +		cipher_xform.next = NULL;
> +		/* Hash encrypted message and decrypt */
> +		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
> +		/* Create Crypto session*/
> +		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
> +	default:
> +		return NULL;
> +	}
> +}
> +
>  #define AES_BLOCK_SIZE 16
>  #define AES_CIPHER_IV_LENGTH 16
>
> @@ -3356,6 +3563,138 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
>  	return TEST_SUCCESS;
>  }
>
> +static int
> +test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
> +		struct perf_test_params *pparams)
> +{
> +	uint16_t i, k, l, m;
> +	uint16_t j = 0;
> +	uint16_t ops_unused = 0;
> +	uint16_t burst_size;
> +	uint16_t ops_needed;
> +
> +	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
> +	uint64_t processed = 0, failed_polls = 0, retries = 0;
> +	uint64_t tsc_start = 0, tsc_end = 0;
> +
> +	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
> +
> +	struct rte_crypto_op *ops[pparams->burst_size];
> +	struct rte_crypto_op *proc_ops[pparams->burst_size];
> +
> +	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
> +
> +	struct crypto_testsuite_params *ts_params = &testsuite_params;
> +
> +	static struct rte_cryptodev_sym_session *sess;
> +
> +	if (rte_cryptodev_count() == 0) {
> +		printf("\nNo crypto devices found. Is PMD build configured?\n");
> +		return TEST_FAILED;
> +	}
> +
> +	/* Create Crypto session*/
> +	sess = test_perf_create_armv8_session(ts_params->dev_id,
> +			pparams->chain, pparams->cipher_algo,
> +			pparams->cipher_key_length, pparams->auth_algo);
> +	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
> +
> +	/* Generate a burst of crypto operations */
> +	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
> +		mbufs[i] = test_perf_create_pktmbuf(
> +				ts_params->mbuf_mp,
> +				pparams->buf_size);
> +
> +		if (mbufs[i] == NULL) {
> +			printf("\nFailed to get mbuf - freeing the rest.\n");
> +			for (k = 0; k < i; k++)
> +				rte_pktmbuf_free(mbufs[k]);
> +			return -1;
> +		}
> +	}
> +
> +	tsc_start = rte_rdtsc_precise();
> +
> +	while (total_enqueued < pparams->total_operations) {
> +		if ((total_enqueued + pparams->burst_size) <=
> +					pparams->total_operations)
> +			burst_size = pparams->burst_size;
> +		else
> +			burst_size = pparams->total_operations - total_enqueued;
> +
> +		ops_needed = burst_size - ops_unused;
> +
> +		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
> +				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
> +			printf("\nFailed to alloc enough ops, finish dequeuing "
> +				"and free ops below.");
> +		} else {
> +			for (i = 0; i < ops_needed; i++)
> +				ops[i] = test_perf_set_crypto_op_aes(ops[i],
> +					mbufs[i + (pparams->burst_size *
> +						(j % NUM_MBUF_SETS))],
> +					sess, pparams->buf_size, digest_length);
> +
> +			/* enqueue burst */
> +			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
> +					queue_id, ops, burst_size);
> +
> +			if (burst_enqueued < burst_size)
> +				retries++;
> +
> +			ops_unused = burst_size - burst_enqueued;
> +			total_enqueued += burst_enqueued;
> +		}
> +
> +		/* dequeue burst */
> +		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
> +				proc_ops, pparams->burst_size);
> +		if (burst_dequeued == 0)
> +			failed_polls++;
> +		else {
> +			processed += burst_dequeued;
> +
> +			for (l = 0; l < burst_dequeued; l++)
> +				rte_crypto_op_free(proc_ops[l]);
> +		}
> +		j++;
> +	}
> +
> +	/* Dequeue any operations still in the crypto device */
> +	while (processed < pparams->total_operations) {
> +		/* Sending 0 length burst to flush sw crypto device */
> +		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
> +
> +		/* dequeue burst */
> +		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
> +				proc_ops, pparams->burst_size);
> +		if (burst_dequeued == 0)
> +			failed_polls++;
> +		else {
> +			processed += burst_dequeued;
> +
> +			for (m = 0; m < burst_dequeued; m++)
> +				rte_crypto_op_free(proc_ops[m]);
> +		}
> +	}
> +
> +	tsc_end = rte_rdtsc_precise();
> +
> +	double ops_s = ((double)processed / (tsc_end - tsc_start))
> +					* rte_get_tsc_hz();
> +	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
> +					/ 1000000000;
> +
> +	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
> +			ops_s / 1000000, throughput, retries, failed_polls);
> +
> +	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
> +		rte_pktmbuf_free(mbufs[i]);
> +
> +	printf("\n");
> +	return TEST_SUCCESS;
> +}
> +
>  /*
>
>      perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
> @@ -3664,6 +4003,125 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
>  }
>
>  static int
> +test_perf_armv8_vary_pkt_size(void)
> +{
> +	unsigned int total_operations = 100000;
> +	unsigned int burst_size = { 64 };
> +	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
> +			1792, 2048 };
> +	uint8_t i, j;
> +
> +	struct perf_test_params params_set[] = {
> +		{
> +			.chain = CIPHER_HASH,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
> +		},
> +		{
> +			.chain = HASH_CIPHER,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
> +		},
> +		{
> +			.chain = CIPHER_HASH,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
> +		},
> +		{
> +			.chain = HASH_CIPHER,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
> +		},
> +	};
> +
> +	for (i = 0; i < RTE_DIM(params_set); i++) {
> +		params_set[i].total_operations = total_operations;
> +		params_set[i].burst_size = burst_size;
> +		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
> +				" burst_size: %d ops\n",
> +				chain_mode_name(params_set[i].chain),
> +				cipher_algo_name(params_set[i].cipher_algo),
> +				auth_algo_name(params_set[i].auth_algo),
> +				params_set[i].cipher_key_length,
> +				burst_size);
> +		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
> +				"EmptyPolls\n");
> +		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
> +			params_set[i].buf_size = buf_lengths[j];
> +			test_perf_armv8(testsuite_params.dev_id, 0,
> +							&params_set[i]);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +test_perf_armv8_vary_burst_size(void)
> +{
> +	unsigned int total_operations = 4096;
> +	uint16_t buf_lengths[] = { 64 };
> +	uint8_t i, j;
> +
> +	struct perf_test_params params_set[] = {
> +		{
> +			.chain = CIPHER_HASH,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
> +		},
> +		{
> +			.chain = HASH_CIPHER,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
> +		},
> +		{
> +			.chain = CIPHER_HASH,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
> +		},
> +		{
> +			.chain = HASH_CIPHER,
> +
> +			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
> +			.cipher_key_length = 16,
> +			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
> +		},
> +	};
> +
> +	printf("\n\nStart %s.", __func__);
> +	printf("\nThis Test measures the average IA cycle cost using a "
> +			"constant request(packet) size. ");
> +	printf("Cycle cost is only valid when indicators show device is "
> +			"not busy, i.e. Retries and EmptyPolls = 0");
> +
> +	for (i = 0; i < RTE_DIM(params_set); i++) {
> +		printf("\n");
> +		params_set[i].total_operations = total_operations;
> +
> +		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
> +			params_set[i].buf_size = buf_lengths[j];
> +			test_perf_armv8_optimise_cyclecount(&params_set[i]);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int
>  test_perf_aes_cbc_vary_burst_size(void)
>  {
>  	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
> @@ -4214,6 +4672,19 @@ static int test_continual_perf_AES_GCM(void)
>  	}
>  };
>
> +static struct unit_test_suite cryptodev_armv8_testsuite  = {
> +	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
> +	.setup = testsuite_setup,
> +	.teardown = testsuite_teardown,
> +	.unit_test_cases = {
> +		TEST_CASE_ST(ut_setup, ut_teardown,
> +				test_perf_armv8_vary_pkt_size),
> +		TEST_CASE_ST(ut_setup, ut_teardown,
> +				test_perf_armv8_vary_burst_size),
> +		TEST_CASES_END() /**< NULL terminate unit test array */
> +	}
> +};
> +
>  static int
>  perftest_aesni_gcm_cryptodev(void)
>  {
> @@ -4270,6 +4741,14 @@ static int test_continual_perf_AES_GCM(void)
>  	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
>  }
>
> +static int
> +perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
> +{
> +	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
> +
> +	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
> +}
> +
>  REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
>  REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
>  REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
> @@ -4279,3 +4758,4 @@ static int test_continual_perf_AES_GCM(void)
>  		perftest_openssl_cryptodev);
>  REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
>  		perftest_qat_continual_cryptodev);
> +REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest, perftest_sw_armv8_cryptodev);
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-13  8:16         ` Hemant Agrawal
@ 2017-01-13 15:50           ` Zbigniew Bodek
  2017-01-16  5:57           ` Jianbo Liu
  1 sibling, 0 replies; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-13 15:50 UTC (permalink / raw)
  To: Hemant Agrawal, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob



On 13.01.2017 09:16, Hemant Agrawal wrote:
> On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Add type and name for ARMv8 crypto PMD
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>> ---
>>  lib/librte_cryptodev/rte_cryptodev.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/lib/librte_cryptodev/rte_cryptodev.h
>> b/lib/librte_cryptodev/rte_cryptodev.h
>> index 8f63e8f..6f34f22 100644
>> --- a/lib/librte_cryptodev/rte_cryptodev.h
>> +++ b/lib/librte_cryptodev/rte_cryptodev.h
>> @@ -66,6 +66,8 @@
>>  /**< KASUMI PMD device name */
>>  #define CRYPTODEV_NAME_ZUC_PMD        crypto_zuc
>>  /**< KASUMI PMD device name */
>> +#define CRYPTODEV_NAME_ARMV8_PMD    crypto_armv8
>> +/**< ARMv8 Crypto PMD device name */
>>
> I will suggest the name as armv8ce or armv8_ce for this driver.
> Do you agree?

CE for "crypto extensions"?
Agreed.

>
>>  /** Crypto device type */
>>  enum rte_cryptodev_type {
>> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>>      RTE_CRYPTODEV_KASUMI_PMD,    /**< KASUMI PMD */
>>      RTE_CRYPTODEV_ZUC_PMD,        /**< ZUC PMD */
>>      RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
>> +    RTE_CRYPTODEV_ARMV8_PMD,    /**< ARMv8 crypto PMD */
>>  };
>>
>>  extern const char **rte_cyptodev_names;
>>
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-13  8:07       ` Hemant Agrawal
@ 2017-01-13 18:59         ` Zbigniew Bodek
  2017-01-16  6:57           ` Hemant Agrawal
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-13 18:59 UTC (permalink / raw)
  To: Hemant Agrawal, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

Hello Hemant,

Thank you for your remarks and comments. Please check my answer below.

Kind regards
Zbigniew

On 13.01.2017 09:07, Hemant Agrawal wrote:
> On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Introduce crypto poll mode driver using ARMv8
>> cryptographic extensions. This PMD is optimized
>> to provide performance boost for chained
>> crypto operations processing, such as:
>> * encryption + HMAC generation
>> * decryption + HMAC validation.
>> In particular, cipher only or hash only
>> operations are not provided.
>
> Do you have a plan to add the crypto only, auth/hash only support into
> this driver?

OpenSSL driver is already implementing that and it is optimized for ARMv8.

> Also, do you plan to add additional cases w.r.t supported by other
> crypto driver?

We may do it in the future but this depends on our resource availability.

>
>> Performance gain can be observed in tests
>> against OpenSSL PMD which also uses ARM
>> crypto extensions for packets processing.
>>
>> Exemplary crypto performance tests comparison:
>>
>> cipher_hash. cipher algo: AES_CBC
>> auth algo: SHA1_HMAC cipher key size=16.
>> burst_size: 64 ops
>>
>> ARMv8 PMD improvement over OpenSSL PMD
>> (Optimized for ARMv8 cipher only and hash
>> only cases):
>>
>> Buffer
>> Size(B)   OPS(M)      Throughput(Gbps)
>> 64        729 %        742 %
>> 128       577 %        592 %
>> 256       483 %        476 %
>> 512       336 %        351 %
>> 768       300 %        286 %
>> 1024      263 %        250 %
>> 1280      225 %        229 %
>> 1536      214 %        213 %
>> 1792      186 %        203 %
>> 2048      200 %        193 %
>>
>> The driver currently supports AES-128-CBC
>> in combination with: SHA256 HMAC and SHA1 HMAC.
>> The core crypto functionality of this driver is
>> provided by the external armv8_crypto library
>> that can be downloaded from the Cavium repository:
>> https://github.com/caviumnetworks/armv8_crypto
>>
>> CPU compatibility with this virtual device
>> is detected in run-time and virtual crypto
>> device will not be created if CPU doesn't
>> provide AES, SHA1, SHA2 and NEON.
>>
>> The functionality and performance of this
>> code can be tested using generic test application
>> with the following commands:
>> * cryptodev_sw_armv8_autotest
>> * cryptodev_sw_armv8_perftest
>> New test vectors and cases have been added
>> to the general pool. In particular SHA1 and
>> SHA256 HMAC for short cases were introduced.
>> This is because low-level ARM assembly code
>> is using different code paths for long and
>> short data sets, so in order to test the
>> mentioned driver correctly, two different
>> data sets need to be provided.
>>
>> ---
>> v3:
>> * Addressed review remarks
>> * Moved low-level assembly code to the external library
>> * Removed SHA256 MAC cases
>> * Various fixes: interface to the library, digest destination
>>   and source address interpreting, missing mbuf manipulations.
>>
>> v2:
>> * Fixed checkpatch warnings
>> * Divide patches into smaller logical parts
>>
>> Zbigniew Bodek (8):
>>   mk: fix build of assembly files for ARM64
>>   lib: add cryptodev type for the upcoming ARMv8 PMD
>>   crypto/armv8: add PMD optimized for ARMv8 processors
>>   mk/crypto/armv8: add PMD to the build system
>>   doc/armv8: update documentation about crypto PMD
>>   crypto/armv8: enable ARMv8 PMD in the configuration
>>   crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
>>   app/test: add ARMv8 crypto tests and test vectors
>>
>>  MAINTAINERS                                    |   6 +
>>  app/test/test_cryptodev.c                      |  63 ++
>>  app/test/test_cryptodev_aes_test_vectors.h     | 144 +++-
>>  app/test/test_cryptodev_blockcipher.c          |   4 +
>>  app/test/test_cryptodev_blockcipher.h          |   1 +
>>  app/test/test_cryptodev_perf.c                 | 480 +++++++++++++
>>  config/common_base                             |   6 +
>>  doc/guides/cryptodevs/armv8.rst                |  96 +++
>>  doc/guides/cryptodevs/index.rst                |   1 +
>>  doc/guides/rel_notes/release_17_02.rst         |   5 +
>>  drivers/crypto/Makefile                        |   1 +
>>  drivers/crypto/armv8/Makefile                  |  73 ++
>>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926
>> +++++++++++++++++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>>  lib/librte_cryptodev/rte_cryptodev.h           |   3 +
>>  mk/arch/arm64/rte.vars.mk                      |   1 -
>>  mk/rte.app.mk                                  |   2 +
>>  mk/toolchain/gcc/rte.vars.mk                   |   6 +-
>>  20 files changed, 2390 insertions(+), 11 deletions(-)
>>  create mode 100644 doc/guides/cryptodevs/armv8.rst
>>  create mode 100644 drivers/crypto/armv8/Makefile
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>>
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-13  7:41             ` Jianbo Liu
@ 2017-01-13 19:09               ` Zbigniew Bodek
  0 siblings, 0 replies; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-13 19:09 UTC (permalink / raw)
  To: Jianbo Liu; +Cc: dev, pablo.de.lara.guarch, Declan Doherty, Jerin Jacob



On 13.01.2017 08:41, Jianbo Liu wrote:
> On 12 January 2017 at 21:12, Zbigniew Bodek
> <zbigniew.bodek@caviumnetworks.com> wrote:
>> Hello  Jianbo Liu,
>>
>> Thanks for the review. Please check my answers in-line.
>>
>> Kind regards
>> Zbigniew
>>
>>
>> On 06.01.2017 03:45, Jianbo Liu wrote:
>>>
>>> On 5 January 2017 at 01:33,  <zbigniew.bodek@caviumnetworks.com> wrote:
>>>>
>>>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>>>
>>>> This patch introduces crypto poll mode driver
>>>> using ARMv8 cryptographic extensions.
>>>> CPU compatibility with this driver is detected in
>>>> run-time and virtual crypto device will not be
>>>> created if CPU doesn't provide:
>>>> AES, SHA1, SHA2 and NEON.
>>>>
>>>> This PMD is optimized to provide performance boost
>>>> for chained crypto operations processing,
>>>> such as encryption + HMAC generation,
>>>> decryption + HMAC validation. In particular,
>>>> cipher only or hash only operations are
>>>> not provided.
>>>>
>>>> The driver currently supports AES-128-CBC
>>>> in combination with: SHA256 HMAC and SHA1 HMAC
>>>> and relies on the external armv8_crypto library:
>>>> https://github.com/caviumnetworks/armv8_crypto
>>>>
>>>
>>> It's standalone lib. I think you should change the following line in
>>> its Makefile, so not depend on DPDK.
>>> "include $(RTE_SDK)/mk/rte.lib.mk"
>>>
>>>> This patch adds driver's code only and does
>>>> not include it in the build system.
>>>>
>>>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>>> ---
>>>>  drivers/crypto/armv8/Makefile                  |  73 ++
>>>>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926
>>>> +++++++++++++++++++++++++
>>>>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>>>>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>>>>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>>>>  5 files changed, 1582 insertions(+)
>>>>  create mode 100644 drivers/crypto/armv8/Makefile
>>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>>>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>>>>
> .....
>
>>>> +       /* Select auth algo */
>>>> +       switch (auth_xform->auth.algo) {
>>>> +       /* Cover supported hash algorithms */
>>>> +       case RTE_CRYPTO_AUTH_SHA256:
>>>> +               aalg = auth_xform->auth.algo;
>>>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
>>>> +               break;
>>>> +       case RTE_CRYPTO_AUTH_SHA1_HMAC:
>>>> +       case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
>>>> +               aalg = auth_xform->auth.algo;
>>>> +               sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
>>>> +               break;
>>>> +       default:
>>>> +               return -EINVAL;
>>>> +       }
>>>> +
>>>> +       /* Verify supported key lengths and extract proper algorithm */
>>>> +       switch (cipher_xform->cipher.key.length << 3) {
>>>> +       case 128:
>>>> +               sess->crypto_func =
>>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>>> 128);
>>>> +               sess->cipher.key_sched =
>>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 128);
>>>> +               break;
>>>> +       case 192:
>>>> +               sess->crypto_func =
>>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>>> 192);
>>>> +               sess->cipher.key_sched =
>>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 192);
>>>> +               break;
>>>> +       case 256:
>>>> +               sess->crypto_func =
>>>> +                               CRYPTO_GET_ALGO(order, cop, calg, aalg,
>>>> 256);
>>>> +               sess->cipher.key_sched =
>>>> +                               CRYPTO_GET_KEY_SCHED(cop, calg, 256);
>>>> +               break;
>>>> +       default:
>>>> +               sess->crypto_func = NULL;
>>>> +               sess->cipher.key_sched = NULL;
>>>> +               return -EINVAL;
>>>> +       }
>>>> +
>>>> +       if (unlikely(sess->crypto_func == NULL)) {
>>>> +               /*
>>>> +                * If we got here that means that there must be a bug
>>>
>>>
>>> Since AES-128-CBC is only supported in your patch. It means that
>>> crypto_func could be NULL according to the switch above if
>>> cipher.key.length > 128?
>>
>>
>> Yes. Instead of checking for key lengths in a similar way that we check for
>> algorithms, etc. we just fail when we don't find appropriate function. Do
>> you suggest that this should be changed?
>>
>
> I mean to return error directly if length is not 128 in the above
> switch, so this "if" is no necessary.

OK. Done. Will resend patches soon.

>
>>
>>>
>>>> +                * in the algorithms selection above. Nevertheless keep
>>>> +                * it here to catch bug immediately and avoid NULL
>>>> pointer
>>>> +                * dereference in OPs processing.
>>>> +                */
>>>> +               ARMV8_CRYPTO_LOG_ERR(
>>>> +                       "No appropriate crypto function for given
>>>> parameters");
>>>> +               return -EINVAL;
>>>> +       }
>>>> +
>>>> +       /* Set up cipher session prerequisites */
>>>> +       if (cipher_set_prerequisites(sess, cipher_xform) != 0)
>>>> +               return -EINVAL;
>>>> +
>>>> +       /* Set up authentication session prerequisites */
>>>> +       if (auth_set_prerequisites(sess, auth_xform) != 0)
>>>> +               return -EINVAL;
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>
>>>
>>> ....
>>>
>>>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>>> b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>>> new file mode 100644
>>>> index 0000000..2bf6475
>>>> --- /dev/null
>>>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>>> @@ -0,0 +1,369 @@
>>>> +/*
>>>> + *   BSD LICENSE
>>>> + *
>>>> + *   Copyright (C) Cavium networks Ltd. 2017.
>>>> + *
>>>> + *   Redistribution and use in source and binary forms, with or without
>>>> + *   modification, are permitted provided that the following conditions
>>>> + *   are met:
>>>> + *
>>>> + *     * Redistributions of source code must retain the above copyright
>>>> + *       notice, this list of conditions and the following disclaimer.
>>>> + *     * Redistributions in binary form must reproduce the above
>>>> copyright
>>>> + *       notice, this list of conditions and the following disclaimer in
>>>> + *       the documentation and/or other materials provided with the
>>>> + *       distribution.
>>>> + *     * Neither the name of Cavium networks nor the names of its
>>>> + *       contributors may be used to endorse or promote products derived
>>>> + *       from this software without specific prior written permission.
>>>> + *
>>>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>>>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>>>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>>>> FOR
>>>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>>>> COPYRIGHT
>>>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>>>> INCIDENTAL,
>>>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>>>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>>>> USE,
>>>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
>>>> ANY
>>>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>>>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
>>>> USE
>>>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>>>> DAMAGE.
>>>> + */
>>>> +
>>>> +#include <string.h>
>>>> +
>>>> +#include <rte_common.h>
>>>> +#include <rte_malloc.h>
>>>> +#include <rte_cryptodev_pmd.h>
>>>> +
>>>> +#include "armv8_crypto_defs.h"
>>>> +
>>>> +#include "rte_armv8_pmd_private.h"
>>>> +
>>>> +static const struct rte_cryptodev_capabilities
>>>> +       armv8_crypto_pmd_capabilities[] = {
>>>> +       {       /* SHA1 HMAC */
>>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>>> +                       {.sym = {
>>>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>>>> +                               {.auth = {
>>>> +                                       .algo =
>>>> RTE_CRYPTO_AUTH_SHA1_HMAC,
>>>> +                                       .block_size = 64,
>>>> +                                       .key_size = {
>>>> +                                               .min = 16,
>>>> +                                               .max = 128,
>>>> +                                               .increment = 0
>>>> +                                       },
>>>> +                                       .digest_size = {
>>>> +                                               .min = 20,
>>>> +                                               .max = 20,
>>>> +                                               .increment = 0
>>>> +                                       },
>>>> +                                       .aad_size = { 0 }
>>>> +                               }, }
>>>> +                       }, }
>>>> +       },
>>>> +       {       /* SHA256 HMAC */
>>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>>> +                       {.sym = {
>>>> +                               .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>>>> +                               {.auth = {
>>>> +                                       .algo =
>>>> RTE_CRYPTO_AUTH_SHA256_HMAC,
>>>> +                                       .block_size = 64,
>>>> +                                       .key_size = {
>>>> +                                               .min = 16,
>>>> +                                               .max = 128,
>>>> +                                               .increment = 0
>>>> +                                       },
>>>> +                                       .digest_size = {
>>>> +                                               .min = 32,
>>>> +                                               .max = 32,
>>>> +                                               .increment = 0
>>>> +                                       },
>>>> +                                       .aad_size = { 0 }
>>>> +                               }, }
>>>> +                       }, }
>>>> +       },
>>>> +       {       /* AES CBC */
>>>> +               .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>>>> +                       {.sym = {
>>>> +                               .xform_type =
>>>> RTE_CRYPTO_SYM_XFORM_CIPHER,
>>>> +                               {.cipher = {
>>>> +                                       .algo =
>>>> RTE_CRYPTO_CIPHER_AES_CBC,
>>>> +                                       .block_size = 16,
>>>> +                                       .key_size = {
>>>> +                                               .min = 16,
>>>> +                                               .max = 16,
>>>> +                                               .increment = 0
>>>> +                                       },
>>>> +                                       .iv_size = {
>>>> +                                               .min = 16,
>>>> +                                               .max = 16,
>>>> +                                               .increment = 0
>>>> +                                       }
>>>> +                               }, }
>>>> +                       }, }
>>>> +       },
>>>> +
>>>
>>>
>>> It's strange that you defined aes and hmac here, but not implemented
>>> them, though their combinations are implemented.
>>> Will you add later?
>>
>>
>> We may add standalone algorithms in the future but those ops here are not
>> for that purpose. I thought that since there is no chained operations
>> capability we should export what we can do even though that it will work
>> (mean not return error) only if the operations are chained.
>> Do you have some other suggestion?
>>
>
> Nothing special. Either implement them later, or add new chained ops
> (is that possible?)
> BTW, can you explain what optimization you have done, so I can better
> understand your asm code, thanks!

Yes. The optimized assembly code is utilizing locality of reference 
while doing encryption/decryption as well as hash at the same time 
rather than one after another. Also, significant parts of the code are 
arranged for best instructions pipelining.
Some parts of the implementation such as key schedule are written in a 
way that uses NEON and crypto instructions to speed up operations needed 
for key expansion.

>
>>
>>>
>>>> +       RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
>>>> +};
>>>> +
>>>> +

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-13  7:57         ` Hemant Agrawal
@ 2017-01-13 19:15           ` Zbigniew Bodek
  0 siblings, 0 replies; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-13 19:15 UTC (permalink / raw)
  To: Hemant Agrawal; +Cc: dev, pablo.de.lara.guarch, Jerin Jacob



On 13.01.2017 08:57, Hemant Agrawal wrote:
> On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> This patch introduces crypto poll mode driver
>> using ARMv8 cryptographic extensions.
>> CPU compatibility with this driver is detected in
>> run-time and virtual crypto device will not be
>> created if CPU doesn't provide:
>> AES, SHA1, SHA2 and NEON.
>>
>> This PMD is optimized to provide performance boost
>> for chained crypto operations processing,
>> such as encryption + HMAC generation,
>> decryption + HMAC validation. In particular,
>> cipher only or hash only operations are
>> not provided.
>>
>> The driver currently supports AES-128-CBC
>> in combination with: SHA256 HMAC and SHA1 HMAC
>> and relies on the external armv8_crypto library:
>> https://github.com/caviumnetworks/armv8_crypto
>>
>> This patch adds driver's code only and does
>> not include it in the build system.
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>> ---
>>  drivers/crypto/armv8/Makefile                  |  73 ++
>>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926
>> +++++++++++++++++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
>>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
>>  5 files changed, 1582 insertions(+)
>>  create mode 100644 drivers/crypto/armv8/Makefile
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
>>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
>>
>> diff --git a/drivers/crypto/armv8/Makefile
>> b/drivers/crypto/armv8/Makefile
>> new file mode 100644
>> index 0000000..dc5ea02
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/Makefile
>> @@ -0,0 +1,73 @@
>> +#
>> +#   BSD LICENSE
>> +#
>> +#   Copyright (C) Cavium networks Ltd. 2017.
>> +#
>> +#   Redistribution and use in source and binary forms, with or without
>> +#   modification, are permitted provided that the following conditions
>> +#   are met:
>> +#
>> +#     * Redistributions of source code must retain the above copyright
>> +#       notice, this list of conditions and the following disclaimer.
>> +#     * Redistributions in binary form must reproduce the above
>> copyright
>> +#       notice, this list of conditions and the following disclaimer in
>> +#       the documentation and/or other materials provided with the
>> +#       distribution.
>> +#     * Neither the name of Cavium networks nor the names of its
>> +#       contributors may be used to endorse or promote products derived
>> +#       from this software without specific prior written permission.
>> +#
>> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>> FOR
>> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>> USE,
>> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
>> ANY
>> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
>> USE
>> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>> +#
>> +
>> +include $(RTE_SDK)/mk/rte.vars.mk
>> +
>> +ifneq ($(MAKECMDGOALS),clean)
>> +ifneq ($(MAKECMDGOALS),config)
>> +ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
>> +$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
>> +endif
>> +endif
>> +endif
>> +
>> +# library name
>> +LIB = librte_pmd_armv8.a
>> +
>> +# build flags
>> +CFLAGS += -O3
>> +CFLAGS += $(WERROR_FLAGS)
>> +CFLAGS += -L$(RTE_SDK)/../openssl -I$(RTE_SDK)/../openssl/include
>> +
>> +# library version
>> +LIBABIVER := 1
>> +
>> +# versioning export map
>> +EXPORT_MAP := rte_armv8_pmd_version.map
>> +
>> +# external library dependencies
>> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
>> +CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
>> +LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
>> +
>> +# library source files
>> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
>> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
>> +
>> +# library dependencies
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
>> +
>> +include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c
>> b/drivers/crypto/armv8/rte_armv8_pmd.c
>> new file mode 100644
>> index 0000000..39433bb
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd.c
>> @@ -0,0 +1,926 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above
>> copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Cavium networks nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>> USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
>> THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> DAMAGE.
>> + */
>> +
>> +#include <stdbool.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_hexdump.h>
>> +#include <rte_cryptodev.h>
>> +#include <rte_cryptodev_pmd.h>
>> +#include <rte_vdev.h>
>> +#include <rte_malloc.h>
>> +#include <rte_cpuflags.h>
>> +
>> +#include "armv8_crypto_defs.h"
>> +
>> +#include "rte_armv8_pmd_private.h"
>> +
>> +static int cryptodev_armv8_crypto_uninit(const char *name);
>> +
>> +/**
>> + * Pointers to the supported combined mode crypto functions are stored
>> + * in the static tables. Each combined (chained) cryptographic operation
>> + * can be decribed by a set of numbers:
>
> replace "decribed" with "described"

Thanks. Done.
>
>> + * - order:    order of operations (cipher, auth) or (auth, cipher)
>> + * - direction:    encryption or decryption
>> + * - calg:    cipher algorithm such as AES_CBC, AES_CTR, etc.
>> + * - aalg:    authentication algorithm such as SHA1, SHA256, etc.
>> + * - keyl:    cipher key length, for example 128, 192, 256 bits
>> + *
>> + * In order to quickly acquire each function pointer based on those
>> numbers,
>> + * a hierarchy of arrays is maintained. The final level, 3D array is
>> indexed
>> + * by the combined mode function parameters only (cipher algorithm,
>> + * authentication algorithm and key length).
>> + *
>> + * This gives 3 memory accesses to obtain a function pointer instead of
>> + * traversing the array manually and comparing function parameters on
>> each loop.
>> + *
>> + *                   +--+CRYPTO_FUNC
>> + *            +--+ENC|
>> + *      +--+CA|
>> + *      |     +--+DEC
>> + * ORDER|
>> + *      |     +--+ENC
>> + *      +--+AC|
>> + *            +--+DEC
>> + *
>> + */
>> +
>> +/**
>> + * 3D array type for ARM Combined Mode crypto functions pointers.
>> + * CRYPTO_CIPHER_MAX:            max cipher ID number
>> + * CRYPTO_AUTH_MAX:            max auth ID number
>> + * CRYPTO_CIPHER_KEYLEN_MAX:        max key length ID number
>> + */
>> +typedef const crypto_func_t
>> +crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
>>
>> +
>> +/* Evaluate to key length definition */
>> +#define KEYL(keyl)        (ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
>> +
>> +/* Local aliases for supported ciphers */
>> +#define CIPH_AES_CBC        RTE_CRYPTO_CIPHER_AES_CBC
>> +/* Local aliases for supported hashes */
>> +#define AUTH_SHA1_HMAC        RTE_CRYPTO_AUTH_SHA1_HMAC
>> +#define AUTH_SHA256        RTE_CRYPTO_AUTH_SHA256
>
> SHA256 you are defining both AUTH and HMAC, however for SHA1 only HMAC.
> In your implementation, you seems to be only supporting HMAC.

Yes. This is removed now. The MAC implementation will not be included in 
this patchset.

>
>> +#define AUTH_SHA256_HMAC    RTE_CRYPTO_AUTH_SHA256_HMAC
>> +
>> +/**
>> + * Arrays containing pointers to particular cryptographic,
>> + * combined mode functions.
>> + * crypto_op_ca_encrypt:    cipher (encrypt), authenticate
>> + * crypto_op_ca_decrypt:    cipher (decrypt), authenticate
>> + * crypto_op_ac_encrypt:    authenticate, cipher (encrypt)
>> + * crypto_op_ac_decrypt:    authenticate, cipher (decrypt)
>> + */
>> +static const crypto_func_tbl_t
>> +crypto_op_ca_encrypt = {
>> +    /* [cipher alg][auth alg][key length] = crypto_function, */
>> +    [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
>> +    [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
>> +};
>> +
> do you plan to support aes192 and aes256 as well?

Yes, in the future. Based on our resource availability. This patchset 
will contain only AES128CBC

>
>> +static const crypto_func_tbl_t
>> +crypto_op_ca_decrypt = {
>> +    NULL
>> +};
>> +
>> +static const crypto_func_tbl_t
>> +crypto_op_ac_encrypt = {
>> +    NULL
>> +};
>> +
>> +static const crypto_func_tbl_t
>> +crypto_op_ac_decrypt = {
>> +    /* [cipher alg][auth alg][key length] = crypto_function, */
>> +    [CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
>> +    [CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] =
>> sha256_hmac_aes128cbc_dec,
>> +};
>> +
>> +/**
>> + * Arrays containing pointers to particular cryptographic function sets,
>> + * covering given cipher operation directions (encrypt, decrypt)
>> + * for each order of cipher and authentication pairs.
>> + */
>> +static const crypto_func_tbl_t *
>> +crypto_cipher_auth[] = {
>> +    &crypto_op_ca_encrypt,
>> +    &crypto_op_ca_decrypt,
>> +    NULL
>> +};
>> +
>> +static const crypto_func_tbl_t *
>> +crypto_auth_cipher[] = {
>> +    &crypto_op_ac_encrypt,
>> +    &crypto_op_ac_decrypt,
>> +    NULL
>> +};
>> +
>> +/**
>> + * Top level array containing pointers to particular cryptographic
>> + * function sets, covering given order of chained operations.
>> + * crypto_cipher_auth:    cipher first, authenticate after
>> + * crypto_auth_cipher:    authenticate first, cipher after
>> + */
>> +static const crypto_func_tbl_t **
>> +crypto_chain_order[] = {
>> +    crypto_cipher_auth,
>> +    crypto_auth_cipher,
>> +    NULL
>> +};
>> +
>> +/**
>> + * Extract particular combined mode crypto function from the 3D array.
>> + */
>> +#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)            \
>> +({                                    \
>> +    crypto_func_tbl_t *func_tbl =                    \
>> +                (crypto_chain_order[(order)])[(cop)];    \
>> +                                    \
>> +    ((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);        \
>> +})
>> +
>> +/*----------------------------------------------------------------------------*/
>>
>> +
>> +/**
>> + * 2D array type for ARM key schedule functions pointers.
>> + * CRYPTO_CIPHER_MAX:            max cipher ID number
>> + * CRYPTO_CIPHER_KEYLEN_MAX:        max key length ID number
>> + */
>> +typedef const crypto_key_sched_t
>> +crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
>> +
>> +static const crypto_key_sched_tbl_t
>> +crypto_key_sched_encrypt = {
>> +    /* [cipher alg][key length] = key_expand_func, */
>> +    [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
>> +};
>> +
>> +static const crypto_key_sched_tbl_t
>> +crypto_key_sched_decrypt = {
>> +    /* [cipher alg][key length] = key_expand_func, */
>> +    [CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
>> +};
>> +
>> +/**
>> + * Top level array containing pointers to particular key generation
>> + * function sets, covering given operation direction.
>> + * crypto_key_sched_encrypt:    keys for encryption
>> + * crypto_key_sched_decrypt:    keys for decryption
>> + */
>> +static const crypto_key_sched_tbl_t *
>> +crypto_key_sched_dir[] = {
>> +    &crypto_key_sched_encrypt,
>> +    &crypto_key_sched_decrypt,
>> +    NULL
>> +};
>> +
>> +/**
>> + * Extract particular combined mode crypto function from the 3D array.
>> + */
>> +#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)                \
>> +({                                    \
>> +    crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];    \
>> +                                    \
>> +    ((*ks_tbl)[(calg)][KEYL(keyl)]);                \
>> +})
>> +
>> +/*----------------------------------------------------------------------------*/
>>
>> +
>> +/**
>> + * Global static parameter used to create a unique name for each
>> + * ARMV8 crypto device.
>> + */
>> +static unsigned int unique_name_id;
>> +
>> +static inline int
>> +create_unique_device_name(char *name, size_t size)
>> +{
>> +    int ret;
>> +
>> +    if (name == NULL)
>> +        return -EINVAL;
>> +
>> +    ret = snprintf(name, size, "%s_%u",
>> RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
>> +            unique_name_id++);
>> +    if (ret < 0)
>> +        return ret;
>> +    return 0;
>> +}
>> +
>> +/*
>> +
>> *------------------------------------------------------------------------------
>>
>> + * Session Prepare
>> +
>> *------------------------------------------------------------------------------
>>
>> + */
>> +
>> +/** Get xform chain order */
>> +static enum armv8_crypto_chain_order
>> +armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
>> +{
>> +
>> +    /*
>> +     * This driver currently covers only chained operations.
>> +     * Ignore only cipher or only authentication operations
>> +     * or chains longer than 2 xform structures.
>> +     */
>> +    if (xform->next == NULL || xform->next->next != NULL)
>> +        return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
>> +
>> +    if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
>> +        if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
>> +            return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
>> +    }
>> +
>> +    if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
>> +        if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
>> +            return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
>> +    }
>> +
>> +    return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
>> +}
>> +
>> +static inline void
>> +auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
>> +                const struct rte_crypto_sym_xform *xform)
>> +{
>> +    size_t i;
>> +
>> +    /* Generate i_key_pad and o_key_pad */
>> +    memset(sess->auth.hmac.i_key_pad, 0,
>> sizeof(sess->auth.hmac.i_key_pad));
>> +    rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
>> +                            xform->auth.key.length);
>> +    memset(sess->auth.hmac.o_key_pad, 0,
>> sizeof(sess->auth.hmac.o_key_pad));
>> +    rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
>> +                            xform->auth.key.length);
>> +    /*
>> +     * XOR key with IPAD/OPAD values to obtain i_key_pad
>> +     * and o_key_pad.
>> +     * Byte-by-byte operation may seem to be the less efficient
>> +     * here but in fact it's the opposite.
>> +     * The result ASM code is likely operate on NEON registers
>> +     * (load auth key to Qx, load IPAD/OPAD to multiple
>> +     * elements of Qy, eor 128 bits at once).
>> +     */
>> +    for (i = 0; i < SHA_BLOCK_MAX; i++) {
>> +        sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
>> +        sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
>> +    }
>> +}
>> +
>> +static inline int
>> +auth_set_prerequisites(struct armv8_crypto_session *sess,
>> +            const struct rte_crypto_sym_xform *xform)
>> +{
>> +    uint8_t partial[64] = { 0 };
>> +    int error;
>> +
>> +    switch (xform->auth.algo) {
>> +    case RTE_CRYPTO_AUTH_SHA1_HMAC:
>> +        /*
>> +         * Generate authentication key, i_key_pad and o_key_pad.
>> +         */
>> +        /* Zero memory under key */
>> +        memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
>> +
>> +        if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
>> +            /*
>> +             * In case the key is longer than 160 bits
>> +             * the algorithm will use SHA1(key) instead.
>> +             */
>> +            error = sha1_block(NULL, xform->auth.key.data,
>> +                sess->auth.hmac.key, xform->auth.key.length);
>> +            if (error != 0)
>> +                return -1;
>> +        } else {
>> +            /*
>> +             * Now copy the given authentication key to the session
>> +             * key assuming that the session key is zeroed there is
>> +             * no need for additional zero padding if the key is
>> +             * shorter than SHA1_AUTH_KEY_LENGTH.
>> +             */
>> +            rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
>> +                            xform->auth.key.length);
>> +        }
>> +
>> +        /* Prepare HMAC padding: key|pattern */
>> +        auth_hmac_pad_prepare(sess, xform);
>> +        /*
>> +         * Calculate partial hash values for i_key_pad and o_key_pad.
>> +         * Will be used as initialization state for final HMAC.
>> +         */
>> +        error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
>> +            partial, SHA1_BLOCK_SIZE);
>> +        if (error != 0)
>> +            return -1;
>> +        memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
>> +
>> +        error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
>> +            partial, SHA1_BLOCK_SIZE);
>> +        if (error != 0)
>> +            return -1;
>> +        memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
>> +
>> +        break;
>> +    case RTE_CRYPTO_AUTH_SHA256_HMAC:
>> +        /*
>> +         * Generate authentication key, i_key_pad and o_key_pad.
>> +         */
>> +        /* Zero memory under key */
>> +        memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
>> +
>> +        if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
>> +            /*
>> +             * In case the key is longer than 256 bits
>> +             * the algorithm will use SHA256(key) instead.
>> +             */
>> +            error = sha256_block(NULL, xform->auth.key.data,
>> +                sess->auth.hmac.key, xform->auth.key.length);
>> +            if (error != 0)
>> +                return -1;
>> +        } else {
>> +            /*
>> +             * Now copy the given authentication key to the session
>> +             * key assuming that the session key is zeroed there is
>> +             * no need for additional zero padding if the key is
>> +             * shorter than SHA256_AUTH_KEY_LENGTH.
>> +             */
>> +            rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
>> +                            xform->auth.key.length);
>> +        }
>> +
>> +        /* Prepare HMAC padding: key|pattern */
>> +        auth_hmac_pad_prepare(sess, xform);
>> +        /*
>> +         * Calculate partial hash values for i_key_pad and o_key_pad.
>> +         * Will be used as initialization state for final HMAC.
>> +         */
>> +        error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
>> +            partial, SHA256_BLOCK_SIZE);
>> +        if (error != 0)
>> +            return -1;
>> +        memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
>> +
>> +        error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
>> +            partial, SHA256_BLOCK_SIZE);
>> +        if (error != 0)
>> +            return -1;
>> +        memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
>> +
>> +        break;
>> +    default:
>> +        break;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static inline int
>> +cipher_set_prerequisites(struct armv8_crypto_session *sess,
>> +            const struct rte_crypto_sym_xform *xform)
>> +{
>> +    crypto_key_sched_t cipher_key_sched;
>> +
>> +    cipher_key_sched = sess->cipher.key_sched;
>> +    if (likely(cipher_key_sched != NULL)) {
>> +        /* Set up cipher session key */
>> +        cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>> +armv8_crypto_set_session_chained_parameters(struct
>> armv8_crypto_session *sess,
>> +        const struct rte_crypto_sym_xform *cipher_xform,
>> +        const struct rte_crypto_sym_xform *auth_xform)
>> +{
>> +    enum armv8_crypto_chain_order order;
>> +    enum armv8_crypto_cipher_operation cop;
>> +    enum rte_crypto_cipher_algorithm calg;
>> +    enum rte_crypto_auth_algorithm aalg;
>> +
>> +    /* Validate and prepare scratch order of combined operations */
>> +    switch (sess->chain_order) {
>> +    case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
>> +    case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
>> +        order = sess->chain_order;
>> +        break;
>> +    default:
>> +        return -EINVAL;
>> +    }
>> +    /* Select cipher direction */
>> +    sess->cipher.direction = cipher_xform->cipher.op;
>> +    /* Select cipher key */
>> +    sess->cipher.key.length = cipher_xform->cipher.key.length;
>> +    /* Set cipher direction */
>> +    cop = sess->cipher.direction;
>> +    /* Set cipher algorithm */
>> +    calg = cipher_xform->cipher.algo;
>> +
>> +    /* Select cipher algo */
>> +    switch (calg) {
>> +    /* Cover supported cipher algorithms */
>> +    case RTE_CRYPTO_CIPHER_AES_CBC:
>> +        sess->cipher.algo = calg;
>> +        /* IV len is always 16 bytes (block size) for AES CBC */
>> +        sess->cipher.iv_len = 16;
>> +        break;
>> +    default:
>> +        return -EINVAL;
>> +    }
>> +    /* Select auth generate/verify */
>> +    sess->auth.operation = auth_xform->auth.op;
>> +
>> +    /* Select auth algo */
>> +    switch (auth_xform->auth.algo) {
>> +    /* Cover supported hash algorithms */
>> +    case RTE_CRYPTO_AUTH_SHA256:
>> +        aalg = auth_xform->auth.algo;
>> +        sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_AUTH;
>> +        break;
>
> as previously stated, are you supporting AUTH types?

MAC is not supported in this patchset. Removed.
>
>
>> +    case RTE_CRYPTO_AUTH_SHA1_HMAC:
>> +    case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
>> +        aalg = auth_xform->auth.algo;
>> +        sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
>> +        break;
>> +    default:
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* Verify supported key lengths and extract proper algorithm */
>> +    switch (cipher_xform->cipher.key.length << 3) {
>> +    case 128:
>> +        sess->crypto_func =
>> +                CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
>> +        sess->cipher.key_sched =
>> +                CRYPTO_GET_KEY_SCHED(cop, calg, 128);
>> +        break;
>> +    case 192:
>
> aes192 and aes256?

Set as default - unsupported in the new patchset.

>
>> +        sess->crypto_func =
>> +                CRYPTO_GET_ALGO(order, cop, calg, aalg, 192);
>> +        sess->cipher.key_sched =
>> +                CRYPTO_GET_KEY_SCHED(cop, calg, 192);
>> +        break;
>> +    case 256:
>> +        sess->crypto_func =
>> +                CRYPTO_GET_ALGO(order, cop, calg, aalg, 256);
>> +        sess->cipher.key_sched =
>> +                CRYPTO_GET_KEY_SCHED(cop, calg, 256);
>> +        break;
>> +    default:
>> +        sess->crypto_func = NULL;
>> +        sess->cipher.key_sched = NULL;
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (unlikely(sess->crypto_func == NULL)) {
>> +        /*
>> +         * If we got here that means that there must be a bug
>> +         * in the algorithms selection above. Nevertheless keep
>> +         * it here to catch bug immediately and avoid NULL pointer
>> +         * dereference in OPs processing.
>> +         */
>> +        ARMV8_CRYPTO_LOG_ERR(
>> +            "No appropriate crypto function for given parameters");
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* Set up cipher session prerequisites */
>> +    if (cipher_set_prerequisites(sess, cipher_xform) != 0)
>> +        return -EINVAL;
>> +
>> +    /* Set up authentication session prerequisites */
>> +    if (auth_set_prerequisites(sess, auth_xform) != 0)
>> +        return -EINVAL;
>> +
>> +    return 0;
>> +}
>> +
>> +/** Parse crypto xform chain and set private session parameters */
>> +int
>> +armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
>> +        const struct rte_crypto_sym_xform *xform)
>> +{
>> +    const struct rte_crypto_sym_xform *cipher_xform = NULL;
>> +    const struct rte_crypto_sym_xform *auth_xform = NULL;
>> +    bool is_chained_op;
>> +    int ret;
>> +
>> +    /* Filter out spurious/broken requests */
>> +    if (xform == NULL)
>> +        return -EINVAL;
>> +
>> +    sess->chain_order = armv8_crypto_get_chain_order(xform);
>> +    switch (sess->chain_order) {
>> +    case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
>> +        cipher_xform = xform;
>> +        auth_xform = xform->next;
>> +        is_chained_op = true;
>> +        break;
>> +    case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
>> +        auth_xform = xform;
>> +        cipher_xform = xform->next;
>> +        is_chained_op = true;
>> +        break;
>> +    default:
>> +        is_chained_op = false;
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (is_chained_op) {
>> +        ret = armv8_crypto_set_session_chained_parameters(sess,
>> +                        cipher_xform, auth_xform);
>> +        if (unlikely(ret != 0)) {
>> +            ARMV8_CRYPTO_LOG_ERR(
>> +            "Invalid/unsupported chained (cipher/auth) parameters");
>> +            return -EINVAL;
>> +        }
>> +    } else {
>> +        ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/** Provide session for operation */
>> +static struct armv8_crypto_session *
>> +get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
>> +{
>> +    struct armv8_crypto_session *sess = NULL;
>> +
>> +    if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
>> +        /* get existing session */
>> +        if (likely(op->sym->session != NULL &&
>> +                op->sym->session->dev_type ==
>> +                RTE_CRYPTODEV_ARMV8_PMD)) {
>> +            sess = (struct armv8_crypto_session *)
>> +                op->sym->session->_private;
>> +        }
>> +    } else {
>> +        /* provide internal session */
>> +        void *_sess = NULL;
>> +
>> +        if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
>> +            sess = (struct armv8_crypto_session *)
>> +                ((struct rte_cryptodev_sym_session *)_sess)
>> +                ->_private;
>> +
>> +            if (unlikely(armv8_crypto_set_session_parameters(
>> +                    sess, op->sym->xform) != 0)) {
>> +                rte_mempool_put(qp->sess_mp, _sess);
>> +                sess = NULL;
>> +            } else
>> +                op->sym->session = _sess;
>> +        }
>> +    }
>> +
>> +    if (sess == NULL)
>> +        op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
>> +
>> +    return sess;
>> +}
>> +
>> +/*
>> +
>> *------------------------------------------------------------------------------
>>
>> + * Process Operations
>> +
>> *------------------------------------------------------------------------------
>>
>> + */
>> +
>> +/*----------------------------------------------------------------------------*/
>>
>> +
>> +/** Process cipher operation */
>> +static void
>> +process_armv8_chained_op
>> +        (struct rte_crypto_op *op, struct armv8_crypto_session *sess,
>> +        struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
>> +{
>> +    crypto_func_t crypto_func;
>> +    crypto_arg_t arg;
>> +    struct rte_mbuf *m_asrc, *m_adst;
>> +    uint8_t *csrc, *cdst;
>> +    uint8_t *adst, *asrc;
>> +    uint64_t clen, alen __rte_unused;
>> +    int error;
>> +
>> +    clen = op->sym->cipher.data.length;
>> +    alen = op->sym->auth.data.length;
>> +
>> +    csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
>> +            op->sym->cipher.data.offset);
>> +    cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
>> +            op->sym->cipher.data.offset);
>> +
>> +    switch (sess->chain_order) {
>> +    case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
>> +        m_asrc = m_adst = mbuf_dst;
>> +        break;
>> +    case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
>> +        m_asrc = mbuf_src;
>> +        m_adst = mbuf_dst;
>> +        break;
>> +    default:
>> +        op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
>> +        return;
>> +    }
>> +    asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
>> +                op->sym->auth.data.offset);
>> +
>> +    switch (sess->auth.mode) {
>> +    case ARMV8_CRYPTO_AUTH_AS_AUTH:
>> +        /* Nothing to do here, just verify correct option */
>> +        break;
>> +    case ARMV8_CRYPTO_AUTH_AS_HMAC:
>> +        arg.digest.hmac.key = sess->auth.hmac.key;
>> +        arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
>> +        arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
>> +        break;
>> +    default:
>> +        op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
>> +        return;
>> +    }
>> +
>> +    if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
>> +        adst = op->sym->auth.digest.data;
>> +        if (adst == NULL) {
>> +            adst = rte_pktmbuf_mtod_offset(m_adst,
>> +                    uint8_t *,
>> +                    op->sym->auth.data.offset +
>> +                    op->sym->auth.data.length);
>> +        }
>> +    } else {
>> +        adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
>> +                op->sym->auth.digest.length);
>> +    }
>> +
>> +    if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
>> +        op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
>> +        return;
>> +    }
>> +
>> +    arg.cipher.iv = op->sym->cipher.iv.data;
>> +    arg.cipher.key = sess->cipher.key.data;
>> +    /* Acquire combined mode function */
>> +    crypto_func = sess->crypto_func;
>> +    ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
>> +    error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
>> +    if (error != 0) {
>> +        op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
>> +        return;
>> +    }
>> +
>> +    op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
>> +    if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
>> +        if (memcmp(adst, op->sym->auth.digest.data,
>> +                op->sym->auth.digest.length) != 0) {
>> +            op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
>> +        }
>> +        /* Trim area used for digest from mbuf. */
>> +        rte_pktmbuf_trim(m_asrc,
>> +                op->sym->auth.digest.length);
>> +    }
>> +}
>> +
>> +/** Process crypto operation for mbuf */
>> +static int
>> +process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
>> +        struct armv8_crypto_session *sess)
>> +{
>> +    struct rte_mbuf *msrc, *mdst;
>> +    int retval;
>> +
>> +    msrc = op->sym->m_src;
>> +    mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
>> +
>> +    op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
>> +
>> +    switch (sess->chain_order) {
>> +    case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
>> +    case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
>> +        process_armv8_chained_op(op, sess, msrc, mdst);
>> +        break;
>> +    default:
>> +        op->status = RTE_CRYPTO_OP_STATUS_ERROR;
>> +        break;
>> +    }
>> +
>> +    /* Free session if a session-less crypto op */
>> +    if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
>> +        memset(sess, 0, sizeof(struct armv8_crypto_session));
>> +        rte_mempool_put(qp->sess_mp, op->sym->session);
>> +        op->sym->session = NULL;
>> +    }
>> +
>> +    if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
>> +        op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
>> +
>> +    if (op->status != RTE_CRYPTO_OP_STATUS_ERROR)
>> +        retval = rte_ring_enqueue(qp->processed_ops, (void *)op);
>> +    else
>> +        retval = -1;
>> +
>> +    return retval;
>> +}
>> +
>> +/*
>> +
>> *------------------------------------------------------------------------------
>>
>> + * PMD Framework
>> +
>> *------------------------------------------------------------------------------
>>
>> + */
>> +
>> +/** Enqueue burst */
>> +static uint16_t
>> +armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op
>> **ops,
>> +        uint16_t nb_ops)
>> +{
>> +    struct armv8_crypto_session *sess;
>> +    struct armv8_crypto_qp *qp = queue_pair;
>> +    int i, retval;
>> +
>> +    for (i = 0; i < nb_ops; i++) {
>> +        sess = get_session(qp, ops[i]);
>> +        if (unlikely(sess == NULL))
>> +            goto enqueue_err;
>> +
>> +        retval = process_op(qp, ops[i], sess);
>> +        if (unlikely(retval < 0))
>> +            goto enqueue_err;
>> +    }
>> +
>> +    qp->stats.enqueued_count += i;
>> +    return i;
>> +
>> +enqueue_err:
>> +    if (ops[i] != NULL)
>> +        ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
>> +
>> +    qp->stats.enqueue_err_count++;
>> +    return i;
>> +}
>> +
>> +/** Dequeue burst */
>> +static uint16_t
>> +armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op
>> **ops,
>> +        uint16_t nb_ops)
>> +{
>> +    struct armv8_crypto_qp *qp = queue_pair;
>> +
>> +    unsigned int nb_dequeued = 0;
>> +
>> +    nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
>> +            (void **)ops, nb_ops);
>> +    qp->stats.dequeued_count += nb_dequeued;
>> +
>> +    return nb_dequeued;
>> +}
>> +
>> +/** Create ARMv8 crypto device */
>> +static int
>> +cryptodev_armv8_crypto_create(const char *name,
>> +        struct rte_crypto_vdev_init_params *init_params)
>> +{
>> +    struct rte_cryptodev *dev;
>> +    char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
>> +    struct armv8_crypto_private *internals;
>> +
>> +    /* Check CPU for support for AES instruction set */
>> +    if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
>> +        ARMV8_CRYPTO_LOG_ERR(
>> +            "AES instructions not supported by CPU");
>> +        return -EFAULT;
>> +    }
>> +
>> +    /* Check CPU for support for SHA instruction set */
>> +    if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
>> +        !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
>> +        ARMV8_CRYPTO_LOG_ERR(
>> +            "SHA1/SHA2 instructions not supported by CPU");
>> +        return -EFAULT;
>> +    }
>> +
>> +    /* Check CPU for support for Advance SIMD instruction set */
>> +    if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
>> +        ARMV8_CRYPTO_LOG_ERR(
>> +            "Advanced SIMD instructions not supported by CPU");
>> +        return -EFAULT;
>> +    }
>> +
>> +    /* create a unique device name */
>> +    if (create_unique_device_name(crypto_dev_name,
>> +            RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
>> +        ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
>> +        return -EINVAL;
>> +    }
>> +
>> +    dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
>> +                sizeof(struct armv8_crypto_private),
>> +                init_params->socket_id);
>> +    if (dev == NULL) {
>> +        ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
>> +        goto init_error;
>> +    }
>> +
>> +    dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
>> +    dev->dev_ops = rte_armv8_crypto_pmd_ops;
>> +
>> +    /* register rx/tx burst functions for data path */
>> +    dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
>> +    dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
>> +
>> +    dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
>> +            RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
>> +
>> +    /* Set vector instructions mode supported */
>> +    internals = dev->data->dev_private;
>> +
>> +    internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
>> +    internals->max_nb_sessions = init_params->max_nb_sessions;
>> +
>> +    return 0;
>> +
>> +init_error:
>> +    ARMV8_CRYPTO_LOG_ERR(
>> +        "driver %s: cryptodev_armv8_crypto_create failed", name);
>> +
>> +    cryptodev_armv8_crypto_uninit(crypto_dev_name);
>> +    return -EFAULT;
>> +}
>> +
>> +/** Initialise ARMv8 crypto device */
>> +static int
>> +cryptodev_armv8_crypto_init(const char *name,
>> +        const char *input_args)
>> +{
>> +    struct rte_crypto_vdev_init_params init_params = {
>> +        RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
>> +        RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
>> +        rte_socket_id()
>> +    };
>> +
>> +    rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
>> +
>> +    RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
>> +            init_params.socket_id);
>> +    RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
>> +            init_params.max_nb_queue_pairs);
>> +    RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
>> +            init_params.max_nb_sessions);
>> +
>> +    return cryptodev_armv8_crypto_create(name, &init_params);
>> +}
>> +
>> +/** Uninitialise ARMv8 crypto device */
>> +static int
>> +cryptodev_armv8_crypto_uninit(const char *name)
>> +{
>> +    if (name == NULL)
>> +        return -EINVAL;
>> +
>> +    RTE_LOG(INFO, PMD,
>> +        "Closing ARMv8 crypto device %s on numa socket %u\n",
>> +        name, rte_socket_id());
>> +
>> +    return 0;
>> +}
>> +
>> +static struct rte_vdev_driver armv8_crypto_drv = {
>> +    .probe = cryptodev_armv8_crypto_init,
>> +    .remove = cryptodev_armv8_crypto_uninit
>> +};
>> +
>> +RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
>> +RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
>> +RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
>> +    "max_nb_queue_pairs=<int> "
>> +    "max_nb_sessions=<int> "
>> +    "socket_id=<int>");
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>> b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>> new file mode 100644
>> index 0000000..2bf6475
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
>> @@ -0,0 +1,369 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above
>> copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Cavium networks nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>> USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
>> THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> DAMAGE.
>> + */
>> +
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_malloc.h>
>> +#include <rte_cryptodev_pmd.h>
>> +
>> +#include "armv8_crypto_defs.h"
>> +
>> +#include "rte_armv8_pmd_private.h"
>> +
>> +static const struct rte_cryptodev_capabilities
>> +    armv8_crypto_pmd_capabilities[] = {
>> +    {    /* SHA1 HMAC */
>> +        .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +            {.sym = {
>> +                .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>> +                {.auth = {
>> +                    .algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
>> +                    .block_size = 64,
>> +                    .key_size = {
>> +                        .min = 16,
>> +                        .max = 128,
>> +                        .increment = 0
>> +                    },
>> +                    .digest_size = {
>> +                        .min = 20,
>> +                        .max = 20,
>> +                        .increment = 0
>> +                    },
>> +                    .aad_size = { 0 }
>> +                }, }
>> +            }, }
>> +    },
>> +    {    /* SHA256 HMAC */
>> +        .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +            {.sym = {
>> +                .xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
>> +                {.auth = {
>> +                    .algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
>> +                    .block_size = 64,
>> +                    .key_size = {
>> +                        .min = 16,
>> +                        .max = 128,
>> +                        .increment = 0
>> +                    },
>> +                    .digest_size = {
>> +                        .min = 32,
>> +                        .max = 32,
>> +                        .increment = 0
>> +                    },
>> +                    .aad_size = { 0 }
>> +                }, }
>> +            }, }
>> +    },
>> +    {    /* AES CBC */
>> +        .op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
>> +            {.sym = {
>> +                .xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
>> +                {.cipher = {
>> +                    .algo = RTE_CRYPTO_CIPHER_AES_CBC,
>> +                    .block_size = 16,
>> +                    .key_size = {
>> +                        .min = 16,
>> +                        .max = 16,
>
> do you plan max = 32 ?
>
>> +                        .increment = 0
>> +                    },
>> +                    .iv_size = {
>> +                        .min = 16,
>> +                        .max = 16,
>> +                        .increment = 0
>> +                    }
>> +                }, }
>> +            }, }
>> +    },
>> +
>> +    RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
>> +};
>> +
>> +
>> +/** Configure device */
>> +static int
>> +armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
>> +{
>> +    return 0;
>> +}
>> +
>> +/** Start device */
>> +static int
>> +armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
>> +{
>> +    return 0;
>> +}
>> +
>> +/** Stop device */
>> +static void
>> +armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
>> +{
>> +}
>> +
>> +/** Close device */
>> +static int
>> +armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
>> +{
>> +    return 0;
>> +}
>> +
>> +
>> +/** Get device statistics */
>> +static void
>> +armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
>> +        struct rte_cryptodev_stats *stats)
>> +{
>> +    int qp_id;
>> +
>> +    for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
>> +        struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
>> +
>> +        stats->enqueued_count += qp->stats.enqueued_count;
>> +        stats->dequeued_count += qp->stats.dequeued_count;
>> +
>> +        stats->enqueue_err_count += qp->stats.enqueue_err_count;
>> +        stats->dequeue_err_count += qp->stats.dequeue_err_count;
>> +    }
>> +}
>> +
>> +/** Reset device statistics */
>> +static void
>> +armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
>> +{
>> +    int qp_id;
>> +
>> +    for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
>> +        struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
>> +
>> +        memset(&qp->stats, 0, sizeof(qp->stats));
>> +    }
>> +}
>> +
>> +
>> +/** Get device info */
>> +static void
>> +armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
>> +        struct rte_cryptodev_info *dev_info)
>> +{
>> +    struct armv8_crypto_private *internals = dev->data->dev_private;
>> +
>> +    if (dev_info != NULL) {
>> +        dev_info->dev_type = dev->dev_type;
>> +        dev_info->feature_flags = dev->feature_flags;
>> +        dev_info->capabilities = armv8_crypto_pmd_capabilities;
>> +        dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
>> +        dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
>> +    }
>> +}
>> +
>> +/** Release queue pair */
>> +static int
>> +armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
>> +{
>> +
>> +    if (dev->data->queue_pairs[qp_id] != NULL) {
>> +        rte_free(dev->data->queue_pairs[qp_id]);
>> +        dev->data->queue_pairs[qp_id] = NULL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/** set a unique name for the queue pair based on it's name, dev_id
>> and qp_id */
>> +static int
>
>> +armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
>> +        struct armv8_crypto_qp *qp)
>> +{
>> +    unsigned int n;
>> +
>> +    n = snprintf(qp->name, sizeof(qp->name),
>> "armv8_crypto_pmd_%u_qp_%u",
>> +            dev->data->dev_id, qp->id);
>> +
>> +    if (n > sizeof(qp->name))
>> +        return -1;
>> +
>> +    return 0;
>> +}
>> +
>> +
>> +/** Create a ring to place processed operations on */
>> +static struct rte_ring *
>> +armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp
>> *qp,
>> +        unsigned int ring_size, int socket_id)
>> +{
>> +    struct rte_ring *r;
>> +
>> +    r = rte_ring_lookup(qp->name);
>> +    if (r) {
>> +        if (r->prod.size >= ring_size) {
>> +            ARMV8_CRYPTO_LOG_INFO(
>> +                "Reusing existing ring %s for processed ops",
>> +                 qp->name);
>> +            return r;
>> +        }
>> +
>> +        ARMV8_CRYPTO_LOG_ERR(
>> +            "Unable to reuse existing ring %s for processed ops",
>> +             qp->name);
>> +        return NULL;
>> +    }
>> +
>> +    return rte_ring_create(qp->name, ring_size, socket_id,
>> +            RING_F_SP_ENQ | RING_F_SC_DEQ);
>> +}
>> +
>> +
>> +/** Setup a queue pair */
>> +static int
>> +armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
>> +        const struct rte_cryptodev_qp_conf *qp_conf,
>> +         int socket_id)
>> +{
>> +    struct armv8_crypto_qp *qp = NULL;
>> +
>> +    /* Free memory prior to re-allocation if needed. */
>> +    if (dev->data->queue_pairs[qp_id] != NULL)
>> +        armv8_crypto_pmd_qp_release(dev, qp_id);
>> +
>> +    /* Allocate the queue pair data structure. */
>> +    qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
>> +                    RTE_CACHE_LINE_SIZE, socket_id);
>> +    if (qp == NULL)
>> +        return -ENOMEM;
>> +
>> +    qp->id = qp_id;
>> +    dev->data->queue_pairs[qp_id] = qp;
>> +
>> +    if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
>> +        goto qp_setup_cleanup;
>> +
>> +    qp->processed_ops =
>> armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
>> +            qp_conf->nb_descriptors, socket_id);
>> +    if (qp->processed_ops == NULL)
>> +        goto qp_setup_cleanup;
>> +
>> +    qp->sess_mp = dev->data->session_pool;
>> +
>> +    memset(&qp->stats, 0, sizeof(qp->stats));
>> +
>> +    return 0;
>> +
>> +qp_setup_cleanup:
>> +    if (qp)
>> +        rte_free(qp);
>> +
>> +    return -1;
>> +}
>> +
>> +/** Start queue pair */
>> +static int
>> +armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
>> +        __rte_unused uint16_t queue_pair_id)
>> +{
>> +    return -ENOTSUP;
>> +}
>> +
>> +/** Stop queue pair */
>> +static int
>> +armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
>> +        __rte_unused uint16_t queue_pair_id)
>> +{
>> +    return -ENOTSUP;
>> +}
>> +
>> +/** Return the number of allocated queue pairs */
>> +static uint32_t
>> +armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
>> +{
>> +    return dev->data->nb_queue_pairs;
>> +}
>> +
>> +/** Returns the size of the session structure */
>> +static unsigned
>> +armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev
>> __rte_unused)
>> +{
>> +    return sizeof(struct armv8_crypto_session);
>> +}
>> +
>> +/** Configure the session from a crypto xform chain */
>> +static void *
>> +armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev
>> __rte_unused,
>> +        struct rte_crypto_sym_xform *xform, void *sess)
>> +{
>> +    if (unlikely(sess == NULL)) {
>> +        ARMV8_CRYPTO_LOG_ERR("invalid session struct");
>> +        return NULL;
>> +    }
>> +
>> +    if (armv8_crypto_set_session_parameters(
>> +            sess, xform) != 0) {
>> +        ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
>> +        return NULL;
>> +    }
>> +
>> +    return sess;
>> +}
>> +
>> +/** Clear the memory of session so it doesn't leave key material
>> behind */
>> +static void
>> +armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
>> +                void *sess)
>> +{
>> +
>> +    /* Zero out the whole structure */
>> +    if (sess)
>> +        memset(sess, 0, sizeof(struct armv8_crypto_session));
>> +}
>> +
>> +struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
>> +        .dev_configure        = armv8_crypto_pmd_config,
>> +        .dev_start        = armv8_crypto_pmd_start,
>> +        .dev_stop        = armv8_crypto_pmd_stop,
>> +        .dev_close        = armv8_crypto_pmd_close,
>> +
>> +        .stats_get        = armv8_crypto_pmd_stats_get,
>> +        .stats_reset        = armv8_crypto_pmd_stats_reset,
>> +
>> +        .dev_infos_get        = armv8_crypto_pmd_info_get,
>> +
>> +        .queue_pair_setup    = armv8_crypto_pmd_qp_setup,
>> +        .queue_pair_release    = armv8_crypto_pmd_qp_release,
>> +        .queue_pair_start    = armv8_crypto_pmd_qp_start,
>> +        .queue_pair_stop    = armv8_crypto_pmd_qp_stop,
>> +        .queue_pair_count    = armv8_crypto_pmd_qp_count,
>> +
>> +        .session_get_size    = armv8_crypto_pmd_session_get_size,
>> +        .session_configure    = armv8_crypto_pmd_session_configure,
>> +        .session_clear        = armv8_crypto_pmd_session_clear
>> +};
>> +
>> +struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops =
>> &armv8_crypto_pmd_ops;
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h
>> b/drivers/crypto/armv8/rte_armv8_pmd_private.h
>> new file mode 100644
>> index 0000000..fe46cde
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
>> @@ -0,0 +1,211 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above
>> copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Cavium networks nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>> USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
>> THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> DAMAGE.
>> + */
>> +
>> +#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
>> +#define _RTE_ARMV8_PMD_PRIVATE_H_
>> +
>> +#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
>> +    RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
>> +            RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
>> +            __func__, __LINE__, ## args)
>> +
>> +#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
>> +#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
>> +    RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
>> +            RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
>> +            __func__, __LINE__, ## args)
>> +
>> +#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
>> +    RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
>> +            RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
>> +            __func__, __LINE__, ## args)
>> +
>> +#define ARMV8_CRYPTO_ASSERT(con)                \
>> +do {                                \
>> +    if (!(con)) {                        \
>> +        rte_panic("%s(): "                \
>> +            con "condition failed, line %u", __func__);    \
>> +    }                            \
>> +} while (0)
>> +
>> +#else
>> +#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
>> +#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
>> +#define ARMV8_CRYPTO_ASSERT(con)
>> +#endif
>> +
>> +#define NBBY        8        /* Number of bits in a byte */
>
> is it being used somewhere?

The intention was to use in in the line below. Fixed.

>
>> +#define BYTE_LENGTH(x)    ((x) / 8)    /* Number of bytes in x (roun
>> down) */
>
> "round down"  instead of "roun down"

Thanks. Fixed.

>
>> +
>> +/** ARMv8 operation order mode enumerator */
>> +enum armv8_crypto_chain_order {
>> +    ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
>> +    ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
>> +    ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
>> +    ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
>> +};
>> +
>> +/** ARMv8 cipher operation enumerator */
>> +enum armv8_crypto_cipher_operation {
>> +    ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
>> +    ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
>> +    ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
>> +    ARMV8_CRYPTO_CIPHER_OP_LIST_END =
>> ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
>> +};
>> +
>> +enum armv8_crypto_cipher_keylen {
>> +    ARMV8_CRYPTO_CIPHER_KEYLEN_128,
>> +    ARMV8_CRYPTO_CIPHER_KEYLEN_192,
>> +    ARMV8_CRYPTO_CIPHER_KEYLEN_256,
>> +    ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
>> +    ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
>> +        ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
>> +};
>> +
>> +/** ARMv8 auth mode enumerator */
>> +enum armv8_crypto_auth_mode {
>> +    ARMV8_CRYPTO_AUTH_AS_AUTH,
>> +    ARMV8_CRYPTO_AUTH_AS_HMAC,
>> +    ARMV8_CRYPTO_AUTH_AS_CIPHER,
>> +    ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
>> +    ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
>> +};
>> +
>> +#define CRYPTO_ORDER_MAX        ARMV8_CRYPTO_CHAIN_LIST_END
>> +#define CRYPTO_CIPHER_OP_MAX        ARMV8_CRYPTO_CIPHER_OP_LIST_END
>> +#define CRYPTO_CIPHER_KEYLEN_MAX    ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
>> +#define CRYPTO_CIPHER_MAX        RTE_CRYPTO_CIPHER_LIST_END
>> +#define CRYPTO_AUTH_MAX            RTE_CRYPTO_AUTH_LIST_END
>> +
>> +#define HMAC_IPAD_VALUE            (0x36)
>> +#define HMAC_OPAD_VALUE            (0x5C)
>> +
>> +#define SHA256_AUTH_KEY_LENGTH        (BYTE_LENGTH(256))
>> +#define SHA256_BLOCK_SIZE        (BYTE_LENGTH(512))
>> +
>> +#define SHA1_AUTH_KEY_LENGTH        (BYTE_LENGTH(160))
>> +#define SHA1_BLOCK_SIZE            (BYTE_LENGTH(512))
>> +
>> +#define SHA_AUTH_KEY_MAX        SHA256_AUTH_KEY_LENGTH
>> +#define SHA_BLOCK_MAX            SHA256_BLOCK_SIZE
>> +
>> +typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
>> +                uint8_t *, uint8_t *, uint64_t,
>> +                crypto_arg_t *);
>> +
>> +typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
>> +
>> +/** private data structure for each ARMv8 crypto device */
>> +struct armv8_crypto_private {
>> +    unsigned int max_nb_qpairs;
>> +    /**< Max number of queue pairs */
>> +    unsigned int max_nb_sessions;
>> +    /**< Max number of sessions */
>> +};
>> +
>> +/** ARMv8 crypto queue pair */
>> +struct armv8_crypto_qp {
>> +    uint16_t id;
>> +    /**< Queue Pair Identifier */
>> +    char name[RTE_CRYPTODEV_NAME_LEN];
>> +    /**< Unique Queue Pair Name */
>> +    struct rte_ring *processed_ops;
>> +    /**< Ring for placing process packets */
>> +    struct rte_mempool *sess_mp;
>> +    /**< Session Mempool */
>> +    struct rte_cryptodev_stats stats;
>> +    /**< Queue pair statistics */
>> +} __rte_cache_aligned;
>> +
>> +/** ARMv8 crypto private session structure */
>> +struct armv8_crypto_session {
>> +    enum armv8_crypto_chain_order chain_order;
>> +    /**< chain order mode */
>> +    crypto_func_t crypto_func;
>> +    /**< cryptographic function to use for this session */
>> +
>> +    /** Cipher Parameters */
>> +    struct {
>> +        enum rte_crypto_cipher_operation direction;
>> +        /**< cipher operation direction */
>> +        enum rte_crypto_cipher_algorithm algo;
>> +        /**< cipher algorithm */
>> +        int iv_len;
>> +        /**< IV length */
>> +
>> +        struct {
>> +            uint8_t data[256];
>> +            /**< key data */
>> +            size_t length;
>> +            /**< key length in bytes */
>> +        } key;
>> +
>> +        crypto_key_sched_t key_sched;
>> +        /**< Key schedule function */
>> +    } cipher;
>> +
>> +    /** Authentication Parameters */
>> +    struct {
>> +        enum rte_crypto_auth_operation operation;
>> +        /**< auth operation generate or verify */
>> +        enum armv8_crypto_auth_mode mode;
>> +        /**< auth operation mode */
>> +
>> +        union {
>> +            struct {
>> +                /* Add data if needed */
>> +            } auth;
>> +
>> +            struct {
>> +                uint8_t i_key_pad[SHA_BLOCK_MAX]
>> +                            __rte_cache_aligned;
>> +                /**< inner pad (max supported block length) */
>> +                uint8_t o_key_pad[SHA_BLOCK_MAX]
>> +                            __rte_cache_aligned;
>> +                /**< outer pad (max supported block length) */
>> +                uint8_t key[SHA_AUTH_KEY_MAX];
>> +                /**< HMAC key (max supported length)*/
>> +            } hmac;
>> +        };
>> +    } auth;
>> +
>> +} __rte_cache_aligned;
>> +
>> +/** Set and validate ARMv8 crypto session parameters */
>> +extern int armv8_crypto_set_session_parameters(
>> +        struct armv8_crypto_session *sess,
>> +        const struct rte_crypto_sym_xform *xform);
>> +
>> +/** device specific operations function pointer structure */
>> +extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
>> +
>> +#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
>> diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map
>> b/drivers/crypto/armv8/rte_armv8_pmd_version.map
>> new file mode 100644
>> index 0000000..1f84b68
>> --- /dev/null
>> +++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
>> @@ -0,0 +1,3 @@
>> +DPDK_17.02 {
>> +    local: *;
>> +};
>>
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-13  8:16         ` Hemant Agrawal
  2017-01-13 15:50           ` Zbigniew Bodek
@ 2017-01-16  5:57           ` Jianbo Liu
  1 sibling, 0 replies; 100+ messages in thread
From: Jianbo Liu @ 2017-01-16  5:57 UTC (permalink / raw)
  To: Hemant Agrawal
  Cc: Zbigniew Bodek, dev, pablo.de.lara.guarch, Declan Doherty, Jerin Jacob

On 13 January 2017 at 16:16, Hemant Agrawal <hemant.agrawal@nxp.com> wrote:
> On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
>>
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Add type and name for ARMv8 crypto PMD
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>> ---
>>  lib/librte_cryptodev/rte_cryptodev.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/lib/librte_cryptodev/rte_cryptodev.h
>> b/lib/librte_cryptodev/rte_cryptodev.h
>> index 8f63e8f..6f34f22 100644
>> --- a/lib/librte_cryptodev/rte_cryptodev.h
>> +++ b/lib/librte_cryptodev/rte_cryptodev.h
>> @@ -66,6 +66,8 @@
>>  /**< KASUMI PMD device name */
>>  #define CRYPTODEV_NAME_ZUC_PMD         crypto_zuc
>>  /**< KASUMI PMD device name */
>> +#define CRYPTODEV_NAME_ARMV8_PMD       crypto_armv8
>> +/**< ARMv8 Crypto PMD device name */
>>
> I will suggest the name as armv8ce or armv8_ce for this driver.
> Do you agree?
>

I don't because it's a lib only optimized for chained crypto and hash.

>
>>  /** Crypto device type */
>>  enum rte_cryptodev_type {
>> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>>         RTE_CRYPTODEV_KASUMI_PMD,       /**< KASUMI PMD */
>>         RTE_CRYPTODEV_ZUC_PMD,          /**< ZUC PMD */
>>         RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
>> +       RTE_CRYPTODEV_ARMV8_PMD,        /**< ARMv8 crypto PMD */
>>  };
>>
>>  extern const char **rte_cyptodev_names;
>>
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-13 18:59         ` Zbigniew Bodek
@ 2017-01-16  6:57           ` Hemant Agrawal
  2017-01-16  8:02             ` Jerin Jacob
  0 siblings, 1 reply; 100+ messages in thread
From: Hemant Agrawal @ 2017-01-16  6:57 UTC (permalink / raw)
  To: Zbigniew Bodek, dev; +Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob

Hi Zbigniew,


> -----Original Message-----
> From: Zbigniew Bodek [mailto:zbigniew.bodek@caviumnetworks.com]
> Subject: Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
> On 13.01.2017 09:07, Hemant Agrawal wrote:
> > On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> >> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >>
> >> Introduce crypto poll mode driver using ARMv8 cryptographic
> >> extensions. This PMD is optimized to provide performance boost for
> >> chained crypto operations processing, such as:
> >> * encryption + HMAC generation
> >> * decryption + HMAC validation.
> >> In particular, cipher only or hash only operations are not provided.
> >
> > Do you have a plan to add the crypto only, auth/hash only support into
> > this driver?
> 
> OpenSSL driver is already implementing that and it is optimized for ARMv8.
> 
[Hemant]  Agreed that openSSL driver implement it, however it will make the application little complex to initiates both driver instances and then choose the driver based on the algorithm.

> > Also, do you plan to add additional cases w.r.t supported by other
> > crypto driver?
> 
> We may do it in the future but this depends on our resource availability.
> >
> >> Performance gain can be observed in tests against OpenSSL PMD which
> >> also uses ARM crypto extensions for packets processing.
> >>
> >> Exemplary crypto performance tests comparison:
> >>
> >> cipher_hash. cipher algo: AES_CBC
> >> auth algo: SHA1_HMAC cipher key size=16.
> >> burst_size: 64 ops
> >>
> >> ARMv8 PMD improvement over OpenSSL PMD (Optimized for ARMv8 cipher
> >> only and hash only cases):
> >>
> >> Buffer
> >> Size(B)   OPS(M)      Throughput(Gbps)
> >> 64        729 %        742 %
> >> 128       577 %        592 %
> >> 256       483 %        476 %
> >> 512       336 %        351 %
> >> 768       300 %        286 %
> >> 1024      263 %        250 %
> >> 1280      225 %        229 %
> >> 1536      214 %        213 %
> >> 1792      186 %        203 %
> >> 2048      200 %        193 %
> >>
> >> The driver currently supports AES-128-CBC in combination with: SHA256
> >> HMAC and SHA1 HMAC.
> >> The core crypto functionality of this driver is provided by the
> >> external armv8_crypto library that can be downloaded from the Cavium
> >> repository:
> >> https://github.com/caviumnetworks/armv8_crypto
> >>
[Hemant] Thanks for the good work. 
Is it possible to get it hosted on a standard and neutral place? E.g. Linaro
It will make it easier for other ARM vendors to contribute. 

> >> CPU compatibility with this virtual device is detected in run-time
> >> and virtual crypto device will not be created if CPU doesn't provide
> >> AES, SHA1, SHA2 and NEON.
> >>
> >> The functionality and performance of this code can be tested using
> >> generic test application with the following commands:
> >> * cryptodev_sw_armv8_autotest
> >> * cryptodev_sw_armv8_perftest
> >> New test vectors and cases have been added to the general pool. In
> >> particular SHA1 and
> >> SHA256 HMAC for short cases were introduced.
> >> This is because low-level ARM assembly code is using different code
> >> paths for long and short data sets, so in order to test the mentioned
> >> driver correctly, two different data sets need to be provided.
> >>
> >> ---
> >> v3:
> >> * Addressed review remarks
> >> * Moved low-level assembly code to the external library
> >> * Removed SHA256 MAC cases
> >> * Various fixes: interface to the library, digest destination
> >>   and source address interpreting, missing mbuf manipulations.
> >>
> >> v2:
> >> * Fixed checkpatch warnings
> >> * Divide patches into smaller logical parts
> >>
> >> Zbigniew Bodek (8):
> >>   mk: fix build of assembly files for ARM64
> >>   lib: add cryptodev type for the upcoming ARMv8 PMD
> >>   crypto/armv8: add PMD optimized for ARMv8 processors
> >>   mk/crypto/armv8: add PMD to the build system
> >>   doc/armv8: update documentation about crypto PMD
> >>   crypto/armv8: enable ARMv8 PMD in the configuration
> >>   crypto/armv8: update MAINTAINERS entry for ARMv8 crypto
> >>   app/test: add ARMv8 crypto tests and test vectors
> >>
> >>  MAINTAINERS                                    |   6 +
> >>  app/test/test_cryptodev.c                      |  63 ++
> >>  app/test/test_cryptodev_aes_test_vectors.h     | 144 +++-
> >>  app/test/test_cryptodev_blockcipher.c          |   4 +
> >>  app/test/test_cryptodev_blockcipher.h          |   1 +
> >>  app/test/test_cryptodev_perf.c                 | 480 +++++++++++++
> >>  config/common_base                             |   6 +
> >>  doc/guides/cryptodevs/armv8.rst                |  96 +++
> >>  doc/guides/cryptodevs/index.rst                |   1 +
> >>  doc/guides/rel_notes/release_17_02.rst         |   5 +
> >>  drivers/crypto/Makefile                        |   1 +
> >>  drivers/crypto/armv8/Makefile                  |  73 ++
> >>  drivers/crypto/armv8/rte_armv8_pmd.c           | 926
> >> +++++++++++++++++++++++++
> >>  drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
> >>  drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
> >>  drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
> >>  lib/librte_cryptodev/rte_cryptodev.h           |   3 +
> >>  mk/arch/arm64/rte.vars.mk                      |   1 -
> >>  mk/rte.app.mk                                  |   2 +
> >>  mk/toolchain/gcc/rte.vars.mk                   |   6 +-
> >>  20 files changed, 2390 insertions(+), 11 deletions(-)  create mode
> >> 100644 doc/guides/cryptodevs/armv8.rst  create mode 100644
> >> drivers/crypto/armv8/Makefile  create mode 100644
> >> drivers/crypto/armv8/rte_armv8_pmd.c
> >>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
> >>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
> >>  create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map
> >>
> >
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
  2017-01-16  6:57           ` Hemant Agrawal
@ 2017-01-16  8:02             ` Jerin Jacob
  0 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2017-01-16  8:02 UTC (permalink / raw)
  To: Hemant Agrawal; +Cc: Zbigniew Bodek, dev, pablo.de.lara.guarch, declan.doherty

On Mon, Jan 16, 2017 at 06:57:12AM +0000, Hemant Agrawal wrote:
> Hi Zbigniew,
> 
> 
> > -----Original Message-----
> > From: Zbigniew Bodek [mailto:zbigniew.bodek@caviumnetworks.com]
> > Subject: Re: [PATCH v3 0/8] Add crypto PMD optimized for ARMv8
> > On 13.01.2017 09:07, Hemant Agrawal wrote:
> > > On 1/4/2017 11:03 PM, zbigniew.bodek@caviumnetworks.com wrote:
> > >> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> > >>
> > >> Introduce crypto poll mode driver using ARMv8 cryptographic
> > >> extensions. This PMD is optimized to provide performance boost for
> > >> chained crypto operations processing, such as:
> > >> * encryption + HMAC generation
> > >> * decryption + HMAC validation.
> > >> In particular, cipher only or hash only operations are not provided.
> > >
> > > Do you have a plan to add the crypto only, auth/hash only support into
> > > this driver?
> > 
> > OpenSSL driver is already implementing that and it is optimized for ARMv8.
> > 
> [Hemant]  Agreed that openSSL driver implement it, however it will make the application little complex to initiates both driver instances and then choose the driver based on the algorithm.

We started with chained crypto as primary data-plane use-case like IPSec
need the chained operation. Agreed on single driver for both chained and non
chained. Feel free to contribute.

> 
> > > Also, do you plan to add additional cases w.r.t supported by other
> > > crypto driver?
> > 
> > We may do it in the future but this depends on our resource availability.
> > >
> > >> Performance gain can be observed in tests against OpenSSL PMD which
> > >> also uses ARM crypto extensions for packets processing.
> > >>
> > >> Exemplary crypto performance tests comparison:
> > >>
> > >> cipher_hash. cipher algo: AES_CBC
> > >> auth algo: SHA1_HMAC cipher key size=16.
> > >> burst_size: 64 ops
> > >>
> > >> ARMv8 PMD improvement over OpenSSL PMD (Optimized for ARMv8 cipher
> > >> only and hash only cases):
> > >>
> > >> Buffer
> > >> Size(B)   OPS(M)      Throughput(Gbps)
> > >> 64        729 %        742 %
> > >> 128       577 %        592 %
> > >> 256       483 %        476 %
> > >> 512       336 %        351 %
> > >> 768       300 %        286 %
> > >> 1024      263 %        250 %
> > >> 1280      225 %        229 %
> > >> 1536      214 %        213 %
> > >> 1792      186 %        203 %
> > >> 2048      200 %        193 %
> > >>
> > >> The driver currently supports AES-128-CBC in combination with: SHA256
> > >> HMAC and SHA1 HMAC.
> > >> The core crypto functionality of this driver is provided by the
> > >> external armv8_crypto library that can be downloaded from the Cavium
> > >> repository:
> > >> https://github.com/caviumnetworks/armv8_crypto
> > >>
> [Hemant] Thanks for the good work. 
> Is it possible to get it hosted on a standard and neutral place? E.g. Linaro
> It will make it easier for other ARM vendors to contribute. 
>

Sure. We are OK to host any place you suggest.
This was one of the reasons why I thought of having asm code in
driver/crypto/armv8 itself. But maintainers had a different view on it.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v4 0/7] Add crypto PMD optimized for ARMv8
  2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
  2017-01-06  2:45         ` Jianbo Liu
  2017-01-13  7:57         ` Hemant Agrawal
@ 2017-01-17 15:48         ` zbigniew.bodek
  2017-01-17 15:48           ` [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
                             ` (6 more replies)
  2 siblings, 7 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:48 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided.
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC.
The core crypto functionality of this driver is
provided by the external armv8_crypto library
that can be downloaded from the Cavium repository:
https://github.com/caviumnetworks/armv8_crypto

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA1 and
SHA256 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

---

v4:
* Address new review remarks (keep ARMv8 naming though)
* Fix spelling and change commit logs
* Removed unused code for currently unsupported algorithms
* Enqueue processed crypto ops in bursts
* Add micro-optimizations to the PMD code
* Send build system fixes in a separate patch

v3:
* Addressed review remarks
* Moved low-level assembly code to the external library
* Removed SHA256 MAC cases
* Various fixes: interface to the library, digest destination
  and source address interpreting, missing mbuf manipulations.

v2:
* Fixed checkpatch warnings
* Divide patches into smaller logical parts

Zbigniew Bodek (7):
  lib: add cryptodev type for the upcoming ARMv8 PMD
  crypto/armv8: add PMD optimized for ARMv8 processors
  mk: add PMD to the build system
  doc: update documentation about ARMv8 crypto PMD
  crypto/armv8: enable ARMv8 PMD in the configuration
  MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                    |   6 +
 app/test/test_cryptodev.c                      |  64 ++
 app/test/test_cryptodev_aes_test_vectors.h     | 144 +++-
 app/test/test_cryptodev_blockcipher.c          |   4 +
 app/test/test_cryptodev_blockcipher.h          |   1 +
 app/test/test_cryptodev_perf.c                 | 486 +++++++++++++
 config/common_base                             |   6 +
 doc/guides/cryptodevs/armv8.rst                |  96 +++
 doc/guides/cryptodevs/index.rst                |   1 +
 doc/guides/rel_notes/release_17_02.rst         |   5 +
 drivers/crypto/Makefile                        |   1 +
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 912 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 lib/librte_cryptodev/rte_cryptodev.h           |   3 +
 mk/rte.app.mk                                  |   2 +
 18 files changed, 2378 insertions(+), 8 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2017-01-17 15:48           ` zbigniew.bodek
  2017-01-18  2:24             ` Jerin Jacob
  2017-01-17 15:48           ` [PATCH v4 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                             ` (5 subsequent siblings)
  6 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:48 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add type and name for ARMv8 crypto PMD

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_cryptodev/rte_cryptodev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index 29d8eec..b370c2f 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 Crypto PMD device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 2/7] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-17 15:48           ` [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2017-01-17 15:48           ` zbigniew.bodek
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-17 15:48           ` [PATCH v4 3/7] mk: add PMD to the build system zbigniew.bodek
                             ` (4 subsequent siblings)
  6 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:48 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver
using ARMv8 cryptographic extensions.
CPU compatibility with this driver is detected in
run-time and virtual crypto device will not be
created if CPU doesn't provide:
AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost
for chained crypto operations processing,
such as encryption + HMAC generation,
decryption + HMAC validation. In particular,
cipher only or hash only operations are
not provided.

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC
and relies on the external armv8_crypto library:
https://github.com/caviumnetworks/armv8_crypto

This patch adds driver's code only and does
not include it in the build system.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 912 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 5 files changed, 1567 insertions(+)
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..2003ec4
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,72 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2017.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(MAKECMDGOALS),clean)
+ifneq ($(MAKECMDGOALS),config)
+ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
+$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
+endif
+endif
+endif
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+# external library dependencies
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
+LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..569b8d1
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,912 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be described by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t
+crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+/* Evaluate to key length definition */
+#define KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/* Local aliases for supported ciphers */
+#define CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
+/* Local aliases for supported hashes */
+#define AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
+#define AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t
+crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * Global static parameter used to create a unique name for each
+ * ARMV8 crypto device.
+ */
+static unsigned int unique_name_id;
+
+static inline int
+create_unique_device_name(char *name, size_t size)
+{
+	int ret;
+
+	if (name == NULL)
+		return -EINVAL;
+
+	ret = snprintf(name, size, "%s_%u", RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+			unique_name_id++);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+	case 256:
+		/* These key lengths are not supported yet */
+	default: /* Fall through */
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static inline struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (unlikely(sess == NULL))
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static inline void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	struct rte_mbuf *m_asrc, *m_adst;
+	uint8_t *csrc, *cdst;
+	uint8_t *adst, *asrc;
+	uint64_t clen, alen;
+	int error;
+
+	clen = op->sym->cipher.data.length;
+	alen = op->sym->auth.data.length;
+
+	csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		m_asrc = m_adst = mbuf_dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		m_asrc = mbuf_src;
+		m_adst = mbuf_dst;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+	asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
+				op->sym->auth.data.offset);
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(m_adst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
+	if (error != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+		/* Trim area used for digest from mbuf. */
+		rte_pktmbuf_trim(m_asrc,
+				op->sym->auth.digest.length);
+	}
+}
+
+/** Process crypto operation for mbuf */
+static inline int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (unlikely(op->status == RTE_CRYPTO_OP_STATUS_ERROR))
+		return -1;
+
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	qp->stats.enqueued_count += retval;
+
+	return retval;
+
+enqueue_err:
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return retval;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(const char *name,
+		struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	char crypto_dev_name[RTE_CRYPTODEV_NAME_MAX_LEN];
+	struct armv8_crypto_private *internals;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* create a unique device name */
+	if (create_unique_device_name(crypto_dev_name,
+			RTE_CRYPTODEV_NAME_MAX_LEN) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create unique cryptodev name");
+		return -EINVAL;
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(crypto_dev_name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed", name);
+
+	cryptodev_armv8_crypto_uninit(crypto_dev_name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id()
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(name, &init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..2bf6475
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,369 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..b75107f
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,211 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define NBBY		8		/* Number of bits in a byte */
+#define BYTE_LENGTH(x)	((x) / NBBY)	/* Number of bytes in x (round down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define HMAC_IPAD_VALUE			(0x36)
+#define HMAC_OPAD_VALUE			(0x5C)
+
+#define SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
+				uint8_t *, uint8_t *, uint64_t,
+				crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 3/7] mk: add PMD to the build system
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-17 15:48           ` [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
  2017-01-17 15:48           ` [PATCH v4 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-17 15:48           ` zbigniew.bodek
  2017-01-17 15:49           ` [PATCH v4 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:48 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Build ARMv8 crypto PMD if compiling for ARM64
and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
is enable in the configuration file.
ARMV8_CRYPTO_LIB_PATH environment variable will
point to the appropriate library directory.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/Makefile | 1 +
 mk/rte.app.mk           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..77b02cf 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..bbb5265 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -145,6 +145,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 4/7] doc: update documentation about ARMv8 crypto PMD
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                             ` (2 preceding siblings ...)
  2017-01-17 15:48           ` [PATCH v4 3/7] mk: add PMD to the build system zbigniew.bodek
@ 2017-01-17 15:49           ` zbigniew.bodek
  2017-01-17 15:49           ` [PATCH v4 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:49 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add documentation about the driver and update
release notes.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/cryptodevs/armv8.rst        | 96 ++++++++++++++++++++++++++++++++++
 doc/guides/cryptodevs/index.rst        |  1 +
 doc/guides/rel_notes/release_17_02.rst |  5 ++
 3 files changed, 102 insertions(+)
 create mode 100644 doc/guides/cryptodevs/armv8.rst

diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..ca8781e
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,96 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2017.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+================================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level library, written in the assembly code.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+In order to enable this virtual crypto PMD, user must:
+
+* Download ARMv8 crypto library source code from
+  `here <https://github.com/caviumnetworks/armv8_crypto>`_
+
+* Export the environmental variable ARMV8_CRYPTO_LIB_PATH with
+  the path where the ``armv8_crypto`` library was downloaded
+  or cloned.
+
+* Build the library by invoking:
+
+.. code-block:: console
+
+	make -C $ARMV8_CRYPTO_LIB_PATH/
+
+* Set CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y in
+  config/defconfig_arm64-armv8a-linuxapp-gcc
+
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Cipher input data has to be a multiple of 16 bytes.
+* Digest input data has to be a multiple of 8 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 5ab7019..872b288 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -67,6 +67,11 @@ New Features
 
   * Support for single operations (cipher only and authentication only).
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 5/7] crypto/armv8: enable ARMv8 PMD in the configuration
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                             ` (3 preceding siblings ...)
  2017-01-17 15:49           ` [PATCH v4 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
@ 2017-01-17 15:49           ` zbigniew.bodek
  2017-01-17 15:49           ` [PATCH v4 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
  2017-01-17 15:49           ` [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  6 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:49 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option to
the common configuration file. Don't enable it by
default for ARM64 as it requires external library
to build.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 config/common_base | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/config/common_base b/config/common_base
index 8e9dcfa..f6779ee 100644
--- a/config/common_base
+++ b/config/common_base
@@ -415,6 +415,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                             ` (4 preceding siblings ...)
  2017-01-17 15:49           ` [PATCH v4 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
@ 2017-01-17 15:49           ` zbigniew.bodek
  2017-01-17 15:49           ` [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  6 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:49 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9645c9b..00c7adc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -447,6 +447,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors
  2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                             ` (5 preceding siblings ...)
  2017-01-17 15:49           ` [PATCH v4 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
@ 2017-01-17 15:49           ` zbigniew.bodek
  2017-01-18  2:26             ` Jerin Jacob
  6 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-17 15:49 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  64 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 144 ++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 486 +++++++++++++++++++++++++++++
 5 files changed, 691 insertions(+), 8 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 3eaf1b7..2093d5a 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,28 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1593,6 +1615,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -6928,6 +6966,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -6991,6 +7046,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -6999,3 +7062,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index 898aae1..6b15a40 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,98 @@
 	}
 };
 
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -878,37 +970,69 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -954,7 +1078,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -963,7 +1088,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1006,7 +1132,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1015,7 +1142,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index f1fe624..94bdb0b 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -87,6 +87,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -644,6 +645,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index fe97e4c..ab81a7f 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -49,6 +49,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 7751ff2..7973723 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,28 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2423,6 +2451,139 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+	uint32_t nb_ops;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length, pparams->chain);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			if ((num_to_submit - num_sent) < burst_size)
+				nb_ops = num_to_submit - num_sent;
+			else
+				nb_ops = burst_size;
+
+			start_cycles = rte_rdtsc();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				nb_ops);
+			end_cycles = rte_rdtsc();
+
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2688,6 +2849,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3375,6 +3586,139 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))], sess,
+					pparams->buf_size, digest_length,
+					pparams->chain);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3688,6 +4032,125 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4238,6 +4701,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4294,6 +4770,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4303,3 +4787,5 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest,
+		perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD
  2017-01-17 15:48           ` [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
@ 2017-01-18  2:24             ` Jerin Jacob
  0 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2017-01-18  2:24 UTC (permalink / raw)
  To: zbigniew.bodek
  Cc: dev, pablo.de.lara.guarch, declan.doherty, jianbo.liu, hemant.agrawal

On Tue, Jan 17, 2017 at 04:48:57PM +0100, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Please change the commit subject to

cryptodev: add cryptodev type for the ARMv8 PMD

from

lib: add cryptodev type for the upcoming ARMv8 PMD
> 
> Add type and name for ARMv8 crypto PMD
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  lib/librte_cryptodev/rte_cryptodev.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
> index 29d8eec..b370c2f 100644
> --- a/lib/librte_cryptodev/rte_cryptodev.h
> +++ b/lib/librte_cryptodev/rte_cryptodev.h
> @@ -66,6 +66,8 @@
>  /**< KASUMI PMD device name */
>  #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
>  /**< KASUMI PMD device name */
> +#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
> +/**< ARMv8 Crypto PMD device name */
>  
>  /** Crypto device type */
>  enum rte_cryptodev_type {
> @@ -77,6 +79,7 @@ enum rte_cryptodev_type {
>  	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
>  	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
>  	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
> +	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
>  };
>  
>  extern const char **rte_cyptodev_names;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors
  2017-01-17 15:49           ` [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2017-01-18  2:26             ` Jerin Jacob
  0 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2017-01-18  2:26 UTC (permalink / raw)
  To: zbigniew.bodek
  Cc: dev, pablo.de.lara.guarch, declan.doherty, jianbo.liu, hemant.agrawal

On Tue, Jan 17, 2017 at 04:49:03PM +0100, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Introduce unit tests for ARMv8 crypto PMD.
> Add test vectors for short cases such as 160 bytes.
> These test cases are ARMv8 specific since the code provides
> different processing paths for different input data sizes.
> 
> User can validate correctness of algorithms' implementation using:
> * cryptodev_sw_armv8_autotest
> For performance test one can use:
> * cryptodev_sw_armv8_perftest

Could you please rebase to latest dpdk-next-crypto HEAD.

➜ [master][dpdk-next-crypto] $ git am -3 dpdk-dev-v4-7-7-app-test-add-ARMv8-crypto-tests-and-test-vectors.patch
Applying: app/test: add ARMv8 crypto tests and test vectors
Using index info to reconstruct a base tree...
M       app/test/test_cryptodev.c
M       app/test/test_cryptodev_aes_test_vectors.h
M       app/test/test_cryptodev_blockcipher.c
M       app/test/test_cryptodev_blockcipher.h
M       app/test/test_cryptodev_perf.c
Falling back to patching base and 3-way merge...
Auto-merging app/test/test_cryptodev_perf.c
Auto-merging app/test/test_cryptodev_blockcipher.h
Auto-merging app/test/test_cryptodev_blockcipher.c
Auto-merging app/test/test_cryptodev_aes_test_vectors.h
CONFLICT (content): Merge conflict in
app/test/test_cryptodev_aes_test_vectors.h
Auto-merging app/test/test_cryptodev.c
Failed to merge in the changes.
Patch failed at 0001 app/test: add ARMv8 crypto tests and test vectors
The copy of the patch that failed is found in:
   /home/jerin/dpdk-next-crypto/.git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v5 0/7] Add crypto PMD optimized for ARMv8
  2017-01-17 15:48           ` [PATCH v4 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-18 14:27             ` zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 1/7] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
                                 ` (7 more replies)
  0 siblings, 8 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided.
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC.
The core crypto functionality of this driver is
provided by the external armv8_crypto library
that can be downloaded from the Cavium repository:
https://github.com/caviumnetworks/armv8_crypto

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA1 and
SHA256 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

---

v5:
* Add user defined name initializing parameter
  (according to b8a661f15eb8)
* Align with the current next-crypto master branch
* Another changes to commit logs

v4:
* Address new review remarks (keep ARMv8 naming though)
* Fix spelling and change commit logs
* Removed unused code for currently unsupported algorithms
* Enqueue processed crypto ops in bursts
* Add micro-optimizations to the PMD code
* Send build system fixes in a separate patch

v3:
* Addressed review remarks
* Moved low-level assembly code to the external library
* Removed SHA256 MAC cases
* Various fixes: interface to the library, digest destination
  and source address interpreting, missing mbuf manipulations.

v2:
* Fixed checkpatch warnings
* Divide patches into smaller logical parts

Zbigniew Bodek (7):
  cryptodev: add cryptodev type for the ARMv8 PMD
  crypto/armv8: add PMD optimized for ARMv8 processors
  mk: add PMD to the build system
  doc: update documentation about ARMv8 crypto PMD
  crypto/armv8: enable ARMv8 PMD in the configuration
  MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                    |   6 +
 app/test/test_cryptodev.c                      |  64 ++
 app/test/test_cryptodev_aes_test_vectors.h     | 145 +++-
 app/test/test_cryptodev_blockcipher.c          |   4 +
 app/test/test_cryptodev_blockcipher.h          |   1 +
 app/test/test_cryptodev_perf.c                 | 486 +++++++++++++
 config/common_base                             |   6 +
 doc/guides/cryptodevs/armv8.rst                |  96 +++
 doc/guides/cryptodevs/index.rst                |   1 +
 doc/guides/rel_notes/release_17_02.rst         |   5 +
 drivers/crypto/Makefile                        |   1 +
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 900 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 lib/librte_cryptodev/rte_cryptodev.h           |   3 +
 mk/rte.app.mk                                  |   2 +
 18 files changed, 2366 insertions(+), 9 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v5 1/7] cryptodev: add cryptodev type for the ARMv8 PMD
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                                 ` (6 subsequent siblings)
  7 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add type and name for ARMv8 crypto PMD

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_cryptodev/rte_cryptodev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index f4e66e6..452b174 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 Crypto PMD device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 2/7] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 1/7] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 3/7] mk: add PMD to the build system zbigniew.bodek
                                 ` (5 subsequent siblings)
  7 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver
using ARMv8 cryptographic extensions.
CPU compatibility with this driver is detected in
run-time and virtual crypto device will not be
created if CPU doesn't provide:
AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost
for chained crypto operations processing,
such as encryption + HMAC generation,
decryption + HMAC validation. In particular,
cipher only or hash only operations are
not provided.

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC
and relies on the external armv8_crypto library:
https://github.com/caviumnetworks/armv8_crypto

This patch adds driver's code only and does
not include it in the build system.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 900 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 5 files changed, 1555 insertions(+)
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..2003ec4
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,72 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2017.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(MAKECMDGOALS),clean)
+ifneq ($(MAKECMDGOALS),config)
+ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
+$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
+endif
+endif
+endif
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+# external library dependencies
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
+LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..1bf0f9d
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,900 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be described by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t
+crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+/* Evaluate to key length definition */
+#define KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/* Local aliases for supported ciphers */
+#define CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
+/* Local aliases for supported hashes */
+#define AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
+#define AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t
+crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+	case 256:
+		/* These key lengths are not supported yet */
+	default: /* Fall through */
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static inline struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (unlikely(sess == NULL))
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static inline void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	struct rte_mbuf *m_asrc, *m_adst;
+	uint8_t *csrc, *cdst;
+	uint8_t *adst, *asrc;
+	uint64_t clen, alen;
+	int error;
+
+	clen = op->sym->cipher.data.length;
+	alen = op->sym->auth.data.length;
+
+	csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		m_asrc = m_adst = mbuf_dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		m_asrc = mbuf_src;
+		m_adst = mbuf_dst;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+	asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
+				op->sym->auth.data.offset);
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(m_adst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
+	if (error != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+		/* Trim area used for digest from mbuf. */
+		rte_pktmbuf_trim(m_asrc,
+				op->sym->auth.digest.length);
+	}
+}
+
+/** Process crypto operation for mbuf */
+static inline int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (unlikely(op->status == RTE_CRYPTO_OP_STATUS_ERROR))
+		return -1;
+
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	qp->stats.enqueued_count += retval;
+
+	return retval;
+
+enqueue_err:
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return retval;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	struct armv8_crypto_private *internals;
+	int ret;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	if (init_params->name[0] == '\0') {
+		ret = rte_cryptodev_pmd_create_dev_name(
+				init_params->name,
+				RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+
+		if (ret < 0) {
+			ARMV8_CRYPTO_LOG_ERR("failed to create unique name");
+			return ret;
+		}
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(init_params->name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed",
+		init_params->name);
+
+	cryptodev_armv8_crypto_uninit(init_params->name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id(),
+		{0}
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	if (init_params.name[0] != '\0') {
+		RTE_LOG(INFO, PMD, "  User defined name = %s\n",
+			init_params.name);
+	}
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(&init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..2bf6475
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,369 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..b75107f
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,211 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define NBBY		8		/* Number of bits in a byte */
+#define BYTE_LENGTH(x)	((x) / NBBY)	/* Number of bytes in x (round down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define HMAC_IPAD_VALUE			(0x36)
+#define HMAC_OPAD_VALUE			(0x5C)
+
+#define SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
+				uint8_t *, uint8_t *, uint64_t,
+				crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 3/7] mk: add PMD to the build system
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 1/7] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
                                 ` (4 subsequent siblings)
  7 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Build ARMv8 crypto PMD if compiling for ARM64
and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
is enable in the configuration file.
ARMV8_CRYPTO_LIB_PATH environment variable will
point to the appropriate library directory.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/Makefile | 1 +
 mk/rte.app.mk           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..77b02cf 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..bbb5265 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -145,6 +145,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                 ` (2 preceding siblings ...)
  2017-01-18 14:27               ` [PATCH v5 3/7] mk: add PMD to the build system zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 17:05                 ` De Lara Guarch, Pablo
  2017-01-18 14:27               ` [PATCH v5 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
                                 ` (3 subsequent siblings)
  7 siblings, 1 reply; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add documentation about the driver and update
release notes.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/cryptodevs/armv8.rst        | 96 ++++++++++++++++++++++++++++++++++
 doc/guides/cryptodevs/index.rst        |  1 +
 doc/guides/rel_notes/release_17_02.rst |  5 ++
 3 files changed, 102 insertions(+)
 create mode 100644 doc/guides/cryptodevs/armv8.rst

diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..ca8781e
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,96 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2017.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+================================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level library, written in the assembly code.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+In order to enable this virtual crypto PMD, user must:
+
+* Download ARMv8 crypto library source code from
+  `here <https://github.com/caviumnetworks/armv8_crypto>`_
+
+* Export the environmental variable ARMV8_CRYPTO_LIB_PATH with
+  the path where the ``armv8_crypto`` library was downloaded
+  or cloned.
+
+* Build the library by invoking:
+
+.. code-block:: console
+
+	make -C $ARMV8_CRYPTO_LIB_PATH/
+
+* Set CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y in
+  config/defconfig_arm64-armv8a-linuxapp-gcc
+
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Cipher input data has to be a multiple of 16 bytes.
+* Digest input data has to be a multiple of 8 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index d59e386..e9c6c00 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -111,6 +111,11 @@ New Features
 
   * Support for single operations (cipher only and authentication only).
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 5/7] crypto/armv8: enable ARMv8 PMD in the configuration
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                 ` (3 preceding siblings ...)
  2017-01-18 14:27               ` [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
                                 ` (2 subsequent siblings)
  7 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option to
the common configuration file. Don't enable it by
default for ARM64 as it requires external library
to build.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 config/common_base | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/config/common_base b/config/common_base
index 8e9dcfa..f6779ee 100644
--- a/config/common_base
+++ b/config/common_base
@@ -415,6 +415,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                 ` (4 preceding siblings ...)
  2017-01-18 14:27               ` [PATCH v5 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 14:27               ` [PATCH v5 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  2017-01-18 15:23               ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 Jerin Jacob
  7 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9645c9b..00c7adc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -447,6 +447,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v5 7/7] app/test: add ARMv8 crypto tests and test vectors
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                 ` (5 preceding siblings ...)
  2017-01-18 14:27               ` [PATCH v5 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
@ 2017-01-18 14:27               ` zbigniew.bodek
  2017-01-18 15:23               ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 Jerin Jacob
  7 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 14:27 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  64 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 145 ++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 486 +++++++++++++++++++++++++++++
 5 files changed, 691 insertions(+), 9 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index 5786fde..1c23f85 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,28 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1593,6 +1615,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -7302,6 +7340,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -7365,6 +7420,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -7373,3 +7436,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index e566548..f0f37ed 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,98 @@
 	}
 };
 
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -888,12 +980,20 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
 				"Scatter Gather",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -902,35 +1002,58 @@
 		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
-
 	},
 	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -998,7 +1121,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1007,7 +1131,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1050,7 +1175,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1059,7 +1185,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index 01aef3b..a48540c 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -102,6 +102,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -645,6 +646,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index 7256f6b..91e9858 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -50,6 +50,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 9b26fc1..7f1adf8 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,28 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2425,6 +2453,139 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+	uint32_t nb_ops;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length, pparams->chain);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			if ((num_to_submit - num_sent) < burst_size)
+				nb_ops = num_to_submit - num_sent;
+			else
+				nb_ops = burst_size;
+
+			start_cycles = rte_rdtsc();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				nb_ops);
+			end_cycles = rte_rdtsc();
+
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2690,6 +2851,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3380,6 +3591,139 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))], sess,
+					pparams->buf_size, digest_length,
+					pparams->chain);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3693,6 +4037,125 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4244,6 +4707,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4300,6 +4776,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4309,3 +4793,5 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest,
+		perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v5 0/7] Add crypto PMD optimized for ARMv8
  2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                 ` (6 preceding siblings ...)
  2017-01-18 14:27               ` [PATCH v5 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2017-01-18 15:23               ` Jerin Jacob
  7 siblings, 0 replies; 100+ messages in thread
From: Jerin Jacob @ 2017-01-18 15:23 UTC (permalink / raw)
  To: zbigniew.bodek
  Cc: dev, pablo.de.lara.guarch, declan.doherty, jianbo.liu, hemant.agrawal

On Wed, Jan 18, 2017 at 03:27:23PM +0100, zbigniew.bodek@caviumnetworks.com wrote:
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Introduce crypto poll mode driver using ARMv8
> cryptographic extensions. This PMD is optimized
> to provide performance boost for chained
> crypto operations processing, such as:
> * encryption + HMAC generation
> * decryption + HMAC validation.
> In particular, cipher only or hash only
> operations are not provided.
> Performance gain can be observed in tests
> against OpenSSL PMD which also uses ARM
> crypto extensions for packets processing.
> 
> Exemplary crypto performance tests comparison:
> 
> cipher_hash. cipher algo: AES_CBC
> auth algo: SHA1_HMAC cipher key size=16.
> burst_size: 64 ops
> 
> ARMv8 PMD improvement over OpenSSL PMD
> (Optimized for ARMv8 cipher only and hash
> only cases):
> 
> Buffer
> Size(B)   OPS(M)      Throughput(Gbps)
> 64        729 %        742 %
> 128       577 %        592 %
> 256       483 %        476 %
> 512       336 %        351 %
> 768       300 %        286 %
> 1024      263 %        250 %
> 1280      225 %        229 %
> 1536      214 %        213 %
> 1792      186 %        203 %
> 2048      200 %        193 %
> 
> The driver currently supports AES-128-CBC
> in combination with: SHA256 HMAC and SHA1 HMAC.
> The core crypto functionality of this driver is
> provided by the external armv8_crypto library
> that can be downloaded from the Cavium repository:
> https://github.com/caviumnetworks/armv8_crypto
> 
> CPU compatibility with this virtual device
> is detected in run-time and virtual crypto
> device will not be created if CPU doesn't
> provide AES, SHA1, SHA2 and NEON.
> 
> The functionality and performance of this
> code can be tested using generic test application
> with the following commands:
> * cryptodev_sw_armv8_autotest
> * cryptodev_sw_armv8_perftest
> New test vectors and cases have been added
> to the general pool. In particular SHA1 and
> SHA256 HMAC for short cases were introduced.
> This is because low-level ARM assembly code
> is using different code paths for long and
> short data sets, so in order to test the
> mentioned driver correctly, two different
> data sets need to be provided.
> 
> ---
> 
> v5:
> * Add user defined name initializing parameter
>   (according to b8a661f15eb8)
> * Align with the current next-crypto master branch
> * Another changes to commit logs


Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

dpdk-crypto-next:
changeset: 0f0099d86e9b0b0865837b70a09018b0e4bd8411

https://github.com/caviumnetworks/armv8_crypto.git
changeset: 71258fb9fe100d411a53a247040e675fbae45e63

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD
  2017-01-18 14:27               ` [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
@ 2017-01-18 17:05                 ` De Lara Guarch, Pablo
  2017-01-18 19:52                   ` Zbigniew Bodek
  0 siblings, 1 reply; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-18 17:05 UTC (permalink / raw)
  To: zbigniew.bodek, dev
  Cc: Doherty, Declan, jerin.jacob, jianbo.liu, hemant.agrawal

Hi Bodek,

> -----Original Message-----
> From: zbigniew.bodek@caviumnetworks.com
> [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, January 18, 2017 2:27 PM
> To: dev@dpdk.org
> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> jerin.jacob@caviumnetworks.com; jianbo.liu@linaro.org;
> hemant.agrawal@nxp.com; Zbigniew Bodek
> Subject: [PATCH v5 4/7] doc: update documentation about ARMv8 crypto
> PMD
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> 
> Add documentation about the driver and update
> release notes.
> 
> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  doc/guides/cryptodevs/armv8.rst        | 96
> ++++++++++++++++++++++++++++++++++
>  doc/guides/cryptodevs/index.rst        |  1 +
>  doc/guides/rel_notes/release_17_02.rst |  5 ++
>  3 files changed, 102 insertions(+)
>  create mode 100644 doc/guides/cryptodevs/armv8.rst
> 
> diff --git a/doc/guides/cryptodevs/armv8.rst
> b/doc/guides/cryptodevs/armv8.rst
> new file mode 100644
> index 0000000..ca8781e
> --- /dev/null
> +++ b/doc/guides/cryptodevs/armv8.rst

...

> +
> +ARMv8 Crypto Poll Mode Driver
> +================================

Extra "===" here.

> +
> +This code provides the initial implementation of the ARMv8 crypto PMD.
> +The driver uses ARMv8 cryptographic extensions to process chained
> crypto
> +operations in an optimized way. The core functionality is provided by
> +a low-level library, written in the assembly code.
> +
> +Features
> +--------
> +
> +ARMv8 Crypto PMD has support for the following algorithm pairs:
> +
> +Supported cipher algorithms:
> +* ``RTE_CRYPTO_CIPHER_AES_CBC``

Add a blank like before starting a list (same below).

> +
> +Supported authentication algorithms:
> +* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
> +* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
> +

Could you add an entry on the "Crypto Device Supported Functionality Matrices",
to show supported algorithms and feature flags? It is in doc/guides/cryptodevs/overview.rst.

There should be a column per crypto device
(I just realized that I missed one for ZUC PMD, so I will send a patch shortly,
and then you can rebase it on top of it).

The rest of the patchset looks good to me, so once you send another version, I will merge it.

Thanks,
Pablo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD
  2017-01-18 17:05                 ` De Lara Guarch, Pablo
@ 2017-01-18 19:52                   ` Zbigniew Bodek
  2017-01-18 19:54                     ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-18 19:52 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, dev
  Cc: Doherty, Declan, jerin.jacob, jianbo.liu, hemant.agrawal

Hello Pablo,

Thanks for the remarks. Please check my answers in-line below.

Kind regards
Zbigniew

On 18.01.2017 18:05, De Lara Guarch, Pablo wrote:
> Hi Bodek,
>
>> -----Original Message-----
>> From: zbigniew.bodek@caviumnetworks.com
>> [mailto:zbigniew.bodek@caviumnetworks.com]
>> Sent: Wednesday, January 18, 2017 2:27 PM
>> To: dev@dpdk.org
>> Cc: De Lara Guarch, Pablo; Doherty, Declan;
>> jerin.jacob@caviumnetworks.com; jianbo.liu@linaro.org;
>> hemant.agrawal@nxp.com; Zbigniew Bodek
>> Subject: [PATCH v5 4/7] doc: update documentation about ARMv8 crypto
>> PMD
>>
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>>
>> Add documentation about the driver and update
>> release notes.
>>
>> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>> Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> ---
>>  doc/guides/cryptodevs/armv8.rst        | 96
>> ++++++++++++++++++++++++++++++++++
>>  doc/guides/cryptodevs/index.rst        |  1 +
>>  doc/guides/rel_notes/release_17_02.rst |  5 ++
>>  3 files changed, 102 insertions(+)
>>  create mode 100644 doc/guides/cryptodevs/armv8.rst
>>
>> diff --git a/doc/guides/cryptodevs/armv8.rst
>> b/doc/guides/cryptodevs/armv8.rst
>> new file mode 100644
>> index 0000000..ca8781e
>> --- /dev/null
>> +++ b/doc/guides/cryptodevs/armv8.rst
>
> ...
>
>> +
>> +ARMv8 Crypto Poll Mode Driver
>> +================================
>
> Extra "===" here.

Fixed in the upcoming patchset.

>
>> +
>> +This code provides the initial implementation of the ARMv8 crypto PMD.
>> +The driver uses ARMv8 cryptographic extensions to process chained
>> crypto
>> +operations in an optimized way. The core functionality is provided by
>> +a low-level library, written in the assembly code.
>> +
>> +Features
>> +--------
>> +
>> +ARMv8 Crypto PMD has support for the following algorithm pairs:
>> +
>> +Supported cipher algorithms:
>> +* ``RTE_CRYPTO_CIPHER_AES_CBC``
>
> Add a blank like before starting a list (same below).

Also fixed.

>
>> +
>> +Supported authentication algorithms:
>> +* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
>> +* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
>> +
>
> Could you add an entry on the "Crypto Device Supported Functionality Matrices",
> to show supported algorithms and feature flags? It is in doc/guides/cryptodevs/overview.rst.

Yes, looking at that file I realized that we also could add "crypto 
device supported feature flags" for ARM. I created another commit in the 
patchset (preceding the one with the documentation update for PMD).
The method of adding this flags is similar to what has been done earlier 
for other PMDs and their features.
I used two names:
* NEON - which is an ARM component so we can use this name as a unique name.
* ARM_CE - for ARM cryptographic extensions. AFAIK there is no other 
name for that.

>
> There should be a column per crypto device
> (I just realized that I missed one for ZUC PMD, so I will send a patch shortly,
> and then you can rebase it on top of it).

Done with the new flags mentioned above as well. I'm sending another 
patchset now and if you have some remarks to the new commits then let's 
do another round :-).

>
> The rest of the patchset looks good to me, so once you send another version, I will merge it.
>
> Thanks,
> Pablo
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD
  2017-01-18 19:52                   ` Zbigniew Bodek
@ 2017-01-18 19:54                     ` De Lara Guarch, Pablo
  0 siblings, 0 replies; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-18 19:54 UTC (permalink / raw)
  To: Zbigniew Bodek, dev
  Cc: Doherty, Declan, jerin.jacob, jianbo.liu, hemant.agrawal

Hi Zbigniew,

> -----Original Message-----
> From: Zbigniew Bodek [mailto:zbigniew.bodek@caviumnetworks.com]
> Sent: Wednesday, January 18, 2017 7:52 PM
> To: De Lara Guarch, Pablo; dev@dpdk.org
> Cc: Doherty, Declan; jerin.jacob@caviumnetworks.com;
> jianbo.liu@linaro.org; hemant.agrawal@nxp.com
> Subject: Re: [PATCH v5 4/7] doc: update documentation about ARMv8
> crypto PMD
> 
> Hello Pablo,
> 
> Thanks for the remarks. Please check my answers in-line below.
> 
> Kind regards
> Zbigniew
> 
> On 18.01.2017 18:05, De Lara Guarch, Pablo wrote:
> > Hi Bodek,
> >
> >> -----Original Message-----
> >> From: zbigniew.bodek@caviumnetworks.com
> >> [mailto:zbigniew.bodek@caviumnetworks.com]
> >> Sent: Wednesday, January 18, 2017 2:27 PM
> >> To: dev@dpdk.org
> >> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> >> jerin.jacob@caviumnetworks.com; jianbo.liu@linaro.org;
> >> hemant.agrawal@nxp.com; Zbigniew Bodek
> >> Subject: [PATCH v5 4/7] doc: update documentation about ARMv8
> crypto
> >> PMD
> >>
> >> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >>
> >> Add documentation about the driver and update
> >> release notes.
> >>
> >> Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
> >> Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> >> ---
> >>  doc/guides/cryptodevs/armv8.rst        | 96
> >> ++++++++++++++++++++++++++++++++++
> >>  doc/guides/cryptodevs/index.rst        |  1 +
> >>  doc/guides/rel_notes/release_17_02.rst |  5 ++
> >>  3 files changed, 102 insertions(+)
> >>  create mode 100644 doc/guides/cryptodevs/armv8.rst
> >>
> >> diff --git a/doc/guides/cryptodevs/armv8.rst
> >> b/doc/guides/cryptodevs/armv8.rst
> >> new file mode 100644
> >> index 0000000..ca8781e
> >> --- /dev/null
> >> +++ b/doc/guides/cryptodevs/armv8.rst
> >
> > ...
> >
> >> +
> >> +ARMv8 Crypto Poll Mode Driver
> >> +================================
> >
> > Extra "===" here.
> 
> Fixed in the upcoming patchset.
> 
> >
> >> +
> >> +This code provides the initial implementation of the ARMv8 crypto
> PMD.
> >> +The driver uses ARMv8 cryptographic extensions to process chained
> >> crypto
> >> +operations in an optimized way. The core functionality is provided by
> >> +a low-level library, written in the assembly code.
> >> +
> >> +Features
> >> +--------
> >> +
> >> +ARMv8 Crypto PMD has support for the following algorithm pairs:
> >> +
> >> +Supported cipher algorithms:
> >> +* ``RTE_CRYPTO_CIPHER_AES_CBC``
> >
> > Add a blank like before starting a list (same below).
> 
> Also fixed.
> 
> >
> >> +
> >> +Supported authentication algorithms:
> >> +* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
> >> +* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
> >> +
> >
> > Could you add an entry on the "Crypto Device Supported Functionality
> Matrices",
> > to show supported algorithms and feature flags? It is in
> doc/guides/cryptodevs/overview.rst.
> 
> Yes, looking at that file I realized that we also could add "crypto
> device supported feature flags" for ARM. I created another commit in the
> patchset (preceding the one with the documentation update for PMD).
> The method of adding this flags is similar to what has been done earlier
> for other PMDs and their features.
> I used two names:
> * NEON - which is an ARM component so we can use this name as a unique
> name.
> * ARM_CE - for ARM cryptographic extensions. AFAIK there is no other
> name for that.

Looks good to me.
> 
> >
> > There should be a column per crypto device
> > (I just realized that I missed one for ZUC PMD, so I will send a patch
> shortly,
> > and then you can rebase it on top of it).
> 
> Done with the new flags mentioned above as well. I'm sending another
> patchset now and if you have some remarks to the new commits then let's
> do another round :-).

Thanks!

Pablo

> 
> >
> > The rest of the patchset looks good to me, so once you send another
> version, I will merge it.
> >
> > Thanks,
> > Pablo
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v6 0/8] Add crypto PMD optimized for ARMv8
  2017-01-18 14:27               ` [PATCH v5 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-18 20:01                 ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 1/8] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
                                     ` (8 more replies)
  0 siblings, 9 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce crypto poll mode driver using ARMv8
cryptographic extensions. This PMD is optimized
to provide performance boost for chained
crypto operations processing, such as:
* encryption + HMAC generation
* decryption + HMAC validation.
In particular, cipher only or hash only
operations are not provided.
Performance gain can be observed in tests
against OpenSSL PMD which also uses ARM
crypto extensions for packets processing.

Exemplary crypto performance tests comparison:

cipher_hash. cipher algo: AES_CBC
auth algo: SHA1_HMAC cipher key size=16.
burst_size: 64 ops

ARMv8 PMD improvement over OpenSSL PMD
(Optimized for ARMv8 cipher only and hash
only cases):

Buffer
Size(B)   OPS(M)      Throughput(Gbps)
64        729 %        742 %
128       577 %        592 %
256       483 %        476 %
512       336 %        351 %
768       300 %        286 %
1024      263 %        250 %
1280      225 %        229 %
1536      214 %        213 %
1792      186 %        203 %
2048      200 %        193 %

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC.
The core crypto functionality of this driver is
provided by the external armv8_crypto library
that can be downloaded from the Cavium repository:
https://github.com/caviumnetworks/armv8_crypto

CPU compatibility with this virtual device
is detected in run-time and virtual crypto
device will not be created if CPU doesn't
provide AES, SHA1, SHA2 and NEON.

The functionality and performance of this
code can be tested using generic test application
with the following commands:
* cryptodev_sw_armv8_autotest
* cryptodev_sw_armv8_perftest
New test vectors and cases have been added
to the general pool. In particular SHA1 and
SHA256 HMAC for short cases were introduced.
This is because low-level ARM assembly code
is using different code paths for long and
short data sets, so in order to test the
mentioned driver correctly, two different
data sets need to be provided.

---

v6:
* Add minor fixes to the documentation
* Introduce ARM-specific feature flags
* Add information about the supported feature flags
  to doc/guides/cryptodevs/overview.rst

v5:
* Add user defined name initializing parameter
  (according to b8a661f15eb8)
* Align with the current next-crypto master branch
* Another changes to commit logs

v4:
* Address new review remarks (keep ARMv8 naming though)
* Fix spelling and change commit logs
* Removed unused code for currently unsupported algorithms
* Enqueue processed crypto ops in bursts
* Add micro-optimizations to the PMD code
* Send build system fixes in a separate patch

v3:
* Addressed review remarks
* Moved low-level assembly code to the external library
* Removed SHA256 MAC cases
* Various fixes: interface to the library, digest destination
  and source address interpreting, missing mbuf manipulations.

v2:
* Fixed checkpatch warnings
* Divide patches into smaller logical parts

Zbigniew Bodek (8):
  cryptodev: add cryptodev type for the ARMv8 PMD
  crypto/armv8: add PMD optimized for ARMv8 processors
  mk: add PMD to the build system
  cryptodev/armv8: introduce ARM-specific feature flags
  doc: update documentation about ARMv8 crypto PMD
  crypto/armv8: enable ARMv8 PMD in the configuration
  MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  app/test: add ARMv8 crypto tests and test vectors

 MAINTAINERS                                    |   6 +
 app/test/test_cryptodev.c                      |  64 ++
 app/test/test_cryptodev_aes_test_vectors.h     | 145 +++-
 app/test/test_cryptodev_blockcipher.c          |   4 +
 app/test/test_cryptodev_blockcipher.h          |   1 +
 app/test/test_cryptodev_perf.c                 | 486 +++++++++++++
 config/common_base                             |   6 +
 doc/guides/cryptodevs/armv8.rst                |  98 +++
 doc/guides/cryptodevs/index.rst                |   1 +
 doc/guides/cryptodevs/overview.rst             |  92 +--
 doc/guides/rel_notes/release_17_02.rst         |   5 +
 drivers/crypto/Makefile                        |   1 +
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 902 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 lib/librte_cryptodev/rte_cryptodev.c           |   4 +
 lib/librte_cryptodev/rte_cryptodev.h           |   8 +
 mk/rte.app.mk                                  |   2 +
 20 files changed, 2426 insertions(+), 54 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

-- 
1.9.1

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v6 1/8] cryptodev: add cryptodev type for the ARMv8 PMD
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 2/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
                                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add type and name for ARMv8 crypto PMD

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_cryptodev/rte_cryptodev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index f4e66e6..452b174 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -66,6 +66,8 @@
 /**< KASUMI PMD device name */
 #define CRYPTODEV_NAME_ZUC_PMD		crypto_zuc
 /**< KASUMI PMD device name */
+#define CRYPTODEV_NAME_ARMV8_PMD	crypto_armv8
+/**< ARMv8 Crypto PMD device name */
 
 /** Crypto device type */
 enum rte_cryptodev_type {
@@ -77,6 +79,7 @@ enum rte_cryptodev_type {
 	RTE_CRYPTODEV_KASUMI_PMD,	/**< KASUMI PMD */
 	RTE_CRYPTODEV_ZUC_PMD,		/**< ZUC PMD */
 	RTE_CRYPTODEV_OPENSSL_PMD,    /**<  OpenSSL PMD */
+	RTE_CRYPTODEV_ARMV8_PMD,	/**< ARMv8 crypto PMD */
 };
 
 extern const char **rte_cyptodev_names;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 2/8] crypto/armv8: add PMD optimized for ARMv8 processors
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 1/8] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 3/8] mk: add PMD to the build system zbigniew.bodek
                                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

This patch introduces crypto poll mode driver
using ARMv8 cryptographic extensions.
CPU compatibility with this driver is detected in
run-time and virtual crypto device will not be
created if CPU doesn't provide:
AES, SHA1, SHA2 and NEON.

This PMD is optimized to provide performance boost
for chained crypto operations processing,
such as encryption + HMAC generation,
decryption + HMAC validation. In particular,
cipher only or hash only operations are
not provided.

The driver currently supports AES-128-CBC
in combination with: SHA256 HMAC and SHA1 HMAC
and relies on the external armv8_crypto library:
https://github.com/caviumnetworks/armv8_crypto

This patch adds driver's code only and does
not include it in the build system.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/armv8/Makefile                  |  72 ++
 drivers/crypto/armv8/rte_armv8_pmd.c           | 900 +++++++++++++++++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_ops.c       | 369 ++++++++++
 drivers/crypto/armv8/rte_armv8_pmd_private.h   | 211 ++++++
 drivers/crypto/armv8/rte_armv8_pmd_version.map |   3 +
 5 files changed, 1555 insertions(+)
 create mode 100644 drivers/crypto/armv8/Makefile
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_ops.c
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_private.h
 create mode 100644 drivers/crypto/armv8/rte_armv8_pmd_version.map

diff --git a/drivers/crypto/armv8/Makefile b/drivers/crypto/armv8/Makefile
new file mode 100644
index 0000000..2003ec4
--- /dev/null
+++ b/drivers/crypto/armv8/Makefile
@@ -0,0 +1,72 @@
+#
+#   BSD LICENSE
+#
+#   Copyright (C) Cavium networks Ltd. 2017.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Cavium networks nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(MAKECMDGOALS),clean)
+ifneq ($(MAKECMDGOALS),config)
+ifeq ($(ARMV8_CRYPTO_LIB_PATH),)
+$(error "Please define ARMV8_CRYPTO_LIB_PATH environment variable")
+endif
+endif
+endif
+
+# library name
+LIB = librte_pmd_armv8.a
+
+# build flags
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# library version
+LIBABIVER := 1
+
+# versioning export map
+EXPORT_MAP := rte_armv8_pmd_version.map
+
+# external library dependencies
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)
+CFLAGS += -I$(ARMV8_CRYPTO_LIB_PATH)/asm/include
+LDLIBS += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
+
+# library source files
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += rte_armv8_pmd_ops.c
+
+# library dependencies
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_eal
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mbuf
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_mempool
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_ring
+DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += lib/librte_cryptodev
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
new file mode 100644
index 0000000..1bf0f9d
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -0,0 +1,900 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_hexdump.h>
+#include <rte_cryptodev.h>
+#include <rte_cryptodev_pmd.h>
+#include <rte_vdev.h>
+#include <rte_malloc.h>
+#include <rte_cpuflags.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static int cryptodev_armv8_crypto_uninit(const char *name);
+
+/**
+ * Pointers to the supported combined mode crypto functions are stored
+ * in the static tables. Each combined (chained) cryptographic operation
+ * can be described by a set of numbers:
+ * - order:	order of operations (cipher, auth) or (auth, cipher)
+ * - direction:	encryption or decryption
+ * - calg:	cipher algorithm such as AES_CBC, AES_CTR, etc.
+ * - aalg:	authentication algorithm such as SHA1, SHA256, etc.
+ * - keyl:	cipher key length, for example 128, 192, 256 bits
+ *
+ * In order to quickly acquire each function pointer based on those numbers,
+ * a hierarchy of arrays is maintained. The final level, 3D array is indexed
+ * by the combined mode function parameters only (cipher algorithm,
+ * authentication algorithm and key length).
+ *
+ * This gives 3 memory accesses to obtain a function pointer instead of
+ * traversing the array manually and comparing function parameters on each loop.
+ *
+ *                   +--+CRYPTO_FUNC
+ *            +--+ENC|
+ *      +--+CA|
+ *      |     +--+DEC
+ * ORDER|
+ *      |     +--+ENC
+ *      +--+AC|
+ *            +--+DEC
+ *
+ */
+
+/**
+ * 3D array type for ARM Combined Mode crypto functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_AUTH_MAX:			max auth ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_func_t
+crypto_func_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_AUTH_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+/* Evaluate to key length definition */
+#define KEYL(keyl)		(ARMV8_CRYPTO_CIPHER_KEYLEN_ ## keyl)
+
+/* Local aliases for supported ciphers */
+#define CIPH_AES_CBC		RTE_CRYPTO_CIPHER_AES_CBC
+/* Local aliases for supported hashes */
+#define AUTH_SHA1_HMAC		RTE_CRYPTO_AUTH_SHA1_HMAC
+#define AUTH_SHA256_HMAC	RTE_CRYPTO_AUTH_SHA256_HMAC
+
+/**
+ * Arrays containing pointers to particular cryptographic,
+ * combined mode functions.
+ * crypto_op_ca_encrypt:	cipher (encrypt), authenticate
+ * crypto_op_ca_decrypt:	cipher (decrypt), authenticate
+ * crypto_op_ac_encrypt:	authenticate, cipher (encrypt)
+ * crypto_op_ac_decrypt:	authenticate, cipher (decrypt)
+ */
+static const crypto_func_tbl_t
+crypto_op_ca_encrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = aes128cbc_sha1_hmac,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = aes128cbc_sha256_hmac,
+};
+
+static const crypto_func_tbl_t
+crypto_op_ca_decrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_encrypt = {
+	NULL
+};
+
+static const crypto_func_tbl_t
+crypto_op_ac_decrypt = {
+	/* [cipher alg][auth alg][key length] = crypto_function, */
+	[CIPH_AES_CBC][AUTH_SHA1_HMAC][KEYL(128)] = sha1_hmac_aes128cbc_dec,
+	[CIPH_AES_CBC][AUTH_SHA256_HMAC][KEYL(128)] = sha256_hmac_aes128cbc_dec,
+};
+
+/**
+ * Arrays containing pointers to particular cryptographic function sets,
+ * covering given cipher operation directions (encrypt, decrypt)
+ * for each order of cipher and authentication pairs.
+ */
+static const crypto_func_tbl_t *
+crypto_cipher_auth[] = {
+	&crypto_op_ca_encrypt,
+	&crypto_op_ca_decrypt,
+	NULL
+};
+
+static const crypto_func_tbl_t *
+crypto_auth_cipher[] = {
+	&crypto_op_ac_encrypt,
+	&crypto_op_ac_decrypt,
+	NULL
+};
+
+/**
+ * Top level array containing pointers to particular cryptographic
+ * function sets, covering given order of chained operations.
+ * crypto_cipher_auth:	cipher first, authenticate after
+ * crypto_auth_cipher:	authenticate first, cipher after
+ */
+static const crypto_func_tbl_t **
+crypto_chain_order[] = {
+	crypto_cipher_auth,
+	crypto_auth_cipher,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_ALGO(order, cop, calg, aalg, keyl)			\
+({									\
+	crypto_func_tbl_t *func_tbl =					\
+				(crypto_chain_order[(order)])[(cop)];	\
+									\
+	((*func_tbl)[(calg)][(aalg)][KEYL(keyl)]);		\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/**
+ * 2D array type for ARM key schedule functions pointers.
+ * CRYPTO_CIPHER_MAX:			max cipher ID number
+ * CRYPTO_CIPHER_KEYLEN_MAX:		max key length ID number
+ */
+typedef const crypto_key_sched_t
+crypto_key_sched_tbl_t[CRYPTO_CIPHER_MAX][CRYPTO_CIPHER_KEYLEN_MAX];
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_encrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_enc,
+};
+
+static const crypto_key_sched_tbl_t
+crypto_key_sched_decrypt = {
+	/* [cipher alg][key length] = key_expand_func, */
+	[CIPH_AES_CBC][KEYL(128)] = aes128_key_sched_dec,
+};
+
+/**
+ * Top level array containing pointers to particular key generation
+ * function sets, covering given operation direction.
+ * crypto_key_sched_encrypt:	keys for encryption
+ * crypto_key_sched_decrypt:	keys for decryption
+ */
+static const crypto_key_sched_tbl_t *
+crypto_key_sched_dir[] = {
+	&crypto_key_sched_encrypt,
+	&crypto_key_sched_decrypt,
+	NULL
+};
+
+/**
+ * Extract particular combined mode crypto function from the 3D array.
+ */
+#define CRYPTO_GET_KEY_SCHED(cop, calg, keyl)				\
+({									\
+	crypto_key_sched_tbl_t *ks_tbl = crypto_key_sched_dir[(cop)];	\
+									\
+	((*ks_tbl)[(calg)][KEYL(keyl)]);				\
+})
+
+/*----------------------------------------------------------------------------*/
+
+/*
+ *------------------------------------------------------------------------------
+ * Session Prepare
+ *------------------------------------------------------------------------------
+ */
+
+/** Get xform chain order */
+static enum armv8_crypto_chain_order
+armv8_crypto_get_chain_order(const struct rte_crypto_sym_xform *xform)
+{
+
+	/*
+	 * This driver currently covers only chained operations.
+	 * Ignore only cipher or only authentication operations
+	 * or chains longer than 2 xform structures.
+	 */
+	if (xform->next == NULL || xform->next->next != NULL)
+		return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_AUTH) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_CIPHER)
+			return ARMV8_CRYPTO_CHAIN_AUTH_CIPHER;
+	}
+
+	if (xform->type == RTE_CRYPTO_SYM_XFORM_CIPHER) {
+		if (xform->next->type == RTE_CRYPTO_SYM_XFORM_AUTH)
+			return ARMV8_CRYPTO_CHAIN_CIPHER_AUTH;
+	}
+
+	return ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED;
+}
+
+static inline void
+auth_hmac_pad_prepare(struct armv8_crypto_session *sess,
+				const struct rte_crypto_sym_xform *xform)
+{
+	size_t i;
+
+	/* Generate i_key_pad and o_key_pad */
+	memset(sess->auth.hmac.i_key_pad, 0, sizeof(sess->auth.hmac.i_key_pad));
+	rte_memcpy(sess->auth.hmac.i_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	memset(sess->auth.hmac.o_key_pad, 0, sizeof(sess->auth.hmac.o_key_pad));
+	rte_memcpy(sess->auth.hmac.o_key_pad, sess->auth.hmac.key,
+							xform->auth.key.length);
+	/*
+	 * XOR key with IPAD/OPAD values to obtain i_key_pad
+	 * and o_key_pad.
+	 * Byte-by-byte operation may seem to be the less efficient
+	 * here but in fact it's the opposite.
+	 * The result ASM code is likely operate on NEON registers
+	 * (load auth key to Qx, load IPAD/OPAD to multiple
+	 * elements of Qy, eor 128 bits at once).
+	 */
+	for (i = 0; i < SHA_BLOCK_MAX; i++) {
+		sess->auth.hmac.i_key_pad[i] ^= HMAC_IPAD_VALUE;
+		sess->auth.hmac.o_key_pad[i] ^= HMAC_OPAD_VALUE;
+	}
+}
+
+static inline int
+auth_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	uint8_t partial[64] = { 0 };
+	int error;
+
+	switch (xform->auth.algo) {
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA1_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA1_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 160 bits
+			 * the algorithm will use SHA1(key) instead.
+			 */
+			error = sha1_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA1_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha1_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		error = sha1_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA1_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA1_BLOCK_SIZE);
+
+		break;
+	case RTE_CRYPTO_AUTH_SHA256_HMAC:
+		/*
+		 * Generate authentication key, i_key_pad and o_key_pad.
+		 */
+		/* Zero memory under key */
+		memset(sess->auth.hmac.key, 0, SHA256_AUTH_KEY_LENGTH);
+
+		if (xform->auth.key.length > SHA256_AUTH_KEY_LENGTH) {
+			/*
+			 * In case the key is longer than 256 bits
+			 * the algorithm will use SHA256(key) instead.
+			 */
+			error = sha256_block(NULL, xform->auth.key.data,
+				sess->auth.hmac.key, xform->auth.key.length);
+			if (error != 0)
+				return -1;
+		} else {
+			/*
+			 * Now copy the given authentication key to the session
+			 * key assuming that the session key is zeroed there is
+			 * no need for additional zero padding if the key is
+			 * shorter than SHA256_AUTH_KEY_LENGTH.
+			 */
+			rte_memcpy(sess->auth.hmac.key, xform->auth.key.data,
+							xform->auth.key.length);
+		}
+
+		/* Prepare HMAC padding: key|pattern */
+		auth_hmac_pad_prepare(sess, xform);
+		/*
+		 * Calculate partial hash values for i_key_pad and o_key_pad.
+		 * Will be used as initialization state for final HMAC.
+		 */
+		error = sha256_block_partial(NULL, sess->auth.hmac.i_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.i_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		error = sha256_block_partial(NULL, sess->auth.hmac.o_key_pad,
+		    partial, SHA256_BLOCK_SIZE);
+		if (error != 0)
+			return -1;
+		memcpy(sess->auth.hmac.o_key_pad, partial, SHA256_BLOCK_SIZE);
+
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static inline int
+cipher_set_prerequisites(struct armv8_crypto_session *sess,
+			const struct rte_crypto_sym_xform *xform)
+{
+	crypto_key_sched_t cipher_key_sched;
+
+	cipher_key_sched = sess->cipher.key_sched;
+	if (likely(cipher_key_sched != NULL)) {
+		/* Set up cipher session key */
+		cipher_key_sched(sess->cipher.key.data, xform->cipher.key.data);
+	}
+
+	return 0;
+}
+
+static int
+armv8_crypto_set_session_chained_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *cipher_xform,
+		const struct rte_crypto_sym_xform *auth_xform)
+{
+	enum armv8_crypto_chain_order order;
+	enum armv8_crypto_cipher_operation cop;
+	enum rte_crypto_cipher_algorithm calg;
+	enum rte_crypto_auth_algorithm aalg;
+
+	/* Validate and prepare scratch order of combined operations */
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		order = sess->chain_order;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select cipher direction */
+	sess->cipher.direction = cipher_xform->cipher.op;
+	/* Select cipher key */
+	sess->cipher.key.length = cipher_xform->cipher.key.length;
+	/* Set cipher direction */
+	cop = sess->cipher.direction;
+	/* Set cipher algorithm */
+	calg = cipher_xform->cipher.algo;
+
+	/* Select cipher algo */
+	switch (calg) {
+	/* Cover supported cipher algorithms */
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		sess->cipher.algo = calg;
+		/* IV len is always 16 bytes (block size) for AES CBC */
+		sess->cipher.iv_len = 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* Select auth generate/verify */
+	sess->auth.operation = auth_xform->auth.op;
+
+	/* Select auth algo */
+	switch (auth_xform->auth.algo) {
+	/* Cover supported hash algorithms */
+	case RTE_CRYPTO_AUTH_SHA1_HMAC:
+	case RTE_CRYPTO_AUTH_SHA256_HMAC: /* Fall through */
+		aalg = auth_xform->auth.algo;
+		sess->auth.mode = ARMV8_CRYPTO_AUTH_AS_HMAC;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Verify supported key lengths and extract proper algorithm */
+	switch (cipher_xform->cipher.key.length << 3) {
+	case 128:
+		sess->crypto_func =
+				CRYPTO_GET_ALGO(order, cop, calg, aalg, 128);
+		sess->cipher.key_sched =
+				CRYPTO_GET_KEY_SCHED(cop, calg, 128);
+		break;
+	case 192:
+	case 256:
+		/* These key lengths are not supported yet */
+	default: /* Fall through */
+		sess->crypto_func = NULL;
+		sess->cipher.key_sched = NULL;
+		return -EINVAL;
+	}
+
+	if (unlikely(sess->crypto_func == NULL)) {
+		/*
+		 * If we got here that means that there must be a bug
+		 * in the algorithms selection above. Nevertheless keep
+		 * it here to catch bug immediately and avoid NULL pointer
+		 * dereference in OPs processing.
+		 */
+		ARMV8_CRYPTO_LOG_ERR(
+			"No appropriate crypto function for given parameters");
+		return -EINVAL;
+	}
+
+	/* Set up cipher session prerequisites */
+	if (cipher_set_prerequisites(sess, cipher_xform) != 0)
+		return -EINVAL;
+
+	/* Set up authentication session prerequisites */
+	if (auth_set_prerequisites(sess, auth_xform) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/** Parse crypto xform chain and set private session parameters */
+int
+armv8_crypto_set_session_parameters(struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform)
+{
+	const struct rte_crypto_sym_xform *cipher_xform = NULL;
+	const struct rte_crypto_sym_xform *auth_xform = NULL;
+	bool is_chained_op;
+	int ret;
+
+	/* Filter out spurious/broken requests */
+	if (xform == NULL)
+		return -EINVAL;
+
+	sess->chain_order = armv8_crypto_get_chain_order(xform);
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		cipher_xform = xform;
+		auth_xform = xform->next;
+		is_chained_op = true;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		auth_xform = xform;
+		cipher_xform = xform->next;
+		is_chained_op = true;
+		break;
+	default:
+		is_chained_op = false;
+		return -EINVAL;
+	}
+
+	if (is_chained_op) {
+		ret = armv8_crypto_set_session_chained_parameters(sess,
+						cipher_xform, auth_xform);
+		if (unlikely(ret != 0)) {
+			ARMV8_CRYPTO_LOG_ERR(
+			"Invalid/unsupported chained (cipher/auth) parameters");
+			return -EINVAL;
+		}
+	} else {
+		ARMV8_CRYPTO_LOG_ERR("Invalid/unsupported operation");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/** Provide session for operation */
+static inline struct armv8_crypto_session *
+get_session(struct armv8_crypto_qp *qp, struct rte_crypto_op *op)
+{
+	struct armv8_crypto_session *sess = NULL;
+
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_WITH_SESSION) {
+		/* get existing session */
+		if (likely(op->sym->session != NULL &&
+				op->sym->session->dev_type ==
+				RTE_CRYPTODEV_ARMV8_PMD)) {
+			sess = (struct armv8_crypto_session *)
+				op->sym->session->_private;
+		}
+	} else {
+		/* provide internal session */
+		void *_sess = NULL;
+
+		if (!rte_mempool_get(qp->sess_mp, (void **)&_sess)) {
+			sess = (struct armv8_crypto_session *)
+				((struct rte_cryptodev_sym_session *)_sess)
+				->_private;
+
+			if (unlikely(armv8_crypto_set_session_parameters(
+					sess, op->sym->xform) != 0)) {
+				rte_mempool_put(qp->sess_mp, _sess);
+				sess = NULL;
+			} else
+				op->sym->session = _sess;
+		}
+	}
+
+	if (unlikely(sess == NULL))
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_SESSION;
+
+	return sess;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * Process Operations
+ *------------------------------------------------------------------------------
+ */
+
+/*----------------------------------------------------------------------------*/
+
+/** Process cipher operation */
+static inline void
+process_armv8_chained_op
+		(struct rte_crypto_op *op, struct armv8_crypto_session *sess,
+		struct rte_mbuf *mbuf_src, struct rte_mbuf *mbuf_dst)
+{
+	crypto_func_t crypto_func;
+	crypto_arg_t arg;
+	struct rte_mbuf *m_asrc, *m_adst;
+	uint8_t *csrc, *cdst;
+	uint8_t *adst, *asrc;
+	uint64_t clen, alen;
+	int error;
+
+	clen = op->sym->cipher.data.length;
+	alen = op->sym->auth.data.length;
+
+	csrc = rte_pktmbuf_mtod_offset(mbuf_src, uint8_t *,
+			op->sym->cipher.data.offset);
+	cdst = rte_pktmbuf_mtod_offset(mbuf_dst, uint8_t *,
+			op->sym->cipher.data.offset);
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+		m_asrc = m_adst = mbuf_dst;
+		break;
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER:
+		m_asrc = mbuf_src;
+		m_adst = mbuf_dst;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+	asrc = rte_pktmbuf_mtod_offset(m_asrc, uint8_t *,
+				op->sym->auth.data.offset);
+
+	switch (sess->auth.mode) {
+	case ARMV8_CRYPTO_AUTH_AS_AUTH:
+		/* Nothing to do here, just verify correct option */
+		break;
+	case ARMV8_CRYPTO_AUTH_AS_HMAC:
+		arg.digest.hmac.key = sess->auth.hmac.key;
+		arg.digest.hmac.i_key_pad = sess->auth.hmac.i_key_pad;
+		arg.digest.hmac.o_key_pad = sess->auth.hmac.o_key_pad;
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_GENERATE) {
+		adst = op->sym->auth.digest.data;
+		if (adst == NULL) {
+			adst = rte_pktmbuf_mtod_offset(m_adst,
+					uint8_t *,
+					op->sym->auth.data.offset +
+					op->sym->auth.data.length);
+		}
+	} else {
+		adst = (uint8_t *)rte_pktmbuf_append(m_asrc,
+				op->sym->auth.digest.length);
+	}
+
+	if (unlikely(op->sym->cipher.iv.length != sess->cipher.iv_len)) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	arg.cipher.iv = op->sym->cipher.iv.data;
+	arg.cipher.key = sess->cipher.key.data;
+	/* Acquire combined mode function */
+	crypto_func = sess->crypto_func;
+	ARMV8_CRYPTO_ASSERT(crypto_func != NULL);
+	error = crypto_func(csrc, cdst, clen, asrc, adst, alen, &arg);
+	if (error != 0) {
+		op->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+		return;
+	}
+
+	op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+	if (sess->auth.operation == RTE_CRYPTO_AUTH_OP_VERIFY) {
+		if (memcmp(adst, op->sym->auth.digest.data,
+				op->sym->auth.digest.length) != 0) {
+			op->status = RTE_CRYPTO_OP_STATUS_AUTH_FAILED;
+		}
+		/* Trim area used for digest from mbuf. */
+		rte_pktmbuf_trim(m_asrc,
+				op->sym->auth.digest.length);
+	}
+}
+
+/** Process crypto operation for mbuf */
+static inline int
+process_op(const struct armv8_crypto_qp *qp, struct rte_crypto_op *op,
+		struct armv8_crypto_session *sess)
+{
+	struct rte_mbuf *msrc, *mdst;
+
+	msrc = op->sym->m_src;
+	mdst = op->sym->m_dst ? op->sym->m_dst : op->sym->m_src;
+
+	op->status = RTE_CRYPTO_OP_STATUS_NOT_PROCESSED;
+
+	switch (sess->chain_order) {
+	case ARMV8_CRYPTO_CHAIN_CIPHER_AUTH:
+	case ARMV8_CRYPTO_CHAIN_AUTH_CIPHER: /* Fall through */
+		process_armv8_chained_op(op, sess, msrc, mdst);
+		break;
+	default:
+		op->status = RTE_CRYPTO_OP_STATUS_ERROR;
+		break;
+	}
+
+	/* Free session if a session-less crypto op */
+	if (op->sym->sess_type == RTE_CRYPTO_SYM_OP_SESSIONLESS) {
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+		rte_mempool_put(qp->sess_mp, op->sym->session);
+		op->sym->session = NULL;
+	}
+
+	if (op->status == RTE_CRYPTO_OP_STATUS_NOT_PROCESSED)
+		op->status = RTE_CRYPTO_OP_STATUS_SUCCESS;
+
+	if (unlikely(op->status == RTE_CRYPTO_OP_STATUS_ERROR))
+		return -1;
+
+	return 0;
+}
+
+/*
+ *------------------------------------------------------------------------------
+ * PMD Framework
+ *------------------------------------------------------------------------------
+ */
+
+/** Enqueue burst */
+static uint16_t
+armv8_crypto_pmd_enqueue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_session *sess;
+	struct armv8_crypto_qp *qp = queue_pair;
+	int i, retval;
+
+	for (i = 0; i < nb_ops; i++) {
+		sess = get_session(qp, ops[i]);
+		if (unlikely(sess == NULL))
+			goto enqueue_err;
+
+		retval = process_op(qp, ops[i], sess);
+		if (unlikely(retval < 0))
+			goto enqueue_err;
+	}
+
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	qp->stats.enqueued_count += retval;
+
+	return retval;
+
+enqueue_err:
+	retval = rte_ring_enqueue_burst(qp->processed_ops, (void *)ops, i);
+	if (ops[i] != NULL)
+		ops[i]->status = RTE_CRYPTO_OP_STATUS_INVALID_ARGS;
+
+	qp->stats.enqueue_err_count++;
+	return retval;
+}
+
+/** Dequeue burst */
+static uint16_t
+armv8_crypto_pmd_dequeue_burst(void *queue_pair, struct rte_crypto_op **ops,
+		uint16_t nb_ops)
+{
+	struct armv8_crypto_qp *qp = queue_pair;
+
+	unsigned int nb_dequeued = 0;
+
+	nb_dequeued = rte_ring_dequeue_burst(qp->processed_ops,
+			(void **)ops, nb_ops);
+	qp->stats.dequeued_count += nb_dequeued;
+
+	return nb_dequeued;
+}
+
+/** Create ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_create(struct rte_crypto_vdev_init_params *init_params)
+{
+	struct rte_cryptodev *dev;
+	struct armv8_crypto_private *internals;
+	int ret;
+
+	/* Check CPU for support for AES instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_AES)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"AES instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for SHA instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA1) ||
+	    !rte_cpu_get_flag_enabled(RTE_CPUFLAG_SHA2)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"SHA1/SHA2 instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	/* Check CPU for support for Advance SIMD instruction set */
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
+		ARMV8_CRYPTO_LOG_ERR(
+			"Advanced SIMD instructions not supported by CPU");
+		return -EFAULT;
+	}
+
+	if (init_params->name[0] == '\0') {
+		ret = rte_cryptodev_pmd_create_dev_name(
+				init_params->name,
+				RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+
+		if (ret < 0) {
+			ARMV8_CRYPTO_LOG_ERR("failed to create unique name");
+			return ret;
+		}
+	}
+
+	dev = rte_cryptodev_pmd_virtual_dev_init(init_params->name,
+				sizeof(struct armv8_crypto_private),
+				init_params->socket_id);
+	if (dev == NULL) {
+		ARMV8_CRYPTO_LOG_ERR("failed to create cryptodev vdev");
+		goto init_error;
+	}
+
+	dev->dev_type = RTE_CRYPTODEV_ARMV8_PMD;
+	dev->dev_ops = rte_armv8_crypto_pmd_ops;
+
+	/* register rx/tx burst functions for data path */
+	dev->dequeue_burst = armv8_crypto_pmd_dequeue_burst;
+	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
+
+	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+
+	/* Set vector instructions mode supported */
+	internals = dev->data->dev_private;
+
+	internals->max_nb_qpairs = init_params->max_nb_queue_pairs;
+	internals->max_nb_sessions = init_params->max_nb_sessions;
+
+	return 0;
+
+init_error:
+	ARMV8_CRYPTO_LOG_ERR(
+		"driver %s: cryptodev_armv8_crypto_create failed",
+		init_params->name);
+
+	cryptodev_armv8_crypto_uninit(init_params->name);
+	return -EFAULT;
+}
+
+/** Initialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_init(const char *name,
+		const char *input_args)
+{
+	struct rte_crypto_vdev_init_params init_params = {
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_QUEUE_PAIRS,
+		RTE_CRYPTODEV_VDEV_DEFAULT_MAX_NB_SESSIONS,
+		rte_socket_id(),
+		{0}
+	};
+
+	rte_cryptodev_parse_vdev_init_params(&init_params, input_args);
+
+	RTE_LOG(INFO, PMD, "Initialising %s on NUMA node %d\n", name,
+			init_params.socket_id);
+	if (init_params.name[0] != '\0') {
+		RTE_LOG(INFO, PMD, "  User defined name = %s\n",
+			init_params.name);
+	}
+	RTE_LOG(INFO, PMD, "  Max number of queue pairs = %d\n",
+			init_params.max_nb_queue_pairs);
+	RTE_LOG(INFO, PMD, "  Max number of sessions = %d\n",
+			init_params.max_nb_sessions);
+
+	return cryptodev_armv8_crypto_create(&init_params);
+}
+
+/** Uninitialise ARMv8 crypto device */
+static int
+cryptodev_armv8_crypto_uninit(const char *name)
+{
+	if (name == NULL)
+		return -EINVAL;
+
+	RTE_LOG(INFO, PMD,
+		"Closing ARMv8 crypto device %s on numa socket %u\n",
+		name, rte_socket_id());
+
+	return 0;
+}
+
+static struct rte_vdev_driver armv8_crypto_drv = {
+	.probe = cryptodev_armv8_crypto_init,
+	.remove = cryptodev_armv8_crypto_uninit
+};
+
+RTE_PMD_REGISTER_VDEV(CRYPTODEV_NAME_ARMV8_PMD, armv8_crypto_drv);
+RTE_PMD_REGISTER_ALIAS(CRYPTODEV_NAME_ARMV8_PMD, cryptodev_armv8_pmd);
+RTE_PMD_REGISTER_PARAM_STRING(CRYPTODEV_NAME_ARMV8_PMD,
+	"max_nb_queue_pairs=<int> "
+	"max_nb_sessions=<int> "
+	"socket_id=<int>");
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_ops.c b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
new file mode 100644
index 0000000..2bf6475
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_ops.c
@@ -0,0 +1,369 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cryptodev_pmd.h>
+
+#include "armv8_crypto_defs.h"
+
+#include "rte_armv8_pmd_private.h"
+
+static const struct rte_cryptodev_capabilities
+	armv8_crypto_pmd_capabilities[] = {
+	{	/* SHA1 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 20,
+						.max = 20,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* SHA256 HMAC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
+				{.auth = {
+					.algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+					.block_size = 64,
+					.key_size = {
+						.min = 16,
+						.max = 128,
+						.increment = 0
+					},
+					.digest_size = {
+						.min = 32,
+						.max = 32,
+						.increment = 0
+					},
+					.aad_size = { 0 }
+				}, }
+			}, }
+	},
+	{	/* AES CBC */
+		.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
+			{.sym = {
+				.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
+				{.cipher = {
+					.algo = RTE_CRYPTO_CIPHER_AES_CBC,
+					.block_size = 16,
+					.key_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					},
+					.iv_size = {
+						.min = 16,
+						.max = 16,
+						.increment = 0
+					}
+				}, }
+			}, }
+	},
+
+	RTE_CRYPTODEV_END_OF_CAPABILITIES_LIST()
+};
+
+
+/** Configure device */
+static int
+armv8_crypto_pmd_config(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Start device */
+static int
+armv8_crypto_pmd_start(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+/** Stop device */
+static void
+armv8_crypto_pmd_stop(__rte_unused struct rte_cryptodev *dev)
+{
+}
+
+/** Close device */
+static int
+armv8_crypto_pmd_close(__rte_unused struct rte_cryptodev *dev)
+{
+	return 0;
+}
+
+
+/** Get device statistics */
+static void
+armv8_crypto_pmd_stats_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_stats *stats)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		stats->enqueued_count += qp->stats.enqueued_count;
+		stats->dequeued_count += qp->stats.dequeued_count;
+
+		stats->enqueue_err_count += qp->stats.enqueue_err_count;
+		stats->dequeue_err_count += qp->stats.dequeue_err_count;
+	}
+}
+
+/** Reset device statistics */
+static void
+armv8_crypto_pmd_stats_reset(struct rte_cryptodev *dev)
+{
+	int qp_id;
+
+	for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+		struct armv8_crypto_qp *qp = dev->data->queue_pairs[qp_id];
+
+		memset(&qp->stats, 0, sizeof(qp->stats));
+	}
+}
+
+
+/** Get device info */
+static void
+armv8_crypto_pmd_info_get(struct rte_cryptodev *dev,
+		struct rte_cryptodev_info *dev_info)
+{
+	struct armv8_crypto_private *internals = dev->data->dev_private;
+
+	if (dev_info != NULL) {
+		dev_info->dev_type = dev->dev_type;
+		dev_info->feature_flags = dev->feature_flags;
+		dev_info->capabilities = armv8_crypto_pmd_capabilities;
+		dev_info->max_nb_queue_pairs = internals->max_nb_qpairs;
+		dev_info->sym.max_nb_sessions = internals->max_nb_sessions;
+	}
+}
+
+/** Release queue pair */
+static int
+armv8_crypto_pmd_qp_release(struct rte_cryptodev *dev, uint16_t qp_id)
+{
+
+	if (dev->data->queue_pairs[qp_id] != NULL) {
+		rte_free(dev->data->queue_pairs[qp_id]);
+		dev->data->queue_pairs[qp_id] = NULL;
+	}
+
+	return 0;
+}
+
+/** set a unique name for the queue pair based on it's name, dev_id and qp_id */
+static int
+armv8_crypto_pmd_qp_set_unique_name(struct rte_cryptodev *dev,
+		struct armv8_crypto_qp *qp)
+{
+	unsigned int n;
+
+	n = snprintf(qp->name, sizeof(qp->name), "armv8_crypto_pmd_%u_qp_%u",
+			dev->data->dev_id, qp->id);
+
+	if (n > sizeof(qp->name))
+		return -1;
+
+	return 0;
+}
+
+
+/** Create a ring to place processed operations on */
+static struct rte_ring *
+armv8_crypto_pmd_qp_create_processed_ops_ring(struct armv8_crypto_qp *qp,
+		unsigned int ring_size, int socket_id)
+{
+	struct rte_ring *r;
+
+	r = rte_ring_lookup(qp->name);
+	if (r) {
+		if (r->prod.size >= ring_size) {
+			ARMV8_CRYPTO_LOG_INFO(
+				"Reusing existing ring %s for processed ops",
+				 qp->name);
+			return r;
+		}
+
+		ARMV8_CRYPTO_LOG_ERR(
+			"Unable to reuse existing ring %s for processed ops",
+			 qp->name);
+		return NULL;
+	}
+
+	return rte_ring_create(qp->name, ring_size, socket_id,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+}
+
+
+/** Setup a queue pair */
+static int
+armv8_crypto_pmd_qp_setup(struct rte_cryptodev *dev, uint16_t qp_id,
+		const struct rte_cryptodev_qp_conf *qp_conf,
+		 int socket_id)
+{
+	struct armv8_crypto_qp *qp = NULL;
+
+	/* Free memory prior to re-allocation if needed. */
+	if (dev->data->queue_pairs[qp_id] != NULL)
+		armv8_crypto_pmd_qp_release(dev, qp_id);
+
+	/* Allocate the queue pair data structure. */
+	qp = rte_zmalloc_socket("ARMv8 PMD Queue Pair", sizeof(*qp),
+					RTE_CACHE_LINE_SIZE, socket_id);
+	if (qp == NULL)
+		return -ENOMEM;
+
+	qp->id = qp_id;
+	dev->data->queue_pairs[qp_id] = qp;
+
+	if (armv8_crypto_pmd_qp_set_unique_name(dev, qp) != 0)
+		goto qp_setup_cleanup;
+
+	qp->processed_ops = armv8_crypto_pmd_qp_create_processed_ops_ring(qp,
+			qp_conf->nb_descriptors, socket_id);
+	if (qp->processed_ops == NULL)
+		goto qp_setup_cleanup;
+
+	qp->sess_mp = dev->data->session_pool;
+
+	memset(&qp->stats, 0, sizeof(qp->stats));
+
+	return 0;
+
+qp_setup_cleanup:
+	if (qp)
+		rte_free(qp);
+
+	return -1;
+}
+
+/** Start queue pair */
+static int
+armv8_crypto_pmd_qp_start(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Stop queue pair */
+static int
+armv8_crypto_pmd_qp_stop(__rte_unused struct rte_cryptodev *dev,
+		__rte_unused uint16_t queue_pair_id)
+{
+	return -ENOTSUP;
+}
+
+/** Return the number of allocated queue pairs */
+static uint32_t
+armv8_crypto_pmd_qp_count(struct rte_cryptodev *dev)
+{
+	return dev->data->nb_queue_pairs;
+}
+
+/** Returns the size of the session structure */
+static unsigned
+armv8_crypto_pmd_session_get_size(struct rte_cryptodev *dev __rte_unused)
+{
+	return sizeof(struct armv8_crypto_session);
+}
+
+/** Configure the session from a crypto xform chain */
+static void *
+armv8_crypto_pmd_session_configure(struct rte_cryptodev *dev __rte_unused,
+		struct rte_crypto_sym_xform *xform, void *sess)
+{
+	if (unlikely(sess == NULL)) {
+		ARMV8_CRYPTO_LOG_ERR("invalid session struct");
+		return NULL;
+	}
+
+	if (armv8_crypto_set_session_parameters(
+			sess, xform) != 0) {
+		ARMV8_CRYPTO_LOG_ERR("failed configure session parameters");
+		return NULL;
+	}
+
+	return sess;
+}
+
+/** Clear the memory of session so it doesn't leave key material behind */
+static void
+armv8_crypto_pmd_session_clear(struct rte_cryptodev *dev __rte_unused,
+				void *sess)
+{
+
+	/* Zero out the whole structure */
+	if (sess)
+		memset(sess, 0, sizeof(struct armv8_crypto_session));
+}
+
+struct rte_cryptodev_ops armv8_crypto_pmd_ops = {
+		.dev_configure		= armv8_crypto_pmd_config,
+		.dev_start		= armv8_crypto_pmd_start,
+		.dev_stop		= armv8_crypto_pmd_stop,
+		.dev_close		= armv8_crypto_pmd_close,
+
+		.stats_get		= armv8_crypto_pmd_stats_get,
+		.stats_reset		= armv8_crypto_pmd_stats_reset,
+
+		.dev_infos_get		= armv8_crypto_pmd_info_get,
+
+		.queue_pair_setup	= armv8_crypto_pmd_qp_setup,
+		.queue_pair_release	= armv8_crypto_pmd_qp_release,
+		.queue_pair_start	= armv8_crypto_pmd_qp_start,
+		.queue_pair_stop	= armv8_crypto_pmd_qp_stop,
+		.queue_pair_count	= armv8_crypto_pmd_qp_count,
+
+		.session_get_size	= armv8_crypto_pmd_session_get_size,
+		.session_configure	= armv8_crypto_pmd_session_configure,
+		.session_clear		= armv8_crypto_pmd_session_clear
+};
+
+struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops = &armv8_crypto_pmd_ops;
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_private.h b/drivers/crypto/armv8/rte_armv8_pmd_private.h
new file mode 100644
index 0000000..b75107f
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_private.h
@@ -0,0 +1,211 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) Cavium networks Ltd. 2017.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Cavium networks nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_ARMV8_PMD_PRIVATE_H_
+#define _RTE_ARMV8_PMD_PRIVATE_H_
+
+#define ARMV8_CRYPTO_LOG_ERR(fmt, args...) \
+	RTE_LOG(ERR, CRYPTODEV, "[%s] %s() line %u: " fmt "\n",  \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#ifdef RTE_LIBRTE_ARMV8_CRYPTO_DEBUG
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...) \
+	RTE_LOG(INFO, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...) \
+	RTE_LOG(DEBUG, CRYPTODEV, "[%s] %s() line %u: " fmt "\n", \
+			RTE_STR(CRYPTODEV_NAME_ARMV8_CRYPTO_PMD), \
+			__func__, __LINE__, ## args)
+
+#define ARMV8_CRYPTO_ASSERT(con)				\
+do {								\
+	if (!(con)) {						\
+		rte_panic("%s(): "				\
+		    con "condition failed, line %u", __func__);	\
+	}							\
+} while (0)
+
+#else
+#define ARMV8_CRYPTO_LOG_INFO(fmt, args...)
+#define ARMV8_CRYPTO_LOG_DBG(fmt, args...)
+#define ARMV8_CRYPTO_ASSERT(con)
+#endif
+
+#define NBBY		8		/* Number of bits in a byte */
+#define BYTE_LENGTH(x)	((x) / NBBY)	/* Number of bytes in x (round down) */
+
+/** ARMv8 operation order mode enumerator */
+enum armv8_crypto_chain_order {
+	ARMV8_CRYPTO_CHAIN_CIPHER_AUTH,
+	ARMV8_CRYPTO_CHAIN_AUTH_CIPHER,
+	ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CHAIN_LIST_END = ARMV8_CRYPTO_CHAIN_NOT_SUPPORTED
+};
+
+/** ARMv8 cipher operation enumerator */
+enum armv8_crypto_cipher_operation {
+	ARMV8_CRYPTO_CIPHER_OP_ENCRYPT = RTE_CRYPTO_CIPHER_OP_ENCRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_DECRYPT = RTE_CRYPTO_CIPHER_OP_DECRYPT,
+	ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_OP_LIST_END = ARMV8_CRYPTO_CIPHER_OP_NOT_SUPPORTED
+};
+
+enum armv8_crypto_cipher_keylen {
+	ARMV8_CRYPTO_CIPHER_KEYLEN_128,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_192,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_256,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED,
+	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END =
+		ARMV8_CRYPTO_CIPHER_KEYLEN_NOT_SUPPORTED
+};
+
+/** ARMv8 auth mode enumerator */
+enum armv8_crypto_auth_mode {
+	ARMV8_CRYPTO_AUTH_AS_AUTH,
+	ARMV8_CRYPTO_AUTH_AS_HMAC,
+	ARMV8_CRYPTO_AUTH_AS_CIPHER,
+	ARMV8_CRYPTO_AUTH_NOT_SUPPORTED,
+	ARMV8_CRYPTO_AUTH_LIST_END = ARMV8_CRYPTO_AUTH_NOT_SUPPORTED
+};
+
+#define CRYPTO_ORDER_MAX		ARMV8_CRYPTO_CHAIN_LIST_END
+#define CRYPTO_CIPHER_OP_MAX		ARMV8_CRYPTO_CIPHER_OP_LIST_END
+#define CRYPTO_CIPHER_KEYLEN_MAX	ARMV8_CRYPTO_CIPHER_KEYLEN_LIST_END
+#define CRYPTO_CIPHER_MAX		RTE_CRYPTO_CIPHER_LIST_END
+#define CRYPTO_AUTH_MAX			RTE_CRYPTO_AUTH_LIST_END
+
+#define HMAC_IPAD_VALUE			(0x36)
+#define HMAC_OPAD_VALUE			(0x5C)
+
+#define SHA256_AUTH_KEY_LENGTH		(BYTE_LENGTH(256))
+#define SHA256_BLOCK_SIZE		(BYTE_LENGTH(512))
+
+#define SHA1_AUTH_KEY_LENGTH		(BYTE_LENGTH(160))
+#define SHA1_BLOCK_SIZE			(BYTE_LENGTH(512))
+
+#define SHA_AUTH_KEY_MAX		SHA256_AUTH_KEY_LENGTH
+#define SHA_BLOCK_MAX			SHA256_BLOCK_SIZE
+
+typedef int (*crypto_func_t)(uint8_t *, uint8_t *, uint64_t,
+				uint8_t *, uint8_t *, uint64_t,
+				crypto_arg_t *);
+
+typedef void (*crypto_key_sched_t)(uint8_t *, const uint8_t *);
+
+/** private data structure for each ARMv8 crypto device */
+struct armv8_crypto_private {
+	unsigned int max_nb_qpairs;
+	/**< Max number of queue pairs */
+	unsigned int max_nb_sessions;
+	/**< Max number of sessions */
+};
+
+/** ARMv8 crypto queue pair */
+struct armv8_crypto_qp {
+	uint16_t id;
+	/**< Queue Pair Identifier */
+	struct rte_ring *processed_ops;
+	/**< Ring for placing process packets */
+	struct rte_mempool *sess_mp;
+	/**< Session Mempool */
+	struct rte_cryptodev_stats stats;
+	/**< Queue pair statistics */
+	char name[RTE_CRYPTODEV_NAME_LEN];
+	/**< Unique Queue Pair Name */
+} __rte_cache_aligned;
+
+/** ARMv8 crypto private session structure */
+struct armv8_crypto_session {
+	enum armv8_crypto_chain_order chain_order;
+	/**< chain order mode */
+	crypto_func_t crypto_func;
+	/**< cryptographic function to use for this session */
+
+	/** Cipher Parameters */
+	struct {
+		enum rte_crypto_cipher_operation direction;
+		/**< cipher operation direction */
+		enum rte_crypto_cipher_algorithm algo;
+		/**< cipher algorithm */
+		int iv_len;
+		/**< IV length */
+
+		struct {
+			uint8_t data[256];
+			/**< key data */
+			size_t length;
+			/**< key length in bytes */
+		} key;
+
+		crypto_key_sched_t key_sched;
+		/**< Key schedule function */
+	} cipher;
+
+	/** Authentication Parameters */
+	struct {
+		enum rte_crypto_auth_operation operation;
+		/**< auth operation generate or verify */
+		enum armv8_crypto_auth_mode mode;
+		/**< auth operation mode */
+
+		union {
+			struct {
+				/* Add data if needed */
+			} auth;
+
+			struct {
+				uint8_t i_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< inner pad (max supported block length) */
+				uint8_t o_key_pad[SHA_BLOCK_MAX]
+							__rte_cache_aligned;
+				/**< outer pad (max supported block length) */
+				uint8_t key[SHA_AUTH_KEY_MAX];
+				/**< HMAC key (max supported length)*/
+			} hmac;
+		};
+	} auth;
+
+} __rte_cache_aligned;
+
+/** Set and validate ARMv8 crypto session parameters */
+extern int armv8_crypto_set_session_parameters(
+		struct armv8_crypto_session *sess,
+		const struct rte_crypto_sym_xform *xform);
+
+/** device specific operations function pointer structure */
+extern struct rte_cryptodev_ops *rte_armv8_crypto_pmd_ops;
+
+#endif /* _RTE_ARMV8_PMD_PRIVATE_H_ */
diff --git a/drivers/crypto/armv8/rte_armv8_pmd_version.map b/drivers/crypto/armv8/rte_armv8_pmd_version.map
new file mode 100644
index 0000000..1f84b68
--- /dev/null
+++ b/drivers/crypto/armv8/rte_armv8_pmd_version.map
@@ -0,0 +1,3 @@
+DPDK_17.02 {
+	local: *;
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 3/8] mk: add PMD to the build system
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 1/8] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 2/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 4/8] cryptodev/armv8: introduce ARM-specific feature flags zbigniew.bodek
                                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Build ARMv8 crypto PMD if compiling for ARM64
and CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option
is enable in the configuration file.
ARMV8_CRYPTO_LIB_PATH environment variable will
point to the appropriate library directory.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 drivers/crypto/Makefile | 1 +
 mk/rte.app.mk           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index 745c614..77b02cf 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -33,6 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_GCM) += aesni_gcm
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB) += aesni_mb
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO) += armv8
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_OPENSSL) += openssl
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_QAT) += qat
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_SNOW3G) += snow3g
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 9f4d057..b607014 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -146,6 +146,8 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -lrte_pmd_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KASUMI)      += -L$(LIBSSO_KASUMI_PATH)/build -lsso_kasumi
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -lrte_pmd_zuc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ZUC)         += -L$(LIBSSO_ZUC_PATH)/build -lsso_zuc
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -lrte_pmd_armv8
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO)    += -L$(ARMV8_CRYPTO_LIB_PATH) -larmv8_crypto
 endif # CONFIG_RTE_LIBRTE_CRYPTODEV
 
 endif # !CONFIG_RTE_BUILD_SHARED_LIBS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 4/8] cryptodev/armv8: introduce ARM-specific feature flags
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (2 preceding siblings ...)
  2017-01-18 20:01                   ` [PATCH v6 3/8] mk: add PMD to the build system zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 5/8] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
                                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add two new feature flags:
* RTE_CRYPTODEV_FF_CPU_NEON
  represents ARM NEON (TM) instructions
* RTE_CRYPTODEV_FF_CPU_ARM_CE
  represents ARM crypto extensions

Add them to both cryptodev library, documentation and relevant
PMD driver for ARMv8.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
---
 doc/guides/cryptodevs/overview.rst   | 2 ++
 drivers/crypto/armv8/rte_armv8_pmd.c | 4 +++-
 lib/librte_cryptodev/rte_cryptodev.c | 4 ++++
 lib/librte_cryptodev/rte_cryptodev.h | 5 +++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/doc/guides/cryptodevs/overview.rst b/doc/guides/cryptodevs/overview.rst
index bd5f0ad..9ec32f1 100644
--- a/doc/guides/cryptodevs/overview.rst
+++ b/doc/guides/cryptodevs/overview.rst
@@ -45,6 +45,8 @@ Supported Feature Flags
    "RTE_CRYPTODEV_FF_CPU_AVX512",,,x,,,,
    "RTE_CRYPTODEV_FF_CPU_AESNI",,,x,x,,,
    "RTE_CRYPTODEV_FF_HW_ACCELERATED",x,,,,,,
+   "RTE_CRYPTODEV_FF_CPU_NEON",,,,,,,
+   "RTE_CRYPTODEV_FF_CPU_ARM_CE",,,,,,,
 
 Supported Cipher Algorithms
 
diff --git a/drivers/crypto/armv8/rte_armv8_pmd.c b/drivers/crypto/armv8/rte_armv8_pmd.c
index 1bf0f9d..d2b88a3 100644
--- a/drivers/crypto/armv8/rte_armv8_pmd.c
+++ b/drivers/crypto/armv8/rte_armv8_pmd.c
@@ -826,7 +826,9 @@
 	dev->enqueue_burst = armv8_crypto_pmd_enqueue_burst;
 
 	dev->feature_flags = RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO |
-			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING;
+			RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING |
+			RTE_CRYPTODEV_FF_CPU_NEON |
+			RTE_CRYPTODEV_FF_CPU_ARM_CE;
 
 	/* Set vector instructions mode supported */
 	internals = dev->data->dev_private;
diff --git a/lib/librte_cryptodev/rte_cryptodev.c b/lib/librte_cryptodev/rte_cryptodev.c
index f2ceb9b..6a51eec 100644
--- a/lib/librte_cryptodev/rte_cryptodev.c
+++ b/lib/librte_cryptodev/rte_cryptodev.c
@@ -240,6 +240,10 @@ struct rte_cryptodev_callback {
 		return "HW_ACCELERATED";
 	case RTE_CRYPTODEV_FF_MBUF_SCATTER_GATHER:
 		return "MBUF_SCATTER_GATHER";
+	case RTE_CRYPTODEV_FF_CPU_NEON:
+		return "CPU_NEON";
+	case RTE_CRYPTODEV_FF_CPU_ARM_CE:
+		return "CPU_ARM_CE";
 	default:
 		return NULL;
 	}
diff --git a/lib/librte_cryptodev/rte_cryptodev.h b/lib/librte_cryptodev/rte_cryptodev.h
index 452b174..f284668 100644
--- a/lib/librte_cryptodev/rte_cryptodev.h
+++ b/lib/librte_cryptodev/rte_cryptodev.h
@@ -232,6 +232,11 @@ struct rte_cryptodev_capabilities {
 /**< Utilises CPU SIMD AVX512 instructions */
 #define	RTE_CRYPTODEV_FF_MBUF_SCATTER_GATHER	(1ULL << 9)
 /**< Scatter-gather mbufs are supported */
+#define	RTE_CRYPTODEV_FF_CPU_NEON		(1ULL << 10)
+/**< Utilises CPU NEON instructions */
+#define	RTE_CRYPTODEV_FF_CPU_ARM_CE		(1ULL << 11)
+/**< Utilises ARM CPU Cryptographic Extensions */
+
 
 
 /**
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 5/8] doc: update documentation about ARMv8 crypto PMD
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (3 preceding siblings ...)
  2017-01-18 20:01                   ` [PATCH v6 4/8] cryptodev/armv8: introduce ARM-specific feature flags zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:01                   ` [PATCH v6 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
                                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add documentation about the driver and update
release notes.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/cryptodevs/armv8.rst        | 98 ++++++++++++++++++++++++++++++++++
 doc/guides/cryptodevs/index.rst        |  1 +
 doc/guides/cryptodevs/overview.rst     | 94 ++++++++++++++++----------------
 doc/guides/rel_notes/release_17_02.rst |  5 ++
 4 files changed, 151 insertions(+), 47 deletions(-)
 create mode 100644 doc/guides/cryptodevs/armv8.rst

diff --git a/doc/guides/cryptodevs/armv8.rst b/doc/guides/cryptodevs/armv8.rst
new file mode 100644
index 0000000..de63793
--- /dev/null
+++ b/doc/guides/cryptodevs/armv8.rst
@@ -0,0 +1,98 @@
+..  BSD LICENSE
+    Copyright (C) Cavium networks Ltd. 2017.
+
+    Redistribution and use in source and binary forms, with or without
+    modification, are permitted provided that the following conditions
+    are met:
+
+      * Redistributions of source code must retain the above copyright
+        notice, this list of conditions and the following disclaimer.
+      * Redistributions in binary form must reproduce the above copyright
+        notice, this list of conditions and the following disclaimer in
+        the documentation and/or other materials provided with the
+        distribution.
+      * Neither the name of Cavium networks nor the names of its
+        contributors may be used to endorse or promote products derived
+        from this software without specific prior written permission.
+
+    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+ARMv8 Crypto Poll Mode Driver
+=============================
+
+This code provides the initial implementation of the ARMv8 crypto PMD.
+The driver uses ARMv8 cryptographic extensions to process chained crypto
+operations in an optimized way. The core functionality is provided by
+a low-level library, written in the assembly code.
+
+Features
+--------
+
+ARMv8 Crypto PMD has support for the following algorithm pairs:
+
+Supported cipher algorithms:
+
+* ``RTE_CRYPTO_CIPHER_AES_CBC``
+
+Supported authentication algorithms:
+
+* ``RTE_CRYPTO_AUTH_SHA1_HMAC``
+* ``RTE_CRYPTO_AUTH_SHA256_HMAC``
+
+Installation
+------------
+
+In order to enable this virtual crypto PMD, user must:
+
+* Download ARMv8 crypto library source code from
+  `here <https://github.com/caviumnetworks/armv8_crypto>`_
+
+* Export the environmental variable ARMV8_CRYPTO_LIB_PATH with
+  the path where the ``armv8_crypto`` library was downloaded
+  or cloned.
+
+* Build the library by invoking:
+
+.. code-block:: console
+
+	make -C $ARMV8_CRYPTO_LIB_PATH/
+
+* Set CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=y in
+  config/defconfig_arm64-armv8a-linuxapp-gcc
+
+The corresponding device can be created only if the following features
+are supported by the CPU:
+
+* ``RTE_CPUFLAG_AES``
+* ``RTE_CPUFLAG_SHA1``
+* ``RTE_CPUFLAG_SHA2``
+* ``RTE_CPUFLAG_NEON``
+
+Initialization
+--------------
+
+User can use app/test application to check how to use this PMD and to verify
+crypto processing.
+
+Test name is cryptodev_sw_armv8_autotest.
+For performance test cryptodev_sw_armv8_perftest can be used.
+
+Limitations
+-----------
+
+* Maximum number of sessions is 2048.
+* Only chained operations are supported.
+* AES-128-CBC is the only supported cipher variant.
+* Cipher input data has to be a multiple of 16 bytes.
+* Digest input data has to be a multiple of 8 bytes.
diff --git a/doc/guides/cryptodevs/index.rst b/doc/guides/cryptodevs/index.rst
index a6a9f23..06c3f6e 100644
--- a/doc/guides/cryptodevs/index.rst
+++ b/doc/guides/cryptodevs/index.rst
@@ -38,6 +38,7 @@ Crypto Device Drivers
     overview
     aesni_mb
     aesni_gcm
+    armv8
     kasumi
     openssl
     null
diff --git a/doc/guides/cryptodevs/overview.rst b/doc/guides/cryptodevs/overview.rst
index 9ec32f1..4bbfadb 100644
--- a/doc/guides/cryptodevs/overview.rst
+++ b/doc/guides/cryptodevs/overview.rst
@@ -33,70 +33,70 @@ Crypto Device Supported Functionality Matrices
 Supported Feature Flags
 
 .. csv-table::
-   :header: "Feature Flags", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc"
+   :header: "Feature Flags", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc", "armv8"
    :stub-columns: 1
 
-   "RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO",x,x,x,x,x,x,x
-   "RTE_CRYPTODEV_FF_ASYMMETRIC_CRYPTO",,,,,,,
-   "RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING",x,x,x,x,x,x,x
-   "RTE_CRYPTODEV_FF_CPU_SSE",,,x,,x,x,
-   "RTE_CRYPTODEV_FF_CPU_AVX",,,x,,x,x,
-   "RTE_CRYPTODEV_FF_CPU_AVX2",,,x,,,,
-   "RTE_CRYPTODEV_FF_CPU_AVX512",,,x,,,,
-   "RTE_CRYPTODEV_FF_CPU_AESNI",,,x,x,,,
-   "RTE_CRYPTODEV_FF_HW_ACCELERATED",x,,,,,,
-   "RTE_CRYPTODEV_FF_CPU_NEON",,,,,,,
-   "RTE_CRYPTODEV_FF_CPU_ARM_CE",,,,,,,
+   "RTE_CRYPTODEV_FF_SYMMETRIC_CRYPTO",x,x,x,x,x,x,x,x
+   "RTE_CRYPTODEV_FF_ASYMMETRIC_CRYPTO",,,,,,,,
+   "RTE_CRYPTODEV_FF_SYM_OPERATION_CHAINING",x,x,x,x,x,x,x,x
+   "RTE_CRYPTODEV_FF_CPU_SSE",,,x,,x,x,,
+   "RTE_CRYPTODEV_FF_CPU_AVX",,,x,,x,x,,
+   "RTE_CRYPTODEV_FF_CPU_AVX2",,,x,,,,,
+   "RTE_CRYPTODEV_FF_CPU_AVX512",,,x,,,,,
+   "RTE_CRYPTODEV_FF_CPU_AESNI",,,x,x,,,,
+   "RTE_CRYPTODEV_FF_HW_ACCELERATED",x,,,,,,,
+   "RTE_CRYPTODEV_FF_CPU_NEON",,,,,,,,x
+   "RTE_CRYPTODEV_FF_CPU_ARM_CE",,,,,,,,x
 
 Supported Cipher Algorithms
 
 .. csv-table::
-   :header: "Cipher Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc"
+   :header: "Cipher Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc", "armv8"
    :stub-columns: 1
 
-   "NULL",,x,,,,,
-   "AES_CBC_128",x,,x,,,,
-   "AES_CBC_192",x,,x,,,,
-   "AES_CBC_256",x,,x,,,,
-   "AES_CTR_128",x,,x,,,,
-   "AES_CTR_192",x,,x,,,,
-   "AES_CTR_256",x,,x,,,,
-   "DES_CBC",x,,,,,,
-   "SNOW3G_UEA2",x,,,,x,,
-   "KASUMI_F8",,,,,,x,
-   "ZUC_EEA3",,,,,,,x
+   "NULL",,x,,,,,,
+   "AES_CBC_128",x,,x,,,,,x
+   "AES_CBC_192",x,,x,,,,,
+   "AES_CBC_256",x,,x,,,,,
+   "AES_CTR_128",x,,x,,,,,
+   "AES_CTR_192",x,,x,,,,,
+   "AES_CTR_256",x,,x,,,,,
+   "DES_CBC",x,,,,,,,
+   "SNOW3G_UEA2",x,,,,x,,,
+   "KASUMI_F8",,,,,,x,,
+   "ZUC_EEA3",,,,,,,x,
 
 Supported Authentication Algorithms
 
 .. csv-table::
-   :header: "Cipher Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc"
+   :header: "Cipher Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc", "armv8"
    :stub-columns: 1
 
-   "NONE",,x,,,,,
-   "MD5",,,,,,,
-   "MD5_HMAC",,,x,,,,
-   "SHA1",,,,,,,
-   "SHA1_HMAC",x,,x,,,,
-   "SHA224",,,,,,,
-   "SHA224_HMAC",,,x,,,,
-   "SHA256",,,,,,,
-   "SHA256_HMAC",x,,x,,,,
-   "SHA384",,,,,,,
-   "SHA384_HMAC",,,x,,,,
-   "SHA512",,,,,,,
-   "SHA512_HMAC",x,,x,,,,
-   "AES_XCBC",x,,x,,,,
-   "AES_GMAC",,,,x,,,
-   "SNOW3G_UIA2",x,,,,x,,
-   "KASUMI_F9",,,,,,x,
-   "ZUC_EIA3",,,,,,,x
+   "NONE",,x,,,,,,
+   "MD5",,,,,,,,
+   "MD5_HMAC",,,x,,,,,
+   "SHA1",,,,,,,,
+   "SHA1_HMAC",x,,x,,,,,x
+   "SHA224",,,,,,,,
+   "SHA224_HMAC",,,x,,,,,
+   "SHA256",,,,,,,,
+   "SHA256_HMAC",x,,x,,,,,x
+   "SHA384",,,,,,,,
+   "SHA384_HMAC",,,x,,,,,
+   "SHA512",,,,,,,,
+   "SHA512_HMAC",x,,x,,,,,
+   "AES_XCBC",x,,x,,,,,
+   "AES_GMAC",,,,x,,,,
+   "SNOW3G_UIA2",x,,,,x,,,
+   "KASUMI_F9",,,,,,x,,
+   "ZUC_EIA3",,,,,,,x,
 
 Supported AEAD Algorithms
 
 .. csv-table::
-   :header: "AEAD Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc"
+   :header: "AEAD Algorithms", "qat", "null", "aesni_mb", "aesni_gcm", "snow3g", "kasumi", "zuc", "armv8"
    :stub-columns: 1
 
-   "AES_GCM_128",x,,,x,,,
-   "AES_GCM_192",x,,,,,,
-   "AES_GCM_256",x,,,x,,,
+   "AES_GCM_128",x,,,x,,,,
+   "AES_GCM_192",x,,,,,,,
+   "AES_GCM_256",x,,,x,,,,
diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 670aaf4..0b96bed 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -175,6 +175,11 @@ New Features
   * Scatter-gatter support for chained mbufs (only out-of place and destination
     mbuf must be contiguous)
 
+* **Added armv8 crypto PMD.**
+
+  A new crypto PMD has been added, which provides combined mode cryptografic
+  operations optimized for ARMv8 processors. The driver can be used to enhance
+  performance in processing chained operations such as cipher + HMAC.
 
 Resolved Issues
 ---------------
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 6/8] crypto/armv8: enable ARMv8 PMD in the configuration
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (4 preceding siblings ...)
  2017-01-18 20:01                   ` [PATCH v6 5/8] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
@ 2017-01-18 20:01                   ` zbigniew.bodek
  2017-01-18 20:02                   ` [PATCH v6 7/8] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
                                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:01 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Add CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO option to
the common configuration file. Don't enable it by
default for ARM64 as it requires external library
to build.

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 config/common_base | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/config/common_base b/config/common_base
index f2e030c..3afa8cb 100644
--- a/config/common_base
+++ b/config/common_base
@@ -428,6 +428,12 @@ CONFIG_RTE_LIBRTE_PMD_ZUC=n
 CONFIG_RTE_LIBRTE_PMD_ZUC_DEBUG=n
 
 #
+# Compile PMD for ARMv8 Crypto device
+#
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO=n
+CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO_DEBUG=n
+
+#
 # Compile PMD for NULL Crypto device
 #
 CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 7/8] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (5 preceding siblings ...)
  2017-01-18 20:01                   ` [PATCH v6 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
@ 2017-01-18 20:02                   ` zbigniew.bodek
  2017-01-18 20:02                   ` [PATCH v6 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
  2017-01-18 21:14                   ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:02 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8d0fe40..0a1c889 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -458,6 +458,12 @@ M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/openssl/
 F: doc/guides/cryptodevs/openssl.rst
 
+ARMv8 Crypto PMD
+M: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
+M: Jerin Jacob <jerin.jacob@caviumnetworks.com>
+F: drivers/crypto/armv8/
+F: doc/guides/cryptodevs/armv8.rst
+
 Null Crypto PMD
 M: Declan Doherty <declan.doherty@intel.com>
 F: drivers/crypto/null/
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 8/8] app/test: add ARMv8 crypto tests and test vectors
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (6 preceding siblings ...)
  2017-01-18 20:02                   ` [PATCH v6 7/8] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
@ 2017-01-18 20:02                   ` zbigniew.bodek
  2017-01-18 21:14                   ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
  8 siblings, 0 replies; 100+ messages in thread
From: zbigniew.bodek @ 2017-01-18 20:02 UTC (permalink / raw)
  To: dev
  Cc: pablo.de.lara.guarch, declan.doherty, jerin.jacob, jianbo.liu,
	hemant.agrawal, Zbigniew Bodek

From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Introduce unit tests for ARMv8 crypto PMD.
Add test vectors for short cases such as 160 bytes.
These test cases are ARMv8 specific since the code provides
different processing paths for different input data sizes.

User can validate correctness of algorithms' implementation using:
* cryptodev_sw_armv8_autotest
For performance test one can use:
* cryptodev_sw_armv8_perftest

Signed-off-by: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
Reviewed-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 app/test/test_cryptodev.c                  |  64 ++++
 app/test/test_cryptodev_aes_test_vectors.h | 145 ++++++++-
 app/test/test_cryptodev_blockcipher.c      |   4 +
 app/test/test_cryptodev_blockcipher.h      |   1 +
 app/test/test_cryptodev_perf.c             | 486 +++++++++++++++++++++++++++++
 5 files changed, 691 insertions(+), 9 deletions(-)

diff --git a/app/test/test_cryptodev.c b/app/test/test_cryptodev.c
index e8d1eae..0f0cf4d 100644
--- a/app/test/test_cryptodev.c
+++ b/app/test/test_cryptodev.c
@@ -348,6 +348,28 @@ struct crypto_unittest_params {
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_type == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_type == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -1593,6 +1615,22 @@ struct crypto_unittest_params {
 	return TEST_SUCCESS;
 }
 
+static int
+test_AES_chain_armv8_all(void)
+{
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+	int status;
+
+	status = test_blockcipher_all_tests(ts_params->mbuf_pool,
+		ts_params->op_mpool, ts_params->valid_devs[0],
+		RTE_CRYPTODEV_ARMV8_PMD,
+		BLKCIPHER_AES_CHAIN_TYPE);
+
+	TEST_ASSERT_EQUAL(status, 0, "Test failed");
+
+	return TEST_SUCCESS;
+}
+
 /* ***** SNOW 3G Tests ***** */
 static int
 create_wireless_algo_hash_session(uint8_t dev_id,
@@ -7847,6 +7885,23 @@ struct test_crypto_vector {
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, test_AES_chain_armv8_all),
+
+		/** Negative tests */
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_data_corrupt),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+			auth_decryption_AES128CBC_HMAC_SHA1_fail_tag_corrupt),
+
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 test_cryptodev_qat(void /*argv __rte_unused, int argc __rte_unused*/)
 {
@@ -7910,6 +7965,14 @@ struct test_crypto_vector {
 	return unit_test_suite_runner(&cryptodev_sw_zuc_testsuite);
 }
 
+static int
+test_cryptodev_armv8(void)
+{
+	gbl_cryptodev_type = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_qat_autotest, test_cryptodev_qat);
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_autotest, test_cryptodev_aesni_mb);
 REGISTER_TEST_COMMAND(cryptodev_openssl_autotest, test_cryptodev_openssl);
@@ -7918,3 +7981,4 @@ struct test_crypto_vector {
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_autotest, test_cryptodev_sw_snow3g);
 REGISTER_TEST_COMMAND(cryptodev_sw_kasumi_autotest, test_cryptodev_sw_kasumi);
 REGISTER_TEST_COMMAND(cryptodev_sw_zuc_autotest, test_cryptodev_sw_zuc);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_autotest, test_cryptodev_armv8);
diff --git a/app/test/test_cryptodev_aes_test_vectors.h b/app/test/test_cryptodev_aes_test_vectors.h
index e566548..f0f37ed 100644
--- a/app/test/test_cryptodev_aes_test_vectors.h
+++ b/app/test/test_cryptodev_aes_test_vectors.h
@@ -825,6 +825,98 @@
 	}
 };
 
+/** AES-128-CBC SHA256 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_12 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC,
+	.auth_key = {
+		.data = {
+			0x42, 0x1A, 0x7D, 0x3D, 0xF5, 0x82, 0x80, 0xF1,
+			0xF1, 0x35, 0x5C, 0x3B, 0xDD, 0x9A, 0x65, 0xBA,
+			0x58, 0x34, 0x85, 0x61, 0x1C, 0x42, 0x10, 0x76,
+			0x9A, 0x4F, 0x88, 0x1B, 0xB6, 0x8F, 0xD8, 0x60
+		},
+		.len = 32
+	},
+	.digest = {
+		.data = {
+			0x92, 0xEC, 0x65, 0x9A, 0x52, 0xCC, 0x50, 0xA5,
+			0xEE, 0x0E, 0xDF, 0x1E, 0xA4, 0xC9, 0xC1, 0x04,
+			0xD5, 0xDC, 0x78, 0x90, 0xF4, 0xE3, 0x35, 0x62,
+			0xAD, 0x95, 0x45, 0x28, 0x5C, 0xF8, 0x8C, 0x0B
+		},
+		.len = 32,
+		.truncated_len = 16
+	}
+};
+
+/** AES-128-CBC SHA1 HMAC test vector (160 bytes) */
+static const struct blockcipher_test_data aes_test_data_13 = {
+	.crypto_algo = RTE_CRYPTO_CIPHER_AES_CBC,
+	.cipher_key = {
+		.data = {
+			0xE4, 0x23, 0x33, 0x8A, 0x35, 0x64, 0x61, 0xE2,
+			0x49, 0x03, 0xDD, 0xC6, 0xB8, 0xCA, 0x55, 0x7A
+		},
+		.len = 16
+	},
+	.iv = {
+		.data = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
+		},
+		.len = 16
+	},
+	.plaintext = {
+		.data = plaintext_aes_common,
+		.len = 160
+	},
+	.ciphertext = {
+		.data = ciphertext512_aes128cbc,
+		.len = 160
+	},
+	.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
+	.auth_key = {
+		.data = {
+			0xF8, 0x2A, 0xC7, 0x54, 0xDB, 0x96, 0x18, 0xAA,
+			0xC3, 0xA1, 0x53, 0xF6, 0x1F, 0x17, 0x60, 0xBD,
+			0xDE, 0xF4, 0xDE, 0xAD
+		},
+		.len = 20
+	},
+	.digest = {
+		.data = {
+			0x4F, 0x16, 0xEA, 0xF7, 0x4A, 0x88, 0xD3, 0xE0,
+			0x0E, 0x12, 0x8B, 0xE7, 0x05, 0xD0, 0x86, 0x48,
+			0x22, 0x43, 0x30, 0xA7
+		},
+		.len = 20,
+		.truncated_len = 12
+	}
+};
+
 static const struct blockcipher_test_case aes_chain_test_cases[] = {
 	{
 		.test_descr = "AES-128-CTR HMAC-SHA1 Encryption Digest",
@@ -888,12 +980,20 @@
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Encryption Digest "
 				"Scatter Gather",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -902,35 +1002,58 @@
 		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
-
 	},
 	{
 		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA1 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_13,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Encryption Digest "
+			"(short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
 			"Verify",
 		.test_data = &aes_test_data_5,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_MB |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_MB |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL |
 			BLOCKCIPHER_TEST_TARGET_PMD_QAT
 	},
 	{
+		.test_descr = "AES-128-CBC HMAC-SHA256 Decryption Digest "
+			"Verify (short buffers)",
+		.test_data = &aes_test_data_12,
+		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8
+	},
+	{
 		.test_descr = "AES-128-CBC HMAC-SHA512 Encryption Digest",
 		.test_data = &aes_test_data_6,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
@@ -998,7 +1121,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1007,7 +1131,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_OOP,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_QAT |
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_QAT |
 			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
@@ -1050,7 +1175,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_ENC_AUTH_GEN,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 	{
 		.test_descr =
@@ -1059,7 +1185,8 @@
 		.test_data = &aes_test_data_4,
 		.op_mask = BLOCKCIPHER_TEST_OP_AUTH_VERIFY_DEC,
 		.feature_mask = BLOCKCIPHER_TEST_FEATURE_SESSIONLESS,
-		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
+		.pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8 |
+			BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL
 	},
 };
 
diff --git a/app/test/test_cryptodev_blockcipher.c b/app/test/test_cryptodev_blockcipher.c
index 01aef3b..a48540c 100644
--- a/app/test/test_cryptodev_blockcipher.c
+++ b/app/test/test_cryptodev_blockcipher.c
@@ -102,6 +102,7 @@
 	switch (cryptodev_type) {
 	case RTE_CRYPTODEV_QAT_SYM_PMD:
 	case RTE_CRYPTODEV_OPENSSL_PMD:
+	case RTE_CRYPTODEV_ARMV8_PMD: /* Fall through */
 		digest_len = tdata->digest.len;
 		break;
 	case RTE_CRYPTODEV_AESNI_MB_PMD:
@@ -645,6 +646,9 @@
 	case RTE_CRYPTODEV_OPENSSL_PMD:
 		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL;
 		break;
+	case RTE_CRYPTODEV_ARMV8_PMD:
+		target_pmd_mask = BLOCKCIPHER_TEST_TARGET_PMD_ARMV8;
+		break;
 	default:
 		TEST_ASSERT(0, "Unrecognized cryptodev type");
 		break;
diff --git a/app/test/test_cryptodev_blockcipher.h b/app/test/test_cryptodev_blockcipher.h
index 7256f6b..91e9858 100644
--- a/app/test/test_cryptodev_blockcipher.h
+++ b/app/test/test_cryptodev_blockcipher.h
@@ -50,6 +50,7 @@
 #define BLOCKCIPHER_TEST_TARGET_PMD_MB		0x0001 /* Multi-buffer flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_QAT			0x0002 /* QAT flag */
 #define BLOCKCIPHER_TEST_TARGET_PMD_OPENSSL	0x0004 /* SW OPENSSL flag */
+#define BLOCKCIPHER_TEST_TARGET_PMD_ARMV8	0x0008 /* ARMv8 flag */
 
 #define BLOCKCIPHER_TEST_OP_CIPHER	(BLOCKCIPHER_TEST_OP_ENCRYPT | \
 					BLOCKCIPHER_TEST_OP_DECRYPT)
diff --git a/app/test/test_cryptodev_perf.c b/app/test/test_cryptodev_perf.c
index 9b26fc1..7f1adf8 100644
--- a/app/test/test_cryptodev_perf.c
+++ b/app/test/test_cryptodev_perf.c
@@ -157,6 +157,12 @@ struct crypto_unittest_params {
 		enum rte_crypto_cipher_algorithm cipher_algo,
 		unsigned int cipher_key_len,
 		enum rte_crypto_auth_algorithm auth_algo);
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo);
+
 static struct rte_mbuf *
 test_perf_create_pktmbuf(struct rte_mempool *mpool, unsigned buf_sz);
 static inline struct rte_crypto_op *
@@ -397,6 +403,28 @@ static const char *auth_algo_name(enum rte_crypto_auth_algorithm auth_algo)
 		}
 	}
 
+	/* Create 2 ARMv8 devices if required */
+	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_ARMV8_PMD) {
+#ifndef RTE_LIBRTE_PMD_ARMV8_CRYPTO
+		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_ARMV8_CRYPTO must be"
+			" enabled in config file to run this testsuite.\n");
+		return TEST_FAILED;
+#endif
+		nb_devs = rte_cryptodev_count_devtype(
+				RTE_CRYPTODEV_ARMV8_PMD);
+		if (nb_devs < 2) {
+			for (i = nb_devs; i < 2; i++) {
+				ret = rte_eal_vdev_init(
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD),
+					NULL);
+
+				TEST_ASSERT(ret == 0, "Failed to create "
+					"instance %u of pmd : %s", i,
+					RTE_STR(CRYPTODEV_NAME_ARMV8_PMD));
+			}
+		}
+	}
+
 #ifndef RTE_LIBRTE_PMD_QAT
 	if (gbl_cryptodev_perftest_devtype == RTE_CRYPTODEV_QAT_SYM_PMD) {
 		RTE_LOG(ERR, USER1, "CONFIG_RTE_LIBRTE_PMD_QAT must be enabled "
@@ -2425,6 +2453,139 @@ struct crypto_data_params aes_cbc_hmac_sha256_output[MAX_PACKET_SIZE_INDEX] = {
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8_optimise_cyclecount(struct perf_test_params *pparams)
+{
+	uint32_t num_to_submit = pparams->total_operations;
+	struct rte_crypto_op *c_ops[num_to_submit];
+	struct rte_crypto_op *proc_ops[num_to_submit];
+	uint64_t failed_polls, retries, start_cycles, end_cycles,
+		 total_cycles = 0;
+	uint32_t burst_sent = 0, burst_received = 0;
+	uint32_t i, burst_size, num_sent, num_ops_received;
+	uint32_t nb_ops;
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate Crypto op data structure(s)*/
+	for (i = 0; i < num_to_submit ; i++) {
+		struct rte_mbuf *m = test_perf_create_pktmbuf(
+						ts_params->mbuf_mp,
+						pparams->buf_size);
+		TEST_ASSERT_NOT_NULL(m, "Failed to allocate tx_buf");
+
+		struct rte_crypto_op *op =
+				rte_crypto_op_alloc(ts_params->op_mpool,
+						RTE_CRYPTO_OP_TYPE_SYMMETRIC);
+		TEST_ASSERT_NOT_NULL(op, "Failed to allocate op");
+
+		op = test_perf_set_crypto_op_aes(op, m, sess, pparams->buf_size,
+				digest_length, pparams->chain);
+		TEST_ASSERT_NOT_NULL(op, "Failed to attach op to session");
+
+		c_ops[i] = op;
+	}
+
+	printf("\nOn %s dev%u qp%u, %s, cipher algo:%s, cipher key length:%u, "
+			"auth_algo:%s, Packet Size %u bytes",
+			pmd_name(gbl_cryptodev_perftest_devtype),
+			ts_params->dev_id, 0,
+			chain_mode_name(pparams->chain),
+			cipher_algo_name(pparams->cipher_algo),
+			pparams->cipher_key_length,
+			auth_algo_name(pparams->auth_algo),
+			pparams->buf_size);
+	printf("\nOps Tx\tOps Rx\tOps/burst  ");
+	printf("Retries  "
+		"EmptyPolls\tIACycles/CyOp\tIACycles/Burst\tIACycles/Byte");
+
+	for (i = 2; i <= 128 ; i *= 2) {
+		num_sent = 0;
+		num_ops_received = 0;
+		retries = 0;
+		failed_polls = 0;
+		burst_size = i;
+		total_cycles = 0;
+		while (num_sent < num_to_submit) {
+			if ((num_to_submit - num_sent) < burst_size)
+				nb_ops = num_to_submit - num_sent;
+			else
+				nb_ops = burst_size;
+
+			start_cycles = rte_rdtsc();
+			burst_sent = rte_cryptodev_enqueue_burst(
+				ts_params->dev_id,
+				0, &c_ops[num_sent],
+				nb_ops);
+			end_cycles = rte_rdtsc();
+
+			if (burst_sent == 0)
+				retries++;
+			num_sent += burst_sent;
+			total_cycles += (end_cycles - start_cycles);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+					ts_params->dev_id, 0, proc_ops,
+					burst_size);
+			end_cycles = rte_rdtsc();
+			if (burst_received < burst_sent)
+				failed_polls++;
+			num_ops_received += burst_received;
+
+			total_cycles += end_cycles - start_cycles;
+		}
+
+		while (num_ops_received != num_to_submit) {
+			/* Sending 0 length burst to flush sw crypto device */
+			rte_cryptodev_enqueue_burst(
+						ts_params->dev_id, 0, NULL, 0);
+
+			start_cycles = rte_rdtsc();
+			burst_received = rte_cryptodev_dequeue_burst(
+				ts_params->dev_id, 0, proc_ops, burst_size);
+			end_cycles = rte_rdtsc();
+
+			total_cycles += end_cycles - start_cycles;
+			if (burst_received == 0)
+				failed_polls++;
+			num_ops_received += burst_received;
+		}
+
+		printf("\n%u\t%u\t%u", num_sent, num_ops_received, burst_size);
+		printf("\t\t%"PRIu64, retries);
+		printf("\t%"PRIu64, failed_polls);
+		printf("\t\t%"PRIu64, total_cycles/num_ops_received);
+		printf("\t\t%"PRIu64,
+			(total_cycles/num_ops_received)*burst_size);
+		printf("\t\t%"PRIu64,
+			total_cycles/(num_ops_received*pparams->buf_size));
+	}
+	printf("\n");
+
+	for (i = 0; i < num_to_submit ; i++) {
+		rte_pktmbuf_free(c_ops[i]->sym->m_src);
+		rte_crypto_op_free(c_ops[i]);
+	}
+
+	return TEST_SUCCESS;
+}
+
 static uint32_t get_auth_key_max_length(enum rte_crypto_auth_algorithm algo)
 {
 	switch (algo) {
@@ -2690,6 +2851,56 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	}
 }
 
+static struct rte_cryptodev_sym_session *
+test_perf_create_armv8_session(uint8_t dev_id, enum chain_mode chain,
+		enum rte_crypto_cipher_algorithm cipher_algo,
+		unsigned int cipher_key_len,
+		enum rte_crypto_auth_algorithm auth_algo)
+{
+	struct rte_crypto_sym_xform cipher_xform = { 0 };
+	struct rte_crypto_sym_xform auth_xform = { 0 };
+
+	/* Setup Cipher Parameters */
+	cipher_xform.type = RTE_CRYPTO_SYM_XFORM_CIPHER;
+	cipher_xform.cipher.algo = cipher_algo;
+
+	switch (cipher_algo) {
+	case RTE_CRYPTO_CIPHER_AES_CBC:
+		cipher_xform.cipher.key.data = aes_cbc_128_key;
+		break;
+	default:
+		return NULL;
+	}
+
+	cipher_xform.cipher.key.length = cipher_key_len;
+
+	/* Setup Auth Parameters */
+	auth_xform.type = RTE_CRYPTO_SYM_XFORM_AUTH;
+	auth_xform.auth.op = RTE_CRYPTO_AUTH_OP_GENERATE;
+	auth_xform.auth.algo = auth_algo;
+
+	auth_xform.auth.digest_length = get_auth_digest_length(auth_algo);
+
+	switch (chain) {
+	case CIPHER_HASH:
+		cipher_xform.next = &auth_xform;
+		auth_xform.next = NULL;
+		/* Encrypt and hash the result */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_ENCRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&cipher_xform);
+	case HASH_CIPHER:
+		auth_xform.next = &cipher_xform;
+		cipher_xform.next = NULL;
+		/* Hash encrypted message and decrypt */
+		cipher_xform.cipher.op = RTE_CRYPTO_CIPHER_OP_DECRYPT;
+		/* Create Crypto session*/
+		return rte_cryptodev_sym_session_create(dev_id,	&auth_xform);
+	default:
+		return NULL;
+	}
+}
+
 #define AES_BLOCK_SIZE 16
 #define AES_CIPHER_IV_LENGTH 16
 
@@ -3380,6 +3591,139 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 	return TEST_SUCCESS;
 }
 
+static int
+test_perf_armv8(uint8_t dev_id, uint16_t queue_id,
+		struct perf_test_params *pparams)
+{
+	uint16_t i, k, l, m;
+	uint16_t j = 0;
+	uint16_t ops_unused = 0;
+	uint16_t burst_size;
+	uint16_t ops_needed;
+
+	uint64_t burst_enqueued = 0, total_enqueued = 0, burst_dequeued = 0;
+	uint64_t processed = 0, failed_polls = 0, retries = 0;
+	uint64_t tsc_start = 0, tsc_end = 0;
+
+	unsigned int digest_length = get_auth_digest_length(pparams->auth_algo);
+
+	struct rte_crypto_op *ops[pparams->burst_size];
+	struct rte_crypto_op *proc_ops[pparams->burst_size];
+
+	struct rte_mbuf *mbufs[pparams->burst_size * NUM_MBUF_SETS];
+
+	struct crypto_testsuite_params *ts_params = &testsuite_params;
+
+	static struct rte_cryptodev_sym_session *sess;
+
+	if (rte_cryptodev_count() == 0) {
+		printf("\nNo crypto devices found. Is PMD build configured?\n");
+		return TEST_FAILED;
+	}
+
+	/* Create Crypto session*/
+	sess = test_perf_create_armv8_session(ts_params->dev_id,
+			pparams->chain, pparams->cipher_algo,
+			pparams->cipher_key_length, pparams->auth_algo);
+	TEST_ASSERT_NOT_NULL(sess, "Session creation failed");
+
+	/* Generate a burst of crypto operations */
+	for (i = 0; i < (pparams->burst_size * NUM_MBUF_SETS); i++) {
+		mbufs[i] = test_perf_create_pktmbuf(
+				ts_params->mbuf_mp,
+				pparams->buf_size);
+
+		if (mbufs[i] == NULL) {
+			printf("\nFailed to get mbuf - freeing the rest.\n");
+			for (k = 0; k < i; k++)
+				rte_pktmbuf_free(mbufs[k]);
+			return -1;
+		}
+	}
+
+	tsc_start = rte_rdtsc();
+
+	while (total_enqueued < pparams->total_operations) {
+		if ((total_enqueued + pparams->burst_size) <=
+					pparams->total_operations)
+			burst_size = pparams->burst_size;
+		else
+			burst_size = pparams->total_operations - total_enqueued;
+
+		ops_needed = burst_size - ops_unused;
+
+		if (ops_needed != rte_crypto_op_bulk_alloc(ts_params->op_mpool,
+				RTE_CRYPTO_OP_TYPE_SYMMETRIC, ops, ops_needed)){
+			printf("\nFailed to alloc enough ops, finish dequeuing "
+				"and free ops below.");
+		} else {
+			for (i = 0; i < ops_needed; i++)
+				ops[i] = test_perf_set_crypto_op_aes(ops[i],
+					mbufs[i + (pparams->burst_size *
+						(j % NUM_MBUF_SETS))], sess,
+					pparams->buf_size, digest_length,
+					pparams->chain);
+
+			/* enqueue burst */
+			burst_enqueued = rte_cryptodev_enqueue_burst(dev_id,
+					queue_id, ops, burst_size);
+
+			if (burst_enqueued < burst_size)
+				retries++;
+
+			ops_unused = burst_size - burst_enqueued;
+			total_enqueued += burst_enqueued;
+		}
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (l = 0; l < burst_dequeued; l++)
+				rte_crypto_op_free(proc_ops[l]);
+		}
+		j++;
+	}
+
+	/* Dequeue any operations still in the crypto device */
+	while (processed < pparams->total_operations) {
+		/* Sending 0 length burst to flush sw crypto device */
+		rte_cryptodev_enqueue_burst(dev_id, queue_id, NULL, 0);
+
+		/* dequeue burst */
+		burst_dequeued = rte_cryptodev_dequeue_burst(dev_id, queue_id,
+				proc_ops, pparams->burst_size);
+		if (burst_dequeued == 0)
+			failed_polls++;
+		else {
+			processed += burst_dequeued;
+
+			for (m = 0; m < burst_dequeued; m++)
+				rte_crypto_op_free(proc_ops[m]);
+		}
+	}
+
+	tsc_end = rte_rdtsc();
+
+	double ops_s = ((double)processed / (tsc_end - tsc_start))
+					* rte_get_tsc_hz();
+	double throughput = (ops_s * pparams->buf_size * NUM_MBUF_SETS)
+					/ 1000000000;
+
+	printf("\t%u\t%6.2f\t%10.2f\t%8"PRIu64"\t%8"PRIu64, pparams->buf_size,
+			ops_s / 1000000, throughput, retries, failed_polls);
+
+	for (i = 0; i < pparams->burst_size * NUM_MBUF_SETS; i++)
+		rte_pktmbuf_free(mbufs[i]);
+
+	printf("\n");
+	return TEST_SUCCESS;
+}
+
 /*
 
     perf_test_aes_sha("avx2", HASH_CIPHER, 16, CBC, SHA1);
@@ -3693,6 +4037,125 @@ static uint32_t get_auth_digest_length(enum rte_crypto_auth_algorithm algo)
 }
 
 static int
+test_perf_armv8_vary_pkt_size(void)
+{
+	unsigned int total_operations = 100000;
+	unsigned int burst_size = { 64 };
+	unsigned int buf_lengths[] = { 64, 128, 256, 512, 768, 1024, 1280, 1536,
+			1792, 2048 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		params_set[i].total_operations = total_operations;
+		params_set[i].burst_size = burst_size;
+		printf("\n%s. cipher algo: %s auth algo: %s cipher key size=%u."
+				" burst_size: %d ops\n",
+				chain_mode_name(params_set[i].chain),
+				cipher_algo_name(params_set[i].cipher_algo),
+				auth_algo_name(params_set[i].auth_algo),
+				params_set[i].cipher_key_length,
+				burst_size);
+		printf("\nBuffer Size(B)\tOPS(M)\tThroughput(Gbps)\tRetries\t"
+				"EmptyPolls\n");
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8(testsuite_params.dev_id, 0,
+							&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
+test_perf_armv8_vary_burst_size(void)
+{
+	unsigned int total_operations = 4096;
+	uint16_t buf_lengths[] = { 64 };
+	uint8_t i, j;
+
+	struct perf_test_params params_set[] = {
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA1_HMAC
+		},
+		{
+			.chain = CIPHER_HASH,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+		{
+			.chain = HASH_CIPHER,
+
+			.cipher_algo  = RTE_CRYPTO_CIPHER_AES_CBC,
+			.cipher_key_length = 16,
+			.auth_algo = RTE_CRYPTO_AUTH_SHA256_HMAC
+		},
+	};
+
+	printf("\n\nStart %s.", __func__);
+	printf("\nThis Test measures the average IA cycle cost using a "
+			"constant request(packet) size. ");
+	printf("Cycle cost is only valid when indicators show device is "
+			"not busy, i.e. Retries and EmptyPolls = 0");
+
+	for (i = 0; i < RTE_DIM(params_set); i++) {
+		printf("\n");
+		params_set[i].total_operations = total_operations;
+
+		for (j = 0; j < RTE_DIM(buf_lengths); j++) {
+			params_set[i].buf_size = buf_lengths[j];
+			test_perf_armv8_optimise_cyclecount(&params_set[i]);
+		}
+	}
+
+	return 0;
+}
+
+static int
 test_perf_aes_cbc_vary_burst_size(void)
 {
 	return test_perf_crypto_qp_vary_burst_size(testsuite_params.dev_id);
@@ -4244,6 +4707,19 @@ static int test_continual_perf_AES_GCM(void)
 	}
 };
 
+static struct unit_test_suite cryptodev_armv8_testsuite  = {
+	.suite_name = "Crypto Device ARMv8 Unit Test Suite",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_pkt_size),
+		TEST_CASE_ST(ut_setup, ut_teardown,
+				test_perf_armv8_vary_burst_size),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
 static int
 perftest_aesni_gcm_cryptodev(void)
 {
@@ -4300,6 +4776,14 @@ static int test_continual_perf_AES_GCM(void)
 	return unit_test_suite_runner(&cryptodev_qat_continual_testsuite);
 }
 
+static int
+perftest_sw_armv8_cryptodev(void /*argv __rte_unused, int argc __rte_unused*/)
+{
+	gbl_cryptodev_perftest_devtype = RTE_CRYPTODEV_ARMV8_PMD;
+
+	return unit_test_suite_runner(&cryptodev_armv8_testsuite);
+}
+
 REGISTER_TEST_COMMAND(cryptodev_aesni_mb_perftest, perftest_aesni_mb_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_perftest, perftest_qat_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_sw_snow3g_perftest, perftest_sw_snow3g_cryptodev);
@@ -4309,3 +4793,5 @@ static int test_continual_perf_AES_GCM(void)
 		perftest_openssl_cryptodev);
 REGISTER_TEST_COMMAND(cryptodev_qat_continual_perftest,
 		perftest_qat_continual_cryptodev);
+REGISTER_TEST_COMMAND(cryptodev_sw_armv8_perftest,
+		perftest_sw_armv8_cryptodev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 0/8] Add crypto PMD optimized for ARMv8
  2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
                                     ` (7 preceding siblings ...)
  2017-01-18 20:02                   ` [PATCH v6 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
@ 2017-01-18 21:14                   ` De Lara Guarch, Pablo
  2017-01-19 10:36                     ` Zbigniew Bodek
  8 siblings, 1 reply; 100+ messages in thread
From: De Lara Guarch, Pablo @ 2017-01-18 21:14 UTC (permalink / raw)
  To: zbigniew.bodek, dev
  Cc: Doherty, Declan, jerin.jacob, jianbo.liu, hemant.agrawal



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
> zbigniew.bodek@caviumnetworks.com
> Sent: Wednesday, January 18, 2017 8:02 PM
> To: dev@dpdk.org
> Cc: De Lara Guarch, Pablo; Doherty, Declan;
> jerin.jacob@caviumnetworks.com; jianbo.liu@linaro.org;
> hemant.agrawal@nxp.com; Zbigniew Bodek
> Subject: [dpdk-dev] [PATCH v6 0/8] Add crypto PMD optimized for ARMv8
> 
> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>

Applied to dpdk-next-crypto.
Thanks,

Pablo

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 0/8] Add crypto PMD optimized for ARMv8
  2017-01-18 21:14                   ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
@ 2017-01-19 10:36                     ` Zbigniew Bodek
  0 siblings, 0 replies; 100+ messages in thread
From: Zbigniew Bodek @ 2017-01-19 10:36 UTC (permalink / raw)
  To: De Lara Guarch, Pablo, dev
  Cc: Doherty, Declan, jerin.jacob, jianbo.liu, hemant.agrawal

Thanks a lot!

Kind regards
Zbigniew

On 18.01.2017 22:14, De Lara Guarch, Pablo wrote:
>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
>> zbigniew.bodek@caviumnetworks.com
>> Sent: Wednesday, January 18, 2017 8:02 PM
>> To: dev@dpdk.org
>> Cc: De Lara Guarch, Pablo; Doherty, Declan;
>> jerin.jacob@caviumnetworks.com; jianbo.liu@linaro.org;
>> hemant.agrawal@nxp.com; Zbigniew Bodek
>> Subject: [dpdk-dev] [PATCH v6 0/8] Add crypto PMD optimized for ARMv8
>>
>> From: Zbigniew Bodek <zbigniew.bodek@caviumnetworks.com>
>
> Applied to dpdk-next-crypto.
> Thanks,
>
> Pablo
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2017-01-19 10:36 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-04 11:33 [PATCH] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2016-12-04 11:33 ` [PATCH 1/3] mk: fix build of assembly files for ARM64 zbigniew.bodek
2016-12-04 11:33 ` [PATCH 2/3] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2016-12-04 11:33 ` [PATCH 3/3] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
2016-12-07  2:32 ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2016-12-07  2:32   ` [PATCH v2 01/12] mk: fix build of assembly files for ARM64 zbigniew.bodek
2016-12-21 14:46     ` De Lara Guarch, Pablo
2017-01-04 17:33     ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2017-01-04 17:33       ` [PATCH v3 1/8] mk: fix build of assembly files for ARM64 zbigniew.bodek
2017-01-13  8:13         ` Hemant Agrawal
2017-01-04 17:33       ` [PATCH v3 2/8] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
2017-01-13  8:16         ` Hemant Agrawal
2017-01-13 15:50           ` Zbigniew Bodek
2017-01-16  5:57           ` Jianbo Liu
2017-01-04 17:33       ` [PATCH v3 3/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2017-01-06  2:45         ` Jianbo Liu
2017-01-12 13:12           ` Zbigniew Bodek
2017-01-13  7:41             ` Jianbo Liu
2017-01-13 19:09               ` Zbigniew Bodek
2017-01-13  7:57         ` Hemant Agrawal
2017-01-13 19:15           ` Zbigniew Bodek
2017-01-17 15:48         ` [PATCH v4 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2017-01-17 15:48           ` [PATCH v4 1/7] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
2017-01-18  2:24             ` Jerin Jacob
2017-01-17 15:48           ` [PATCH v4 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2017-01-18 14:27             ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2017-01-18 14:27               ` [PATCH v5 1/7] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
2017-01-18 14:27               ` [PATCH v5 2/7] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2017-01-18 20:01                 ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 1/8] cryptodev: add cryptodev type for the ARMv8 PMD zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 2/8] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 3/8] mk: add PMD to the build system zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 4/8] cryptodev/armv8: introduce ARM-specific feature flags zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 5/8] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
2017-01-18 20:01                   ` [PATCH v6 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
2017-01-18 20:02                   ` [PATCH v6 7/8] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
2017-01-18 20:02                   ` [PATCH v6 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
2017-01-18 21:14                   ` [PATCH v6 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
2017-01-19 10:36                     ` Zbigniew Bodek
2017-01-18 14:27               ` [PATCH v5 3/7] mk: add PMD to the build system zbigniew.bodek
2017-01-18 14:27               ` [PATCH v5 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
2017-01-18 17:05                 ` De Lara Guarch, Pablo
2017-01-18 19:52                   ` Zbigniew Bodek
2017-01-18 19:54                     ` De Lara Guarch, Pablo
2017-01-18 14:27               ` [PATCH v5 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
2017-01-18 14:27               ` [PATCH v5 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
2017-01-18 14:27               ` [PATCH v5 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
2017-01-18 15:23               ` [PATCH v5 0/7] Add crypto PMD optimized for ARMv8 Jerin Jacob
2017-01-17 15:48           ` [PATCH v4 3/7] mk: add PMD to the build system zbigniew.bodek
2017-01-17 15:49           ` [PATCH v4 4/7] doc: update documentation about ARMv8 crypto PMD zbigniew.bodek
2017-01-17 15:49           ` [PATCH v4 5/7] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
2017-01-17 15:49           ` [PATCH v4 6/7] MAINTAINERS: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
2017-01-17 15:49           ` [PATCH v4 7/7] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
2017-01-18  2:26             ` Jerin Jacob
2017-01-04 17:33       ` [PATCH v3 4/8] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
2017-01-04 17:33       ` [PATCH v3 5/8] doc/armv8: update documentation about crypto PMD zbigniew.bodek
2017-01-04 17:33       ` [PATCH v3 6/8] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
2017-01-04 17:33       ` [PATCH v3 7/8] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
2017-01-04 17:33       ` [PATCH v3 8/8] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek
2017-01-12 10:48         ` De Lara Guarch, Pablo
2017-01-12 11:50           ` Zbigniew Bodek
2017-01-12 12:07             ` De Lara Guarch, Pablo
2017-01-13  9:28         ` Hemant Agrawal
2017-01-10 17:11       ` [PATCH v3 0/8] Add crypto PMD optimized for ARMv8 De Lara Guarch, Pablo
2017-01-10 17:50         ` Zbigniew Bodek
2017-01-13  8:07       ` Hemant Agrawal
2017-01-13 18:59         ` Zbigniew Bodek
2017-01-16  6:57           ` Hemant Agrawal
2017-01-16  8:02             ` Jerin Jacob
2016-12-07  2:32   ` [PATCH v2 02/12] lib: add cryptodev type for the upcoming ARMv8 PMD zbigniew.bodek
2016-12-06 20:27     ` Thomas Monjalon
2016-12-07 19:04       ` Zbigniew Bodek
2016-12-07 20:09         ` Thomas Monjalon
2016-12-09 12:06           ` Declan Doherty
2016-12-07  2:32   ` [PATCH v2 03/12] crypto/armv8: Add core crypto operations for ARMv8 zbigniew.bodek
2016-12-06 20:29     ` Thomas Monjalon
2016-12-06 21:18       ` Jerin Jacob
2016-12-06 21:42         ` Thomas Monjalon
2016-12-06 22:05           ` Jerin Jacob
2016-12-06 22:41             ` Thomas Monjalon
2016-12-06 23:24               ` Jerin Jacob
2016-12-07 15:00                 ` Thomas Monjalon
2016-12-07 16:30                   ` Jerin Jacob
2016-12-07  2:32   ` [PATCH v2 04/12] crypto/armv8: Add AES+SHA256 " zbigniew.bodek
2016-12-07  2:32   ` [PATCH v2 05/12] crypto/armv8: Add AES+SHA1 " zbigniew.bodek
2016-12-07  2:32   ` [PATCH v2 06/12] crypto/armv8: add PMD optimized for ARMv8 processors zbigniew.bodek
2016-12-21 14:55     ` De Lara Guarch, Pablo
2016-12-07  2:33   ` [PATCH v2 07/12] crypto/armv8: generate ASM symbols automatically zbigniew.bodek
2016-12-07  2:33   ` [PATCH v2 08/12] mk/crypto/armv8: add PMD to the build system zbigniew.bodek
2016-12-21 15:01     ` De Lara Guarch, Pablo
2016-12-07  2:33   ` [PATCH v2 09/12] doc/armv8: update documentation about crypto PMD zbigniew.bodek
2016-12-07 21:13     ` Mcnamara, John
2016-12-07  2:33   ` [PATCH v2 10/12] crypto/armv8: enable ARMv8 PMD in the configuration zbigniew.bodek
2016-12-08 10:24   ` [PATCH v2 00/12] Add crypto PMD optimized for ARMv8 Bruce Richardson
2016-12-08 11:32     ` Zbigniew Bodek
2016-12-08 17:45       ` Jerin Jacob
2016-12-21 15:34         ` Declan Doherty
2016-12-22  4:57           ` Jerin Jacob
2016-12-07  2:36 ` [PATCH v2 11/12] crypto/armv8: update MAINTAINERS entry for ARMv8 crypto zbigniew.bodek
2016-12-07  2:37 ` [PATCH v2 12/12] app/test: add ARMv8 crypto tests and test vectors zbigniew.bodek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.