All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ananyev, Konstantin" <konstantin.ananyev-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>,
	"dev-VfR2kkLFssw@public.gmane.org"
	<dev-VfR2kkLFssw@public.gmane.org>
Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
Date: Mon, 25 Aug 2014 16:30:05 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB9772582135D369@IRSMSX105.ger.corp.intel.com> (raw)
In-Reply-To: <1408652100-29217-1-git-send-email-nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

Hi Neil,

> -----Original Message-----
> From: Neil Horman [mailto:nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org]
> Sent: Thursday, August 21, 2014 9:15 PM
> To: dev-VfR2kkLFssw@public.gmane.org
> Cc: Ananyev, Konstantin; thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org; Neil Horman
> Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 

I see you decided not to wait for me and fix everything by yourself :)

> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI

As I said in my previous mail it generates extra jump...
Though from numbers I got the performance impact is negligible: < 1%.
So I suppose, I don't have a good enough reason to object :)

Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
First of all keep  rte_acl_classify_scalar() is already part of our public API.
Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
-  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.  
- to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
  old_alg = rte_acl_get_classify();
  rte_acl_select_classify(new_alg);
  ...
  rte_acl_select_classify(old_alg); 
  
>  REmoved macro definitions for match check functions to make them static inline

More comments inlined below.

Thanks
Konstantin

> 
> Signed-off-by: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
> ---
>  app/test-acl/main.c              |  13 +-
>  app/test/test_acl.c              |  12 +-
>  lib/librte_acl/Makefile          |   5 +-
>  lib/librte_acl/acl_bld.c         |   5 +-
>  lib/librte_acl/acl_match_check.h |  83 ++++
>  lib/librte_acl/acl_run.c         | 944 ---------------------------------------
>  lib/librte_acl/acl_run.h         | 220 +++++++++
>  lib/librte_acl/acl_run_scalar.c  | 198 ++++++++
>  lib/librte_acl/acl_run_sse.c     | 627 ++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c         |  46 ++
>  lib/librte_acl/rte_acl.h         |  26 +-
>  11 files changed, 1216 insertions(+), 963 deletions(-)
>  create mode 100644 lib/librte_acl/acl_match_check.h
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..a77f47d 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
>  	/* perform build. */
>  	ret = rte_acl_build(config.acx, &cfg);
> 
> +	/* setup default rte_acl_classify */
> +	if (config.scalar)
> +		rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +
>  	dump_verbose(DUMP_NONE, stdout,
>  		"rte_acl_build(%u) finished with %d\n",
>  		config.bld_categories, ret);
> @@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
>  			v += config.trace_sz;
>  		}
> 
> -		if (scalar != 0)
> -			ret = rte_acl_classify_scalar(config.acx, data,
> -				results, n, categories);
> -
> -		else
> -			ret = rte_acl_classify(config.acx, data,
> -				results, n, categories);
> +		ret = rte_acl_classify(config.acx, data, results,
> +			n, categories);
> 
>  		if (ret != 0)
>  			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 869f6d3..2fcef6e 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
>  	}
> 
>  	/* make a quick check for scalar */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);


As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the original value.
To support it properly, we need to:
old_alg = rte_acl_get_classify();
 rte_acl_select_classify(new_alg);
 ...
 rte_acl_select_classify(old_alg);

Make all this just to keep UT valid seems like a big hassle to me.
So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.

>  	if (ret != 0) {
>  		printf("Line %i: SSE classify failed!\n", __LINE__);
> @@ -362,7 +363,8 @@ test_invalid_layout(void)
>  	}
> 
>  	/* classify tuples (scalar) */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(results), 1);
>  	if (ret != 0) {
>  		printf("Line %i: Scalar classify failed!\n", __LINE__);
> @@ -850,7 +852,8 @@ test_invalid_parameters(void)
>  	/* scalar classify test */
> 
>  	/* cover zero categories in classify (should not fail) */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 0);
>  	if (result != 0) {
>  		printf("Line %i: Scalar classify with zero categories "
>  				"failed!\n", __LINE__);
> @@ -859,7 +862,8 @@ test_invalid_parameters(void)
>  	}
> 
>  	/* cover invalid but positive categories in classify */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
>  	if (result == 0) {
>  		printf("Line %i: Scalar classify with 3 categories "
>  				"should have failed!\n", __LINE__);
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 4fe4593..65e566d 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> -SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> +
> +CFLAGS_acl_run_sse.o += -msse4.1
> 
>  # install this header file
>  SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
> diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> index 873447b..09d58ea 100644
> --- a/lib/librte_acl/acl_bld.c
> +++ b/lib/librte_acl/acl_bld.c
> @@ -31,7 +31,6 @@
>   *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>   */
> 
> -#include <nmmintrin.h>
>  #include <rte_acl.h>
>  #include "tb_mem.h"
>  #include "acl.h"
> @@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
> 
>  			switch (rule->config->defs[n].type) {
>  			case RTE_ACL_FIELD_TYPE_BITMASK:
> -				wild = (size -
> -					_mm_popcnt_u32(fld->mask_range.u8)) /
> +				wild = (size - __builtin_popcount(
> +					fld->mask_range.u8)) /
>  					size;
>  				break;
> 
> diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> new file mode 100644
> index 0000000..4dc1982
> --- /dev/null
> +++ b/lib/librte_acl/acl_match_check.h

As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.

> @@ -0,0 +1,83 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _ACL_MATCH_CHECK_H_
> +#define _ACL_MATCH_CHECK_H_
> +
> +/*
> + * Detect matches. If a match node transition is found, then this trie
> + * traversal is complete and fill the slot with the next trie
> + * to be processed.
> + */
> +static inline uint64_t
> +acl_match_check(uint64_t transition, int slot,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, void (*resolve_priority)(
> +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories))

Ugh, that's really hard to read.
Can we create a typedef for resolve_priority function type:
typedef void (*resolve_priority_t)(uint64_t, int,
        const struct rte_acl_ctx *ctx, struct parms *,
        const struct rte_acl_match_results *, uint32_t);
And use it here?

> +{
> +	const struct rte_acl_match_results *p;
> +
> +	p = (const struct rte_acl_match_results *)
> +		(flows->trans + ctx->match_index);
> +
> +	if (transition & RTE_ACL_NODE_MATCH) {
> +
> +		/* Remove flags from index and decrement active traversals */
> +		transition &= RTE_ACL_NODE_INDEX;
> +		flows->started--;
> +
> +		/* Resolve priorities for this trie and running results */
> +		if (flows->categories == 1)
> +			resolve_single_priority(transition, slot, ctx,
> +				parms, p);
> +		else
> +			resolve_priority(transition, slot, ctx, parms,
> +				p, flows->categories);
> +
> +		/* Count down completed tries for this search request */
> +		parms[slot].cmplt->count--;
> +
> +		/* Fill the slot with the next trie or idle trie */
> +		transition = acl_start_next_trie(flows, parms, slot, ctx);
> +
> +	} else if (transition == ctx->idle) {
> +		/* reset indirection table for idle slots */
> +		parms[slot].data_index = idle;
> +	}
> +
> +	return transition;
> +}
> +
> +#endif
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - *   BSD LICENSE
> - *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - *   All rights reserved.
> - *
> - *   Redistribution and use in source and binary forms, with or without
> - *   modification, are permitted provided that the following conditions
> - *   are met:
> - *
> - *     * Redistributions of source code must retain the above copyright
> - *       notice, this list of conditions and the following disclaimer.
> - *     * Redistributions in binary form must reproduce the above copyright
> - *       notice, this list of conditions and the following disclaimer in
> - *       the documentation and/or other materials provided with the
> - *       distribution.
> - *     * Neither the name of Intel Corporation nor the names of its
> - *       contributors may be used to endorse or promote products derived
> - *       from this software without specific prior written permission.
> - *
> - *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> - *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> - *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> - *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> - *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> - *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> - *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> - *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> - *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> - *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> - *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> - */
> -
> -#include <rte_acl.h>
> -#include "acl_vect.h"
> -#include "acl.h"
> -
> -#define MAX_SEARCHES_SSE8	8
> -#define MAX_SEARCHES_SSE4	4
> -#define MAX_SEARCHES_SSE2	2
> -#define MAX_SEARCHES_SCALAR	2
> -
> -#define GET_NEXT_4BYTES(prm, idx)	\
> -	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> -
> -
> -#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> -
> -#define	SCALAR_QRANGE_MULT	0x01010101
> -#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> -#define	SCALAR_QRANGE_MIN	0x80808080
> -
> -enum {
> -	SHUFFLE32_SLOT1 = 0xe5,
> -	SHUFFLE32_SLOT2 = 0xe6,
> -	SHUFFLE32_SLOT3 = 0xe7,
> -	SHUFFLE32_SWAP64 = 0x4e,
> -};
> -
> -/*
> - * Structure to manage N parallel trie traversals.
> - * The runtime trie traversal routines can process 8, 4, or 2 tries
> - * in parallel. Each packet may require multiple trie traversals (up to 4).
> - * This structure is used to fill the slots (0 to n-1) for parallel processing
> - * with the trie traversals needed for each packet.
> - */
> -struct acl_flow_data {
> -	uint32_t            num_packets;
> -	/* number of packets processed */
> -	uint32_t            started;
> -	/* number of trie traversals in progress */
> -	uint32_t            trie;
> -	/* current trie index (0 to N-1) */
> -	uint32_t            cmplt_size;
> -	uint32_t            total_packets;
> -	uint32_t            categories;
> -	/* number of result categories per packet. */
> -	/* maximum number of packets to process */
> -	const uint64_t     *trans;
> -	const uint8_t     **data;
> -	uint32_t           *results;
> -	struct completion  *last_cmplt;
> -	struct completion  *cmplt_array;
> -};
> -
> -/*
> - * Structure to maintain running results for
> - * a single packet (up to 4 tries).
> - */
> -struct completion {
> -	uint32_t *results;                          /* running results. */
> -	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> -	uint32_t  count;                            /* num of remaining tries */
> -	/* true for allocated struct */
> -} __attribute__((aligned(XMM_SIZE)));
> -
> -/*
> - * One parms structure for each slot in the search engine.
> - */
> -struct parms {
> -	const uint8_t              *data;
> -	/* input data for this packet */
> -	const uint32_t             *data_index;
> -	/* data indirection for this trie */
> -	struct completion          *cmplt;
> -	/* completion data for this packet */
> -};
> -
> -/*
> - * Define an global idle node for unused engine slots
> - */
> -static const uint32_t idle[UINT8_MAX + 1];
> -
> -static const rte_xmm_t mm_type_quad_range = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -	},
> -};
> -
> -static const rte_xmm_t mm_type_quad_range64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		0,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input = {
> -	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input64 = {
> -	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> -};
> -
> -static const rte_xmm_t mm_ones_16 = {
> -	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> -};
> -
> -static const rte_xmm_t mm_bytes = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> -};
> -
> -static const rte_xmm_t mm_bytes64 = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> -};
> -
> -static const rte_xmm_t mm_match_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -	},
> -};
> -
> -static const rte_xmm_t mm_match_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		0,
> -		0,
> -	},
> -};
> -
> -/*
> - * Allocate a completion structure to manage the tries for a packet.
> - */
> -static inline struct completion *
> -alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> -	uint32_t *results)
> -{
> -	uint32_t n;
> -
> -	for (n = 0; n < size; n++) {
> -
> -		if (p[n].count == 0) {
> -
> -			/* mark as allocated and set number of tries. */
> -			p[n].count = tries;
> -			p[n].results = results;
> -			return &(p[n]);
> -		}
> -	}
> -
> -	/* should never get here */
> -	return NULL;
> -}
> -
> -/*
> - * Resolve priority for a single result trie.
> - */
> -static inline void
> -resolve_single_priority(uint64_t transition, int n,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	const struct rte_acl_match_results *p)
> -{
> -	if (parms[n].cmplt->count == ctx->num_tries ||
> -			parms[n].cmplt->priority[0] <=
> -			p[transition].priority[0]) {
> -
> -		parms[n].cmplt->priority[0] = p[transition].priority[0];
> -		parms[n].cmplt->results[0] = p[transition].results[0];
> -	}
> -
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Resolve priority for multiple results. This consists comparing
> - * the priority of the current traversal with the running set of
> - * results for the packet. For each result, keep a running array of
> - * the result (rule number) and its priority for each category.
> - */
> -static inline void
> -resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, const struct rte_acl_match_results *p,
> -	uint32_t categories)
> -{
> -	uint32_t x;
> -	xmm_t results, priority, results1, priority1, selector;
> -	xmm_t *saved_results, *saved_priority;
> -
> -	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> -
> -		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> -		saved_priority =
> -			(xmm_t *)(&parms[n].cmplt->priority[x]);
> -
> -		/* get results and priorities for completed trie */
> -		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> -		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> -
> -		/* if this is not the first completed trie */
> -		if (parms[n].cmplt->count != ctx->num_tries) {
> -
> -			/* get running best results and their priorities */
> -			results1 = MM_LOADU(saved_results);
> -			priority1 = MM_LOADU(saved_priority);
> -
> -			/* select results that are highest priority */
> -			selector = MM_CMPGT32(priority1, priority);
> -			results = MM_BLENDV8(results, results1, selector);
> -			priority = MM_BLENDV8(priority, priority1, selector);
> -		}
> -
> -		/* save running best results and their priorities */
> -		MM_STOREU(saved_results, results);
> -		MM_STOREU(saved_priority, priority);
> -	}
> -
> -	/* Count down completed tries for this search request */
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Routine to fill a slot in the parallel trie traversal array (parms) from
> - * the list of packets (flows).
> - */
> -static inline uint64_t
> -acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> -	const struct rte_acl_ctx *ctx)
> -{
> -	uint64_t transition;
> -
> -	/* if there are any more packets to process */
> -	if (flows->num_packets < flows->total_packets) {
> -		parms[n].data = flows->data[flows->num_packets];
> -		parms[n].data_index = ctx->trie[flows->trie].data_index;
> -
> -		/* if this is the first trie for this packet */
> -		if (flows->trie == 0) {
> -			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> -				flows->cmplt_size, ctx->num_tries,
> -				flows->results +
> -				flows->num_packets * flows->categories);
> -		}
> -
> -		/* set completion parameters and starting index for this slot */
> -		parms[n].cmplt = flows->last_cmplt;
> -		transition =
> -			flows->trans[parms[n].data[*parms[n].data_index++] +
> -			ctx->trie[flows->trie].root_index];
> -
> -		/*
> -		 * if this is the last trie for this packet,
> -		 * then setup next packet.
> -		 */
> -		flows->trie++;
> -		if (flows->trie >= ctx->num_tries) {
> -			flows->trie = 0;
> -			flows->num_packets++;
> -		}
> -
> -		/* keep track of number of active trie traversals */
> -		flows->started++;
> -
> -	/* no more tries to process, set slot to an idle position */
> -	} else {
> -		transition = ctx->idle;
> -		parms[n].data = (const uint8_t *)idle;
> -		parms[n].data_index = idle;
> -	}
> -	return transition;
> -}
> -
> -/*
> - * Detect matches. If a match node transition is found, then this trie
> - * traversal is complete and fill the slot with the next trie
> - * to be processed.
> - */
> -static inline uint64_t
> -acl_match_check_transition(uint64_t transition, int slot,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows)
> -{
> -	const struct rte_acl_match_results *p;
> -
> -	p = (const struct rte_acl_match_results *)
> -		(flows->trans + ctx->match_index);
> -
> -	if (transition & RTE_ACL_NODE_MATCH) {
> -
> -		/* Remove flags from index and decrement active traversals */
> -		transition &= RTE_ACL_NODE_INDEX;
> -		flows->started--;
> -
> -		/* Resolve priorities for this trie and running results */
> -		if (flows->categories == 1)
> -			resolve_single_priority(transition, slot, ctx,
> -				parms, p);
> -		else
> -			resolve_priority(transition, slot, ctx, parms, p,
> -				flows->categories);
> -
> -		/* Fill the slot with the next trie or idle trie */
> -		transition = acl_start_next_trie(flows, parms, slot, ctx);
> -
> -	} else if (transition == ctx->idle) {
> -		/* reset indirection table for idle slots */
> -		parms[slot].data_index = idle;
> -	}
> -
> -	return transition;
> -}
> -
> -/*
> - * Extract transitions from an XMM register and check for any matches
> - */
> -static void
> -acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, struct acl_flow_data *flows)
> -{
> -	uint64_t transition1, transition2;
> -
> -	/* extract transition from low 64 bits. */
> -	transition1 = MM_CVT64(*indicies);
> -
> -	/* extract transition from high 64 bits. */
> -	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> -	transition2 = MM_CVT64(*indicies);
> -
> -	transition1 = acl_match_check_transition(transition1, slot, ctx,
> -		parms, flows);
> -	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
> -		parms, flows);
> -
> -	/* update indicies with new transitions. */
> -	*indicies = MM_SET64(transition2, transition1);
> -}
> -
> -/*
> - * Check for a match in 2 transitions (contained in SSE register)
> - */
> -static inline void
> -acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	temp = MM_AND(match_mask, *indicies);
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies, slot, ctx, parms, flows);
> -		temp = MM_AND(match_mask, *indicies);
> -	}
> -}
> -
> -/*
> - * Check for any match in 4 transitions (contained in 2 SSE registers)
> - */
> -static inline void
> -acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> -	xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	/* put low 32 bits of each transition into one register */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	/* test for match node */
> -	temp = MM_AND(match_mask, temp);
> -
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies1, slot, ctx, parms, flows);
> -		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> -
> -		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -					(__m128)*indicies2,
> -					0x88);
> -		temp = MM_AND(match_mask, temp);
> -	}
> -}
> -
> -/*
> - * Calculate the address of the next transition for
> - * all types of nodes. Note that only DFA nodes and range
> - * nodes actually transition to another node. Match
> - * nodes don't move.
> - */
> -static inline xmm_t
> -acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr, node_types, temp;
> -
> -	/*
> -	 * Note that no transition is done for a match
> -	 * node and therefore a stream freezes when
> -	 * it reaches a match.
> -	 */
> -
> -	/* Shuffle low 32 into temp and high 32 into indicies2 */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -		(__m128)*indicies2, 0xdd);
> -
> -	/* Calc node type and node addr */
> -	node_types = MM_ANDNOT(index_mask, temp);
> -	addr = MM_AND(index_mask, temp);
> -
> -	/*
> -	 * Calc addr for DFAs - addr = dfa_index + input_byte
> -	 */
> -
> -	/* mask for DFA type (0) nodes */
> -	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> -
> -	/* add input byte to DFA position */
> -	temp = MM_AND(temp, bytes);
> -	temp = MM_AND(temp, next_input);
> -	addr = MM_ADD32(addr, temp);
> -
> -	/*
> -	 * Calc addr for Range nodes -> range_index + range(input)
> -	 */
> -	node_types = MM_CMPEQ32(node_types, type_quad_range);
> -
> -	/*
> -	 * Calculate number of range boundaries that are less than the
> -	 * input value. Range boundaries for each node are in signed 8 bit,
> -	 * ordered from -128 to 127 in the indicies2 register.
> -	 * This is effectively a popcnt of bytes that are greater than the
> -	 * input byte.
> -	 */
> -
> -	/* shuffle input byte to all 4 positions of 32 bit value */
> -	temp = MM_SHUFFLE8(next_input, shuffle_input);
> -
> -	/* check ranges */
> -	temp = MM_CMPGT8(temp, *indicies2);
> -
> -	/* convert -1 to 1 (bytes greater than input byte */
> -	temp = MM_SIGN8(temp, temp);
> -
> -	/* horizontal add pairs of bytes into words */
> -	temp = MM_MADD8(temp, temp);
> -
> -	/* horizontal add pairs of words into dwords */
> -	temp = MM_MADD16(temp, ones_16);
> -
> -	/* mask to range type nodes */
> -	temp = MM_AND(temp, node_types);
> -
> -	/* add index into node position */
> -	return MM_ADD32(addr, temp);
> -}
> -
> -/*
> - * Process 4 transitions (in 2 SIMD registers) in parallel
> - */
> -static inline xmm_t
> -transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr;
> -	uint64_t trans0, trans2;
> -
> -	 /* Calculate the address (array index) for all 4 transitions. */
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, indicies2);
> -
> -	 /* Gather 64 bit transitions and pack back into 2 registers. */
> -
> -	trans0 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 2 */
> -
> -	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> -	trans2 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -
> -	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> -
> -	/* get slot 3 */
> -
> -	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> -	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -static inline void
> -acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> -	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> -	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> -{
> -	flows->num_packets = 0;
> -	flows->started = 0;
> -	flows->trie = 0;
> -	flows->last_cmplt = NULL;
> -	flows->cmplt_array = cmplt;
> -	flows->total_packets = data_num;
> -	flows->categories = categories;
> -	flows->cmplt_size = cmplt_size;
> -	flows->data = data;
> -	flows->results = results;
> -	flows->trans = trans;
> -}
> -
> -/*
> - * Execute trie traversal with 8 traversals in parallel
> - */
> -static inline void
> -search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE8];
> -	struct completion cmplt[MAX_SEARCHES_SSE8];
> -	struct parms parms[MAX_SEARCHES_SSE8];
> -	xmm_t input0, input1;
> -	xmm_t indicies1, indicies2, indicies3, indicies4;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	/*
> -	 * indicies1 contains index_array[0,1]
> -	 * indicies2 contains index_array[2,3]
> -	 * indicies3 contains index_array[4,5]
> -	 * indicies4 contains index_array[6,7]
> -	 */
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> -	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> -
> -	 /* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -	acl_match_check_x4(4, ctx, parms, &flows,
> -		&indicies3, &indicies4, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> -			0);
> -		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> -			0);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> -
> -		 /* Process the 4 bytes of input on each stream. */
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		 /* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -		acl_match_check_x4(4, ctx, parms, &flows,
> -			&indicies3, &indicies4, mm_match_mask.m);
> -	}
> -}
> -
> -/*
> - * Execute trie traversal with 4 traversals in parallel
> - */
> -static inline void
> -search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	 uint32_t *results, int total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE4];
> -	struct completion cmplt[MAX_SEARCHES_SSE4];
> -	struct parms parms[MAX_SEARCHES_SSE4];
> -	xmm_t input, indicies1, indicies2;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -		input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -	}
> -}
> -
> -static inline xmm_t
> -transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1)
> -{
> -	uint64_t t;
> -	xmm_t addr, indicies2;
> -
> -	indicies2 = MM_XOR(ones_16, ones_16);
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, &indicies2);
> -
> -	/* Gather 64 bit transitions and pack 2 per register. */
> -
> -	t = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -/*
> - * Execute trie traversal with 2 traversals in parallel.
> - */
> -static inline void
> -search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE2];
> -	struct completion cmplt[MAX_SEARCHES_SSE2];
> -	struct parms parms[MAX_SEARCHES_SSE2];
> -	xmm_t input, indicies;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> -			mm_match_mask64.m);
> -	}
> -}
> -
> -/*
> - * When processing the transition, rather than using if/else
> - * construct, the offset is calculated for DFA and QRANGE and
> - * then conditionally added to the address based on node type.
> - * This is done to avoid branch mis-predictions. Since the
> - * offset is rather simple calculation it is more efficient
> - * to do the calculation and do a condition move rather than
> - * a conditional branch to determine which calculation to do.
> - */
> -static inline uint32_t
> -scan_forward(uint32_t input, uint32_t max)
> -{
> -	return (input == 0) ? max : rte_bsf32(input);
> -}
> -
> -static inline uint64_t
> -scalar_transition(const uint64_t *trans_table, uint64_t transition,
> -	uint8_t input)
> -{
> -	uint32_t addr, index, ranges, x, a, b, c;
> -
> -	/* break transition into component parts */
> -	ranges = transition >> (sizeof(index) * CHAR_BIT);
> -
> -	/* calc address for a QRANGE node */
> -	c = input * SCALAR_QRANGE_MULT;
> -	a = ranges | SCALAR_QRANGE_MIN;
> -	index = transition & ~RTE_ACL_NODE_INDEX;
> -	a -= (c & SCALAR_QRANGE_MASK);
> -	b = c & SCALAR_QRANGE_MIN;
> -	addr = transition ^ index;
> -	a &= SCALAR_QRANGE_MIN;
> -	a ^= (ranges ^ b) & (a ^ b);
> -	x = scan_forward(a, 32) >> 3;
> -	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> -
> -	/* pickup next transition */
> -	transition = *(trans_table + addr);
> -	return transition;
> -}
> -
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	int n;
> -	uint64_t transition0, transition1;
> -	uint32_t input0, input1;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SCALAR];
> -	struct completion cmplt[MAX_SEARCHES_SCALAR];
> -	struct parms parms[MAX_SEARCHES_SCALAR];
> -
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> -		categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	transition0 = index_array[0];
> -	transition1 = index_array[1];
> -
> -	while (flows.started > 0) {
> -
> -		input0 = GET_NEXT_4BYTES(parms, 0);
> -		input1 = GET_NEXT_4BYTES(parms, 1);
> -
> -		for (n = 0; n < 4; n++) {
> -			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> -				transition0 = scalar_transition(flows.trans,
> -					transition0, (uint8_t)input0);
> -
> -			input0 >>= CHAR_BIT;
> -
> -			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> -				transition1 = scalar_transition(flows.trans,
> -					transition1, (uint8_t)input1);
> -
> -			input1 >>= CHAR_BIT;
> -
> -		}
> -		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> -			transition0 = acl_match_check_transition(transition0,
> -				0, ctx, parms, &flows);
> -			transition1 = acl_match_check_transition(transition1,
> -				1, ctx, parms, &flows);
> -
> -		}
> -	}
> -	return 0;
> -}
> -
> -int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	if (likely(num >= MAX_SEARCHES_SSE8))
> -		search_sse_8(ctx, data, results, num, categories);
> -	else if (num >= MAX_SEARCHES_SSE4)
> -		search_sse_4(ctx, data, results, num, categories);
> -	else
> -		search_sse_2(ctx, data, results, num, categories);
> -
> -	return 0;
> -}
> diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
> new file mode 100644
> index 0000000..c39650e
> --- /dev/null
> +++ b/lib/librte_acl/acl_run.h
> @@ -0,0 +1,220 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef	_ACL_RUN_H_
> +#define	_ACL_RUN_H_
> +
> +#include <rte_acl.h>
> +#include "acl_vect.h"
> +#include "acl.h"
> +
> +#define MAX_SEARCHES_SSE8	8
> +#define MAX_SEARCHES_SSE4	4
> +#define MAX_SEARCHES_SSE2	2
> +#define MAX_SEARCHES_SCALAR	2
> +
> +#define GET_NEXT_4BYTES(prm, idx)	\
> +	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> +
> +
> +#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> +
> +#define	SCALAR_QRANGE_MULT	0x01010101
> +#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> +#define	SCALAR_QRANGE_MIN	0x80808080
> +
> +/*
> + * Structure to manage N parallel trie traversals.
> + * The runtime trie traversal routines can process 8, 4, or 2 tries
> + * in parallel. Each packet may require multiple trie traversals (up to 4).
> + * This structure is used to fill the slots (0 to n-1) for parallel processing
> + * with the trie traversals needed for each packet.
> + */
> +struct acl_flow_data {
> +	uint32_t            num_packets;
> +	/* number of packets processed */
> +	uint32_t            started;
> +	/* number of trie traversals in progress */
> +	uint32_t            trie;
> +	/* current trie index (0 to N-1) */
> +	uint32_t            cmplt_size;
> +	uint32_t            total_packets;
> +	uint32_t            categories;
> +	/* number of result categories per packet. */
> +	/* maximum number of packets to process */
> +	const uint64_t     *trans;
> +	const uint8_t     **data;
> +	uint32_t           *results;
> +	struct completion  *last_cmplt;
> +	struct completion  *cmplt_array;
> +};
> +
> +/*
> + * Structure to maintain running results for
> + * a single packet (up to 4 tries).
> + */
> +struct completion {
> +	uint32_t *results;                          /* running results. */
> +	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> +	uint32_t  count;                            /* num of remaining tries */
> +	/* true for allocated struct */
> +} __attribute__((aligned(XMM_SIZE)));
> +
> +/*
> + * One parms structure for each slot in the search engine.
> + */
> +struct parms {
> +	const uint8_t              *data;
> +	/* input data for this packet */
> +	const uint32_t             *data_index;
> +	/* data indirection for this trie */
> +	struct completion          *cmplt;
> +	/* completion data for this packet */
> +};
> +
> +/*
> + * Define an global idle node for unused engine slots
> + */
> +static const uint32_t idle[UINT8_MAX + 1];
> +
> +/*
> + * Allocate a completion structure to manage the tries for a packet.
> + */
> +static inline struct completion *
> +alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> +	uint32_t *results)
> +{
> +	uint32_t n;
> +
> +	for (n = 0; n < size; n++) {
> +
> +		if (p[n].count == 0) {
> +
> +			/* mark as allocated and set number of tries. */
> +			p[n].count = tries;
> +			p[n].results = results;
> +			return &(p[n]);
> +		}
> +	}
> +
> +	/* should never get here */
> +	return NULL;
> +}
> +
> +/*
> + * Resolve priority for a single result trie.
> + */
> +static inline void
> +resolve_single_priority(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p)
> +{
> +	if (parms[n].cmplt->count == ctx->num_tries ||
> +			parms[n].cmplt->priority[0] <=
> +			p[transition].priority[0]) {
> +
> +		parms[n].cmplt->priority[0] = p[transition].priority[0];
> +		parms[n].cmplt->results[0] = p[transition].results[0];
> +	}
> +}
> +
> +/*
> + * Routine to fill a slot in the parallel trie traversal array (parms) from
> + * the list of packets (flows).
> + */
> +static inline uint64_t
> +acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> +	const struct rte_acl_ctx *ctx)
> +{
> +	uint64_t transition;
> +
> +	/* if there are any more packets to process */
> +	if (flows->num_packets < flows->total_packets) {
> +		parms[n].data = flows->data[flows->num_packets];
> +		parms[n].data_index = ctx->trie[flows->trie].data_index;
> +
> +		/* if this is the first trie for this packet */
> +		if (flows->trie == 0) {
> +			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> +				flows->cmplt_size, ctx->num_tries,
> +				flows->results +
> +				flows->num_packets * flows->categories);
> +		}
> +
> +		/* set completion parameters and starting index for this slot */
> +		parms[n].cmplt = flows->last_cmplt;
> +		transition =
> +			flows->trans[parms[n].data[*parms[n].data_index++] +
> +			ctx->trie[flows->trie].root_index];
> +
> +		/*
> +		 * if this is the last trie for this packet,
> +		 * then setup next packet.
> +		 */
> +		flows->trie++;
> +		if (flows->trie >= ctx->num_tries) {
> +			flows->trie = 0;
> +			flows->num_packets++;
> +		}
> +
> +		/* keep track of number of active trie traversals */
> +		flows->started++;
> +
> +	/* no more tries to process, set slot to an idle position */
> +	} else {
> +		transition = ctx->idle;
> +		parms[n].data = (const uint8_t *)idle;
> +		parms[n].data_index = idle;
> +	}
> +	return transition;
> +}
> +
> +static inline void
> +acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> +	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> +	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> +{
> +	flows->num_packets = 0;
> +	flows->started = 0;
> +	flows->trie = 0;
> +	flows->last_cmplt = NULL;
> +	flows->cmplt_array = cmplt;
> +	flows->total_packets = data_num;
> +	flows->categories = categories;
> +	flows->cmplt_size = cmplt_size;
> +	flows->data = data;
> +	flows->results = results;
> +	flows->trans = trans;
> +}
> +
> +#endif /* _ACL_RUN_H_ */
> diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
> new file mode 100644
> index 0000000..a59ff17
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_scalar.c
> @@ -0,0 +1,198 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/*
> + * Resolve priority for multiple results (scalar version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_scalar(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p, uint32_t categories)
> +{
> +	uint32_t i;
> +	int32_t *saved_priority;
> +	uint32_t *saved_results;
> +	const int32_t *priority;
> +	const uint32_t *results;
> +
> +	saved_results = parms[n].cmplt->results;
> +	saved_priority = parms[n].cmplt->priority;
> +
> +	/* results and priorities for completed trie */
> +	results = p[transition].results;
> +	priority = p[transition].priority;
> +
> +	/* if this is not the first completed trie */
> +	if (parms[n].cmplt->count != ctx->num_tries) {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +			if (saved_priority[i] <= priority[i]) {
> +				saved_priority[i] = priority[i];
> +				saved_results[i] = results[i];
> +			}
> +			if (saved_priority[i + 1] <= priority[i + 1]) {
> +				saved_priority[i + 1] = priority[i + 1];
> +				saved_results[i + 1] = results[i + 1];
> +			}
> +			if (saved_priority[i + 2] <= priority[i + 2]) {
> +				saved_priority[i + 2] = priority[i + 2];
> +				saved_results[i + 2] = results[i + 2];
> +			}
> +			if (saved_priority[i + 3] <= priority[i + 3]) {
> +				saved_priority[i + 3] = priority[i + 3];
> +				saved_results[i + 3] = results[i + 3];
> +			}
> +		}
> +	} else {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +			saved_priority[i] = priority[i];
> +			saved_priority[i + 1] = priority[i + 1];
> +			saved_priority[i + 2] = priority[i + 2];
> +			saved_priority[i + 3] = priority[i + 3];
> +
> +			saved_results[i] = results[i];
> +			saved_results[i + 1] = results[i + 1];
> +			saved_results[i + 2] = results[i + 2];
> +			saved_results[i + 3] = results[i + 3];
> +		}
> +	}
> +}
> +
> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> +	return (input == 0) ? max : rte_bsf32(input);
> +}
> +
> +static inline uint64_t
> +scalar_transition(const uint64_t *trans_table, uint64_t transition,
> +	uint8_t input)
> +{
> +	uint32_t addr, index, ranges, x, a, b, c;
> +
> +	/* break transition into component parts */
> +	ranges = transition >> (sizeof(index) * CHAR_BIT);
> +
> +	/* calc address for a QRANGE node */
> +	c = input * SCALAR_QRANGE_MULT;
> +	a = ranges | SCALAR_QRANGE_MIN;
> +	index = transition & ~RTE_ACL_NODE_INDEX;
> +	a -= (c & SCALAR_QRANGE_MASK);
> +	b = c & SCALAR_QRANGE_MIN;
> +	addr = transition ^ index;
> +	a &= SCALAR_QRANGE_MIN;
> +	a ^= (ranges ^ b) & (a ^ b);
> +	x = scan_forward(a, 32) >> 3;
> +	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> +
> +	/* pickup next transition */
> +	transition = *(trans_table + addr);
> +	return transition;
> +}
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	int n;
> +	uint64_t transition0, transition1;
> +	uint32_t input0, input1;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SCALAR];
> +	struct completion cmplt[MAX_SEARCHES_SCALAR];
> +	struct parms parms[MAX_SEARCHES_SCALAR];
> +
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> +		categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	transition0 = index_array[0];
> +	transition1 = index_array[1];
> +
> +	while (flows.started > 0) {
> +
> +		input0 = GET_NEXT_4BYTES(parms, 0);
> +		input1 = GET_NEXT_4BYTES(parms, 1);
> +
> +		for (n = 0; n < 4; n++) {
> +			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> +				transition0 = scalar_transition(flows.trans,
> +					transition0, (uint8_t)input0);
> +
> +			input0 >>= CHAR_BIT;
> +
> +			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> +				transition1 = scalar_transition(flows.trans,
> +					transition1, (uint8_t)input1);
> +
> +			input1 >>= CHAR_BIT;
> +
> +		}
> +		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> +			transition0 = acl_match_check(transition0,
> +				0, ctx, parms, &flows, resolve_priority_scalar);
> +			transition1 = acl_match_check(transition1,
> +				1, ctx, parms, &flows, resolve_priority_scalar);
> +
> +		}
> +	}
> +	return 0;
> +}
> diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
> new file mode 100644
> index 0000000..3f5c721
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_sse.c
> @@ -0,0 +1,627 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +enum {
> +	SHUFFLE32_SLOT1 = 0xe5,
> +	SHUFFLE32_SLOT2 = 0xe6,
> +	SHUFFLE32_SLOT3 = 0xe7,
> +	SHUFFLE32_SWAP64 = 0x4e,
> +};
> +
> +static const rte_xmm_t mm_type_quad_range = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +	},
> +};
> +
> +static const rte_xmm_t mm_type_quad_range64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		0,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input = {
> +	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input64 = {
> +	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> +};
> +
> +static const rte_xmm_t mm_ones_16 = {
> +	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> +};
> +
> +static const rte_xmm_t mm_bytes = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> +};
> +
> +static const rte_xmm_t mm_bytes64 = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> +};
> +
> +static const rte_xmm_t mm_match_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +	},
> +};
> +
> +static const rte_xmm_t mm_match_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		0,
> +		0,
> +	},
> +};
> +
> +
> +/*
> + * Resolve priority for multiple results (sse version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories)
> +{
> +	uint32_t x;
> +	xmm_t results, priority, results1, priority1, selector;
> +	xmm_t *saved_results, *saved_priority;
> +
> +	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> +		saved_priority =
> +			(xmm_t *)(&parms[n].cmplt->priority[x]);
> +
> +		/* get results and priorities for completed trie */
> +		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> +		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> +
> +		/* if this is not the first completed trie */
> +		if (parms[n].cmplt->count != ctx->num_tries) {
> +
> +			/* get running best results and their priorities */
> +			results1 = MM_LOADU(saved_results);
> +			priority1 = MM_LOADU(saved_priority);
> +
> +			/* select results that are highest priority */
> +			selector = MM_CMPGT32(priority1, priority);
> +			results = MM_BLENDV8(results, results1, selector);
> +			priority = MM_BLENDV8(priority, priority1, selector);
> +		}
> +
> +		/* save running best results and their priorities */
> +		MM_STOREU(saved_results, results);
> +		MM_STOREU(saved_priority, priority);
> +	}
> +}
> +
> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, struct acl_flow_data *flows)
> +{
> +	uint64_t transition1, transition2;
> +
> +	/* extract transition from low 64 bits. */
> +	transition1 = MM_CVT64(*indicies);
> +
> +	/* extract transition from high 64 bits. */
> +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> +	transition2 = MM_CVT64(*indicies);
> +
> +	transition1 = acl_match_check(transition1, slot, ctx,
> +		parms, flows, resolve_priority_sse);
> +	transition2 = acl_match_check(transition2, slot + 1, ctx,
> +		parms, flows, resolve_priority_sse);
> +
> +	/* update indicies with new transitions. */
> +	*indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	temp = MM_AND(match_mask, *indicies);
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies, slot, ctx, parms, flows);
> +		temp = MM_AND(match_mask, *indicies);
> +	}
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> +	xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	/* put low 32 bits of each transition into one register */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	/* test for match node */
> +	temp = MM_AND(match_mask, temp);
> +
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +					(__m128)*indicies2,
> +					0x88);
> +		temp = MM_AND(match_mask, temp);
> +	}
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr, node_types, temp;
> +
> +	/*
> +	 * Note that no transition is done for a match
> +	 * node and therefore a stream freezes when
> +	 * it reaches a match.
> +	 */
> +
> +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +		(__m128)*indicies2, 0xdd);
> +
> +	/* Calc node type and node addr */
> +	node_types = MM_ANDNOT(index_mask, temp);
> +	addr = MM_AND(index_mask, temp);
> +
> +	/*
> +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> +	 */
> +
> +	/* mask for DFA type (0) nodes */
> +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> +	/* add input byte to DFA position */
> +	temp = MM_AND(temp, bytes);
> +	temp = MM_AND(temp, next_input);
> +	addr = MM_ADD32(addr, temp);
> +
> +	/*
> +	 * Calc addr for Range nodes -> range_index + range(input)
> +	 */
> +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> +	/*
> +	 * Calculate number of range boundaries that are less than the
> +	 * input value. Range boundaries for each node are in signed 8 bit,
> +	 * ordered from -128 to 127 in the indicies2 register.
> +	 * This is effectively a popcnt of bytes that are greater than the
> +	 * input byte.
> +	 */
> +
> +	/* shuffle input byte to all 4 positions of 32 bit value */
> +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> +	/* check ranges */
> +	temp = MM_CMPGT8(temp, *indicies2);
> +
> +	/* convert -1 to 1 (bytes greater than input byte */
> +	temp = MM_SIGN8(temp, temp);
> +
> +	/* horizontal add pairs of bytes into words */
> +	temp = MM_MADD8(temp, temp);
> +
> +	/* horizontal add pairs of words into dwords */
> +	temp = MM_MADD16(temp, ones_16);
> +
> +	/* mask to range type nodes */
> +	temp = MM_AND(temp, node_types);
> +
> +	/* add index into node position */
> +	return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr;
> +	uint64_t trans0, trans2;
> +
> +	 /* Calculate the address (array index) for all 4 transitions. */
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, indicies2);
> +
> +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> +	trans0 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 2 */
> +
> +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> +	trans2 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +
> +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> +	/* get slot 3 */
> +
> +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE8];
> +	struct completion cmplt[MAX_SEARCHES_SSE8];
> +	struct parms parms[MAX_SEARCHES_SSE8];
> +	xmm_t input0, input1;
> +	xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	/*
> +	 * indicies1 contains index_array[0,1]
> +	 * indicies2 contains index_array[2,3]
> +	 * indicies3 contains index_array[4,5]
> +	 * indicies4 contains index_array[6,7]
> +	 */
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> +	 /* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +	acl_match_check_x4(4, ctx, parms, &flows,
> +		&indicies3, &indicies4, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> +			0);
> +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> +			0);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> +		 /* Process the 4 bytes of input on each stream. */
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		 /* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +		acl_match_check_x4(4, ctx, parms, &flows,
> +			&indicies3, &indicies4, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	 uint32_t *results, int total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE4];
> +	struct completion cmplt[MAX_SEARCHES_SSE4];
> +	struct parms parms[MAX_SEARCHES_SSE4];
> +	xmm_t input, indicies1, indicies2;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +		input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1)
> +{
> +	uint64_t t;
> +	xmm_t addr, indicies2;
> +
> +	indicies2 = MM_XOR(ones_16, ones_16);
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, &indicies2);
> +
> +	/* Gather 64 bit transitions and pack 2 per register. */
> +
> +	t = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE2];
> +	struct completion cmplt[MAX_SEARCHES_SSE2];
> +	struct parms parms[MAX_SEARCHES_SSE2];
> +	xmm_t input, indicies;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> +			mm_match_mask64.m);
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	if (likely(num >= MAX_SEARCHES_SSE8))
> +		return search_sse_8(ctx, data, results, num, categories);
> +	else if (num >= MAX_SEARCHES_SSE4)
> +		return search_sse_4(ctx, data, results, num, categories);
> +	else
> +		return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..b9173c1 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,52 @@
> 
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> 
> +typedef int (*rte_acl_classify_t)
> +(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
> +
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;

Why not 'static'?
I thought you'd like to hide it  from external world.

> +
> +void rte_acl_select_classify(enum acl_classify_alg alg)
> +{
> +
> +	switch(alg)
> +	{
> +		case ACL_CLASSIFY_DEFAULT:
> +		case ACL_CLASSIFY_SCALAR:
> +			rte_acl_default_classify = rte_acl_classify_scalar;
> +			break;
> +		case ACL_CLASSIFY_SSE:
> +			rte_acl_default_classify = rte_acl_classify_sse;
> +			break;
> +	}
> +
> +}

As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return value, if not.  

> +
> +static void __attribute__((constructor))
> +rte_acl_init(void)
> +{
> +	enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +		alg = ACL_CLASSIFY_SSE;
> +
> +	rte_acl_select_classify(alg);
> +}
> +
> +inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +                            const uint8_t **data,
> +                            uint32_t *results, uint32_t num,
> +                            uint32_t categories)
> +{
> +	return rte_acl_default_classify(ctx, data, results, num, categories);
> +}
> +
> +
>  struct rte_acl_ctx *
>  rte_acl_find_existing(const char *name)
>  {
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
> index afc0f69..650b306 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
>   * If more than one rule is applicable for given input buffer and
>   * given category, then rule with highest priority will be returned as a match.
> + * Note, that this function could be run only on CPUs with SSE4.1 support.
> + * It is up to the caller to make sure that this function is only invoked on
> + * a machine that supports SSE4.1 ISA.
>   * Note, that it is a caller responsibility to ensure that input parameters
>   * are valid and point to correct memory locations.
>   *
> @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * @return
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
> + *   -ENOTSUP for unsupported platforms.

Please remove the line above: current implementation doesn't return ENOTSUP
(I think that was left from v1).

>   */
>  int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
>  	uint32_t *results, uint32_t num, uint32_t categories);
> 
>  /**
> @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
>   */
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);


As I said above we'd better keep it.  

> +
> +enum acl_classify_alg {
> +	ACL_CLASSIFY_DEFAULT = 0,
> +	ACL_CLASSIFY_SCALAR = 1,
> +	ACL_CLASSIFY_SSE = 2,
> +};

As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg

> +
> +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +				   const uint8_t **data,
> +				   uint32_t *results, uint32_t num,
> +				   uint32_t categories);

Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
extern nt
rte_acl_classify(...);

> +/**
> + * Analyze ISA of the current CPU and points rte_acl_default_classify
> + * to the highest applicable version of classify function.
> + */
> +extern void
> +rte_acl_select_classify(enum acl_classify_alg alg);
> 
>  /**
>   * Dump an ACL context structure to the console.
> --
> 1.9.3

  parent reply	other threads:[~2014-08-25 16:30 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-07 18:31 [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
     [not found] ` <1407436263-9360-1-git-send-email-konstantin.ananyev-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2014-08-07 20:11   ` Neil Horman
     [not found]     ` <20140807201134.GA1632-B26myB8xz7F8NnZeBjwnZQMhkBWG/bsMQH7oEaQurus@public.gmane.org>
2014-08-07 20:58       ` Vincent JARDIN
     [not found]         ` <CAG8AbRVUFLhzeKmA+Hx8sBNHnBr6608ebDyfEuSBu7Hx-RxWdw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-07 21:28           ` Chris Wright
     [not found]             ` <20140807212847.GU26743-SwUeJysX96B82hYKe6nXyg@public.gmane.org>
2014-08-08  2:07               ` Neil Horman
2014-08-08 11:49       ` Ananyev, Konstantin
     [not found]         ` <2601191342CEEE43887BDE71AB97725821352285-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-08-08 12:25           ` Neil Horman
     [not found]             ` <20140808122503.GA11413-B26myB8xz7F8NnZeBjwnZQMhkBWG/bsMQH7oEaQurus@public.gmane.org>
2014-08-08 13:09               ` Ananyev, Konstantin
     [not found]                 ` <2601191342CEEE43887BDE71AB977258213522BE-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-08-08 14:30                   ` Neil Horman
     [not found]                     ` <20140808143007.GA4723-B26myB8xz7F8NnZeBjwnZQMhkBWG/bsMQH7oEaQurus@public.gmane.org>
2014-08-11 22:23                       ` Thomas Monjalon
2014-08-21 20:15   ` [PATCHv3] " Neil Horman
     [not found]     ` <1408652100-29217-1-git-send-email-nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
2014-08-25 16:30       ` Ananyev, Konstantin [this message]
     [not found]         ` <2601191342CEEE43887BDE71AB9772582135D369-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-08-26 17:44           ` Neil Horman
     [not found]             ` <20140826174443.GB797-B26myB8xz7F8NnZeBjwnZQMhkBWG/bsMQH7oEaQurus@public.gmane.org>
2014-08-27 11:25               ` Ananyev, Konstantin
     [not found]                 ` <2601191342CEEE43887BDE71AB9772582135DC13-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-08-27 18:56                   ` Neil Horman
     [not found]                     ` <20140827185653.GA31916-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-08-27 19:18                       ` Ananyev, Konstantin
     [not found]                         ` <2601191342CEEE43887BDE71AB9772582135DD43-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-08-28  9:02                           ` Richardson, Bruce
2014-08-28 15:55                           ` Neil Horman
2014-08-28 20:38   ` [PATCHv4] " Neil Horman
     [not found]     ` <1409258292-3238-1-git-send-email-nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
2014-08-29 17:58       ` Ananyev, Konstantin
     [not found]         ` <2601191342CEEE43887BDE71AB9772582135F7FF-kPTMFJFq+rEu0RiL9chJVbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-09-01 11:05           ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2601191342CEEE43887BDE71AB9772582135D369@IRSMSX105.ger.corp.intel.com \
    --to=konstantin.ananyev-ral2jqcrhueavxtiumwx3w@public.gmane.org \
    --cc=dev-VfR2kkLFssw@public.gmane.org \
    --cc=nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.