All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] lib/rib: Add Routing Information Base library
@ 2018-02-21 21:44 Medvedkin Vladimir
  2018-02-21 21:44 ` [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
  2018-02-21 21:44 ` [PATCH v2 2/2] Add autotests for " Medvedkin Vladimir
  0 siblings, 2 replies; 11+ messages in thread
From: Medvedkin Vladimir @ 2018-02-21 21:44 UTC (permalink / raw)
  To: dev; +Cc: Medvedkin Vladimir

 This patch series introduces new library librte_rib which potentially could
 replace librte_lpm.

Medvedkin Vladimir (2):
  Add RIB library
  Add autotests for RIB library

 config/common_base                 |   6 +
 doc/api/doxy-api.conf              |   1 +
 lib/Makefile                       |   2 +
 lib/librte_rib/Makefile            |  22 ++
 lib/librte_rib/rte_dir24_8.c       | 482 +++++++++++++++++++++++++++++++++
 lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
 lib/librte_rib/rte_rib.c           | 526 +++++++++++++++++++++++++++++++++++++
 lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
 lib/librte_rib/rte_rib_version.map |  18 ++
 mk/rte.app.mk                      |   1 +
 test/test/Makefile                 |   5 +
 test/test/test_rib.c               | 330 +++++++++++++++++++++++
 test/test/test_rib_generate_rt.c   | 297 +++++++++++++++++++++
 test/test/test_rib_generate_rt.h   |  38 +++
 test/test/test_rib_lpm_comp.c      | 187 +++++++++++++
 test/test/test_rib_perf.c          | 163 ++++++++++++
 16 files changed, 2516 insertions(+)
 create mode 100644 lib/librte_rib/Makefile
 create mode 100644 lib/librte_rib/rte_dir24_8.c
 create mode 100644 lib/librte_rib/rte_dir24_8.h
 create mode 100644 lib/librte_rib/rte_rib.c
 create mode 100644 lib/librte_rib/rte_rib.h
 create mode 100644 lib/librte_rib/rte_rib_version.map
 create mode 100644 test/test/test_rib.c
 create mode 100644 test/test/test_rib_generate_rt.c
 create mode 100644 test/test/test_rib_generate_rt.h
 create mode 100644 test/test/test_rib_lpm_comp.c
 create mode 100644 test/test/test_rib_perf.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] Add RIB library
  2018-02-21 21:44 [PATCH v2 0/2] lib/rib: Add Routing Information Base library Medvedkin Vladimir
@ 2018-02-21 21:44 ` Medvedkin Vladimir
  2018-03-14 11:09   ` Bruce Richardson
  2018-03-29 10:27   ` Bruce Richardson
  2018-02-21 21:44 ` [PATCH v2 2/2] Add autotests for " Medvedkin Vladimir
  1 sibling, 2 replies; 11+ messages in thread
From: Medvedkin Vladimir @ 2018-02-21 21:44 UTC (permalink / raw)
  To: dev; +Cc: Medvedkin Vladimir

RIB is an alternative to current LPM library.
It solves the following problems
 - Increases the speed of control plane operations against lpm such as
   adding/deleting routes
 - Adds abstraction from dataplane algorithms, so it is possible to add
   different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
   in addition to current dir24_8
 - It is possible to keep user defined application specific additional
   information in struct rte_rib_node which represents route entry.
   It can be next hop/set of next hops (i.e. active and feasible),
   pointers to link rte_rib_node based on some criteria (i.e. next_hop),
   plenty of additional control plane information.
 - For dir24_8 implementation it is possible to remove rte_lpm_tbl_entry.depth
   field that helps to save 6 bits.
 - Also new dir24_8 implementation supports different next_hop sizes
   (1/2/4/8 bytes per next hop)
 - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary operator.
   Instead it returns special default value if there is no route.

Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
---
 config/common_base                 |   6 +
 doc/api/doxy-api.conf              |   1 +
 lib/Makefile                       |   2 +
 lib/librte_rib/Makefile            |  22 ++
 lib/librte_rib/rte_dir24_8.c       | 482 +++++++++++++++++++++++++++++++++
 lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
 lib/librte_rib/rte_rib.c           | 526 +++++++++++++++++++++++++++++++++++++
 lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
 lib/librte_rib/rte_rib_version.map |  18 ++
 mk/rte.app.mk                      |   1 +
 10 files changed, 1496 insertions(+)
 create mode 100644 lib/librte_rib/Makefile
 create mode 100644 lib/librte_rib/rte_dir24_8.c
 create mode 100644 lib/librte_rib/rte_dir24_8.h
 create mode 100644 lib/librte_rib/rte_rib.c
 create mode 100644 lib/librte_rib/rte_rib.h
 create mode 100644 lib/librte_rib/rte_rib_version.map

diff --git a/config/common_base b/config/common_base
index ad03cf4..aceeff5 100644
--- a/config/common_base
+++ b/config/common_base
@@ -679,6 +679,12 @@ CONFIG_RTE_LIBRTE_LPM=y
 CONFIG_RTE_LIBRTE_LPM_DEBUG=n
 
 #
+# Compile librte_rib
+#
+CONFIG_RTE_LIBRTE_RIB=y
+CONFIG_RTE_LIBRTE_RIB_DEBUG=n
+
+#
 # Compile librte_acl
 #
 CONFIG_RTE_LIBRTE_ACL=y
diff --git a/doc/api/doxy-api.conf b/doc/api/doxy-api.conf
index cda52fd..8e4f969 100644
--- a/doc/api/doxy-api.conf
+++ b/doc/api/doxy-api.conf
@@ -60,6 +60,7 @@ INPUT                   = doc/api/doxy-api-index.md \
                           lib/librte_kvargs \
                           lib/librte_latencystats \
                           lib/librte_lpm \
+                          lib/librte_rib \
                           lib/librte_mbuf \
                           lib/librte_member \
                           lib/librte_mempool \
diff --git a/lib/Makefile b/lib/Makefile
index ec965a6..7f2323a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -43,6 +43,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_EFD) += librte_efd
 DEPDIRS-librte_efd := librte_eal librte_ring librte_hash
 DIRS-$(CONFIG_RTE_LIBRTE_LPM) += librte_lpm
 DEPDIRS-librte_lpm := librte_eal
+DIRS-$(CONFIG_RTE_LIBRTE_RIB) += librte_rib
+DEPDIRS-librte_rib := librte_eal
 DIRS-$(CONFIG_RTE_LIBRTE_ACL) += librte_acl
 DEPDIRS-librte_acl := librte_eal
 DIRS-$(CONFIG_RTE_LIBRTE_MEMBER) += librte_member
diff --git a/lib/librte_rib/Makefile b/lib/librte_rib/Makefile
new file mode 100644
index 0000000..cb97f02
--- /dev/null
+++ b/lib/librte_rib/Makefile
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_rib.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
+
+EXPORT_MAP := rte_rib_version.map
+
+LIBABIVER := 1
+
+# all source are stored in SRCS-y
+SRCS-$(CONFIG_RTE_LIBRTE_RIB) := rte_rib.c rte_dir24_8.c
+
+# install this header file
+SYMLINK-$(CONFIG_RTE_LIBRTE_RIB)-include := rte_rib.h rte_dir24_8.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_rib/rte_dir24_8.c b/lib/librte_rib/rte_dir24_8.c
new file mode 100644
index 0000000..a12f882
--- /dev/null
+++ b/lib/librte_rib/rte_dir24_8.c
@@ -0,0 +1,482 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <rte_debug.h>
+#include <rte_malloc.h>
+#include <rte_prefetch.h>
+#include <rte_errno.h>
+
+#include <inttypes.h>
+
+#include <rte_memory.h>
+#include <rte_branch_prediction.h>
+
+#include <rte_rib.h>
+#include <rte_dir24_8.h>
+
+#define BITMAP_SLAB_BIT_SIZE_LOG2	6
+#define BITMAP_SLAB_BIT_SIZE		(1 << BITMAP_SLAB_BIT_SIZE_LOG2)
+#define BITMAP_SLAB_BITMASK		(BITMAP_SLAB_BIT_SIZE - 1)
+
+#define ROUNDUP(x, y)	 RTE_ALIGN_CEIL(x, (1 << (32 - y)))
+
+static __rte_always_inline __attribute__((pure)) void *
+get_tbl24_p(struct rte_dir24_8_tbl *fib, uint32_t ip)
+{
+	return (void *)&((uint8_t *)fib->tbl24)[(ip &
+		RTE_DIR24_8_TBL24_MASK) >> (8 - fib->nh_sz)];
+}
+
+#define LOOKUP_FUNC(suffix, type, bulk_prefetch)			\
+int rte_dir24_8_lookup_bulk_##suffix(void *fib_p, const uint32_t *ips,	\
+	uint64_t *next_hops, const unsigned n)				\
+{									\
+	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;	\
+	uint64_t tmp;							\
+	uint32_t i;							\
+	uint32_t prefetch_offset = RTE_MIN((unsigned)bulk_prefetch, n);	\
+									\
+	RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ips == NULL) ||	\
+		(next_hops == NULL)), -EINVAL);				\
+									\
+	for (i = 0; i < prefetch_offset; i++)				\
+		rte_prefetch0(get_tbl24_p(fib, ips[i]));		\
+	for (i = 0; i < (n - prefetch_offset); i++) {			\
+		rte_prefetch0(get_tbl24_p(fib, ips[i + prefetch_offset])); \
+		tmp = ((type *)fib->tbl24)[ips[i] >> 8];		\
+		if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==	\
+			RTE_DIR24_8_VALID_EXT_ENT)) {			\
+			tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +	\
+				((tmp >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
+		}							\
+		next_hops[i] = tmp >> 1;				\
+	}								\
+	for (; i < n; i++) {						\
+		tmp = ((type *)fib->tbl24)[ips[i] >> 8];		\
+		if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==	\
+			RTE_DIR24_8_VALID_EXT_ENT)) {			\
+			tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +	\
+				((tmp >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
+		}							\
+		next_hops[i] = tmp >> 1;				\
+	}								\
+	return 0;							\
+}									\
+
+static void
+write_to_fib(void *ptr, uint64_t val, enum rte_dir24_8_nh_sz size, int n)
+{
+	int i;
+	uint8_t *ptr8 = (uint8_t *)ptr;
+	uint16_t *ptr16 = (uint16_t *)ptr;
+	uint32_t *ptr32 = (uint32_t *)ptr;
+	uint64_t *ptr64 = (uint64_t *)ptr;
+
+	switch (size) {
+	case RTE_DIR24_8_1B:
+		for (i = 0; i < n; i++)
+			ptr8[i] = (uint8_t)val;
+		break;
+	case RTE_DIR24_8_2B:
+		for (i = 0; i < n; i++)
+			ptr16[i] = (uint16_t)val;
+		break;
+	case RTE_DIR24_8_4B:
+		for (i = 0; i < n; i++)
+			ptr32[i] = (uint32_t)val;
+		break;
+	case RTE_DIR24_8_8B:
+		for (i = 0; i < n; i++)
+			ptr64[i] = (uint64_t)val;
+		break;
+	}
+}
+
+static int
+tbl8_get_idx(struct rte_dir24_8_tbl *fib)
+{
+	uint32_t i;
+	int bit_idx;
+
+	for (i = 0; (i < (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2)) &&
+		(fib->tbl8_idxes[i] == UINT64_MAX); i++)
+		;
+	if (i <= (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2)) {
+		bit_idx = __builtin_ctzll(~fib->tbl8_idxes[i]);
+		fib->tbl8_idxes[i] |= (1ULL << bit_idx);
+		return (i << BITMAP_SLAB_BIT_SIZE_LOG2) + bit_idx;
+	}
+	return -ENOSPC;
+}
+
+static inline void
+tbl8_free_idx(struct rte_dir24_8_tbl *fib, int idx)
+{
+	fib->tbl8_idxes[idx >> BITMAP_SLAB_BIT_SIZE_LOG2] &=
+		~(1ULL << (idx & BITMAP_SLAB_BITMASK));
+}
+
+static int
+tbl8_alloc(struct rte_dir24_8_tbl *fib, uint64_t nh)
+{
+	int	tbl8_idx;
+	uint8_t	*tbl8_ptr;
+
+	tbl8_idx = tbl8_get_idx(fib);
+	if (tbl8_idx < 0)
+		return tbl8_idx;
+	tbl8_ptr = (uint8_t *)fib->tbl8 +
+		((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) <<
+		fib->nh_sz);
+	/*Init tbl8 entries with nexthop from tbl24*/
+	write_to_fib((void *)tbl8_ptr, nh|
+		RTE_DIR24_8_VALID_EXT_ENT, fib->nh_sz,
+		RTE_DIR24_8_TBL8_GRP_NUM_ENT);
+	return tbl8_idx;
+}
+
+static void
+tbl8_recycle(struct rte_dir24_8_tbl *fib, uint32_t ip, uint64_t tbl8_idx)
+{
+	int i;
+	uint64_t nh;
+	uint8_t *ptr8;
+	uint16_t *ptr16;
+	uint32_t *ptr32;
+	uint64_t *ptr64;
+
+	switch (fib->nh_sz) {
+	case RTE_DIR24_8_1B:
+		ptr8 = &((uint8_t *)fib->tbl8)[tbl8_idx *
+				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
+		nh = *ptr8;
+		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
+			if (nh != ptr8[i])
+				return;
+		}
+		((uint8_t *)fib->tbl24)[ip >> 8] =
+			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
+		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
+			ptr8[i] = 0;
+		break;
+	case RTE_DIR24_8_2B:
+		ptr16 = &((uint16_t *)fib->tbl8)[tbl8_idx *
+				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
+		nh = *ptr16;
+		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
+			if (nh != ptr16[i])
+				return;
+		}
+		((uint16_t *)fib->tbl24)[ip >> 8] =
+			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
+		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
+			ptr16[i] = 0;
+		break;
+	case RTE_DIR24_8_4B:
+		ptr32 = &((uint32_t *)fib->tbl8)[tbl8_idx *
+				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
+		nh = *ptr32;
+		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
+			if (nh != ptr32[i])
+				return;
+		}
+		((uint32_t *)fib->tbl24)[ip >> 8] =
+			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
+		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
+			ptr32[i] = 0;
+		break;
+	case RTE_DIR24_8_8B:
+		ptr64 = &((uint64_t *)fib->tbl8)[tbl8_idx *
+				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
+		nh = *ptr64;
+		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
+			if (nh != ptr64[i])
+				return;
+		}
+		((uint64_t *)fib->tbl24)[ip >> 8] =
+			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
+		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
+			ptr64[i] = 0;
+		break;
+	}
+	tbl8_free_idx(fib, tbl8_idx);
+}
+
+static int
+install_to_fib(struct rte_dir24_8_tbl *fib, uint32_t ledge, uint32_t redge,
+	uint64_t next_hop)
+{
+	uint64_t	tbl24_tmp;
+	int	tbl8_idx;
+	int tmp_tbl8_idx;
+	uint8_t	*tbl8_ptr;
+
+	/*case for 0.0.0.0/0*/
+	if (unlikely((ledge == 0) && (redge == 0))) {
+		write_to_fib(fib->tbl24, next_hop << 1, fib->nh_sz, 1 << 24);
+		return 0;
+	}
+	if (ROUNDUP(ledge, 24) <= redge) {
+		if (ledge < ROUNDUP(ledge, 24)) {
+			tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
+			if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
+				RTE_DIR24_8_VALID_EXT_ENT) {
+				tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
+				tmp_tbl8_idx = tbl8_get_idx(fib);
+				if ((tbl8_idx < 0) || (tmp_tbl8_idx < 0))
+					return -ENOSPC;
+				tbl8_free_idx(fib, tmp_tbl8_idx);
+				/*update dir24 entry with tbl8 index*/
+				write_to_fib(get_tbl24_p(fib, ledge),
+					(tbl8_idx << 1)|
+					RTE_DIR24_8_VALID_EXT_ENT,
+					fib->nh_sz, 1);
+			} else
+				tbl8_idx = tbl24_tmp >> 1;
+			tbl8_ptr = (uint8_t *)fib->tbl8 +
+				(((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
+				(ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
+				fib->nh_sz);
+			/*update tbl8 with new next hop*/
+			write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
+				RTE_DIR24_8_VALID_EXT_ENT,
+				fib->nh_sz, ROUNDUP(ledge, 24) - ledge);
+			tbl8_recycle(fib, ledge, tbl8_idx);
+		}
+		if (ROUNDUP(ledge, 24) < (redge & RTE_DIR24_8_TBL24_MASK)) {
+			write_to_fib(get_tbl24_p(fib, ROUNDUP(ledge, 24)),
+				next_hop << 1, fib->nh_sz,
+				((redge & RTE_DIR24_8_TBL24_MASK) -
+				ROUNDUP(ledge, 24)) >> 8);
+		}
+		if (redge & ~RTE_DIR24_8_TBL24_MASK) {
+			tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, redge);
+			if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
+					RTE_DIR24_8_VALID_EXT_ENT) {
+				tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
+				if (tbl8_idx < 0)
+					return -ENOSPC;
+				/*update dir24 entry with tbl8 index*/
+				write_to_fib(get_tbl24_p(fib, redge),
+					(tbl8_idx << 1)|
+					RTE_DIR24_8_VALID_EXT_ENT,
+					fib->nh_sz, 1);
+			} else
+				tbl8_idx = tbl24_tmp >> 1;
+			tbl8_ptr = (uint8_t *)fib->tbl8 +
+				((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) <<
+				fib->nh_sz);
+			/*update tbl8 with new next hop*/
+			write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
+				RTE_DIR24_8_VALID_EXT_ENT,
+				fib->nh_sz, redge & ~RTE_DIR24_8_TBL24_MASK);
+			tbl8_recycle(fib, redge, tbl8_idx);
+		}
+	} else {
+		tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
+		if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
+			RTE_DIR24_8_VALID_EXT_ENT) {
+			tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
+			if (tbl8_idx < 0)
+				return -ENOSPC;
+			/*update dir24 entry with tbl8 index*/
+			write_to_fib(get_tbl24_p(fib, ledge),
+				(tbl8_idx << 1)|
+				RTE_DIR24_8_VALID_EXT_ENT,
+				fib->nh_sz, 1);
+		} else
+			tbl8_idx = tbl24_tmp >> 1;
+		tbl8_ptr = (uint8_t *)fib->tbl8 +
+			(((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
+			(ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
+			fib->nh_sz);
+		/*update tbl8 with new next hop*/
+		write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
+			RTE_DIR24_8_VALID_EXT_ENT,
+			fib->nh_sz, redge - ledge);
+		tbl8_recycle(fib, ledge, tbl8_idx);
+	}
+	return 0;
+}
+
+static int
+modify_fib(struct rte_rib *rib, uint32_t ip, uint8_t depth,
+	uint64_t next_hop)
+{
+	struct rte_rib_node *tmp = NULL;
+	struct rte_dir24_8_tbl *fib;
+	uint32_t ledge, redge;
+	int ret;
+
+	fib = rib->fib;
+
+	if (next_hop > DIR24_8_MAX_NH(fib))
+		return -EINVAL;
+
+	ledge = ip;
+	do {
+		tmp = rte_rib_tree_get_nxt(rib, ip, depth, tmp,
+			RTE_RIB_GET_NXT_COVER);
+		if (tmp != NULL) {
+			if (tmp->depth == depth)
+				continue;
+			redge = tmp->key;
+			if (ledge == redge) {
+				ledge = redge +
+					(uint32_t)(1ULL << (32 - tmp->depth));
+				continue;
+			}
+			ret = install_to_fib(fib, ledge, redge,
+				next_hop);
+			if (ret != 0)
+				return ret;
+			ledge = redge +
+				(uint32_t)(1ULL << (32 - tmp->depth));
+		} else {
+			redge = ip + (uint32_t)(1ULL << (32 - depth));
+			ret = install_to_fib(fib, ledge, redge,
+				next_hop);
+			if (ret != 0)
+				return ret;
+		}
+	} while (tmp);
+
+	return 0;
+}
+
+int
+rte_dir24_8_modify(struct rte_rib *rib, uint32_t ip, uint8_t depth,
+	uint64_t next_hop, enum rte_rib_op op)
+{
+	struct rte_dir24_8_tbl *fib;
+	struct rte_rib_node *tmp = NULL;
+	struct rte_rib_node *node;
+	struct rte_rib_node *parent;
+	int ret = 0;
+
+	if ((rib == NULL) || (depth > RTE_RIB_MAXDEPTH))
+		return -EINVAL;
+
+	fib = rib->fib;
+	RTE_ASSERT(fib);
+
+	ip &= (uint32_t)(UINT64_MAX << (32 - depth));
+
+	node = rte_rib_tree_lookup_exact(rib, ip, depth);
+	switch (op) {
+	case RTE_RIB_ADD:
+		if (node != NULL) {
+			if (node->nh == next_hop)
+				return 0;
+			ret = modify_fib(rib, ip, depth, next_hop);
+			if (ret == 0)
+				node->nh = next_hop;
+			return 0;
+		}
+		if (depth > 24) {
+			tmp = rte_rib_tree_get_nxt(rib, ip, 24, NULL,
+				RTE_RIB_GET_NXT_COVER);
+			if ((tmp == NULL) &&
+				(fib->cur_tbl8s >= fib->number_tbl8s))
+				return -ENOSPC;
+
+		}
+		node = rte_rib_tree_insert(rib, ip, depth);
+		if (node == NULL)
+			return -rte_errno;
+		node->nh = next_hop;
+		parent = rte_rib_tree_lookup_parent(node);
+		if ((parent != NULL) && (parent->nh == next_hop))
+			return 0;
+		ret = modify_fib(rib, ip, depth, next_hop);
+		if (ret) {
+			rte_rib_tree_remove(rib, ip, depth);
+			return ret;
+		}
+		if ((depth > 24) && (tmp == NULL))
+			fib->cur_tbl8s++;
+		return 0;
+	case RTE_RIB_DEL:
+		if (node == NULL)
+			return -ENOENT;
+
+		parent = rte_rib_tree_lookup_parent(node);
+		if (parent != NULL) {
+			if (parent->nh != node->nh)
+				ret = modify_fib(rib, ip, depth, parent->nh);
+		} else
+			ret = modify_fib(rib, ip, depth, fib->def_nh);
+		if (ret == 0) {
+			rte_rib_tree_remove(rib, ip, depth);
+			if (depth > 24) {
+				tmp = rte_rib_tree_get_nxt(rib, ip, 24, NULL,
+					RTE_RIB_GET_NXT_COVER);
+				if (tmp == NULL)
+					fib->cur_tbl8s--;
+			}
+		}
+		return ret;
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int socket_id,
+	enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh)
+{
+	char mem_name[RTE_RIB_NAMESIZE];
+	struct rte_dir24_8_tbl *fib;
+
+	snprintf(mem_name, sizeof(mem_name), "FIB_%s", name);
+	fib = rte_zmalloc_socket(name, sizeof(struct rte_dir24_8_tbl) +
+		RTE_DIR24_8_TBL24_NUM_ENT * (1 << nh_sz), RTE_CACHE_LINE_SIZE,
+		socket_id);
+	if (fib == NULL)
+		return fib;
+
+	snprintf(mem_name, sizeof(mem_name), "TBL8_%s", name);
+	fib->tbl8 = rte_zmalloc_socket(mem_name, RTE_DIR24_8_TBL8_GRP_NUM_ENT *
+			(1 << nh_sz) * RTE_DIR24_8_TBL8_NUM_GROUPS,
+			RTE_CACHE_LINE_SIZE, socket_id);
+	if (fib->tbl8 == NULL) {
+		rte_free(fib);
+		return NULL;
+	}
+	fib->def_nh = def_nh;
+	fib->nh_sz = nh_sz;
+	fib->number_tbl8s = RTE_MIN((uint32_t)RTE_DIR24_8_TBL8_NUM_GROUPS,
+				DIR24_8_MAX_NH(fib));
+
+	snprintf(mem_name, sizeof(mem_name), "TBL8_idxes_%s", name);
+	fib->tbl8_idxes = rte_zmalloc_socket(mem_name,
+			RTE_ALIGN_CEIL(fib->number_tbl8s, 64) >> 3,
+			RTE_CACHE_LINE_SIZE, socket_id);
+	if (fib->tbl8_idxes == NULL) {
+		rte_free(fib->tbl8);
+		rte_free(fib);
+		return NULL;
+	}
+
+	return fib;
+}
+
+void
+rte_dir24_8_free(void *fib_p)
+{
+	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
+
+	rte_free(fib->tbl8_idxes);
+	rte_free(fib->tbl8);
+	rte_free(fib);
+}
+
+LOOKUP_FUNC(1b, uint8_t, 5)
+LOOKUP_FUNC(2b, uint16_t, 6)
+LOOKUP_FUNC(4b, uint32_t, 15)
+LOOKUP_FUNC(8b, uint64_t, 12)
diff --git a/lib/librte_rib/rte_dir24_8.h b/lib/librte_rib/rte_dir24_8.h
new file mode 100644
index 0000000..f779409
--- /dev/null
+++ b/lib/librte_rib/rte_dir24_8.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#ifndef _RTE_DIR24_8_H_
+#define _RTE_DIR24_8_H_
+
+/**
+ * @file
+ * RTE Longest Prefix Match (LPM)
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/** @internal Total number of tbl24 entries. */
+#define RTE_DIR24_8_TBL24_NUM_ENT	(1 << 24)
+
+/** Maximum depth value possible for IPv4 LPM. */
+#define RTE_DIR24_8_MAX_DEPTH		32
+
+/** @internal Number of entries in a tbl8 group. */
+#define RTE_DIR24_8_TBL8_GRP_NUM_ENT	256
+
+/** @internal Total number of tbl8 groups in the tbl8. */
+#define RTE_DIR24_8_TBL8_NUM_GROUPS	65536
+
+/** @internal bitmask with valid and valid_group fields set */
+#define RTE_DIR24_8_VALID_EXT_ENT	0x01
+
+#define RTE_DIR24_8_TBL24_MASK		0xffffff00
+
+/** Size of nexthop (1 << nh_sz) bits */
+enum rte_dir24_8_nh_sz {
+	RTE_DIR24_8_1B,
+	RTE_DIR24_8_2B,
+	RTE_DIR24_8_4B,
+	RTE_DIR24_8_8B
+};
+
+
+#define DIR24_8_BITS_IN_NH(fib)		(8 * (1 << fib->nh_sz))
+#define DIR24_8_MAX_NH(fib)	((1ULL << (DIR24_8_BITS_IN_NH(fib) - 1)) - 1)
+
+#define DIR24_8_TBL_IDX(a, fib)		((a) >> (3 - fib->nh_sz))
+#define DIR24_8_PSD_IDX(a, fib)		((a) & ((1 << (3 - fib->nh_sz)) - 1))
+
+#define DIR24_8_TBL24_VAL(ip)	(ip >> 8)
+#define DIR24_8_TBL8_VAL(res, ip)					\
+	((res >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT + (uint8_t)ip)	\
+
+#define DIR24_8_LOOKUP_MSK						\
+	(((1ULL << ((1 << (fib->nh_sz + 3)) - 1)) << 1) - 1)		\
+
+#define RTE_DIR24_8_GET_TBL24(fib, ip)					\
+	((fib->tbl24[DIR24_8_TBL_IDX(DIR24_8_TBL24_VAL(ip), fib)] >>	\
+	(DIR24_8_PSD_IDX(DIR24_8_TBL24_VAL(ip), fib) *			\
+	DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)			\
+
+#define RTE_DIR24_8_GET_TBL8(fib, res, ip)				\
+	((fib->tbl8[DIR24_8_TBL_IDX(DIR24_8_TBL8_VAL(res, ip), fib)] >>	\
+	(DIR24_8_PSD_IDX(DIR24_8_TBL8_VAL(res, ip), fib) *		\
+	DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)			\
+
+
+struct rte_dir24_8_tbl {
+	uint32_t	number_tbl8s;	/**< Total number of tbl8s. */
+	uint32_t	cur_tbl8s;	/**< Current cumber of tbl8s. */
+	uint64_t	def_nh;
+	enum rte_dir24_8_nh_sz	nh_sz;	/**< Size of nexthop entry */
+	uint64_t	*tbl8;		/**< LPM tbl8 table. */
+	uint64_t	*tbl8_idxes;
+	uint64_t	tbl24[0] __rte_cache_aligned; /**< LPM tbl24 table. */
+};
+
+struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int socket_id,
+	enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh);
+void rte_dir24_8_free(void *fib_p);
+int rte_dir24_8_modify(struct rte_rib *rib, uint32_t key,
+	uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
+int rte_dir24_8_lookup_bulk_1b(void *fib_p, const uint32_t *ips,
+	uint64_t *next_hops, const unsigned n);
+int rte_dir24_8_lookup_bulk_2b(void *fib_p, const uint32_t *ips,
+	uint64_t *next_hops, const unsigned n);
+int rte_dir24_8_lookup_bulk_4b(void *fib_p, const uint32_t *ips,
+	uint64_t *next_hops, const unsigned n);
+int rte_dir24_8_lookup_bulk_8b(void *fib_p, const uint32_t *ips,
+	uint64_t *next_hops, const unsigned n);
+
+
+static inline int
+rte_dir24_8_lookup(void *fib_p, uint32_t ip, uint64_t *next_hop)
+{
+	uint64_t res;
+	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
+
+	RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ip == NULL) ||
+		(next_hop == NULL)), -EINVAL);
+
+	res = RTE_DIR24_8_GET_TBL24(fib, ip);
+	if (unlikely((res & RTE_DIR24_8_VALID_EXT_ENT) ==
+		RTE_DIR24_8_VALID_EXT_ENT)) {
+		res = RTE_DIR24_8_GET_TBL8(fib, res, ip);
+	}
+	*next_hop = res >> 1;
+	return 0;
+}
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_DIR24_8_H_ */
+
diff --git a/lib/librte_rib/rte_rib.c b/lib/librte_rib/rte_rib.c
new file mode 100644
index 0000000..7783b23
--- /dev/null
+++ b/lib/librte_rib/rte_rib.c
@@ -0,0 +1,526 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/queue.h>
+
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_common.h>
+#include <rte_tailq.h>
+#include <rte_errno.h>
+#include <rte_rwlock.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+#include <rte_log.h>
+
+#include <rte_rib.h>
+#include <rte_dir24_8.h>
+
+TAILQ_HEAD(rte_rib_list, rte_tailq_entry);
+static struct rte_tailq_elem rte_rib_tailq = {
+	.name = "RTE_RIB",
+};
+EAL_REGISTER_TAILQ(rte_rib_tailq)
+
+static struct rte_rib_node *
+new_node_malloc(struct rte_rib *rib)
+{
+	struct rte_rib_node *ent;
+
+	ent =  malloc(rib->node_sz);
+	if (unlikely(ent == NULL))
+		return NULL;
+	++rib->cur_nodes;
+	return ent;
+}
+
+static void
+free_node_malloc(__rte_unused struct rte_rib *rib, struct rte_rib_node *ent)
+{
+	--rib->cur_nodes;
+	free(ent);
+}
+
+static struct rte_rib_node *
+new_node_mempool(struct rte_rib *rib)
+{
+	struct rte_rib_node *ent;
+	int ret;
+
+	ret = rte_mempool_get(rib->node_pool, (void *)&ent);
+	if (unlikely(ret != 0))
+		return NULL;
+	++rib->cur_nodes;
+	return ent;
+}
+
+static void
+free_node_mempool(struct rte_rib *rib, struct rte_rib_node *ent)
+{
+	--rib->cur_nodes;
+	rte_mempool_put(rib->node_pool, ent);
+}
+
+struct rte_rib_node *
+rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key)
+{
+	struct rte_rib_node *cur = rib->trie;
+	struct rte_rib_node *prev = NULL;
+
+	while ((cur != NULL) && (((cur->key ^ key) &
+		(uint32_t)(UINT64_MAX << (32 - cur->depth))) == 0)) {
+		if ((cur->flag & RTE_RIB_VALID_NODE) == RTE_RIB_VALID_NODE)
+			prev = cur;
+		cur = RTE_RIB_GET_NXT_NODE(cur, key);
+	}
+	return prev;
+}
+
+struct rte_rib_node *
+rte_rib_tree_lookup_parent(struct rte_rib_node *ent)
+{
+	struct rte_rib_node *tmp;
+
+	if (ent == NULL)
+		return NULL;
+	tmp = ent->parent;
+	while ((tmp != NULL) &&
+		(tmp->flag & RTE_RIB_VALID_NODE) != RTE_RIB_VALID_NODE) {
+		tmp = tmp->parent;
+}
+	return tmp;
+}
+
+struct rte_rib_node *
+rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t depth)
+{
+	struct rte_rib_node *cur = rib->trie;
+
+	key &= (uint32_t)(UINT64_MAX << (32 - depth));
+	while (cur != NULL) {
+		if ((cur->key == key) && (cur->depth == depth) &&
+				(cur->flag & RTE_RIB_VALID_NODE))
+			return cur;
+		if ((cur->depth > depth) ||
+				(((uint64_t)key >> (32 - cur->depth)) !=
+				((uint64_t)cur->key >> (32 - cur->depth))))
+			break;
+		cur = RTE_RIB_GET_NXT_NODE(cur, key);
+	}
+	return NULL;
+}
+
+struct rte_rib_node *
+rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key,
+	uint8_t depth, struct rte_rib_node *cur, int flag)
+{
+	struct rte_rib_node *tmp, *prev = NULL;
+
+	if (cur == NULL) {
+		tmp = rib->trie;
+		while ((tmp) && (tmp->depth < depth))
+			tmp = RTE_RIB_GET_NXT_NODE(tmp, key);
+	} else {
+		tmp = cur;
+		while ((tmp->parent != NULL) && (RTE_RIB_IS_RIGHT_NODE(tmp) ||
+				(tmp->parent->right == NULL))) {
+			tmp = tmp->parent;
+			if ((tmp->flag & RTE_RIB_VALID_NODE) &&
+				(RTE_RIB_IS_COVERED(tmp->key, tmp->depth,
+					key, depth)))
+				return tmp;
+		}
+		tmp = (tmp->parent) ? tmp->parent->right : NULL;
+	}
+	while (tmp) {
+		if ((tmp->flag & RTE_RIB_VALID_NODE) &&
+			(RTE_RIB_IS_COVERED(tmp->key, tmp->depth,
+				key, depth))) {
+			prev = tmp;
+			if (flag == RTE_RIB_GET_NXT_COVER)
+				return prev;
+		}
+		tmp = (tmp->left) ? tmp->left : tmp->right;
+	}
+	return prev;
+}
+
+void
+rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth)
+{
+	struct rte_rib_node *cur, *prev, *child;
+
+	cur = rte_rib_tree_lookup_exact(rib, key, depth);
+	if (cur == NULL)
+		return;
+
+	--rib->cur_routes;
+	cur->flag &= ~RTE_RIB_VALID_NODE;
+	while ((cur->flag & RTE_RIB_VALID_NODE) != RTE_RIB_VALID_NODE) {
+		if ((cur->left != NULL) && (cur->right != NULL))
+			return;
+		child = (cur->left == NULL) ? cur->right : cur->left;
+		if (child != NULL)
+			child->parent = cur->parent;
+		if (cur->parent == NULL) {
+			rib->trie = child;
+			rib->free_node(rib, cur);
+			return;
+		}
+		if (cur->parent->left == cur)
+			cur->parent->left = child;
+		else
+			cur->parent->right = child;
+		prev = cur;
+		cur = cur->parent;
+		rib->free_node(rib, prev);
+	}
+}
+
+struct rte_rib_node *
+rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth)
+{
+	struct rte_rib_node **tmp = &rib->trie;
+	struct rte_rib_node *prev = NULL;
+	struct rte_rib_node *new_node = NULL;
+	struct rte_rib_node *common_node = NULL;
+	int i = 0;
+	uint32_t common_prefix;
+	uint8_t common_depth;
+
+	if (depth > 32) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	key &= (uint32_t)(UINT64_MAX << (32 - depth));
+	new_node = rte_rib_tree_lookup_exact(rib, key, depth);
+	if (new_node != NULL) {
+		rte_errno = EEXIST;
+		return NULL;
+	}
+
+	new_node = rib->alloc_node(rib);
+	if (new_node == NULL) {
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+	new_node->left = NULL;
+	new_node->right = NULL;
+	new_node->parent = NULL;
+	new_node->key = key;
+	new_node->depth = depth;
+	new_node->flag = RTE_RIB_VALID_NODE;
+
+	while (1) {
+		if (*tmp == NULL) {
+			*tmp = new_node;
+			new_node->parent = prev;
+		}
+		if ((key == (*tmp)->key) && (depth == (*tmp)->depth)) {
+			if (new_node != *tmp) {
+				rib->free_node(rib, new_node);
+				(*tmp)->flag |= RTE_RIB_VALID_NODE;
+			}
+			++rib->cur_routes;
+			return *tmp;
+		}
+		i = (*tmp)->depth;
+		if ((i >= depth) || (((uint64_t)key >> (32 - i)) !=
+				((uint64_t)(*tmp)->key >> (32 - i))))
+			break;
+		prev = *tmp;
+		tmp = (key & (1 << (31 - i))) ? &(*tmp)->right : &(*tmp)->left;
+	}
+	common_depth = RTE_MIN(depth, (*tmp)->depth);
+	common_prefix = key ^ (*tmp)->key;
+	i = __builtin_clz(common_prefix);
+
+	common_depth = RTE_MIN(i, common_depth);
+	common_prefix = key & (uint32_t)(UINT64_MAX << (32 - common_depth));
+	if ((common_prefix == key) && (common_depth == depth)) {
+		if ((*tmp)->key & (1 << (31 - depth)))
+			new_node->right = *tmp;
+		else
+			new_node->left = *tmp;
+		new_node->parent = (*tmp)->parent;
+		(*tmp)->parent = new_node;
+		*tmp = new_node;
+	} else {
+		common_node = rib->alloc_node(rib);
+		if (common_node == NULL) {
+			rib->free_node(rib, new_node);
+			rte_errno = ENOMEM;
+			return NULL;
+		}
+		common_node->key = common_prefix;
+		common_node->depth = common_depth;
+		common_node->flag = 0;
+		common_node->parent = (*tmp)->parent;
+		new_node->parent = common_node;
+		(*tmp)->parent = common_node;
+		if ((new_node->key & (1 << (31 - common_depth))) == 0) {
+			common_node->left = new_node;
+			common_node->right = *tmp;
+		} else {
+			common_node->left = *tmp;
+			common_node->right = new_node;
+		}
+		*tmp = common_node;
+	}
+	++rib->cur_routes;
+	return new_node;
+}
+
+struct rte_rib *
+rte_rib_create(const char *name, int socket_id, struct rte_rib_conf *conf)
+{
+	char mem_name[RTE_RIB_NAMESIZE];
+	struct rte_rib *rib = NULL;
+	struct rte_tailq_entry *te;
+	struct rte_rib_list *rib_list;
+	struct rte_mempool *node_pool = NULL;
+	enum rte_dir24_8_nh_sz dir24_8_nh_size;
+
+	/* Check user arguments. */
+	if ((name == NULL) || (conf == NULL) || (socket_id < -1) ||
+			(conf->type >= RTE_RIB_TYPE_MAX) ||
+			(conf->alloc_type >= RTE_RIB_ALLOC_MAX) ||
+			(conf->max_nodes == 0) ||
+			(conf->node_sz < sizeof(struct rte_rib_node))) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	if (conf->alloc_type == RTE_RIB_MEMPOOL) {
+		snprintf(mem_name, sizeof(mem_name), "MP_%s", name);
+		node_pool = rte_mempool_create(mem_name, conf->max_nodes,
+			conf->node_sz, 0, 0, NULL, NULL, NULL, NULL,
+			socket_id, 0);
+
+		if (node_pool == NULL) {
+			RTE_LOG(ERR, LPM,
+				"Can not allocate mempool for RIB %s\n", name);
+			rte_errno = ENOMEM;
+			return NULL;
+		}
+
+	}
+
+	snprintf(mem_name, sizeof(mem_name), "RIB_%s", name);
+
+	rib_list = RTE_TAILQ_CAST(rte_rib_tailq.head, rte_rib_list);
+
+	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+	/* guarantee there's no existing */
+	TAILQ_FOREACH(te, rib_list, next) {
+		rib = (struct rte_rib *)te->data;
+		if (strncmp(name, rib->name, RTE_RIB_NAMESIZE) == 0)
+			break;
+	}
+	rib = NULL;
+	if (te != NULL) {
+		rte_errno = EEXIST;
+		goto exit;
+	}
+
+	/* allocate tailq entry */
+	te = rte_zmalloc("RIB_TAILQ_ENTRY", sizeof(*te), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, LPM,
+			"Can not allocate tailq entry for RIB %s\n", name);
+		rte_errno = ENOMEM;
+		goto exit;
+	}
+
+	/* Allocate memory to store the LPM data structures. */
+	rib = (struct rte_rib *)rte_zmalloc_socket(mem_name,
+		sizeof(struct rte_rib),	RTE_CACHE_LINE_SIZE, socket_id);
+	if (rib == NULL) {
+		RTE_LOG(ERR, LPM, "RIB %s memory allocation failed\n", name);
+		rte_errno = ENOMEM;
+		goto free_te;
+	}
+	snprintf(rib->name, sizeof(rib->name), "%s", name);
+	rib->trie = NULL;
+	rib->max_nodes = conf->max_nodes;
+	rib->node_sz = conf->node_sz;
+	rib->type = conf->type;
+	rib->alloc_type = conf->alloc_type;
+
+	if (conf->type <= RTE_RIB_DIR24_8_8B) {
+		switch (conf->type) {
+		case RTE_RIB_DIR24_8_1B:
+			dir24_8_nh_size = RTE_DIR24_8_1B;
+			rib->lookup = rte_dir24_8_lookup_bulk_1b;
+			break;
+		case RTE_RIB_DIR24_8_2B:
+			dir24_8_nh_size = RTE_DIR24_8_2B;
+			rib->lookup = rte_dir24_8_lookup_bulk_2b;
+			break;
+		case RTE_RIB_DIR24_8_4B:
+			dir24_8_nh_size = RTE_DIR24_8_4B;
+			rib->lookup = rte_dir24_8_lookup_bulk_4b;
+			break;
+		case RTE_RIB_DIR24_8_8B:
+			dir24_8_nh_size = RTE_DIR24_8_8B;
+			rib->lookup = rte_dir24_8_lookup_bulk_8b;
+			break;
+		case RTE_RIB_TYPE_MAX:
+		default:
+			RTE_LOG(ERR, LPM, "Bad RIB %s type\n", name);
+			rte_errno = EINVAL;
+			goto free_rib;
+		}
+		rib->fib = (void *)rte_dir24_8_create(name, socket_id,
+				dir24_8_nh_size, conf->def_nh);
+		if (rib->fib == NULL) {
+			RTE_LOG(ERR, LPM, "Failed to allocate FIB %s\n", name);
+			rte_errno = ENOMEM;
+			goto free_rib;
+		}
+		rib->modify = rte_dir24_8_modify;
+	}
+
+	switch (conf->alloc_type) {
+	case RTE_RIB_MALLOC:
+		rib->alloc_node = new_node_malloc;
+		rib->free_node = free_node_malloc;
+		break;
+	case RTE_RIB_MEMPOOL:
+		rib->node_pool = node_pool;
+		rib->alloc_node = new_node_mempool;
+		rib->free_node = free_node_mempool;
+		break;
+	case RTE_RIB_ALLOC_MAX:
+	default:
+		RTE_LOG(ERR, LPM, "Bad RIB %s alloc type\n", name);
+		rte_errno = EINVAL;
+		goto free_fib;
+	}
+
+	te->data = (void *)rib;
+	TAILQ_INSERT_TAIL(rib_list, te, next);
+
+	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	return rib;
+
+free_fib:
+	switch (conf->type) {
+	case RTE_RIB_DIR24_8_1B:
+	case RTE_RIB_DIR24_8_2B:
+	case RTE_RIB_DIR24_8_4B:
+	case RTE_RIB_DIR24_8_8B:
+		rte_dir24_8_free(rib->fib);
+		break;
+	default:
+		break;
+	}
+free_rib:
+	rte_free(rib);
+free_te:
+	rte_free(te);
+exit:
+	if (conf->alloc_type == RTE_RIB_MEMPOOL)
+		rte_mempool_free(node_pool);
+	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	return NULL;
+}
+
+struct rte_rib *
+rte_rib_find_existing(const char *name)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_tailq_entry *te;
+	struct rte_rib_list *rib_list;
+
+	rib_list = RTE_TAILQ_CAST(rte_rib_tailq.head, rte_rib_list);
+
+	rte_rwlock_read_lock(RTE_EAL_TAILQ_RWLOCK);
+	TAILQ_FOREACH(te, rib_list, next) {
+		rib = (struct rte_rib *) te->data;
+		if (strncmp(name, rib->name, RTE_RIB_NAMESIZE) == 0)
+			break;
+	}
+	rte_rwlock_read_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	if (te == NULL) {
+		rte_errno = ENOENT;
+		return NULL;
+	}
+
+	return rib;
+}
+
+void
+rte_rib_free(struct rte_rib *rib)
+{
+	struct rte_tailq_entry *te;
+	struct rte_rib_list *rib_list;
+	struct rte_rib_node *tmp = NULL;
+
+	if (rib == NULL)
+		return;
+
+	rib_list = RTE_TAILQ_CAST(rte_rib_tailq.head, rte_rib_list);
+
+	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+	/* find our tailq entry */
+	TAILQ_FOREACH(te, rib_list, next) {
+		if (te->data == (void *)rib)
+			break;
+	}
+	if (te != NULL)
+		TAILQ_REMOVE(rib_list, te, next);
+
+	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	while ((tmp = rte_rib_tree_get_nxt(rib, 0, 0, tmp,
+		RTE_RIB_GET_NXT_ALL)) != NULL)
+		rte_rib_tree_remove(rib, tmp->key, tmp->depth);
+
+	if (rib->alloc_type == RTE_RIB_MEMPOOL)
+		rte_mempool_free(rib->node_pool);
+
+	switch (rib->type) {
+	case RTE_RIB_DIR24_8_1B:
+	case RTE_RIB_DIR24_8_2B:
+	case RTE_RIB_DIR24_8_4B:
+	case RTE_RIB_DIR24_8_8B:
+		rte_dir24_8_free(rib->fib);
+		break;
+	default:
+		break;
+	}
+
+	rte_free(rib);
+	rte_free(te);
+}
+
+int
+rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t next_hop)
+{
+	if ((rib == NULL) || (depth > RTE_RIB_MAXDEPTH))
+		return -EINVAL;
+
+	return rib->modify(rib, ip, depth, next_hop, RTE_RIB_ADD);
+}
+
+int
+rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth)
+{
+	if ((rib == NULL) || (depth > RTE_RIB_MAXDEPTH))
+		return -EINVAL;
+
+	return rib->modify(rib, ip, depth, 0, RTE_RIB_DEL);
+}
diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
new file mode 100644
index 0000000..6eac8fb
--- /dev/null
+++ b/lib/librte_rib/rte_rib.h
@@ -0,0 +1,322 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#ifndef _RTE_RIB_H_
+#define _RTE_RIB_H_
+
+/**
+ * @file
+ * Compressed trie implementation for Longest Prefix Match
+ */
+
+/** @internal Macro to enable/disable run-time checks. */
+#if defined(RTE_LIBRTE_RIB_DEBUG)
+#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {	\
+	if (cond)					\
+		return retval;				\
+} while (0)
+#else
+#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
+#endif
+
+#define RTE_RIB_VALID_NODE	1
+#define RTE_RIB_GET_NXT_ALL	0
+#define RTE_RIB_GET_NXT_COVER	1
+
+#define RTE_RIB_INVALID_ROUTE	0
+#define RTE_RIB_VALID_ROUTE	1
+
+/** Max number of characters in RIB name. */
+#define RTE_RIB_NAMESIZE	64
+
+/** Maximum depth value possible for IPv4 RIB. */
+#define RTE_RIB_MAXDEPTH	32
+
+/**
+ * Macro to check if prefix1 {key1/depth1}
+ * is covered by prefix2 {key2/depth2}
+ */
+#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)			\
+	((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
+		&& (depth1 > depth2))
+
+/** @internal Macro to get next node in tree*/
+#define RTE_RIB_GET_NXT_NODE(node, key)					\
+	((key & (1 << (31 - node->depth))) ? node->right : node->left)
+/** @internal Macro to check if node is right child*/
+#define RTE_RIB_IS_RIGHT_NODE(node)	(node->parent->right == node)
+
+
+struct rte_rib_node {
+	struct rte_rib_node *left;
+	struct rte_rib_node *right;
+	struct rte_rib_node *parent;
+	uint32_t	key;
+	uint8_t		depth;
+	uint8_t		flag;
+	uint64_t	nh;
+	uint64_t	ext[0];
+};
+
+struct rte_rib;
+
+/** Type of FIB struct*/
+enum rte_rib_type {
+	RTE_RIB_DIR24_8_1B,
+	RTE_RIB_DIR24_8_2B,
+	RTE_RIB_DIR24_8_4B,
+	RTE_RIB_DIR24_8_8B,
+	RTE_RIB_TYPE_MAX
+};
+
+enum rte_rib_op {
+	RTE_RIB_ADD,
+	RTE_RIB_DEL
+};
+
+/** RIB nodes allocation type */
+enum rte_rib_alloc_type {
+	RTE_RIB_MALLOC,
+	RTE_RIB_MEMPOOL,
+	RTE_RIB_ALLOC_MAX
+};
+
+typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
+	uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
+typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
+	uint64_t *next_hops, const unsigned n);
+typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib *rib);
+typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
+	struct rte_rib_node *node);
+
+struct rte_rib {
+	char name[RTE_RIB_NAMESIZE];
+	/*pointer to rib trie*/
+	struct rte_rib_node	*trie;
+	/*pointer to dataplane struct*/
+	void	*fib;
+	/*prefix modification*/
+	rte_rib_modify_fn_t	modify;
+	/* Bulk lookup fn*/
+	rte_rib_tree_lookup_fn_t	lookup;
+	/*alloc trie element*/
+	rte_rib_alloc_node_fn_t	alloc_node;
+	/*free trie element*/
+	rte_rib_free_node_fn_t	free_node;
+	struct rte_mempool	*node_pool;
+	uint32_t		cur_nodes;
+	uint32_t		cur_routes;
+	int			max_nodes;
+	int			node_sz;
+	enum rte_rib_type	type;
+	enum rte_rib_alloc_type	alloc_type;
+};
+
+/** RIB configuration structure */
+struct rte_rib_conf {
+	enum rte_rib_type	type;
+	enum rte_rib_alloc_type	alloc_type;
+	int	max_nodes;
+	size_t	node_sz;
+	uint64_t def_nh;
+};
+
+/**
+ * Lookup an IP into the RIB structure
+ *
+ * @param rib
+ *  RIB object handle
+ * @param key
+ *  IP to be looked up in the RIB
+ * @return
+ *  pointer to struct rte_rib_node on success,
+ *  NULL otherwise
+ */
+struct rte_rib_node *
+rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
+
+/**
+ * Lookup less specific route into the RIB structure
+ *
+ * @param ent
+ *  Pointer to struct rte_rib_node that represents target route
+ * @return
+ *  pointer to struct rte_rib_node that represents
+ *  less specific route on success,
+ *  NULL otherwise
+ */
+struct rte_rib_node *
+rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
+
+/**
+ * Lookup prefix into the RIB structure
+ *
+ * @param rib
+ *  RIB object handle
+ * @param key
+ *  net to be looked up in the RIB
+ * @param depth
+ *  prefix length
+ * @return
+ *  pointer to struct rte_rib_node on success,
+ *  NULL otherwise
+ */
+struct rte_rib_node *
+rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t depth);
+
+/**
+ * Retrieve next more specific prefix from the RIB
+ * that is covered by key/depth supernet
+ *
+ * @param rib
+ *  RIB object handle
+ * @param key
+ *  net address of supernet prefix that covers returned more specific prefixes
+ * @param depth
+ *  supernet prefix length
+ * @param cur
+ *   pointer to the last returned prefix to get next prefix
+ *   or
+ *   NULL to get first more specific prefix
+ * @param flag
+ *  -RTE_RIB_GET_NXT_ALL
+ *   get all prefixes from subtrie
+ *  -RTE_RIB_GET_NXT_COVER
+ *   get only first more specific prefix even if it have more specifics
+ * @return
+ *  pointer to the next more specific prefix
+ *  or
+ *  NULL if there is no prefixes left
+ */
+struct rte_rib_node *
+rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
+	struct rte_rib_node *cur, int flag);
+
+/**
+ * Remove prefix from the RIB
+ *
+ * @param rib
+ *  RIB object handle
+ * @param key
+ *  net to be removed from the RIB
+ * @param depth
+ *  prefix length
+ */
+void
+rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
+
+/**
+ * Insert prefix into the RIB
+ *
+ * @param rib
+ *  RIB object handle
+ * @param key
+ *  net to be inserted to the RIB
+ * @param depth
+ *  prefix length
+ * @return
+ *  pointer to new rte_rib_node on success
+ *  NULL otherwise
+ */
+struct rte_rib_node *
+rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
+
+/**
+ * Create RIB
+ *
+ * @param name
+ *  RIB name
+ * @param socket_id
+ *  NUMA socket ID for RIB table memory allocation
+ * @param conf
+ *  Structure containing the configuration
+ * @return
+ *  Handle to RIB object on success
+ *  NULL otherwise with rte_errno set to an appropriate values.
+ */
+struct rte_rib *
+rte_rib_create(const char *name, int socket_id, struct rte_rib_conf *conf);
+
+/**
+ * Find an existing RIB object and return a pointer to it.
+ *
+ * @param name
+ *  Name of the rib object as passed to rte_rib_create()
+ * @return
+ *  Pointer to rib object or NULL if object not found with rte_errno
+ *  set appropriately. Possible rte_errno values include:
+ *   - ENOENT - required entry not available to return.
+ */
+struct rte_rib *
+rte_rib_find_existing(const char *name);
+
+/**
+ * Free an RIB object.
+ *
+ * @param rib
+ *   RIB object handle
+ * @return
+ *   None
+ */
+void
+rte_rib_free(struct rte_rib *rib);
+
+/**
+ * Add a rule to the RIB.
+ *
+ * @param rib
+ *   RIB object handle
+ * @param ip
+ *   IP of the rule to be added to the RIB
+ * @param depth
+ *   Depth of the rule to be added to the RIB
+ * @param next_hop
+ *   Next hop of the rule to be added to the RIB
+ * @return
+ *   0 on success, negative value otherwise
+ */
+int
+rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t next_hop);
+
+/**
+ * Delete a rule from the RIB.
+ *
+ * @param rib
+ *   RIB object handle
+ * @param ip
+ *   IP of the rule to be deleted from the RIB
+ * @param depth
+ *   Depth of the rule to be deleted from the RIB
+ * @return
+ *   0 on success, negative value otherwise
+ */
+int
+rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
+
+/**
+ * Lookup multiple IP addresses in an FIB. This may be implemented as a
+ * macro, so the address of the function should not be used.
+ *
+ * @param RIB
+ *   RIB object handle
+ * @param ips
+ *   Array of IPs to be looked up in the FIB
+ * @param next_hops
+ *   Next hop of the most specific rule found for IP.
+ *   This is an array of eight byte values.
+ *   If the lookup for the given IP failed, then corresponding element would
+ *   contain default value, see description of then next parameter.
+ * @param n
+ *   Number of elements in ips (and next_hops) array to lookup. This should be a
+ *   compile time constant, and divisible by 8 for best performance.
+ * @param defv
+ *   Default value to populate into corresponding element of hop[] array,
+ *   if lookup would fail.
+ *  @return
+ *   -EINVAL for incorrect arguments, otherwise 0
+ */
+#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)	\
+	rib->lookup(rib->fib, ips, next_hops, n)
+
+#endif /* _RTE_RIB_H_ */
diff --git a/lib/librte_rib/rte_rib_version.map b/lib/librte_rib/rte_rib_version.map
new file mode 100644
index 0000000..b193d6c
--- /dev/null
+++ b/lib/librte_rib/rte_rib_version.map
@@ -0,0 +1,18 @@
+DPDK_17.08 {
+	global:
+
+	rte_rib_create;
+	rte_rib_free;
+	rte_rib_tree_lookup;
+	rte_rib_tree_lookup_parent;
+	rte_rib_tree_lookup_exact;
+	rte_rib_tree_get_nxt;
+	rte_rib_tree_remove;
+	rte_rib_tree_insert;
+	rte_rib_find_existing;
+	rte_rib_add;
+	rte_rib_delete;
+	rte_rib_delete_all;
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d1..4708aa4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -70,6 +70,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_GRO)            += -lrte_gro
 _LDLIBS-$(CONFIG_RTE_LIBRTE_GSO)            += -lrte_gso
 _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)          += -lrte_meter
 _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)            += -lrte_lpm
+_LDLIBS-$(CONFIG_RTE_LIBRTE_RIB)            += -lrte_rib
 # librte_acl needs --whole-archive because of weak functions
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ACL)            += --whole-archive
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ACL)            += -lrte_acl
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] Add autotests for RIB library
  2018-02-21 21:44 [PATCH v2 0/2] lib/rib: Add Routing Information Base library Medvedkin Vladimir
  2018-02-21 21:44 ` [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
@ 2018-02-21 21:44 ` Medvedkin Vladimir
  1 sibling, 0 replies; 11+ messages in thread
From: Medvedkin Vladimir @ 2018-02-21 21:44 UTC (permalink / raw)
  To: dev; +Cc: Medvedkin Vladimir

Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
---
 test/test/Makefile               |   5 +
 test/test/test_rib.c             | 330 +++++++++++++++++++++++++++++++++++++++
 test/test/test_rib_generate_rt.c | 297 +++++++++++++++++++++++++++++++++++
 test/test/test_rib_generate_rt.h |  38 +++++
 test/test/test_rib_lpm_comp.c    | 187 ++++++++++++++++++++++
 test/test/test_rib_perf.c        | 163 +++++++++++++++++++
 6 files changed, 1020 insertions(+)
 create mode 100644 test/test/test_rib.c
 create mode 100644 test/test/test_rib_generate_rt.c
 create mode 100644 test/test/test_rib_generate_rt.h
 create mode 100644 test/test/test_rib_lpm_comp.c
 create mode 100644 test/test/test_rib_perf.c

diff --git a/test/test/Makefile b/test/test/Makefile
index a88cc38..7644f6d 100644
--- a/test/test/Makefile
+++ b/test/test/Makefile
@@ -119,6 +119,11 @@ SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm_perf.c
 SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm6.c
 SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm6_perf.c
 
+SRCS-$(CONFIG_RTE_LIBRTE_RIB) += test_rib.c
+SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_rib_generate_rt.c
+SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_rib_perf.c
+SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_rib_lpm_comp.c
+
 SRCS-y += test_debug.c
 SRCS-y += test_errno.c
 SRCS-y += test_tailq.c
diff --git a/test/test/test_rib.c b/test/test/test_rib.c
new file mode 100644
index 0000000..305ae34
--- /dev/null
+++ b/test/test/test_rib.c
@@ -0,0 +1,330 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include <rte_ip.h>
+#include <rte_rib.h>
+
+#include "test.h"
+#include "test_xmmt_ops.h"
+#include <rte_dir24_8.h>
+
+
+#define TEST_RIB_ASSERT(cond) do {				\
+	if (!(cond)) {						\
+		printf("Error at line %d:\n", __LINE__);	\
+		return -1;					\
+	}							\
+} while (0)
+
+typedef int32_t (*rte_rib_test)(void);
+
+static int32_t test0(void);
+static int32_t test1(void);
+static int32_t test2(void);
+static int32_t test3(void);
+static int32_t test4(void);
+static int32_t test5(void);
+
+static rte_rib_test tests[] = {
+/* Test Cases */
+	test0,
+	test1,
+	test2,
+	test3,
+	test4,
+	test5
+};
+
+#define NUM_RIB_TESTS (sizeof(tests)/sizeof(tests[0]))
+#define MAX_DEPTH 32
+#define MAX_RULES (1 << 22)
+#define NUMBER_TBL8S 4096
+#define PASS 0
+
+/*
+ * Check that rte_rib_create fails gracefully for incorrect user input
+ * arguments
+ */
+int32_t
+test0(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	/* rte_rib_create: rib name == NULL */
+	rib = rte_rib_create(NULL, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+
+	/* rte_rib_create: config == NULL */
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, NULL);
+	TEST_RIB_ASSERT(rib == NULL);
+
+	/* socket_id < -1 is invalid */
+	rib = rte_rib_create(__func__, -2, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+
+	/* rte_rib_create: max_nodes = 0 */
+	config.max_nodes = 0;
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+	config.max_nodes = MAX_RULES;
+
+	/* rte_rib_create: node_sz = 0 */
+	config.node_sz = 0;
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	/* rte_rib_create: invalid alloc type */
+	config.alloc_type = RTE_RIB_ALLOC_MAX;
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+	config.alloc_type = RTE_RIB_MALLOC;
+
+	/* rte_rib_create: invalid type */
+	config.type = RTE_RIB_TYPE_MAX;
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib == NULL);
+
+	return PASS;
+}
+
+/*
+ * Create rib table then delete rib table 10 times
+ * for every rib type/node alloc type
+ * Use a slightly different rules size each time
+ * */
+int32_t
+test1(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+	int32_t i, j, k;
+
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	for (k = RTE_RIB_MALLOC; k < RTE_RIB_ALLOC_MAX; k++) {
+		config.alloc_type = k;
+		for (j = RTE_RIB_DIR24_8_1B; j < RTE_RIB_TYPE_MAX; j++) {
+			config.type = j;
+			/* rte_rib_free: Free NULL */
+			for (i = 0; i < 2; i++) {
+				config.max_nodes = MAX_RULES - i;
+				rib = rte_rib_create(__func__, SOCKET_ID_ANY,
+					&config);
+				TEST_RIB_ASSERT(rib != NULL);
+				rte_rib_free(rib);
+			}
+		}
+	}
+	/* Can not test free so return success */
+	return PASS;
+}
+
+/*
+ * Call rte_rib_free for NULL pointer user input. Note: free has no return and
+ * therefore it is impossible to check for failure but this test is added to
+ * increase function coverage metrics and to validate that freeing null does
+ * not crash.
+ */
+int32_t
+test2(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	rte_rib_free(rib);
+	rte_rib_free(NULL);
+	return PASS;
+}
+
+/*
+ * Check that rte_rib_add fails gracefully for incorrect user input arguments
+ */
+int32_t
+test3(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+	uint32_t ip = IPv4(0, 0, 0, 0);
+	uint64_t next_hop = 100;
+	uint8_t depth = 24;
+	int32_t status = 0;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	/* rte_rib_add: rib == NULL */
+	status = rte_rib_add(NULL, ip, depth, next_hop);
+	TEST_RIB_ASSERT(status < 0);
+
+	/*Create vaild rib to use in rest of test. */
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	/* rte_rib_add: depth > MAX_DEPTH */
+	status = rte_rib_add(rib, ip, (MAX_DEPTH + 1), next_hop);
+	TEST_RIB_ASSERT(status < 0);
+
+	rte_rib_free(rib);
+
+	return PASS;
+}
+
+/*
+ * Check that rte_rib_delete fails gracefully for incorrect user input
+ * arguments
+ */
+int32_t
+test4(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+	uint32_t ip = IPv4(0, 0, 0, 0);
+	uint8_t depth = 24;
+	int32_t status = 0;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	/* rte_rib_delete: rib == NULL */
+	status = rte_rib_delete(NULL, ip, depth);
+	TEST_RIB_ASSERT(status < 0);
+
+	/*Create vaild rib to use in rest of test. */
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	/* rte_rib_delete: depth > MAX_DEPTH */
+	status = rte_rib_delete(rib, ip, (MAX_DEPTH + 1));
+	TEST_RIB_ASSERT(status < 0);
+
+	rte_rib_free(rib);
+
+	return PASS;
+}
+
+/*
+ * Call add, lookup and delete for a single rule with depth <= 24
+ */
+int32_t
+test5(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+
+	uint32_t ip = IPv4(190, 2, 0, 0);
+	uint64_t next_hop_add = 10;
+	uint64_t next_hop_return = 20;
+	uint64_t next_hop_default = 14;
+	uint8_t depth = 24;
+	uint32_t status = 0;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+	config.def_nh = next_hop_default;
+
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	status = rte_rib_add(rib, ip, depth, next_hop_add);
+	TEST_RIB_ASSERT(status == 0);
+
+	status = rte_rib_fib_lookup_bulk(rib, &ip, &next_hop_return, 1);
+	TEST_RIB_ASSERT((status == 0) && (next_hop_return == next_hop_add));
+
+	status = rte_rib_delete(rib, ip, depth);
+	TEST_RIB_ASSERT(status == 0);
+	status = rte_rib_fib_lookup_bulk(rib, &ip, &next_hop_return, 1);
+	TEST_RIB_ASSERT(next_hop_return == next_hop_default);
+
+	rte_rib_free(rib);
+
+	return PASS;
+}
+
+
+
+/*
+ * Check that rte_rib_tree_* fails gracefully for incorrect user input
+ * arguments
+ */
+/*
+
+int i;
+struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)rib->fib;
+uint8_t *tmp = (uint8_t *)fib->tbl24;
+for (i = 0; i < (1 << 24); i++) {
+	if (tmp[i] != 0) {
+		printf("I = %d; tmp = %d\n", i, tmp[i]);
+	}
+}
+
+
+int32_t
+testX(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf config;
+	uint32_t ip = IPv4(0, 0, 0, 0);
+	uint8_t depth = 24;
+	int32_t status = 0;
+
+	config.type = RTE_RIB_DIR24_8_1B;
+	config.alloc_type = RTE_RIB_MALLOC;
+	config.max_nodes = MAX_RULES;
+	config.node_sz = sizeof(struct rte_rib_node);
+
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(rib != NULL);
+
+
+}
+*/
+
+
+/*
+ * Do all unit tests.
+ */
+static int
+test_rib(void)
+{
+	unsigned i;
+	int status, global_status = 0;
+
+	for (i = 0; i < NUM_RIB_TESTS; i++) {
+		status = tests[i]();
+		if (status < 0) {
+			printf("ERROR: RIB Test %u: FAIL\n", i);
+			global_status = status;
+		}
+	}
+
+	return global_status;
+}
+
+REGISTER_TEST_COMMAND(rib_autotest, test_rib);
diff --git a/test/test/test_rib_generate_rt.c b/test/test/test_rib_generate_rt.c
new file mode 100644
index 0000000..9c30967
--- /dev/null
+++ b/test/test/test_rib_generate_rt.c
@@ -0,0 +1,297 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <math.h>
+
+#include <rte_random.h>
+#include <rte_cycles.h>
+#include <rte_ip.h>
+
+#include "test_rib_generate_rt.h"
+
+static uint32_t max_route_entries;
+static uint32_t num_route_entries;
+
+/* All following numbers of each depth of each common IP class are just
+ * got from previous large constant table in app/test/test_rib_routes.h .
+ * In order to match similar performance, they keep same depth and IP
+ * address coverage as previous constant table. These numbers don't
+ * include any private local IP address. As previous large const rule
+ * table was just dumped from a real router, there are no any IP address
+ * in class C or D.
+ */
+static struct route_rule_count rule_count = {
+	.a = { /* IP class A in which the most significant bit is 0 */
+		    0, /* depth =  1 */
+		    0, /* depth =  2 */
+		    1, /* depth =  3 */
+		    0, /* depth =  4 */
+		    2, /* depth =  5 */
+		    1, /* depth =  6 */
+		    3, /* depth =  7 */
+		  185, /* depth =  8 */
+		   26, /* depth =  9 */
+		   16, /* depth = 10 */
+		   39, /* depth = 11 */
+		  144, /* depth = 12 */
+		  233, /* depth = 13 */
+		  528, /* depth = 14 */
+		  866, /* depth = 15 */
+		 3856, /* depth = 16 */
+		 3268, /* depth = 17 */
+		 5662, /* depth = 18 */
+		17301, /* depth = 19 */
+		22226, /* depth = 20 */
+		11147, /* depth = 21 */
+		16746, /* depth = 22 */
+		17120, /* depth = 23 */
+		77578, /* depth = 24 */
+		  401, /* depth = 25 */
+		  656, /* depth = 26 */
+		 1107, /* depth = 27 */
+		 1121, /* depth = 28 */
+		 2316, /* depth = 29 */
+		  717, /* depth = 30 */
+		   10, /* depth = 31 */
+		   66  /* depth = 32 */
+	},
+	.b = { /* IP class A in which the most 2 significant bits are 10 */
+		    0, /* depth =  1 */
+		    0, /* depth =  2 */
+		    0, /* depth =  3 */
+		    0, /* depth =  4 */
+		    1, /* depth =  5 */
+		    1, /* depth =  6 */
+		    1, /* depth =  7 */
+		    3, /* depth =  8 */
+		    3, /* depth =  9 */
+		   30, /* depth = 10 */
+		   25, /* depth = 11 */
+		  168, /* depth = 12 */
+		  305, /* depth = 13 */
+		  569, /* depth = 14 */
+		 1129, /* depth = 15 */
+		50800, /* depth = 16 */
+		 1645, /* depth = 17 */
+		 1820, /* depth = 18 */
+		 3506, /* depth = 19 */
+		 3258, /* depth = 20 */
+		 3424, /* depth = 21 */
+		 4971, /* depth = 22 */
+		 6885, /* depth = 23 */
+		39771, /* depth = 24 */
+		  424, /* depth = 25 */
+		  170, /* depth = 26 */
+		  443, /* depth = 27 */
+		   92, /* depth = 28 */
+		  366, /* depth = 29 */
+		  377, /* depth = 30 */
+		    2, /* depth = 31 */
+		  200  /* depth = 32 */
+	},
+	.c = { /* IP class A in which the most 3 significant bits are 110 */
+		     0, /* depth =  1 */
+		     0, /* depth =  2 */
+		     0, /* depth =  3 */
+		     0, /* depth =  4 */
+		     0, /* depth =  5 */
+		     0, /* depth =  6 */
+		     0, /* depth =  7 */
+		    12, /* depth =  8 */
+		     8, /* depth =  9 */
+		     9, /* depth = 10 */
+		    33, /* depth = 11 */
+		    69, /* depth = 12 */
+		   237, /* depth = 13 */
+		  1007, /* depth = 14 */
+		  1717, /* depth = 15 */
+		 14663, /* depth = 16 */
+		  8070, /* depth = 17 */
+		 16185, /* depth = 18 */
+		 48261, /* depth = 19 */
+		 36870, /* depth = 20 */
+		 33960, /* depth = 21 */
+		 50638, /* depth = 22 */
+		 61422, /* depth = 23 */
+		466549, /* depth = 24 */
+		  1829, /* depth = 25 */
+		  4824, /* depth = 26 */
+		  4927, /* depth = 27 */
+		  5914, /* depth = 28 */
+		 10254, /* depth = 29 */
+		  4905, /* depth = 30 */
+		     1, /* depth = 31 */
+		   716  /* depth = 32 */
+	}
+};
+
+static void generate_random_rule_prefix(struct route_rule *rt,
+	uint32_t ip_class, uint8_t depth)
+{
+/* IP address class A, the most significant bit is 0 */
+#define IP_HEAD_MASK_A			0x00000000
+#define IP_HEAD_BIT_NUM_A		1
+
+/* IP address class B, the most significant 2 bits are 10 */
+#define IP_HEAD_MASK_B			0x80000000
+#define IP_HEAD_BIT_NUM_B		2
+
+/* IP address class C, the most significant 3 bits are 110 */
+#define IP_HEAD_MASK_C			0xC0000000
+#define IP_HEAD_BIT_NUM_C		3
+
+	uint32_t class_depth;
+	uint32_t range;
+	uint32_t mask;
+	uint32_t step;
+	uint32_t start;
+	uint32_t fixed_bit_num;
+	uint32_t ip_head_mask;
+	uint32_t rule_num;
+	uint32_t k;
+	struct route_rule *ptr_rule;
+
+	if (ip_class == IP_CLASS_A) {        /* IP Address class A */
+		fixed_bit_num = IP_HEAD_BIT_NUM_A;
+		ip_head_mask = IP_HEAD_MASK_A;
+		rule_num = rule_count.a[depth - 1];
+	} else if (ip_class == IP_CLASS_B) { /* IP Address class B */
+		fixed_bit_num = IP_HEAD_BIT_NUM_B;
+		ip_head_mask = IP_HEAD_MASK_B;
+		rule_num = rule_count.b[depth - 1];
+	} else {                             /* IP Address class C */
+		fixed_bit_num = IP_HEAD_BIT_NUM_C;
+		ip_head_mask = IP_HEAD_MASK_C;
+		rule_num = rule_count.c[depth - 1];
+	}
+
+	if ((rule_num == 0) || ((num_route_entries + rule_num) >=
+		max_route_entries))
+		return;
+
+	/* the number of rest bits which don't include the most significant
+	 * fixed bits for this IP address class
+	 */
+	class_depth = depth - fixed_bit_num;
+
+	/* range is the maximum number of rules for this depth and
+	 * this IP address class
+	 */
+	range = 1 << class_depth;
+
+	/* only mask the most depth significant generated bits
+	 * except fixed bits for IP address class
+	 */
+	mask = range - 1;
+
+	/* Widen coverage of IP address in generated rules */
+	if (range <= rule_num)
+		step = 1;
+	else
+		step = round((double)range / rule_num);
+
+	/* Only generate rest bits except the most significant
+	 * fixed bits for IP address class
+	 */
+	start = rte_rand() & mask;
+	ptr_rule = &rt[num_route_entries];
+	for (k = 0; k < rule_num; k++) {
+		ptr_rule->ip = (start << (RTE_RIB_MAXDEPTH - depth))
+			| ip_head_mask;
+		ptr_rule->depth = depth;
+		ptr_rule++;
+		start = (start + step) & mask;
+	}
+	num_route_entries += rule_num;
+}
+
+static void insert_rule_in_random_pos(struct route_rule *rt,
+	uint32_t ip, uint8_t depth)
+{
+	uint32_t pos;
+	int try_count = 0;
+	struct route_rule tmp;
+
+	do {
+		pos = rte_rand();
+		try_count++;
+	} while ((try_count < 10) && (pos > num_route_entries));
+
+	if ((pos > num_route_entries) || (pos > max_route_entries))
+		pos = num_route_entries >> 1;
+
+	tmp = rt[pos];
+	rt[pos].ip = ip;
+	rt[pos].depth = depth;
+	if (num_route_entries < max_route_entries)
+		rt[num_route_entries++] = tmp;
+}
+
+uint32_t
+generate_large_route_rule_table(uint32_t num_routes, struct route_rule *rt)
+{
+	uint32_t ip_class;
+	uint8_t  depth;
+
+	rte_srand(rte_rdtsc());
+	num_route_entries = 0;
+	max_route_entries = num_routes;
+	for (ip_class = IP_CLASS_A; ip_class <= IP_CLASS_C; ip_class++) {
+		for (depth = 1; depth <= RTE_RIB_MAXDEPTH; depth++)
+			generate_random_rule_prefix(rt, ip_class, depth);
+	}
+	/* Add following rules to keep same as previous large constant table,
+	 * they are 4 rules with private local IP address and 1 all-zeros prefix
+	 * with depth = 8.
+	 */
+	insert_rule_in_random_pos(rt, IPv4(0, 0, 0, 0), 8);
+	insert_rule_in_random_pos(rt, IPv4(10, 2, 23, 147), 32);
+	insert_rule_in_random_pos(rt, IPv4(192, 168, 100, 10), 24);
+	insert_rule_in_random_pos(rt, IPv4(192, 168, 25, 100), 24);
+	insert_rule_in_random_pos(rt, IPv4(192, 168, 129, 124), 32);
+
+	return num_route_entries;
+}
+
+void
+print_route_distribution(const struct route_rule *table, uint32_t n)
+{
+	unsigned i, j;
+
+	printf("Route distribution per prefix width:\n");
+	printf("DEPTH    QUANTITY (PERCENT)\n");
+	printf("---------------------------\n");
+
+	/* Count depths. */
+	for (i = 1; i <= 32; i++) {
+		unsigned depth_counter = 0;
+		double percent_hits;
+
+		for (j = 0; j < n; j++)
+			if (table[j].depth == (uint8_t) i)
+				depth_counter++;
+
+		percent_hits = ((double)depth_counter)/((double)n) * 100;
+		printf("%.2u%15u (%.2f)\n", i, depth_counter, percent_hits);
+	}
+	printf("\n");
+}
+
+void
+shuffle_rt(struct route_rule *rt, uint32_t n)
+{
+	uint32_t pos;
+	struct route_rule tmp;
+	uint32_t i;
+
+	for (i = 0; i < n; i++) {
+		pos = rte_rand() % n;
+		tmp = rt[pos];
+		rt[pos] = rt[i];
+		rt[i] = tmp;
+	}
+}
diff --git a/test/test/test_rib_generate_rt.h b/test/test/test_rib_generate_rt.h
new file mode 100644
index 0000000..90573c7
--- /dev/null
+++ b/test/test/test_rib_generate_rt.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#ifndef _TEST_RIB_GENERATE_RT_H_
+#define _TEST_RIB_GENERATE_RT_H_
+
+#define RTE_RIB_MAXDEPTH	32
+
+struct route_rule {
+	uint64_t nh;
+	uint32_t ip;
+	uint8_t depth;
+};
+
+enum {
+	IP_CLASS_A,
+	IP_CLASS_B,
+	IP_CLASS_C
+};
+
+/* struct route_rule_count defines the total number of rules in following a/b/c
+ * each item in a[]/b[]/c[] is the number of common IP address class A/B/C, not
+ * including the ones for private local network.
+ */
+struct route_rule_count {
+	uint32_t a[RTE_RIB_MAXDEPTH];
+	uint32_t b[RTE_RIB_MAXDEPTH];
+	uint32_t c[RTE_RIB_MAXDEPTH];
+};
+
+
+uint32_t generate_large_route_rule_table(uint32_t num_routes,
+	struct route_rule *rt);
+void print_route_distribution(const struct route_rule *table, uint32_t n);
+void shuffle_rt(struct route_rule *rt, uint32_t n);
+
+#endif /* _TEST_RIB_GENERATE_RT_H_ */
diff --git a/test/test/test_rib_lpm_comp.c b/test/test/test_rib_lpm_comp.c
new file mode 100644
index 0000000..d89e91c
--- /dev/null
+++ b/test/test/test_rib_lpm_comp.c
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include <rte_random.h>
+#include <rte_cycles.h>
+#include <rte_branch_prediction.h>
+#include <rte_ip.h>
+#include <rte_malloc.h>
+#include <rte_lpm.h>
+#include <rte_rib.h>
+
+#include "test.h"
+#include "test_xmmt_ops.h"
+#include "test_rib_generate_rt.h"
+
+#define TEST_RIB_ASSERT(cond) do {				\
+	if (!(cond)) {						\
+		printf("Error at line %d:\n", __LINE__);	\
+		return -1;					\
+	}							\
+} while (0)
+
+#define ITERATIONS (1 << 15)
+#define BATCH_SIZE (1 << 7)
+#define BULK_SIZE 32
+#define LPM_NH_MASK	((1 << 24) - 1)
+
+static const uint64_t default_nh;
+
+static int
+test_lookup(struct rte_rib *rib, struct rte_lpm *lpm)
+{
+	static uint32_t ip_batch[BATCH_SIZE];
+	uint64_t rib_next_hops[BULK_SIZE];
+	uint32_t lpm_next_hops[BULK_SIZE];
+	int i, j, k;
+
+	for (i = 0; i < ITERATIONS; i++) {
+		for (j = 0; j < BATCH_SIZE; j++)
+			ip_batch[j] = rte_rand();
+
+		for (j = 0; j < BATCH_SIZE; j += BULK_SIZE) {
+			rte_rib_fib_lookup_bulk(rib, &ip_batch[j],
+				rib_next_hops, BULK_SIZE);
+			rte_lpm_lookup_bulk(lpm, &ip_batch[j],
+				lpm_next_hops, BULK_SIZE);
+			for (k = 0; k < BULK_SIZE; k++) {
+				if (likely(lpm_next_hops[k] &
+					RTE_LPM_LOOKUP_SUCCESS))
+					lpm_next_hops[k] &= LPM_NH_MASK;
+				else
+					lpm_next_hops[k] = default_nh;
+			}
+			for (k = 0; k < BULK_SIZE; k++)
+				TEST_RIB_ASSERT(rib_next_hops[k] ==
+						lpm_next_hops[k]);
+		}
+	}
+	return 0;
+}
+
+static int
+test_rib_lpm_comp(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_lpm *lpm = NULL;
+	struct route_rule *rt = NULL;
+	unsigned i;
+	int rib_add = 0, lpm_add = 0;
+	int ret, nh_bits, nr_tbl8;
+	uint32_t num_routes;
+	struct rte_rib_conf conf;
+	struct rte_lpm_config config;
+
+	rte_srand(rte_rdtsc());
+
+	conf.max_nodes = 3000000;
+	conf.node_sz = sizeof(struct rte_rib_node);
+	conf.type = RTE_RIB_DIR24_8_8B;
+	conf.alloc_type = RTE_RIB_MEMPOOL;
+	conf.def_nh	= default_nh;
+
+	nh_bits = RTE_MIN(((1 << (3 + conf.type)) - 2), 24);
+	nr_tbl8 = RTE_MIN((1 << nh_bits), 65536);
+	config.number_tbl8s = nr_tbl8;
+	config.max_rules = 2000000;
+	config.flags = 0;
+
+	num_routes = 1200000;
+
+	rt = rte_zmalloc("struct route_rule *", sizeof(struct route_rule) *
+		num_routes + 5, 0);
+	TEST_RIB_ASSERT(rt != NULL);
+
+	num_routes = generate_large_route_rule_table(num_routes, rt);
+	TEST_RIB_ASSERT(num_routes != 0);
+	printf("No. routes = %u\n", (unsigned) num_routes);
+
+	shuffle_rt(rt, num_routes);
+
+	print_route_distribution(rt, (uint32_t) num_routes);
+
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &conf);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	lpm = rte_lpm_create(__func__, SOCKET_ID_ANY, &config);
+	TEST_RIB_ASSERT(lpm != NULL);
+
+	for (i = 0; i < num_routes; i++)
+		rt[i].nh = rte_rand() & ((1ULL << nh_bits) - 1);
+
+	for (i = 0; i < num_routes; i++) {
+		ret = rte_rib_add(rib, rt[i].ip, rt[i].depth, rt[i].nh);
+		if (ret == 0)
+			rib_add++;
+		else
+			continue;
+
+		ret = rte_lpm_add(lpm, rt[i].ip, rt[i].depth, rt[i].nh);
+		if (ret == 0)
+			lpm_add++;
+		else {
+			rte_rib_delete(rib, rt[i].ip, rt[i].depth);
+			rib_add--;
+		}
+	}
+	TEST_RIB_ASSERT(rib_add == lpm_add);
+
+	ret = test_lookup(rib, lpm);
+	if (ret != 0)
+		return ret;
+
+	for (i = 0; i < num_routes; i++) {
+		if ((i % 3) == 0) {
+			ret = rte_rib_delete(rib, rt[i].ip, rt[i].depth);
+			if (ret == 0)
+				rib_add--;
+			else
+				continue;
+
+			ret = rte_lpm_delete(lpm, rt[i].ip, rt[i].depth);
+			if (ret == 0)
+				lpm_add--;
+		}
+	}
+	TEST_RIB_ASSERT(rib_add == lpm_add);
+
+	ret = test_lookup(rib, lpm);
+	if (ret != 0)
+		return ret;
+
+	for (i = 0; i < num_routes; i++) {
+		if ((i % 6) == 0) {
+			ret = rte_rib_add(rib, rt[i].ip, rt[i].depth, rt[i].nh);
+			if (ret == 0)
+				rib_add++;
+			else
+				continue;
+
+			ret = rte_lpm_add(lpm, rt[i].ip, rt[i].depth, rt[i].nh);
+			if (ret == 0)
+				lpm_add++;
+			else {
+				rte_rib_delete(rib, rt[i].ip, rt[i].depth);
+				rib_add--;
+			}
+		}
+	}
+	TEST_RIB_ASSERT(rib_add == lpm_add);
+
+	ret = test_lookup(rib, lpm);
+	if (ret != 0)
+		return ret;
+
+	rte_rib_free(rib);
+	rte_lpm_free(lpm);
+	rte_free(rt);
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(rib_lpm_comp_autotest, test_rib_lpm_comp);
diff --git a/test/test/test_rib_perf.c b/test/test/test_rib_perf.c
new file mode 100644
index 0000000..f03475e
--- /dev/null
+++ b/test/test/test_rib_perf.c
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include <rte_cycles.h>
+#include <rte_random.h>
+#include <rte_branch_prediction.h>
+#include <rte_ip.h>
+#include <rte_malloc.h>
+#include <rte_rib.h>
+#include <rte_dir24_8.h>
+
+#include "test.h"
+#include "test_xmmt_ops.h"
+#include "test_rib_generate_rt.h"
+
+#define TEST_RIB_ASSERT(cond) do {				\
+	if (!(cond)) {						\
+		printf("Error at line %d:\n", __LINE__);	\
+		return -1;					\
+	}							\
+} while (0)
+
+#define ITERATIONS (1 << 15)
+#define BATCH_SIZE (1 << 12)
+#define BULK_SIZE 32
+
+static int
+test_rib_perf(void)
+{
+	struct rte_rib *rib = NULL;
+	struct rte_rib_conf conf;
+	struct route_rule *rt;
+	uint64_t begin, total_time;
+	uint64_t next_hop_add;
+	uint64_t default_nh = 0;
+	int64_t count = 0;
+	unsigned i, j;
+	int status = 0;
+	int ret;
+	uint32_t num_routes;
+
+	conf.max_nodes = 3000000;
+	conf.node_sz = sizeof(struct rte_rib_node);
+	conf.type = RTE_RIB_DIR24_8_4B;
+	conf.alloc_type = RTE_RIB_MEMPOOL;
+	conf.def_nh = default_nh;
+	rte_srand(rte_rdtsc());
+
+	num_routes = 1200000;
+
+	rt = rte_zmalloc("struct route_rule *", sizeof(struct route_rule) *
+		num_routes, 0);
+	TEST_RIB_ASSERT(rt != NULL);
+
+	num_routes = generate_large_route_rule_table(num_routes, rt);
+	TEST_RIB_ASSERT(num_routes != 0);
+
+	printf("No. routes = %u\n", (unsigned) num_routes);
+
+	shuffle_rt(rt, num_routes);
+
+	print_route_distribution(rt, (uint32_t) num_routes);
+
+	rib = rte_rib_create(__func__, SOCKET_ID_ANY, &conf);
+	TEST_RIB_ASSERT(rib != NULL);
+
+	/* Measue add. */
+	begin = rte_rdtsc();
+
+	for (i = 0; i < num_routes; i++) {
+		do {
+			next_hop_add = rte_rand() & ((1ULL << 30) - 1);
+		} while (next_hop_add == default_nh);
+
+		ret = rte_rib_add(rib, rt[i].ip, rt[i].depth, next_hop_add);
+		if ((ret == 0))
+			status++;
+	}
+
+	total_time = rte_rdtsc() - begin;
+
+	printf("Unique added entries = %d\n", status);
+	printf("Average RIB Add: %g cycles\n",
+			(double)total_time / num_routes);
+
+	/* Measure Lookup */
+	total_time = 0;
+	count = 0;
+	for (i = 0; i < ITERATIONS; i++) {
+		static uint32_t ip_batch[BATCH_SIZE];
+		uint64_t nh_64;
+		/* Create array of random IP addresses */
+		for (j = 0; j < BATCH_SIZE; j++)
+			ip_batch[j] = rte_rand();
+
+		/* Lookup per batch */
+		begin = rte_rdtsc();
+		for (j = 0; j < BATCH_SIZE; j++) {
+			ret = rte_dir24_8_lookup(rib->fib, ip_batch[j], &nh_64);
+			if (unlikely(nh_64 == default_nh))
+				count++;
+		}
+		total_time += rte_rdtsc() - begin;
+	}
+
+	printf("RIB Lookup: %.1f cycles (fails = %.1f%%)\n",
+			(double)total_time / ((double)ITERATIONS * BATCH_SIZE),
+			(count * 100.0) / (double)(ITERATIONS * BATCH_SIZE));
+
+	/* Measure bulk Lookup */
+	total_time = 0;
+	count = 0;
+	for (i = 0; i < ITERATIONS; i++) {
+		static uint32_t ip_batch[BATCH_SIZE];
+		uint64_t next_hops[BULK_SIZE];
+
+		/* Create array of random IP addresses */
+		for (j = 0; j < BATCH_SIZE; j++)
+			ip_batch[j] = rte_rand();
+
+		/* Lookup per batch */
+		begin = rte_rdtsc();
+		for (j = 0; j < BATCH_SIZE; j += BULK_SIZE)
+			rte_rib_fib_lookup_bulk(rib, &ip_batch[j], next_hops,
+				BULK_SIZE);
+
+		total_time += rte_rdtsc() - begin;
+		for (j = 0; j < BULK_SIZE; j++) {
+			if (next_hops[j] == default_nh)
+				count++;
+		}
+	}
+	printf("BULK RIB Lookup: %.1f cycles (fails = %.1f%%)\n",
+			(double)total_time / ((double)ITERATIONS * BATCH_SIZE),
+			(count * 100.0) / (double)(ITERATIONS * BATCH_SIZE));
+
+	/* Delete */
+	status = 0;
+	begin = rte_rdtsc();
+
+	for (i = 0; i < num_routes; i++) {
+		ret = rte_rib_delete(rib, rt[i].ip, rt[i].depth);
+		if (ret == 0)
+			status++;
+	}
+
+	total_time = rte_rdtsc() - begin;
+
+	printf("Average RIB Delete: %g cycles\n",
+			(double)total_time / num_routes);
+
+	rte_rib_free(rib);
+	rte_free(rt);
+
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(rib_perf_autotest, test_rib_perf);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-02-21 21:44 ` [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
@ 2018-03-14 11:09   ` Bruce Richardson
  2018-03-14 12:05     ` Richardson, Bruce
  2018-03-25 18:17     ` Vladimir Medvedkin
  2018-03-29 10:27   ` Bruce Richardson
  1 sibling, 2 replies; 11+ messages in thread
From: Bruce Richardson @ 2018-03-14 11:09 UTC (permalink / raw)
  To: Medvedkin Vladimir; +Cc: dev

On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> RIB is an alternative to current LPM library.
> It solves the following problems
>  - Increases the speed of control plane operations against lpm such as
>    adding/deleting routes
>  - Adds abstraction from dataplane algorithms, so it is possible to add
>    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
>    in addition to current dir24_8
>  - It is possible to keep user defined application specific additional
>    information in struct rte_rib_node which represents route entry.
>    It can be next hop/set of next hops (i.e. active and feasible),
>    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
>    plenty of additional control plane information.
>  - For dir24_8 implementation it is possible to remove rte_lpm_tbl_entry.depth
>    field that helps to save 6 bits.
>  - Also new dir24_8 implementation supports different next_hop sizes
>    (1/2/4/8 bytes per next hop)
>  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary operator.
>    Instead it returns special default value if there is no route.
> 
> Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> ---
>  config/common_base                 |   6 +
>  doc/api/doxy-api.conf              |   1 +
>  lib/Makefile                       |   2 +
>  lib/librte_rib/Makefile            |  22 ++
>  lib/librte_rib/rte_dir24_8.c       | 482 +++++++++++++++++++++++++++++++++
>  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
>  lib/librte_rib/rte_rib.c           | 526 +++++++++++++++++++++++++++++++++++++
>  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
>  lib/librte_rib/rte_rib_version.map |  18 ++
>  mk/rte.app.mk                      |   1 +
>  10 files changed, 1496 insertions(+)
>  create mode 100644 lib/librte_rib/Makefile
>  create mode 100644 lib/librte_rib/rte_dir24_8.c
>  create mode 100644 lib/librte_rib/rte_dir24_8.h
>  create mode 100644 lib/librte_rib/rte_rib.c
>  create mode 100644 lib/librte_rib/rte_rib.h
>  create mode 100644 lib/librte_rib/rte_rib_version.map
> 

First pass review comments. For now just reviewed the main public header
file rte_rib.h. Later reviews will cover the other files as best I can.

/Bruce

<snip>
> diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> new file mode 100644
> index 0000000..6eac8fb
> --- /dev/null
> +++ b/lib/librte_rib/rte_rib.h
> @@ -0,0 +1,322 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> + */
> +
> +#ifndef _RTE_RIB_H_
> +#define _RTE_RIB_H_
> +
> +/**
> + * @file
> + * Compressed trie implementation for Longest Prefix Match
> + */
> +
> +/** @internal Macro to enable/disable run-time checks. */
> +#if defined(RTE_LIBRTE_RIB_DEBUG)
> +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {	\
> +	if (cond)					\
> +		return retval;				\
> +} while (0)
> +#else
> +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> +#endif

use RTE_ASSERT?

> +
> +#define RTE_RIB_VALID_NODE	1

should there be an INVALID_NODE macro?

> +#define RTE_RIB_GET_NXT_ALL	0
> +#define RTE_RIB_GET_NXT_COVER	1
> +
> +#define RTE_RIB_INVALID_ROUTE	0
> +#define RTE_RIB_VALID_ROUTE	1
> +
> +/** Max number of characters in RIB name. */
> +#define RTE_RIB_NAMESIZE	64
> +
> +/** Maximum depth value possible for IPv4 RIB. */
> +#define RTE_RIB_MAXDEPTH	32

I think we should have IPv4 in the name here. Will it not be extended to
support IPv6 in future?

> +
> +/**
> + * Macro to check if prefix1 {key1/depth1}
> + * is covered by prefix2 {key2/depth2}
> + */
> +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)			\
> +	((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> +		&& (depth1 > depth2))
Neat check!

Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
I think you can avoid the casting and have a slightly shorter mask by
changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to 
"~(UINT32_MAX >> depth2)"
I'd also suggest for readability putting the second check first, and,
for maintainability, using an inline function rather than a macro.

> +
> +/** @internal Macro to get next node in tree*/
> +#define RTE_RIB_GET_NXT_NODE(node, key)					\
> +	((key & (1 << (31 - node->depth))) ? node->right : node->left)
> +/** @internal Macro to check if node is right child*/
> +#define RTE_RIB_IS_RIGHT_NODE(node)	(node->parent->right == node)

Again, consider inline fns rather than macros.
For the latter macro, rather than doing additional pointer derefs to
parent, can you also get if it's a right node by using:
"(node->key & (1 << (32 - node->depth)))"? 

> +
> +
> +struct rte_rib_node {
> +	struct rte_rib_node *left;
> +	struct rte_rib_node *right;
> +	struct rte_rib_node *parent;
> +	uint32_t	key;
> +	uint8_t		depth;
> +	uint8_t		flag;
> +	uint64_t	nh;
> +	uint64_t	ext[0];
> +};
> +
> +struct rte_rib;
> +
> +/** Type of FIB struct*/
> +enum rte_rib_type {
> +	RTE_RIB_DIR24_8_1B,
> +	RTE_RIB_DIR24_8_2B,
> +	RTE_RIB_DIR24_8_4B,
> +	RTE_RIB_DIR24_8_8B,
> +	RTE_RIB_TYPE_MAX
> +};

If the plan is to support multiple underlying fib types and algorithms
under the rib library, would it not be better to separate out the
algorithm part from the data storage part? So have the type just be
DIR_24_8, and have the 1, 2, 4 or 8 specified separately.

> +
> +enum rte_rib_op {
> +	RTE_RIB_ADD,
> +	RTE_RIB_DEL
> +};
> +
> +/** RIB nodes allocation type */
> +enum rte_rib_alloc_type {
> +	RTE_RIB_MALLOC,
> +	RTE_RIB_MEMPOOL,
> +	RTE_RIB_ALLOC_MAX
> +};

Not sure you need this any more. Malloc allocations and mempool
allocations are now pretty much the same thing.

> +
> +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> +	uint8_t depth, uint64_t next_hop, enum rte_rib_op op);

Do you anticipate more ops in future than just add and delete? If not,
why not just split this function into two and drop the op struct.

> +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> +	uint64_t *next_hops, const unsigned n);
> +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib *rib);
> +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> +	struct rte_rib_node *node);
> +
> +struct rte_rib {
> +	char name[RTE_RIB_NAMESIZE];
> +	/*pointer to rib trie*/
> +	struct rte_rib_node	*trie;
> +	/*pointer to dataplane struct*/
> +	void	*fib;
> +	/*prefix modification*/
> +	rte_rib_modify_fn_t	modify;
> +	/* Bulk lookup fn*/
> +	rte_rib_tree_lookup_fn_t	lookup;
> +	/*alloc trie element*/
> +	rte_rib_alloc_node_fn_t	alloc_node;
> +	/*free trie element*/
> +	rte_rib_free_node_fn_t	free_node;
> +	struct rte_mempool	*node_pool;
> +	uint32_t		cur_nodes;
> +	uint32_t		cur_routes;
> +	int			max_nodes;
> +	int			node_sz;
> +	enum rte_rib_type	type;
> +	enum rte_rib_alloc_type	alloc_type;
> +};
> +
> +/** RIB configuration structure */
> +struct rte_rib_conf {
> +	enum rte_rib_type	type;
> +	enum rte_rib_alloc_type	alloc_type;
> +	int	max_nodes;
> +	size_t	node_sz;
> +	uint64_t def_nh;
> +};
> +
> +/**
> + * Lookup an IP into the RIB structure
> + *
> + * @param rib
> + *  RIB object handle
> + * @param key
> + *  IP to be looked up in the RIB
> + * @return
> + *  pointer to struct rte_rib_node on success,
> + *  NULL otherwise
> + */
> +struct rte_rib_node *
> +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> +
> +/**
> + * Lookup less specific route into the RIB structure
> + *
> + * @param ent
> + *  Pointer to struct rte_rib_node that represents target route
> + * @return
> + *  pointer to struct rte_rib_node that represents
> + *  less specific route on success,
> + *  NULL otherwise
> + */
> +struct rte_rib_node *
> +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> +
> +/**
> + * Lookup prefix into the RIB structure
> + *
> + * @param rib
> + *  RIB object handle
> + * @param key
> + *  net to be looked up in the RIB
> + * @param depth
> + *  prefix length
> + * @return
> + *  pointer to struct rte_rib_node on success,
> + *  NULL otherwise
> + */
> +struct rte_rib_node *
> +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t depth);

Can you explain the difference between this and regular lookup, and how
they would be used. I don't think the names convey the differences
sufficiently, and so we should look to rename one or both to be clearer.

> +
> +/**
> + * Retrieve next more specific prefix from the RIB
s/more/most/

> + * that is covered by key/depth supernet
> + *
> + * @param rib
> + *  RIB object handle
> + * @param key
> + *  net address of supernet prefix that covers returned more specific prefixes
> + * @param depth
> + *  supernet prefix length
> + * @param cur
> + *   pointer to the last returned prefix to get next prefix
> + *   or
> + *   NULL to get first more specific prefix
> + * @param flag
> + *  -RTE_RIB_GET_NXT_ALL
> + *   get all prefixes from subtrie

By all prefixes do you mean more specific, i.e. the final prefix?

> + *  -RTE_RIB_GET_NXT_COVER
> + *   get only first more specific prefix even if it have more specifics
> + * @return
> + *  pointer to the next more specific prefix
> + *  or
> + *  NULL if there is no prefixes left
> + */
> +struct rte_rib_node *
> +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> +	struct rte_rib_node *cur, int flag);
> +
> +/**
> + * Remove prefix from the RIB
> + *
> + * @param rib
> + *  RIB object handle
> + * @param key
> + *  net to be removed from the RIB
> + * @param depth
> + *  prefix length
> + */
> +void
> +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> +
> +/**
> + * Insert prefix into the RIB
> + *
> + * @param rib
> + *  RIB object handle
> + * @param key
> + *  net to be inserted to the RIB
> + * @param depth
> + *  prefix length
> + * @return
> + *  pointer to new rte_rib_node on success
> + *  NULL otherwise
> + */
> +struct rte_rib_node *
> +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> +
> +/**
> + * Create RIB
> + *
> + * @param name
> + *  RIB name
> + * @param socket_id
> + *  NUMA socket ID for RIB table memory allocation
> + * @param conf
> + *  Structure containing the configuration
> + * @return
> + *  Handle to RIB object on success
> + *  NULL otherwise with rte_errno set to an appropriate values.
> + */
> +struct rte_rib *
> +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf *conf);
> +
> +/**
> + * Find an existing RIB object and return a pointer to it.
> + *
> + * @param name
> + *  Name of the rib object as passed to rte_rib_create()
> + * @return
> + *  Pointer to rib object or NULL if object not found with rte_errno
> + *  set appropriately. Possible rte_errno values include:
> + *   - ENOENT - required entry not available to return.
> + */
> +struct rte_rib *
> +rte_rib_find_existing(const char *name);
> +
> +/**
> + * Free an RIB object.
> + *
> + * @param rib
> + *   RIB object handle
> + * @return
> + *   None
> + */
> +void
> +rte_rib_free(struct rte_rib *rib);
> +
> +/**
> + * Add a rule to the RIB.
> + *
> + * @param rib
> + *   RIB object handle
> + * @param ip
> + *   IP of the rule to be added to the RIB
> + * @param depth
> + *   Depth of the rule to be added to the RIB
> + * @param next_hop
> + *   Next hop of the rule to be added to the RIB
> + * @return
> + *   0 on success, negative value otherwise
> + */
> +int
> +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t next_hop);
> +
> +/**
> + * Delete a rule from the RIB.
> + *
> + * @param rib
> + *   RIB object handle
> + * @param ip
> + *   IP of the rule to be deleted from the RIB
> + * @param depth
> + *   Depth of the rule to be deleted from the RIB
> + * @return
> + *   0 on success, negative value otherwise
> + */
> +int
> +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> +
> +/**
> + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> + * macro, so the address of the function should not be used.
> + *
> + * @param RIB
> + *   RIB object handle
> + * @param ips
> + *   Array of IPs to be looked up in the FIB
> + * @param next_hops
> + *   Next hop of the most specific rule found for IP.
> + *   This is an array of eight byte values.
> + *   If the lookup for the given IP failed, then corresponding element would
> + *   contain default value, see description of then next parameter.
> + * @param n
> + *   Number of elements in ips (and next_hops) array to lookup. This should be a
> + *   compile time constant, and divisible by 8 for best performance.
> + * @param defv
> + *   Default value to populate into corresponding element of hop[] array,
> + *   if lookup would fail.
> + *  @return
> + *   -EINVAL for incorrect arguments, otherwise 0
> + */
> +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)	\
> +	rib->lookup(rib->fib, ips, next_hops, n)

My main thought here is whether this needs to be a function at all?
Given that it takes a full burst of addresses in a single go, how much
performance would actually be lost by making this a regular function in
the C file?
IF we do convert this to a regular function, then a lot of the structure
definitions above - most importantly, the rib structure itself - can
probably be moved to a private header file and not exposed to
applications at all. This will make ABI compatibility a *lot* easier, as
the structures can be changed without affecting the public ABI.

/Bruce

> +
> +#endif /* _RTE_RIB_H_ */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-14 11:09   ` Bruce Richardson
@ 2018-03-14 12:05     ` Richardson, Bruce
  2018-03-25 18:17     ` Vladimir Medvedkin
  1 sibling, 0 replies; 11+ messages in thread
From: Richardson, Bruce @ 2018-03-14 12:05 UTC (permalink / raw)
  To: Richardson, Bruce, Medvedkin Vladimir; +Cc: dev

> > +/** RIB nodes allocation type */
> > +enum rte_rib_alloc_type {
> > +	RTE_RIB_MALLOC,
> > +	RTE_RIB_MEMPOOL,
> > +	RTE_RIB_ALLOC_MAX
> > +};
> 
> Not sure you need this any more. Malloc allocations and mempool
> allocations are now pretty much the same thing.

Sorry, please ignore this comment. I mistook mempool for memzone.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-14 11:09   ` Bruce Richardson
  2018-03-14 12:05     ` Richardson, Bruce
@ 2018-03-25 18:17     ` Vladimir Medvedkin
  2018-03-26  9:50       ` Bruce Richardson
  1 sibling, 1 reply; 11+ messages in thread
From: Vladimir Medvedkin @ 2018-03-25 18:17 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi,

2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:

> On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > RIB is an alternative to current LPM library.
> > It solves the following problems
> >  - Increases the speed of control plane operations against lpm such as
> >    adding/deleting routes
> >  - Adds abstraction from dataplane algorithms, so it is possible to add
> >    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> >    in addition to current dir24_8
> >  - It is possible to keep user defined application specific additional
> >    information in struct rte_rib_node which represents route entry.
> >    It can be next hop/set of next hops (i.e. active and feasible),
> >    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> >    plenty of additional control plane information.
> >  - For dir24_8 implementation it is possible to remove
> rte_lpm_tbl_entry.depth
> >    field that helps to save 6 bits.
> >  - Also new dir24_8 implementation supports different next_hop sizes
> >    (1/2/4/8 bytes per next hop)
> >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> operator.
> >    Instead it returns special default value if there is no route.
> >
> > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > ---
> >  config/common_base                 |   6 +
> >  doc/api/doxy-api.conf              |   1 +
> >  lib/Makefile                       |   2 +
> >  lib/librte_rib/Makefile            |  22 ++
> >  lib/librte_rib/rte_dir24_8.c       | 482 ++++++++++++++++++++++++++++++
> +++
> >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> >  lib/librte_rib/rte_rib.c           | 526 ++++++++++++++++++++++++++++++
> +++++++
> >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> >  lib/librte_rib/rte_rib_version.map |  18 ++
> >  mk/rte.app.mk                      |   1 +
> >  10 files changed, 1496 insertions(+)
> >  create mode 100644 lib/librte_rib/Makefile
> >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> >  create mode 100644 lib/librte_rib/rte_rib.c
> >  create mode 100644 lib/librte_rib/rte_rib.h
> >  create mode 100644 lib/librte_rib/rte_rib_version.map
> >
>
> First pass review comments. For now just reviewed the main public header
> file rte_rib.h. Later reviews will cover the other files as best I can.
>
> /Bruce
>
> <snip>
> > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > new file mode 100644
> > index 0000000..6eac8fb
> > --- /dev/null
> > +++ b/lib/librte_rib/rte_rib.h
> > @@ -0,0 +1,322 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > + */
> > +
> > +#ifndef _RTE_RIB_H_
> > +#define _RTE_RIB_H_
> > +
> > +/**
> > + * @file
> > + * Compressed trie implementation for Longest Prefix Match
> > + */
> > +
> > +/** @internal Macro to enable/disable run-time checks. */
> > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > +     if (cond)                                       \
> > +             return retval;                          \
> > +} while (0)
> > +#else
> > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > +#endif
>
> use RTE_ASSERT?
>
it was done just like it was done in the LPM lib. But if you think it
should be RTE_ASSERT so be it.


>
> > +
> > +#define RTE_RIB_VALID_NODE   1
>
> should there be an INVALID_NODE macro?
>
No


>
> > +#define RTE_RIB_GET_NXT_ALL  0
> > +#define RTE_RIB_GET_NXT_COVER        1
> > +
> > +#define RTE_RIB_INVALID_ROUTE        0
> > +#define RTE_RIB_VALID_ROUTE  1
> > +
> > +/** Max number of characters in RIB name. */
> > +#define RTE_RIB_NAMESIZE     64
> > +
> > +/** Maximum depth value possible for IPv4 RIB. */
> > +#define RTE_RIB_MAXDEPTH     32
>
> I think we should have IPv4 in the name here. Will it not be extended to
> support IPv6 in future?
>
I think there should be a separate implementation of the library for ipv6


>
> > +
> > +/**
> > + * Macro to check if prefix1 {key1/depth1}
> > + * is covered by prefix2 {key2/depth2}
> > + */
> > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
>      \
> > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> > +             && (depth1 > depth2))
> Neat check!
>
> Any particular reason for using UINT64_MAX here rather than UINT32_MAX?

in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
UINT32_MAX because shift count will be masked to 5 bits.

I think you can avoid the casting and have a slightly shorter mask by
> changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> "~(UINT32_MAX >> depth2)"
> I'd also suggest for readability putting the second check first, and,
> for maintainability, using an inline function rather than a macro.
>
 Agree, it looks clearer


> > +
> > +/** @internal Macro to get next node in tree*/
> > +#define RTE_RIB_GET_NXT_NODE(node, key)
>       \
> > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > +/** @internal Macro to check if node is right child*/
> > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
>
> Again, consider inline fns rather than macros.
>
Ok

For the latter macro, rather than doing additional pointer derefs to
> parent, can you also get if it's a right node by using:
> "(node->key & (1 << (32 - node->depth)))"?
>
No, it is not possible. Decision whether node be left or right is made
using parent and child common depth.
Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth
will be /8 and 10.128.0.0/24 will be right child.


> > +
> > +
> > +struct rte_rib_node {
> > +     struct rte_rib_node *left;
> > +     struct rte_rib_node *right;
> > +     struct rte_rib_node *parent;
> > +     uint32_t        key;
> > +     uint8_t         depth;
> > +     uint8_t         flag;
> > +     uint64_t        nh;
> > +     uint64_t        ext[0];
> > +};
> > +
> > +struct rte_rib;
> > +
> > +/** Type of FIB struct*/
> > +enum rte_rib_type {
> > +     RTE_RIB_DIR24_8_1B,
> > +     RTE_RIB_DIR24_8_2B,
> > +     RTE_RIB_DIR24_8_4B,
> > +     RTE_RIB_DIR24_8_8B,
> > +     RTE_RIB_TYPE_MAX
> > +};
>
> If the plan is to support multiple underlying fib types and algorithms
> under the rib library, would it not be better to separate out the
> algorithm part from the data storage part? So have the type just be
> DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
>
Yes, we were talk about it in IRC, agree. Now I pass next hop size in
union rte_rib_fib_conf inside rte_rib_conf


>
> > +
> > +enum rte_rib_op {
> > +     RTE_RIB_ADD,
> > +     RTE_RIB_DEL
> > +};
> > +
> > +/** RIB nodes allocation type */
> > +enum rte_rib_alloc_type {
> > +     RTE_RIB_MALLOC,
> > +     RTE_RIB_MEMPOOL,
> > +     RTE_RIB_ALLOC_MAX
> > +};
>
> Not sure you need this any more. Malloc allocations and mempool
> allocations are now pretty much the same thing.
>
Actually I think to remove malloc. On performance tests with
adding/deleting huge amount of routes malloc is slower. Maybe because of
fragmentation.
What do you think?


> > +
> > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
>
> Do you anticipate more ops in future than just add and delete? If not,
> why not just split this function into two and drop the op struct.
>
It is difficult question. I'm not ready to make decision at the moment.


>
> > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib
> *rib);
> > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > +     struct rte_rib_node *node);
> > +
> > +struct rte_rib {
> > +     char name[RTE_RIB_NAMESIZE];
> > +     /*pointer to rib trie*/
> > +     struct rte_rib_node     *trie;
> > +     /*pointer to dataplane struct*/
> > +     void    *fib;
> > +     /*prefix modification*/
> > +     rte_rib_modify_fn_t     modify;
> > +     /* Bulk lookup fn*/
> > +     rte_rib_tree_lookup_fn_t        lookup;
> > +     /*alloc trie element*/
> > +     rte_rib_alloc_node_fn_t alloc_node;
> > +     /*free trie element*/
> > +     rte_rib_free_node_fn_t  free_node;
> > +     struct rte_mempool      *node_pool;
> > +     uint32_t                cur_nodes;
> > +     uint32_t                cur_routes;
> > +     int                     max_nodes;
> > +     int                     node_sz;
> > +     enum rte_rib_type       type;
> > +     enum rte_rib_alloc_type alloc_type;
> > +};
> > +
> > +/** RIB configuration structure */
> > +struct rte_rib_conf {
> > +     enum rte_rib_type       type;
> > +     enum rte_rib_alloc_type alloc_type;
> > +     int     max_nodes;
> > +     size_t  node_sz;
> > +     uint64_t def_nh;
> > +};
> > +
> > +/**
> > + * Lookup an IP into the RIB structure
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  IP to be looked up in the RIB
> > + * @return
> > + *  pointer to struct rte_rib_node on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > +
> > +/**
> > + * Lookup less specific route into the RIB structure
> > + *
> > + * @param ent
> > + *  Pointer to struct rte_rib_node that represents target route
> > + * @return
> > + *  pointer to struct rte_rib_node that represents
> > + *  less specific route on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > +
> > +/**
> > + * Lookup prefix into the RIB structure
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be looked up in the RIB
> > + * @param depth
> > + *  prefix length
> > + * @return
> > + *  pointer to struct rte_rib_node on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
>
> Can you explain the difference between this and regular lookup, and how
> they would be used. I don't think the names convey the differences
> sufficiently, and so we should look to rename one or both to be clearer.
>
Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for
passed key.
rte_rib_tree_lookup_exact will lookup node contained key and depth equal to
passed in args. It used to find exact route.


>
> > +
> > +/**
> > + * Retrieve next more specific prefix from the RIB
> s/more/most/
>

> > + * that is covered by key/depth supernet
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net address of supernet prefix that covers returned more specific
> prefixes
> > + * @param depth
> > + *  supernet prefix length
> > + * @param cur
> > + *   pointer to the last returned prefix to get next prefix
> > + *   or
> > + *   NULL to get first more specific prefix
> > + * @param flag
> > + *  -RTE_RIB_GET_NXT_ALL
> > + *   get all prefixes from subtrie
>
> By all prefixes do you mean more specific, i.e. the final prefix?
>
What do you mean the final prefix?


> > + *  -RTE_RIB_GET_NXT_COVER
> > + *   get only first more specific prefix even if it have more specifics
> > + * @return
> > + *  pointer to the next more specific prefix
> > + *  or
> > + *  NULL if there is no prefixes left
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> > +     struct rte_rib_node *cur, int flag);
> > +
> > +/**
> > + * Remove prefix from the RIB
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be removed from the RIB
> > + * @param depth
> > + *  prefix length
> > + */
> > +void
> > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > +
> > +/**
> > + * Insert prefix into the RIB
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be inserted to the RIB
> > + * @param depth
> > + *  prefix length
> > + * @return
> > + *  pointer to new rte_rib_node on success
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > +
> > +/**
> > + * Create RIB
> > + *
> > + * @param name
> > + *  RIB name
> > + * @param socket_id
> > + *  NUMA socket ID for RIB table memory allocation
> > + * @param conf
> > + *  Structure containing the configuration
> > + * @return
> > + *  Handle to RIB object on success
> > + *  NULL otherwise with rte_errno set to an appropriate values.
> > + */
> > +struct rte_rib *
> > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> *conf);
> > +
> > +/**
> > + * Find an existing RIB object and return a pointer to it.
> > + *
> > + * @param name
> > + *  Name of the rib object as passed to rte_rib_create()
> > + * @return
> > + *  Pointer to rib object or NULL if object not found with rte_errno
> > + *  set appropriately. Possible rte_errno values include:
> > + *   - ENOENT - required entry not available to return.
> > + */
> > +struct rte_rib *
> > +rte_rib_find_existing(const char *name);
> > +
> > +/**
> > + * Free an RIB object.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @return
> > + *   None
> > + */
> > +void
> > +rte_rib_free(struct rte_rib *rib);
> > +
> > +/**
> > + * Add a rule to the RIB.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @param ip
> > + *   IP of the rule to be added to the RIB
> > + * @param depth
> > + *   Depth of the rule to be added to the RIB
> > + * @param next_hop
> > + *   Next hop of the rule to be added to the RIB
> > + * @return
> > + *   0 on success, negative value otherwise
> > + */
> > +int
> > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t
> next_hop);
> > +
> > +/**
> > + * Delete a rule from the RIB.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @param ip
> > + *   IP of the rule to be deleted from the RIB
> > + * @param depth
> > + *   Depth of the rule to be deleted from the RIB
> > + * @return
> > + *   0 on success, negative value otherwise
> > + */
> > +int
> > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > +
> > +/**
> > + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> > + * macro, so the address of the function should not be used.
> > + *
> > + * @param RIB
> > + *   RIB object handle
> > + * @param ips
> > + *   Array of IPs to be looked up in the FIB
> > + * @param next_hops
> > + *   Next hop of the most specific rule found for IP.
> > + *   This is an array of eight byte values.
> > + *   If the lookup for the given IP failed, then corresponding element
> would
> > + *   contain default value, see description of then next parameter.
> > + * @param n
> > + *   Number of elements in ips (and next_hops) array to lookup. This
> should be a
> > + *   compile time constant, and divisible by 8 for best performance.
> > + * @param defv
> > + *   Default value to populate into corresponding element of hop[]
> array,
> > + *   if lookup would fail.
> > + *  @return
> > + *   -EINVAL for incorrect arguments, otherwise 0
> > + */
> > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > +     rib->lookup(rib->fib, ips, next_hops, n)
>
> My main thought here is whether this needs to be a function at all?
> Given that it takes a full burst of addresses in a single go, how much
> performance would actually be lost by making this a regular function in
> the C file?
> IF we do convert this to a regular function, then a lot of the structure
> definitions above - most importantly, the rib structure itself - can
> probably be moved to a private header file and not exposed to
> applications at all. This will make ABI compatibility a *lot* easier, as
> the structures can be changed without affecting the public ABI.
>
I didn't quite understand what you mean.


> /Bruce
>
> > +
> > +#endif /* _RTE_RIB_H_ */
>



-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-25 18:17     ` Vladimir Medvedkin
@ 2018-03-26  9:50       ` Bruce Richardson
  2018-03-29 19:59         ` Vladimir Medvedkin
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2018-03-26  9:50 UTC (permalink / raw)
  To: Vladimir Medvedkin; +Cc: dev

On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote:
> Hi,
> 
> 2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:
> 
> > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > RIB is an alternative to current LPM library.
> > > It solves the following problems
> > >  - Increases the speed of control plane operations against lpm such as
> > >    adding/deleting routes
> > >  - Adds abstraction from dataplane algorithms, so it is possible to add
> > >    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> > >    in addition to current dir24_8
> > >  - It is possible to keep user defined application specific additional
> > >    information in struct rte_rib_node which represents route entry.
> > >    It can be next hop/set of next hops (i.e. active and feasible),
> > >    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> > >    plenty of additional control plane information.
> > >  - For dir24_8 implementation it is possible to remove
> > rte_lpm_tbl_entry.depth
> > >    field that helps to save 6 bits.
> > >  - Also new dir24_8 implementation supports different next_hop sizes
> > >    (1/2/4/8 bytes per next hop)
> > >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> > operator.
> > >    Instead it returns special default value if there is no route.
> > >
> > > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > > ---
> > >  config/common_base                 |   6 +
> > >  doc/api/doxy-api.conf              |   1 +
> > >  lib/Makefile                       |   2 +
> > >  lib/librte_rib/Makefile            |  22 ++
> > >  lib/librte_rib/rte_dir24_8.c       | 482 ++++++++++++++++++++++++++++++
> > +++
> > >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> > >  lib/librte_rib/rte_rib.c           | 526 ++++++++++++++++++++++++++++++
> > +++++++
> > >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> > >  lib/librte_rib/rte_rib_version.map |  18 ++
> > >  mk/rte.app.mk                      |   1 +
> > >  10 files changed, 1496 insertions(+)
> > >  create mode 100644 lib/librte_rib/Makefile
> > >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> > >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> > >  create mode 100644 lib/librte_rib/rte_rib.c
> > >  create mode 100644 lib/librte_rib/rte_rib.h
> > >  create mode 100644 lib/librte_rib/rte_rib_version.map
> > >
> >
> > First pass review comments. For now just reviewed the main public header
> > file rte_rib.h. Later reviews will cover the other files as best I can.
> >
> > /Bruce
> >
> > <snip>
> > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > > new file mode 100644
> > > index 0000000..6eac8fb
> > > --- /dev/null
> > > +++ b/lib/librte_rib/rte_rib.h
> > > @@ -0,0 +1,322 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > > + */
> > > +
> > > +#ifndef _RTE_RIB_H_
> > > +#define _RTE_RIB_H_
> > > +
> > > +/**
> > > + * @file
> > > + * Compressed trie implementation for Longest Prefix Match
> > > + */
> > > +
> > > +/** @internal Macro to enable/disable run-time checks. */
> > > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > > +     if (cond)                                       \
> > > +             return retval;                          \
> > > +} while (0)
> > > +#else
> > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > > +#endif
> >
> > use RTE_ASSERT?
> >
> it was done just like it was done in the LPM lib. But if you think it
> should be RTE_ASSERT so be it.
> 
> 
> >
> > > +
> > > +#define RTE_RIB_VALID_NODE   1
> >
> > should there be an INVALID_NODE macro?
> >
> No
> 
> 
> >
> > > +#define RTE_RIB_GET_NXT_ALL  0
> > > +#define RTE_RIB_GET_NXT_COVER        1
> > > +
> > > +#define RTE_RIB_INVALID_ROUTE        0
> > > +#define RTE_RIB_VALID_ROUTE  1
> > > +
> > > +/** Max number of characters in RIB name. */
> > > +#define RTE_RIB_NAMESIZE     64
> > > +
> > > +/** Maximum depth value possible for IPv4 RIB. */
> > > +#define RTE_RIB_MAXDEPTH     32
> >
> > I think we should have IPv4 in the name here. Will it not be extended to
> > support IPv6 in future?
> >
> I think there should be a separate implementation of the library for ipv6
> 
I can understand the need for a separate LPM implementation, but should
they both not be under the same rib library?

> 
> >
> > > +
> > > +/**
> > > + * Macro to check if prefix1 {key1/depth1}
> > > + * is covered by prefix2 {key2/depth2}
> > > + */
> > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
> >      \
> > > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> > > +             && (depth1 > depth2))
> > Neat check!
> >
> > Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
> 
> in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
> UINT32_MAX because shift count will be masked to 5 bits.
> 
> I think you can avoid the casting and have a slightly shorter mask by
> > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> > "~(UINT32_MAX >> depth2)"
> > I'd also suggest for readability putting the second check first, and,
> > for maintainability, using an inline function rather than a macro.
> >
>  Agree, it looks clearer
> 
> 
> > > +
> > > +/** @internal Macro to get next node in tree*/
> > > +#define RTE_RIB_GET_NXT_NODE(node, key)
> >       \
> > > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > > +/** @internal Macro to check if node is right child*/
> > > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
> >
> > Again, consider inline fns rather than macros.
> >
> Ok
> 
> For the latter macro, rather than doing additional pointer derefs to
> > parent, can you also get if it's a right node by using:
> > "(node->key & (1 << (32 - node->depth)))"?
> >
> No, it is not possible. Decision whether node be left or right is made
> using parent and child common depth.
> Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth
> will be /8 and 10.128.0.0/24 will be right child.
> 
> 
> > > +
> > > +
> > > +struct rte_rib_node {
> > > +     struct rte_rib_node *left;
> > > +     struct rte_rib_node *right;
> > > +     struct rte_rib_node *parent;
> > > +     uint32_t        key;
> > > +     uint8_t         depth;
> > > +     uint8_t         flag;
> > > +     uint64_t        nh;
> > > +     uint64_t        ext[0];
> > > +};
> > > +
> > > +struct rte_rib;
> > > +
> > > +/** Type of FIB struct*/
> > > +enum rte_rib_type {
> > > +     RTE_RIB_DIR24_8_1B,
> > > +     RTE_RIB_DIR24_8_2B,
> > > +     RTE_RIB_DIR24_8_4B,
> > > +     RTE_RIB_DIR24_8_8B,
> > > +     RTE_RIB_TYPE_MAX
> > > +};
> >
> > If the plan is to support multiple underlying fib types and algorithms
> > under the rib library, would it not be better to separate out the
> > algorithm part from the data storage part? So have the type just be
> > DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
> >
> Yes, we were talk about it in IRC, agree. Now I pass next hop size in
> union rte_rib_fib_conf inside rte_rib_conf
> 
> 
> >
> > > +
> > > +enum rte_rib_op {
> > > +     RTE_RIB_ADD,
> > > +     RTE_RIB_DEL
> > > +};
> > > +
> > > +/** RIB nodes allocation type */
> > > +enum rte_rib_alloc_type {
> > > +     RTE_RIB_MALLOC,
> > > +     RTE_RIB_MEMPOOL,
> > > +     RTE_RIB_ALLOC_MAX
> > > +};
> >
> > Not sure you need this any more. Malloc allocations and mempool
> > allocations are now pretty much the same thing.
> >
> Actually I think to remove malloc. On performance tests with
> adding/deleting huge amount of routes malloc is slower. Maybe because of
> fragmentation.
> What do you think?
> 
Yes, definitely mempool allocations are the way to go!

> 
> > > +
> > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> > > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> >
> > Do you anticipate more ops in future than just add and delete? If not,
> > why not just split this function into two and drop the op struct.
> >
> It is difficult question. I'm not ready to make decision at the moment.
> 
> 
> >
> > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> > > +     uint64_t *next_hops, const unsigned n);
> > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib
> > *rib);
> > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > > +     struct rte_rib_node *node);
> > > +
> > > +struct rte_rib {
> > > +     char name[RTE_RIB_NAMESIZE];
> > > +     /*pointer to rib trie*/
> > > +     struct rte_rib_node     *trie;
> > > +     /*pointer to dataplane struct*/
> > > +     void    *fib;
> > > +     /*prefix modification*/
> > > +     rte_rib_modify_fn_t     modify;
> > > +     /* Bulk lookup fn*/
> > > +     rte_rib_tree_lookup_fn_t        lookup;
> > > +     /*alloc trie element*/
> > > +     rte_rib_alloc_node_fn_t alloc_node;
> > > +     /*free trie element*/
> > > +     rte_rib_free_node_fn_t  free_node;
> > > +     struct rte_mempool      *node_pool;
> > > +     uint32_t                cur_nodes;
> > > +     uint32_t                cur_routes;
> > > +     int                     max_nodes;
> > > +     int                     node_sz;
> > > +     enum rte_rib_type       type;
> > > +     enum rte_rib_alloc_type alloc_type;
> > > +};
> > > +
> > > +/** RIB configuration structure */
> > > +struct rte_rib_conf {
> > > +     enum rte_rib_type       type;
> > > +     enum rte_rib_alloc_type alloc_type;
> > > +     int     max_nodes;
> > > +     size_t  node_sz;
> > > +     uint64_t def_nh;
> > > +};
> > > +
> > > +/**
> > > + * Lookup an IP into the RIB structure
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  IP to be looked up in the RIB
> > > + * @return
> > > + *  pointer to struct rte_rib_node on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > > +
> > > +/**
> > > + * Lookup less specific route into the RIB structure
> > > + *
> > > + * @param ent
> > > + *  Pointer to struct rte_rib_node that represents target route
> > > + * @return
> > > + *  pointer to struct rte_rib_node that represents
> > > + *  less specific route on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > > +
> > > +/**
> > > + * Lookup prefix into the RIB structure
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be looked up in the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + * @return
> > > + *  pointer to struct rte_rib_node on success,
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t
> > depth);
> >
> > Can you explain the difference between this and regular lookup, and how
> > they would be used. I don't think the names convey the differences
> > sufficiently, and so we should look to rename one or both to be clearer.
> >
> Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for
> passed key.
> rte_rib_tree_lookup_exact will lookup node contained key and depth equal to
> passed in args. It used to find exact route.
> 
So if there is no node exactly matching the parameters, it the lookup_exact
returns failure? E.g. if you request a /24 node, it won't return a /8 node
that would cover the /24?

> 
> >
> > > +
> > > +/**
> > > + * Retrieve next more specific prefix from the RIB
> > s/more/most/
> >
> 
> > > + * that is covered by key/depth supernet
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net address of supernet prefix that covers returned more specific
> > prefixes
> > > + * @param depth
> > > + *  supernet prefix length
> > > + * @param cur
> > > + *   pointer to the last returned prefix to get next prefix
> > > + *   or
> > > + *   NULL to get first more specific prefix
> > > + * @param flag
> > > + *  -RTE_RIB_GET_NXT_ALL
> > > + *   get all prefixes from subtrie
> >
> > By all prefixes do you mean more specific, i.e. the final prefix?
> >
> What do you mean the final prefix?
> 
The most specific one, or the longest prefix.

> 
> > > + *  -RTE_RIB_GET_NXT_COVER
> > > + *   get only first more specific prefix even if it have more specifics
> > > + * @return
> > > + *  pointer to the next more specific prefix
> > > + *  or
> > > + *  NULL if there is no prefixes left
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> > > +     struct rte_rib_node *cur, int flag);
> > > +
> > > +/**
> > > + * Remove prefix from the RIB
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be removed from the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + */
> > > +void
> > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Insert prefix into the RIB
> > > + *
> > > + * @param rib
> > > + *  RIB object handle
> > > + * @param key
> > > + *  net to be inserted to the RIB
> > > + * @param depth
> > > + *  prefix length
> > > + * @return
> > > + *  pointer to new rte_rib_node on success
> > > + *  NULL otherwise
> > > + */
> > > +struct rte_rib_node *
> > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > > +
> > > +/**
> > > + * Create RIB
> > > + *
> > > + * @param name
> > > + *  RIB name
> > > + * @param socket_id
> > > + *  NUMA socket ID for RIB table memory allocation
> > > + * @param conf
> > > + *  Structure containing the configuration
> > > + * @return
> > > + *  Handle to RIB object on success
> > > + *  NULL otherwise with rte_errno set to an appropriate values.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> > *conf);
> > > +
> > > +/**
> > > + * Find an existing RIB object and return a pointer to it.
> > > + *
> > > + * @param name
> > > + *  Name of the rib object as passed to rte_rib_create()
> > > + * @return
> > > + *  Pointer to rib object or NULL if object not found with rte_errno
> > > + *  set appropriately. Possible rte_errno values include:
> > > + *   - ENOENT - required entry not available to return.
> > > + */
> > > +struct rte_rib *
> > > +rte_rib_find_existing(const char *name);
> > > +
> > > +/**
> > > + * Free an RIB object.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @return
> > > + *   None
> > > + */
> > > +void
> > > +rte_rib_free(struct rte_rib *rib);
> > > +
> > > +/**
> > > + * Add a rule to the RIB.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @param ip
> > > + *   IP of the rule to be added to the RIB
> > > + * @param depth
> > > + *   Depth of the rule to be added to the RIB
> > > + * @param next_hop
> > > + *   Next hop of the rule to be added to the RIB
> > > + * @return
> > > + *   0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t
> > next_hop);
> > > +
> > > +/**
> > > + * Delete a rule from the RIB.
> > > + *
> > > + * @param rib
> > > + *   RIB object handle
> > > + * @param ip
> > > + *   IP of the rule to be deleted from the RIB
> > > + * @param depth
> > > + *   Depth of the rule to be deleted from the RIB
> > > + * @return
> > > + *   0 on success, negative value otherwise
> > > + */
> > > +int
> > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > > +
> > > +/**
> > > + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> > > + * macro, so the address of the function should not be used.
> > > + *
> > > + * @param RIB
> > > + *   RIB object handle
> > > + * @param ips
> > > + *   Array of IPs to be looked up in the FIB
> > > + * @param next_hops
> > > + *   Next hop of the most specific rule found for IP.
> > > + *   This is an array of eight byte values.
> > > + *   If the lookup for the given IP failed, then corresponding element
> > would
> > > + *   contain default value, see description of then next parameter.
> > > + * @param n
> > > + *   Number of elements in ips (and next_hops) array to lookup. This
> > should be a
> > > + *   compile time constant, and divisible by 8 for best performance.
> > > + * @param defv
> > > + *   Default value to populate into corresponding element of hop[]
> > array,
> > > + *   if lookup would fail.
> > > + *  @return
> > > + *   -EINVAL for incorrect arguments, otherwise 0
> > > + */
> > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > > +     rib->lookup(rib->fib, ips, next_hops, n)
> >
> > My main thought here is whether this needs to be a function at all?
> > Given that it takes a full burst of addresses in a single go, how much
> > performance would actually be lost by making this a regular function in
> > the C file?
> > IF we do convert this to a regular function, then a lot of the structure
> > definitions above - most importantly, the rib structure itself - can
> > probably be moved to a private header file and not exposed to
> > applications at all. This will make ABI compatibility a *lot* easier, as
> > the structures can be changed without affecting the public ABI.
> >
> I didn't quite understand what you mean.
> 
Sorry, by "needs to be a function" in first line read "needs to be a
macro". Basically, the point is to not inline anything that doesn't need
it. If a function works on a burst of packets, it probably will be fine
being a regular function than a macro or inline function.

/Bruce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-02-21 21:44 ` [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
  2018-03-14 11:09   ` Bruce Richardson
@ 2018-03-29 10:27   ` Bruce Richardson
  2018-03-29 20:11     ` Vladimir Medvedkin
  1 sibling, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2018-03-29 10:27 UTC (permalink / raw)
  To: Medvedkin Vladimir; +Cc: dev

On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> RIB is an alternative to current LPM library.
> It solves the following problems
>  - Increases the speed of control plane operations against lpm such as
>    adding/deleting routes
>  - Adds abstraction from dataplane algorithms, so it is possible to add
>    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
>    in addition to current dir24_8
>  - It is possible to keep user defined application specific additional
>    information in struct rte_rib_node which represents route entry.
>    It can be next hop/set of next hops (i.e. active and feasible),
>    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
>    plenty of additional control plane information.
>  - For dir24_8 implementation it is possible to remove rte_lpm_tbl_entry.depth
>    field that helps to save 6 bits.
>  - Also new dir24_8 implementation supports different next_hop sizes
>    (1/2/4/8 bytes per next hop)
>  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary operator.
>    Instead it returns special default value if there is no route.
> 
> Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> ---

Hi again,

some initial comments on the dir24_8 files below.

/Bruce

>  config/common_base                 |   6 +
>  doc/api/doxy-api.conf              |   1 +
>  lib/Makefile                       |   2 +
>  lib/librte_rib/Makefile            |  22 ++
>  lib/librte_rib/rte_dir24_8.c       | 482 +++++++++++++++++++++++++++++++++
>  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
>  lib/librte_rib/rte_rib.c           | 526 +++++++++++++++++++++++++++++++++++++
>  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
>  lib/librte_rib/rte_rib_version.map |  18 ++
>  mk/rte.app.mk                      |   1 +
>  10 files changed, 1496 insertions(+)
>  create mode 100644 lib/librte_rib/Makefile
>  create mode 100644 lib/librte_rib/rte_dir24_8.c
>  create mode 100644 lib/librte_rib/rte_dir24_8.h
>  create mode 100644 lib/librte_rib/rte_rib.c
>  create mode 100644 lib/librte_rib/rte_rib.h
>  create mode 100644 lib/librte_rib/rte_rib_version.map
> 

<snip>

> diff --git a/lib/librte_rib/rte_dir24_8.c b/lib/librte_rib/rte_dir24_8.c
> new file mode 100644
> index 0000000..a12f882
> --- /dev/null
> +++ b/lib/librte_rib/rte_dir24_8.c
> @@ -0,0 +1,482 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> + */
> +
> +#include <stdint.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <rte_debug.h>
> +#include <rte_malloc.h>
> +#include <rte_prefetch.h>
> +#include <rte_errno.h>
> +
> +#include <inttypes.h>
> +
> +#include <rte_memory.h>
> +#include <rte_branch_prediction.h>
> +
> +#include <rte_rib.h>
> +#include <rte_dir24_8.h>
> +
> +#define BITMAP_SLAB_BIT_SIZE_LOG2	6
> +#define BITMAP_SLAB_BIT_SIZE		(1 << BITMAP_SLAB_BIT_SIZE_LOG2)
> +#define BITMAP_SLAB_BITMASK		(BITMAP_SLAB_BIT_SIZE - 1)
> +
> +#define ROUNDUP(x, y)	 RTE_ALIGN_CEIL(x, (1 << (32 - y)))
> +
> +static __rte_always_inline __attribute__((pure)) void *
> +get_tbl24_p(struct rte_dir24_8_tbl *fib, uint32_t ip)
> +{
> +	return (void *)&((uint8_t *)fib->tbl24)[(ip &
> +		RTE_DIR24_8_TBL24_MASK) >> (8 - fib->nh_sz)];
> +}
> +
> +#define LOOKUP_FUNC(suffix, type, bulk_prefetch)			\
> +int rte_dir24_8_lookup_bulk_##suffix(void *fib_p, const uint32_t *ips,	\
> +	uint64_t *next_hops, const unsigned n)				\
> +{									\
> +	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;	\
> +	uint64_t tmp;							\
> +	uint32_t i;							\
> +	uint32_t prefetch_offset = RTE_MIN((unsigned)bulk_prefetch, n);	\
> +									\
> +	RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ips == NULL) ||	\
> +		(next_hops == NULL)), -EINVAL);				\
> +									\
> +	for (i = 0; i < prefetch_offset; i++)				\
> +		rte_prefetch0(get_tbl24_p(fib, ips[i]));		\
> +	for (i = 0; i < (n - prefetch_offset); i++) {			\
> +		rte_prefetch0(get_tbl24_p(fib, ips[i + prefetch_offset])); \
> +		tmp = ((type *)fib->tbl24)[ips[i] >> 8];		\
> +		if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==	\
> +			RTE_DIR24_8_VALID_EXT_ENT)) {			\
> +			tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +	\
> +				((tmp >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> +		}							\
> +		next_hops[i] = tmp >> 1;				\
> +	}								\
> +	for (; i < n; i++) {						\
> +		tmp = ((type *)fib->tbl24)[ips[i] >> 8];		\
> +		if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==	\
> +			RTE_DIR24_8_VALID_EXT_ENT)) {			\
> +			tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +	\
> +				((tmp >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> +		}							\
> +		next_hops[i] = tmp >> 1;				\
> +	}								\
> +	return 0;							\
> +}									\

What is the advantage of doing this as a macro? Unless I'm missing
something "suffix" is never actually used in the function at all, and you
reference the size of the data from fix->nh_sz. Therefore there can be no
performance benefit from having such a lookup function, that I can see.

Therefore, if performance is ok, I suggest just making a single lookup_bulk
function that works with all sizes - as the inlined lookup function does in
the header. 

Alternatively, if you do want specific functions for each
entry size, you still don't need macros. Write a single function that takes
as a final parameter the entry-size and use that in calculations rather
than nh_sz.  Then wrap that function in the set of public ones, passing in
the final size parameter explicitly as "1", "2", "4" or "8". The compiler
will then know that as a compile-time constant and generate the correct
code for each size. However, for this path I suggest you check for any
resulting performance improvement, e.g. with l3fwd, as I think it's not
likely to be significant.

> +
> +static void
> +write_to_fib(void *ptr, uint64_t val, enum rte_dir24_8_nh_sz size, int n)
> +{
> +	int i;
> +	uint8_t *ptr8 = (uint8_t *)ptr;
> +	uint16_t *ptr16 = (uint16_t *)ptr;
> +	uint32_t *ptr32 = (uint32_t *)ptr;
> +	uint64_t *ptr64 = (uint64_t *)ptr;
> +
> +	switch (size) {
> +	case RTE_DIR24_8_1B:
> +		for (i = 0; i < n; i++)
> +			ptr8[i] = (uint8_t)val;
> +		break;
> +	case RTE_DIR24_8_2B:
> +		for (i = 0; i < n; i++)
> +			ptr16[i] = (uint16_t)val;
> +		break;
> +	case RTE_DIR24_8_4B:
> +		for (i = 0; i < n; i++)
> +			ptr32[i] = (uint32_t)val;
> +		break;
> +	case RTE_DIR24_8_8B:
> +		for (i = 0; i < n; i++)
> +			ptr64[i] = (uint64_t)val;
> +		break;
> +	}
> +}
> +
> +static int
> +tbl8_get_idx(struct rte_dir24_8_tbl *fib)
> +{
> +	uint32_t i;
> +	int bit_idx;
> +
> +	for (i = 0; (i < (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2)) &&
> +		(fib->tbl8_idxes[i] == UINT64_MAX); i++)
> +		;
> +	if (i <= (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2)) {
> +		bit_idx = __builtin_ctzll(~fib->tbl8_idxes[i]);
> +		fib->tbl8_idxes[i] |= (1ULL << bit_idx);
> +		return (i << BITMAP_SLAB_BIT_SIZE_LOG2) + bit_idx;
> +	}
> +	return -ENOSPC;
> +}
> +
> +static inline void
> +tbl8_free_idx(struct rte_dir24_8_tbl *fib, int idx)
> +{
> +	fib->tbl8_idxes[idx >> BITMAP_SLAB_BIT_SIZE_LOG2] &=
> +		~(1ULL << (idx & BITMAP_SLAB_BITMASK));
> +}
> +
> +static int
> +tbl8_alloc(struct rte_dir24_8_tbl *fib, uint64_t nh)
> +{
> +	int	tbl8_idx;
> +	uint8_t	*tbl8_ptr;
> +
> +	tbl8_idx = tbl8_get_idx(fib);
> +	if (tbl8_idx < 0)
> +		return tbl8_idx;
> +	tbl8_ptr = (uint8_t *)fib->tbl8 +
> +		((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) <<
> +		fib->nh_sz);
> +	/*Init tbl8 entries with nexthop from tbl24*/
> +	write_to_fib((void *)tbl8_ptr, nh|
> +		RTE_DIR24_8_VALID_EXT_ENT, fib->nh_sz,
> +		RTE_DIR24_8_TBL8_GRP_NUM_ENT);
> +	return tbl8_idx;
> +}
> +
> +static void
> +tbl8_recycle(struct rte_dir24_8_tbl *fib, uint32_t ip, uint64_t tbl8_idx)
> +{
> +	int i;
> +	uint64_t nh;
> +	uint8_t *ptr8;
> +	uint16_t *ptr16;
> +	uint32_t *ptr32;
> +	uint64_t *ptr64;
> +
> +	switch (fib->nh_sz) {
> +	case RTE_DIR24_8_1B:
> +		ptr8 = &((uint8_t *)fib->tbl8)[tbl8_idx *
> +				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> +		nh = *ptr8;
> +		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> +			if (nh != ptr8[i])
> +				return;
> +		}
> +		((uint8_t *)fib->tbl24)[ip >> 8] =
> +			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> +		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> +			ptr8[i] = 0;
> +		break;
> +	case RTE_DIR24_8_2B:
> +		ptr16 = &((uint16_t *)fib->tbl8)[tbl8_idx *
> +				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> +		nh = *ptr16;
> +		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> +			if (nh != ptr16[i])
> +				return;
> +		}
> +		((uint16_t *)fib->tbl24)[ip >> 8] =
> +			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> +		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> +			ptr16[i] = 0;
> +		break;
> +	case RTE_DIR24_8_4B:
> +		ptr32 = &((uint32_t *)fib->tbl8)[tbl8_idx *
> +				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> +		nh = *ptr32;
> +		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> +			if (nh != ptr32[i])
> +				return;
> +		}
> +		((uint32_t *)fib->tbl24)[ip >> 8] =
> +			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> +		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> +			ptr32[i] = 0;
> +		break;
> +	case RTE_DIR24_8_8B:
> +		ptr64 = &((uint64_t *)fib->tbl8)[tbl8_idx *
> +				RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> +		nh = *ptr64;
> +		for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> +			if (nh != ptr64[i])
> +				return;
> +		}
> +		((uint64_t *)fib->tbl24)[ip >> 8] =
> +			nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> +		for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> +			ptr64[i] = 0;
> +		break;
> +	}
> +	tbl8_free_idx(fib, tbl8_idx);
> +}
> +
> +static int
> +install_to_fib(struct rte_dir24_8_tbl *fib, uint32_t ledge, uint32_t redge,
> +	uint64_t next_hop)
> +{
> +	uint64_t	tbl24_tmp;
> +	int	tbl8_idx;
> +	int tmp_tbl8_idx;
> +	uint8_t	*tbl8_ptr;
> +
> +	/*case for 0.0.0.0/0*/
> +	if (unlikely((ledge == 0) && (redge == 0))) {
> +		write_to_fib(fib->tbl24, next_hop << 1, fib->nh_sz, 1 << 24);
> +		return 0;
> +	}
> +	if (ROUNDUP(ledge, 24) <= redge) {
> +		if (ledge < ROUNDUP(ledge, 24)) {
> +			tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
> +			if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> +				RTE_DIR24_8_VALID_EXT_ENT) {
> +				tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> +				tmp_tbl8_idx = tbl8_get_idx(fib);
> +				if ((tbl8_idx < 0) || (tmp_tbl8_idx < 0))
> +					return -ENOSPC;
> +				tbl8_free_idx(fib, tmp_tbl8_idx);
> +				/*update dir24 entry with tbl8 index*/
> +				write_to_fib(get_tbl24_p(fib, ledge),
> +					(tbl8_idx << 1)|
> +					RTE_DIR24_8_VALID_EXT_ENT,
> +					fib->nh_sz, 1);
> +			} else
> +				tbl8_idx = tbl24_tmp >> 1;
> +			tbl8_ptr = (uint8_t *)fib->tbl8 +
> +				(((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
> +				(ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
> +				fib->nh_sz);
> +			/*update tbl8 with new next hop*/
> +			write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> +				RTE_DIR24_8_VALID_EXT_ENT,
> +				fib->nh_sz, ROUNDUP(ledge, 24) - ledge);
> +			tbl8_recycle(fib, ledge, tbl8_idx);
> +		}
> +		if (ROUNDUP(ledge, 24) < (redge & RTE_DIR24_8_TBL24_MASK)) {
> +			write_to_fib(get_tbl24_p(fib, ROUNDUP(ledge, 24)),
> +				next_hop << 1, fib->nh_sz,
> +				((redge & RTE_DIR24_8_TBL24_MASK) -
> +				ROUNDUP(ledge, 24)) >> 8);
> +		}
> +		if (redge & ~RTE_DIR24_8_TBL24_MASK) {
> +			tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, redge);
> +			if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> +					RTE_DIR24_8_VALID_EXT_ENT) {
> +				tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> +				if (tbl8_idx < 0)
> +					return -ENOSPC;
> +				/*update dir24 entry with tbl8 index*/
> +				write_to_fib(get_tbl24_p(fib, redge),
> +					(tbl8_idx << 1)|
> +					RTE_DIR24_8_VALID_EXT_ENT,
> +					fib->nh_sz, 1);
> +			} else
> +				tbl8_idx = tbl24_tmp >> 1;
> +			tbl8_ptr = (uint8_t *)fib->tbl8 +
> +				((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) <<
> +				fib->nh_sz);
> +			/*update tbl8 with new next hop*/
> +			write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> +				RTE_DIR24_8_VALID_EXT_ENT,
> +				fib->nh_sz, redge & ~RTE_DIR24_8_TBL24_MASK);
> +			tbl8_recycle(fib, redge, tbl8_idx);
> +		}
> +	} else {
> +		tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
> +		if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> +			RTE_DIR24_8_VALID_EXT_ENT) {
> +			tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> +			if (tbl8_idx < 0)
> +				return -ENOSPC;
> +			/*update dir24 entry with tbl8 index*/
> +			write_to_fib(get_tbl24_p(fib, ledge),
> +				(tbl8_idx << 1)|
> +				RTE_DIR24_8_VALID_EXT_ENT,
> +				fib->nh_sz, 1);
> +		} else
> +			tbl8_idx = tbl24_tmp >> 1;
> +		tbl8_ptr = (uint8_t *)fib->tbl8 +
> +			(((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
> +			(ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
> +			fib->nh_sz);
> +		/*update tbl8 with new next hop*/
> +		write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> +			RTE_DIR24_8_VALID_EXT_ENT,
> +			fib->nh_sz, redge - ledge);
> +		tbl8_recycle(fib, ledge, tbl8_idx);
> +	}
> +	return 0;
> +}
> +
> +static int
> +modify_fib(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> +	uint64_t next_hop)
> +{
> +	struct rte_rib_node *tmp = NULL;
> +	struct rte_dir24_8_tbl *fib;
> +	uint32_t ledge, redge;
> +	int ret;
> +
> +	fib = rib->fib;
> +
> +	if (next_hop > DIR24_8_MAX_NH(fib))
> +		return -EINVAL;
> +
> +	ledge = ip;
> +	do {
> +		tmp = rte_rib_tree_get_nxt(rib, ip, depth, tmp,
> +			RTE_RIB_GET_NXT_COVER);
> +		if (tmp != NULL) {
> +			if (tmp->depth == depth)
> +				continue;
> +			redge = tmp->key;
> +			if (ledge == redge) {
> +				ledge = redge +
> +					(uint32_t)(1ULL << (32 - tmp->depth));
> +				continue;
> +			}
> +			ret = install_to_fib(fib, ledge, redge,
> +				next_hop);
> +			if (ret != 0)
> +				return ret;
> +			ledge = redge +
> +				(uint32_t)(1ULL << (32 - tmp->depth));
> +		} else {
> +			redge = ip + (uint32_t)(1ULL << (32 - depth));
> +			ret = install_to_fib(fib, ledge, redge,
> +				next_hop);
> +			if (ret != 0)
> +				return ret;
> +		}
> +	} while (tmp);
> +
> +	return 0;
> +}
> +
> +int
> +rte_dir24_8_modify(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> +	uint64_t next_hop, enum rte_rib_op op)
> +{
> +	struct rte_dir24_8_tbl *fib;
> +	struct rte_rib_node *tmp = NULL;
> +	struct rte_rib_node *node;
> +	struct rte_rib_node *parent;
> +	int ret = 0;
> +
> +	if ((rib == NULL) || (depth > RTE_RIB_MAXDEPTH))
> +		return -EINVAL;
> +
> +	fib = rib->fib;
> +	RTE_ASSERT(fib);
> +
> +	ip &= (uint32_t)(UINT64_MAX << (32 - depth));
> +
> +	node = rte_rib_tree_lookup_exact(rib, ip, depth);
> +	switch (op) {
> +	case RTE_RIB_ADD:
> +		if (node != NULL) {
> +			if (node->nh == next_hop)
> +				return 0;
> +			ret = modify_fib(rib, ip, depth, next_hop);
> +			if (ret == 0)
> +				node->nh = next_hop;
> +			return 0;
> +		}
> +		if (depth > 24) {
> +			tmp = rte_rib_tree_get_nxt(rib, ip, 24, NULL,
> +				RTE_RIB_GET_NXT_COVER);
> +			if ((tmp == NULL) &&
> +				(fib->cur_tbl8s >= fib->number_tbl8s))
> +				return -ENOSPC;
> +
> +		}
> +		node = rte_rib_tree_insert(rib, ip, depth);
> +		if (node == NULL)
> +			return -rte_errno;
> +		node->nh = next_hop;
> +		parent = rte_rib_tree_lookup_parent(node);
> +		if ((parent != NULL) && (parent->nh == next_hop))
> +			return 0;
> +		ret = modify_fib(rib, ip, depth, next_hop);
> +		if (ret) {
> +			rte_rib_tree_remove(rib, ip, depth);
> +			return ret;
> +		}
> +		if ((depth > 24) && (tmp == NULL))
> +			fib->cur_tbl8s++;
> +		return 0;
> +	case RTE_RIB_DEL:
> +		if (node == NULL)
> +			return -ENOENT;
> +
> +		parent = rte_rib_tree_lookup_parent(node);
> +		if (parent != NULL) {
> +			if (parent->nh != node->nh)
> +				ret = modify_fib(rib, ip, depth, parent->nh);
> +		} else
> +			ret = modify_fib(rib, ip, depth, fib->def_nh);
> +		if (ret == 0) {
> +			rte_rib_tree_remove(rib, ip, depth);
> +			if (depth > 24) {
> +				tmp = rte_rib_tree_get_nxt(rib, ip, 24, NULL,
> +					RTE_RIB_GET_NXT_COVER);
> +				if (tmp == NULL)
> +					fib->cur_tbl8s--;
> +			}
> +		}
> +		return ret;
> +	default:
> +		break;
> +	}
> +	return -EINVAL;
> +}
> +
> +struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int socket_id,
> +	enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh)
> +{
> +	char mem_name[RTE_RIB_NAMESIZE];
> +	struct rte_dir24_8_tbl *fib;
> +
> +	snprintf(mem_name, sizeof(mem_name), "FIB_%s", name);
> +	fib = rte_zmalloc_socket(name, sizeof(struct rte_dir24_8_tbl) +
> +		RTE_DIR24_8_TBL24_NUM_ENT * (1 << nh_sz), RTE_CACHE_LINE_SIZE,
> +		socket_id);
> +	if (fib == NULL)
> +		return fib;
> +
> +	snprintf(mem_name, sizeof(mem_name), "TBL8_%s", name);
> +	fib->tbl8 = rte_zmalloc_socket(mem_name, RTE_DIR24_8_TBL8_GRP_NUM_ENT *
> +			(1 << nh_sz) * RTE_DIR24_8_TBL8_NUM_GROUPS,
> +			RTE_CACHE_LINE_SIZE, socket_id);
> +	if (fib->tbl8 == NULL) {
> +		rte_free(fib);
> +		return NULL;
> +	}
> +	fib->def_nh = def_nh;
> +	fib->nh_sz = nh_sz;
> +	fib->number_tbl8s = RTE_MIN((uint32_t)RTE_DIR24_8_TBL8_NUM_GROUPS,
> +				DIR24_8_MAX_NH(fib));
> +
> +	snprintf(mem_name, sizeof(mem_name), "TBL8_idxes_%s", name);
> +	fib->tbl8_idxes = rte_zmalloc_socket(mem_name,
> +			RTE_ALIGN_CEIL(fib->number_tbl8s, 64) >> 3,
> +			RTE_CACHE_LINE_SIZE, socket_id);
> +	if (fib->tbl8_idxes == NULL) {
> +		rte_free(fib->tbl8);
> +		rte_free(fib);
> +		return NULL;
> +	}
> +
> +	return fib;
> +}
> +
> +void
> +rte_dir24_8_free(void *fib_p)
> +{
> +	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
> +
> +	rte_free(fib->tbl8_idxes);
> +	rte_free(fib->tbl8);
> +	rte_free(fib);
> +}
> +
> +LOOKUP_FUNC(1b, uint8_t, 5)
> +LOOKUP_FUNC(2b, uint16_t, 6)
> +LOOKUP_FUNC(4b, uint32_t, 15)
> +LOOKUP_FUNC(8b, uint64_t, 12)
> diff --git a/lib/librte_rib/rte_dir24_8.h b/lib/librte_rib/rte_dir24_8.h
> new file mode 100644
> index 0000000..f779409
> --- /dev/null
> +++ b/lib/librte_rib/rte_dir24_8.h
> @@ -0,0 +1,116 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> + */
> +
> +#ifndef _RTE_DIR24_8_H_
> +#define _RTE_DIR24_8_H_
> +
> +/**
> + * @file
> + * RTE Longest Prefix Match (LPM)
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/** @internal Total number of tbl24 entries. */
> +#define RTE_DIR24_8_TBL24_NUM_ENT	(1 << 24)
> +
> +/** Maximum depth value possible for IPv4 LPM. */
> +#define RTE_DIR24_8_MAX_DEPTH		32
> +
> +/** @internal Number of entries in a tbl8 group. */
> +#define RTE_DIR24_8_TBL8_GRP_NUM_ENT	256
> +
> +/** @internal Total number of tbl8 groups in the tbl8. */
> +#define RTE_DIR24_8_TBL8_NUM_GROUPS	65536
> +
> +/** @internal bitmask with valid and valid_group fields set */
> +#define RTE_DIR24_8_VALID_EXT_ENT	0x01
> +
> +#define RTE_DIR24_8_TBL24_MASK		0xffffff00
> +
> +/** Size of nexthop (1 << nh_sz) bits */
> +enum rte_dir24_8_nh_sz {
> +	RTE_DIR24_8_1B,
> +	RTE_DIR24_8_2B,
> +	RTE_DIR24_8_4B,
> +	RTE_DIR24_8_8B
> +};
> +
> +
> +#define DIR24_8_BITS_IN_NH(fib)		(8 * (1 << fib->nh_sz))
> +#define DIR24_8_MAX_NH(fib)	((1ULL << (DIR24_8_BITS_IN_NH(fib) - 1)) - 1)
> +
> +#define DIR24_8_TBL_IDX(a, fib)		((a) >> (3 - fib->nh_sz))
> +#define DIR24_8_PSD_IDX(a, fib)		((a) & ((1 << (3 - fib->nh_sz)) - 1))
> +
> +#define DIR24_8_TBL24_VAL(ip)	(ip >> 8)
> +#define DIR24_8_TBL8_VAL(res, ip)					\
> +	((res >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT + (uint8_t)ip)	\
> +
> +#define DIR24_8_LOOKUP_MSK						\
> +	(((1ULL << ((1 << (fib->nh_sz + 3)) - 1)) << 1) - 1)		\
> +
> +#define RTE_DIR24_8_GET_TBL24(fib, ip)					\
> +	((fib->tbl24[DIR24_8_TBL_IDX(DIR24_8_TBL24_VAL(ip), fib)] >>	\
> +	(DIR24_8_PSD_IDX(DIR24_8_TBL24_VAL(ip), fib) *			\
> +	DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)			\
> +
> +#define RTE_DIR24_8_GET_TBL8(fib, res, ip)				\
> +	((fib->tbl8[DIR24_8_TBL_IDX(DIR24_8_TBL8_VAL(res, ip), fib)] >>	\
> +	(DIR24_8_PSD_IDX(DIR24_8_TBL8_VAL(res, ip), fib) *		\
> +	DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)			\
> 
I would strongly suggest making each of the above macros into inline
functions instead. It would allow easier readability since you have
parameter types and can split things across lines easier.
Also, some comments might be good too.

+
> +
> +struct rte_dir24_8_tbl {
> +	uint32_t	number_tbl8s;	/**< Total number of tbl8s. */
> +	uint32_t	cur_tbl8s;	/**< Current cumber of tbl8s. */
> +	uint64_t	def_nh;
> +	enum rte_dir24_8_nh_sz	nh_sz;	/**< Size of nexthop entry */
> +	uint64_t	*tbl8;		/**< LPM tbl8 table. */
> +	uint64_t	*tbl8_idxes;
> +	uint64_t	tbl24[0] __rte_cache_aligned; /**< LPM tbl24 table. */
> +};
> +
> +struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int socket_id,
> +	enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh);
> +void rte_dir24_8_free(void *fib_p);
> +int rte_dir24_8_modify(struct rte_rib *rib, uint32_t key,
> +	uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> +int rte_dir24_8_lookup_bulk_1b(void *fib_p, const uint32_t *ips,
> +	uint64_t *next_hops, const unsigned n);
> +int rte_dir24_8_lookup_bulk_2b(void *fib_p, const uint32_t *ips,
> +	uint64_t *next_hops, const unsigned n);
> +int rte_dir24_8_lookup_bulk_4b(void *fib_p, const uint32_t *ips,
> +	uint64_t *next_hops, const unsigned n);
> +int rte_dir24_8_lookup_bulk_8b(void *fib_p, const uint32_t *ips,
> +	uint64_t *next_hops, const unsigned n);
> +
> +
> +static inline int
> +rte_dir24_8_lookup(void *fib_p, uint32_t ip, uint64_t *next_hop)

Why use void * as parameter, since the proper type is defined just above?

> +{
> +	uint64_t res;
> +	struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
> +
> +	RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ip == NULL) ||
> +		(next_hop == NULL)), -EINVAL);
> +
> +	res = RTE_DIR24_8_GET_TBL24(fib, ip);
> +	if (unlikely((res & RTE_DIR24_8_VALID_EXT_ENT) ==
> +		RTE_DIR24_8_VALID_EXT_ENT)) {
> +		res = RTE_DIR24_8_GET_TBL8(fib, res, ip);
> +	}
> +	*next_hop = res >> 1;
> +	return 0;
> +}

Do we need this static inline function? Can the bulk functions do on their
own? If we can remove this, we can move the most of the header file
contents, especially the structures, out of the public header. That would
greatly improve the ease with which ABI can be maintained.

> +
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_DIR24_8_H_ */
> +

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-26  9:50       ` Bruce Richardson
@ 2018-03-29 19:59         ` Vladimir Medvedkin
  0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Medvedkin @ 2018-03-29 19:59 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

2018-03-26 12:50 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:

> On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote:
> > Hi,
> >
> > 2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com
> >:
> >
> > > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > > RIB is an alternative to current LPM library.
> > > > It solves the following problems
> > > >  - Increases the speed of control plane operations against lpm such
> as
> > > >    adding/deleting routes
> > > >  - Adds abstraction from dataplane algorithms, so it is possible to
> add
> > > >    different ip route lookup algorythms such as
> DXR/poptrie/lpc-trie/etc
> > > >    in addition to current dir24_8
> > > >  - It is possible to keep user defined application specific
> additional
> > > >    information in struct rte_rib_node which represents route entry.
> > > >    It can be next hop/set of next hops (i.e. active and feasible),
> > > >    pointers to link rte_rib_node based on some criteria (i.e.
> next_hop),
> > > >    plenty of additional control plane information.
> > > >  - For dir24_8 implementation it is possible to remove
> > > rte_lpm_tbl_entry.depth
> > > >    field that helps to save 6 bits.
> > > >  - Also new dir24_8 implementation supports different next_hop sizes
> > > >    (1/2/4/8 bytes per next hop)
> > > >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate
> ternary
> > > operator.
> > > >    Instead it returns special default value if there is no route.
> > > >
> > > > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > > > ---
> > > >  config/common_base                 |   6 +
> > > >  doc/api/doxy-api.conf              |   1 +
> > > >  lib/Makefile                       |   2 +
> > > >  lib/librte_rib/Makefile            |  22 ++
> > > >  lib/librte_rib/rte_dir24_8.c       | 482
> ++++++++++++++++++++++++++++++
> > > +++
> > > >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> > > >  lib/librte_rib/rte_rib.c           | 526
> ++++++++++++++++++++++++++++++
> > > +++++++
> > > >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> > > >  lib/librte_rib/rte_rib_version.map |  18 ++
> > > >  mk/rte.app.mk                      |   1 +
> > > >  10 files changed, 1496 insertions(+)
> > > >  create mode 100644 lib/librte_rib/Makefile
> > > >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> > > >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> > > >  create mode 100644 lib/librte_rib/rte_rib.c
> > > >  create mode 100644 lib/librte_rib/rte_rib.h
> > > >  create mode 100644 lib/librte_rib/rte_rib_version.map
> > > >
> > >
> > > First pass review comments. For now just reviewed the main public
> header
> > > file rte_rib.h. Later reviews will cover the other files as best I can.
> > >
> > > /Bruce
> > >
> > > <snip>
> > > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > > > new file mode 100644
> > > > index 0000000..6eac8fb
> > > > --- /dev/null
> > > > +++ b/lib/librte_rib/rte_rib.h
> > > > @@ -0,0 +1,322 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > > > + */
> > > > +
> > > > +#ifndef _RTE_RIB_H_
> > > > +#define _RTE_RIB_H_
> > > > +
> > > > +/**
> > > > + * @file
> > > > + * Compressed trie implementation for Longest Prefix Match
> > > > + */
> > > > +
> > > > +/** @internal Macro to enable/disable run-time checks. */
> > > > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > > > +     if (cond)                                       \
> > > > +             return retval;                          \
> > > > +} while (0)
> > > > +#else
> > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > > > +#endif
> > >
> > > use RTE_ASSERT?
> > >
> > it was done just like it was done in the LPM lib. But if you think it
> > should be RTE_ASSERT so be it.
> >
> >
> > >
> > > > +
> > > > +#define RTE_RIB_VALID_NODE   1
> > >
> > > should there be an INVALID_NODE macro?
> > >
> > No
> >
> >
> > >
> > > > +#define RTE_RIB_GET_NXT_ALL  0
> > > > +#define RTE_RIB_GET_NXT_COVER        1
> > > > +
> > > > +#define RTE_RIB_INVALID_ROUTE        0
> > > > +#define RTE_RIB_VALID_ROUTE  1
> > > > +
> > > > +/** Max number of characters in RIB name. */
> > > > +#define RTE_RIB_NAMESIZE     64
> > > > +
> > > > +/** Maximum depth value possible for IPv4 RIB. */
> > > > +#define RTE_RIB_MAXDEPTH     32
> > >
> > > I think we should have IPv4 in the name here. Will it not be extended
> to
> > > support IPv6 in future?
> > >
> > I think there should be a separate implementation of the library for ipv6
> >
> I can understand the need for a separate LPM implementation, but should
> they both not be under the same rib library?
>
I planned to have sepatate rib6 for ipv6.
Of course it is possible to make universal abstraction for both v4 and v6.
But in this case there will be universal rte_rib_node with type (v4|v6) and
variable length, keys will become union of uint32_t for v4 and uint8_t [16]
for v6. I think it is overcomplication.


> >
> > >
> > > > +
> > > > +/**
> > > > + * Macro to check if prefix1 {key1/depth1}
> > > > + * is covered by prefix2 {key2/depth2}
> > > > + */
> > > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
> > >      \
> > > > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) ==
> 0)\
> > > > +             && (depth1 > depth2))
> > > Neat check!
> > >
> > > Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
> >
> > in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
> > UINT32_MAX because shift count will be masked to 5 bits.
> >
> > I think you can avoid the casting and have a slightly shorter mask by
> > > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> > > "~(UINT32_MAX >> depth2)"
> > > I'd also suggest for readability putting the second check first, and,
> > > for maintainability, using an inline function rather than a macro.
> > >
> >  Agree, it looks clearer
> >
> >
> > > > +
> > > > +/** @internal Macro to get next node in tree*/
> > > > +#define RTE_RIB_GET_NXT_NODE(node, key)
> > >       \
> > > > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > > > +/** @internal Macro to check if node is right child*/
> > > > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
> > >
> > > Again, consider inline fns rather than macros.
> > >
> > Ok
> >
> > For the latter macro, rather than doing additional pointer derefs to
> > > parent, can you also get if it's a right node by using:
> > > "(node->key & (1 << (32 - node->depth)))"?
> > >
> > No, it is not possible. Decision whether node be left or right is made
> > using parent and child common depth.
> > Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common
> depth
> > will be /8 and 10.128.0.0/24 will be right child.
> >
> >
> > > > +
> > > > +
> > > > +struct rte_rib_node {
> > > > +     struct rte_rib_node *left;
> > > > +     struct rte_rib_node *right;
> > > > +     struct rte_rib_node *parent;
> > > > +     uint32_t        key;
> > > > +     uint8_t         depth;
> > > > +     uint8_t         flag;
> > > > +     uint64_t        nh;
> > > > +     uint64_t        ext[0];
> > > > +};
> > > > +
> > > > +struct rte_rib;
> > > > +
> > > > +/** Type of FIB struct*/
> > > > +enum rte_rib_type {
> > > > +     RTE_RIB_DIR24_8_1B,
> > > > +     RTE_RIB_DIR24_8_2B,
> > > > +     RTE_RIB_DIR24_8_4B,
> > > > +     RTE_RIB_DIR24_8_8B,
> > > > +     RTE_RIB_TYPE_MAX
> > > > +};
> > >
> > > If the plan is to support multiple underlying fib types and algorithms
> > > under the rib library, would it not be better to separate out the
> > > algorithm part from the data storage part? So have the type just be
> > > DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
> > >
> > Yes, we were talk about it in IRC, agree. Now I pass next hop size in
> > union rte_rib_fib_conf inside rte_rib_conf
> >
> >
> > >
> > > > +
> > > > +enum rte_rib_op {
> > > > +     RTE_RIB_ADD,
> > > > +     RTE_RIB_DEL
> > > > +};
> > > > +
> > > > +/** RIB nodes allocation type */
> > > > +enum rte_rib_alloc_type {
> > > > +     RTE_RIB_MALLOC,
> > > > +     RTE_RIB_MEMPOOL,
> > > > +     RTE_RIB_ALLOC_MAX
> > > > +};
> > >
> > > Not sure you need this any more. Malloc allocations and mempool
> > > allocations are now pretty much the same thing.
> > >
> > Actually I think to remove malloc. On performance tests with
> > adding/deleting huge amount of routes malloc is slower. Maybe because of
> > fragmentation.
> > What do you think?
> >
> Yes, definitely mempool allocations are the way to go!
>
> >
> > > > +
> > > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t
> key,
> > > > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> > >
> > > Do you anticipate more ops in future than just add and delete? If not,
> > > why not just split this function into two and drop the op struct.
> > >
> > It is difficult question. I'm not ready to make decision at the moment.
> >
> >
> > >
> > > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t
> *ips,
> > > > +     uint64_t *next_hops, const unsigned n);
> > > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct
> rte_rib
> > > *rib);
> > > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > > > +     struct rte_rib_node *node);
> > > > +
> > > > +struct rte_rib {
> > > > +     char name[RTE_RIB_NAMESIZE];
> > > > +     /*pointer to rib trie*/
> > > > +     struct rte_rib_node     *trie;
> > > > +     /*pointer to dataplane struct*/
> > > > +     void    *fib;
> > > > +     /*prefix modification*/
> > > > +     rte_rib_modify_fn_t     modify;
> > > > +     /* Bulk lookup fn*/
> > > > +     rte_rib_tree_lookup_fn_t        lookup;
> > > > +     /*alloc trie element*/
> > > > +     rte_rib_alloc_node_fn_t alloc_node;
> > > > +     /*free trie element*/
> > > > +     rte_rib_free_node_fn_t  free_node;
> > > > +     struct rte_mempool      *node_pool;
> > > > +     uint32_t                cur_nodes;
> > > > +     uint32_t                cur_routes;
> > > > +     int                     max_nodes;
> > > > +     int                     node_sz;
> > > > +     enum rte_rib_type       type;
> > > > +     enum rte_rib_alloc_type alloc_type;
> > > > +};
> > > > +
> > > > +/** RIB configuration structure */
> > > > +struct rte_rib_conf {
> > > > +     enum rte_rib_type       type;
> > > > +     enum rte_rib_alloc_type alloc_type;
> > > > +     int     max_nodes;
> > > > +     size_t  node_sz;
> > > > +     uint64_t def_nh;
> > > > +};
> > > > +
> > > > +/**
> > > > + * Lookup an IP into the RIB structure
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  IP to be looked up in the RIB
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > > > +
> > > > +/**
> > > > + * Lookup less specific route into the RIB structure
> > > > + *
> > > > + * @param ent
> > > > + *  Pointer to struct rte_rib_node that represents target route
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node that represents
> > > > + *  less specific route on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > > > +
> > > > +/**
> > > > + * Lookup prefix into the RIB structure
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be looked up in the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key,
> uint8_t
> > > depth);
> > >
> > > Can you explain the difference between this and regular lookup, and how
> > > they would be used. I don't think the names convey the differences
> > > sufficiently, and so we should look to rename one or both to be
> clearer.
> > >
> > Regular lookup (rte_rib_tree_lookup) will lookup for most specific node
> for
> > passed key.
> > rte_rib_tree_lookup_exact will lookup node contained key and depth equal
> to
> > passed in args. It used to find exact route.
> >
> So if there is no node exactly matching the parameters, it the lookup_exact
> returns failure? E.g. if you request a /24 node, it won't return a /8 node
> that would cover the /24?
>
yes, it returns failure. Use  rte_rib_tree_lookup(without depth) and if you
want use  rte_rib_tree_lookup_parent after.


> >
> > >
> > > > +
> > > > +/**
> > > > + * Retrieve next more specific prefix from the RIB
> > > s/more/most/
> > >
> >
> > > > + * that is covered by key/depth supernet
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net address of supernet prefix that covers returned more
> specific
> > > prefixes
> > > > + * @param depth
> > > > + *  supernet prefix length
> > > > + * @param cur
> > > > + *   pointer to the last returned prefix to get next prefix
> > > > + *   or
> > > > + *   NULL to get first more specific prefix
> > > > + * @param flag
> > > > + *  -RTE_RIB_GET_NXT_ALL
> > > > + *   get all prefixes from subtrie
> > >
> > > By all prefixes do you mean more specific, i.e. the final prefix?
> > >
> > What do you mean the final prefix?
> >
> The most specific one, or the longest prefix.

This function is created for different task, not for lpm lookup. This
function is for traverse on trie and retrieve routes that falls under the
key/depth so term the longest prefix is irrelevant here.
Let me explain with an example. Imagine there are 10.0.0.0/8, 10.0.0.0/24,
10.0.0.10/32 and 10.64.0.0/10 in routing table.
You have code like:

rte_rib_node *tmp = NULL;
do {
    tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 8,
RTE_RIB_GET_NXT_ALL); /* retrieve all routes that belongs to 10.0.0.0/8 */
    if (node)
          printf("%u/%u\n", tmp->key, tmp->depth);
} while (tmp);

in this case you will see all subprefixes, but without 10.0.0.0/8:
10.0.0.0/24
10.0.0.10/32
10.64.0.0/10

If you want 10.0.0.0/8 do
tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 7, RTE_RIB_GET_NXT_ALL); /*
retrieve all routes that belongs to 10.0.0.0/7 */

And if you call it with RTE_RIB_GET_NXT_COVER like
tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 8, RTE_RIB_GET_NXT_COVER);
you will get
10.0.0.0/24
10.64.0.0/10
without
10.0.0.10/32
This is useful if you want to get gaps for 10.0.0.0/8 that not covered by
presented routes.


>
> >
> > > > + *  -RTE_RIB_GET_NXT_COVER
> > > > + *   get only first more specific prefix even if it have more
> specifics
> > > > + * @return
> > > > + *  pointer to the next more specific prefix
> > > > + *  or
> > > > + *  NULL if there is no prefixes left
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t
> depth,
> > > > +     struct rte_rib_node *cur, int flag);
> > > > +
> > > > +/**
> > > > + * Remove prefix from the RIB
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be removed from the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + */
> > > > +void
> > > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
> > > > +
> > > > +/**
> > > > + * Insert prefix into the RIB
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be inserted to the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + * @return
> > > > + *  pointer to new rte_rib_node on success
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
> > > > +
> > > > +/**
> > > > + * Create RIB
> > > > + *
> > > > + * @param name
> > > > + *  RIB name
> > > > + * @param socket_id
> > > > + *  NUMA socket ID for RIB table memory allocation
> > > > + * @param conf
> > > > + *  Structure containing the configuration
> > > > + * @return
> > > > + *  Handle to RIB object on success
> > > > + *  NULL otherwise with rte_errno set to an appropriate values.
> > > > + */
> > > > +struct rte_rib *
> > > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> > > *conf);
> > > > +
> > > > +/**
> > > > + * Find an existing RIB object and return a pointer to it.
> > > > + *
> > > > + * @param name
> > > > + *  Name of the rib object as passed to rte_rib_create()
> > > > + * @return
> > > > + *  Pointer to rib object or NULL if object not found with rte_errno
> > > > + *  set appropriately. Possible rte_errno values include:
> > > > + *   - ENOENT - required entry not available to return.
> > > > + */
> > > > +struct rte_rib *
> > > > +rte_rib_find_existing(const char *name);
> > > > +
> > > > +/**
> > > > + * Free an RIB object.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @return
> > > > + *   None
> > > > + */
> > > > +void
> > > > +rte_rib_free(struct rte_rib *rib);
> > > > +
> > > > +/**
> > > > + * Add a rule to the RIB.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @param ip
> > > > + *   IP of the rule to be added to the RIB
> > > > + * @param depth
> > > > + *   Depth of the rule to be added to the RIB
> > > > + * @param next_hop
> > > > + *   Next hop of the rule to be added to the RIB
> > > > + * @return
> > > > + *   0 on success, negative value otherwise
> > > > + */
> > > > +int
> > > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> uint64_t
> > > next_hop);
> > > > +
> > > > +/**
> > > > + * Delete a rule from the RIB.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @param ip
> > > > + *   IP of the rule to be deleted from the RIB
> > > > + * @param depth
> > > > + *   Depth of the rule to be deleted from the RIB
> > > > + * @return
> > > > + *   0 on success, negative value otherwise
> > > > + */
> > > > +int
> > > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > > > +
> > > > +/**
> > > > + * Lookup multiple IP addresses in an FIB. This may be implemented
> as a
> > > > + * macro, so the address of the function should not be used.
> > > > + *
> > > > + * @param RIB
> > > > + *   RIB object handle
> > > > + * @param ips
> > > > + *   Array of IPs to be looked up in the FIB
> > > > + * @param next_hops
> > > > + *   Next hop of the most specific rule found for IP.
> > > > + *   This is an array of eight byte values.
> > > > + *   If the lookup for the given IP failed, then corresponding
> element
> > > would
> > > > + *   contain default value, see description of then next parameter.
> > > > + * @param n
> > > > + *   Number of elements in ips (and next_hops) array to lookup. This
> > > should be a
> > > > + *   compile time constant, and divisible by 8 for best performance.
> > > > + * @param defv
> > > > + *   Default value to populate into corresponding element of hop[]
> > > array,
> > > > + *   if lookup would fail.
> > > > + *  @return
> > > > + *   -EINVAL for incorrect arguments, otherwise 0
> > > > + */
> > > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > > > +     rib->lookup(rib->fib, ips, next_hops, n)
> > >
> > > My main thought here is whether this needs to be a function at all?
> > > Given that it takes a full burst of addresses in a single go, how much
> > > performance would actually be lost by making this a regular function in
> > > the C file?
> > > IF we do convert this to a regular function, then a lot of the
> structure
> > > definitions above - most importantly, the rib structure itself - can
> > > probably be moved to a private header file and not exposed to
> > > applications at all. This will make ABI compatibility a *lot* easier,
> as
> > > the structures can be changed without affecting the public ABI.
> > >
> > I didn't quite understand what you mean.
> >
> Sorry, by "needs to be a function" in first line read "needs to be a
> macro". Basically, the point is to not inline anything that doesn't need
> it. If a function works on a burst of packets, it probably will be fine
> being a regular function than a macro or inline function.
>
ok, got it after conversation in IRC

>
> /Bruce
>



-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-29 10:27   ` Bruce Richardson
@ 2018-03-29 20:11     ` Vladimir Medvedkin
  2018-03-29 20:41       ` Bruce Richardson
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Medvedkin @ 2018-03-29 20:11 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

2018-03-29 13:27 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:

> On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > RIB is an alternative to current LPM library.
> > It solves the following problems
> >  - Increases the speed of control plane operations against lpm such as
> >    adding/deleting routes
> >  - Adds abstraction from dataplane algorithms, so it is possible to add
> >    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> >    in addition to current dir24_8
> >  - It is possible to keep user defined application specific additional
> >    information in struct rte_rib_node which represents route entry.
> >    It can be next hop/set of next hops (i.e. active and feasible),
> >    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> >    plenty of additional control plane information.
> >  - For dir24_8 implementation it is possible to remove
> rte_lpm_tbl_entry.depth
> >    field that helps to save 6 bits.
> >  - Also new dir24_8 implementation supports different next_hop sizes
> >    (1/2/4/8 bytes per next hop)
> >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> operator.
> >    Instead it returns special default value if there is no route.
> >
> > Signed-off-by: Medvedkin Vladimir <medvedkinv@gmail.com>
> > ---
>
> Hi again,
>
> some initial comments on the dir24_8 files below.
>
> /Bruce
>
> >  config/common_base                 |   6 +
> >  doc/api/doxy-api.conf              |   1 +
> >  lib/Makefile                       |   2 +
> >  lib/librte_rib/Makefile            |  22 ++
> >  lib/librte_rib/rte_dir24_8.c       | 482 ++++++++++++++++++++++++++++++
> +++
> >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> >  lib/librte_rib/rte_rib.c           | 526 ++++++++++++++++++++++++++++++
> +++++++
> >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> >  lib/librte_rib/rte_rib_version.map |  18 ++
> >  mk/rte.app.mk                      |   1 +
> >  10 files changed, 1496 insertions(+)
> >  create mode 100644 lib/librte_rib/Makefile
> >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> >  create mode 100644 lib/librte_rib/rte_rib.c
> >  create mode 100644 lib/librte_rib/rte_rib.h
> >  create mode 100644 lib/librte_rib/rte_rib_version.map
> >
>
> <snip>
>
> > diff --git a/lib/librte_rib/rte_dir24_8.c b/lib/librte_rib/rte_dir24_8.c
> > new file mode 100644
> > index 0000000..a12f882
> > --- /dev/null
> > +++ b/lib/librte_rib/rte_dir24_8.c
> > @@ -0,0 +1,482 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > + */
> > +
> > +#include <stdint.h>
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <string.h>
> > +#include <rte_debug.h>
> > +#include <rte_malloc.h>
> > +#include <rte_prefetch.h>
> > +#include <rte_errno.h>
> > +
> > +#include <inttypes.h>
> > +
> > +#include <rte_memory.h>
> > +#include <rte_branch_prediction.h>
> > +
> > +#include <rte_rib.h>
> > +#include <rte_dir24_8.h>
> > +
> > +#define BITMAP_SLAB_BIT_SIZE_LOG2    6
> > +#define BITMAP_SLAB_BIT_SIZE         (1 << BITMAP_SLAB_BIT_SIZE_LOG2)
> > +#define BITMAP_SLAB_BITMASK          (BITMAP_SLAB_BIT_SIZE - 1)
> > +
> > +#define ROUNDUP(x, y)         RTE_ALIGN_CEIL(x, (1 << (32 - y)))
> > +
> > +static __rte_always_inline __attribute__((pure)) void *
> > +get_tbl24_p(struct rte_dir24_8_tbl *fib, uint32_t ip)
> > +{
> > +     return (void *)&((uint8_t *)fib->tbl24)[(ip &
> > +             RTE_DIR24_8_TBL24_MASK) >> (8 - fib->nh_sz)];
> > +}
> > +
> > +#define LOOKUP_FUNC(suffix, type, bulk_prefetch)                     \
> > +int rte_dir24_8_lookup_bulk_##suffix(void *fib_p, const uint32_t
> *ips,       \
> > +     uint64_t *next_hops, const unsigned n)                          \
> > +{                                                                    \
> > +     struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;  \
> > +     uint64_t tmp;                                                   \
> > +     uint32_t i;                                                     \
> > +     uint32_t prefetch_offset = RTE_MIN((unsigned)bulk_prefetch, n); \
> > +                                                                     \
> > +     RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ips == NULL) ||       \
> > +             (next_hops == NULL)), -EINVAL);                         \
> > +                                                                     \
> > +     for (i = 0; i < prefetch_offset; i++)                           \
> > +             rte_prefetch0(get_tbl24_p(fib, ips[i]));                \
> > +     for (i = 0; i < (n - prefetch_offset); i++) {                   \
> > +             rte_prefetch0(get_tbl24_p(fib, ips[i + prefetch_offset]));
> \
> > +             tmp = ((type *)fib->tbl24)[ips[i] >> 8];                \
> > +             if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==       \
> > +                     RTE_DIR24_8_VALID_EXT_ENT)) {                   \
> > +                     tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +     \
> > +                             ((tmp >> 1) *
> RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> > +             }                                                       \
> > +             next_hops[i] = tmp >> 1;                                \
> > +     }                                                               \
> > +     for (; i < n; i++) {                                            \
> > +             tmp = ((type *)fib->tbl24)[ips[i] >> 8];                \
> > +             if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==       \
> > +                     RTE_DIR24_8_VALID_EXT_ENT)) {                   \
> > +                     tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +     \
> > +                             ((tmp >> 1) *
> RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> > +             }                                                       \
> > +             next_hops[i] = tmp >> 1;                                \
> > +     }                                                               \
> > +     return 0;                                                       \
> > +}                                                                    \
>
> What is the advantage of doing this as a macro? Unless I'm missing
> something "suffix" is never actually used in the function at all, and you
> reference the size of the data from fix->nh_sz. Therefore there can be no
> performance benefit from having such a lookup function, that I can see.
>
suffix is to create 4 different function names, look at the end of
rte_dir24_8.c, there are
LOOKUP_FUNC(1b, uint8_t, 5)
LOOKUP_FUNC(2b, uint16_t, 6)
LOOKUP_FUNC(4b, uint32_t, 15)
LOOKUP_FUNC(8b, uint64_t, 12)

Now I made single lookup function that references the size of the data from
fib->nh_sz instead of static casting to passed type in macro.
was:
BULK RIB Lookup: 24.2 cycles (fails = 0.0%)
become:
BULK RIB Lookup: 26.1 cycles (fails = 0.0%)
proc E3-1230v1@3.6Ghz


>
> Therefore, if performance is ok, I suggest just making a single lookup_bulk
> function that works with all sizes - as the inlined lookup function does in
> the header.
>
> Alternatively, if you do want specific functions for each
> entry size, you still don't need macros. Write a single function that takes
> as a final parameter the entry-size and use that in calculations rather
> than nh_sz.  Then wrap that function in the set of public ones, passing in
> the final size parameter explicitly as "1", "2", "4" or "8". The compiler
> will then know that as a compile-time constant and generate the correct
> code for each size. However, for this path I suggest you check for any
> resulting performance improvement, e.g. with l3fwd, as I think it's not
> likely to be significant.
>
> > +
> > +static void
> > +write_to_fib(void *ptr, uint64_t val, enum rte_dir24_8_nh_sz size, int
> n)
> > +{
> > +     int i;
> > +     uint8_t *ptr8 = (uint8_t *)ptr;
> > +     uint16_t *ptr16 = (uint16_t *)ptr;
> > +     uint32_t *ptr32 = (uint32_t *)ptr;
> > +     uint64_t *ptr64 = (uint64_t *)ptr;
> > +
> > +     switch (size) {
> > +     case RTE_DIR24_8_1B:
> > +             for (i = 0; i < n; i++)
> > +                     ptr8[i] = (uint8_t)val;
> > +             break;
> > +     case RTE_DIR24_8_2B:
> > +             for (i = 0; i < n; i++)
> > +                     ptr16[i] = (uint16_t)val;
> > +             break;
> > +     case RTE_DIR24_8_4B:
> > +             for (i = 0; i < n; i++)
> > +                     ptr32[i] = (uint32_t)val;
> > +             break;
> > +     case RTE_DIR24_8_8B:
> > +             for (i = 0; i < n; i++)
> > +                     ptr64[i] = (uint64_t)val;
> > +             break;
> > +     }
> > +}
> > +
> > +static int
> > +tbl8_get_idx(struct rte_dir24_8_tbl *fib)
> > +{
> > +     uint32_t i;
> > +     int bit_idx;
> > +
> > +     for (i = 0; (i < (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2))
> &&
> > +             (fib->tbl8_idxes[i] == UINT64_MAX); i++)
> > +             ;
> > +     if (i <= (fib->number_tbl8s >> BITMAP_SLAB_BIT_SIZE_LOG2)) {
> > +             bit_idx = __builtin_ctzll(~fib->tbl8_idxes[i]);
> > +             fib->tbl8_idxes[i] |= (1ULL << bit_idx);
> > +             return (i << BITMAP_SLAB_BIT_SIZE_LOG2) + bit_idx;
> > +     }
> > +     return -ENOSPC;
> > +}
> > +
> > +static inline void
> > +tbl8_free_idx(struct rte_dir24_8_tbl *fib, int idx)
> > +{
> > +     fib->tbl8_idxes[idx >> BITMAP_SLAB_BIT_SIZE_LOG2] &=
> > +             ~(1ULL << (idx & BITMAP_SLAB_BITMASK));
> > +}
> > +
> > +static int
> > +tbl8_alloc(struct rte_dir24_8_tbl *fib, uint64_t nh)
> > +{
> > +     int     tbl8_idx;
> > +     uint8_t *tbl8_ptr;
> > +
> > +     tbl8_idx = tbl8_get_idx(fib);
> > +     if (tbl8_idx < 0)
> > +             return tbl8_idx;
> > +     tbl8_ptr = (uint8_t *)fib->tbl8 +
> > +             ((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) <<
> > +             fib->nh_sz);
> > +     /*Init tbl8 entries with nexthop from tbl24*/
> > +     write_to_fib((void *)tbl8_ptr, nh|
> > +             RTE_DIR24_8_VALID_EXT_ENT, fib->nh_sz,
> > +             RTE_DIR24_8_TBL8_GRP_NUM_ENT);
> > +     return tbl8_idx;
> > +}
> > +
> > +static void
> > +tbl8_recycle(struct rte_dir24_8_tbl *fib, uint32_t ip, uint64_t
> tbl8_idx)
> > +{
> > +     int i;
> > +     uint64_t nh;
> > +     uint8_t *ptr8;
> > +     uint16_t *ptr16;
> > +     uint32_t *ptr32;
> > +     uint64_t *ptr64;
> > +
> > +     switch (fib->nh_sz) {
> > +     case RTE_DIR24_8_1B:
> > +             ptr8 = &((uint8_t *)fib->tbl8)[tbl8_idx *
> > +                             RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> > +             nh = *ptr8;
> > +             for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> > +                     if (nh != ptr8[i])
> > +                             return;
> > +             }
> > +             ((uint8_t *)fib->tbl24)[ip >> 8] =
> > +                     nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> > +             for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> > +                     ptr8[i] = 0;
> > +             break;
> > +     case RTE_DIR24_8_2B:
> > +             ptr16 = &((uint16_t *)fib->tbl8)[tbl8_idx *
> > +                             RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> > +             nh = *ptr16;
> > +             for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> > +                     if (nh != ptr16[i])
> > +                             return;
> > +             }
> > +             ((uint16_t *)fib->tbl24)[ip >> 8] =
> > +                     nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> > +             for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> > +                     ptr16[i] = 0;
> > +             break;
> > +     case RTE_DIR24_8_4B:
> > +             ptr32 = &((uint32_t *)fib->tbl8)[tbl8_idx *
> > +                             RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> > +             nh = *ptr32;
> > +             for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> > +                     if (nh != ptr32[i])
> > +                             return;
> > +             }
> > +             ((uint32_t *)fib->tbl24)[ip >> 8] =
> > +                     nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> > +             for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> > +                     ptr32[i] = 0;
> > +             break;
> > +     case RTE_DIR24_8_8B:
> > +             ptr64 = &((uint64_t *)fib->tbl8)[tbl8_idx *
> > +                             RTE_DIR24_8_TBL8_GRP_NUM_ENT];
> > +             nh = *ptr64;
> > +             for (i = 1; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++) {
> > +                     if (nh != ptr64[i])
> > +                             return;
> > +             }
> > +             ((uint64_t *)fib->tbl24)[ip >> 8] =
> > +                     nh & ~RTE_DIR24_8_VALID_EXT_ENT;
> > +             for (i = 0; i < RTE_DIR24_8_TBL8_GRP_NUM_ENT; i++)
> > +                     ptr64[i] = 0;
> > +             break;
> > +     }
> > +     tbl8_free_idx(fib, tbl8_idx);
> > +}
> > +
> > +static int
> > +install_to_fib(struct rte_dir24_8_tbl *fib, uint32_t ledge, uint32_t
> redge,
> > +     uint64_t next_hop)
> > +{
> > +     uint64_t        tbl24_tmp;
> > +     int     tbl8_idx;
> > +     int tmp_tbl8_idx;
> > +     uint8_t *tbl8_ptr;
> > +
> > +     /*case for 0.0.0.0/0*/
> > +     if (unlikely((ledge == 0) && (redge == 0))) {
> > +             write_to_fib(fib->tbl24, next_hop << 1, fib->nh_sz, 1 <<
> 24);
> > +             return 0;
> > +     }
> > +     if (ROUNDUP(ledge, 24) <= redge) {
> > +             if (ledge < ROUNDUP(ledge, 24)) {
> > +                     tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
> > +                     if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> > +                             RTE_DIR24_8_VALID_EXT_ENT) {
> > +                             tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> > +                             tmp_tbl8_idx = tbl8_get_idx(fib);
> > +                             if ((tbl8_idx < 0) || (tmp_tbl8_idx < 0))
> > +                                     return -ENOSPC;
> > +                             tbl8_free_idx(fib, tmp_tbl8_idx);
> > +                             /*update dir24 entry with tbl8 index*/
> > +                             write_to_fib(get_tbl24_p(fib, ledge),
> > +                                     (tbl8_idx << 1)|
> > +                                     RTE_DIR24_8_VALID_EXT_ENT,
> > +                                     fib->nh_sz, 1);
> > +                     } else
> > +                             tbl8_idx = tbl24_tmp >> 1;
> > +                     tbl8_ptr = (uint8_t *)fib->tbl8 +
> > +                             (((tbl8_idx *
> RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
> > +                             (ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
> > +                             fib->nh_sz);
> > +                     /*update tbl8 with new next hop*/
> > +                     write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> > +                             RTE_DIR24_8_VALID_EXT_ENT,
> > +                             fib->nh_sz, ROUNDUP(ledge, 24) - ledge);
> > +                     tbl8_recycle(fib, ledge, tbl8_idx);
> > +             }
> > +             if (ROUNDUP(ledge, 24) < (redge & RTE_DIR24_8_TBL24_MASK))
> {
> > +                     write_to_fib(get_tbl24_p(fib, ROUNDUP(ledge, 24)),
> > +                             next_hop << 1, fib->nh_sz,
> > +                             ((redge & RTE_DIR24_8_TBL24_MASK) -
> > +                             ROUNDUP(ledge, 24)) >> 8);
> > +             }
> > +             if (redge & ~RTE_DIR24_8_TBL24_MASK) {
> > +                     tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, redge);
> > +                     if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> > +                                     RTE_DIR24_8_VALID_EXT_ENT) {
> > +                             tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> > +                             if (tbl8_idx < 0)
> > +                                     return -ENOSPC;
> > +                             /*update dir24 entry with tbl8 index*/
> > +                             write_to_fib(get_tbl24_p(fib, redge),
> > +                                     (tbl8_idx << 1)|
> > +                                     RTE_DIR24_8_VALID_EXT_ENT,
> > +                                     fib->nh_sz, 1);
> > +                     } else
> > +                             tbl8_idx = tbl24_tmp >> 1;
> > +                     tbl8_ptr = (uint8_t *)fib->tbl8 +
> > +                             ((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT)
> <<
> > +                             fib->nh_sz);
> > +                     /*update tbl8 with new next hop*/
> > +                     write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> > +                             RTE_DIR24_8_VALID_EXT_ENT,
> > +                             fib->nh_sz, redge &
> ~RTE_DIR24_8_TBL24_MASK);
> > +                     tbl8_recycle(fib, redge, tbl8_idx);
> > +             }
> > +     } else {
> > +             tbl24_tmp = RTE_DIR24_8_GET_TBL24(fib, ledge);
> > +             if ((tbl24_tmp & RTE_DIR24_8_VALID_EXT_ENT) !=
> > +                     RTE_DIR24_8_VALID_EXT_ENT) {
> > +                     tbl8_idx = tbl8_alloc(fib, tbl24_tmp);
> > +                     if (tbl8_idx < 0)
> > +                             return -ENOSPC;
> > +                     /*update dir24 entry with tbl8 index*/
> > +                     write_to_fib(get_tbl24_p(fib, ledge),
> > +                             (tbl8_idx << 1)|
> > +                             RTE_DIR24_8_VALID_EXT_ENT,
> > +                             fib->nh_sz, 1);
> > +             } else
> > +                     tbl8_idx = tbl24_tmp >> 1;
> > +             tbl8_ptr = (uint8_t *)fib->tbl8 +
> > +                     (((tbl8_idx * RTE_DIR24_8_TBL8_GRP_NUM_ENT) +
> > +                     (ledge & ~RTE_DIR24_8_TBL24_MASK)) <<
> > +                     fib->nh_sz);
> > +             /*update tbl8 with new next hop*/
> > +             write_to_fib((void *)tbl8_ptr, (next_hop << 1)|
> > +                     RTE_DIR24_8_VALID_EXT_ENT,
> > +                     fib->nh_sz, redge - ledge);
> > +             tbl8_recycle(fib, ledge, tbl8_idx);
> > +     }
> > +     return 0;
> > +}
> > +
> > +static int
> > +modify_fib(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> > +     uint64_t next_hop)
> > +{
> > +     struct rte_rib_node *tmp = NULL;
> > +     struct rte_dir24_8_tbl *fib;
> > +     uint32_t ledge, redge;
> > +     int ret;
> > +
> > +     fib = rib->fib;
> > +
> > +     if (next_hop > DIR24_8_MAX_NH(fib))
> > +             return -EINVAL;
> > +
> > +     ledge = ip;
> > +     do {
> > +             tmp = rte_rib_tree_get_nxt(rib, ip, depth, tmp,
> > +                     RTE_RIB_GET_NXT_COVER);
> > +             if (tmp != NULL) {
> > +                     if (tmp->depth == depth)
> > +                             continue;
> > +                     redge = tmp->key;
> > +                     if (ledge == redge) {
> > +                             ledge = redge +
> > +                                     (uint32_t)(1ULL << (32 -
> tmp->depth));
> > +                             continue;
> > +                     }
> > +                     ret = install_to_fib(fib, ledge, redge,
> > +                             next_hop);
> > +                     if (ret != 0)
> > +                             return ret;
> > +                     ledge = redge +
> > +                             (uint32_t)(1ULL << (32 - tmp->depth));
> > +             } else {
> > +                     redge = ip + (uint32_t)(1ULL << (32 - depth));
> > +                     ret = install_to_fib(fib, ledge, redge,
> > +                             next_hop);
> > +                     if (ret != 0)
> > +                             return ret;
> > +             }
> > +     } while (tmp);
> > +
> > +     return 0;
> > +}
> > +
> > +int
> > +rte_dir24_8_modify(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> > +     uint64_t next_hop, enum rte_rib_op op)
> > +{
> > +     struct rte_dir24_8_tbl *fib;
> > +     struct rte_rib_node *tmp = NULL;
> > +     struct rte_rib_node *node;
> > +     struct rte_rib_node *parent;
> > +     int ret = 0;
> > +
> > +     if ((rib == NULL) || (depth > RTE_RIB_MAXDEPTH))
> > +             return -EINVAL;
> > +
> > +     fib = rib->fib;
> > +     RTE_ASSERT(fib);
> > +
> > +     ip &= (uint32_t)(UINT64_MAX << (32 - depth));
> > +
> > +     node = rte_rib_tree_lookup_exact(rib, ip, depth);
> > +     switch (op) {
> > +     case RTE_RIB_ADD:
> > +             if (node != NULL) {
> > +                     if (node->nh == next_hop)
> > +                             return 0;
> > +                     ret = modify_fib(rib, ip, depth, next_hop);
> > +                     if (ret == 0)
> > +                             node->nh = next_hop;
> > +                     return 0;
> > +             }
> > +             if (depth > 24) {
> > +                     tmp = rte_rib_tree_get_nxt(rib, ip, 24, NULL,
> > +                             RTE_RIB_GET_NXT_COVER);
> > +                     if ((tmp == NULL) &&
> > +                             (fib->cur_tbl8s >= fib->number_tbl8s))
> > +                             return -ENOSPC;
> > +
> > +             }
> > +             node = rte_rib_tree_insert(rib, ip, depth);
> > +             if (node == NULL)
> > +                     return -rte_errno;
> > +             node->nh = next_hop;
> > +             parent = rte_rib_tree_lookup_parent(node);
> > +             if ((parent != NULL) && (parent->nh == next_hop))
> > +                     return 0;
> > +             ret = modify_fib(rib, ip, depth, next_hop);
> > +             if (ret) {
> > +                     rte_rib_tree_remove(rib, ip, depth);
> > +                     return ret;
> > +             }
> > +             if ((depth > 24) && (tmp == NULL))
> > +                     fib->cur_tbl8s++;
> > +             return 0;
> > +     case RTE_RIB_DEL:
> > +             if (node == NULL)
> > +                     return -ENOENT;
> > +
> > +             parent = rte_rib_tree_lookup_parent(node);
> > +             if (parent != NULL) {
> > +                     if (parent->nh != node->nh)
> > +                             ret = modify_fib(rib, ip, depth,
> parent->nh);
> > +             } else
> > +                     ret = modify_fib(rib, ip, depth, fib->def_nh);
> > +             if (ret == 0) {
> > +                     rte_rib_tree_remove(rib, ip, depth);
> > +                     if (depth > 24) {
> > +                             tmp = rte_rib_tree_get_nxt(rib, ip, 24,
> NULL,
> > +                                     RTE_RIB_GET_NXT_COVER);
> > +                             if (tmp == NULL)
> > +                                     fib->cur_tbl8s--;
> > +                     }
> > +             }
> > +             return ret;
> > +     default:
> > +             break;
> > +     }
> > +     return -EINVAL;
> > +}
> > +
> > +struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int
> socket_id,
> > +     enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh)
> > +{
> > +     char mem_name[RTE_RIB_NAMESIZE];
> > +     struct rte_dir24_8_tbl *fib;
> > +
> > +     snprintf(mem_name, sizeof(mem_name), "FIB_%s", name);
> > +     fib = rte_zmalloc_socket(name, sizeof(struct rte_dir24_8_tbl) +
> > +             RTE_DIR24_8_TBL24_NUM_ENT * (1 << nh_sz),
> RTE_CACHE_LINE_SIZE,
> > +             socket_id);
> > +     if (fib == NULL)
> > +             return fib;
> > +
> > +     snprintf(mem_name, sizeof(mem_name), "TBL8_%s", name);
> > +     fib->tbl8 = rte_zmalloc_socket(mem_name,
> RTE_DIR24_8_TBL8_GRP_NUM_ENT *
> > +                     (1 << nh_sz) * RTE_DIR24_8_TBL8_NUM_GROUPS,
> > +                     RTE_CACHE_LINE_SIZE, socket_id);
> > +     if (fib->tbl8 == NULL) {
> > +             rte_free(fib);
> > +             return NULL;
> > +     }
> > +     fib->def_nh = def_nh;
> > +     fib->nh_sz = nh_sz;
> > +     fib->number_tbl8s = RTE_MIN((uint32_t)RTE_DIR24_8_TBL8_NUM_GROUPS,
> > +                             DIR24_8_MAX_NH(fib));
> > +
> > +     snprintf(mem_name, sizeof(mem_name), "TBL8_idxes_%s", name);
> > +     fib->tbl8_idxes = rte_zmalloc_socket(mem_name,
> > +                     RTE_ALIGN_CEIL(fib->number_tbl8s, 64) >> 3,
> > +                     RTE_CACHE_LINE_SIZE, socket_id);
> > +     if (fib->tbl8_idxes == NULL) {
> > +             rte_free(fib->tbl8);
> > +             rte_free(fib);
> > +             return NULL;
> > +     }
> > +
> > +     return fib;
> > +}
> > +
> > +void
> > +rte_dir24_8_free(void *fib_p)
> > +{
> > +     struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
> > +
> > +     rte_free(fib->tbl8_idxes);
> > +     rte_free(fib->tbl8);
> > +     rte_free(fib);
> > +}
> > +
> > +LOOKUP_FUNC(1b, uint8_t, 5)
> > +LOOKUP_FUNC(2b, uint16_t, 6)
> > +LOOKUP_FUNC(4b, uint32_t, 15)
> > +LOOKUP_FUNC(8b, uint64_t, 12)
> > diff --git a/lib/librte_rib/rte_dir24_8.h b/lib/librte_rib/rte_dir24_8.h
> > new file mode 100644
> > index 0000000..f779409
> > --- /dev/null
> > +++ b/lib/librte_rib/rte_dir24_8.h
> > @@ -0,0 +1,116 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv@gmail.com>
> > + */
> > +
> > +#ifndef _RTE_DIR24_8_H_
> > +#define _RTE_DIR24_8_H_
> > +
> > +/**
> > + * @file
> > + * RTE Longest Prefix Match (LPM)
> > + */
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +/** @internal Total number of tbl24 entries. */
> > +#define RTE_DIR24_8_TBL24_NUM_ENT    (1 << 24)
> > +
> > +/** Maximum depth value possible for IPv4 LPM. */
> > +#define RTE_DIR24_8_MAX_DEPTH                32
> > +
> > +/** @internal Number of entries in a tbl8 group. */
> > +#define RTE_DIR24_8_TBL8_GRP_NUM_ENT 256
> > +
> > +/** @internal Total number of tbl8 groups in the tbl8. */
> > +#define RTE_DIR24_8_TBL8_NUM_GROUPS  65536
> > +
> > +/** @internal bitmask with valid and valid_group fields set */
> > +#define RTE_DIR24_8_VALID_EXT_ENT    0x01
> > +
> > +#define RTE_DIR24_8_TBL24_MASK               0xffffff00
> > +
> > +/** Size of nexthop (1 << nh_sz) bits */
> > +enum rte_dir24_8_nh_sz {
> > +     RTE_DIR24_8_1B,
> > +     RTE_DIR24_8_2B,
> > +     RTE_DIR24_8_4B,
> > +     RTE_DIR24_8_8B
> > +};
> > +
> > +
> > +#define DIR24_8_BITS_IN_NH(fib)              (8 * (1 << fib->nh_sz))
> > +#define DIR24_8_MAX_NH(fib)  ((1ULL << (DIR24_8_BITS_IN_NH(fib) - 1)) -
> 1)
> > +
> > +#define DIR24_8_TBL_IDX(a, fib)              ((a) >> (3 - fib->nh_sz))
> > +#define DIR24_8_PSD_IDX(a, fib)              ((a) & ((1 << (3 -
> fib->nh_sz)) - 1))
> > +
> > +#define DIR24_8_TBL24_VAL(ip)        (ip >> 8)
> > +#define DIR24_8_TBL8_VAL(res, ip)                                    \
> > +     ((res >> 1) * RTE_DIR24_8_TBL8_GRP_NUM_ENT + (uint8_t)ip)       \
> > +
> > +#define DIR24_8_LOOKUP_MSK                                           \
> > +     (((1ULL << ((1 << (fib->nh_sz + 3)) - 1)) << 1) - 1)            \
> > +
> > +#define RTE_DIR24_8_GET_TBL24(fib, ip)
>      \
> > +     ((fib->tbl24[DIR24_8_TBL_IDX(DIR24_8_TBL24_VAL(ip), fib)] >>    \
> > +     (DIR24_8_PSD_IDX(DIR24_8_TBL24_VAL(ip), fib) *                  \
> > +     DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)                 \
> > +
> > +#define RTE_DIR24_8_GET_TBL8(fib, res, ip)                           \
> > +     ((fib->tbl8[DIR24_8_TBL_IDX(DIR24_8_TBL8_VAL(res, ip), fib)] >> \
> > +     (DIR24_8_PSD_IDX(DIR24_8_TBL8_VAL(res, ip), fib) *              \
> > +     DIR24_8_BITS_IN_NH(fib))) & DIR24_8_LOOKUP_MSK)                 \
> >
> I would strongly suggest making each of the above macros into inline
> functions instead. It would allow easier readability since you have
> parameter types and can split things across lines easier.
> Also, some comments might be good too.
>
> +
> > +
> > +struct rte_dir24_8_tbl {
> > +     uint32_t        number_tbl8s;   /**< Total number of tbl8s. */
> > +     uint32_t        cur_tbl8s;      /**< Current cumber of tbl8s. */
> > +     uint64_t        def_nh;
> > +     enum rte_dir24_8_nh_sz  nh_sz;  /**< Size of nexthop entry */
> > +     uint64_t        *tbl8;          /**< LPM tbl8 table. */
> > +     uint64_t        *tbl8_idxes;
> > +     uint64_t        tbl24[0] __rte_cache_aligned; /**< LPM tbl24
> table. */
> > +};
> > +
> > +struct rte_dir24_8_tbl *rte_dir24_8_create(const char *name, int
> socket_id,
> > +     enum rte_dir24_8_nh_sz nh_sz, uint64_t def_nh);
> > +void rte_dir24_8_free(void *fib_p);
> > +int rte_dir24_8_modify(struct rte_rib *rib, uint32_t key,
> > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> > +int rte_dir24_8_lookup_bulk_1b(void *fib_p, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +int rte_dir24_8_lookup_bulk_2b(void *fib_p, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +int rte_dir24_8_lookup_bulk_4b(void *fib_p, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +int rte_dir24_8_lookup_bulk_8b(void *fib_p, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +
> > +
> > +static inline int
> > +rte_dir24_8_lookup(void *fib_p, uint32_t ip, uint64_t *next_hop)
>
> Why use void * as parameter, since the proper type is defined just above?

agree, will change to struct rte_dir24_8_tbl *


>
> > +{
> > +     uint64_t res;
> > +     struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
> > +
> > +     RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ip == NULL) ||
> > +             (next_hop == NULL)), -EINVAL);
> > +
> > +     res = RTE_DIR24_8_GET_TBL24(fib, ip);
> > +     if (unlikely((res & RTE_DIR24_8_VALID_EXT_ENT) ==
> > +             RTE_DIR24_8_VALID_EXT_ENT)) {
> > +             res = RTE_DIR24_8_GET_TBL8(fib, res, ip);
> > +     }
> > +     *next_hop = res >> 1;
> > +     return 0;
> > +}
>
> Do we need this static inline function? Can the bulk functions do on their
> own? If we can remove this, we can move the most of the header file
> contents, especially the structures, out of the public header. That would
> greatly improve the ease with which ABI can be maintained.
>
It was done in some manner to LPM. There was separate single lookup and
bulk versions.
Of course it is possible to remove this function at all and use bulk
version to lookup single packet. But I thought maybe somebody could use it.


>
>
> +
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif /* _RTE_DIR24_8_H_ */
> > +
>



-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] Add RIB library
  2018-03-29 20:11     ` Vladimir Medvedkin
@ 2018-03-29 20:41       ` Bruce Richardson
  0 siblings, 0 replies; 11+ messages in thread
From: Bruce Richardson @ 2018-03-29 20:41 UTC (permalink / raw)
  To: Vladimir Medvedkin; +Cc: dev

On Thu, Mar 29, 2018 at 11:11:20PM +0300, Vladimir Medvedkin wrote:
> 2018-03-29 13:27 GMT+03:00 Bruce Richardson <bruce.richardson@intel.com>:
> 
> > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > RIB is an alternative to current LPM library.
<snip>
> > > +#define LOOKUP_FUNC(suffix, type, bulk_prefetch)                     \
> > > +int rte_dir24_8_lookup_bulk_##suffix(void *fib_p, const uint32_t
> > *ips,       \
> > > +     uint64_t *next_hops, const unsigned n)                          \
> > > +{                                                                    \
> > > +     struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;  \
> > > +     uint64_t tmp;                                                   \
> > > +     uint32_t i;                                                     \
> > > +     uint32_t prefetch_offset = RTE_MIN((unsigned)bulk_prefetch, n); \
> > > +                                                                     \
> > > +     RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ips == NULL) ||       \
> > > +             (next_hops == NULL)), -EINVAL);                         \
> > > +                                                                     \
> > > +     for (i = 0; i < prefetch_offset; i++)                           \
> > > +             rte_prefetch0(get_tbl24_p(fib, ips[i]));                \
> > > +     for (i = 0; i < (n - prefetch_offset); i++) {                   \
> > > +             rte_prefetch0(get_tbl24_p(fib, ips[i + prefetch_offset]));
> > \
> > > +             tmp = ((type *)fib->tbl24)[ips[i] >> 8];                \
> > > +             if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==       \
> > > +                     RTE_DIR24_8_VALID_EXT_ENT)) {                   \
> > > +                     tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +     \
> > > +                             ((tmp >> 1) *
> > RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> > > +             }                                                       \
> > > +             next_hops[i] = tmp >> 1;                                \
> > > +     }                                                               \
> > > +     for (; i < n; i++) {                                            \
> > > +             tmp = ((type *)fib->tbl24)[ips[i] >> 8];                \
> > > +             if (unlikely((tmp & RTE_DIR24_8_VALID_EXT_ENT) ==       \
> > > +                     RTE_DIR24_8_VALID_EXT_ENT)) {                   \
> > > +                     tmp = ((type *)fib->tbl8)[(uint8_t)ips[i] +     \
> > > +                             ((tmp >> 1) *
> > RTE_DIR24_8_TBL8_GRP_NUM_ENT)]; \
> > > +             }                                                       \
> > > +             next_hops[i] = tmp >> 1;                                \
> > > +     }                                                               \
> > > +     return 0;                                                       \
> > > +}                                                                    \
> >
> > What is the advantage of doing this as a macro? Unless I'm missing
> > something "suffix" is never actually used in the function at all, and you
> > reference the size of the data from fix->nh_sz. Therefore there can be no
> > performance benefit from having such a lookup function, that I can see.
> >
> suffix is to create 4 different function names, look at the end of
> rte_dir24_8.c, there are
> LOOKUP_FUNC(1b, uint8_t, 5)
> LOOKUP_FUNC(2b, uint16_t, 6)
> LOOKUP_FUNC(4b, uint32_t, 15)
> LOOKUP_FUNC(8b, uint64_t, 12)
> 
> Now I made single lookup function that references the size of the data from
> fib->nh_sz instead of static casting to passed type in macro.
> was:
> BULK RIB Lookup: 24.2 cycles (fails = 0.0%)
> become:
> BULK RIB Lookup: 26.1 cycles (fails = 0.0%)
> proc E3-1230v1@3.6Ghz
> 
> 
So you are saying that it turned out to be faster to do a lookup of the
size rather than hardcoding it. Seems strange, but ok.
I'm still confused why the four functions with four different names. What
is different between the four implementations. Just the amount of
prefetching done? It could still be done by a single call with a
compile-time constant parameter. It's whats used a lot in the ring library
and works well there.

> >
> > Therefore, if performance is ok, I suggest just making a single lookup_bulk
> > function that works with all sizes - as the inlined lookup function does in
> > the header.
> >
> > Alternatively, if you do want specific functions for each
> > entry size, you still don't need macros. Write a single function that takes
> > as a final parameter the entry-size and use that in calculations rather
> > than nh_sz.  Then wrap that function in the set of public ones, passing in
> > the final size parameter explicitly as "1", "2", "4" or "8". The compiler
> > will then know that as a compile-time constant and generate the correct
> > code for each size. However, for this path I suggest you check for any
> > resulting performance improvement, e.g. with l3fwd, as I think it's not
> > likely to be significant.
> >
> > > +
<snip> 
> >
> > > +{
> > > +     uint64_t res;
> > > +     struct rte_dir24_8_tbl *fib = (struct rte_dir24_8_tbl *)fib_p;
> > > +
> > > +     RTE_RIB_RETURN_IF_TRUE(((fib == NULL) || (ip == NULL) ||
> > > +             (next_hop == NULL)), -EINVAL);
> > > +
> > > +     res = RTE_DIR24_8_GET_TBL24(fib, ip);
> > > +     if (unlikely((res & RTE_DIR24_8_VALID_EXT_ENT) ==
> > > +             RTE_DIR24_8_VALID_EXT_ENT)) {
> > > +             res = RTE_DIR24_8_GET_TBL8(fib, res, ip);
> > > +     }
> > > +     *next_hop = res >> 1;
> > > +     return 0;
> > > +}
> >
> > Do we need this static inline function? Can the bulk functions do on their
> > own? If we can remove this, we can move the most of the header file
> > contents, especially the structures, out of the public header. That would
> > greatly improve the ease with which ABI can be maintained.
> >
> It was done in some manner to LPM. There was separate single lookup and
> bulk versions.
> Of course it is possible to remove this function at all and use bulk
> version to lookup single packet. But I thought maybe somebody could use it.
> 

Yes, I understand that. However, if it's unlikely to be used, I would
prioritize having ABI-friendliness over having it.

/Bruce

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-29 20:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-21 21:44 [PATCH v2 0/2] lib/rib: Add Routing Information Base library Medvedkin Vladimir
2018-02-21 21:44 ` [PATCH v2 1/2] Add RIB library Medvedkin Vladimir
2018-03-14 11:09   ` Bruce Richardson
2018-03-14 12:05     ` Richardson, Bruce
2018-03-25 18:17     ` Vladimir Medvedkin
2018-03-26  9:50       ` Bruce Richardson
2018-03-29 19:59         ` Vladimir Medvedkin
2018-03-29 10:27   ` Bruce Richardson
2018-03-29 20:11     ` Vladimir Medvedkin
2018-03-29 20:41       ` Bruce Richardson
2018-02-21 21:44 ` [PATCH v2 2/2] Add autotests for " Medvedkin Vladimir

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.