linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] XDP Program for Ip forward
@ 2017-10-03  7:37 cjacob
  2017-10-03  7:37 ` [PATCH 1/1] xdp: Sample xdp program implementing ip forward cjacob
  2017-10-04 15:07 ` [PATCH 0/1] XDP Program for Ip forward Jesper Dangaard Brouer
  0 siblings, 2 replies; 6+ messages in thread
From: cjacob @ 2017-10-03  7:37 UTC (permalink / raw)
  To: netdev; +Cc: Christina.Jacob, linux-kernel, linux-arm-kernel


The patch below implements port to port forwarding through route table and arp
table lookup for ipv4 packets using bpf_redirect helper function and lpm_trie
map.  This has an improved performance over the normal kernel stack ip forward.

Implementation details.
-----------------------
The program uses one map each for arp table, route table and packet count.
The number of entries the program can process is limited by the size of the
map used.

In the xdp3_user.c,

initially, the routing table is read and stored in an lpm trie map.
The arp table is read and stored in an array map. There are two netlink sockets
that listen to any change in the route table and arp table.
There are two types of changes to the route table.

	1.New
	
	The new entries are added to the lpm trie with proper key and prefix
	length. If there is a another entry in the route table with a different
	metric(only metric is considered), then the values are compared and the
	one with lowest metric is added to the trie node.
	
	2.Deletion 

	On deletion from the route table, the particular node is removed and the
	entire route table is read again to check if there is another entry with
	the same key and prefix length but a different metric. If it exists it
	is added to the lpm trie.

This implementation depends on Patch bpf: Implement map_delete_elem for
BPF_MAP_TYPE_LPM_TRIE which is not yet upstreamed.

There are two types of changes to the route table

	1.New
	
	The new arp entries are added in the in the array map directly with the
	ip address as the key and the destination mac address as the value.
	
	2.Delete 
	
	The entry corresponding to the particular ip is deleted from the 
	arp table map.

Another map is maintained for entries in the route table having 32 bit mask.
Such entries can have a corresponding arp entry which if stored together with
the route entry in an array map and can be accessed in O(1) time, thereby 
eliminating the trie lookup and arp lookup.

In the xdp3_kern.c,

The array map for the 32 bit mask entries is checked to see if there is a key
that exactly matches the destination ip. If it has a non zero destination mac
entry then the xdp data is updated accordingly. Otherwise a proper route and 
arp table lookup is done using the lpm_trie and the arp table array map.
	
	Usage: ./xdp3 [-S] <ifindex1...ifindexn> 

	-S to choose generic xdp implementation 
	  [Default is driver xdp implementation]
	ifindex - the index of the interface to which 
	the xdp program has to be attached.
	in 4.14-rc3 kernel.


cjacob (1):
  xdp: Sample xdp program implementing ip forward

 samples/bpf/Makefile    |    4 +
 samples/bpf/xdp3_kern.c |  204 +++++++++++++++
 samples/bpf/xdp3_user.c |  649 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 857 insertions(+), 0 deletions(-)
 create mode 100644 samples/bpf/xdp3_kern.c
 create mode 100644 samples/bpf/xdp3_user.c

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] xdp: Sample xdp program implementing ip forward
  2017-10-03  7:37 [PATCH 0/1] XDP Program for Ip forward cjacob
@ 2017-10-03  7:37 ` cjacob
  2017-10-03 15:54   ` Daniel Borkmann
  2017-10-03 16:24   ` David Ahern
  2017-10-04 15:07 ` [PATCH 0/1] XDP Program for Ip forward Jesper Dangaard Brouer
  1 sibling, 2 replies; 6+ messages in thread
From: cjacob @ 2017-10-03  7:37 UTC (permalink / raw)
  To: netdev; +Cc: Christina.Jacob, linux-kernel, linux-arm-kernel

Implements port to port forwarding with route table and arp table
lookup for ipv4 packets using bpf_redirect helper function and
lpm_trie  map.

Signed-off-by: cjacob <Christina.Jacob@cavium.com>
---
 samples/bpf/Makefile    |    4 +
 samples/bpf/xdp3_kern.c |  204 +++++++++++++++
 samples/bpf/xdp3_user.c |  649 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 857 insertions(+), 0 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index cf17c79..cc9cc0b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -28,6 +28,7 @@ hostprogs-y += test_cgrp2_sock
 hostprogs-y += test_cgrp2_sock2
 hostprogs-y += xdp1
 hostprogs-y += xdp2
+hostprogs-y += xdp3
 hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
@@ -73,6 +74,7 @@ test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o
 xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
+xdp3-objs := bpf_load.o $(LIBBPF) xdp3_user.o
 test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) cgroup_helpers.o \
 				       test_current_task_under_cgroup_user.o
 trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
@@ -114,6 +116,7 @@ always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
+always += xdp3_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
@@ -160,6 +163,7 @@ HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
+HOSTLOADLIBES_xdp3 += -lelf
 HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
diff --git a/samples/bpf/xdp3_kern.c b/samples/bpf/xdp3_kern.c
new file mode 100644
index 0000000..62d905d
--- /dev/null
+++ b/samples/bpf/xdp3_kern.c
@@ -0,0 +1,204 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include <linux/slab.h>
+#include <net/ip_fib.h>
+
+struct trie_value {
+	__u8 prefix[4];
+	long value;
+	int gw;
+	int ifindex;
+	int metric;
+};
+
+union key_4 {
+	u32 b32[2];
+	u8 b8[8];
+};
+
+struct arp_entry {
+	int dst;
+	long mac;
+};
+
+struct direct_map {
+	long mac;
+	int ifindex;
+	struct arp_entry arp;
+};
+
+/* Map for trie implementation*/
+struct bpf_map_def SEC("maps") lpm_map = {
+	.type = BPF_MAP_TYPE_LPM_TRIE,
+	.key_size = 8,
+	.value_size =
+		sizeof(struct trie_value),
+	.max_entries = 50,
+	.map_flags = BPF_F_NO_PREALLOC,
+};
+
+/* Map for counter*/
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+/* Map for ARP table*/
+struct bpf_map_def SEC("maps") arp_table = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(long),
+	.max_entries = 50,
+};
+
+/* Map to keep the exact match entries in the route table*/
+struct bpf_map_def SEC("maps") exact_match = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct direct_map),
+	.max_entries = 50,
+};
+
+/**
+ * Function to set source and destination mac of the packet
+ */
+static inline void set_src_dst_mac(void *data, void *src, void *dst)
+{
+	unsigned short *p      = data;
+	unsigned short *dest   = dst;
+	unsigned short *source = src;
+
+	p[3] = source[0];
+	p[4] = source[1];
+	p[5] = source[2];
+	p[0] = dest[0];
+	p[1] = dest[1];
+	p[2] = dest[2];
+}
+
+/**
+ * Parse IPV4 packet to get SRC, DST IP and protocol
+ */
+static inline int parse_ipv4(void *data, u64 nh_off, void *data_end,
+			     unsigned int *src, unsigned int *dest)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	*src = (unsigned int)iph->saddr;
+	*dest = (unsigned int)iph->daddr;
+	return iph->protocol;
+}
+
+SEC("xdp3")
+int xdp_prog3(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP, forward_to;
+	long *value;
+	struct trie_value *prefix_value;
+	long *dest_mac = NULL, *src_mac = NULL;
+	u16 h_proto;
+	u64 nh_off;
+	u32 ipproto;
+	union key_4 key4;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_ARP)) {
+		return XDP_PASS;
+	} else if (h_proto == htons(ETH_P_IP)) {
+		int src_ip = 0, dest_ip = 0;
+		struct direct_map *direct_entry;
+
+		ipproto = parse_ipv4(data, nh_off, data_end, &src_ip, &dest_ip);
+		direct_entry = (struct direct_map *)bpf_map_lookup_elem
+			(&exact_match, &dest_ip);
+		/*check for exact match, this would give a faster lookup*/
+		if (direct_entry && direct_entry->mac &&
+		    direct_entry->arp.mac) {
+			src_mac = &direct_entry->mac;
+			dest_mac = &direct_entry->arp.mac;
+			forward_to = direct_entry->ifindex;
+		} else {
+			/*Look up in the trie for lpm*/
+			// Key for trie
+			key4.b32[0] = 32;
+			key4.b8[4] = dest_ip % 0x100;
+			key4.b8[5] = (dest_ip >> 8) % 0x100;
+			key4.b8[6] = (dest_ip >> 16) % 0x100;
+			key4.b8[7] = (dest_ip >> 24) % 0x100;
+			prefix_value =
+				((struct trie_value *)bpf_map_lookup_elem
+				 (&lpm_map, &key4));
+			if (!prefix_value) {
+				return XDP_DROP;
+			} else {
+				src_mac = &prefix_value->value;
+				if (src_mac) {
+					dest_mac = (long *)bpf_map_lookup_elem
+						(&arp_table, &dest_ip);
+					if (!dest_mac) {
+						if (prefix_value->gw) {
+							dest_ip = *(unsigned int *)(&(prefix_value->gw));
+							dest_mac = (long *)bpf_map_lookup_elem
+								(&arp_table, &dest_ip);
+						} else {
+							return XDP_DROP;
+						}
+					}
+					forward_to = prefix_value->ifindex;
+				} else {
+					return XDP_DROP;
+				}
+			}
+		}
+	} else {
+		ipproto = 0;
+	}
+	if (src_mac && dest_mac) {
+		set_src_dst_mac(data, src_mac,
+				dest_mac);
+		value = bpf_map_lookup_elem
+			(&rxcnt, &ipproto);
+		if (value)
+			*value += 1;
+		return  bpf_redirect(
+				     forward_to,
+				     0);
+	}
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp3_user.c b/samples/bpf/xdp3_user.c
new file mode 100644
index 0000000..451b522
--- /dev/null
+++ b/samples/bpf/xdp3_user.c
@@ -0,0 +1,649 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+#include <arpa/inet.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <net/if.h>
+#include <netdb.h>
+#include <sys/ioctl.h>
+#include "bpf_util.h"
+#include <sys/syscall.h>
+
+int sock, sock_arp, flags = 0;
+char buf[8192];
+static int total_ifindex;
+char **index_list;
+
+static int get_route_table(int rtm_family);
+static void int_exit(int sig)
+{
+	int i = 0, index;
+
+	for (i = 0; i < total_ifindex; i++) {
+		index = strtoul(index_list[i], NULL, 0);
+		set_link_xdp_fd(index, -1, flags);
+	}
+	exit(0);
+}
+
+static void close_and_exit(int sig)
+{
+	int i = 0, index;
+
+	close(sock);
+	close(sock_arp);
+
+	for (i = 0; i < total_ifindex; i++) {
+		index = strtoul(index_list[i], NULL, 0);
+		set_link_xdp_fd(index, -1, flags);
+	}
+	exit(0);
+}
+
+/* Get the mac address of the interface given interface name */
+static long *getmac(char *iface)
+{
+	int fd;
+	struct ifreq ifr;
+	long *mac = NULL;
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	ifr.ifr_addr.sa_family = AF_INET;
+	strncpy(ifr.ifr_name, iface, IFNAMSIZ - 1);
+	ioctl(fd, SIOCGIFHWADDR, &ifr);
+	mac = (long *)ifr.ifr_hwaddr.sa_data;
+	close(fd);
+	return mac;
+}
+
+static int recv_msg(struct sockaddr_nl sock_addr, int sock)
+{
+	char *buf_ptr;
+	struct nlmsghdr *nh;
+	int len, nll = 0;
+
+	buf_ptr = buf;
+	while (1) {
+		len = recv(sock, buf_ptr, sizeof(buf) - nll, 0);
+		if (len < 0)
+			return len;
+
+		nh = (struct nlmsghdr *)buf_ptr;
+
+		if (nh->nlmsg_type == NLMSG_DONE)
+			break;
+		buf_ptr += len;
+		nll += len;
+		if ((sock_addr.nl_groups & RTMGRP_NEIGH) == RTMGRP_NEIGH)
+			break;
+
+		if ((sock_addr.nl_groups & RTMGRP_IPV4_ROUTE) ==
+		    RTMGRP_IPV4_ROUTE)
+			break;
+	}
+	return nll;
+}
+
+/* Function to parse the route entry returned by netlink
+ * Updates the route entry related map entries
+ */
+static void read_route(struct nlmsghdr *nh, int nll)
+{
+	struct route_table {
+		int dst, gw, dst_len, iface, metric;
+		long *mac;
+		char iface_name[IFNAMSIZ];
+	} route;
+	struct arp_table {
+		int dst;
+		long mac;
+	};
+
+	struct direct_map {
+		long mac;
+		int ifindex;
+		struct arp_table arp;
+	} direct_entry;
+	int i;
+	int rtm_family;
+	struct bpf_lpm_trie_key *prefix_key;
+	char dsts[24], gws[24], ifs[16], dsts_len[24], metrics[24];
+	struct rtmsg *rt_msg;
+	int rtl;
+	struct rtattr *rt_attr;
+
+	if (nh->nlmsg_type == RTM_DELROUTE)
+		printf("DELETING Route entry\n");
+	else if (nh->nlmsg_type == RTM_GETROUTE)
+		printf("READING Route entry\n");
+	else if (nh->nlmsg_type == RTM_NEWROUTE)
+		printf("NEW Route entry\n");
+	else
+		printf("%d\n", nh->nlmsg_type);
+
+	bzero(&route, sizeof(route));
+	printf("Destination\tGateway\t\tGenmask\tMetric\tIface\n");
+	for (; NLMSG_OK(nh, nll); nh = NLMSG_NEXT(nh, nll)) {
+		rt_msg = (struct rtmsg *)NLMSG_DATA(nh);
+		rtm_family = rt_msg->rtm_family;
+		if (rtm_family == AF_INET)
+			if (rt_msg->rtm_table != RT_TABLE_MAIN)
+				continue;
+		rt_attr = (struct rtattr *)RTM_RTA(rt_msg);
+		rtl = RTM_PAYLOAD(nh);
+
+		for (; RTA_OK(rt_attr, rtl); rt_attr = RTA_NEXT(rt_attr, rtl)) {
+			switch (rt_attr->rta_type) {
+			case NDA_DST:
+				sprintf(dsts, "%d",
+					*((int *)RTA_DATA(rt_attr)));
+				break;
+			case RTA_GATEWAY:
+				sprintf(gws, "%d", *((int *)RTA_DATA(rt_attr)));
+				break;
+			case RTA_OIF:
+				sprintf(ifs, "%d", *((int *)RTA_DATA(rt_attr)));
+				break;
+			case RTA_METRICS:
+				sprintf(metrics, "%d",
+					*((int *)RTA_DATA(rt_attr)));
+			default:
+				break;
+			}
+		}
+		sprintf(dsts_len, "%d", rt_msg->rtm_dst_len);
+
+		route.dst = atoi(dsts);
+		route.dst_len = atoi(dsts_len);
+		route.gw = atoi(gws);
+		route.iface = atoi(ifs);
+		route.metric = atoi(metrics);
+		if_indextoname(route.iface, route.iface_name);
+		route.mac = getmac(route.iface_name);
+		printf("%x\t\t%x\t\t%d\t%d\t%d\n", route.dst, route.gw,
+		       route.dst_len, route.metric, route.iface);
+		if (rtm_family == AF_INET) {
+			struct trie_value {
+				__u8 prefix[4];
+				long value;
+				int gw;
+				int ifindex;
+				int metric;
+			} *prefix_value;
+
+			prefix_key = alloca(sizeof(*prefix_key) + 3);
+			prefix_value = alloca(sizeof(*prefix_value));
+
+			prefix_key->prefixlen = 32;
+			prefix_key->prefixlen = route.dst_len;
+			direct_entry.mac = *route.mac & 0xffffffffffff;
+			direct_entry.ifindex = route.iface;
+			direct_entry.arp.mac = 0;
+			direct_entry.arp.dst = 0;
+			if (route.dst_len == 32) {
+				if (nh->nlmsg_type == RTM_DELROUTE) {
+					assert(bpf_map_delete_elem(
+								   map_fd[3],
+								   &route.dst
+								   ) == 0);
+				} else {
+					if (bpf_map_lookup_elem(map_fd[2],
+								&route.dst,
+								&direct_entry.arp.mac
+								) == 0)
+						direct_entry.arp.dst = route.dst;
+
+					assert(bpf_map_update_elem(map_fd[3],
+								   &route.dst,
+								   &direct_entry,
+								   0) == 0);
+				}
+			}
+			for (i = 0; i < 4; i++)
+				prefix_key->data[i] =
+					(route.dst >> i * 8) % 0x100;
+			if (bpf_map_lookup_elem(map_fd[0], prefix_key,
+						prefix_value) < 0) {
+				for (i = 0; i < 4; i++)
+					prefix_value->prefix[i] =
+						prefix_key->data[i];
+				prefix_value->value =
+					*route.mac & 0xffffffffffff;
+				prefix_value->ifindex = route.iface;
+				prefix_value->gw = route.gw;
+				prefix_value->metric = route.metric;
+
+				assert(bpf_map_update_elem(map_fd[0],
+							   prefix_key,
+							   prefix_value, 0
+							   ) == 0);
+			} else {
+				if (nh->nlmsg_type == RTM_DELROUTE) {
+					printf("deleting entry\n");
+					printf("prefix key=%d.%d.%d.%d/%d",
+					       prefix_key->data[0],
+					       prefix_key->data[1],
+					       prefix_key->data[2],
+					       prefix_key->data[3],
+					       prefix_key->prefixlen);
+					assert(bpf_map_delete_elem(map_fd[0],
+								   prefix_key
+								   ) == 0);
+					/* Rereading the route table to check if
+					 * there is an entry with the same
+					 * prefix but a different metric as the
+					 * deleted enty.
+					 */
+					get_route_table(AF_INET);
+				} else if (prefix_key->data[0] ==
+					   prefix_value->prefix[0] &&
+					   prefix_key->data[1] ==
+					   prefix_value->prefix[1] &&
+					   prefix_key->data[2] ==
+					   prefix_value->prefix[2] &&
+					   prefix_key->data[3] ==
+					   prefix_value->prefix[3] &&
+					   route.metric >= prefix_value->metric) {
+					continue;
+				} else {
+					for (i = 0; i < 4; i++)
+						prefix_value->prefix[i] =
+							prefix_key->data[i];
+					prefix_value->value =
+						*route.mac & 0xffffffffffff;
+					prefix_value->ifindex = route.iface;
+					prefix_value->gw = route.gw;
+					prefix_value->metric = route.metric;
+					assert(bpf_map_update_elem(
+								   map_fd[0],
+								   prefix_key,
+								   prefix_value,
+								   0) == 0);
+				}
+			}
+		}
+		bzero(&route, sizeof(route));
+		bzero(dsts, sizeof(dsts));
+		bzero(dsts_len, sizeof(dsts_len));
+		bzero(gws, sizeof(gws));
+		bzero(ifs, sizeof(ifs));
+		bzero(&route, sizeof(route));
+	}
+}
+
+/* Function to read the existing route table  when the process is launched*/
+static int get_route_table(int rtm_family)
+{
+	struct {
+		struct nlmsghdr nl;
+		struct rtmsg rt;
+		char buf[8192];
+	} req;
+
+	int sock, seq = 0;
+	struct sockaddr_nl sa;
+	struct msghdr msg;
+	struct iovec iov;
+	int ret = 0;
+	struct nlmsghdr *nh;
+	int nll;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+	bzero(&sa, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	bzero(&req, sizeof(req));
+	req.nl.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtmsg));
+	req.nl.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+	req.nl.nlmsg_type = RTM_GETROUTE;
+
+	req.rt.rtm_family = rtm_family;
+	req.rt.rtm_table = RT_TABLE_MAIN;
+	req.nl.nlmsg_pid = 0;
+	req.nl.nlmsg_seq = ++seq;
+	bzero(&msg, sizeof(msg));
+	iov.iov_base = (void *)&req.nl;
+	iov.iov_len = req.nl.nlmsg_len;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	ret = sendmsg(sock, &msg, 0);
+	if (ret < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	bzero(buf, sizeof(buf));
+	nll = recv_msg(sa, sock);
+	if (nll < 0) {
+		printf("recv from netlink: %s\n", strerror(nll));
+		ret = -1;
+		goto cleanup;
+	}
+	nh = (struct nlmsghdr *)buf;
+	read_route(nh, nll);
+cleanup:
+	close(sock);
+	return ret;
+}
+
+/* Function to parse the arp entry returned by netlink
+ * Updates the arp entry related map entries
+ */
+static void read_arp(struct nlmsghdr *nh, int nll)
+{
+	struct arp_table {
+		int dst;
+		long mac;
+	} arp_entry;
+	struct direct_map {
+		long mac;
+		int ifindex;
+		struct arp_table arp;
+	} direct_entry;
+
+	char dsts[24], mac[24];
+	struct ndmsg *rt_msg;
+	int rtl, i = 0, ndm_family;
+	struct rtattr *rt_attr;
+
+	if (nh->nlmsg_type == RTM_GETNEIGH)
+		printf("READING arp entry\n");
+	printf("Address\tHwAddress\n");
+	for (; NLMSG_OK(nh, nll); nh = NLMSG_NEXT(nh, nll)) {
+		i++;
+		rt_msg = (struct ndmsg *)NLMSG_DATA(nh);
+		rt_attr = (struct rtattr *)RTM_RTA(rt_msg);
+		ndm_family = rt_msg->ndm_family;
+		rtl = RTM_PAYLOAD(nh);
+		for (; RTA_OK(rt_attr, rtl); rt_attr = RTA_NEXT(rt_attr, rtl)) {
+			switch (rt_attr->rta_type) {
+			case NDA_DST:
+				sprintf(dsts, "%d",
+					*((int *)RTA_DATA(rt_attr)));
+				break;
+			case NDA_LLADDR:
+				sprintf(mac, "%ld",
+					*((long *)RTA_DATA(rt_attr)));
+				break;
+			default:
+				break;
+			}
+		}
+		arp_entry.dst = atoi(dsts);
+		arp_entry.mac = atol(mac);
+		printf("%x\t\t%lx\n", arp_entry.dst, arp_entry.mac);
+		if (ndm_family == AF_INET) {
+			if (bpf_map_lookup_elem(map_fd[3], &arp_entry.dst,
+						&direct_entry) == 0) {
+				if (nh->nlmsg_type == RTM_DELNEIGH) {
+					direct_entry.arp.dst = 0;
+					direct_entry.arp.mac = 0;
+				} else if (nh->nlmsg_type == RTM_NEWNEIGH) {
+					direct_entry.arp.dst = arp_entry.dst;
+					direct_entry.arp.mac = arp_entry.mac;
+				}
+				assert(bpf_map_update_elem(map_fd[3],
+							   &arp_entry.dst,
+							   &direct_entry, 0
+							   ) == 0);
+				bzero(&direct_entry, sizeof(direct_entry));
+			}
+			if (nh->nlmsg_type == RTM_DELNEIGH) {
+				assert(bpf_map_delete_elem(map_fd[2],
+							   &arp_entry.dst) == 0);
+			} else if (nh->nlmsg_type == RTM_NEWNEIGH) {
+				assert(bpf_map_update_elem(map_fd[2],
+							   &arp_entry.dst,
+							   &arp_entry.mac, 0
+							   ) == 0);
+			}
+		}
+		bzero(&arp_entry, sizeof(arp_entry));
+		bzero(dsts, sizeof(dsts));
+	}
+}
+
+/* Function to read the existing arp table  when the process is launched*/
+static int get_arp_table(int rtm_family)
+{
+	struct {
+		struct nlmsghdr nl;
+		struct ndmsg rt;
+		char buf[8192];
+	} req;
+
+	int sock, seq = 0;
+	struct sockaddr_nl sa;
+	struct msghdr msg;
+	struct iovec iov;
+	int ret = 0;
+	struct nlmsghdr *nh;
+	int nll;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+	bzero(&sa, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	bzero(&req, sizeof(req));
+	req.nl.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtmsg));
+	req.nl.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+	req.nl.nlmsg_type = RTM_GETNEIGH;
+	req.rt.ndm_state = NUD_REACHABLE;
+	req.rt.ndm_family = rtm_family;
+	req.nl.nlmsg_pid = 0;
+	req.nl.nlmsg_seq = ++seq;
+	bzero(&msg, sizeof(msg));
+	iov.iov_base = (void *)&req.nl;
+	iov.iov_len = req.nl.nlmsg_len;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	ret = sendmsg(sock, &msg, 0);
+	if (ret < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	bzero(buf, sizeof(buf));
+	nll = recv_msg(sa, sock);
+	if (nll < 0) {
+		printf("recv from netlink: %s\n", strerror(nll));
+		ret = -1;
+		goto cleanup;
+	}
+	nh = (struct nlmsghdr *)buf;
+	read_arp(nh, nll);
+cleanup:
+	close(sock);
+	return ret;
+}
+
+/* Function to keep track and update changes in route and arp table
+ * Give regular statistics of packets forwarded
+ */
+static int monitor_route(void)
+{
+	struct sockaddr_nl la, lr;
+	struct nlmsghdr *nh;
+	int nll, ret = 0;
+	const unsigned int nr_keys = 256;
+	int interval = 5;
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_keys][nr_cpus];
+	__u32 key;
+	int i;
+	struct pollfd fds_route, fds_arp;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	fcntl(sock, F_SETFL, O_NONBLOCK);
+	bzero(&lr, sizeof(lr));
+	lr.nl_family = AF_NETLINK;
+	lr.nl_groups = RTMGRP_IPV6_ROUTE | RTMGRP_IPV4_ROUTE | RTMGRP_NOTIFY;
+	if (bind(sock, (struct sockaddr *)&lr, sizeof(lr)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	fds_route.fd = sock;
+	fds_route.events = POLL_IN;
+
+	sock_arp = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock_arp < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	fcntl(sock_arp, F_SETFL, O_NONBLOCK);
+	bzero(&la, sizeof(la));
+	la.nl_family = AF_NETLINK;
+	la.nl_groups = RTMGRP_NEIGH | RTMGRP_NOTIFY;
+	if (bind(sock_arp, (struct sockaddr *)&la, sizeof(la)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		ret = -1;
+		goto cleanup;
+	}
+	fds_arp.fd = sock_arp;
+	fds_arp.events = POLL_IN;
+
+	memset(prev, 0, sizeof(prev));
+	do {
+		signal(SIGINT, close_and_exit);
+		signal(SIGTERM, close_and_exit);
+
+		sleep(interval);
+		for (key = 0; key < nr_keys; key++) {
+			__u64 sum = 0;
+
+			assert(bpf_map_lookup_elem(map_fd[1], &key, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[key][i]);
+			if (sum)
+				printf("proto %u: %10llu pkt/s\n",
+				       key, sum / interval);
+			memcpy(prev[key], values, sizeof(values));
+		}
+
+		bzero(buf, sizeof(buf));
+		if (poll(&fds_route, 1, 3) == POLL_IN) {
+			nll = recv_msg(lr, sock);
+			if (nll < 0) {
+				printf("recv from netlink: %s\n",
+				       strerror(nll));
+				ret = -1;
+				goto cleanup;
+			}
+
+			nh = (struct nlmsghdr *)buf;
+			printf("Routing table updated.\n");
+			read_route(nh, nll);
+		}
+		bzero(buf, sizeof(buf));
+		if (poll(&fds_arp, 1, 3) == POLL_IN) {
+			nll = recv_msg(la, sock_arp);
+			if (nll < 0) {
+				printf("recv from netlink: %s\n",
+				       strerror(nll));
+				ret = -1;
+				goto cleanup;
+			}
+
+			nh = (struct nlmsghdr *)buf;
+			read_arp(nh, nll);
+		}
+
+	} while (1);
+cleanup:
+	close(sock);
+	return ret;
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	int i = 1, index;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	printf("Entering user program\n");
+	if (ac < 2) {
+		printf("usage: %s [-S] IFINDEX\n", argv[0]);
+		return 1;
+	}
+	if (!strcmp(argv[1], "-S")) {
+		flags = XDP_FLAGS_SKB_MODE;
+		total_ifindex = ac - 2;
+		index_list = (argv + 2);
+	} else {
+		flags = 0;
+		total_ifindex = ac - 1;
+		index_list = (argv + 1);
+	}
+printf("Loading bpf program\n");
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+	printf("\n**************loading bpf file*********************\n\n\n");
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	for (i = 0; i < total_ifindex; i++) {
+		index = strtoul(index_list[i], NULL, 0);
+		if (set_link_xdp_fd(index, prog_fd[0], flags) < 0) {
+			printf("link set xdp fd failed\n");
+			return 1;
+		}
+		printf("Attached to %d\n", index);
+	}
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	printf("*******************ROUTE TABLE*************************\n\n\n");
+	get_route_table(AF_INET);
+	printf("*******************ARP TABLE***************************\n\n\n");
+	get_arp_table(AF_INET);
+	if (monitor_route() < 0) {
+		printf("Error in receiving route update");
+		return 1;
+	}
+
+	return 0;
+}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] xdp: Sample xdp program implementing ip forward
  2017-10-03  7:37 ` [PATCH 1/1] xdp: Sample xdp program implementing ip forward cjacob
@ 2017-10-03 15:54   ` Daniel Borkmann
       [not found]     ` <DM5PR07MB346826EDCF6C5F2B1287D1578A710@DM5PR07MB3468.namprd07.prod.outlook.com>
  2017-10-03 16:24   ` David Ahern
  1 sibling, 1 reply; 6+ messages in thread
From: Daniel Borkmann @ 2017-10-03 15:54 UTC (permalink / raw)
  To: cjacob, netdev; +Cc: linux-kernel, linux-arm-kernel, alexei.starovoitov

On 10/03/2017 09:37 AM, cjacob wrote:
> Implements port to port forwarding with route table and arp table
> lookup for ipv4 packets using bpf_redirect helper function and
> lpm_trie  map.
>
> Signed-off-by: cjacob <Christina.Jacob@cavium.com>

Thanks for the patch, just few minor comments below!

Note, should be full name, e.g.:

   Signed-off-by: Christina Jacob <Christina.Jacob@cavium.com>

Also you From: only shows 'cjacob' as can be seen from the cover letter
as well, so perhaps check your git settings to make that full name:

   cjacob (1):
     xdp: Sample xdp program implementing ip forward

If there's one single patch, then cover letter is not needed, only
for >1 sets.

[...]
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include <linux/in.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_vlan.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include "bpf_helpers.h"
> +#include <linux/slab.h>
> +#include <net/ip_fib.h>
> +
> +struct trie_value {
> +	__u8 prefix[4];
> +	long value;
> +	int gw;
> +	int ifindex;
> +	int metric;
> +};
> +
> +union key_4 {
> +	u32 b32[2];
> +	u8 b8[8];
> +};
> +
> +struct arp_entry {
> +	int dst;
> +	long mac;
> +};
> +
> +struct direct_map {
> +	long mac;
> +	int ifindex;
> +	struct arp_entry arp;
> +};
> +
> +/* Map for trie implementation*/
> +struct bpf_map_def SEC("maps") lpm_map = {
> +	.type = BPF_MAP_TYPE_LPM_TRIE,
> +	.key_size = 8,
> +	.value_size =
> +		sizeof(struct trie_value),

(Nit: there are couple of such breaks throughout the patch, can we
  just use single line for such cases where reasonable?)

> +	.max_entries = 50,
> +	.map_flags = BPF_F_NO_PREALLOC,
> +};
> +
> +/* Map for counter*/
> +struct bpf_map_def SEC("maps") rxcnt = {
> +	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(long),
> +	.max_entries = 256,
> +};
> +
> +/* Map for ARP table*/
> +struct bpf_map_def SEC("maps") arp_table = {
> +	.type = BPF_MAP_TYPE_HASH,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(long),

Perhaps these should be proper structs here, such that it
becomes easier to read/handle later on lookup.

> +	.max_entries = 50,
> +};
> +
> +/* Map to keep the exact match entries in the route table*/
> +struct bpf_map_def SEC("maps") exact_match = {
> +	.type = BPF_MAP_TYPE_HASH,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(struct direct_map),
> +	.max_entries = 50,
> +};
> +
> +/**
> + * Function to set source and destination mac of the packet
> + */
> +static inline void set_src_dst_mac(void *data, void *src, void *dst)
> +{
> +	unsigned short *p      = data;
> +	unsigned short *dest   = dst;
> +	unsigned short *source = src;
> +
> +	p[3] = source[0];
> +	p[4] = source[1];
> +	p[5] = source[2];
> +	p[0] = dest[0];
> +	p[1] = dest[1];
> +	p[2] = dest[2];

You could just use __builtin_memcpy() given length is
constant anyway, so LLVM will do the inlining.

> +}
> +
> +/**
> + * Parse IPV4 packet to get SRC, DST IP and protocol
> + */
> +static inline int parse_ipv4(void *data, u64 nh_off, void *data_end,
> +			     unsigned int *src, unsigned int *dest)
> +{
> +	struct iphdr *iph = data + nh_off;
> +
> +	if (iph + 1 > data_end)
> +		return 0;
> +	*src = (unsigned int)iph->saddr;
> +	*dest = (unsigned int)iph->daddr;

Why not stay with __be32 types?

> +	return iph->protocol;
> +}
> +
> +SEC("xdp3")
> +int xdp_prog3(struct xdp_md *ctx)
> +{
> +	void *data_end = (void *)(long)ctx->data_end;
> +	void *data = (void *)(long)ctx->data;
> +	struct ethhdr *eth = data;
> +	int rc = XDP_DROP, forward_to;
> +	long *value;
> +	struct trie_value *prefix_value;
> +	long *dest_mac = NULL, *src_mac = NULL;
> +	u16 h_proto;
> +	u64 nh_off;
> +	u32 ipproto;
> +	union key_4 key4;
> +
> +	nh_off = sizeof(*eth);
> +	if (data + nh_off > data_end)
> +		return rc;
> +
> +	h_proto = eth->h_proto;
> +
> +	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> +		struct vlan_hdr *vhdr;
> +
> +		vhdr = data + nh_off;
> +		nh_off += sizeof(struct vlan_hdr);
> +		if (data + nh_off > data_end)
> +			return rc;
> +		h_proto = vhdr->h_vlan_encapsulated_proto;
> +	}
> +	if (h_proto == htons(ETH_P_ARP)) {
> +		return XDP_PASS;
> +	} else if (h_proto == htons(ETH_P_IP)) {
> +		int src_ip = 0, dest_ip = 0;
> +		struct direct_map *direct_entry;
> +
> +		ipproto = parse_ipv4(data, nh_off, data_end, &src_ip, &dest_ip);
> +		direct_entry = (struct direct_map *)bpf_map_lookup_elem
> +			(&exact_match, &dest_ip);
> +		/*check for exact match, this would give a faster lookup*/
> +		if (direct_entry && direct_entry->mac &&
> +		    direct_entry->arp.mac) {
> +			src_mac = &direct_entry->mac;
> +			dest_mac = &direct_entry->arp.mac;
> +			forward_to = direct_entry->ifindex;
> +		} else {
> +			/*Look up in the trie for lpm*/
> +			// Key for trie

Nit: please check style throughout the patch.

> +			key4.b32[0] = 32;
> +			key4.b8[4] = dest_ip % 0x100;
> +			key4.b8[5] = (dest_ip >> 8) % 0x100;
> +			key4.b8[6] = (dest_ip >> 16) % 0x100;
> +			key4.b8[7] = (dest_ip >> 24) % 0x100;
> +			prefix_value =
> +				((struct trie_value *)bpf_map_lookup_elem
> +				 (&lpm_map, &key4));

For key, please use proper struct bpf_lpm_trie_key, see also
usage example in tools/testing/selftests/bpf/test_lpm_map.c
for LPM handling.

> +			if (!prefix_value) {
> +				return XDP_DROP;
> +			} else {
> +				src_mac = &prefix_value->value;
> +				if (src_mac) {
> +					dest_mac = (long *)bpf_map_lookup_elem
> +						(&arp_table, &dest_ip);
> +					if (!dest_mac) {
> +						if (prefix_value->gw) {
> +							dest_ip = *(unsigned int *)(&(prefix_value->gw));
> +							dest_mac = (long *)bpf_map_lookup_elem
> +								(&arp_table, &dest_ip);
> +						} else {
> +							return XDP_DROP;
> +						}
> +					}
> +					forward_to = prefix_value->ifindex;
> +				} else {
> +					return XDP_DROP;
> +				}
> +			}
> +		}
> +	} else {
> +		ipproto = 0;
> +	}
> +	if (src_mac && dest_mac) {
> +		set_src_dst_mac(data, src_mac,
> +				dest_mac);
> +		value = bpf_map_lookup_elem
> +			(&rxcnt, &ipproto);
> +		if (value)
> +			*value += 1;
> +		return  bpf_redirect(
> +				     forward_to,
> +				     0);
> +	}
> +	return rc;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] xdp: Sample xdp program implementing ip forward
  2017-10-03  7:37 ` [PATCH 1/1] xdp: Sample xdp program implementing ip forward cjacob
  2017-10-03 15:54   ` Daniel Borkmann
@ 2017-10-03 16:24   ` David Ahern
  1 sibling, 0 replies; 6+ messages in thread
From: David Ahern @ 2017-10-03 16:24 UTC (permalink / raw)
  To: cjacob, netdev; +Cc: linux-kernel, linux-arm-kernel

On 10/3/17 12:37 AM, cjacob wrote:
> diff --git a/samples/bpf/xdp3_kern.c b/samples/bpf/xdp3_kern.c
> new file mode 100644
> index 0000000..62d905d
> --- /dev/null
> +++ b/samples/bpf/xdp3_kern.c
> @@ -0,0 +1,204 @@
> +/* Copyright (c) 2016 PLUMgrid

2016 PLUMgrid?


> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#define KBUILD_MODNAME "

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] XDP Program for Ip forward
  2017-10-03  7:37 [PATCH 0/1] XDP Program for Ip forward cjacob
  2017-10-03  7:37 ` [PATCH 1/1] xdp: Sample xdp program implementing ip forward cjacob
@ 2017-10-04 15:07 ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2017-10-04 15:07 UTC (permalink / raw)
  To: cjacob; +Cc: brouer, netdev, linux-kernel, linux-arm-kernel


First of all thank you for working on this.

On Tue,  3 Oct 2017 13:07:04 +0530 cjacob <Christina.Jacob@cavium.com> wrote:

> 	Usage: ./xdp3 [-S] <ifindex1...ifindexn> 
> 
> 	-S to choose generic xdp implementation 
> 	  [Default is driver xdp implementation]
> 	ifindex - the index of the interface to which 
> 	the xdp program has to be attached.
> 	in 4.14-rc3 kernel.

I would prefer if we can name the program something more descriptive
than "xdp3".  What about "xdp_redirect_router" or "xdp_router_ipv4" ?

I would also appreciate if we can stop using ifindex'es, and instead
use normal device ifname's.  And simply do the lookup to the ifindex in
the program via if_nametoindex(ifname), see how in [1] and [2].

When adding more ifname's you can just use the same trick as with
multiple --cpu options like [1] and [2].

[1] http://lkml.kernel.org/r/150711864538.9499.11712573036995600273.stgit@firesoul
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] xdp: Sample xdp program implementing ip forward
       [not found]     ` <DM5PR07MB346826EDCF6C5F2B1287D1578A710@DM5PR07MB3468.namprd07.prod.outlook.com>
@ 2017-10-10  2:24       ` Jacob, Christina
  0 siblings, 0 replies; 6+ messages in thread
From: Jacob, Christina @ 2017-10-10  2:24 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: linux-kernel, linux-arm-kernel, alexei.starovoitov

Sorry for the late reply. I will include the suggested changes in the next revision of the patch.

Please see inline for clarifications and questions.


Thanks,

Christina


________________________________
From: Daniel Borkmann <daniel@iogearbox.net>
Sent: Tuesday, October 3, 2017 9:24 PM
To: Jacob, Christina; netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org; alexei.starovoitov@gmail.com
Subject: Re: [PATCH 1/1] xdp: Sample xdp program implementing ip forward

>On 10/03/2017 09:37 AM, cjacob wrote:
>> Implements port to port forwarding with route table and arp table
>> lookup for ipv4 packets using bpf_redirect helper function and
>> lpm_trie  map.
>>
>> Signed-off-by: cjacob <Christina.Jacob@cavium.com>
>
>Thanks for the patch, just few minor comments below!
>
>Note, should be full name, e.g.:
>
>   Signed-off-by: Christina Jacob <Christina.Jacob@cavium.com>
>
>Also you From: only shows 'cjacob' as can be seen from the cover letter
>as well, so perhaps check your git settings to make that full name:
>
>   cjacob (1):
>     xdp: Sample xdp program implementing ip forward
>
>If there's one single patch, then cover letter is not needed, only
>for >1 sets.
>
>[...]
>> +#define KBUILD_MODNAME "foo"
>> +#include <uapi/linux/bpf.h>
>> +#include <linux/in.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/if_packet.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/ip.h>
>> +#include <linux/ipv6.h>
>> +#include "bpf_helpers.h"
>> +#include <linux/slab.h>
>> +#include <net/ip_fib.h>
>> +
>> +struct trie_value {
>> +     __u8 prefix[4];
>> +     long value;
>> +     int gw;
>> +     int ifindex;
>> +     int metric;
>> +};
>> +
>> +union key_4 {
>> +     u32 b32[2];
>> +     u8 b8[8];
>> +};
>> +
>> +struct arp_entry {
>> +     int dst;
>> +     long mac;
>> +};
>> +
>> +struct direct_map {
>> +     long mac;
>> +     int ifindex;
>> +     struct arp_entry arp;
>> +};
>> +
>> +/* Map for trie implementation*/
>> +struct bpf_map_def SEC("maps") lpm_map = {
>> +     .type = BPF_MAP_TYPE_LPM_TRIE,
>> +     .key_size = 8,
>> +     .value_size =
>> +             sizeof(struct trie_value),
>
>(Nit: there are couple of such breaks throughout the patch, can we
>  just use single line for such cases where reasonable?)
>
>> +     .max_entries = 50,
>> +     .map_flags = BPF_F_NO_PREALLOC,
>> +};
>> +
>> +/* Map for counter*/
>> +struct bpf_map_def SEC("maps") rxcnt = {
>> +     .type = BPF_MAP_TYPE_PERCPU_ARRAY,
>> +     .key_size = sizeof(u32),
>> +     .value_size = sizeof(long),
>> +     .max_entries = 256,
>> +};
>> +
>> +/* Map for ARP table*/
>> +struct bpf_map_def SEC("maps") arp_table = {
>> +     .type = BPF_MAP_TYPE_HASH,
>> +     .key_size = sizeof(int),
>> +     .value_size = sizeof(long),
>
>Perhaps these should be proper structs here, such that it
>becomes easier to read/handle later on lookup.
>

I am not clear about this. I am defining a ebpf map.
I did not understand what structure you are refering to
Am I missing something here?.

>> +     .max_entries = 50,
>> +};
>> +
>> +/* Map to keep the exact match entries in the route table*/
>> +struct bpf_map_def SEC("maps") exact_match = {
>> +     .type = BPF_MAP_TYPE_HASH,
>> +     .key_size = sizeof(int),
>> +     .value_size = sizeof(struct direct_map),
>> +     .max_entries = 50,
>> +};
>> +
>> +/**
>> + * Function to set source and destination mac of the packet
>> + */
>> +static inline void set_src_dst_mac(void *data, void *src, void *dst)
>> +{
>> +     unsigned short *p      = data;
>> +     unsigned short *dest   = dst;
>> +     unsigned short *source = src;
>> +
>> +     p[3] = source[0];
>> +     p[4] = source[1];
>> +     p[5] = source[2];
>> +     p[0] = dest[0];
>> +     p[1] = dest[1];
>> +     p[2] = dest[2];
>
>You could just use __builtin_memcpy() given length is
>constant anyway, so LLVM will do the inlining.
>
>> +}
>> +
>> +/**
>> + * Parse IPV4 packet to get SRC, DST IP and protocol
>> + */
>> +static inline int parse_ipv4(void *data, u64 nh_off, void *data_end,
>> +                          unsigned int *src, unsigned int *dest)
>> +{
>> +     struct iphdr *iph = data + nh_off;
>> +
>> +     if (iph + 1 > data_end)
>> +             return 0;
>> +     *src = (unsigned int)iph->saddr;
>> +     *dest = (unsigned int)iph->daddr;
>
>Why not stay with __be32 types?
>
>> +     return iph->protocol;
>> +}
>> +
>> +SEC("xdp3")
>> +int xdp_prog3(struct xdp_md *ctx)
>> +{
>> +     void *data_end = (void *)(long)ctx->data_end;
>> +     void *data = (void *)(long)ctx->data;
>> +     struct ethhdr *eth = data;
>> +     int rc = XDP_DROP, forward_to;
>> +     long *value;
>> +     struct trie_value *prefix_value;
>> +     long *dest_mac = NULL, *src_mac = NULL;
>> +     u16 h_proto;
>> +     u64 nh_off;
>> +     u32 ipproto;
>> +     union key_4 key4;
>> +
>> +     nh_off = sizeof(*eth);
>> +     if (data + nh_off > data_end)
>> +             return rc;
>> +
>> +     h_proto = eth->h_proto;
>> +
>> +     if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
>> +             struct vlan_hdr *vhdr;
>> +
>> +             vhdr = data + nh_off;
>> +             nh_off += sizeof(struct vlan_hdr);
>> +             if (data + nh_off > data_end)
>> +                     return rc;
>> +             h_proto = vhdr->h_vlan_encapsulated_proto;
>> +     }
>> +     if (h_proto == htons(ETH_P_ARP)) {
>> +             return XDP_PASS;
>> +     } else if (h_proto == htons(ETH_P_IP)) {
>> +             int src_ip = 0, dest_ip = 0;
>> +             struct direct_map *direct_entry;
>> +
>> +             ipproto = parse_ipv4(data, nh_off, data_end, &src_ip, &dest_ip);
>> +             direct_entry = (struct direct_map *)bpf_map_lookup_elem
>> +                     (&exact_match, &dest_ip);
>> +             /*check for exact match, this would give a faster lookup*/
>> +             if (direct_entry && direct_entry->mac &&
>> +                 direct_entry->arp.mac) {
>> +                     src_mac = &direct_entry->mac;
>> +                     dest_mac = &direct_entry->arp.mac;
>> +                     forward_to = direct_entry->ifindex;
>> +             } else {
>> +                     /*Look up in the trie for lpm*/
>> +                     // Key for trie
>
>Nit: please check style throughout the patch.
>
>> +                     key4.b32[0] = 32;
>> +                     key4.b8[4] = dest_ip % 0x100;
>> +                     key4.b8[5] = (dest_ip >> 8) % 0x100;
>> +                     key4.b8[6] = (dest_ip >> 16) % 0x100;
>> +                     key4.b8[7] = (dest_ip >> 24) % 0x100;
>> +                     prefix_value =
>> +                             ((struct trie_value *)bpf_map_lookup_elem
>> +                              (&lpm_map, &key4));
>
>For key, please use proper struct bpf_lpm_trie_key, see also
>usage example in tools/testing/selftests/bpf/test_lpm_map.c
>for LPM handling.
>

I am following the way how it is done in the kernel program of other sample programs.
Can we do dynamic memory allocation in ebpf kernel program. I am getting invalid instruction errors in runtime.

>> +                     if (!prefix_value) {
>> +                             return XDP_DROP;
>> +                     } else {
>> +                             src_mac = &prefix_value->value;
>> +                             if (src_mac) {
>> +                                     dest_mac = (long *)bpf_map_lookup_elem
>> +                                             (&arp_table, &dest_ip);
>> +                                     if (!dest_mac) {
>> +                                             if (prefix_value->gw) {
>> +                                                     dest_ip = *(unsigned int *)(&(prefix_value->gw));
>> +                                                     dest_mac = (long *)bpf_map_lookup_elem
>> +                                                             (&arp_table, &dest_ip);
>> +                                             } else {
>> +                                                     return XDP_DROP;
>> +                                             }
>> +                                     }
>> +                                     forward_to = prefix_value->ifindex;
>> +                             } else {
>> +                                     return XDP_DROP;
>> +                             }
>> +                     }
>> +             }
>> +     } else {
>> +             ipproto = 0;
>> +     }
>> +     if (src_mac && dest_mac) {
>> +             set_src_dst_mac(data, src_mac,
>> +                             dest_mac);
>> +             value = bpf_map_lookup_elem
>> +                     (&rxcnt, &ipproto);
>> +             if (value)
>> +                     *value += 1;
>> +             return  bpf_redirect(
>> +                                  forward_to,
>> +                                  0);
>> +     }
>> +     return rc;

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-10-10  2:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-03  7:37 [PATCH 0/1] XDP Program for Ip forward cjacob
2017-10-03  7:37 ` [PATCH 1/1] xdp: Sample xdp program implementing ip forward cjacob
2017-10-03 15:54   ` Daniel Borkmann
     [not found]     ` <DM5PR07MB346826EDCF6C5F2B1287D1578A710@DM5PR07MB3468.namprd07.prod.outlook.com>
2017-10-10  2:24       ` Jacob, Christina
2017-10-03 16:24   ` David Ahern
2017-10-04 15:07 ` [PATCH 0/1] XDP Program for Ip forward Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).