All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure
@ 2017-12-07 12:44 Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit Pablo Neira Ayuso
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:44 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

Hi,

This patchset is a new iteration of the flow offload infrastructure [1].
This round adds a netlink control plane to configure flow table, so
there is no one single flow table, as in the previous patchset, that
gets registered unconditionally.

The following example shows how to create a flow table whose name is 'w',
and a rule that specifies what flows are offloaded into this flow table.

	table ip x {
	        flowtable w {
	                hook ingress priority -100 devices = { eth0, eth1 };
	        }

	        chain y {
	                type filter hook forward priority 0; policy accept;
	                ip protocol tcp flow offload @w
	        }
	}

The flow table control plane is useful to set on specific flow table
configurations, including what devices you want to bind the flow table
to, the priority in the netfilter pipeline at the ingress hooks, custom
timeout for the flow table, and anything else that needs a toggle to be
enabled/disabled through this control plane.

* Patch 1/6 adds the IPS_OFFLOAD status bit for conntrack, the conntrack
  garbage collector does not expire entries that has been offloaded.
  Conntrack entries that have been offloaded in the conntrack table look
  like this:

  ipv4     2 tcp      6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] use=3

* Patch 2/6 adds a netlink control plane, that allows to create, list
  and delete flow tables. This patch also introduces the nf_flow_table
  object, that uses a rhashtable, garbage collector to remove entries
  that has expired, ie. those that we see no traffic for a while, and
  the flow table type, to allow to register IPv4 and IPv6 flow table.
  It's basically boiler plate netlink code that integrates into
  nf_tables.

* Patch 3/6 adds the generic flow table representation, this includes
  the flow table API to create, remove and lookup for entries in the
  flow table, and the generic garbage collector to expire entries. This
  is basically the common code to all flow table types.

* Patch 4/6 provides the IPv4 flow table flavour, that is the only type
  so far. This provides the ingress hook for IPv4, basically to look up
  for an entry in the flow table, then in case of hit, decrement TTL and
  pass it on to the neighbour layer for transmission at a given device,
  otherwise fall back to classic forwarding path.

* Patch 5/6 introduces the "flow offload" action. This allocates the
  flow entry and adds it to the flow table. This allows you to decide
  at what stage you want to offload flows through policy.

* Patch 6/6 adds the net_device ndo to offload flows to hardware, if
  driver implements this feature. This adds a new workqueue to configure
  hardware flow offload from user context. There is no driver so far
  available using this, but I've been approached by several hardware
  driver developers, from different companies, willing to implement
  this, so I'm inclined to keep this in a branch in my nf-next tree
  until we have the first client of this.

This is my TODO list, things I would like to finish:

* netns support.
* IPv6 support.
* Port address translation, so far only layer 3 NATs.
* PMTU interactions.
* stateful flow tracking.

Among other things that I would like to polish, just more fine grain
details.

Cc'ing everyone that have provided feedback privately or publicly since
the last time. If I forgot anyone to be Cc'ed, please accept my apologies.

Comments welcome, thanks.

[1] https://lwn.net/Articles/738214/

Pablo Neira Ayuso (6):
  netfilter: nf_conntrack: add IPS_OFFLOAD status bit
  netfilter: nf_tables: add flow table netlink frontend
  netfilter: add generic flow table infrastructure
  netfilter: flow table support for IPv4
  netfilter: nf_tables: flow offload expression
  netfilter: nft_flow_offload: add ndo hooks for hardware offload

 include/linux/netdevice.h                          |   9 +
 include/net/netfilter/nf_flow_table.h              |  96 +++
 include/net/netfilter/nf_tables.h                  |  51 ++
 include/uapi/linux/netfilter/nf_conntrack_common.h |   4 +
 include/uapi/linux/netfilter/nf_tables.h           |  64 ++
 net/ipv4/netfilter/Kconfig                         |   8 +
 net/ipv4/netfilter/Makefile                        |   3 +
 net/ipv4/netfilter/nf_flow_table_ipv4.c            | 316 +++++++++
 net/netfilter/Kconfig                              |  14 +
 net/netfilter/Makefile                             |   4 +
 net/netfilter/nf_conntrack_core.c                  |  19 +
 net/netfilter/nf_conntrack_netlink.c               |  15 +-
 net/netfilter/nf_conntrack_proto_tcp.c             |   3 +
 net/netfilter/nf_conntrack_standalone.c            |  12 +-
 net/netfilter/nf_flow_table.c                      | 295 ++++++++
 net/netfilter/nf_tables_api.c                      | 749 ++++++++++++++++++++-
 net/netfilter/nft_flow_offload.c                   | 353 ++++++++++
 17 files changed, 2009 insertions(+), 6 deletions(-)
 create mode 100644 include/net/netfilter/nf_flow_table.h
 create mode 100644 net/ipv4/netfilter/nf_flow_table_ipv4.c
 create mode 100644 net/netfilter/nf_flow_table.c
 create mode 100644 net/netfilter/nft_flow_offload.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
@ 2017-12-07 12:44 ` Pablo Neira Ayuso
  2017-12-08  6:47   ` Florian Westphal
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 2/6] netfilter: nf_tables: add flow table netlink frontend Pablo Neira Ayuso
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:44 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.

 # cat /proc/net/nf_conntrack
 ipv4     2 tcp      6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2

Note the [OFFLOAD] tag in the listing.

The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().

Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/uapi/linux/netfilter/nf_conntrack_common.h |  4 ++++
 net/netfilter/nf_conntrack_core.c                  | 19 +++++++++++++++++++
 net/netfilter/nf_conntrack_netlink.c               | 15 ++++++++++++++-
 net/netfilter/nf_conntrack_proto_tcp.c             |  3 +++
 net/netfilter/nf_conntrack_standalone.c            | 12 ++++++++----
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
index dc947e59d03a..6b463b88182d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -100,6 +100,10 @@ enum ip_conntrack_status {
 	IPS_HELPER_BIT = 13,
 	IPS_HELPER = (1 << IPS_HELPER_BIT),
 
+	/* Conntrack has been offloaded to flow table. */
+	IPS_OFFLOAD_BIT = 14,
+	IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT),
+
 	/* Be careful here, modifying these bits can make things messy,
 	 * so don't let users modify them directly.
 	 */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 01130392b7c0..02e195accd47 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net,
 	hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
 		tmp = nf_ct_tuplehash_to_ctrack(h);
 
+		if (test_bit(IPS_OFFLOAD_BIT, &tmp->status))
+			continue;
+
 		if (nf_ct_is_expired(tmp)) {
 			nf_ct_gc_expired(tmp);
 			continue;
@@ -975,6 +978,17 @@ static bool gc_worker_can_early_drop(const struct nf_conn *ct)
 	return false;
 }
 
+#define	DAY	(86400 * HZ)
+
+/* Set an arbitrary timeout large enough not to ever expire, this save
+ * us a check for the IPS_OFFLOAD_BIT from the packet path via
+ * nf_ct_is_expired().
+ */
+static void nf_ct_offload_timeout(struct nf_conn *ct)
+{
+       ct->timeout = nfct_time_stamp + DAY;
+}
+
 static void gc_worker(struct work_struct *work)
 {
 	unsigned int min_interval = max(HZ / GC_MAX_BUCKETS_DIV, 1u);
@@ -1011,6 +1025,11 @@ static void gc_worker(struct work_struct *work)
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 
 			scanned++;
+			if (test_bit(IPS_OFFLOAD_BIT, &tmp->status)) {
+				nf_ct_offload_timeout(tmp);
+				continue;
+			}
+
 			if (nf_ct_is_expired(tmp)) {
 				nf_ct_gc_expired(tmp);
 				expired_count++;
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index de4053d84364..79a74aec7c1e 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1105,6 +1105,14 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = {
 				    .len = NF_CT_LABELS_MAX_SIZE },
 };
 
+static int ctnetlink_flush_iterate(struct nf_conn *ct, void *data)
+{
+	if (test_bit(IPS_OFFLOAD_BIT, &ct->status))
+		return 0;
+
+	return ctnetlink_filter_match(ct, data);
+}
+
 static int ctnetlink_flush_conntrack(struct net *net,
 				     const struct nlattr * const cda[],
 				     u32 portid, int report)
@@ -1117,7 +1125,7 @@ static int ctnetlink_flush_conntrack(struct net *net,
 			return PTR_ERR(filter);
 	}
 
-	nf_ct_iterate_cleanup_net(net, ctnetlink_filter_match, filter,
+	nf_ct_iterate_cleanup_net(net, ctnetlink_flush_iterate, filter,
 				  portid, report);
 	kfree(filter);
 
@@ -1163,6 +1171,11 @@ static int ctnetlink_del_conntrack(struct net *net, struct sock *ctnl,
 
 	ct = nf_ct_tuplehash_to_ctrack(h);
 
+	if (test_bit(IPS_OFFLOAD_BIT, &ct->status)) {
+		nf_ct_put(ct);
+		return -EBUSY;
+	}
+
 	if (cda[CTA_ID]) {
 		u_int32_t id = ntohl(nla_get_be32(cda[CTA_ID]));
 		if (id != (u32)(unsigned long)ct) {
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index cba1c6ffe51a..156f529d1668 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -305,6 +305,9 @@ static bool tcp_invert_tuple(struct nf_conntrack_tuple *tuple,
 /* Print out the private part of the conntrack. */
 static void tcp_print_conntrack(struct seq_file *s, struct nf_conn *ct)
 {
+	if (test_bit(IPS_OFFLOAD_BIT, &ct->status))
+		return;
+
 	seq_printf(s, "%s ", tcp_conntrack_names[ct->proto.tcp.state]);
 }
 #endif
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 5a101caa3e12..46d32baad095 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -309,10 +309,12 @@ static int ct_seq_show(struct seq_file *s, void *v)
 	WARN_ON(!l4proto);
 
 	ret = -ENOSPC;
-	seq_printf(s, "%-8s %u %-8s %u %ld ",
+	seq_printf(s, "%-8s %u %-8s %u ",
 		   l3proto_name(l3proto->l3proto), nf_ct_l3num(ct),
-		   l4proto_name(l4proto->l4proto), nf_ct_protonum(ct),
-		   nf_ct_expires(ct)  / HZ);
+		   l4proto_name(l4proto->l4proto), nf_ct_protonum(ct));
+
+	if (!test_bit(IPS_OFFLOAD_BIT, &ct->status))
+		seq_printf(s, "%ld ", nf_ct_expires(ct)  / HZ);
 
 	if (l4proto->print_conntrack)
 		l4proto->print_conntrack(s, ct);
@@ -339,7 +341,9 @@ static int ct_seq_show(struct seq_file *s, void *v)
 	if (seq_print_acct(s, ct, IP_CT_DIR_REPLY))
 		goto release;
 
-	if (test_bit(IPS_ASSURED_BIT, &ct->status))
+	if (test_bit(IPS_OFFLOAD_BIT, &ct->status))
+		seq_puts(s, "[OFFLOAD] ");
+	else if (test_bit(IPS_ASSURED_BIT, &ct->status))
 		seq_puts(s, "[ASSURED] ");
 
 	if (seq_has_overflowed(s))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 2/6] netfilter: nf_tables: add flow table netlink frontend
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit Pablo Neira Ayuso
@ 2017-12-07 12:44 ` Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 3/6] netfilter: add generic flow table infrastructure Pablo Neira Ayuso
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:44 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.

This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_flow_table.h    |  24 +
 include/net/netfilter/nf_tables.h        |  51 +++
 include/uapi/linux/netfilter/nf_tables.h |  53 +++
 net/netfilter/nf_tables_api.c            | 749 ++++++++++++++++++++++++++++++-
 4 files changed, 876 insertions(+), 1 deletion(-)
 create mode 100644 include/net/netfilter/nf_flow_table.h

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
new file mode 100644
index 000000000000..429833dba2d9
--- /dev/null
+++ b/include/net/netfilter/nf_flow_table.h
@@ -0,0 +1,24 @@
+#ifndef _NF_FLOW_TABLE_H
+#define _NF_FLOW_TABLE_H
+
+#include <linux/rhashtable.h>
+
+struct nf_flowtable;
+
+struct nf_flowtable_type {
+	struct list_head		list;
+	int				family;
+	int 				(*init)(struct nf_flowtable *);
+	void 				(*destroy)(struct nf_flowtable *);
+	const struct rhashtable_params	*params;
+	nf_hookfn			*hook;
+	struct module			*owner;
+};
+
+struct nf_flowtable {
+	struct rhashtable		rhashtable;
+	const struct nf_flowtable_type	*type;
+	struct delayed_work		gc_work;
+};
+
+#endif /* _FLOW_OFFLOAD_H */
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 0f5b12a4ad09..c9050b87a57a 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -8,6 +8,7 @@
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_tables.h>
 #include <linux/u64_stats_sync.h>
+#include <net/netfilter/nf_flow_table.h>
 #include <net/netlink.h>
 
 #define NFT_JUMP_STACK_SIZE	16
@@ -942,6 +943,7 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt, void *priv);
  *	@chains: chains in the table
  *	@sets: sets in the table
  *	@objects: stateful objects in the table
+ *	@flowtables: flow tables in the table
  *	@hgenerator: handle generator state
  *	@use: number of chain references to this table
  *	@flags: table flag (see enum nft_table_flags)
@@ -953,6 +955,7 @@ struct nft_table {
 	struct list_head		chains;
 	struct list_head		sets;
 	struct list_head		objects;
+	struct list_head		flowtables;
 	u64				hgenerator;
 	u32				use;
 	u16				flags:14,
@@ -1091,6 +1094,44 @@ int nft_register_obj(struct nft_object_type *obj_type);
 void nft_unregister_obj(struct nft_object_type *obj_type);
 
 /**
+ *	struct nft_flowtable - nf_tables flow table
+ *
+ *	@list: flow table list node in table list
+ * 	@table: the table the flow table is contained in
+ *	@name: name of this flow table
+ *	@hooknum: hook number
+ *	@priority: hook priority
+ *	@ops_len: number of hooks in array
+ *	@genmask: generation mask
+ *	@use: number of references to this flow table
+ *	@data: rhashtable and garbage collector
+ * 	@ops: array of hooks
+ */
+struct nft_flowtable {
+	struct list_head		list;
+	struct nft_table		*table;
+	char				*name;
+	int				hooknum;
+	int				priority;
+	int				ops_len;
+	u32				genmask:2,
+					use:30;
+	/* runtime data below here */
+	struct nf_hook_ops		*ops ____cacheline_aligned;
+	struct nf_flowtable		data;
+};
+
+struct nft_flowtable *nf_tables_flowtable_lookup(const struct nft_table *table,
+						 const struct nlattr *nla,
+						 u8 genmask);
+void nft_flow_table_iterate(struct net *net,
+			    void (*iter)(struct nf_flowtable *flowtable, void *data),
+			    void *data);
+
+void nft_register_flowtable_type(struct nf_flowtable_type *type);
+void nft_unregister_flowtable_type(struct nf_flowtable_type *type);
+
+/**
  *	struct nft_traceinfo - nft tracing information and state
  *
  *	@pkt: pktinfo currently processed
@@ -1140,6 +1181,9 @@ void nft_trace_notify(struct nft_traceinfo *info);
 #define MODULE_ALIAS_NFT_OBJ(type) \
 	MODULE_ALIAS("nft-obj-" __stringify(type))
 
+#define MODULE_ALIAS_NFT_FLOWTABLE(family) \
+	MODULE_ALIAS("nft-flowtable-" __stringify(family))
+
 /*
  * The gencursor defines two generations, the currently active and the
  * next one. Objects contain a bitmask of 2 bits specifying the generations
@@ -1326,4 +1370,11 @@ struct nft_trans_obj {
 #define nft_trans_obj(trans)	\
 	(((struct nft_trans_obj *)trans->data)->obj)
 
+struct nft_trans_flowtable {
+	struct nft_flowtable		*flowtable;
+};
+
+#define nft_trans_flowtable(trans)	\
+	(((struct nft_trans_flowtable *)trans->data)->flowtable)
+
 #endif /* _NET_NF_TABLES_H */
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 871afa4871bf..9ba0f4c13de6 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -91,6 +91,9 @@ enum nft_verdicts {
  * @NFT_MSG_GETOBJ: get a stateful object (enum nft_obj_attributes)
  * @NFT_MSG_DELOBJ: delete a stateful object (enum nft_obj_attributes)
  * @NFT_MSG_GETOBJ_RESET: get and reset a stateful object (enum nft_obj_attributes)
+ * @NFT_MSG_NEWFLOWTABLE: add new flow table (enum nft_flowtable_attributes)
+ * @NFT_MSG_GETFLOWTABLE: get flow table (enum nft_flowtable_attributes)
+ * @NFT_MSG_DELFLOWTABLE: delete flow table (enum nft_flowtable_attributes)
  */
 enum nf_tables_msg_types {
 	NFT_MSG_NEWTABLE,
@@ -115,6 +118,9 @@ enum nf_tables_msg_types {
 	NFT_MSG_GETOBJ,
 	NFT_MSG_DELOBJ,
 	NFT_MSG_GETOBJ_RESET,
+	NFT_MSG_NEWFLOWTABLE,
+	NFT_MSG_GETFLOWTABLE,
+	NFT_MSG_DELFLOWTABLE,
 	NFT_MSG_MAX,
 };
 
@@ -1307,6 +1313,53 @@ enum nft_object_attributes {
 #define NFTA_OBJ_MAX		(__NFTA_OBJ_MAX - 1)
 
 /**
+ * enum nft_flowtable_attributes - nf_tables flow table netlink attributes
+ *
+ * @NFTA_FLOWTABLE_TABLE: name of the table containing the expression (NLA_STRING)
+ * @NFTA_FLOWTABLE_NAME: name of this flow table (NLA_STRING)
+ * @NFTA_FLOWTABLE_HOOK: netfilter hook configuration(NLA_U32)
+ * @NFTA_FLOWTABLE_USE: number of references to this flow table (NLA_U32)
+ */
+enum nft_flowtable_attributes {
+	NFTA_FLOWTABLE_UNSPEC,
+	NFTA_FLOWTABLE_TABLE,
+	NFTA_FLOWTABLE_NAME,
+	NFTA_FLOWTABLE_HOOK,
+	NFTA_FLOWTABLE_USE,
+	__NFTA_FLOWTABLE_MAX
+};
+#define NFTA_FLOWTABLE_MAX	(__NFTA_FLOWTABLE_MAX - 1)
+
+/**
+ * enum nft_flowtable_hook_attributes - nf_tables flow table hook netlink attributes
+ *
+ * @NFTA_FLOWTABLE_HOOK_NUM: netfilter hook number (NLA_U32)
+ * @NFTA_FLOWTABLE_HOOK_PRIORITY: netfilter hook priority (NLA_U32)
+ * @NFTA_FLOWTABLE_HOOK_DEVS: input devices this flow table is bound to (NLA_NESTED)
+ */
+enum nft_flowtable_hook_attributes {
+	NFTA_FLOWTABLE_HOOK_UNSPEC,
+	NFTA_FLOWTABLE_HOOK_NUM,
+	NFTA_FLOWTABLE_HOOK_PRIORITY,
+	NFTA_FLOWTABLE_HOOK_DEVS,
+	__NFTA_FLOWTABLE_HOOK_MAX
+};
+#define NFTA_FLOWTABLE_HOOK_MAX	(__NFTA_FLOWTABLE_HOOK_MAX - 1)
+
+/**
+ * enum nft_device_attributes - nf_tables device netlink attributes
+ *
+ * @NFTA_DEVICE_NAME: name of this device (NLA_STRING)
+ */
+enum nft_devices_attributes {
+	NFTA_DEVICE_UNSPEC,
+	NFTA_DEVICE_NAME,
+	__NFTA_DEVICE_MAX
+};
+#define NFTA_DEVICE_MAX		(__NFTA_DEVICE_MAX - 1)
+
+
+/**
  * enum nft_trace_attributes - nf_tables trace netlink attributes
  *
  * @NFTA_TRACE_TABLE: name of the table (NLA_STRING)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 64e1ee091225..d91d4f2b1ac5 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -17,6 +17,7 @@
 #include <linux/netfilter.h>
 #include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_flow_table.h>
 #include <net/netfilter/nf_tables_core.h>
 #include <net/netfilter/nf_tables.h>
 #include <net/net_namespace.h>
@@ -24,6 +25,7 @@
 
 static LIST_HEAD(nf_tables_expressions);
 static LIST_HEAD(nf_tables_objects);
+static LIST_HEAD(nf_tables_flowtables);
 
 /**
  *	nft_register_afinfo - register nf_tables address family info
@@ -348,6 +350,40 @@ static int nft_delobj(struct nft_ctx *ctx, struct nft_object *obj)
 	return err;
 }
 
+static int nft_trans_flowtable_add(struct nft_ctx *ctx, int msg_type,
+				   struct nft_flowtable *flowtable)
+{
+	struct nft_trans *trans;
+
+	trans = nft_trans_alloc(ctx, msg_type,
+				sizeof(struct nft_trans_flowtable));
+	if (trans == NULL)
+		return -ENOMEM;
+
+	if (msg_type == NFT_MSG_NEWFLOWTABLE)
+		nft_activate_next(ctx->net, flowtable);
+
+	nft_trans_flowtable(trans) = flowtable;
+	list_add_tail(&trans->list, &ctx->net->nft.commit_list);
+
+	return 0;
+}
+
+static int nft_delflowtable(struct nft_ctx *ctx,
+			    struct nft_flowtable *flowtable)
+{
+	int err;
+
+	err = nft_trans_flowtable_add(ctx, NFT_MSG_DELFLOWTABLE, flowtable);
+	if (err < 0)
+		return err;
+
+	nft_deactivate_next(ctx->net, flowtable);
+	ctx->table->use--;
+
+	return err;
+}
+
 /*
  * Tables
  */
@@ -733,6 +769,7 @@ static int nf_tables_newtable(struct net *net, struct sock *nlsk,
 	INIT_LIST_HEAD(&table->chains);
 	INIT_LIST_HEAD(&table->sets);
 	INIT_LIST_HEAD(&table->objects);
+	INIT_LIST_HEAD(&table->flowtables);
 	table->flags = flags;
 
 	nft_ctx_init(&ctx, net, skb, nlh, afi, table, NULL, nla);
@@ -754,10 +791,11 @@ static int nf_tables_newtable(struct net *net, struct sock *nlsk,
 
 static int nft_flush_table(struct nft_ctx *ctx)
 {
-	int err;
+	struct nft_flowtable *flowtable, *nft;
 	struct nft_chain *chain, *nc;
 	struct nft_object *obj, *ne;
 	struct nft_set *set, *ns;
+	int err;
 
 	list_for_each_entry(chain, &ctx->table->chains, list) {
 		if (!nft_is_active_next(ctx->net, chain))
@@ -783,6 +821,12 @@ static int nft_flush_table(struct nft_ctx *ctx)
 			goto out;
 	}
 
+	list_for_each_entry_safe(flowtable, nft, &ctx->table->flowtables, list) {
+		err = nft_delflowtable(ctx, flowtable);
+		if (err < 0)
+			goto out;
+	}
+
 	list_for_each_entry_safe(obj, ne, &ctx->table->objects, list) {
 		err = nft_delobj(ctx, obj);
 		if (err < 0)
@@ -4779,6 +4823,607 @@ static void nf_tables_obj_notify(const struct nft_ctx *ctx,
 		       ctx->afi->family, ctx->report, GFP_KERNEL);
 }
 
+/*
+ * Flow tables
+ */
+void nft_register_flowtable_type(struct nf_flowtable_type *type)
+{
+	nfnl_lock(NFNL_SUBSYS_NFTABLES);
+	list_add_tail_rcu(&type->list, &nf_tables_flowtables);
+	nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+}
+EXPORT_SYMBOL_GPL(nft_register_flowtable_type);
+
+void nft_unregister_flowtable_type(struct nf_flowtable_type *type)
+{
+	nfnl_lock(NFNL_SUBSYS_NFTABLES);
+	list_del_rcu(&type->list);
+	nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+}
+EXPORT_SYMBOL_GPL(nft_unregister_flowtable_type);
+
+static const struct nla_policy nft_flowtable_policy[NFTA_FLOWTABLE_MAX + 1] = {
+	[NFTA_FLOWTABLE_TABLE]		= { .type = NLA_STRING,
+					    .len = NFT_NAME_MAXLEN - 1 },
+	[NFTA_FLOWTABLE_NAME]		= { .type = NLA_STRING,
+					    .len = NFT_NAME_MAXLEN - 1 },
+	[NFTA_FLOWTABLE_HOOK]		= { .type = NLA_NESTED },
+};
+
+struct nft_flowtable *nf_tables_flowtable_lookup(const struct nft_table *table,
+						 const struct nlattr *nla,
+						 u8 genmask)
+{
+	struct nft_flowtable *flowtable;
+
+	list_for_each_entry(flowtable, &table->flowtables, list) {
+		if (!nla_strcmp(nla, flowtable->name) &&
+		    nft_active_genmask(flowtable, genmask))
+			return flowtable;
+	}
+	return ERR_PTR(-ENOENT);
+}
+EXPORT_SYMBOL_GPL(nf_tables_flowtable_lookup);
+
+#define NFT_FLOWTABLE_DEVICE_MAX	8
+
+static int nf_tables_parse_devices(const struct nft_ctx *ctx,
+				   const struct nlattr *attr,
+				   struct net_device *dev_array[], int *len)
+{
+	const struct nlattr *tmp;
+	struct net_device *dev;
+	char ifname[IFNAMSIZ];
+	int rem, n = 0, err;
+
+	nla_for_each_nested(tmp, attr, rem) {
+		if (nla_type(tmp) != NFTA_DEVICE_NAME) {
+			err = -EINVAL;
+			goto err1;
+		}
+
+		nla_strlcpy(ifname, tmp, IFNAMSIZ);
+		dev = dev_get_by_name(ctx->net, ifname);
+		if (!dev) {
+			err = -ENOENT;
+			goto err1;
+		}
+
+		dev_array[n++] = dev;
+		if (n == NFT_FLOWTABLE_DEVICE_MAX) {
+			err = -EFBIG;
+			goto err1;
+		}
+	}
+	if (!len)
+		return -EINVAL;
+
+	err = 0;
+err1:
+	*len = n;
+	return err;
+}
+
+static const struct nla_policy nft_flowtable_hook_policy[NFTA_FLOWTABLE_HOOK_MAX + 1] = {
+	[NFTA_FLOWTABLE_HOOK_NUM]	= { .type = NLA_U32 },
+	[NFTA_FLOWTABLE_HOOK_PRIORITY]	= { .type = NLA_U32 },
+	[NFTA_FLOWTABLE_HOOK_DEVS]	= { .type = NLA_NESTED },
+};
+
+static int nf_tables_flowtable_parse_hook(const struct nft_ctx *ctx,
+					  const struct nlattr *attr,
+					  struct nft_flowtable *flowtable)
+{
+	struct net_device *dev_array[NFT_FLOWTABLE_DEVICE_MAX];
+	struct nlattr *tb[NFTA_FLOWTABLE_HOOK_MAX + 1];
+	struct nf_hook_ops *ops;
+	int hooknum, priority;
+	int err, n = 0, i;
+
+	err = nla_parse_nested(tb, NFTA_FLOWTABLE_HOOK_MAX, attr,
+			       nft_flowtable_hook_policy, NULL);
+	if (err < 0)
+		return err;
+
+	if (!tb[NFTA_FLOWTABLE_HOOK_NUM] ||
+	    !tb[NFTA_FLOWTABLE_HOOK_PRIORITY] ||
+	    !tb[NFTA_FLOWTABLE_HOOK_DEVS])
+		return -EINVAL;
+
+	hooknum = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_NUM]));
+	if (hooknum >= ctx->afi->nhooks)
+		return -EINVAL;
+
+	priority = ntohl(nla_get_be32(tb[NFTA_FLOWTABLE_HOOK_PRIORITY]));
+
+	err = nf_tables_parse_devices(ctx, tb[NFTA_FLOWTABLE_HOOK_DEVS],
+				      dev_array, &n);
+	if (err < 0)
+		goto err1;
+
+	ops = kmalloc(sizeof(struct nf_hook_ops) * n, GFP_KERNEL);
+	if (!ops) {
+		err = -ENOMEM;
+		goto err1;
+	}
+
+	flowtable->ops		= ops;
+	flowtable->ops_len	= n;
+
+	for (i = 0; i < n; i++) {
+		flowtable->ops[i].pf		= NFPROTO_NETDEV;
+		flowtable->ops[i].hooknum	= hooknum;
+		flowtable->ops[i].priority	= priority;
+		flowtable->ops[i].priv		= &flowtable->data.rhashtable;
+		flowtable->ops[i].hook		= flowtable->data.type->hook;
+		flowtable->ops[i].dev		= dev_array[i];
+	}
+
+	err = 0;
+err1:
+	for (i = 0; i < n; i++)
+		dev_put(dev_array[i]);
+
+	return err;
+}
+
+static const struct nf_flowtable_type *
+__nft_flowtable_type_get(const struct nft_af_info *afi)
+{
+	const struct nf_flowtable_type *type;
+
+	list_for_each_entry(type, &nf_tables_flowtables, list) {
+		if (afi->family == type->family)
+			return type;
+	}
+	return NULL;
+}
+
+static const struct nf_flowtable_type *
+nft_flowtable_type_get(const struct nft_af_info *afi)
+{
+	const struct nf_flowtable_type *type;
+
+	type = __nft_flowtable_type_get(afi);
+	if (type != NULL && try_module_get(type->owner))
+		return type;
+
+#ifdef CONFIG_MODULES
+	if (type == NULL) {
+		nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+		request_module("nft-flowtable-%u", afi->family);
+		nfnl_lock(NFNL_SUBSYS_NFTABLES);
+		if (__nft_flowtable_type_get(afi))
+			return ERR_PTR(-EAGAIN);
+	}
+#endif
+	return ERR_PTR(-ENOENT);
+}
+
+void nft_flow_table_iterate(struct net *net,
+			    void (*iter)(struct nf_flowtable *flowtable, void *data),
+			    void *data)
+{
+	struct nft_flowtable *flowtable;
+	const struct nft_af_info *afi;
+	const struct nft_table *table;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(afi, &net->nft.af_info, list) {
+		list_for_each_entry_rcu(table, &afi->tables, list) {
+			list_for_each_entry_rcu(flowtable, &table->flowtables, list) {
+				iter(&flowtable->data, data);
+			}
+		}
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(nft_flow_table_iterate);
+
+static void nft_unregister_flowtable_net_hooks(struct net *net,
+					       struct nft_flowtable *flowtable)
+{
+	int i;
+
+	for (i = 0; i < flowtable->ops_len; i++) {
+		if (!flowtable->ops[i].dev)
+			continue;
+
+		nf_unregister_net_hook(net, &flowtable->ops[i]);
+	}
+}
+
+static int nf_tables_newflowtable(struct net *net, struct sock *nlsk,
+				  struct sk_buff *skb,
+				  const struct nlmsghdr *nlh,
+				  const struct nlattr * const nla[],
+				  struct netlink_ext_ack *extack)
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
+	const struct nf_flowtable_type *type;
+	u8 genmask = nft_genmask_next(net);
+	int family = nfmsg->nfgen_family;
+	struct nft_flowtable *flowtable;
+	struct nft_af_info *afi;
+	struct nft_table *table;
+	struct nft_ctx ctx;
+	int err, i, k;
+
+	if (!nla[NFTA_FLOWTABLE_TABLE] ||
+	    !nla[NFTA_FLOWTABLE_NAME] ||
+	    !nla[NFTA_FLOWTABLE_HOOK])
+		return -EINVAL;
+
+	afi = nf_tables_afinfo_lookup(net, family, true);
+	if (IS_ERR(afi))
+		return PTR_ERR(afi);
+
+	table = nf_tables_table_lookup(afi, nla[NFTA_FLOWTABLE_TABLE], genmask);
+	if (IS_ERR(table))
+		return PTR_ERR(table);
+
+	flowtable = nf_tables_flowtable_lookup(table, nla[NFTA_FLOWTABLE_NAME],
+					       genmask);
+	if (IS_ERR(flowtable)) {
+		err = PTR_ERR(flowtable);
+		if (err != -ENOENT)
+			return err;
+	} else {
+		if (nlh->nlmsg_flags & NLM_F_EXCL)
+			return -EEXIST;
+
+		return 0;
+	}
+
+	nft_ctx_init(&ctx, net, skb, nlh, afi, table, NULL, nla);
+
+	flowtable = kzalloc(sizeof(*flowtable), GFP_KERNEL);
+	if (!flowtable)
+		return -ENOMEM;
+
+	flowtable->table = table;
+	flowtable->name = nla_strdup(nla[NFTA_FLOWTABLE_NAME], GFP_KERNEL);
+	if (!flowtable->name) {
+		err = -ENOMEM;
+		goto err1;
+	}
+
+	type = nft_flowtable_type_get(afi);
+	if (IS_ERR(type)) {
+		err = PTR_ERR(type);
+		goto err2;
+	}
+
+	flowtable->data.type = type;
+	err = rhashtable_init(&flowtable->data.rhashtable, type->params);
+	if (err < 0)
+		goto err3;
+
+	err = type->init(&flowtable->data);
+	if (err < 0)
+		goto err3;
+
+	err = nf_tables_flowtable_parse_hook(&ctx, nla[NFTA_FLOWTABLE_HOOK],
+					     flowtable);
+	if (err < 0)
+		goto err4;
+
+	for (i = 0; i < flowtable->ops_len; i++) {
+		err = nf_register_net_hook(net, &flowtable->ops[i]);
+		if (err < 0)
+			goto err5;
+	}
+
+	err = nft_trans_flowtable_add(&ctx, NFT_MSG_NEWFLOWTABLE, flowtable);
+	if (err < 0)
+		goto err6;
+
+	list_add_tail_rcu(&flowtable->list, &table->flowtables);
+	table->use++;
+
+	return 0;
+err6:
+	i = flowtable->ops_len - 1;
+err5:
+	for (k = i; k >= 0; k--)
+		nf_unregister_net_hook(net, &flowtable->ops[i]);
+
+	kfree(flowtable->ops);
+err4:
+	type->destroy(&flowtable->data);
+err3:
+	module_put(type->owner);
+err2:
+	kfree(flowtable->name);
+err1:
+	kfree(flowtable);
+	return err;
+}
+
+static int nf_tables_delflowtable(struct net *net, struct sock *nlsk,
+				  struct sk_buff *skb,
+				  const struct nlmsghdr *nlh,
+				  const struct nlattr * const nla[],
+				  struct netlink_ext_ack *extack)
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
+	u8 genmask = nft_genmask_next(net);
+	int family = nfmsg->nfgen_family;
+	struct nft_flowtable *flowtable;
+	struct nft_af_info *afi;
+	struct nft_table *table;
+	struct nft_ctx ctx;
+
+	afi = nf_tables_afinfo_lookup(net, family, true);
+	if (IS_ERR(afi))
+		return PTR_ERR(afi);
+
+	table = nf_tables_table_lookup(afi, nla[NFTA_FLOWTABLE_TABLE], genmask);
+	if (IS_ERR(table))
+		return PTR_ERR(table);
+
+	flowtable = nf_tables_flowtable_lookup(table, nla[NFTA_FLOWTABLE_NAME],
+					       genmask);
+	if (IS_ERR(flowtable))
+                return PTR_ERR(flowtable);
+	if (flowtable->use > 0)
+		return -EBUSY;
+
+	nft_ctx_init(&ctx, net, skb, nlh, afi, table, NULL, nla);
+
+	return nft_delflowtable(&ctx, flowtable);
+}
+
+static int nf_tables_fill_flowtable_info(struct sk_buff *skb, struct net *net,
+					 u32 portid, u32 seq, int event,
+					 u32 flags, int family,
+					 struct nft_flowtable *flowtable)
+{
+	struct nlattr *nest, *nest_devs;
+	struct nfgenmsg *nfmsg;
+	struct nlmsghdr *nlh;
+	int i;
+
+	event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
+	nlh = nlmsg_put(skb, portid, seq, event, sizeof(struct nfgenmsg), flags);
+	if (nlh == NULL)
+		goto nla_put_failure;
+
+	nfmsg = nlmsg_data(nlh);
+	nfmsg->nfgen_family	= family;
+	nfmsg->version		= NFNETLINK_V0;
+	nfmsg->res_id		= htons(net->nft.base_seq & 0xffff);
+
+	if (nla_put_string(skb, NFTA_FLOWTABLE_TABLE, flowtable->table->name) ||
+	    nla_put_string(skb, NFTA_FLOWTABLE_NAME, flowtable->name) ||
+	    nla_put_be32(skb, NFTA_FLOWTABLE_USE, htonl(flowtable->use)))
+		goto nla_put_failure;
+
+	nest = nla_nest_start(skb, NFTA_FLOWTABLE_HOOK);
+	if (nla_put_be32(skb, NFTA_FLOWTABLE_HOOK_NUM, htonl(flowtable->hooknum)) ||
+	    nla_put_be32(skb, NFTA_FLOWTABLE_HOOK_PRIORITY, htonl(flowtable->priority)))
+		goto nla_put_failure;
+
+	nest_devs = nla_nest_start(skb, NFTA_FLOWTABLE_HOOK_DEVS);
+	if (!nest_devs)
+		goto nla_put_failure;
+
+	for (i = 0; i < flowtable->ops_len; i++) {
+		if (flowtable->ops[i].dev &&
+		    nla_put_string(skb, NFTA_DEVICE_NAME,
+				   flowtable->ops[i].dev->name))
+			goto nla_put_failure;
+	}
+	nla_nest_end(skb, nest_devs);
+	nla_nest_end(skb, nest);
+
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_trim(skb, nlh);
+	return -1;
+}
+
+struct nft_flowtable_filter {
+	char		*table;
+};
+
+static int nf_tables_dump_flowtable(struct sk_buff *skb,
+				    struct netlink_callback *cb)
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
+	struct nft_flowtable_filter *filter = cb->data;
+	unsigned int idx = 0, s_idx = cb->args[0];
+	struct net *net = sock_net(skb->sk);
+	int family = nfmsg->nfgen_family;
+	struct nft_flowtable *flowtable;
+	const struct nft_af_info *afi;
+	const struct nft_table *table;
+
+	rcu_read_lock();
+	cb->seq = net->nft.base_seq;
+
+	list_for_each_entry_rcu(afi, &net->nft.af_info, list) {
+		if (family != NFPROTO_UNSPEC && family != afi->family)
+			continue;
+
+		list_for_each_entry_rcu(table, &afi->tables, list) {
+			list_for_each_entry_rcu(flowtable, &table->flowtables, list) {
+				if (!nft_is_active(net, flowtable))
+					goto cont;
+				if (idx < s_idx)
+					goto cont;
+				if (idx > s_idx)
+					memset(&cb->args[1], 0,
+					       sizeof(cb->args) - sizeof(cb->args[0]));
+				if (filter && filter->table[0] &&
+				    strcmp(filter->table, table->name))
+					goto cont;
+
+				if (nf_tables_fill_flowtable_info(skb, net, NETLINK_CB(cb->skb).portid,
+								  cb->nlh->nlmsg_seq,
+								  NFT_MSG_NEWFLOWTABLE,
+								  NLM_F_MULTI | NLM_F_APPEND,
+								  afi->family, flowtable) < 0)
+					goto done;
+
+				nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+cont:
+				idx++;
+			}
+		}
+	}
+done:
+	rcu_read_unlock();
+
+	cb->args[0] = idx;
+	return skb->len;
+}
+
+static int nf_tables_dump_flowtable_done(struct netlink_callback *cb)
+{
+	struct nft_flowtable_filter *filter = cb->data;
+
+	if (!filter)
+		return 0;
+
+	kfree(filter->table);
+	kfree(filter);
+
+	return 0;
+}
+
+static struct nft_flowtable_filter *
+nft_flowtable_filter_alloc(const struct nlattr * const nla[])
+{
+	struct nft_flowtable_filter *filter;
+
+	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
+	if (!filter)
+		return ERR_PTR(-ENOMEM);
+
+	if (nla[NFTA_FLOWTABLE_TABLE]) {
+		filter->table = nla_strdup(nla[NFTA_FLOWTABLE_TABLE],
+					   GFP_KERNEL);
+		if (!filter->table) {
+			kfree(filter);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+	return filter;
+}
+
+static int nf_tables_getflowtable(struct net *net, struct sock *nlsk,
+				  struct sk_buff *skb,
+				  const struct nlmsghdr *nlh,
+				  const struct nlattr * const nla[],
+				  struct netlink_ext_ack *extack)
+{
+	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
+	u8 genmask = nft_genmask_cur(net);
+	int family = nfmsg->nfgen_family;
+	struct nft_flowtable *flowtable;
+	const struct nft_af_info *afi;
+	const struct nft_table *table;
+	struct sk_buff *skb2;
+	int err;
+
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = nf_tables_dump_flowtable,
+			.done = nf_tables_dump_flowtable_done,
+		};
+
+		if (nla[NFTA_FLOWTABLE_TABLE]) {
+			struct nft_flowtable_filter *filter;
+
+			filter = nft_flowtable_filter_alloc(nla);
+			if (IS_ERR(filter))
+				return -ENOMEM;
+
+			c.data = filter;
+		}
+		return netlink_dump_start(nlsk, skb, nlh, &c);
+	}
+
+	if (!nla[NFTA_FLOWTABLE_NAME])
+		return -EINVAL;
+
+	afi = nf_tables_afinfo_lookup(net, family, false);
+	if (IS_ERR(afi))
+		return PTR_ERR(afi);
+
+	table = nf_tables_table_lookup(afi, nla[NFTA_FLOWTABLE_TABLE], genmask);
+	if (IS_ERR(table))
+		return PTR_ERR(table);
+
+	flowtable = nf_tables_flowtable_lookup(table, nla[NFTA_FLOWTABLE_NAME],
+					       genmask);
+	if (IS_ERR(table))
+		return PTR_ERR(flowtable);
+
+	skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb2)
+		return -ENOMEM;
+
+	err = nf_tables_fill_flowtable_info(skb2, net, NETLINK_CB(skb).portid,
+					    nlh->nlmsg_seq,
+					    NFT_MSG_NEWFLOWTABLE, 0, family,
+					    flowtable);
+	if (err < 0)
+		goto err;
+
+	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+err:
+	kfree_skb(skb2);
+	return err;
+}
+
+static void nf_tables_flowtable_notify(struct nft_ctx *ctx,
+				       struct nft_flowtable *flowtable,
+				       int event)
+{
+	struct sk_buff *skb;
+	int err;
+
+	if (ctx->report &&
+	    !nfnetlink_has_listeners(ctx->net, NFNLGRP_NFTABLES))
+		return;
+
+	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (skb == NULL)
+		goto err;
+
+	err = nf_tables_fill_flowtable_info(skb, ctx->net, ctx->portid,
+					    ctx->seq, event, 0,
+					    ctx->afi->family, flowtable);
+	if (err < 0) {
+		kfree_skb(skb);
+		goto err;
+	}
+
+	nfnetlink_send(skb, ctx->net, ctx->portid, NFNLGRP_NFTABLES,
+		       ctx->report, GFP_KERNEL);
+	return;
+err:
+	nfnetlink_set_err(ctx->net, ctx->portid, NFNLGRP_NFTABLES, -ENOBUFS);
+}
+
+static void nft_flowtable_destroy(void *ptr, void *arg)
+{
+	kfree(ptr);
+}
+
+static void nf_tables_flowtable_destroy(struct nft_flowtable *flowtable)
+{
+	flowtable->data.type->destroy(&flowtable->data);
+	kfree(flowtable->name);
+	rhashtable_free_and_destroy(&flowtable->data.rhashtable,
+				    nft_flowtable_destroy, NULL);
+	module_put(flowtable->data.type->owner);
+}
+
 static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
 				   u32 portid, u32 seq)
 {
@@ -4809,6 +5454,49 @@ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
 	return -EMSGSIZE;
 }
 
+static void nft_flowtable_event(unsigned long event, struct net_device *dev,
+				struct nft_flowtable *flowtable)
+{
+	int i;
+
+	for (i = 0; i < flowtable->ops_len; i++) {
+		if (flowtable->ops[i].dev != dev)
+			continue;
+
+		nf_unregister_net_hook(dev_net(dev), &flowtable->ops[i]);
+		flowtable->ops[i].dev = NULL;
+		break;
+	}
+}
+
+static int nf_tables_flowtable_event(struct notifier_block *this,
+				     unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct nft_flowtable *flowtable;
+	struct nft_table *table;
+	struct nft_af_info *afi;
+
+	if (event != NETDEV_UNREGISTER)
+		return 0;
+
+	nfnl_lock(NFNL_SUBSYS_NFTABLES);
+	list_for_each_entry(afi, &dev_net(dev)->nft.af_info, list) {
+		list_for_each_entry(table, &afi->tables, list) {
+			list_for_each_entry(flowtable, &table->flowtables, list) {
+				nft_flowtable_event(event, dev, flowtable);
+			}
+		}
+	}
+	nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block nf_tables_flowtable_notifier = {
+	.notifier_call	= nf_tables_flowtable_event,
+};
+
 static void nf_tables_gen_notify(struct net *net, struct sk_buff *skb,
 				 int event)
 {
@@ -4961,6 +5649,21 @@ static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
 		.attr_count	= NFTA_OBJ_MAX,
 		.policy		= nft_obj_policy,
 	},
+	[NFT_MSG_NEWFLOWTABLE] = {
+		.call_batch	= nf_tables_newflowtable,
+		.attr_count	= NFTA_FLOWTABLE_MAX,
+		.policy		= nft_flowtable_policy,
+	},
+	[NFT_MSG_GETFLOWTABLE] = {
+		.call		= nf_tables_getflowtable,
+		.attr_count	= NFTA_FLOWTABLE_MAX,
+		.policy		= nft_flowtable_policy,
+	},
+	[NFT_MSG_DELFLOWTABLE] = {
+		.call_batch	= nf_tables_delflowtable,
+		.attr_count	= NFTA_FLOWTABLE_MAX,
+		.policy		= nft_flowtable_policy,
+	},
 };
 
 static void nft_chain_commit_update(struct nft_trans *trans)
@@ -5006,6 +5709,9 @@ static void nf_tables_commit_release(struct nft_trans *trans)
 	case NFT_MSG_DELOBJ:
 		nft_obj_destroy(nft_trans_obj(trans));
 		break;
+	case NFT_MSG_DELFLOWTABLE:
+		nf_tables_flowtable_destroy(nft_trans_flowtable(trans));
+		break;
 	}
 	kfree(trans);
 }
@@ -5124,6 +5830,21 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 			nf_tables_obj_notify(&trans->ctx, nft_trans_obj(trans),
 					     NFT_MSG_DELOBJ);
 			break;
+		case NFT_MSG_NEWFLOWTABLE:
+			nft_clear(net, nft_trans_flowtable(trans));
+			nf_tables_flowtable_notify(&trans->ctx,
+						   nft_trans_flowtable(trans),
+						   NFT_MSG_NEWFLOWTABLE);
+			nft_trans_destroy(trans);
+			break;
+		case NFT_MSG_DELFLOWTABLE:
+			list_del_rcu(&nft_trans_flowtable(trans)->list);
+			nf_tables_flowtable_notify(&trans->ctx,
+						   nft_trans_flowtable(trans),
+						   NFT_MSG_DELFLOWTABLE);
+			nft_unregister_flowtable_net_hooks(net,
+					nft_trans_flowtable(trans));
+			break;
 		}
 	}
 
@@ -5161,6 +5882,9 @@ static void nf_tables_abort_release(struct nft_trans *trans)
 	case NFT_MSG_NEWOBJ:
 		nft_obj_destroy(nft_trans_obj(trans));
 		break;
+	case NFT_MSG_NEWFLOWTABLE:
+		nf_tables_flowtable_destroy(nft_trans_flowtable(trans));
+		break;
 	}
 	kfree(trans);
 }
@@ -5251,6 +5975,17 @@ static int nf_tables_abort(struct net *net, struct sk_buff *skb)
 			nft_clear(trans->ctx.net, nft_trans_obj(trans));
 			nft_trans_destroy(trans);
 			break;
+		case NFT_MSG_NEWFLOWTABLE:
+			trans->ctx.table->use--;
+			list_del_rcu(&nft_trans_flowtable(trans)->list);
+			nft_unregister_flowtable_net_hooks(net,
+					nft_trans_flowtable(trans));
+			break;
+		case NFT_MSG_DELFLOWTABLE:
+			trans->ctx.table->use++;
+			nft_clear(trans->ctx.net, nft_trans_flowtable(trans));
+			nft_trans_destroy(trans);
+			break;
 		}
 	}
 
@@ -5802,6 +6537,7 @@ EXPORT_SYMBOL_GPL(__nft_release_basechain);
 /* Called by nft_unregister_afinfo() from __net_exit path, nfnl_lock is held. */
 static void __nft_release_afinfo(struct net *net, struct nft_af_info *afi)
 {
+	struct nft_flowtable *flowtable, *nf;
 	struct nft_table *table, *nt;
 	struct nft_chain *chain, *nc;
 	struct nft_object *obj, *ne;
@@ -5816,6 +6552,9 @@ static void __nft_release_afinfo(struct net *net, struct nft_af_info *afi)
 		list_for_each_entry(chain, &table->chains, list)
 			nf_tables_unregister_hooks(net, table, chain,
 						   afi->nops);
+		list_for_each_entry(flowtable, &table->flowtables, list)
+			nf_unregister_net_hooks(net, flowtable->ops,
+						flowtable->ops_len);
 		/* No packets are walking on these chains anymore. */
 		ctx.table = table;
 		list_for_each_entry(chain, &table->chains, list) {
@@ -5826,6 +6565,11 @@ static void __nft_release_afinfo(struct net *net, struct nft_af_info *afi)
 				nf_tables_rule_destroy(&ctx, rule);
 			}
 		}
+		list_for_each_entry_safe(flowtable, nf, &table->flowtables, list) {
+			list_del(&flowtable->list);
+			table->use--;
+			nf_tables_flowtable_destroy(flowtable);
+		}
 		list_for_each_entry_safe(set, ns, &table->sets, list) {
 			list_del(&set->list);
 			table->use--;
@@ -5869,6 +6613,8 @@ static int __init nf_tables_module_init(void)
 	if (err < 0)
 		goto err3;
 
+	register_netdevice_notifier(&nf_tables_flowtable_notifier);
+
 	pr_info("nf_tables: (c) 2007-2009 Patrick McHardy <kaber@trash.net>\n");
 	return register_pernet_subsys(&nf_tables_net_ops);
 err3:
@@ -5883,6 +6629,7 @@ static void __exit nf_tables_module_exit(void)
 {
 	unregister_pernet_subsys(&nf_tables_net_ops);
 	nfnetlink_subsys_unregister(&nf_tables_subsys);
+	unregister_netdevice_notifier(&nf_tables_flowtable_notifier);
 	rcu_barrier();
 	nf_tables_core_module_exit();
 	kfree(info);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 3/6] netfilter: add generic flow table infrastructure
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 2/6] netfilter: nf_tables: add flow table netlink frontend Pablo Neira Ayuso
@ 2017-12-07 12:44 ` Pablo Neira Ayuso
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4 Pablo Neira Ayuso
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:44 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

This patch defines the API to interact with flow tables, this allows to
add, delete and lookup for entries in the flow table. This also adds the
generic garbage code that removes entries that have expired, ie. no
traffic has been seen for a while.

Users of the flow table infrastructure can delete entries via
flow_offload_dead(), which sets the dying bit, this signals the garbage
collector to release an entry from user context.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_flow_table.h |  71 +++++++++
 net/netfilter/Kconfig                 |   7 +
 net/netfilter/Makefile                |   3 +
 net/netfilter/nf_flow_table.c         | 269 ++++++++++++++++++++++++++++++++++
 4 files changed, 350 insertions(+)
 create mode 100644 net/netfilter/nf_flow_table.c

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 429833dba2d9..1a2598b4a58f 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -1,7 +1,11 @@
 #ifndef _NF_FLOW_TABLE_H
 #define _NF_FLOW_TABLE_H
 
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/netdevice.h>
 #include <linux/rhashtable.h>
+#include <linux/rcupdate.h>
 
 struct nf_flowtable;
 
@@ -21,4 +25,71 @@ struct nf_flowtable {
 	struct delayed_work		gc_work;
 };
 
+enum flow_offload_tuple_dir {
+	FLOW_OFFLOAD_DIR_ORIGINAL,
+	FLOW_OFFLOAD_DIR_REPLY,
+	__FLOW_OFFLOAD_DIR_MAX		= FLOW_OFFLOAD_DIR_REPLY,
+};
+#define FLOW_OFFLOAD_DIR_MAX	(__FLOW_OFFLOAD_DIR_MAX + 1)
+
+struct flow_offload_tuple {
+	union {
+		struct in_addr		src_v4;
+		struct in6_addr		src_v6;
+	};
+	union {
+		struct in_addr		dst_v4;
+		struct in6_addr		dst_v6;
+	};
+	struct {
+		__be16			src_port;
+		__be16			dst_port;
+	};
+
+	u8				l3proto;
+	u8				l4proto;
+	u8				dir;
+
+	int				iifidx;
+	int				oifidx;
+
+	union {
+		__be32			gateway;
+		struct in6_addr		gateway6;
+	};
+};
+
+struct flow_offload_tuple_rhash {
+	struct rhash_head		node;
+	struct flow_offload_tuple	tuple;
+};
+
+#define	FLOW_OFFLOAD_SNAT	0x1
+#define	FLOW_OFFLOAD_DNAT	0x2
+#define	FLOW_OFFLOAD_DYING	0x4
+
+struct flow_offload {
+	struct flow_offload_tuple_rhash		tuplehash[FLOW_OFFLOAD_DIR_MAX];
+	u32					flags;
+	union {
+		/* Your private driver data here. */
+		u32		timeout;
+	};
+};
+
+struct flow_offload *flow_offload_alloc(struct nf_conn *ct, int *ifindex,
+					union nf_inet_addr *gw);
+void flow_offload_free(const struct flow_offload *flow);
+
+int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow);
+void flow_offload_del(struct nf_flowtable *flow_table, struct flow_offload *flow);
+struct flow_offload_tuple_rhash *flow_offload_lookup(struct nf_flowtable *flow_table,
+						     struct flow_offload_tuple *tuple);
+int nf_flow_table_iterate(struct nf_flowtable *flow_table,
+			  void (*iter)(struct flow_offload *flow, void *data),
+			  void *data);
+void nf_flow_offload_work_gc(struct work_struct *work);
+
+void flow_offload_dead(struct flow_offload *flow);
+
 #endif /* _FLOW_OFFLOAD_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e4a13cc8a2e7..af0f58322515 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -649,6 +649,13 @@ endif # NF_TABLES_NETDEV
 
 endif # NF_TABLES
 
+config NF_FLOW_TABLE
+	tristate "Netfilter flow table module"
+	help
+	  This option adds the flow table core infrastructure.
+
+	  To compile it as a module, choose M here.
+
 config NETFILTER_XTABLES
 	tristate "Netfilter Xtables support (required for ip_tables)"
 	default m if NETFILTER_ADVANCED=n
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index d3891c93edd6..1f7d92bd571a 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -106,6 +106,9 @@ obj-$(CONFIG_NFT_FIB_NETDEV)	+= nft_fib_netdev.o
 obj-$(CONFIG_NFT_DUP_NETDEV)	+= nft_dup_netdev.o
 obj-$(CONFIG_NFT_FWD_NETDEV)	+= nft_fwd_netdev.o
 
+# flow table infrastructure
+obj-$(CONFIG_NF_FLOW_TABLE)	+= nf_flow_table.o
+
 # generic X tables 
 obj-$(CONFIG_NETFILTER_XTABLES) += x_tables.o xt_tcpudp.o
 
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
new file mode 100644
index 000000000000..ff27dad268c3
--- /dev/null
+++ b/net/netfilter/nf_flow_table.c
@@ -0,0 +1,269 @@
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netfilter.h>
+#include <linux/rhashtable.h>
+#include <linux/netdevice.h>
+#include <net/netfilter/nf_flow_table.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_tuple.h>
+
+struct flow_offload_entry {
+	struct flow_offload	flow;
+	struct nf_conn		*ct;
+	struct rcu_head		rcu_head;
+};
+
+struct flow_offload *
+flow_offload_alloc(struct nf_conn *ct, int *iifindex, union nf_inet_addr *gw)
+{
+	struct flow_offload_entry *entry;
+	struct flow_offload *flow;
+
+	entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+	if (!entry)
+		return NULL;
+
+	if (unlikely(nf_ct_is_dying(ct) ||
+	    !atomic_inc_not_zero(&ct->ct_general.use))) {
+		kfree(entry);
+		return NULL;
+	}
+	entry->ct = ct;
+
+	flow = &entry->flow;
+	switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num) {
+	case NFPROTO_IPV4:
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v4 =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.in;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v4 =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.in;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v4 =
+			ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3.in;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4 =
+			ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.in;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l3proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l4proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l3proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l4proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.gateway =
+			gw[FLOW_OFFLOAD_DIR_ORIGINAL].ip;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.gateway =
+			gw[FLOW_OFFLOAD_DIR_REPLY].ip;
+		break;
+	case NFPROTO_IPV6:
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v6 =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.in6;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v6 =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.in6;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v6 =
+			ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u3.in6;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v6 =
+			ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.in6;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l3proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.l4proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l3proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.l4proto =
+			ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.gateway6 =
+			gw[FLOW_OFFLOAD_DIR_ORIGINAL].in6;
+		flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.gateway6 =
+			gw[FLOW_OFFLOAD_DIR_REPLY].in6;
+		break;
+        }
+	flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_port =
+		ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u.tcp.port;
+	flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_port =
+		ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u.tcp.port;
+	flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_port =
+		ct->tuplehash[IP_CT_DIR_REPLY].tuple.src.u.tcp.port;
+	flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_port =
+		ct->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u.tcp.port;
+
+	flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dir =
+						FLOW_OFFLOAD_DIR_ORIGINAL;
+	flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dir =
+						FLOW_OFFLOAD_DIR_REPLY;
+
+	flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx =
+		iifindex[FLOW_OFFLOAD_DIR_ORIGINAL];
+	flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.oifidx =
+		iifindex[FLOW_OFFLOAD_DIR_REPLY];
+	flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.iifidx =
+		iifindex[FLOW_OFFLOAD_DIR_REPLY];
+	flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.oifidx =
+		iifindex[FLOW_OFFLOAD_DIR_ORIGINAL];
+
+	if (ct->status & IPS_SRC_NAT)
+		flow->flags |= FLOW_OFFLOAD_SNAT;
+	else if (ct->status & IPS_DST_NAT)
+		flow->flags |= FLOW_OFFLOAD_DNAT;
+
+	return flow;
+}
+EXPORT_SYMBOL_GPL(flow_offload_alloc);
+
+void flow_offload_free(const struct flow_offload *flow)
+{
+	struct flow_offload_entry *e;
+
+	e = container_of(flow, struct flow_offload_entry, flow);
+	kfree(e);
+}
+EXPORT_SYMBOL_GPL(flow_offload_free);
+
+void flow_offload_dead(struct flow_offload *flow)
+{
+	flow->flags |= FLOW_OFFLOAD_DYING;
+}
+EXPORT_SYMBOL_GPL(flow_offload_dead);
+
+int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
+{
+	flow->timeout = (u32)jiffies;
+
+	rhashtable_insert_fast(&flow_table->rhashtable,
+			       &flow->tuplehash[0].node,
+			       *flow_table->type->params);
+	rhashtable_insert_fast(&flow_table->rhashtable,
+			       &flow->tuplehash[1].node,
+			       *flow_table->type->params);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(flow_offload_add);
+
+void flow_offload_del(struct nf_flowtable *flow_table,
+		      struct flow_offload *flow)
+{
+	struct flow_offload_entry *e;
+
+	rhashtable_remove_fast(&flow_table->rhashtable,
+			       &flow->tuplehash[0].node,
+			       *flow_table->type->params);
+	rhashtable_remove_fast(&flow_table->rhashtable,
+			       &flow->tuplehash[1].node,
+			       *flow_table->type->params);
+
+	e = container_of(flow, struct flow_offload_entry, flow);
+	kfree_rcu(e, rcu_head);
+}
+EXPORT_SYMBOL_GPL(flow_offload_del);
+
+struct flow_offload_tuple_rhash *
+flow_offload_lookup(struct nf_flowtable *flow_table,
+		    struct flow_offload_tuple *tuple)
+{
+	return rhashtable_lookup_fast(&flow_table->rhashtable, tuple,
+				      *flow_table->type->params);
+}
+EXPORT_SYMBOL_GPL(flow_offload_lookup);
+
+static void nf_flow_release_ct(const struct flow_offload *flow)
+{
+	struct flow_offload_entry *e;
+
+	e = container_of(flow, struct flow_offload_entry, flow);
+	nf_ct_delete(e->ct, 0, 0);
+	nf_ct_put(e->ct);
+}
+
+int nf_flow_table_iterate(struct nf_flowtable *flow_table,
+			  void (*iter)(struct flow_offload *flow, void *data),
+			  void *data)
+{
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct rhashtable_iter hti;
+	struct flow_offload *flow;
+	int err;
+
+	rhashtable_walk_init(&flow_table->rhashtable, &hti, GFP_KERNEL);
+	err = rhashtable_walk_start(&hti);
+	if (err && err != -EAGAIN)
+		goto out;
+
+	while ((tuplehash = rhashtable_walk_next(&hti))) {
+		if (IS_ERR(tuplehash)) {
+			err = PTR_ERR(tuplehash);
+			if (err != -EAGAIN)
+				goto out;
+
+			continue;
+		}
+		if (tuplehash->tuple.dir)
+			continue;
+
+		flow = container_of(tuplehash, struct flow_offload, tuplehash[0]);
+
+		iter(flow, data);
+	}
+out:
+	rhashtable_walk_stop(&hti);
+	rhashtable_walk_exit(&hti);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
+
+static inline bool nf_flow_has_expired(const struct flow_offload *flow)
+{
+	return (__s32)(flow->timeout - (u32)jiffies) <= 0;
+}
+
+static inline bool nf_flow_is_dying(const struct flow_offload *flow)
+{
+	return flow->flags & FLOW_OFFLOAD_DYING;
+}
+
+void nf_flow_offload_work_gc(struct work_struct *work)
+{
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct nf_flowtable *flow_table;
+	struct rhashtable_iter hti;
+	struct flow_offload *flow;
+	int err, counter = 0;
+
+	flow_table = container_of(work, struct nf_flowtable, gc_work.work);
+
+	rhashtable_walk_init(&flow_table->rhashtable, &hti, GFP_KERNEL);
+	err = rhashtable_walk_start(&hti);
+	if (err && err != -EAGAIN)
+		goto out;
+
+	while ((tuplehash = rhashtable_walk_next(&hti))) {
+		if (IS_ERR(tuplehash)) {
+			err = PTR_ERR(tuplehash);
+			if (err != -EAGAIN)
+				goto out;
+
+			continue;
+		}
+		if (tuplehash->tuple.dir)
+			continue;
+
+		flow = container_of(tuplehash, struct flow_offload, tuplehash[0]);
+
+		if (nf_flow_has_expired(flow) ||
+		    nf_flow_is_dying(flow)) {
+			flow_offload_del(flow_table, flow);
+			nf_flow_release_ct(flow);
+		}
+		counter++;
+	}
+
+	rhashtable_walk_stop(&hti);
+	rhashtable_walk_exit(&hti);
+out:
+	queue_delayed_work(system_power_efficient_wq, &flow_table->gc_work, HZ);
+}
+EXPORT_SYMBOL_GPL(nf_flow_offload_work_gc);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>");
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
                   ` (2 preceding siblings ...)
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 3/6] netfilter: add generic flow table infrastructure Pablo Neira Ayuso
@ 2017-12-07 12:44 ` Pablo Neira Ayuso
  2017-12-08 10:04   ` Florian Westphal
  2017-12-07 12:45 ` [PATCH nf-next RFC,v2 5/6] netfilter: nf_tables: flow offload expression Pablo Neira Ayuso
  2017-12-07 12:45 ` [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload Pablo Neira Ayuso
  5 siblings, 1 reply; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:44 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

This patch adds the IPv4 flow table type, that implements the datapath
flow table to forward IPv4 traffic. Rationale is:

1) Look up for the packet in the flow table, from the ingress hook.
2) If there's a hit, decrement ttl and pass it on to the neighbour layer
   for transmission.
3) If there's a miss, packet is passed up to the classic forwarding
   path.

This patch also supports layer 3 source and destination NAT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/Kconfig              |   8 +
 net/ipv4/netfilter/Makefile             |   3 +
 net/ipv4/netfilter/nf_flow_table_ipv4.c | 316 ++++++++++++++++++++++++++++++++
 3 files changed, 327 insertions(+)
 create mode 100644 net/ipv4/netfilter/nf_flow_table_ipv4.c

diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index c11eb1744ab1..8b430c1744c4 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -177,6 +177,14 @@ config NF_NAT_H323
 
 endif # NF_NAT_IPV4
 
+config NF_FLOW_TABLE_IPV4
+	select NF_FLOW_TABLE
+	tristate "Netfilter flow table IPv4 module"
+	help
+	  This option adds the flow table IPv4 support.
+
+	  To compile it as a module, choose M here.
+
 config IP_NF_IPTABLES
 	tristate "IP tables support (required for filtering/masq/NAT)"
 	default m if NETFILTER_ADVANCED=n
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index f462fee66ac8..ae39e1c569a8 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -52,6 +52,9 @@ obj-$(CONFIG_IP_NF_NAT) += iptable_nat.o
 obj-$(CONFIG_IP_NF_RAW) += iptable_raw.o
 obj-$(CONFIG_IP_NF_SECURITY) += iptable_security.o
 
+# flow table support
+obj-$(CONFIG_NF_FLOW_TABLE_IPV4) += nf_flow_table_ipv4.o
+
 # matches
 obj-$(CONFIG_IP_NF_MATCH_AH) += ipt_ah.o
 obj-$(CONFIG_IP_NF_MATCH_RPFILTER) += ipt_rpfilter.o
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c b/net/ipv4/netfilter/nf_flow_table_ipv4.c
new file mode 100644
index 000000000000..090a3fbcf211
--- /dev/null
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -0,0 +1,316 @@
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netfilter.h>
+#include <linux/rhashtable.h>
+#include <linux/ip.h>
+#include <linux/netdevice.h>
+#include <net/ip.h>
+#include <net/neighbour.h>
+#include <net/netfilter/nf_flow_table.h>
+#include <net/netfilter/nf_tables.h>
+/* For layer 4 checksum field offset. */
+#include <linux/tcp.h>
+#include <linux/udp.h>
+
+static int nf_flow_nat_tcp(struct sk_buff *skb, unsigned int thoff,
+			   __be32 addr, __be32 new_addr)
+{
+	struct tcphdr *tcph;
+
+	if (!pskb_may_pull(skb, thoff + sizeof(*tcph)) ||
+	    skb_try_make_writable(skb, thoff + sizeof(*tcph)))
+		return -1;
+
+	tcph = (void *)(skb_network_header(skb) + thoff);
+	inet_proto_csum_replace4(&tcph->check, skb, addr, new_addr, true);
+
+	return 0;
+}
+
+static int nf_flow_nat_udp(struct sk_buff *skb, unsigned int thoff,
+			   __be32 addr, __be32 new_addr)
+{
+	struct udphdr *udph;
+
+	if (!pskb_may_pull(skb, thoff + sizeof(*udph)) ||
+	    skb_try_make_writable(skb, thoff + sizeof(*udph)))
+		return -1;
+
+	udph = (void *)(skb_network_header(skb) + thoff);
+	if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) {
+		inet_proto_csum_replace4(&udph->check, skb, addr,
+					 new_addr, true);
+		if (!udph->check)
+			udph->check = CSUM_MANGLED_0;
+	}
+
+	return 0;
+}
+
+static int nf_flow_nat_l4proto(struct sk_buff *skb, struct iphdr *iph,
+			       unsigned int thoff, __be32 addr, __be32 new_addr)
+{
+	csum_replace4(&iph->check, addr, new_addr);
+
+	switch (iph->protocol) {
+	case IPPROTO_TCP:
+		if (nf_flow_nat_tcp(skb, thoff, addr, new_addr) < 0)
+			return NF_DROP;
+		break;
+	case IPPROTO_UDP:
+		if (nf_flow_nat_udp(skb, thoff, addr, new_addr) < 0)
+			return NF_DROP;
+		break;
+	}
+
+	return 0;
+}
+
+static int nf_flow_snat_ip(const struct flow_offload *flow, struct sk_buff *skb,
+			   struct iphdr *iph, unsigned int thoff,
+			   enum flow_offload_tuple_dir dir)
+{
+	__be32 addr, new_addr;
+
+	switch (dir) {
+	case FLOW_OFFLOAD_DIR_ORIGINAL:
+		addr = iph->saddr;
+		new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4.s_addr;
+		iph->saddr = new_addr;
+		break;
+	case FLOW_OFFLOAD_DIR_REPLY:
+		addr = iph->daddr;
+		new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v4.s_addr;
+		iph->daddr = new_addr;
+		break;
+	default:
+		return -1;
+	}
+
+	return nf_flow_nat_l4proto(skb, iph, thoff, addr, new_addr);
+}
+
+static int nf_flow_dnat_ip(const struct flow_offload *flow, struct sk_buff *skb,
+			   struct iphdr *iph, unsigned int thoff,
+			   enum flow_offload_tuple_dir dir)
+{
+	__be32 addr, new_addr;
+
+	switch (dir) {
+	case FLOW_OFFLOAD_DIR_ORIGINAL:
+		addr = iph->daddr;
+		new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.src_v4.s_addr;
+		iph->daddr = new_addr;
+		break;
+	case FLOW_OFFLOAD_DIR_REPLY:
+		addr = iph->saddr;
+		new_addr = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.dst_v4.s_addr;
+		iph->saddr = new_addr;
+		break;
+	default:
+		return -1;
+	}
+
+	return nf_flow_nat_l4proto(skb, iph, thoff, addr, new_addr);
+}
+
+static int nf_flow_nat_ip(const struct flow_offload *flow, struct sk_buff *skb,
+			  enum flow_offload_tuple_dir dir)
+{
+	unsigned int thoff;
+	struct iphdr *iph;
+
+	if (skb_try_make_writable(skb, sizeof(*iph)))
+		return -1;
+
+	iph = ip_hdr(skb);
+	thoff = iph->ihl * 4;
+
+	if (flow->flags & FLOW_OFFLOAD_SNAT &&
+	    nf_flow_snat_ip(flow, skb, iph, thoff, dir) < 0)
+		return -1;
+	if (flow->flags & FLOW_OFFLOAD_DNAT &&
+	    nf_flow_dnat_ip(flow, skb, iph, thoff, dir) < 0)
+		return -1;
+
+	return 0;
+}
+
+/* Similar to rt_nexthop(). */
+static inline void
+nf_flow_nexthop(const struct flow_offload *flow,
+		union nf_inet_addr *nexthop, enum flow_offload_tuple_dir dir)
+{
+	if (flow->tuplehash[dir].tuple.gateway) {
+		nexthop->ip = flow->tuplehash[dir].tuple.gateway;
+		return;
+	}
+
+	nexthop->ip = flow->tuplehash[!dir].tuple.src_v4.s_addr;
+}
+
+struct flow_ports {
+	__be16 src, dst;
+};
+
+static bool ip_has_options(unsigned int thoff)
+{
+	return thoff > sizeof(struct iphdr);
+}
+
+static int nf_flow_tuple_ip(struct sk_buff *skb,
+			    struct flow_offload_tuple *tuple)
+{
+	struct flow_ports *ports;
+	unsigned int thoff;
+	struct iphdr *iph;
+
+	if (!pskb_may_pull(skb, sizeof(*iph)))
+		return -1;
+
+	iph = ip_hdr(skb);
+	thoff = iph->ihl * 4;
+
+	if (ip_is_fragment(iph) ||
+	    unlikely(ip_has_options(thoff)))
+		return -1;
+
+	if (iph->protocol != IPPROTO_TCP &&
+	    iph->protocol != IPPROTO_UDP)
+		return -1;
+
+	thoff = iph->ihl * 4;
+	if (!pskb_may_pull(skb, thoff + sizeof(*ports)))
+		return -1;
+
+	ports = (struct flow_ports *)(skb_network_header(skb) + thoff);
+
+	tuple->src_v4.s_addr	= iph->saddr;
+	tuple->dst_v4.s_addr	= iph->daddr;
+	tuple->src_port		= ports->src;
+	tuple->dst_port		= ports->dst;
+	tuple->l3proto		= AF_INET;
+	tuple->l4proto		= iph->protocol;
+
+	return 0;
+}
+
+#define NF_FLOW_TIMEOUT	(30 * HZ)
+
+static unsigned int
+nf_flow_offload_hook(void *priv, struct sk_buff *skb,
+		     const struct nf_hook_state *state)
+{
+	struct flow_offload_tuple_rhash *tuplehash;
+	struct nf_flowtable *flow_table = priv;
+	struct flow_offload_tuple tuple = {};
+	union nf_inet_addr nexthop;
+	struct flow_offload *flow;
+	struct net_device *outdev;
+	struct iphdr *iph;
+
+	if (nf_flow_tuple_ip(skb, &tuple) < 0)
+		return NF_ACCEPT;
+
+	tuplehash = flow_offload_lookup(flow_table, &tuple);
+	if (tuplehash == NULL)
+		return NF_ACCEPT;
+
+	outdev = dev_get_by_index_rcu(&init_net, tuplehash->tuple.oifidx);
+	if (!outdev)
+		return NF_ACCEPT;
+
+	flow = container_of(tuplehash, struct flow_offload,
+			    tuplehash[tuplehash->tuple.dir]);
+
+	flow->timeout = (u32)jiffies + NF_FLOW_TIMEOUT;
+
+	if (flow->flags & (FLOW_OFFLOAD_SNAT | FLOW_OFFLOAD_DNAT) &&
+	    nf_flow_nat_ip(flow, skb, tuplehash->tuple.dir) < 0)
+		return NF_DROP;
+
+	iph = ip_hdr(skb);
+	ip_decrease_ttl(iph);
+
+	skb->dev = outdev;
+	nf_flow_nexthop(flow, &nexthop, tuplehash->tuple.dir);
+
+	neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
+
+	return NF_STOLEN;
+}
+
+static u32 flow_offload_hash(const void *data, u32 len, u32 seed)
+{
+	const struct flow_offload_tuple *tuple = data;
+
+	return jhash(tuple, offsetof(struct flow_offload_tuple, l4proto), seed);
+}
+
+static u32 flow_offload_hash_obj(const void *data, u32 len, u32 seed)
+{
+	const struct flow_offload_tuple_rhash *tuplehash = data;
+
+	return jhash(&tuplehash->tuple, offsetof(struct flow_offload_tuple, l4proto), seed);
+}
+
+static int flow_offload_hash_cmp(struct rhashtable_compare_arg *arg,
+					const void *ptr)
+{
+	const struct flow_offload_tuple *tuple = arg->key;
+	const struct flow_offload_tuple_rhash *x = ptr;
+
+	if (memcmp(&x->tuple, tuple, offsetof(struct flow_offload_tuple, l4proto)))
+		return 1;
+
+	return 0;
+}
+
+static const struct rhashtable_params flow_offload_rhash_params = {
+	.head_offset		= offsetof(struct flow_offload_tuple_rhash, node),
+	.hashfn			= flow_offload_hash,
+	.obj_hashfn		= flow_offload_hash_obj,
+	.obj_cmpfn		= flow_offload_hash_cmp,
+	.automatic_shrinking	= true,
+};
+
+static int nf_flow_table_ipv4_init(struct nf_flowtable *flow_table)
+{
+	INIT_DEFERRABLE_WORK(&flow_table->gc_work, nf_flow_offload_work_gc);
+	queue_delayed_work(system_power_efficient_wq, &flow_table->gc_work, HZ);
+	return 0;
+}
+
+static void nf_flow_table_ipv4_destroy(struct nf_flowtable *flow_table)
+{
+	cancel_delayed_work_sync(&flow_table->gc_work);
+}
+
+static struct nf_flowtable_type flowtable_ipv4 = {
+	.family		= NFPROTO_IPV4,
+	.init		= nf_flow_table_ipv4_init,
+	.destroy	= nf_flow_table_ipv4_destroy,
+	.params		= &flow_offload_rhash_params,
+	.hook		= nf_flow_offload_hook,
+	.owner		= THIS_MODULE,
+};
+
+static int __init nf_flow_ipv4_module_init(void)
+{
+	nft_register_flowtable_type(&flowtable_ipv4);
+
+	return 0;
+}
+
+static void __exit nf_flow_ipv4_module_exit(void)
+{
+	nft_unregister_flowtable_type(&flowtable_ipv4);
+}
+
+module_init(nf_flow_ipv4_module_init);
+module_exit(nf_flow_ipv4_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>");
+MODULE_ALIAS_NFT_FLOWTABLE(AF_INET);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 5/6] netfilter: nf_tables: flow offload expression
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
                   ` (3 preceding siblings ...)
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4 Pablo Neira Ayuso
@ 2017-12-07 12:45 ` Pablo Neira Ayuso
  2017-12-07 12:45 ` [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload Pablo Neira Ayuso
  5 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:45 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded into a given flow table via name. This new
instruction creates the flow entry and adds it to the flow table.

Only established flows, ie. we have seen traffic in both directions, are
added to the flow table. You can still decide to offload entries at a
later stage via packet counting or checking the ct status in case you
want to offload assured conntracks.

This has an explicit dependency with the conntrack subsystem.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/uapi/linux/netfilter/nf_tables.h |  11 ++
 net/netfilter/Kconfig                    |   7 +
 net/netfilter/Makefile                   |   1 +
 net/netfilter/nft_flow_offload.c         | 274 +++++++++++++++++++++++++++++++
 4 files changed, 293 insertions(+)
 create mode 100644 net/netfilter/nft_flow_offload.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 9ba0f4c13de6..528d832fefb4 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -954,6 +954,17 @@ enum nft_ct_attributes {
 };
 #define NFTA_CT_MAX		(__NFTA_CT_MAX - 1)
 
+/**
+ * enum nft_flow_attributes - ct offload expression attributes
+ * @NFTA_FLOW_TABLE_NAME: flow table name (NLA_STRING)
+ */
+enum nft_offload_attributes {
+	NFTA_FLOW_UNSPEC,
+	NFTA_FLOW_TABLE_NAME,
+	__NFTA_FLOW_MAX,
+};
+#define NFTA_FLOW_MAX		(__NFTA_FLOW_MAX - 1)
+
 enum nft_limit_type {
 	NFT_LIMIT_PKTS,
 	NFT_LIMIT_PKT_BYTES
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index af0f58322515..e998cc45cdcc 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -497,6 +497,13 @@ config NFT_CT
 	  This option adds the "ct" expression that you can use to match
 	  connection tracking information such as the flow state.
 
+config NFT_FLOW_OFFLOAD
+	depends on NF_CONNTRACK
+	tristate "Netfilter nf_tables hardware flow offload module"
+	help
+	  This option adds the "flow_offload" expression that you can use to
+	  choose what flows are placed into the hardware.
+
 config NFT_SET_RBTREE
 	tristate "Netfilter nf_tables rbtree set module"
 	help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 1f7d92bd571a..2c1b8de922f2 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_NFT_META)		+= nft_meta.o
 obj-$(CONFIG_NFT_RT)		+= nft_rt.o
 obj-$(CONFIG_NFT_NUMGEN)	+= nft_numgen.o
 obj-$(CONFIG_NFT_CT)		+= nft_ct.o
+obj-$(CONFIG_NFT_FLOW_OFFLOAD)	+= nft_flow_offload.o
 obj-$(CONFIG_NFT_LIMIT)		+= nft_limit.o
 obj-$(CONFIG_NFT_NAT)		+= nft_nat.o
 obj-$(CONFIG_NFT_OBJREF)	+= nft_objref.o
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
new file mode 100644
index 000000000000..f1d98a03175f
--- /dev/null
+++ b/net/netfilter/nft_flow_offload.c
@@ -0,0 +1,274 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/ip.h> /* for ipv4 options. */
+#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <linux/netfilter/nf_conntrack_common.h>
+#include <net/netfilter/nf_flow_table.h>
+
+struct nft_flow_offload {
+	struct nft_flowtable	*flowtable;
+};
+
+static int nft_flow_route(const struct nft_pktinfo *pkt,
+			  const struct nf_conn *ct, union nf_inet_addr *gw,
+			  enum ip_conntrack_dir dir)
+{
+	const struct dst_entry *this_dst = skb_dst(pkt->skb);
+	struct dst_entry *other_dst;
+	const struct nf_afinfo *ai;
+	struct flowi fl;
+
+	memset(&fl, 0, sizeof(fl));
+	switch (nft_pf(pkt)) {
+	case NFPROTO_IPV4:
+		fl.u.ip4.daddr = ct->tuplehash[!dir].tuple.dst.u3.ip;
+		break;
+	case NFPROTO_IPV6:
+		fl.u.ip6.daddr = ct->tuplehash[!dir].tuple.dst.u3.in6;
+		break;
+	}
+
+	ai = nf_get_afinfo(nft_pf(pkt));
+	if (ai) {
+		ai->route(nft_net(pkt), &other_dst, &fl, false);
+		if (!other_dst)
+			return -ENOENT;
+	}
+
+	switch (nft_pf(pkt)) {
+	case NFPROTO_IPV4: {
+		const struct rtable *other_rt =
+			(const struct rtable *)other_dst;
+		const struct rtable *this_rt =
+			(const struct rtable *)this_dst;
+
+		gw[dir].ip = this_rt->rt_gateway;
+		gw[!dir].ip = other_rt->rt_gateway;
+		break;
+		}
+	case NFPROTO_IPV6:
+		break;
+	default:
+		break;
+	}
+
+	dst_release(other_dst);
+
+	return 0;
+}
+
+static bool nft_flow_offload_skip(struct sk_buff *skb)
+{
+	struct ip_options *opt  = &(IPCB(skb)->opt);
+
+	if (unlikely(opt->optlen))
+		return true;
+	if (skb_sec_path(skb))
+		return true;
+
+	return false;
+}
+
+static void nft_flow_offload_eval(const struct nft_expr *expr,
+				  struct nft_regs *regs,
+				  const struct nft_pktinfo *pkt)
+{
+	struct nft_flow_offload *priv = nft_expr_priv(expr);
+	struct nf_flowtable *flowtable = &priv->flowtable->data;
+	struct net_device *outdev = pkt->xt.state->out;
+	struct net_device *indev = pkt->xt.state->in;
+	union nf_inet_addr gateway[IP_CT_DIR_MAX];
+	enum ip_conntrack_info ctinfo;
+	int iifindex[IP_CT_DIR_MAX];
+	struct flow_offload *flow;
+	enum ip_conntrack_dir dir;
+	struct nf_conn *ct;
+	int ret;
+
+	if (nft_flow_offload_skip(pkt->skb))
+		goto out;
+
+	ct = nf_ct_get(pkt->skb, &ctinfo);
+	if (!ct)
+		goto out;
+
+	switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+		break;
+	default:
+		goto out;
+	}
+
+	if (test_bit(IPS_HELPER_BIT, &ct->status))
+		goto out;
+
+	if (ctinfo == IP_CT_NEW ||
+	    ctinfo == IP_CT_RELATED)
+		goto out;
+
+	if (test_and_set_bit(IPS_OFFLOAD_BIT, &ct->status))
+		goto out;
+
+	dir = CTINFO2DIR(ctinfo);
+	if (nft_flow_route(pkt, ct, gateway, dir) < 0)
+		goto err1;
+
+	iifindex[dir]	= indev->ifindex;
+	iifindex[!dir]	= outdev->ifindex;
+
+	flow = flow_offload_alloc(ct, iifindex, gateway);
+	if (!flow)
+		goto err1;
+
+	ret = flow_offload_add(flowtable, flow);
+	if (ret < 0)
+		goto err2;
+
+	return;
+err2:
+	flow_offload_free(flow);
+err1:
+	clear_bit(IPS_OFFLOAD_BIT, &ct->status);
+out:
+	regs->verdict.code = NFT_BREAK;
+}
+
+static int nft_flow_offload_validate(const struct nft_ctx *ctx,
+				     const struct nft_expr *expr,
+				     const struct nft_data **data)
+{
+	unsigned int hook_mask = (1 << NF_INET_FORWARD);
+
+	return nft_chain_validate_hooks(ctx->chain, hook_mask);
+}
+
+static int nft_flow_offload_init(const struct nft_ctx *ctx,
+				 const struct nft_expr *expr,
+				 const struct nlattr * const tb[])
+{
+	struct nft_flow_offload *priv = nft_expr_priv(expr);
+	u8 genmask = nft_genmask_next(ctx->net);
+	struct nft_flowtable *flowtable;
+
+	if (!tb[NFTA_FLOW_TABLE_NAME])
+		return -EINVAL;
+
+	flowtable = nf_tables_flowtable_lookup(ctx->table,
+					       tb[NFTA_FLOW_TABLE_NAME],
+					       genmask);
+	if (IS_ERR(flowtable))
+		return PTR_ERR(flowtable);
+
+	priv->flowtable = flowtable;
+	flowtable->use++;
+
+	return nf_ct_netns_get(ctx->net, ctx->afi->family);
+}
+
+static void nft_flow_offload_destroy(const struct nft_ctx *ctx,
+				     const struct nft_expr *expr)
+{
+	struct nft_flow_offload *priv = nft_expr_priv(expr);
+
+	priv->flowtable->use--;
+	nf_ct_netns_put(ctx->net, ctx->afi->family);
+}
+
+static int nft_flow_offload_dump(struct sk_buff *skb, const struct nft_expr *expr)
+{
+	struct nft_flow_offload *priv = nft_expr_priv(expr);
+
+	if (nla_put_string(skb, NFTA_FLOW_TABLE_NAME, priv->flowtable->name))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -1;
+}
+
+struct nft_expr_type nft_flow_offload_type;
+static const struct nft_expr_ops nft_flow_offload_ops = {
+	.type		= &nft_flow_offload_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_flow_offload)),
+	.eval		= nft_flow_offload_eval,
+	.init		= nft_flow_offload_init,
+	.destroy	= nft_flow_offload_destroy,
+	.validate	= nft_flow_offload_validate,
+	.dump		= nft_flow_offload_dump,
+};
+
+struct nft_expr_type nft_flow_offload_type __read_mostly = {
+	.name		= "flow_offload",
+	.ops		= &nft_flow_offload_ops,
+	.maxattr	= NFTA_FLOW_MAX,
+	.owner		= THIS_MODULE,
+};
+
+static void flow_offload_iterate_cleanup(struct flow_offload *flow, void *data)
+{
+	struct net_device *dev = data;
+
+	if (dev && flow->tuplehash[0].tuple.iifidx != dev->ifindex)
+		return;
+
+	flow_offload_dead(flow);
+}
+
+static void nft_flow_offload_iterate_cleanup(struct nf_flowtable *flowtable,
+					     void *data)
+{
+	nf_flow_table_iterate(flowtable, flow_offload_iterate_cleanup, data);
+}
+
+static int flow_offload_netdev_event(struct notifier_block *this,
+				     unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	if (event != NETDEV_DOWN)
+		return NOTIFY_DONE;
+
+	nft_flow_table_iterate(dev_net(dev), nft_flow_offload_iterate_cleanup, dev);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block flow_offload_netdev_notifier = {
+	.notifier_call	= flow_offload_netdev_event,
+};
+
+static int __init nft_flow_offload_module_init(void)
+{
+	register_netdevice_notifier(&flow_offload_netdev_notifier);
+
+	return nft_register_expr(&nft_flow_offload_type);
+}
+
+static void __exit nft_flow_offload_module_exit(void)
+{
+	struct net *net;
+
+	nft_unregister_expr(&nft_flow_offload_type);
+	unregister_netdevice_notifier(&flow_offload_netdev_notifier);
+	rtnl_lock();
+	for_each_net(net)
+		nft_flow_table_iterate(net, nft_flow_offload_iterate_cleanup, NULL);
+	rtnl_unlock();
+}
+
+module_init(nft_flow_offload_module_init);
+module_exit(nft_flow_offload_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>");
+MODULE_ALIAS_NFT_EXPR("flow_offload");
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload
  2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
                   ` (4 preceding siblings ...)
  2017-12-07 12:45 ` [PATCH nf-next RFC,v2 5/6] netfilter: nf_tables: flow offload expression Pablo Neira Ayuso
@ 2017-12-07 12:45 ` Pablo Neira Ayuso
  2017-12-08 10:18   ` Florian Westphal
  5 siblings, 1 reply; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-07 12:45 UTC (permalink / raw)
  To: netfilter-devel
  Cc: netdev, f.fainelli, simon.horman, ronye, jiri, nbd, john, kubakici, fw

This patch adds the infrastructure to offload flows to hardware, in case
the nic/switch comes with built-in flow tables capabilities.

If the hardware comes with no hardware flow tables or they have
limitations in terms of features, this falls back to the software
generic flow table implementation.

The software flow table garbage collector skips entries that resides in
the hardware, so the hardware will be responsible for releasing this
flow table entry too via flow_offload_dead(). In the next garbage
collector run, this removes the entries both in the software and
hardware flow table from user context.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
@Florian: I still owe you one here, you mentioned about inmediate schedule
of the workqueue thread, and I need to revisit this, the quick patch I made
is hitting splats when calling queue_delayed_work() from packet path, this
may be my fault though.

 include/linux/netdevice.h             |  9 ++++
 include/net/netfilter/nf_flow_table.h |  1 +
 net/netfilter/nf_flow_table.c         | 26 ++++++++++++
 net/netfilter/nft_flow_offload.c      | 79 +++++++++++++++++++++++++++++++++++
 4 files changed, 115 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..5f2919775632 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,13 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct flow_offload;
+
+enum flow_offload_type {
+	FLOW_OFFLOAD_ADD	= 0,
+	FLOW_OFFLOAD_DEL,
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1281,6 +1288,8 @@ struct net_device_ops {
 	int			(*ndo_bridge_dellink)(struct net_device *dev,
 						      struct nlmsghdr *nlh,
 						      u16 flags);
+	int			(*ndo_flow_offload)(enum flow_offload_type type,
+						    struct flow_offload *flow);
 	int			(*ndo_change_carrier)(struct net_device *dev,
 						      bool new_carrier);
 	int			(*ndo_get_phys_port_id)(struct net_device *dev,
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 1a2598b4a58f..317049d5ff25 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -67,6 +67,7 @@ struct flow_offload_tuple_rhash {
 #define	FLOW_OFFLOAD_SNAT	0x1
 #define	FLOW_OFFLOAD_DNAT	0x2
 #define	FLOW_OFFLOAD_DYING	0x4
+#define	FLOW_OFFLOAD_HW		0x8
 
 struct flow_offload {
 	struct flow_offload_tuple_rhash		tuplehash[FLOW_OFFLOAD_DIR_MAX];
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index ff27dad268c3..c578c3aec0e0 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c
@@ -212,6 +212,21 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
 }
 EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
 
+static void flow_offload_hw_del(struct flow_offload *flow)
+{
+	struct net_device *indev;
+	int ret, ifindex;
+
+	rtnl_lock();
+	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
+	indev = __dev_get_by_index(&init_net, ifindex);
+	if (WARN_ON(!indev))
+		return;
+
+	ret = indev->netdev_ops->ndo_flow_offload(FLOW_OFFLOAD_DEL, flow);
+	rtnl_unlock();
+}
+
 static inline bool nf_flow_has_expired(const struct flow_offload *flow)
 {
 	return (__s32)(flow->timeout - (u32)jiffies) <= 0;
@@ -222,6 +237,11 @@ static inline bool nf_flow_is_dying(const struct flow_offload *flow)
 	return flow->flags & FLOW_OFFLOAD_DYING;
 }
 
+static inline bool nf_flow_in_hw(const struct flow_offload *flow)
+{
+	return flow->flags & FLOW_OFFLOAD_HW;
+}
+
 void nf_flow_offload_work_gc(struct work_struct *work)
 {
 	struct flow_offload_tuple_rhash *tuplehash;
@@ -250,10 +270,16 @@ void nf_flow_offload_work_gc(struct work_struct *work)
 
 		flow = container_of(tuplehash, struct flow_offload, tuplehash[0]);
 
+		if (nf_flow_in_hw(flow) &&
+		    !nf_flow_is_dying(flow))
+			continue;
+
 		if (nf_flow_has_expired(flow) ||
 		    nf_flow_is_dying(flow)) {
 			flow_offload_del(flow_table, flow);
 			nf_flow_release_ct(flow);
+			if (nf_flow_in_hw(flow))
+				flow_offload_hw_del(flow);
 		}
 		counter++;
 	}
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
index f1d98a03175f..c4ee1df25a16 100644
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@@ -13,6 +13,64 @@
 #include <linux/netfilter/nf_conntrack_common.h>
 #include <net/netfilter/nf_flow_table.h>
 
+static LIST_HEAD(flow_hw_offload_pending_list);
+static DEFINE_SPINLOCK(flow_hw_offload_lock);
+
+struct flow_hw_offload {
+	struct list_head	list;
+	struct flow_offload	*flow;
+	struct nf_conn		*ct;
+};
+
+static int do_flow_offload(struct flow_offload *flow)
+{
+	struct net_device *indev;
+	int ret, ifindex;
+
+	rtnl_lock();
+	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
+	indev = __dev_get_by_index(&init_net, ifindex);
+	if (WARN_ON(!indev))
+		return 0;
+
+	ret = indev->netdev_ops->ndo_flow_offload(FLOW_OFFLOAD_ADD, flow);
+	if (ret >= 0)
+		flow->flags |= FLOW_OFFLOAD_HW;
+	rtnl_unlock();
+
+	return ret;
+}
+
+/* Schedule worker every 100 ms. */
+#define FLOW_HW_WORK_TIMEOUT	msecs_to_jiffies(100)
+
+static struct delayed_work nft_flow_offload_dwork;
+
+static void flow_offload_work(struct work_struct *work)
+{
+	struct flow_hw_offload *offload, *next;
+	LIST_HEAD(hw_offload_pending);
+
+	spin_lock_bh(&flow_hw_offload_lock);
+	if (!list_empty(&flow_hw_offload_pending_list))
+		list_move_tail(&flow_hw_offload_pending_list, &hw_offload_pending);
+	spin_unlock_bh(&flow_hw_offload_lock);
+
+	list_for_each_entry_safe(offload, next, &hw_offload_pending, list) {
+		if (nf_ct_is_dying(offload->ct))
+			goto next;
+
+		do_flow_offload(offload->flow);
+next:
+		nf_conntrack_put(&offload->ct->ct_general);
+		list_del(&offload->list);
+		kfree(offload);
+	}
+
+	queue_delayed_work(system_power_efficient_wq, &nft_flow_offload_dwork,
+			   FLOW_HW_WORK_TIMEOUT);
+}
+
 struct nft_flow_offload {
 	struct nft_flowtable	*flowtable;
 };
@@ -86,6 +144,7 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 	struct net_device *outdev = pkt->xt.state->out;
 	struct net_device *indev = pkt->xt.state->in;
 	union nf_inet_addr gateway[IP_CT_DIR_MAX];
+	struct flow_hw_offload *offload;
 	enum ip_conntrack_info ctinfo;
 	int iifindex[IP_CT_DIR_MAX];
 	struct flow_offload *flow;
@@ -133,6 +192,21 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 	if (ret < 0)
 		goto err2;
 
+	if (!indev->netdev_ops->ndo_flow_offload)
+		return;
+
+	offload = kmalloc(sizeof(struct flow_hw_offload), GFP_ATOMIC);
+	if (!offload)
+		return;
+
+	nf_conntrack_get(&ct->ct_general);
+	offload->ct = ct;
+	offload->flow = flow;
+
+	spin_lock_bh(&flow_hw_offload_lock);
+	list_add_tail(&offload->list, &flow_hw_offload_pending_list);
+	spin_unlock_bh(&flow_hw_offload_lock);
+
 	return;
 err2:
 	flow_offload_free(flow);
@@ -251,6 +325,10 @@ static int __init nft_flow_offload_module_init(void)
 {
 	register_netdevice_notifier(&flow_offload_netdev_notifier);
 
+	INIT_DEFERRABLE_WORK(&nft_flow_offload_dwork, flow_offload_work);
+	queue_delayed_work(system_power_efficient_wq, &nft_flow_offload_dwork,
+			   FLOW_HW_WORK_TIMEOUT);
+
 	return nft_register_expr(&nft_flow_offload_type);
 }
 
@@ -259,6 +337,7 @@ static void __exit nft_flow_offload_module_exit(void)
 	struct net *net;
 
 	nft_unregister_expr(&nft_flow_offload_type);
+	cancel_delayed_work_sync(&nft_flow_offload_dwork);
 	unregister_netdevice_notifier(&flow_offload_netdev_notifier);
 	rtnl_lock();
 	for_each_net(net)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit Pablo Neira Ayuso
@ 2017-12-08  6:47   ` Florian Westphal
  2017-12-08 21:00     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2017-12-08  6:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici, fw

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
> index dc947e59d03a..6b463b88182d 100644
> --- a/include/uapi/linux/netfilter/nf_conntrack_common.h
> +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
> @@ -100,6 +100,10 @@ enum ip_conntrack_status {
>  	IPS_HELPER_BIT = 13,
>  	IPS_HELPER = (1 << IPS_HELPER_BIT),
>  
> +	/* Conntrack has been offloaded to flow table. */
> +	IPS_OFFLOAD_BIT = 14,
> +	IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT),
> +
>  	/* Be careful here, modifying these bits can make things messy,
>  	 * so don't let users modify them directly.
>  	 */

I think this new bit has to be added to the UNCHANGEABLE mask below.

> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 01130392b7c0..02e195accd47 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net,
>  	hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
>  		tmp = nf_ct_tuplehash_to_ctrack(h);
>  
> +		if (test_bit(IPS_OFFLOAD_BIT, &tmp->status))
> +			continue;
> +

nit: I would move this below the ASSURED bit check, AFAIU most
(all?) offloaded conntracks are not in ASSURED state since they never
see two-way communication but in case we've mixed flows or no offloading
in place then the ASSURED check takes care of skipping earlydrop
already.

> +/* Set an arbitrary timeout large enough not to ever expire, this save
> + * us a check for the IPS_OFFLOAD_BIT from the packet path via
> + * nf_ct_is_expired().
> + */
> +static void nf_ct_offload_timeout(struct nf_conn *ct)
> +{
> +       ct->timeout = nfct_time_stamp + DAY;
> +}

Not sure if its worth adding a test to avoid unconditional write,
e.g. something like

> +static void nf_ct_offload_timeout(struct nf_conn *ct)
> +{
	if (nf_ct_expires(ct) < DAY/2))
> +       ct->timeout = nfct_time_stamp + DAY;

but perhaps not worth it, gc_worker is infrequent.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4
  2017-12-07 12:44 ` [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4 Pablo Neira Ayuso
@ 2017-12-08 10:04   ` Florian Westphal
  2017-12-08 21:14     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2017-12-08 10:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici, fw

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> This patch adds the IPv4 flow table type, that implements the datapath
> flow table to forward IPv4 traffic. Rationale is:
> 
> 1) Look up for the packet in the flow table, from the ingress hook.
> 2) If there's a hit, decrement ttl and pass it on to the neighbour layer
>    for transmission.
> 3) If there's a miss, packet is passed up to the classic forwarding
>    path.

Is there a plan to also handle zone IDs in future?

I't going to be messy for sure since we'd need to tell HW how to do
the zone mapping.  Perhaps only support a builtin list, e.g.
vlan id == zone...?

Don't yet see how it could be done in a generic way as the mappings can
be arbitrarily complex.

Right now afaics one could install one flow table per zone and map
this in nft, but then we still miss the part that tells the hardware
how the zone identifier was derived.

> +static bool ip_has_options(unsigned int thoff)
> +{
> +	return thoff > sizeof(struct iphdr);

I'd use
	thoff != sizeof(...)

to catch case where ihl is < struct iphdr.

> +nf_flow_offload_hook(void *priv, struct sk_buff *skb,
> +		     const struct nf_hook_state *state)
> +{
> +	struct flow_offload_tuple_rhash *tuplehash;
> +	struct nf_flowtable *flow_table = priv;
> +	struct flow_offload_tuple tuple = {};
> +	union nf_inet_addr nexthop;
> +	struct flow_offload *flow;
> +	struct net_device *outdev;
> +	struct iphdr *iph;
> +
> +	if (nf_flow_tuple_ip(skb, &tuple) < 0)
> +		return NF_ACCEPT;
> +
> +	tuplehash = flow_offload_lookup(flow_table, &tuple);
> +	if (tuplehash == NULL)
> +		return NF_ACCEPT;
> +
> +	outdev = dev_get_by_index_rcu(&init_net, tuplehash->tuple.oifidx);

state->net ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload
  2017-12-07 12:45 ` [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload Pablo Neira Ayuso
@ 2017-12-08 10:18   ` Florian Westphal
  2017-12-08 21:16     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2017-12-08 10:18 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici, fw

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> The software flow table garbage collector skips entries that resides in
> the hardware, so the hardware will be responsible for releasing this
> flow table entry too via flow_offload_dead(). In the next garbage
> collector run, this removes the entries both in the software and
> hardware flow table from user context.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> @Florian: I still owe you one here, you mentioned about inmediate schedule
> of the workqueue thread, and I need to revisit this, the quick patch I made
> is hitting splats when calling queue_delayed_work() from packet path, this
> may be my fault though.

OK. IIRC I had suggested to just use schedule_work() instead.
In most cases (assuming system is busy) the workqueue will already be
pending anyway.

> diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
> index ff27dad268c3..c578c3aec0e0 100644
> --- a/net/netfilter/nf_flow_table.c
> +++ b/net/netfilter/nf_flow_table.c
> @@ -212,6 +212,21 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
>  }
>  EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
>  
> +static void flow_offload_hw_del(struct flow_offload *flow)
> +{
> +	struct net_device *indev;
> +	int ret, ifindex;
> +
> +	rtnl_lock();
> +	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
> +	indev = __dev_get_by_index(&init_net, ifindex);

I think this should pass struct net * as arg to flow_offload_hw_del.

> +	if (WARN_ON(!indev))
> +		return;
> +
> +	ret = indev->netdev_ops->ndo_flow_offload(FLOW_OFFLOAD_DEL, flow);
> +	rtnl_unlock();
> +}

Please no rtnl lock unless absolutely needed.
Seems this could even avoid the mutex completely by using
dev_get_by_index + dev_put.

> +static int do_flow_offload(struct flow_offload *flow)
> +{
> +	struct net_device *indev;
> +	int ret, ifindex;
> +
> +	rtnl_lock();
> +	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
> +	indev = __dev_get_by_index(&init_net, ifindex);

likewise.

> +#define FLOW_HW_WORK_TIMEOUT	msecs_to_jiffies(100)
> +
> +static struct delayed_work nft_flow_offload_dwork;

I would go with struct work and no delay at all.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit
  2017-12-08  6:47   ` Florian Westphal
@ 2017-12-08 21:00     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-08 21:00 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici

On Fri, Dec 08, 2017 at 07:47:02AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
> > index dc947e59d03a..6b463b88182d 100644
> > --- a/include/uapi/linux/netfilter/nf_conntrack_common.h
> > +++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
> > @@ -100,6 +100,10 @@ enum ip_conntrack_status {
> >  	IPS_HELPER_BIT = 13,
> >  	IPS_HELPER = (1 << IPS_HELPER_BIT),
> >  
> > +	/* Conntrack has been offloaded to flow table. */
> > +	IPS_OFFLOAD_BIT = 14,
> > +	IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT),
> > +
> >  	/* Be careful here, modifying these bits can make things messy,
> >  	 * so don't let users modify them directly.
> >  	 */
> 
> I think this new bit has to be added to the UNCHANGEABLE mask below.

Right.

> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 01130392b7c0..02e195accd47 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net,
> >  	hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
> >  		tmp = nf_ct_tuplehash_to_ctrack(h);
> >  
> > +		if (test_bit(IPS_OFFLOAD_BIT, &tmp->status))
> > +			continue;
> > +
> 
> nit: I would move this below the ASSURED bit check, AFAIU most
> (all?) offloaded conntracks are not in ASSURED state since they never
> see two-way communication but in case we've mixed flows or no offloading
> in place then the ASSURED check takes care of skipping earlydrop
> already.

Offload happens once we enter established state (or later if your rule
postpone it to a later stage), that is before we observe the full
3-way handshake in tcp, hence the assured bit. So I think we still
need this here.

> > +/* Set an arbitrary timeout large enough not to ever expire, this save
> > + * us a check for the IPS_OFFLOAD_BIT from the packet path via
> > + * nf_ct_is_expired().
> > + */
> > +static void nf_ct_offload_timeout(struct nf_conn *ct)
> > +{
> > +       ct->timeout = nfct_time_stamp + DAY;
> > +}
> 
> Not sure if its worth adding a test to avoid unconditional write,
> e.g. something like
> 
> > +static void nf_ct_offload_timeout(struct nf_conn *ct)
> > +{
> 	if (nf_ct_expires(ct) < DAY/2))
> > +       ct->timeout = nfct_time_stamp + DAY;
> 
> but perhaps not worth it, gc_worker is infrequent.

That's fine, I'll do this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4
  2017-12-08 10:04   ` Florian Westphal
@ 2017-12-08 21:14     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-08 21:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici

On Fri, Dec 08, 2017 at 11:04:13AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > This patch adds the IPv4 flow table type, that implements the datapath
> > flow table to forward IPv4 traffic. Rationale is:
> > 
> > 1) Look up for the packet in the flow table, from the ingress hook.
> > 2) If there's a hit, decrement ttl and pass it on to the neighbour layer
> >    for transmission.
> > 3) If there's a miss, packet is passed up to the classic forwarding
> >    path.
> 
> Is there a plan to also handle zone IDs in future?

Zone ID is meaningful to whoever applies the policy: in this offload
approach this patchset implements, the policy resides in the kernel.

> I't going to be messy for sure since we'd need to tell HW how to do
> the zone mapping.  Perhaps only support a builtin list, e.g.
> vlan id == zone...?

I've been considering a more simple solution, ie. add the input ifindex
device in the flowtable hash lookup, as part of the flow tuple. All
examples I've been observing for zones are basically mapping network
interfaces to zones.

> Don't yet see how it could be done in a generic way as the mappings can
> be arbitrarily complex.
> 
> Right now afaics one could install one flow table per zone and map
> this in nft, but then we still miss the part that tells the hardware
> how the zone identifier was derived.
> 
> > +static bool ip_has_options(unsigned int thoff)
> > +{
> > +	return thoff > sizeof(struct iphdr);
> 
> I'd use
> 	thoff != sizeof(...)
> 
> to catch case where ihl is < struct iphdr.

ok.

> > +nf_flow_offload_hook(void *priv, struct sk_buff *skb,
> > +		     const struct nf_hook_state *state)
> > +{
> > +	struct flow_offload_tuple_rhash *tuplehash;
> > +	struct nf_flowtable *flow_table = priv;
> > +	struct flow_offload_tuple tuple = {};
> > +	union nf_inet_addr nexthop;
> > +	struct flow_offload *flow;
> > +	struct net_device *outdev;
> > +	struct iphdr *iph;
> > +
> > +	if (nf_flow_tuple_ip(skb, &tuple) < 0)
> > +		return NF_ACCEPT;
> > +
> > +	tuplehash = flow_offload_lookup(flow_table, &tuple);
> > +	if (tuplehash == NULL)
> > +		return NF_ACCEPT;
> > +
> > +	outdev = dev_get_by_index_rcu(&init_net, tuplehash->tuple.oifidx);
> 
> state->net ?

Yes, netns support is in my TODO list.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload
  2017-12-08 10:18   ` Florian Westphal
@ 2017-12-08 21:16     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 13+ messages in thread
From: Pablo Neira Ayuso @ 2017-12-08 21:16 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, netdev, f.fainelli, simon.horman, ronye, jiri,
	nbd, john, kubakici

On Fri, Dec 08, 2017 at 11:18:36AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> 
> > diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
> > index ff27dad268c3..c578c3aec0e0 100644
> > --- a/net/netfilter/nf_flow_table.c
> > +++ b/net/netfilter/nf_flow_table.c
> > @@ -212,6 +212,21 @@ int nf_flow_table_iterate(struct nf_flowtable *flow_table,
> >  }
> >  EXPORT_SYMBOL_GPL(nf_flow_table_iterate);
> >  
> > +static void flow_offload_hw_del(struct flow_offload *flow)
> > +{
> > +	struct net_device *indev;
> > +	int ret, ifindex;
> > +
> > +	rtnl_lock();
> > +	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
> > +	indev = __dev_get_by_index(&init_net, ifindex);
> 
> I think this should pass struct net * as arg to flow_offload_hw_del.
>
> > +	if (WARN_ON(!indev))
> > +		return;
> > +
> > +	ret = indev->netdev_ops->ndo_flow_offload(FLOW_OFFLOAD_DEL, flow);
> > +	rtnl_unlock();
> > +}
> 
> Please no rtnl lock unless absolutely needed.
> Seems this could even avoid the mutex completely by using
> dev_get_by_index + dev_put.

OK, we still need to make sure that we additions and deletions from
hardware don't occur concurrently, but that we can probably do it with
another mutex.

> > +static int do_flow_offload(struct flow_offload *flow)
> > +{
> > +	struct net_device *indev;
> > +	int ret, ifindex;
> > +
> > +	rtnl_lock();
> > +	ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx;
> > +	indev = __dev_get_by_index(&init_net, ifindex);
> 
> likewise.
> 
> > +#define FLOW_HW_WORK_TIMEOUT	msecs_to_jiffies(100)
> > +
> > +static struct delayed_work nft_flow_offload_dwork;
> 
> I would go with struct work and no delay at all.

Will have a look into this, thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-12-08 21:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-07 12:44 [PATCH nf-next RFC,v2 0/6] Flow offload infrastructure Pablo Neira Ayuso
2017-12-07 12:44 ` [PATCH nf-next RFC,v2 1/6] netfilter: nf_conntrack: add IPS_OFFLOAD status bit Pablo Neira Ayuso
2017-12-08  6:47   ` Florian Westphal
2017-12-08 21:00     ` Pablo Neira Ayuso
2017-12-07 12:44 ` [PATCH nf-next RFC,v2 2/6] netfilter: nf_tables: add flow table netlink frontend Pablo Neira Ayuso
2017-12-07 12:44 ` [PATCH nf-next RFC,v2 3/6] netfilter: add generic flow table infrastructure Pablo Neira Ayuso
2017-12-07 12:44 ` [PATCH nf-next RFC,v2 4/6] netfilter: flow table support for IPv4 Pablo Neira Ayuso
2017-12-08 10:04   ` Florian Westphal
2017-12-08 21:14     ` Pablo Neira Ayuso
2017-12-07 12:45 ` [PATCH nf-next RFC,v2 5/6] netfilter: nf_tables: flow offload expression Pablo Neira Ayuso
2017-12-07 12:45 ` [PATCH nf-next RFC,v2 6/6] netfilter: nft_flow_offload: add ndo hooks for hardware offload Pablo Neira Ayuso
2017-12-08 10:18   ` Florian Westphal
2017-12-08 21:16     ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.