linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 net-next 0/7] OVS conntrack support
@ 2015-03-02 21:54 Joe Stringer
  2015-03-02 21:54 ` [RFCv2 net-next 1/7] openvswitch: Serialize acts with original netlink len Joe Stringer
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:54 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: linux-kernel, Justin Pettit, Andy Zhou, Thomas Graf, Patrick McHardy

The goal of this series is to allow OVS to send packets through the Linux
kernel connection tracker, and subsequently match on fields populated by
conntrack.

Sending this out as another RFC change as this is the first time IP fragment
support is included. Only IPv4 is added right now, as we'd like to get some
feedback on that approach before we implement IPv6 frag support.

Helper support is also yet to be addressed, for tracking a particular flow a la
iptables CT targets. I think this is just a matter of having userspace specify
the helper to use (eg via 8-bit field in conntrack action), and setting up the
conntrack template accordingly when OVS first installs the flow containing a
conntrack action.

There are some additional related items that I intend to work on, which I do
not see as prerequisite for this series:
- OVS Connlabel support.
- Allow OVS to register logging facilities for conntrack.
- Conntrack per-zone configuration.

The branch below has been updated with the corresponding userspace pieces:
https://github.com/justinpettit/ovs/tree/conntrack


RFCv2:
- Support IPv4 fragments
- Warn when ct->net is different from skb net in skb_has_valid_nfct().
- Set OVS_CS_F_TRACKED when a flow cannot be identified ("invalid")
- Continue processing packets when conntrack marks the flow invalid.
- Use PF_INET6 family when sending IPv6 packets to conntrack.
- Verify conn_* matches when deserializing metadata from netlink.
- Only allow conntrack action on IPv4/IPv6 packets.
- Remove explicit dependencies on conn_zone, conn_mark.
- General tidyups

RFCv1:
- Rebase to net-next.
- Add conn_zone field to the flow key.
- Add explicit dependencies on conn_zone, conn_mark.
- Refactor conntrack changes into net/openvswitch/ovs_conntrack.*.
- Don't allow set_field() actions to change conn_state, conn_zone.
- Add OVS_CS_F_* flags to indicate connection state.
- Add "invalid" connection state.


Andy Zhou (3):
  net: refactor ip_fragment()
  net: Refactor ip_defrag() APIs
  openvswitch: Support fragmented IPv4 packets for conntrack

Joe Stringer (2):
  openvswitch: Serialize acts with original netlink len
  openvswitch: Move MASKED* macros to datapath.h

Justin Pettit (2):
  openvswitch: Add conntrack action
  openvswitch: Allow matching on conntrack mark

 drivers/net/macvlan.c               |    2 +-
 include/net/ip.h                    |   13 +-
 include/uapi/linux/openvswitch.h    |   42 +++-
 net/ipv4/ip_fragment.c              |   46 ++--
 net/ipv4/ip_input.c                 |    5 +-
 net/ipv4/ip_output.c                |  113 +++++----
 net/ipv4/netfilter/nf_defrag_ipv4.c |    2 +-
 net/netfilter/ipvs/ip_vs_core.c     |    2 +-
 net/openvswitch/Kconfig             |   11 +
 net/openvswitch/Makefile            |    1 +
 net/openvswitch/actions.c           |  140 +++++++++---
 net/openvswitch/conntrack.c         |  427 +++++++++++++++++++++++++++++++++++
 net/openvswitch/conntrack.h         |   91 ++++++++
 net/openvswitch/datapath.c          |   60 +++--
 net/openvswitch/datapath.h          |   10 +
 net/openvswitch/flow.c              |    4 +
 net/openvswitch/flow.h              |    4 +
 net/openvswitch/flow_netlink.c      |   95 ++++++--
 net/openvswitch/flow_netlink.h      |    4 +-
 net/openvswitch/vport.c             |    1 +
 net/packet/af_packet.c              |    2 +-
 21 files changed, 938 insertions(+), 137 deletions(-)
 create mode 100644 net/openvswitch/conntrack.c
 create mode 100644 net/openvswitch/conntrack.h

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 1/7] openvswitch: Serialize acts with original netlink len
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
@ 2015-03-02 21:54 ` Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 2/7] openvswitch: Move MASKED* macros to datapath.h Joe Stringer
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:54 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: linux-kernel, Justin Pettit, azhou, Thomas Graf, Patrick McHardy

Previously, we used the kernel-internal netlink actions length to
calculate the size of messages to serialize back to userspace.
However,the sw_flow_actions may not be formatted exactly the same as the
actions on the wire, so store the original actions length when
de-serializing and re-use the original length when serializing.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
 net/openvswitch/datapath.c     |    2 +-
 net/openvswitch/flow.h         |    1 +
 net/openvswitch/flow_netlink.c |    1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index ae5e77c..c8c60c5 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -700,7 +700,7 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,
 
 	/* OVS_FLOW_ATTR_ACTIONS */
 	if (should_fill_actions(ufid_flags))
-		len += nla_total_size(acts->actions_len);
+		len += nla_total_size(acts->orig_len);
 
 	return len
 		+ nla_total_size(sizeof(struct ovs_flow_stats)) /* OVS_FLOW_ATTR_STATS */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a076e44..998401a 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -209,6 +209,7 @@ struct sw_flow_id {
 
 struct sw_flow_actions {
 	struct rcu_head rcu;
+	size_t orig_len;	/* From flow_cmd_new netlink actions size */
 	u32 actions_len;
 	struct nlattr actions[];
 };
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 216f20b..d5b01af 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1543,6 +1543,7 @@ static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log)
 		return ERR_PTR(-ENOMEM);
 
 	sfa->actions_len = 0;
+	sfa->orig_len = size;
 	return sfa;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 2/7] openvswitch: Move MASKED* macros to datapath.h
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
  2015-03-02 21:54 ` [RFCv2 net-next 1/7] openvswitch: Serialize acts with original netlink len Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 3/7] openvswitch: Add conntrack action Joe Stringer
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: linux-kernel, Justin Pettit, azhou, Thomas Graf, Patrick McHardy

This will allow the ovs-conntrack code to reuse these macros.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
 net/openvswitch/actions.c  |   52 +++++++++++++++++++++-----------------------
 net/openvswitch/datapath.h |    4 ++++
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index b491c1c..ed3cb56 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -185,10 +185,6 @@ static int pop_mpls(struct sk_buff *skb, struct sw_flow_key *key,
 	return 0;
 }
 
-/* 'KEY' must not have any bits set outside of the 'MASK' */
-#define MASKED(OLD, KEY, MASK) ((KEY) | ((OLD) & ~(MASK)))
-#define SET_MASKED(OLD, KEY, MASK) ((OLD) = MASKED(OLD, KEY, MASK))
-
 static int set_mpls(struct sk_buff *skb, struct sw_flow_key *flow_key,
 		    const __be32 *mpls_lse, const __be32 *mask)
 {
@@ -201,7 +197,7 @@ static int set_mpls(struct sk_buff *skb, struct sw_flow_key *flow_key,
 		return err;
 
 	stack = (__be32 *)skb_mpls_header(skb);
-	lse = MASKED(*stack, *mpls_lse, *mask);
+	lse = OVS_MASKED(*stack, *mpls_lse, *mask);
 	if (skb->ip_summed == CHECKSUM_COMPLETE) {
 		__be32 diff[] = { ~(*stack), lse };
 
@@ -244,9 +240,9 @@ static void ether_addr_copy_masked(u8 *dst_, const u8 *src_, const u8 *mask_)
 	const u16 *src = (const u16 *)src_;
 	const u16 *mask = (const u16 *)mask_;
 
-	SET_MASKED(dst[0], src[0], mask[0]);
-	SET_MASKED(dst[1], src[1], mask[1]);
-	SET_MASKED(dst[2], src[2], mask[2]);
+	OVS_SET_MASKED(dst[0], src[0], mask[0]);
+	OVS_SET_MASKED(dst[1], src[1], mask[1]);
+	OVS_SET_MASKED(dst[2], src[2], mask[2]);
 }
 
 static int set_eth_addr(struct sk_buff *skb, struct sw_flow_key *flow_key,
@@ -330,10 +326,10 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
 static void mask_ipv6_addr(const __be32 old[4], const __be32 addr[4],
 			   const __be32 mask[4], __be32 masked[4])
 {
-	masked[0] = MASKED(old[0], addr[0], mask[0]);
-	masked[1] = MASKED(old[1], addr[1], mask[1]);
-	masked[2] = MASKED(old[2], addr[2], mask[2]);
-	masked[3] = MASKED(old[3], addr[3], mask[3]);
+	masked[0] = OVS_MASKED(old[0], addr[0], mask[0]);
+	masked[1] = OVS_MASKED(old[1], addr[1], mask[1]);
+	masked[2] = OVS_MASKED(old[2], addr[2], mask[2]);
+	masked[3] = OVS_MASKED(old[3], addr[3], mask[3]);
 }
 
 static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
@@ -350,15 +346,15 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
 static void set_ipv6_fl(struct ipv6hdr *nh, u32 fl, u32 mask)
 {
 	/* Bits 21-24 are always unmasked, so this retains their values. */
-	SET_MASKED(nh->flow_lbl[0], (u8)(fl >> 16), (u8)(mask >> 16));
-	SET_MASKED(nh->flow_lbl[1], (u8)(fl >> 8), (u8)(mask >> 8));
-	SET_MASKED(nh->flow_lbl[2], (u8)fl, (u8)mask);
+	OVS_SET_MASKED(nh->flow_lbl[0], (u8)(fl >> 16), (u8)(mask >> 16));
+	OVS_SET_MASKED(nh->flow_lbl[1], (u8)(fl >> 8), (u8)(mask >> 8));
+	OVS_SET_MASKED(nh->flow_lbl[2], (u8)fl, (u8)mask);
 }
 
 static void set_ip_ttl(struct sk_buff *skb, struct iphdr *nh, u8 new_ttl,
 		       u8 mask)
 {
-	new_ttl = MASKED(nh->ttl, new_ttl, mask);
+	new_ttl = OVS_MASKED(nh->ttl, new_ttl, mask);
 
 	csum_replace2(&nh->check, htons(nh->ttl << 8), htons(new_ttl << 8));
 	nh->ttl = new_ttl;
@@ -384,7 +380,7 @@ static int set_ipv4(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	 * makes sense to check if the value actually changed.
 	 */
 	if (mask->ipv4_src) {
-		new_addr = MASKED(nh->saddr, key->ipv4_src, mask->ipv4_src);
+		new_addr = OVS_MASKED(nh->saddr, key->ipv4_src, mask->ipv4_src);
 
 		if (unlikely(new_addr != nh->saddr)) {
 			set_ip_addr(skb, nh, &nh->saddr, new_addr);
@@ -392,7 +388,7 @@ static int set_ipv4(struct sk_buff *skb, struct sw_flow_key *flow_key,
 		}
 	}
 	if (mask->ipv4_dst) {
-		new_addr = MASKED(nh->daddr, key->ipv4_dst, mask->ipv4_dst);
+		new_addr = OVS_MASKED(nh->daddr, key->ipv4_dst, mask->ipv4_dst);
 
 		if (unlikely(new_addr != nh->daddr)) {
 			set_ip_addr(skb, nh, &nh->daddr, new_addr);
@@ -480,7 +476,8 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
 		    *(__be32 *)nh & htonl(IPV6_FLOWINFO_FLOWLABEL);
 	}
 	if (mask->ipv6_hlimit) {
-		SET_MASKED(nh->hop_limit, key->ipv6_hlimit, mask->ipv6_hlimit);
+		OVS_SET_MASKED(nh->hop_limit, key->ipv6_hlimit,
+			       mask->ipv6_hlimit);
 		flow_key->ip.ttl = nh->hop_limit;
 	}
 	return 0;
@@ -509,8 +506,8 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,
 
 	uh = udp_hdr(skb);
 	/* Either of the masks is non-zero, so do not bother checking them. */
-	src = MASKED(uh->source, key->udp_src, mask->udp_src);
-	dst = MASKED(uh->dest, key->udp_dst, mask->udp_dst);
+	src = OVS_MASKED(uh->source, key->udp_src, mask->udp_src);
+	dst = OVS_MASKED(uh->dest, key->udp_dst, mask->udp_dst);
 
 	if (uh->check && skb->ip_summed != CHECKSUM_PARTIAL) {
 		if (likely(src != uh->source)) {
@@ -550,12 +547,12 @@ static int set_tcp(struct sk_buff *skb, struct sw_flow_key *flow_key,
 		return err;
 
 	th = tcp_hdr(skb);
-	src = MASKED(th->source, key->tcp_src, mask->tcp_src);
+	src = OVS_MASKED(th->source, key->tcp_src, mask->tcp_src);
 	if (likely(src != th->source)) {
 		set_tp_port(skb, &th->source, src, &th->check);
 		flow_key->tp.src = src;
 	}
-	dst = MASKED(th->dest, key->tcp_dst, mask->tcp_dst);
+	dst = OVS_MASKED(th->dest, key->tcp_dst, mask->tcp_dst);
 	if (likely(dst != th->dest)) {
 		set_tp_port(skb, &th->dest, dst, &th->check);
 		flow_key->tp.dst = dst;
@@ -582,8 +579,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	old_csum = sh->checksum;
 	old_correct_csum = sctp_compute_cksum(skb, sctphoff);
 
-	sh->source = MASKED(sh->source, key->sctp_src, mask->sctp_src);
-	sh->dest = MASKED(sh->dest, key->sctp_dst, mask->sctp_dst);
+	sh->source = OVS_MASKED(sh->source, key->sctp_src, mask->sctp_src);
+	sh->dest = OVS_MASKED(sh->dest, key->sctp_dst, mask->sctp_dst);
 
 	new_csum = sctp_compute_cksum(skb, sctphoff);
 
@@ -744,12 +741,13 @@ static int execute_masked_set_action(struct sk_buff *skb,
 
 	switch (nla_type(a)) {
 	case OVS_KEY_ATTR_PRIORITY:
-		SET_MASKED(skb->priority, nla_get_u32(a), *get_mask(a, u32 *));
+		OVS_SET_MASKED(skb->priority, nla_get_u32(a),
+			       *get_mask(a, u32 *));
 		flow_key->phy.priority = skb->priority;
 		break;
 
 	case OVS_KEY_ATTR_SKB_MARK:
-		SET_MASKED(skb->mark, nla_get_u32(a), *get_mask(a, u32 *));
+		OVS_SET_MASKED(skb->mark, nla_get_u32(a), *get_mask(a, u32 *));
 		flow_key->phy.skb_mark = skb->mark;
 		break;
 
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 3ece945..9661a01 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -199,6 +199,10 @@ void ovs_dp_notify_wq(struct work_struct *work);
 int action_fifos_init(void);
 void action_fifos_exit(void);
 
+/* 'KEY' must not have any bits set outside of the 'MASK' */
+#define OVS_MASKED(OLD, KEY, MASK) ((KEY) | ((OLD) & ~(MASK)))
+#define OVS_SET_MASKED(OLD, KEY, MASK) ((OLD) = OVS_MASKED(OLD, KEY, MASK))
+
 #define OVS_NLERR(logging_allowed, fmt, ...)			\
 do {								\
 	if (logging_allowed && net_ratelimit())			\
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 3/7] openvswitch: Add conntrack action
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
  2015-03-02 21:54 ` [RFCv2 net-next 1/7] openvswitch: Serialize acts with original netlink len Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 2/7] openvswitch: Move MASKED* macros to datapath.h Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 4/7] openvswitch: Allow matching on conntrack mark Joe Stringer
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: Justin Pettit, linux-kernel, azhou, Thomas Graf, Patrick McHardy

From: Justin Pettit <jpettit@nicira.com>

Expose the kernel connection tracker to OVS. Userspace components can
make use of the "conntrack()" action, followed by "recirculate", to
populate the conntracking state in the OVS flow key, and subsequently
match on that state.

IPv4 fragment handling for conntrack is added in the following patches.

Zone support added by Thomas Graf <tgraf@noironetworks.com>

Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
This can be tested with the corresponding userspace component here:
https://www.github.com/justinpettit/openvswitch conntrack

RFCv2:
- Warn when ct->net is different from skb net in skb_has_valid_nfct().
- Save the OVS CB before calling into conntrack.
- Set OVS_CS_F_TRACKED when a flow cannot be identified ("invalid")
- Continue processing packets when conntrack marks the flow invalid.
- Use PF_INET6 family when sending IPv6 packets to conntrack.
- Verify conn_* matches when deserializing metadata from netlink.
- Only allow conntrack action on IPv4/IPv6 packets.
- General tidyups

Changes since RFC:
- Rebase to net-next.
- Add conn_zone field to the flow key.
- Add explicit dependencies on conn_zone, conn_mark.
- Refactor conntrack changes into net/openvswitch/ovs_conntrack.*.
- Don't allow set_field() actions to change conn_state, conn_zone.
- Add OVS_CS_F_* flags to indicate connection state.
- Add "invalid" connection state.
---
 include/uapi/linux/openvswitch.h |   36 +++++
 net/openvswitch/Kconfig          |   11 ++
 net/openvswitch/Makefile         |    1 +
 net/openvswitch/actions.c        |    5 +
 net/openvswitch/conntrack.c      |  296 ++++++++++++++++++++++++++++++++++++++
 net/openvswitch/conntrack.h      |   77 ++++++++++
 net/openvswitch/datapath.c       |   18 ++-
 net/openvswitch/flow.c           |    3 +
 net/openvswitch/flow.h           |    2 +
 net/openvswitch/flow_netlink.c   |   82 +++++++++--
 net/openvswitch/flow_netlink.h   |    4 +-
 11 files changed, 512 insertions(+), 23 deletions(-)
 create mode 100644 net/openvswitch/conntrack.c
 create mode 100644 net/openvswitch/conntrack.h

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index bbd49a0..f1909ae 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -317,6 +317,8 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_MPLS,      /* array of struct ovs_key_mpls.
 				 * The implementation may restrict
 				 * the accepted length of the array. */
+	OVS_KEY_ATTR_CONN_STATE,/* u8 of OVS_CS_F_* */
+	OVS_KEY_ATTR_CONN_ZONE, /* u16 connection tracking zone. */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ovs_tunnel_info */
@@ -429,6 +431,15 @@ struct ovs_key_nd {
 	__u8	nd_tll[ETH_ALEN];
 };
 
+/* OVS_KEY_ATTR_CONN_STATE flags */
+#define OVS_CS_F_NEW               0x01 /* Beginning of a new connection. */
+#define OVS_CS_F_ESTABLISHED       0x02 /* Part of an existing connection. */
+#define OVS_CS_F_RELATED           0x04 /* Related to an established
+					 * connection. */
+#define OVS_CS_F_INVALID           0x20 /* Could not track connection. */
+#define OVS_CS_F_REPLY_DIR         0x40 /* Flow is in the reply direction. */
+#define OVS_CS_F_TRACKED           0x80 /* Conntrack has occurred. */
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -591,6 +602,28 @@ struct ovs_action_hash {
 };
 
 /**
+ * enum ovs_ct_attr - Attributes for %OVS_ACTION_ATTR_CT action.
+ * @OVS_CT_ATTR_FLAGS: u32 connection tracking flags.
+ * @OVS_CT_ATTR_ZONE: u16 connection tracking zone.
+ */
+enum ovs_ct_attr {
+	OVS_CT_ATTR_UNSPEC,
+	OVS_CT_ATTR_FLAGS,      /* u8 of OVS_CT_F_*. */
+	OVS_CT_ATTR_ZONE,       /* u16 zone id. */
+	__OVS_CT_ATTR_MAX
+};
+
+#define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1)
+
+/*
+ * OVS_CT_ATTR_FLAGS flags - bitmask of %OVS_CT_F_*
+ * @OVS_CT_F_COMMIT: Commits the flow to the conntrack hashtable in the
+ * specified zone. Future packets for the current connection will be
+ * considered as 'established' or 'related'.
+ */
+#define OVS_CT_F_COMMIT		0x01
+
+/**
  * enum ovs_action_attr - Action types.
  *
  * @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
@@ -619,6 +652,8 @@ struct ovs_action_hash {
  * indicate the new packet contents. This could potentially still be
  * %ETH_P_MPLS if the resulting MPLS label stack is not empty.  If there
  * is no MPLS label stack, as determined by ethertype, no action is taken.
+ * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
+ * entries in the flow key.
  *
  * Only a single header can be set with a single %OVS_ACTION_ATTR_SET.  Not all
  * fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -644,6 +679,7 @@ enum ovs_action_attr {
 				       * data immediately followed by a mask.
 				       * The data must be zero for the unmasked
 				       * bits. */
+	OVS_ACTION_ATTR_CT,           /* One nested OVS_CT_ATTR_* . */
 
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index b7d818c..b108dca 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -30,6 +30,17 @@ config OPENVSWITCH
 
 	  If unsure, say N.
 
+config OPENVSWITCH_CONNTRACK
+	bool "Open vSwitch conntrack action support"
+	depends on OPENVSWITCH
+	depends on NF_CONNTRACK
+	default OPENVSWITCH
+	---help---
+	  If you say Y here, then Open vSwitch module will be able to pass
+	  packets through conntrack.
+
+	  Say N to exclude this support and reduce the binary size.
+
 config OPENVSWITCH_GRE
 	tristate "Open vSwitch GRE tunneling support"
 	depends on OPENVSWITCH
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 91b9478..7e7e2c6 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -15,6 +15,7 @@ openvswitch-y := \
 	vport-internal_dev.o \
 	vport-netdev.o
 
+openvswitch-$(CONFIG_OPENVSWITCH_CONNTRACK) += conntrack.o
 obj-$(CONFIG_OPENVSWITCH_GENEVE)+= vport-geneve.o
 obj-$(CONFIG_OPENVSWITCH_VXLAN)	+= vport-vxlan.o
 obj-$(CONFIG_OPENVSWITCH_GRE)	+= vport-gre.o
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index ed3cb56..2d801f6 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -38,6 +38,7 @@
 
 #include "datapath.h"
 #include "flow.h"
+#include "conntrack.h"
 #include "vport.h"
 
 static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
@@ -916,6 +917,10 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		case OVS_ACTION_ATTR_SAMPLE:
 			err = sample(dp, skb, key, a);
 			break;
+
+		case OVS_ACTION_ATTR_CT:
+			err = ovs_ct_execute(skb, key, nla_data(a));
+			break;
 		}
 
 		if (unlikely(err)) {
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
new file mode 100644
index 0000000..d911c4c
--- /dev/null
+++ b/net/openvswitch/conntrack.c
@@ -0,0 +1,296 @@
+/*
+ * Copyright (c) 2015 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <uapi/linux/openvswitch.h>
+
+#include "datapath.h"
+#include "conntrack.h"
+#include "flow.h"
+#include "flow_netlink.h"
+
+struct ovs_conntrack_info {
+	u32 flags;
+	u16 zone;
+	struct nf_conn *ct;
+};
+
+/* Determine whether skb->nfct is equal to the result of conntrack lookup. */
+static bool skb_nfct_cached(const struct net *net, u16 zone,
+			    const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+	if (!ct)
+		return false;
+	WARN(!net_eq(net, ct->ct_net),
+	     "Packet has conntrack association from different namespace\n");
+	if (zone != nf_ct_zone(ct))
+		return false;
+	return true;
+}
+
+static struct net *ovs_get_net(struct sk_buff *skb)
+{
+#ifdef CONFIG_NET_NS
+	struct vport *vport;
+
+	vport = OVS_CB(skb)->input_vport;
+	if (!vport)
+		return ERR_PTR(-EINVAL);
+
+	return vport->dp->net;
+#else
+	return &init_net;
+#endif
+}
+
+/* Map SKB connection state into the values used by flow definition. */
+u8 ovs_ct_get_state(const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	u8 cstate = OVS_CS_F_TRACKED;
+
+	if (!nf_ct_get(skb, &ctinfo))
+		return 0;
+
+	switch (ctinfo) {
+	case IP_CT_ESTABLISHED_REPLY:
+	case IP_CT_RELATED_REPLY:
+	case IP_CT_NEW_REPLY:
+		cstate |= OVS_CS_F_REPLY_DIR;
+		break;
+	default:
+		break;
+	}
+
+	switch (ctinfo) {
+	case IP_CT_ESTABLISHED:
+	case IP_CT_ESTABLISHED_REPLY:
+		cstate |= OVS_CS_F_ESTABLISHED;
+		break;
+	case IP_CT_RELATED:
+	case IP_CT_RELATED_REPLY:
+		cstate |= OVS_CS_F_RELATED;
+		break;
+	case IP_CT_NEW:
+	case IP_CT_NEW_REPLY:
+		cstate |= OVS_CS_F_NEW;
+		break;
+	default:
+		break;
+	}
+
+	return cstate;
+}
+
+u16 ovs_ct_get_zone(const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+
+	ct = nf_ct_get(skb, &ctinfo);
+
+	return ct ? nf_ct_zone(ct) : NF_CT_DEFAULT_ZONE;
+}
+
+bool ovs_ct_state_valid(const struct sw_flow_key *key)
+{
+	return (key->phy.conn_state &&
+		key->phy.conn_state != OVS_CS_F_INVALID);
+}
+
+static int ovs_ct_lookup(struct net *net, struct nf_conn *tmpl,
+			 struct sw_flow_key *key, struct sk_buff *skb)
+{
+	u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
+
+	if (!skb_nfct_cached(net, zone, skb)) {
+		uint8_t pf;
+
+		/* Associate skb with specified zone. */
+		if (tmpl) {
+			atomic_inc(&tmpl->ct_general.use);
+			skb->nfct = &tmpl->ct_general;
+			skb->nfctinfo = IP_CT_NEW;
+		}
+
+		pf = key->eth.type == htons(ETH_P_IP) ? PF_INET
+		   : key->eth.type == htons(ETH_P_IPV6) ? PF_INET6
+		   : PF_UNSPEC;
+		if (nf_conntrack_in(net, pf, NF_INET_PRE_ROUTING, skb) !=
+		    NF_ACCEPT)
+			return -ENOENT;
+	}
+
+	if (skb->nfct) {
+		key->phy.conn_state = ovs_ct_get_state(skb);
+		key->phy.conn_zone = ovs_ct_get_zone(skb);
+	} else {
+		key->phy.conn_state = OVS_CS_F_TRACKED | OVS_CS_F_INVALID;
+		key->phy.conn_zone = zone;
+	}
+
+	return 0;
+}
+
+int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
+		   const struct ovs_conntrack_info *info)
+{
+	struct net *net;
+	int nh_ofs = skb_network_offset(skb);
+	struct nf_conn *tmpl = info->ct;
+	int err = -EINVAL;
+
+	net = ovs_get_net(skb);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	/* The conntrack module expects to be working at L3. */
+	skb_pull(skb, nh_ofs);
+
+	if (ovs_ct_lookup(net, tmpl, key, skb))
+		goto err_push_skb;
+
+	if (info->flags & OVS_CT_F_COMMIT && ovs_ct_state_valid(key) &&
+	    nf_conntrack_confirm(skb) != NF_ACCEPT)
+		goto err_push_skb;
+
+	err = 0;
+err_push_skb:
+	/* Point back to L2, which OVS expects. */
+	skb_push(skb, nh_ofs);
+	return err;
+}
+
+int ovs_ct_verify(u64 attrs)
+{
+#ifndef CONFIG_NF_CONNTRACK_ZONES
+	if (attrs & (1ULL << OVS_KEY_ATTR_CONN_ZONE))
+		return -ENOTSUPP;
+#endif
+	return 0;
+}
+
+int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
+		       const struct sw_flow_key *key,
+		       struct sw_flow_actions **sfa,  bool log)
+{
+	struct ovs_conntrack_info ct_info;
+	struct nf_conntrack_tuple t;
+	struct nlattr *a;
+	int rem;
+
+	if (key->eth.type != htons(ETH_P_IP) &&
+	    key->eth.type != htons(ETH_P_IPV6))
+		return -EINVAL;
+
+	memset(&ct_info, 0, sizeof(ct_info));
+
+	nla_for_each_nested(a, attr, rem) {
+		int type = nla_type(a);
+		static const u32 ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
+			[OVS_CT_ATTR_FLAGS] = sizeof(u32),
+			[OVS_CT_ATTR_ZONE] = sizeof(u16),
+		};
+
+		if (type > OVS_CT_ATTR_MAX) {
+			OVS_NLERR(log,
+				  "Unknown conntrack attr (type=%d, max=%d)\n",
+				  type, OVS_CT_ATTR_MAX);
+			return -EINVAL;
+		}
+
+		if (ovs_ct_attr_lens[type] != nla_len(a) &&
+		    ovs_ct_attr_lens[type] != -1) {
+			OVS_NLERR(log,
+				  "Conntrack attr type has unexpected length (type=%d, length=%d, expected=%d)\n",
+				  type, nla_len(a), ovs_ct_attr_lens[type]);
+			return -EINVAL;
+		}
+
+		switch (type) {
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+		case OVS_CT_ATTR_ZONE:
+			memset(&t, 0, sizeof(t));
+			ct_info.zone = nla_get_u16(a);
+			ct_info.ct = nf_conntrack_alloc(net,
+					ct_info.zone, &t, &t,
+					GFP_KERNEL);
+			if (IS_ERR(ct_info.ct))
+				return PTR_ERR(ct_info.ct);
+
+			nf_conntrack_tmpl_insert(net, ct_info.ct);
+			break;
+#endif
+		case OVS_CT_ATTR_FLAGS:
+			ct_info.flags = nla_get_u32(a);
+			break;
+		default:
+			OVS_NLERR(log, "Unknown conntrack attr (%d)\n",
+				  type);
+			return -EINVAL;
+		}
+	}
+
+	if (rem > 0) {
+		OVS_NLERR(log, "Conntrack attr has %d unknown bytes\n", rem);
+		return -EINVAL;
+	}
+
+	return ovs_nla_add_action(sfa, OVS_ACTION_ATTR_CT, &ct_info,
+				  sizeof(ct_info), log);
+}
+
+int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
+			  struct sk_buff *skb)
+{
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, OVS_ACTION_ATTR_CT);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, OVS_CT_ATTR_FLAGS, ct_info->flags))
+		return -EMSGSIZE;
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+	if (nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone))
+		return -EMSGSIZE;
+#endif
+
+	nla_nest_end(skb, start);
+
+	return 0;
+}
+
+void ovs_ct_free_acts(struct sw_flow_actions *sf_acts)
+{
+	if (sf_acts) {
+		struct ovs_conntrack_info *ct_info;
+		struct nlattr *a;
+		int rem, len = sf_acts->actions_len;
+
+		for (a = sf_acts->actions, rem = len; rem > 0;
+		     a = nla_next(a, &rem)) {
+			switch (nla_type(a)) {
+			case OVS_ACTION_ATTR_CT:
+				ct_info = nla_data(a);
+				if (ct_info->ct)
+					nf_ct_put(ct_info->ct);
+				break;
+			}
+		}
+	}
+}
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
new file mode 100644
index 0000000..4bfdb13
--- /dev/null
+++ b/net/openvswitch/conntrack.h
@@ -0,0 +1,77 @@
+/*
+ * Copyright (c) 2015 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef OVS_CONNTRACK_H
+#define OVS_CONNTRACK_H 1
+
+struct ovs_conntrack_info;
+
+#if defined(CONFIG_OPENVSWITCH_CONNTRACK)
+int ovs_ct_verify(u64 attrs);
+int ovs_ct_copy_action(struct net *, const struct nlattr *,
+		       const struct sw_flow_key *, struct sw_flow_actions **,
+		       bool log);
+int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *);
+
+int ovs_ct_execute(struct sk_buff *, struct sw_flow_key *,
+		   const struct ovs_conntrack_info *);
+
+u8 ovs_ct_get_state(const struct sk_buff *skb);
+u16 ovs_ct_get_zone(const struct sk_buff *skb);
+bool ovs_ct_state_valid(const struct sw_flow_key *key);
+void ovs_ct_free_acts(struct sw_flow_actions *sf_acts);
+#else
+#include <linux/errno.h>
+
+int ovs_ct_verify(u64 attrs)
+{
+	return -ENOTSUPP;
+}
+
+static inline int ovs_ct_copy_action(struct net *net, const struct nlattr *nla,
+				     const struct sw_flow_key *key,
+				     struct sw_flow_actions **acts, bool log)
+{
+	return -ENOTSUPP;
+}
+
+static inline int ovs_ct_action_to_attr(const struct ovs_conntrack_info *info,
+					struct sk_buff *skb)
+{
+	return -ENOTSUPP;
+}
+
+static inline int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
+				 const struct ovs_conntrack_info *info)
+{
+	return -ENOTSUPP;
+}
+
+static inline u8 ovs_ct_get_state(const struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline u16 ovs_ct_get_zone(const struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline bool ovs_ct_state_valid(const struct sw_flow_key *key)
+{
+	return false;
+}
+
+static inline void ovs_ct_free_acts(struct sw_flow_actions *sf_acts) { }
+#endif
+#endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index c8c60c5..46f67ee 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -519,6 +519,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	struct sk_buff *packet;
 	struct sw_flow *flow;
 	struct sw_flow_actions *sf_acts;
+	struct net *net = sock_net(skb->sk);
 	struct datapath *dp;
 	struct ethhdr *eth;
 	struct vport *input_vport;
@@ -562,7 +563,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	if (err)
 		goto err_flow_free;
 
-	err = ovs_nla_copy_actions(a[OVS_PACKET_ATTR_ACTIONS],
+	err = ovs_nla_copy_actions(net, a[OVS_PACKET_ATTR_ACTIONS],
 				   &flow->key, &acts, log);
 	if (err)
 		goto err_flow_free;
@@ -867,6 +868,7 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
 
 static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 {
+	struct net *net = sock_net(skb->sk);
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow *flow = NULL, *new_flow;
@@ -916,8 +918,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		goto err_kfree_flow;
 
 	/* Validate actions. */
-	error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
-				     &acts, log);
+	error = ovs_nla_copy_actions(net, a[OVS_FLOW_ATTR_ACTIONS],
+				     &new_flow->key, &acts, log);
 	if (error) {
 		OVS_NLERR(log, "Flow actions may not be safe on all matching packets.");
 		goto err_kfree_flow;
@@ -1025,7 +1027,8 @@ error:
 }
 
 /* Factor out action copy to avoid "Wframe-larger-than=1024" warning. */
-static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
+static struct sw_flow_actions *get_flow_actions(struct net *net,
+						const struct nlattr *a,
 						const struct sw_flow_key *key,
 						const struct sw_flow_mask *mask,
 						bool log)
@@ -1035,7 +1038,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 	int error;
 
 	ovs_flow_mask_key(&masked_key, key, mask);
-	error = ovs_nla_copy_actions(a, &masked_key, &acts, log);
+	error = ovs_nla_copy_actions(net, a, &masked_key, &acts, log);
 	if (error) {
 		OVS_NLERR(log,
 			  "Actions may not be safe on all matching packets");
@@ -1047,6 +1050,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
 
 static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 {
+	struct net *net = sock_net(skb->sk);
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
@@ -1078,8 +1082,8 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 
 	/* Validate actions. */
 	if (a[OVS_FLOW_ATTR_ACTIONS]) {
-		acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, &mask,
-					log);
+		acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], &key,
+					&mask, log);
 		if (IS_ERR(acts)) {
 			error = PTR_ERR(acts);
 			goto error;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 50ec42f..de1dbaa 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -49,6 +49,7 @@
 #include "datapath.h"
 #include "flow.h"
 #include "flow_netlink.h"
+#include "conntrack.h"
 
 u64 ovs_flow_used_time(unsigned long flow_jiffies)
 {
@@ -705,6 +706,8 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
 	key->phy.priority = skb->priority;
 	key->phy.in_port = OVS_CB(skb)->input_vport->port_no;
 	key->phy.skb_mark = skb->mark;
+	key->phy.conn_state = ovs_ct_get_state(skb);
+	key->phy.conn_zone = ovs_ct_get_zone(skb);
 	key->ovs_flow_hash = 0;
 	key->recirc_id = 0;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 998401a..ad3779a 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -127,6 +127,8 @@ struct sw_flow_key {
 		u32	priority;	/* Packet QoS priority. */
 		u32	skb_mark;	/* SKB mark. */
 		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
+		u16	conn_zone;	/* Conntrack zone. */
+		u8	conn_state;	/* Connection state. */
 	} __packed phy; /* Safe when right after 'tun_key'. */
 	u32 ovs_flow_hash;		/* Datapath computed hash value.  */
 	u32 recirc_id;			/* Recirculation ID.  */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d5b01af..4264048 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
 #include <net/mpls.h>
 
 #include "flow_netlink.h"
+#include "conntrack.h"
 #include "vport-vxlan.h"
 
 struct ovs_len_tbl {
@@ -281,7 +282,7 @@ size_t ovs_key_attr_size(void)
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 22);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 24);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -290,6 +291,8 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_SKB_MARK */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_DP_HASH */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_RECIRC_ID */
+		+ nla_total_size(1)   /* OVS_KEY_ATTR_CONN_STATE */
+		+ nla_total_size(2)   /* OVS_KEY_ATTR_CONN_ZONE */
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -339,6 +342,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_TUNNEL]	 = { .len = OVS_ATTR_NESTED,
 				     .next = ovs_tunnel_key_lens, },
 	[OVS_KEY_ATTR_MPLS]	 = { .len = sizeof(struct ovs_key_mpls) },
+	[OVS_KEY_ATTR_CONN_STATE] = { .len = sizeof(u8) },
+	[OVS_KEY_ATTR_CONN_ZONE] = { .len = sizeof(u16) },
 };
 
 static bool is_all_zero(const u8 *fp, size_t size)
@@ -766,6 +771,22 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 			return -EINVAL;
 		*attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL);
 	}
+
+	if (ovs_ct_verify(*attrs))
+		return -EINVAL;
+
+	if (*attrs & (1ULL << OVS_KEY_ATTR_CONN_STATE)) {
+		uint8_t conn_state = nla_get_u8(a[OVS_KEY_ATTR_CONN_STATE]);
+
+		SW_FLOW_KEY_PUT(match, phy.conn_state, conn_state, is_mask);
+		*attrs &= ~(1ULL << OVS_KEY_ATTR_CONN_STATE);
+	}
+	if (*attrs & (1ULL << OVS_KEY_ATTR_CONN_ZONE)) {
+		uint16_t conn_zone = nla_get_u16(a[OVS_KEY_ATTR_CONN_ZONE]);
+
+		SW_FLOW_KEY_PUT(match, phy.conn_zone, conn_zone, is_mask);
+		*attrs &= ~(1ULL << OVS_KEY_ATTR_CONN_ZONE);
+	}
 	return 0;
 }
 
@@ -1312,6 +1333,12 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 	if (nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, output->phy.skb_mark))
 		goto nla_put_failure;
 
+	if (nla_put_u8(skb, OVS_KEY_ATTR_CONN_STATE, output->phy.conn_state))
+		goto nla_put_failure;
+
+	if (nla_put_u16(skb, OVS_KEY_ATTR_CONN_ZONE, output->phy.conn_zone))
+		goto nla_put_failure;
+
 	nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
 	if (!nla)
 		goto nla_put_failure;
@@ -1547,11 +1574,21 @@ static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log)
 	return sfa;
 }
 
+/* RCU callback used by ovs_nla_free_flow_actions. */
+static void rcu_free_acts_callback(struct rcu_head *rcu)
+{
+	struct sw_flow_actions *sf_acts = container_of(rcu,
+			struct sw_flow_actions, rcu);
+
+	ovs_ct_free_acts(sf_acts);
+	kfree(sf_acts);
+}
+
 /* Schedules 'sf_acts' to be freed after the next RCU grace period.
  * The caller must hold rcu_read_lock for this to be sensible. */
 void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
 {
-	kfree_rcu(sf_acts, rcu);
+	call_rcu(&sf_acts->rcu, rcu_free_acts_callback);
 }
 
 static struct nlattr *reserve_sfa_size(struct sw_flow_actions **sfa,
@@ -1608,8 +1645,8 @@ static struct nlattr *__add_action(struct sw_flow_actions **sfa,
 	return a;
 }
 
-static int add_action(struct sw_flow_actions **sfa, int attrtype,
-		      void *data, int len, bool log)
+int ovs_nla_add_action(struct sw_flow_actions **sfa, int attrtype, void *data,
+		       int len, bool log)
 {
 	struct nlattr *a;
 
@@ -1624,7 +1661,7 @@ static inline int add_nested_action_start(struct sw_flow_actions **sfa,
 	int used = (*sfa)->actions_len;
 	int err;
 
-	err = add_action(sfa, attrtype, NULL, 0, log);
+	err = ovs_nla_add_action(sfa, attrtype, NULL, 0, log);
 	if (err)
 		return err;
 
@@ -1640,12 +1677,12 @@ static inline void add_nested_action_end(struct sw_flow_actions *sfa,
 	a->nla_len = sfa->actions_len - st_offset;
 }
 
-static int __ovs_nla_copy_actions(const struct nlattr *attr,
+static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 				  const struct sw_flow_key *key,
 				  int depth, struct sw_flow_actions **sfa,
 				  __be16 eth_type, __be16 vlan_tci, bool log);
 
-static int validate_and_copy_sample(const struct nlattr *attr,
+static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 				    const struct sw_flow_key *key, int depth,
 				    struct sw_flow_actions **sfa,
 				    __be16 eth_type, __be16 vlan_tci, bool log)
@@ -1677,15 +1714,15 @@ static int validate_and_copy_sample(const struct nlattr *attr,
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SAMPLE, log);
 	if (start < 0)
 		return start;
-	err = add_action(sfa, OVS_SAMPLE_ATTR_PROBABILITY,
-			 nla_data(probability), sizeof(u32), log);
+	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PROBABILITY,
+				 nla_data(probability), sizeof(u32), log);
 	if (err)
 		return err;
 	st_acts = add_nested_action_start(sfa, OVS_SAMPLE_ATTR_ACTIONS, log);
 	if (st_acts < 0)
 		return st_acts;
 
-	err = __ovs_nla_copy_actions(actions, key, depth + 1, sfa,
+	err = __ovs_nla_copy_actions(net, actions, key, depth + 1, sfa,
 				     eth_type, vlan_tci, log);
 	if (err)
 		return err;
@@ -2007,7 +2044,7 @@ static int copy_action(const struct nlattr *from,
 	return 0;
 }
 
-static int __ovs_nla_copy_actions(const struct nlattr *attr,
+static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 				  const struct sw_flow_key *key,
 				  int depth, struct sw_flow_actions **sfa,
 				  __be16 eth_type, __be16 vlan_tci, bool log)
@@ -2031,7 +2068,8 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
 			[OVS_ACTION_ATTR_SET] = (u32)-1,
 			[OVS_ACTION_ATTR_SET_MASKED] = (u32)-1,
 			[OVS_ACTION_ATTR_SAMPLE] = (u32)-1,
-			[OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash)
+			[OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash),
+			[OVS_ACTION_ATTR_CT] = (u32)-1,
 		};
 		const struct ovs_action_push_vlan *vlan;
 		int type = nla_type(a);
@@ -2138,13 +2176,20 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
 			break;
 
 		case OVS_ACTION_ATTR_SAMPLE:
-			err = validate_and_copy_sample(a, key, depth, sfa,
+			err = validate_and_copy_sample(net, a, key, depth, sfa,
 						       eth_type, vlan_tci, log);
 			if (err)
 				return err;
 			skip_copy = true;
 			break;
 
+		case OVS_ACTION_ATTR_CT:
+			err = ovs_ct_copy_action(net, a, key, sfa, log);
+			if (err)
+				return err;
+			skip_copy = true;
+			break;
+
 		default:
 			OVS_NLERR(log, "Unknown Action type %d", type);
 			return -EINVAL;
@@ -2163,7 +2208,7 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
 }
 
 /* 'key' must be the masked key. */
-int ovs_nla_copy_actions(const struct nlattr *attr,
+int ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			 const struct sw_flow_key *key,
 			 struct sw_flow_actions **sfa, bool log)
 {
@@ -2173,7 +2218,7 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
 	if (IS_ERR(*sfa))
 		return PTR_ERR(*sfa);
 
-	err = __ovs_nla_copy_actions(attr, key, 0, sfa, key->eth.type,
+	err = __ovs_nla_copy_actions(net, attr, key, 0, sfa, key->eth.type,
 				     key->eth.tci, log);
 	if (err)
 		kfree(*sfa);
@@ -2291,6 +2336,13 @@ int ovs_nla_put_actions(const struct nlattr *attr, int len, struct sk_buff *skb)
 			if (err)
 				return err;
 			break;
+
+		case OVS_ACTION_ATTR_CT:
+			err = ovs_ct_action_to_attr(nla_data(a), skb);
+			if (err)
+				return err;
+			break;
+
 		default:
 			if (nla_put(skb, type, nla_len(a), nla_data(a)))
 				return -EMSGSIZE;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 5c3d75b..f699dca1 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -62,9 +62,11 @@ int ovs_nla_get_identifier(struct sw_flow_id *sfid, const struct nlattr *ufid,
 			   const struct sw_flow_key *key, bool log);
 u32 ovs_nla_get_ufid_flags(const struct nlattr *attr);
 
-int ovs_nla_copy_actions(const struct nlattr *attr,
+int ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			 const struct sw_flow_key *key,
 			 struct sw_flow_actions **sfa, bool log);
+int ovs_nla_add_action(struct sw_flow_actions **sfa, int attrtype,
+		       void *data, int len, bool log);
 int ovs_nla_put_actions(const struct nlattr *attr,
 			int len, struct sk_buff *skb);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 4/7] openvswitch: Allow matching on conntrack mark
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
                   ` (2 preceding siblings ...)
  2015-03-02 21:55 ` [RFCv2 net-next 3/7] openvswitch: Add conntrack action Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 5/7] net: refactor ip_fragment() Joe Stringer
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: Justin Pettit, linux-kernel, azhou, Thomas Graf, Patrick McHardy

From: Justin Pettit <jpettit@nicira.com>

Allow matching and setting the conntrack mark field. As with conntrack
state and zone, these are populated by executing the conntrack() action.
Unlike these, the conntrack mark is also a writable field. The
set_field() action may be used to modify the mark, which will take
effect on the most recent conntrack entry.

E.g.: actions:conntrack(zone=0),conntrack(zone=1),set_field(1->conntrack_mark)

This will perform conntrack lookup in zone 0, then lookup in zone 1,
then modify the mark for the entry in zone 1. The mark for the entry in
zone 0 is unchanged. The conntrack entry itself must be committed using the
"commit" flag in the conntrack action flags for this change to persist.

Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
---
RFCv2:
- Verify conn_* matches when deserializing metadata from netlink.
---
 include/uapi/linux/openvswitch.h |    1 +
 net/openvswitch/actions.c        |    5 ++
 net/openvswitch/conntrack.c      |   98 ++++++++++++++++++++++++++++++++++++--
 net/openvswitch/conntrack.h      |   14 ++++++
 net/openvswitch/flow.c           |    1 +
 net/openvswitch/flow.h           |    1 +
 net/openvswitch/flow_netlink.c   |   14 +++++-
 7 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index f1909ae..30d70a3 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -319,6 +319,7 @@ enum ovs_key_attr {
 				 * the accepted length of the array. */
 	OVS_KEY_ATTR_CONN_STATE,/* u8 of OVS_CS_F_* */
 	OVS_KEY_ATTR_CONN_ZONE, /* u16 connection tracking zone. */
+	OVS_KEY_ATTR_CONN_MARK, /* u32 connection tracking mark */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ovs_tunnel_info */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 2d801f6..9bd9f99 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -791,6 +791,11 @@ static int execute_masked_set_action(struct sk_buff *skb,
 		err = set_mpls(skb, flow_key, nla_data(a), get_mask(a,
 								    __be32 *));
 		break;
+
+	case OVS_KEY_ATTR_CONN_MARK:
+		err = ovs_ct_set_mark(skb, flow_key, nla_get_u32(a),
+				      *get_mask(a, u32 *));
+		break;
 	}
 
 	return err;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index d911c4c..93d76a5 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -106,14 +106,23 @@ u16 ovs_ct_get_zone(const struct sk_buff *skb)
 	return ct ? nf_ct_zone(ct) : NF_CT_DEFAULT_ZONE;
 }
 
+u32 ovs_ct_get_mark(const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	return ct ? ct->mark : 0;
+}
+
 bool ovs_ct_state_valid(const struct sw_flow_key *key)
 {
 	return (key->phy.conn_state &&
 		key->phy.conn_state != OVS_CS_F_INVALID);
 }
 
-static int ovs_ct_lookup(struct net *net, struct nf_conn *tmpl,
-			 struct sw_flow_key *key, struct sk_buff *skb)
+static int ovs_ct_lookup__(struct net *net, struct nf_conn *tmpl,
+			   struct sw_flow_key *key, struct sk_buff *skb)
 {
 	u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
 
@@ -138,14 +147,37 @@ static int ovs_ct_lookup(struct net *net, struct nf_conn *tmpl,
 	if (skb->nfct) {
 		key->phy.conn_state = ovs_ct_get_state(skb);
 		key->phy.conn_zone = ovs_ct_get_zone(skb);
+		key->phy.conn_mark = ovs_ct_get_mark(skb);
 	} else {
 		key->phy.conn_state = OVS_CS_F_TRACKED | OVS_CS_F_INVALID;
 		key->phy.conn_zone = zone;
+		key->phy.conn_mark = 0;
 	}
 
 	return 0;
 }
 
+static int ovs_ct_lookup(struct net *net, u16 zone, struct sw_flow_key *key,
+			 struct sk_buff *skb)
+{
+	struct nf_conntrack_tuple t;
+	struct nf_conn *tmpl = NULL;
+	int err;
+
+	if (zone != NF_CT_DEFAULT_ZONE) {
+		memset(&t, 0, sizeof(t));
+		tmpl = nf_conntrack_alloc(net, zone, &t, &t, GFP_KERNEL);
+		if (IS_ERR(tmpl))
+			return PTR_ERR(tmpl);
+	}
+
+	err = ovs_ct_lookup__(net, tmpl, key, skb);
+	if (tmpl)
+		nf_ct_put(tmpl);
+
+	return err;
+}
+
 int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
 		   const struct ovs_conntrack_info *info)
 {
@@ -161,7 +193,7 @@ int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
 	/* The conntrack module expects to be working at L3. */
 	skb_pull(skb, nh_ofs);
 
-	if (ovs_ct_lookup(net, tmpl, key, skb))
+	if (ovs_ct_lookup__(net, tmpl, key, skb))
 		goto err_push_skb;
 
 	if (info->flags & OVS_CT_F_COMMIT && ovs_ct_state_valid(key) &&
@@ -175,12 +207,72 @@ err_push_skb:
 	return err;
 }
 
+/* If conntrack is performed on a packet which is subsequently sent to
+ * userspace, then on execute the returned packet won't have conntrack
+ * available in the skb. Initialize it if it is needed.
+ *
+ * Typically this should boil down to a no-op.
+ */
+static int reinit_skb_nfct(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	struct net *net;
+	int err;
+
+	if (!ovs_ct_state_valid(key))
+		return -EINVAL;
+
+	net = ovs_get_net(skb);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	err = ovs_ct_lookup(net, key->phy.conn_zone, key, skb);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
+		    u32 conn_mark, u32 mask)
+{
+#ifdef CONFIG_NF_CONNTRACK_MARK
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+	u32 new_mark;
+	int err;
+
+	err = reinit_skb_nfct(skb, key);
+	if (err)
+		return err;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return -EINVAL;
+
+	new_mark = ct->mark;
+	OVS_SET_MASKED(new_mark, conn_mark, mask);
+	if (ct->mark != new_mark) {
+		ct->mark = new_mark;
+		nf_conntrack_event_cache(IPCT_MARK, ct);
+		key->phy.conn_mark = conn_mark;
+	}
+
+	return 0;
+#else
+	return -ENOTSUPP;
+#endif
+}
+
 int ovs_ct_verify(u64 attrs)
 {
 #ifndef CONFIG_NF_CONNTRACK_ZONES
 	if (attrs & (1ULL << OVS_KEY_ATTR_CONN_ZONE))
 		return -ENOTSUPP;
 #endif
+#ifndef CONFIG_NF_CONNTRACK_MARK
+	if (attrs & (1ULL << OVS_KEY_ATTR_CONN_MARK))
+		return -ENOTSUPP;
+#endif
 	return 0;
 }
 
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 4bfdb13..d72e4f3 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -26,6 +26,9 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *);
 int ovs_ct_execute(struct sk_buff *, struct sw_flow_key *,
 		   const struct ovs_conntrack_info *);
 
+int ovs_ct_set_mark(struct sk_buff *, struct sw_flow_key *, u32 conn_mark,
+		    u32 mask);
+u32 ovs_ct_get_mark(const struct sk_buff *skb);
 u8 ovs_ct_get_state(const struct sk_buff *skb);
 u16 ovs_ct_get_zone(const struct sk_buff *skb);
 bool ovs_ct_state_valid(const struct sw_flow_key *key);
@@ -67,11 +70,22 @@ static inline u16 ovs_ct_get_zone(const struct sk_buff *skb)
 	return 0;
 }
 
+static inline u32 ovs_ct_get_mark(const struct sk_buff *skb)
+{
+	return 0;
+}
+
 static inline bool ovs_ct_state_valid(const struct sw_flow_key *key)
 {
 	return false;
 }
 
+static inline int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
+				  u32 conn_mark, u32 mask)
+{
+	return -ENOTSUPP;
+}
+
 static inline void ovs_ct_free_acts(struct sw_flow_actions *sf_acts) { }
 #endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index de1dbaa..2a7c6c9 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -708,6 +708,7 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
 	key->phy.skb_mark = skb->mark;
 	key->phy.conn_state = ovs_ct_get_state(skb);
 	key->phy.conn_zone = ovs_ct_get_zone(skb);
+	key->phy.conn_mark = ovs_ct_get_mark(skb);
 	key->ovs_flow_hash = 0;
 	key->recirc_id = 0;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index ad3779a..aa7eb1d 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -128,6 +128,7 @@ struct sw_flow_key {
 		u32	skb_mark;	/* SKB mark. */
 		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
 		u16	conn_zone;	/* Conntrack zone. */
+		u32	conn_mark;	/* Conntrack mark. */
 		u8	conn_state;	/* Connection state. */
 	} __packed phy; /* Safe when right after 'tun_key'. */
 	u32 ovs_flow_hash;		/* Datapath computed hash value.  */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 4264048..9c1d0c5 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -282,7 +282,7 @@ size_t ovs_key_attr_size(void)
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 24);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 25);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -293,6 +293,7 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_RECIRC_ID */
 		+ nla_total_size(1)   /* OVS_KEY_ATTR_CONN_STATE */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_CONN_ZONE */
+		+ nla_total_size(4)   /* OVS_KEY_ATTR_CONN_MARK */
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -344,6 +345,7 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_MPLS]	 = { .len = sizeof(struct ovs_key_mpls) },
 	[OVS_KEY_ATTR_CONN_STATE] = { .len = sizeof(u8) },
 	[OVS_KEY_ATTR_CONN_ZONE] = { .len = sizeof(u16) },
+	[OVS_KEY_ATTR_CONN_MARK] = { .len = sizeof(u32) },
 };
 
 static bool is_all_zero(const u8 *fp, size_t size)
@@ -787,6 +789,12 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 		SW_FLOW_KEY_PUT(match, phy.conn_zone, conn_zone, is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CONN_ZONE);
 	}
+	if (*attrs & (1ULL << OVS_KEY_ATTR_CONN_MARK)) {
+		uint32_t mark = nla_get_u32(a[OVS_KEY_ATTR_CONN_MARK]);
+
+		SW_FLOW_KEY_PUT(match, phy.conn_mark, mark, is_mask);
+		*attrs &= ~(1ULL << OVS_KEY_ATTR_CONN_MARK);
+	}
 	return 0;
 }
 
@@ -1339,6 +1347,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 	if (nla_put_u16(skb, OVS_KEY_ATTR_CONN_ZONE, output->phy.conn_zone))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, OVS_KEY_ATTR_CONN_MARK, output->phy.conn_mark))
+		goto nla_put_failure;
+
 	nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
 	if (!nla)
 		goto nla_put_failure;
@@ -1879,6 +1890,7 @@ static int validate_set(const struct nlattr *a,
 
 	case OVS_KEY_ATTR_PRIORITY:
 	case OVS_KEY_ATTR_SKB_MARK:
+	case OVS_KEY_ATTR_CONN_MARK:
 	case OVS_KEY_ATTR_ETHERNET:
 		break;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 5/7] net: refactor ip_fragment()
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
                   ` (3 preceding siblings ...)
  2015-03-02 21:55 ` [RFCv2 net-next 4/7] openvswitch: Allow matching on conntrack mark Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-02 21:55 ` [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs Joe Stringer
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: Andy Zhou, linux-kernel, Justin Pettit, Thomas Graf, Patrick McHardy

From: Andy Zhou <azhou@nicira.com>

Current ip_fragment() API assumes there is a netdev device attached to
the skb. The MTU size is then derived from the attached device. However,
skbs incoming from OVS vports do not have a netdevice attached, so it is
not possible to query it for the MTU size.

This patch splits the original function into two pieces: The core
fragmentation logic is now provided by ip_fragment_mtu(), The call back
function with this API accepts two arguments: skb and an application
specific pointer. ip_fragment() retains the original API, and it in turn
calls ip_fragment_mtu() to do the work.

Future patches will make use of the new ip_fragment_mtu() from OVS
modules.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/net/ip.h     |    3 ++
 net/ipv4/ip_output.c |  113 ++++++++++++++++++++++++++++----------------------
 2 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 025c61c..e73ac20 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -109,6 +109,9 @@ int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct sock *sk, struct sk_buff *skb);
 int ip_mc_output(struct sock *sk, struct sk_buff *skb);
 int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));
+int ip_fragment_mtu(struct sk_buff *skb, unsigned int mtu, unsigned int ll_rs,
+		    struct net_device *dev, void *output_arg,
+		    int (*output)(struct sk_buff *, void *output_arg));
 int ip_do_nat(struct sk_buff *skb);
 void ip_send_check(struct iphdr *ip);
 int __ip_local_out(struct sk_buff *skb);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d68199d..57ed8ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -472,54 +472,22 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	skb_copy_secmark(to, from);
 }
 
-/*
- *	This IP datagram is too large to be sent in one piece.  Break it up into
- *	smaller pieces (each of size equal to IP header plus
- *	a block of the data of the original IP data part) that will yet fit in a
- *	single device frame, and queue such a frame for sending.
- */
-
-int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
+int ip_fragment_mtu(struct sk_buff *skb, unsigned int mtu, unsigned int ll_rs,
+		    struct net_device *dev, void *output_arg,
+		    int (*output)(struct sk_buff *, void *output_arg))
 {
 	struct iphdr *iph;
 	int ptr;
-	struct net_device *dev;
 	struct sk_buff *skb2;
-	unsigned int mtu, hlen, left, len, ll_rs;
+	unsigned int hlen, left, len;
 	int offset;
 	__be16 not_last_frag;
-	struct rtable *rt = skb_rtable(skb);
 	int err = 0;
 
-	dev = rt->dst.dev;
-
-	/*
-	 *	Point into the IP datagram header.
-	 */
-
 	iph = ip_hdr(skb);
-
-	mtu = ip_skb_dst_mtu(skb);
-	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
-		     (IPCB(skb)->frag_max_size &&
-		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
-		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(mtu));
-		kfree_skb(skb);
-		return -EMSGSIZE;
-	}
-
-	/*
-	 *	Setup starting values.
-	 */
-
 	hlen = iph->ihl * 4;
 	mtu = mtu - hlen;	/* Size of data space */
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge)
-		mtu -= nf_bridge_mtu_reduction(skb);
-#endif
+
 	IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE;
 
 	/* When frag_list is given, use it. First, check its validity:
@@ -592,10 +560,11 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				ip_send_check(iph);
 			}
 
-			err = output(skb);
+			err = output(skb, output_arg);
 
-			if (!err)
-				IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
+			if (!err && dev)
+				IP_INC_STATS(dev_net(dev),
+					     IPSTATS_MIB_FRAGCREATES);
 			if (err || !frag)
 				break;
 
@@ -605,7 +574,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		}
 
 		if (err == 0) {
-			IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
+			if (dev)
+				IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
 			return 0;
 		}
 
@@ -614,7 +584,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			kfree_skb(frag);
 			frag = skb;
 		}
-		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+		if (dev)
+			IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		return err;
 
 slow_path_clean:
@@ -636,10 +607,6 @@ slow_path:
 	left = skb->len - hlen;		/* Space per frame */
 	ptr = hlen;		/* Where to start from */
 
-	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
-	 * we need to make room for the encapsulating header
-	 */
-	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
 
 	/*
 	 *	Fragment the datagram.
@@ -732,21 +699,67 @@ slow_path:
 
 		ip_send_check(iph);
 
-		err = output(skb2);
+		err = output(skb2, output_arg);
 		if (err)
 			goto fail;
 
-		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
+		if (dev)
+			IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
 	}
 	consume_skb(skb);
-	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
+	if (dev)
+		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
 	return err;
 
 fail:
 	kfree_skb(skb);
-	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+	if (dev)
+		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 	return err;
 }
+EXPORT_SYMBOL(ip_fragment_mtu);
+
+/*This IP datagram is too large to be sent in one piece.  Break it up into
+ *smaller pieces (each of size equal to IP header plus
+ *a block of the data of the original IP data part) that will yet fit in a
+ *single device frame, and queue such a frame for sending.
+ */
+int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
+{
+	struct iphdr *iph;
+	struct net_device *dev;
+	unsigned int mtu, ll_rs;
+	struct rtable *rt = skb_rtable(skb);
+
+	dev = rt->dst.dev;
+
+	/* Point into the IP datagram header.  */
+	iph = ip_hdr(skb);
+
+	mtu = ip_skb_dst_mtu(skb);
+	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
+		     (IPCB(skb)->frag_max_size &&
+		      IPCB(skb)->frag_max_size > mtu))) {
+		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+			  htonl(mtu));
+		kfree_skb(skb);
+		return -EMSGSIZE;
+	}
+
+	/* Setup starting values.  */
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	if (skb->nf_bridge)
+		mtu -= nf_bridge_mtu_reduction(skb);
+#endif
+	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
+	 * we need to make room for the encapsulating header
+	 */
+	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
+
+	return ip_fragment_mtu(skb, mtu, ll_rs, NULL, dev,
+			(int (*)(struct sk_buff *, void *output_arg))output);
+}
 EXPORT_SYMBOL(ip_fragment);
 
 int
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
                   ` (4 preceding siblings ...)
  2015-03-02 21:55 ` [RFCv2 net-next 5/7] net: refactor ip_fragment() Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-03  8:20   ` Patrick McHardy
  2015-03-02 21:55 ` [RFCv2 net-next 7/7] openvswitch: Support fragmented IPv4 packets for conntrack Joe Stringer
  2015-03-03  0:59 ` [RFCv2 net-next 0/7] OVS conntrack support Tom Herbert
  7 siblings, 1 reply; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: Andy Zhou, linux-kernel, Justin Pettit, Thomas Graf, Patrick McHardy

From: Andy Zhou <azhou@nicira.com>

Currently, ip_defrag() does not keep track of the maximum fragmentation
size for each fragmented packet. This information is not necessary since
current Linux IP fragmentation always fragments a packet based on output
devices' MTU.

However, this becomes more tricky when integrating with output ports that
do not have a netdevice attached, for example OVS vports. In this case,
the MTU of the output port is not always known. If the incoming maximum
fragment size is tracked during defragmentation, then these users can
refragment into reasonable sizes when sending the packets.

This patch modifies the ip_defrag() to keep track of the maximum
fragment size for each packet and report this size back to the caller
once a packet is successfully reassembled. This will be used by the
next patch.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 drivers/net/macvlan.c               |    2 +-
 include/net/ip.h                    |   10 +++++---
 net/ipv4/ip_fragment.c              |   46 +++++++++++++++++++++++++----------
 net/ipv4/ip_input.c                 |    5 ++--
 net/ipv4/netfilter/nf_defrag_ipv4.c |    2 +-
 net/netfilter/ipvs/ip_vs_core.c     |    2 +-
 net/packet/af_packet.c              |    2 +-
 7 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 1df38bd..eb978e4 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -412,7 +412,7 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 
 	port = macvlan_port_get_rcu(skb->dev);
 	if (is_multicast_ether_addr(eth->h_dest)) {
-		skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN);
+		skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN, NULL);
 		if (!skb)
 			return RX_HANDLER_CONSUMED;
 		eth = eth_hdr(skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index e73ac20..5035deb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -494,11 +494,15 @@ enum ip_defrag_users {
 	IP_DEFRAG_MACVLAN,
 };
 
-int ip_defrag(struct sk_buff *skb, u32 user);
+int ip_defrag_net(struct net *net, struct sk_buff *skb, u32 user,
+		  unsigned int *mru);
+int ip_defrag(struct sk_buff *skb, u32 user, unsigned int *mru);
 #ifdef CONFIG_INET
-struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user);
+struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user,
+				unsigned int *mru);
 #else
-static inline struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+static inline struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user,
+					      unsigned int *mru)
 {
 	return skb;
 }
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e5b6d0d..313ca80 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -77,6 +77,7 @@ struct ipq {
 	u8		ecn; /* RFC3168 support */
 	int             iif;
 	unsigned int    rid;
+	unsigned int    mru; /* Maximum received packet fragment size */
 	struct inet_peer *peer;
 };
 
@@ -138,6 +139,7 @@ static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
 
 	const struct ip4_create_arg *arg = a;
 
+	qp->mru = 0;
 	qp->protocol = arg->iph->protocol;
 	qp->id = arg->iph->id;
 	qp->ecn = ip4_frag_ecn(arg->iph->tos);
@@ -315,7 +317,7 @@ static int ip_frag_reinit(struct ipq *qp)
 }
 
 /* Add new segment to existing queue. */
-static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
+static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb, unsigned int *mru)
 {
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
@@ -323,6 +325,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	int ihl, end;
 	int err = -ENOENT;
 	u8 ecn;
+	unsigned int len = skb->len;
 
 	if (qp->q.flags & INET_FRAG_COMPLETE)
 		goto err;
@@ -396,6 +399,12 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 found:
+	/* Maintain maximum received unit size of all the fragments we
+	 * have seen so far.
+	 */
+	if (len > qp->mru)
+		qp->mru = len;
+
 	/* We found where to put this one.  Check for overlap with
 	 * preceding fragment, and, if needed, align things so that
 	 * any overlaps are eliminated.
@@ -485,6 +494,8 @@ found:
 		skb->_skb_refdst = 0UL;
 		err = ip_frag_reasm(qp, prev, dev);
 		skb->_skb_refdst = orefdst;
+		if (!err && mru)
+			*mru = qp->mru;
 		return err;
 	}
 
@@ -628,39 +639,48 @@ out_fail:
 	return err;
 }
 
-/* Process an incoming IP datagram fragment. */
-int ip_defrag(struct sk_buff *skb, u32 user)
+int ip_defrag_net(struct net *net, struct sk_buff *skb, u32 user,
+		  unsigned int *mru)
 {
 	struct ipq *qp;
-	struct net *net;
 
-	net = skb->dev ? dev_net(skb->dev) : dev_net(skb_dst(skb)->dev);
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
-
-	/* Lookup (or create) queue header */
 	if ((qp = ip_find(net, ip_hdr(skb), user)) != NULL) {
 		int ret;
 
 		spin_lock(&qp->q.lock);
-
-		ret = ip_frag_queue(qp, skb);
-
+		ret = ip_frag_queue(qp, skb, mru);
 		spin_unlock(&qp->q.lock);
+
 		ipq_put(qp);
 		return ret;
 	}
-
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
 	kfree_skb(skb);
 	return -ENOMEM;
 }
+EXPORT_SYMBOL(ip_defrag_net);
+
+/* Process an incoming IP datagram fragment. */
+int ip_defrag(struct sk_buff *skb, u32 user, unsigned int *mru)
+{
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : dev_net(skb_dst(skb)->dev);
+
+	return ip_defrag_net(net, skb, user, mru);
+}
 EXPORT_SYMBOL(ip_defrag);
 
-struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user,
+				unsigned int *mru)
 {
 	struct iphdr iph;
 	u32 len;
 
+	if (mru)
+		*mru = 0;
+
 	if (skb->protocol != htons(ETH_P_IP))
 		return skb;
 
@@ -682,7 +702,7 @@ struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
 			if (pskb_trim_rcsum(skb, len))
 				return skb;
 			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-			if (ip_defrag(skb, user))
+			if (ip_defrag(skb, user, mru))
 				return NULL;
 			skb_clear_hash(skb);
 		}
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 3d4da2c..d59e3f6 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -168,7 +168,8 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 		     sk->sk_bound_dev_if == dev->ifindex) &&
 		    net_eq(sock_net(sk), dev_net(dev))) {
 			if (ip_is_fragment(ip_hdr(skb))) {
-				if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN))
+				if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN,
+					      NULL))
 					return true;
 			}
 			if (last) {
@@ -249,7 +250,7 @@ int ip_local_deliver(struct sk_buff *skb)
 	 */
 
 	if (ip_is_fragment(ip_hdr(skb))) {
-		if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
+		if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER, NULL))
 			return 0;
 	}
 
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 7e5ca6f..8bbe4df 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -29,7 +29,7 @@ static int nf_ct_ipv4_gather_frags(struct sk_buff *skb, u_int32_t user)
 	skb_orphan(skb);
 
 	local_bh_disable();
-	err = ip_defrag(skb, user);
+	err = ip_defrag(skb, user, NULL);
 	local_bh_enable();
 
 	if (!err) {
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b87ca32..dab1f3d 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -651,7 +651,7 @@ static inline int ip_vs_gather_frags(struct sk_buff *skb, u_int32_t user)
 	int err;
 
 	local_bh_disable();
-	err = ip_defrag(skb, user);
+	err = ip_defrag(skb, user, NULL);
 	local_bh_enable();
 	if (!err)
 		ip_send_check(ip_hdr(skb));
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 9db8369..0d8a8d2 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1359,7 +1359,7 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 	case PACKET_FANOUT_HASH:
 	default:
 		if (fanout_has_flag(f, PACKET_FANOUT_FLAG_DEFRAG)) {
-			skb = ip_check_defrag(skb, IP_DEFRAG_AF_PACKET);
+			skb = ip_check_defrag(skb, IP_DEFRAG_AF_PACKET, NULL);
 			if (!skb)
 				return 0;
 		}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFCv2 net-next 7/7] openvswitch: Support fragmented IPv4 packets for conntrack
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
                   ` (5 preceding siblings ...)
  2015-03-02 21:55 ` [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs Joe Stringer
@ 2015-03-02 21:55 ` Joe Stringer
  2015-03-03  0:59 ` [RFCv2 net-next 0/7] OVS conntrack support Tom Herbert
  7 siblings, 0 replies; 11+ messages in thread
From: Joe Stringer @ 2015-03-02 21:55 UTC (permalink / raw)
  To: netdev, Pablo Neira Ayuso
  Cc: Andy Zhou, linux-kernel, Justin Pettit, Thomas Graf, Patrick McHardy

From: Andy Zhou <azhou@nicira.com>

The conntrack action now re-assembles fragmented IPv4 packets and only
send a fully re-assembled IP packet to nf_conntrack layer.

When a re-assembled IP frame hits the output action. The output action
will re fragment them into IP fragments based on this packets' incoming
fragment size.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/uapi/linux/openvswitch.h |    5 ++-
 net/openvswitch/actions.c        |   78 ++++++++++++++++++++++++++++++++++----
 net/openvswitch/conntrack.c      |   43 ++++++++++++++++++++-
 net/openvswitch/datapath.c       |   40 ++++++++++++++++---
 net/openvswitch/datapath.h       |    6 +++
 net/openvswitch/vport.c          |    1 +
 6 files changed, 157 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 30d70a3..b947544 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -162,7 +162,9 @@ enum ovs_packet_cmd {
  * %OVS_USERSPACE_ATTR_EGRESS_TUN_PORT attribute, which is sent only if the
  * output port is actually a tunnel port. Contains the output tunnel key
  * extracted from the packet as nested %OVS_TUNNEL_KEY_ATTR_* attributes.
- *
+ * @OVS_PACKET_ATTR_MRU: Present for an %OVS_PACKET_CMD_ACTION and
+ * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment
+ * size.
  * These attributes follow the &struct ovs_header within the Generic Netlink
  * payload for %OVS_PACKET_* commands.
  */
@@ -178,6 +180,7 @@ enum ovs_packet_attr {
 	OVS_PACKET_ATTR_UNUSED2,
 	OVS_PACKET_ATTR_PROBE,      /* Packet operation is a feature probe,
 				       error logging should be suppressed. */
+	OVS_PACKET_ATTR_MRU,          /* Maximum received IP fragment size. */
 	__OVS_PACKET_ATTR_MAX
 };
 
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 9bd9f99..789e53a 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -53,6 +53,11 @@ struct deferred_action {
 	struct sw_flow_key pkt_key;
 };
 
+struct vport_frag_output_info {
+	struct vport *vport;
+	struct sw_flow_key *key;
+};
+
 #define DEFERRED_ACTION_FIFO_SIZE 10
 struct action_fifo {
 	int head;
@@ -595,14 +600,67 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
 	return 0;
 }
 
-static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
+/* Given an IP frame, reconstruct its MAC header based on flow.  */
+int ovs_setup_l2_header(struct sk_buff *skb, struct sw_flow_key *key)
+{
+	int err;
+
+	err = skb_ensure_writable(skb, ETH_HLEN);
+	if (unlikely(err))
+		return err;
+
+	__skb_push(skb, ETH_HLEN);
+	skb_reset_mac_header(skb);
+
+	ether_addr_copy(eth_hdr(skb)->h_source, key->eth.src);
+	ether_addr_copy(eth_hdr(skb)->h_dest, key->eth.dst);
+	eth_hdr(skb)->h_proto = key->eth.type;
+
+	return 0;
+}
+
+static int ovs_vport_output(struct sk_buff *skb, void *output_arg)
+{
+	struct vport_frag_output_info *arg =
+		(struct vport_frag_output_info *)output_arg;
+	struct sw_flow_key *key = arg->key;
+	struct vport *vport = arg->vport;
+	int err;
+
+	err = ovs_setup_l2_header(skb, key);
+	if (err) {
+		kfree_skb(skb);
+		return err;
+	}
+	ovs_vport_send(vport, skb);
+
+	return 0;
+}
+
+static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
+		      struct sw_flow_key *key)
 {
 	struct vport *vport = ovs_vport_rcu(dp, out_port);
+	unsigned int mru = OVS_CB(skb)->mru;
 
-	if (likely(vport))
-		ovs_vport_send(vport, skb);
-	else
+	if (likely(vport)) {
+		if (!mru || (skb->len <= mru + ETH_HLEN)) {
+			ovs_vport_send(vport, skb);
+		} else if (key->eth.type == htons(ETH_P_IP)) {
+			struct vport_frag_output_info arg;
+			unsigned int mtu = mru;
+
+			arg.vport = vport;
+			arg.key = key;
+
+			skb_pull(skb, ETH_HLEN);
+
+			ip_fragment_mtu(skb, mtu, LL_MAX_HEADER, NULL, &arg,
+					ovs_vport_output);
+		}
+	} else {
 		kfree_skb(skb);
+	}
 }
 
 static int output_userspace(struct datapath *dp, struct sk_buff *skb,
@@ -617,6 +675,7 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,
 	upcall.userdata = NULL;
 	upcall.portid = 0;
 	upcall.egress_tun_info = NULL;
+	upcall.mru = OVS_CB(skb)->mru;
 
 	for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
 		 a = nla_next(a, &rem)) {
@@ -865,7 +924,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 			struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC);
 
 			if (out_skb)
-				do_output(dp, out_skb, prev_port);
+				do_output(dp, out_skb, prev_port, key);
 
 			prev_port = -1;
 		}
@@ -929,13 +988,18 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		}
 
 		if (unlikely(err)) {
-			kfree_skb(skb);
+			/* Hide stolen fragments from user space. */
+			if (err == -EINPROGRESS)
+				err = 0;
+			else
+				kfree_skb(skb);
+
 			return err;
 		}
 	}
 
 	if (prev_port != -1)
-		do_output(dp, skb, prev_port);
+		do_output(dp, skb, prev_port, key);
 	else
 		consume_skb(skb);
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 93d76a5..793d489 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -178,21 +178,60 @@ static int ovs_ct_lookup(struct net *net, u16 zone, struct sw_flow_key *key,
 	return err;
 }
 
+static int handle_fragments(struct net *net, u16 zone, struct sk_buff *skb,
+			    struct sw_flow_key *key)
+{
+	if (key->eth.type == htons(ETH_P_IP)) {
+		if (ip_is_fragment(ip_hdr(skb))) {
+			struct ovs_skb_cb ovs_cb = *OVS_CB(skb);
+			int nh_ofs = skb_network_offset(skb);
+			enum ip_defrag_users user;
+			unsigned int mru;
+			int err;
+
+			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+			user = IP_DEFRAG_CONNTRACK_IN + zone;
+			skb_pull(skb, nh_ofs);
+			err = ip_defrag_net(net, skb, user, &mru);
+			if (err)
+				return err;
+
+			/* Got a reassembled IP frame */
+			skb_clear_hash(skb);
+			ip_send_check(ip_hdr(skb));
+			skb->ignore_df = 1;
+			err = ovs_setup_l2_header(skb, key);
+			if (err)
+				return err;
+
+			ovs_cb.mru = mru;
+			*OVS_CB(skb) = ovs_cb;
+		}
+	} /* XXX Handle IPv6 */
+
+	return 0;
+}
+
 int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
 		   const struct ovs_conntrack_info *info)
 {
 	struct net *net;
-	int nh_ofs = skb_network_offset(skb);
 	struct nf_conn *tmpl = info->ct;
-	int err = -EINVAL;
+	int nh_ofs, err;
 
 	net = ovs_get_net(skb);
 	if (IS_ERR(net))
 		return PTR_ERR(net);
 
+	err = handle_fragments(net, info->zone, skb, key);
+	if (err)
+		return err;
+
 	/* The conntrack module expects to be working at L3. */
+	nh_ofs = skb_network_offset(skb);
 	skb_pull(skb, nh_ofs);
 
+	err = -EINVAL;
 	if (ovs_ct_lookup__(net, tmpl, key, skb))
 		goto err_push_skb;
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 46f67ee..1340f21 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -277,6 +277,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 		upcall.userdata = NULL;
 		upcall.portid = ovs_vport_find_upcall_portid(p, skb);
 		upcall.egress_tun_info = NULL;
+		upcall.mru = OVS_CB(skb)->mru;
 		error = ovs_dp_upcall(dp, skb, key, &upcall);
 		if (unlikely(error))
 			kfree_skb(skb);
@@ -398,9 +399,23 @@ static size_t upcall_msg_size(const struct dp_upcall_info *upcall_info,
 	if (upcall_info->egress_tun_info)
 		size += nla_total_size(ovs_tun_key_attr_size());
 
+	/* OVS_PACKET_ATTR_MRU */
+	if (upcall_info->mru)
+		size += nla_total_size(sizeof(unsigned int));
+
 	return size;
 }
 
+static void pad_packet(struct datapath *dp, struct sk_buff *skb)
+{
+	if (!(dp->user_features & OVS_DP_F_UNALIGNED)) {
+		size_t plen = NLA_ALIGN(skb->len) - skb->len;
+
+		if (plen > 0)
+			memset(skb_put(skb, plen), 0, plen);
+	}
+}
+
 static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 				  const struct sw_flow_key *key,
 				  const struct dp_upcall_info *upcall_info)
@@ -479,6 +494,16 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 		nla_nest_end(user_skb, nla);
 	}
 
+	/* Add OVS_PACKET_ATTR_MRU */
+	if (upcall_info->mru) {
+		if (nla_put_u16(user_skb, OVS_PACKET_ATTR_MRU,
+				upcall_info->mru)) {
+			err = -ENOBUFS;
+			goto out;
+		}
+		pad_packet(dp, user_skb);
+	}
+
 	/* Only reserve room for attribute header, packet data is added
 	 * in skb_zerocopy() */
 	if (!(nla = nla_reserve(user_skb, OVS_PACKET_ATTR_PACKET, 0))) {
@@ -492,12 +517,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 		goto out;
 
 	/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
-	if (!(dp->user_features & OVS_DP_F_UNALIGNED)) {
-		size_t plen = NLA_ALIGN(user_skb->len) - user_skb->len;
-
-		if (plen > 0)
-			memset(skb_put(user_skb, plen), 0, plen);
-	}
+	pad_packet(dp, user_skb);
 
 	((struct nlmsghdr *) user_skb->data)->nlmsg_len = user_skb->len;
 
@@ -526,6 +546,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	int len;
 	int err;
 	bool log = !a[OVS_PACKET_ATTR_PROBE];
+	unsigned int mru;
 
 	err = -EINVAL;
 	if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] ||
@@ -552,6 +573,12 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	else
 		packet->protocol = htons(ETH_P_802_2);
 
+	/* Set packet's mru */
+	mru = 0;
+	if (a[OVS_PACKET_ATTR_MRU])
+		mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
+	OVS_CB(packet)->mru = mru;
+
 	/* Build an sw_flow for sending this packet. */
 	flow = ovs_flow_alloc();
 	err = PTR_ERR(flow);
@@ -612,6 +639,7 @@ static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
 	[OVS_PACKET_ATTR_KEY] = { .type = NLA_NESTED },
 	[OVS_PACKET_ATTR_ACTIONS] = { .type = NLA_NESTED },
 	[OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG },
+	[OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 },
 };
 
 static const struct genl_ops dp_packet_genl_ops[] = {
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 9661a01..cfbdda1 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -98,10 +98,13 @@ struct datapath {
  * NULL if the packet is not being tunneled.
  * @input_vport: The original vport packet came in on. This value is cached
  * when a packet is received by OVS.
+ * @mru: The maximum received fragement size; 0 if the packet is not
+ * fragmented.
  */
 struct ovs_skb_cb {
 	struct ovs_tunnel_info  *egress_tun_info;
 	struct vport		*input_vport;
+	unsigned int		mru;
 };
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
@@ -114,12 +117,14 @@ struct ovs_skb_cb {
  * then no packet is sent and the packet is accounted in the datapath's @n_lost
  * counter.
  * @egress_tun_info: If nonnull, becomes %OVS_PACKET_ATTR_EGRESS_TUN_KEY.
+ * @mru: If not zero, Maximum received IP fragment size.
  */
 struct dp_upcall_info {
 	const struct ovs_tunnel_info *egress_tun_info;
 	const struct nlattr *userdata;
 	u32 portid;
 	u8 cmd;
+	unsigned int mru;
 };
 
 /**
@@ -198,6 +203,7 @@ void ovs_dp_notify_wq(struct work_struct *work);
 
 int action_fifos_init(void);
 void action_fifos_exit(void);
+int ovs_setup_l2_header(struct sk_buff *skb, struct sw_flow_key *key);
 
 /* 'KEY' must not have any bits set outside of the 'MASK' */
 #define OVS_MASKED(OLD, KEY, MASK) ((KEY) | ((OLD) & ~(MASK)))
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index ec2954f..184dd51 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -486,6 +486,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 
 	OVS_CB(skb)->input_vport = vport;
 	OVS_CB(skb)->egress_tun_info = NULL;
+	OVS_CB(skb)->mru = 0;
 	/* Extract flow from 'skb' into 'key'. */
 	error = ovs_flow_key_extract(tun_info, skb, &key);
 	if (unlikely(error)) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFCv2 net-next 0/7] OVS conntrack support
  2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
                   ` (6 preceding siblings ...)
  2015-03-02 21:55 ` [RFCv2 net-next 7/7] openvswitch: Support fragmented IPv4 packets for conntrack Joe Stringer
@ 2015-03-03  0:59 ` Tom Herbert
  7 siblings, 0 replies; 11+ messages in thread
From: Tom Herbert @ 2015-03-03  0:59 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Linux Netdev List, Pablo Neira Ayuso, LKML, Justin Pettit,
	Andy Zhou, Thomas Graf, Patrick McHardy

On Mon, Mar 2, 2015 at 1:54 PM, Joe Stringer <joestringer@nicira.com> wrote:
> The goal of this series is to allow OVS to send packets through the Linux
> kernel connection tracker, and subsequently match on fields populated by
> conntrack.
>
> Sending this out as another RFC change as this is the first time IP fragment
> support is included. Only IPv4 is added right now, as we'd like to get some
> feedback on that approach before we implement IPv6 frag support.
>
> Helper support is also yet to be addressed, for tracking a particular flow a la
> iptables CT targets. I think this is just a matter of having userspace specify
> the helper to use (eg via 8-bit field in conntrack action), and setting up the
> conntrack template accordingly when OVS first installs the flow containing a
> conntrack action.
>
> There are some additional related items that I intend to work on, which I do
> not see as prerequisite for this series:
> - OVS Connlabel support.
> - Allow OVS to register logging facilities for conntrack.
> - Conntrack per-zone configuration.
>
> The branch below has been updated with the corresponding userspace pieces:
> https://github.com/justinpettit/ovs/tree/conntrack
>
>
> RFCv2:
> - Support IPv4 fragments
> - Warn when ct->net is different from skb net in skb_has_valid_nfct().
> - Set OVS_CS_F_TRACKED when a flow cannot be identified ("invalid")
> - Continue processing packets when conntrack marks the flow invalid.
> - Use PF_INET6 family when sending IPv6 packets to conntrack.
> - Verify conn_* matches when deserializing metadata from netlink.
> - Only allow conntrack action on IPv4/IPv6 packets.
> - Remove explicit dependencies on conn_zone, conn_mark.
> - General tidyups
>
> RFCv1:
> - Rebase to net-next.
> - Add conn_zone field to the flow key.
> - Add explicit dependencies on conn_zone, conn_mark.
> - Refactor conntrack changes into net/openvswitch/ovs_conntrack.*.
> - Don't allow set_field() actions to change conn_state, conn_zone.
> - Add OVS_CS_F_* flags to indicate connection state.
> - Add "invalid" connection state.
>
>
> Andy Zhou (3):
>   net: refactor ip_fragment()
>   net: Refactor ip_defrag() APIs
>   openvswitch: Support fragmented IPv4 packets for conntrack
>
> Joe Stringer (2):
>   openvswitch: Serialize acts with original netlink len
>   openvswitch: Move MASKED* macros to datapath.h
>
> Justin Pettit (2):
>   openvswitch: Add conntrack action
>   openvswitch: Allow matching on conntrack mark
>
>  drivers/net/macvlan.c               |    2 +-
>  include/net/ip.h                    |   13 +-
>  include/uapi/linux/openvswitch.h    |   42 +++-
>  net/ipv4/ip_fragment.c              |   46 ++--
>  net/ipv4/ip_input.c                 |    5 +-
>  net/ipv4/ip_output.c                |  113 +++++----

This is a lot of change to core IP. It probably should be done in its
own patch set.

>  net/ipv4/netfilter/nf_defrag_ipv4.c |    2 +-
>  net/netfilter/ipvs/ip_vs_core.c     |    2 +-
>  net/openvswitch/Kconfig             |   11 +
>  net/openvswitch/Makefile            |    1 +
>  net/openvswitch/actions.c           |  140 +++++++++---
>  net/openvswitch/conntrack.c         |  427 +++++++++++++++++++++++++++++++++++
>  net/openvswitch/conntrack.h         |   91 ++++++++
>  net/openvswitch/datapath.c          |   60 +++--
>  net/openvswitch/datapath.h          |   10 +
>  net/openvswitch/flow.c              |    4 +
>  net/openvswitch/flow.h              |    4 +
>  net/openvswitch/flow_netlink.c      |   95 ++++++--
>  net/openvswitch/flow_netlink.h      |    4 +-
>  net/openvswitch/vport.c             |    1 +
>  net/packet/af_packet.c              |    2 +-
>  21 files changed, 938 insertions(+), 137 deletions(-)
>  create mode 100644 net/openvswitch/conntrack.c
>  create mode 100644 net/openvswitch/conntrack.h
>
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs
  2015-03-02 21:55 ` [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs Joe Stringer
@ 2015-03-03  8:20   ` Patrick McHardy
  2015-03-03 19:55     ` Andy Zhou
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2015-03-03  8:20 UTC (permalink / raw)
  To: Joe Stringer
  Cc: netdev, Pablo Neira Ayuso, Andy Zhou, linux-kernel,
	Justin Pettit, Thomas Graf

On 02.03, Joe Stringer wrote:
> From: Andy Zhou <azhou@nicira.com>
> 
> Currently, ip_defrag() does not keep track of the maximum fragmentation
> size for each fragmented packet. This information is not necessary since
> current Linux IP fragmentation always fragments a packet based on output
> devices' MTU.

It does, search for max_size.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs
  2015-03-03  8:20   ` Patrick McHardy
@ 2015-03-03 19:55     ` Andy Zhou
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Zhou @ 2015-03-03 19:55 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Joe Stringer, netdev, Pablo Neira Ayuso, linux-kernel,
	Justin Pettit, Thomas Graf

Ah, I missed it.  Currently it is passed via IPCB. Would it be better
to pass it as a parameter?

On Tue, Mar 3, 2015 at 12:20 AM, Patrick McHardy <kaber@trash.net> wrote:
> On 02.03, Joe Stringer wrote:
>> From: Andy Zhou <azhou@nicira.com>
>>
>> Currently, ip_defrag() does not keep track of the maximum fragmentation
>> size for each fragmented packet. This information is not necessary since
>> current Linux IP fragmentation always fragments a packet based on output
>> devices' MTU.
>
> It does, search for max_size.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-03-03 20:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-02 21:54 [RFCv2 net-next 0/7] OVS conntrack support Joe Stringer
2015-03-02 21:54 ` [RFCv2 net-next 1/7] openvswitch: Serialize acts with original netlink len Joe Stringer
2015-03-02 21:55 ` [RFCv2 net-next 2/7] openvswitch: Move MASKED* macros to datapath.h Joe Stringer
2015-03-02 21:55 ` [RFCv2 net-next 3/7] openvswitch: Add conntrack action Joe Stringer
2015-03-02 21:55 ` [RFCv2 net-next 4/7] openvswitch: Allow matching on conntrack mark Joe Stringer
2015-03-02 21:55 ` [RFCv2 net-next 5/7] net: refactor ip_fragment() Joe Stringer
2015-03-02 21:55 ` [RFCv2 net-next 6/7] net: Refactor ip_defrag() APIs Joe Stringer
2015-03-03  8:20   ` Patrick McHardy
2015-03-03 19:55     ` Andy Zhou
2015-03-02 21:55 ` [RFCv2 net-next 7/7] openvswitch: Support fragmented IPv4 packets for conntrack Joe Stringer
2015-03-03  0:59 ` [RFCv2 net-next 0/7] OVS conntrack support Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).