All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
@ 2017-02-03  1:10 Jarno Rajahalme
  2017-02-03  1:10 ` [PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array Jarno Rajahalme
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

When looking for an existing conntrack entry, the packet 5-tuple
must be inverted if NAT has already been applied, as the current
packet headers do not match any conntrack tuple.  For
example, if a packet from private address X to a public address B is
source-NATted to A, the conntrack entry will have the following tuples
(ignoring the protocol and port numbers) after the conntrack entry is
committed:

Original direction tuple: (X,B)
Reply direction tuple: (B,A)

Now, if a reply packet is already transformed back to the private
address space (e.g., with a CT(nat) action), the tuple corresponding
to the current packet headers is:

Current packet tuple: (B,X)

This does not match either of the conntrack tuples above.  Normally
this does not matter, as the conntrack lookup was already done using
the tuple (B,A), but if the current packet does not match any flow in
the OVS datapath, the packet is sent to userspace via an upcall,
during which the packet's skb is freed, and the conntrack entry
pointer in the skb is lost.  When the packet is reintroduced to the
datapath, any further conntrack action will need to perform a new
conntrack lookup to find the entry again.  Prior to this patch this
second lookup failed for NATted packets.  The datapath flow setup
corresponding to the upcall can succeed, however, allowing all further
packets in the reply direction to re-use the conntrack entry pointer
in the skb, so typically the lookup failure only causes a packet drop.

The solution is to invert the tuple derived from the current packet
headers in case the conntrack state stored in the packet metadata
indicates that the packet has been transformed by NAT:

Inverted tuple: (X,B)

With this the conntrack entry can be found, matching the original
direction tuple.

This same logic also works for the original direction packets:

Current packet tuple (after NAT): (A,B)
Inverted tuple: (B,A)

While the current packet tuple (A,B) does not match either of the
conntrack tuples, the inverted one (B,A) does match the reply
direction tuple.

Since the inverted tuple matches the reverse direction tuple the
direction of the packet must be reversed as well.

Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/openvswitch/conntrack.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 54253ea..b91baa2 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -430,7 +430,7 @@ ovs_ct_get_info(const struct nf_conntrack_tuple_hash *h)
  */
 static struct nf_conn *
 ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone,
-		     u8 l3num, struct sk_buff *skb)
+		     u8 l3num, struct sk_buff *skb, bool natted)
 {
 	struct nf_conntrack_l3proto *l3proto;
 	struct nf_conntrack_l4proto *l4proto;
@@ -453,6 +453,17 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone,
 		return NULL;
 	}
 
+	/* Must invert the tuple if skb has been transformed by NAT. */
+	if (natted) {
+		struct nf_conntrack_tuple inverse;
+
+		if (!nf_ct_invert_tuple(&inverse, &tuple, l3proto, l4proto)) {
+			pr_debug("ovs_ct_find_existing: Inversion failed!\n");
+			return NULL;
+		}
+		tuple = inverse;
+	}
+
 	/* look for tuple match */
 	h = nf_conntrack_find_get(net, zone, &tuple);
 	if (!h)
@@ -460,6 +471,13 @@ ovs_ct_find_existing(struct net *net, const struct nf_conntrack_zone *zone,
 
 	ct = nf_ct_tuplehash_to_ctrack(h);
 
+	/* Inverted packet tuple matches the reverse direction conntrack tuple,
+	 * select the other tuplehash to get the right 'ctinfo' bits for this
+	 * packet.
+	 */
+	if (natted)
+		h = &ct->tuplehash[!h->tuple.dst.dir];
+
 	skb->nfct = &ct->ct_general;
 	skb->nfctinfo = ovs_ct_get_info(h);
 	return ct;
@@ -483,7 +501,9 @@ static bool skb_nfct_cached(struct net *net,
 	if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
 	    !(key->ct.state & OVS_CS_F_INVALID) &&
 	    key->ct.zone == info->zone.id)
-		ct = ovs_ct_find_existing(net, &info->zone, info->family, skb);
+		ct = ovs_ct_find_existing(net, &info->zone, info->family, skb,
+					  !!(key->ct.state
+					     & OVS_CS_F_NAT_MASK));
 	if (!ct)
 		return false;
 	if (!net_eq(net, read_pnet(&ct->ct_net)))
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-03  1:10 ` [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection Jarno Rajahalme
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

Make the array of labels in struct ovs_key_ct_label an union, adding a
u32 array of the same byte size as the existing u8 array.  It is
faster to loop through the labels 32 bits at the time, which is also
the alignment of netlink attributes.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 include/uapi/linux/openvswitch.h | 8 ++++++--
 net/openvswitch/conntrack.c      | 9 +++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 375d812..96aee34 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -446,9 +446,13 @@ struct ovs_key_nd {
 	__u8	nd_tll[ETH_ALEN];
 };
 
-#define OVS_CT_LABELS_LEN	16
+#define OVS_CT_LABELS_LEN_32	4
+#define OVS_CT_LABELS_LEN	(OVS_CT_LABELS_LEN_32 * sizeof(__u32))
 struct ovs_key_ct_labels {
-	__u8	ct_labels[OVS_CT_LABELS_LEN];
+	union {
+		__u8	ct_labels[OVS_CT_LABELS_LEN];
+		__u32	ct_labels_32[OVS_CT_LABELS_LEN_32];
+	};
 };
 
 /* OVS_KEY_ATTR_CT_STATE flags */
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index b91baa2..6730f09 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -277,8 +277,9 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
 	if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN)
 		return -ENOSPC;
 
-	err = nf_connlabels_replace(ct, (u32 *)labels, (u32 *)mask,
-				    OVS_CT_LABELS_LEN / sizeof(u32));
+	err = nf_connlabels_replace(ct, labels->ct_labels_32,
+				    mask->ct_labels_32,
+				    OVS_CT_LABELS_LEN_32);
 	if (err)
 		return err;
 
@@ -852,8 +853,8 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 {
 	size_t i;
 
-	for (i = 0; i < sizeof(*labels); i++)
-		if (labels->ct_labels[i])
+	for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
+		if (labels->ct_labels_32[i])
 			return true;
 
 	return false;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
  2017-02-03  1:10 ` [PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-06 21:46   ` Joe Stringer
  2017-02-03  1:10 ` [PATCH net-next 4/7] openvswitch: Inherit master's labels Jarno Rajahalme
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

Avoid triggering change events for setting conntrack mark or labels
before the conntrack entry has been confirmed.  Refactoring on this
patch also makes chenges in later patches easier to review.

Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark")
Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/openvswitch/conntrack.c | 87 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 63 insertions(+), 24 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 6730f09..a163c44 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -229,23 +229,17 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
 	return 0;
 }
 
-static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
+static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key,
 			   u32 ct_mark, u32 mask)
 {
 #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
-	enum ip_conntrack_info ctinfo;
-	struct nf_conn *ct;
 	u32 new_mark;
 
-	/* The connection could be invalid, in which case set_mark is no-op. */
-	ct = nf_ct_get(skb, &ctinfo);
-	if (!ct)
-		return 0;
-
 	new_mark = ct_mark | (ct->mark & ~(mask));
 	if (ct->mark != new_mark) {
 		ct->mark = new_mark;
-		nf_conntrack_event_cache(IPCT_MARK, ct);
+		if (nf_ct_is_confirmed(ct))
+			nf_conntrack_event_cache(IPCT_MARK, ct);
 		key->ct.mark = new_mark;
 	}
 
@@ -255,26 +249,59 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
 #endif
 }
 
-static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
-			     const struct ovs_key_ct_labels *labels,
-			     const struct ovs_key_ct_labels *mask)
+static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
 {
-	enum ip_conntrack_info ctinfo;
 	struct nf_conn_labels *cl;
-	struct nf_conn *ct;
-	int err;
-
-	/* The connection could be invalid, in which case set_label is no-op.*/
-	ct = nf_ct_get(skb, &ctinfo);
-	if (!ct)
-		return 0;
 
 	cl = nf_ct_labels_find(ct);
 	if (!cl) {
 		nf_ct_labels_ext_add(ct);
 		cl = nf_ct_labels_find(ct);
 	}
+
 	if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN)
+		return NULL;
+
+	return cl;
+}
+
+/* Initialize labels for a new, to be committed conntrack entry.  Note that
+ * since the new connection is not yet confirmed, and thus no-one else has
+ * access to it's labels, we simply write them over.  Also, we refrain from
+ * triggering events, as receiving change events before the create event would
+ * be confusing.
+ */
+static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
+			      const struct ovs_key_ct_labels *labels,
+			      const struct ovs_key_ct_labels *mask)
+{
+	struct nf_conn_labels *cl;
+	u32 *dst;
+	int i;
+
+	cl = ovs_ct_get_conn_labels(ct);
+	if (!cl)
+		return -ENOSPC;
+
+	dst = (u32 *)cl->bits;
+	for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
+		dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
+			(labels->ct_labels_32[i] & mask->ct_labels_32[i]);
+
+	memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
+
+	return 0;
+}
+
+static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key,
+			     const struct ovs_key_ct_labels *labels,
+			     const struct ovs_key_ct_labels *mask)
+{
+	struct nf_conn_labels *cl;
+	int err;
+
+	cl = ovs_ct_get_conn_labels(ct);
+	if (!cl)
 		return -ENOSPC;
 
 	err = nf_connlabels_replace(ct, labels->ct_labels_32,
@@ -283,7 +310,8 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
 	if (err)
 		return err;
 
-	ovs_ct_get_labels(ct, &key->ct.labels);
+	memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
+
 	return 0;
 }
 
@@ -865,25 +893,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
 			 struct sk_buff *skb)
 {
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 	int err;
 
 	err = __ovs_ct_lookup(net, key, info, skb);
 	if (err)
 		return err;
 
+	/* The connection could be invalid, in which case this is a no-op.*/
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return 0;
+
 	/* Apply changes before confirming the connection so that the initial
 	 * conntrack NEW netlink event carries the values given in the CT
 	 * action.
 	 */
 	if (info->mark.mask) {
-		err = ovs_ct_set_mark(skb, key, info->mark.value,
+		err = ovs_ct_set_mark(ct, key, info->mark.value,
 				      info->mark.mask);
 		if (err)
 			return err;
 	}
 	if (labels_nonzero(&info->labels.mask)) {
-		err = ovs_ct_set_labels(skb, key, &info->labels.value,
-					&info->labels.mask);
+		if (!nf_ct_is_confirmed(ct))
+			err = ovs_ct_init_labels(ct, key, &info->labels.value,
+						 &info->labels.mask);
+		else
+			err = ovs_ct_set_labels(ct, key, &info->labels.value,
+						&info->labels.mask);
 		if (err)
 			return err;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 4/7] openvswitch: Inherit master's labels.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
  2017-02-03  1:10 ` [PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array Jarno Rajahalme
  2017-02-03  1:10 ` [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-06 21:53   ` Joe Stringer
  2017-02-03  1:10 ` [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key Jarno Rajahalme
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

We avoid calling into nf_conntrack_in() for expected connections, as
that would remove the expectation that we want to stick around until
we are ready to commit the connection.  Instead, we do a lookup in the
expectation table directly.  However, after a successful expectation
lookup we have set the flow key label field from the master
connection, whereas nf_conntrack_in() does not do this.  This leads to
master's labels being iherited after an expectation lookup, but those
labels not being inherited after the corresponding conntrack action
with a commit flag.

This patch resolves the problem by changing the commit code path to
also inherit the master's labels to the expected connection.
Resolving this conflict in favor or inheriting the labels allows
information be passed from the master connection to related
connections, which would otherwise be much harder.  Labels can still
be set explicitly, so this change only affects the default values of
the labels in presense of a master connection.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/openvswitch/conntrack.c | 48 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index a163c44..738a4fa 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -265,6 +265,8 @@ static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
 	return cl;
 }
 
+static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
+
 /* Initialize labels for a new, to be committed conntrack entry.  Note that
  * since the new connection is not yet confirmed, and thus no-one else has
  * access to it's labels, we simply write them over.  Also, we refrain from
@@ -275,18 +277,35 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
 			      const struct ovs_key_ct_labels *labels,
 			      const struct ovs_key_ct_labels *mask)
 {
-	struct nf_conn_labels *cl;
-	u32 *dst;
-	int i;
+	struct nf_conn_labels *cl, *master_cl;
+	bool have_mask = labels_nonzero(mask);
+
+	/* Inherit master's labels to the related connection? */
+	master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL;
+
+	if (!master_cl && !have_mask)
+		return 0;   /* Nothing to do. */
 
 	cl = ovs_ct_get_conn_labels(ct);
 	if (!cl)
 		return -ENOSPC;
 
-	dst = (u32 *)cl->bits;
-	for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
-		dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
-			(labels->ct_labels_32[i] & mask->ct_labels_32[i]);
+	/* Inherit the master's labels, if any. */
+	if (master_cl) {
+		size_t len = sizeof(master_cl->bits);
+
+		memcpy(&cl->bits, &master_cl->bits,
+		       len > OVS_CT_LABELS_LEN ? OVS_CT_LABELS_LEN : len);
+	}
+	if (have_mask) {
+		u32 *dst = (u32 *)cl->bits;
+		int i;
+
+		for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
+			dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
+				(labels->ct_labels_32[i]
+				 & mask->ct_labels_32[i]);
+	}
 
 	memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
 
@@ -916,13 +935,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 		if (err)
 			return err;
 	}
-	if (labels_nonzero(&info->labels.mask)) {
-		if (!nf_ct_is_confirmed(ct))
-			err = ovs_ct_init_labels(ct, key, &info->labels.value,
-						 &info->labels.mask);
-		else
-			err = ovs_ct_set_labels(ct, key, &info->labels.value,
-						&info->labels.mask);
+	if (!nf_ct_is_confirmed(ct)) {
+		err = ovs_ct_init_labels(ct, key, &info->labels.value,
+					 &info->labels.mask);
+		if (err)
+			return err;
+	} else if (labels_nonzero(&info->labels.mask)) {
+		err = ovs_ct_set_labels(ct, key, &info->labels.value,
+					&info->labels.mask);
 		if (err)
 			return err;
 	}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
                   ` (2 preceding siblings ...)
  2017-02-03  1:10 ` [PATCH net-next 4/7] openvswitch: Inherit master's labels Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-07  7:15   ` Joe Stringer
  2017-02-03  1:10 ` [PATCH net-next 6/7] openvswitch: Add force commit Jarno Rajahalme
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key.  The new fields are initially zeroed, and are populated
whenever a conntrack action is executed and either finds or generates
a conntrack entry.  This means that these fields exist for all packets
were not rejected by conntrack as untrackable.

The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.

The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple.  This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.

When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state.  While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards.  If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change.  When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.

It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information.  If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.

The fact that neither ARP not ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields.  Hence, the IP addresses are overlaid in union with ARP
and ND fields.  This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets.  ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 include/uapi/linux/openvswitch.h | 20 ++++++++-
 net/openvswitch/actions.c        |  2 +
 net/openvswitch/conntrack.c      | 95 +++++++++++++++++++++++++++++++++++++---
 net/openvswitch/conntrack.h      | 13 +++++-
 net/openvswitch/flow.c           | 34 +++++++++++---
 net/openvswitch/flow.h           | 49 ++++++++++++++++-----
 net/openvswitch/flow_netlink.c   | 91 +++++++++++++++++++++++++++++---------
 net/openvswitch/flow_netlink.h   |  7 ++-
 8 files changed, 264 insertions(+), 47 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 96aee34..90af8b8 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -1,6 +1,6 @@
 
 /*
- * Copyright (c) 2007-2013 Nicira, Inc.
+ * Copyright (c) 2007-2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -331,6 +331,8 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_ZONE,	/* u16 connection tracking zone. */
 	OVS_KEY_ATTR_CT_MARK,	/* u32 connection tracking mark */
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking label */
+	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4,   /* struct ovs_key_ct_tuple_ipv4 */
+	OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6,   /* struct ovs_key_ct_tuple_ipv6 */
 
 #ifdef __KERNEL__
 	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ip_tunnel_info */
@@ -472,6 +474,22 @@ struct ovs_key_ct_labels {
 
 #define OVS_CS_F_NAT_MASK (OVS_CS_F_SRC_NAT | OVS_CS_F_DST_NAT)
 
+struct ovs_key_ct_tuple_ipv4 {
+	__be32 ipv4_src;
+	__be32 ipv4_dst;
+	__be16 src_port;
+	__be16 dst_port;
+	__u8   ipv4_proto;
+};
+
+struct ovs_key_ct_tuple_ipv6 {
+	__be32 ipv6_src[4];
+	__be32 ipv6_dst[4];
+	__be16 src_port;
+	__be16 dst_port;
+	__u8   ipv6_proto;
+};
+
 /**
  * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
  * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index efa9a88..b1beb2b 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1074,6 +1074,8 @@ static int execute_masked_set_action(struct sk_buff *skb,
 	case OVS_KEY_ATTR_CT_ZONE:
 	case OVS_KEY_ATTR_CT_MARK:
 	case OVS_KEY_ATTR_CT_LABELS:
+	case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4:
+	case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6:
 		err = -EINVAL;
 		break;
 	}
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 738a4fa..1afe153 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -155,6 +155,59 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 	key->ct.zone = zone->id;
 	key->ct.mark = ovs_ct_get_mark(ct);
 	ovs_ct_get_labels(ct, &key->ct.labels);
+
+	/* Use the master if we have one. */
+	if (ct && ct->master)
+		ct = ct->master;
+
+	key->ct.orig_proto = 0;
+	key->ct.orig_tp.src = 0;
+	key->ct.orig_tp.dst = 0;
+	if (key->eth.type == htons(ETH_P_IP)) {
+		/* IP version must match. */
+		if (ct && nf_ct_l3num(ct) == NFPROTO_IPV4) {
+			const struct nf_conntrack_tuple *orig =
+				&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+
+			key->ipv4.ct_orig.src = orig->src.u3.ip;
+			key->ipv4.ct_orig.dst = orig->dst.u3.ip;
+			key->ct.orig_proto = orig->dst.protonum;
+			if (orig->dst.protonum == IPPROTO_ICMP) {
+				key->ct.orig_tp.src
+					= htons(orig->dst.u.icmp.type);
+				key->ct.orig_tp.dst
+					= htons(orig->dst.u.icmp.code);
+			} else {
+				key->ct.orig_tp.src = orig->src.u.all;
+				key->ct.orig_tp.dst = orig->dst.u.all;
+			}
+		} else
+			memset(&key->ipv4.ct_orig, 0,
+			       sizeof(key->ipv4.ct_orig));
+	} else if (key->eth.type == htons(ETH_P_IPV6) &&
+		   /* IPv6 orig tuple addresses overlap with ND. */
+		   !sw_flow_key_is_nd(key)) {
+		/* IP version must match. */
+		if (ct && nf_ct_l3num(ct) == NFPROTO_IPV6) {
+			const struct nf_conntrack_tuple *orig =
+				&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+
+			key->ipv6.ct_orig.src = orig->src.u3.in6;
+			key->ipv6.ct_orig.dst = orig->dst.u3.in6;
+			key->ct.orig_proto = orig->dst.protonum;
+			if (orig->dst.protonum == IPPROTO_ICMPV6) {
+				key->ct.orig_tp.src
+					= htons(orig->dst.u.icmp.type);
+				key->ct.orig_tp.dst
+					= htons(orig->dst.u.icmp.code);
+			} else {
+				key->ct.orig_tp.src = orig->src.u.all;
+				key->ct.orig_tp.dst = orig->dst.u.all;
+			}
+		} else
+			memset(&key->ipv6.ct_orig, 0,
+			       sizeof(key->ipv6.ct_orig));
+	}
 }
 
 /* Update 'key' based on skb->nfct.  If 'post_ct' is true, then OVS has
@@ -208,24 +261,54 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key)
 	ovs_ct_update_key(skb, NULL, key, false, false);
 }
 
-int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
+#define IN6_ADDR_INITIALIZER(ADDR) \
+	{ (ADDR).s6_addr32[0], (ADDR).s6_addr32[1], \
+	  (ADDR).s6_addr32[2], (ADDR).s6_addr32[3] }
+
+int ovs_ct_put_key(const struct sw_flow_key *swkey,
+		   const struct sw_flow_key *output, struct sk_buff *skb)
 {
-	if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, key->ct.state))
+	if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state))
 		return -EMSGSIZE;
 
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
-	    nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, key->ct.zone))
+	    nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone))
 		return -EMSGSIZE;
 
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) &&
-	    nla_put_u32(skb, OVS_KEY_ATTR_CT_MARK, key->ct.mark))
+	    nla_put_u32(skb, OVS_KEY_ATTR_CT_MARK, output->ct.mark))
 		return -EMSGSIZE;
 
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) &&
-	    nla_put(skb, OVS_KEY_ATTR_CT_LABELS, sizeof(key->ct.labels),
-		    &key->ct.labels))
+	    nla_put(skb, OVS_KEY_ATTR_CT_LABELS, sizeof(output->ct.labels),
+		    &output->ct.labels))
 		return -EMSGSIZE;
 
+	if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct.orig_proto) {
+		struct ovs_key_ct_tuple_ipv4 orig = {
+			output->ipv4.ct_orig.src,
+			output->ipv4.ct_orig.dst,
+			output->ct.orig_tp.src,
+			output->ct.orig_tp.dst,
+			output->ct.orig_proto,
+		};
+		if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig),
+			    &orig))
+			return -EMSGSIZE;
+	} else if (swkey->eth.type == htons(ETH_P_IPV6) &&
+		   swkey->ct.orig_proto) {
+		struct ovs_key_ct_tuple_ipv6 orig = {
+			IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.src),
+			IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst),
+			output->ct.orig_tp.src,
+			output->ct.orig_tp.dst,
+			output->ct.orig_proto,
+		};
+		if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig),
+			    &orig))
+			return -EMSGSIZE;
+	}
+
 	return 0;
 }
 
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 8f6230b..8573ab3 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -32,7 +32,8 @@ int ovs_ct_execute(struct net *, struct sk_buff *, struct sw_flow_key *,
 		   const struct ovs_conntrack_info *);
 
 void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key);
-int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb);
+int ovs_ct_put_key(const struct sw_flow_key *swkey,
+		   const struct sw_flow_key *output, struct sk_buff *skb);
 void ovs_ct_free_action(const struct nlattr *a);
 
 #define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \
@@ -79,9 +80,17 @@ static inline void ovs_ct_fill_key(const struct sk_buff *skb,
 	key->ct.zone = 0;
 	key->ct.mark = 0;
 	memset(&key->ct.labels, 0, sizeof(key->ct.labels));
+	key->ct.orig_proto = 0;
+	key->ct.orig_tp.src = 0;
+	key->ct.orig_tp.dst = 0;
+	if (key->eth.type == htons(ETH_P_IP))
+		memset(&key->ipv4.ct_orig, 0, sizeof(key->ipv4.ct_orig));
+	else if (key->eth.type == htons(ETH_P_IPV6) && !sw_flow_key_is_nd(key))
+		memset(&key->ipv6.ct_orig, 0, sizeof(key->ipv6.ct_orig));
 }
 
-static inline int ovs_ct_put_key(const struct sw_flow_key *key,
+static inline int ovs_ct_put_key(const struct sw_flow_key *swkey,
+				 const struct sw_flow_key *output,
 				 struct sk_buff *skb)
 {
 	return 0;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 2c0a00f..9d4bb8e 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -765,7 +765,7 @@ static int key_extract_mac_proto(struct sk_buff *skb)
 int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 			 struct sk_buff *skb, struct sw_flow_key *key)
 {
-	int res;
+	int res, err;
 
 	/* Extract metadata from packet. */
 	if (tun_info) {
@@ -792,7 +792,6 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 	key->phy.priority = skb->priority;
 	key->phy.in_port = OVS_CB(skb)->input_vport->port_no;
 	key->phy.skb_mark = skb->mark;
-	ovs_ct_fill_key(skb, key);
 	key->ovs_flow_hash = 0;
 	res = key_extract_mac_proto(skb);
 	if (res < 0)
@@ -800,17 +799,26 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
 	key->mac_proto = res;
 	key->recirc_id = 0;
 
-	return key_extract(skb, key);
+	err = key_extract(skb, key);
+	if (!err)
+		ovs_ct_fill_key(skb, key);   /* Must be after key_extract(). */
+	return err;
 }
 
 int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr,
 				   struct sk_buff *skb,
 				   struct sw_flow_key *key, bool log)
 {
+	const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
+	u64 attrs = 0;
 	int err;
 
+	err = parse_flow_nlattrs(attr, a, &attrs, log);
+	if (err)
+		return -EINVAL;
+
 	/* Extract metadata from netlink attributes. */
-	err = ovs_nla_get_flow_metadata(net, attr, key, log);
+	err = ovs_nla_get_flow_metadata(net, a, attrs, key, log);
 	if (err)
 		return err;
 
@@ -824,5 +832,21 @@ int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr,
 	 */
 
 	skb->protocol = key->eth.type;
-	return key_extract(skb, key);
+	err = key_extract(skb, key);
+	if (err)
+		return err;
+
+	/* Check that we have conntrack original direction tuple metadata only
+	 * for packets for which it makes sense.  Otherwise the key may be
+	 * corrupted due to overlapping key fields.
+	 */
+	if (attrs & (1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4) &&
+	    key->eth.type != htons(ETH_P_IP))
+		return -EINVAL;
+	if (attrs & (1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6) &&
+	    (key->eth.type != htons(ETH_P_IPV6) ||
+	     sw_flow_key_is_nd(key)))
+		return -EINVAL;
+
+	return 0;
 }
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index f61cae7..19dc4d2 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2007-2014 Nicira, Inc.
+ * Copyright (c) 2007-2017 Nicira, Inc.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -107,10 +107,16 @@ struct sw_flow_key {
 				__be32 src;	/* IP source address. */
 				__be32 dst;	/* IP destination address. */
 			} addr;
-			struct {
-				u8 sha[ETH_ALEN];	/* ARP source hardware address. */
-				u8 tha[ETH_ALEN];	/* ARP target hardware address. */
-			} arp;
+			union {
+				struct {
+					__be32 src;
+					__be32 dst;
+				} ct_orig;	/* Conntrack original direction fields. */
+				struct {
+					u8 sha[ETH_ALEN];	/* ARP source hardware address. */
+					u8 tha[ETH_ALEN];	/* ARP target hardware address. */
+				} arp;
+			};
 		} ipv4;
 		struct {
 			struct {
@@ -118,23 +124,44 @@ struct sw_flow_key {
 				struct in6_addr dst;	/* IPv6 destination address. */
 			} addr;
 			__be32 label;			/* IPv6 flow label. */
-			struct {
-				struct in6_addr target;	/* ND target address. */
-				u8 sll[ETH_ALEN];	/* ND source link layer address. */
-				u8 tll[ETH_ALEN];	/* ND target link layer address. */
-			} nd;
+			union {
+				struct {
+					struct in6_addr src;
+					struct in6_addr dst;
+				} ct_orig;	/* Conntrack original direction fields. */
+				struct {
+					struct in6_addr target;	/* ND target address. */
+					u8 sll[ETH_ALEN];	/* ND source link layer address. */
+					u8 tll[ETH_ALEN];	/* ND target link layer address. */
+				} nd;
+			};
 		} ipv6;
 	};
 	struct {
 		/* Connection tracking fields. */
+		u8 state;
+		u8 orig_proto;		/* CT orig tuple IP protocol. */
 		u16 zone;
 		u32 mark;
-		u8 state;
+		struct {
+			__be16 src;	/* CT orig tuple tp src port. */
+			__be16 dst;	/* CT orig tuple tp dst port. */
+		} orig_tp;
+
 		struct ovs_key_ct_labels labels;
 	} ct;
 
 } __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
 
+static inline bool sw_flow_key_is_nd(const struct sw_flow_key *key)
+{
+	return key->eth.type == htons(ETH_P_IPV6) &&
+		key->ip.proto == NEXTHDR_ICMP &&
+		key->tp.dst == 0 &&
+		(key->tp.src == ntohs(NDISC_NEIGHBOUR_SOLICITATION) ||
+		 key->tp.src == ntohs(NDISC_NEIGHBOUR_ADVERTISEMENT));
+}
+
 struct sw_flow_key_range {
 	unsigned short int start;
 	unsigned short int end;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index c87d359..b2594bb 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -129,7 +129,9 @@ static bool match_validate(const struct sw_flow_match *match,
 	/* The following mask attributes allowed only if they
 	 * pass the validation tests. */
 	mask_allowed &= ~((1 << OVS_KEY_ATTR_IPV4)
+			| (1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4)
 			| (1 << OVS_KEY_ATTR_IPV6)
+			| (1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6)
 			| (1 << OVS_KEY_ATTR_TCP)
 			| (1 << OVS_KEY_ATTR_TCP_FLAGS)
 			| (1 << OVS_KEY_ATTR_UDP)
@@ -161,8 +163,11 @@ static bool match_validate(const struct sw_flow_match *match,
 
 	if (match->key->eth.type == htons(ETH_P_IP)) {
 		key_expected |= 1 << OVS_KEY_ATTR_IPV4;
-		if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
+		if (match->mask &&
+		    (match->mask->key.eth.type == htons(0xffff))) {
 			mask_allowed |= 1 << OVS_KEY_ATTR_IPV4;
+			mask_allowed |= 1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4;
+		}
 
 		if (match->key->ip.frag != OVS_FRAG_TYPE_LATER) {
 			if (match->key->ip.proto == IPPROTO_UDP) {
@@ -196,8 +201,11 @@ static bool match_validate(const struct sw_flow_match *match,
 
 	if (match->key->eth.type == htons(ETH_P_IPV6)) {
 		key_expected |= 1 << OVS_KEY_ATTR_IPV6;
-		if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
+		if (match->mask &&
+		    (match->mask->key.eth.type == htons(0xffff))) {
 			mask_allowed |= 1 << OVS_KEY_ATTR_IPV6;
+			mask_allowed |= 1 << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6;
+		}
 
 		if (match->key->ip.frag != OVS_FRAG_TYPE_LATER) {
 			if (match->key->ip.proto == IPPROTO_UDP) {
@@ -230,6 +238,12 @@ static bool match_validate(const struct sw_flow_match *match,
 						htons(NDISC_NEIGHBOUR_SOLICITATION) ||
 				    match->key->tp.src == htons(NDISC_NEIGHBOUR_ADVERTISEMENT)) {
 					key_expected |= 1 << OVS_KEY_ATTR_ND;
+					/* Original direction conntrack tuple
+					 * uses the same space as the ND fields
+					 * in the key, so both are not allowed
+					 * at the same time.
+					 */
+					mask_allowed &= ~(1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6);
 					if (match->mask && (match->mask->key.tp.src == htons(0xff)))
 						mask_allowed |= 1 << OVS_KEY_ATTR_ND;
 				}
@@ -282,7 +296,7 @@ size_t ovs_key_attr_size(void)
 	/* Whenever adding new OVS_KEY_ FIELDS, we should consider
 	 * updating this function.
 	 */
-	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 26);
+	BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 28);
 
 	return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
 		+ nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
@@ -295,6 +309,7 @@ size_t ovs_key_attr_size(void)
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_CT_ZONE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_CT_MARK */
 		+ nla_total_size(16)  /* OVS_KEY_ATTR_CT_LABELS */
+		+ nla_total_size(40)  /* OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6 */
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
 		+ nla_total_size(2)   /* OVS_KEY_ATTR_ETHERTYPE */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_VLAN */
@@ -355,6 +370,10 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
 	[OVS_KEY_ATTR_CT_ZONE]	 = { .len = sizeof(u16) },
 	[OVS_KEY_ATTR_CT_MARK]	 = { .len = sizeof(u32) },
 	[OVS_KEY_ATTR_CT_LABELS] = { .len = sizeof(struct ovs_key_ct_labels) },
+	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4] = {
+		.len = sizeof(struct ovs_key_ct_tuple_ipv4) },
+	[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6] = {
+		.len = sizeof(struct ovs_key_ct_tuple_ipv6) },
 };
 
 static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
@@ -430,9 +449,8 @@ static int parse_flow_mask_nlattrs(const struct nlattr *attr,
 	return __parse_flow_nlattrs(attr, a, attrsp, log, true);
 }
 
-static int parse_flow_nlattrs(const struct nlattr *attr,
-			      const struct nlattr *a[], u64 *attrsp,
-			      bool log)
+int parse_flow_nlattrs(const struct nlattr *attr, const struct nlattr *a[],
+		       u64 *attrsp, bool log)
 {
 	return __parse_flow_nlattrs(attr, a, attrsp, log, false);
 }
@@ -1082,6 +1100,34 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				   sizeof(*cl), is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_LABELS);
 	}
+	if (*attrs & (1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4)) {
+		const struct ovs_key_ct_tuple_ipv4 *ct;
+
+		ct = nla_data(a[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4]);
+
+		SW_FLOW_KEY_PUT(match, ipv4.ct_orig.src, ct->ipv4_src, is_mask);
+		SW_FLOW_KEY_PUT(match, ipv4.ct_orig.dst, ct->ipv4_dst, is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_tp.src, ct->src_port, is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_tp.dst, ct->dst_port, is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_proto, ct->ipv4_proto, is_mask);
+		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4);
+	}
+	if (*attrs & (1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6)) {
+		const struct ovs_key_ct_tuple_ipv6 *ct;
+
+		ct = nla_data(a[OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6]);
+
+		SW_FLOW_KEY_MEMCPY(match, ipv6.ct_orig.src, &ct->ipv6_src,
+				   sizeof(match->key->ipv6.ct_orig.src),
+				   is_mask);
+		SW_FLOW_KEY_MEMCPY(match, ipv6.ct_orig.dst, &ct->ipv6_dst,
+				   sizeof(match->key->ipv6.ct_orig.dst),
+				   is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_tp.src, ct->src_port, is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_tp.dst, ct->dst_port, is_mask);
+		SW_FLOW_KEY_PUT(match, ct.orig_proto, ct->ipv6_proto, is_mask);
+		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6);
+	}
 
 	/* For layer 3 packets the Ethernet type is provided
 	 * and treated as metadata but no MAC addresses are provided.
@@ -1430,6 +1476,10 @@ int ovs_nla_get_match(struct net *net, struct sw_flow_match *match,
 			goto free_newmask;
 	}
 
+	/* This also checks that conntrack original direction metadata exists
+	 * only for packets for which it makes sense.  Otherwise the key may be
+	 * corrupted due to overlapping key fields.
+	 */
 	if (!match_validate(match, key_attrs, mask_attrs, log))
 		err = -EINVAL;
 
@@ -1493,9 +1543,12 @@ u32 ovs_nla_get_ufid_flags(const struct nlattr *attr)
 
 /**
  * ovs_nla_get_flow_metadata - parses Netlink attributes into a flow key.
- * @key: Receives extracted in_port, priority, tun_key and skb_mark.
- * @attr: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute
- * sequence.
+ * @net: Network namespace.
+ * @key: Receives extracted in_port, priority, tun_key, skb_mark and conntrack
+ * metadata.
+ * @a: Array of netlink attributes holding parsed %OVS_KEY_ATTR_* Netlink
+ * attributes.
+ * @attrs: Bit mask for the netlink attributes included in @a.
  * @log: Boolean to allow kernel error logging.  Normally true, but when
  * probing for feature compatibility this should be passed in as false to
  * suppress unnecessary error logging.
@@ -1504,25 +1557,23 @@ u32 ovs_nla_get_ufid_flags(const struct nlattr *attr)
  * take the same form accepted by flow_from_nlattrs(), but only enough of it to
  * get the metadata, that is, the parts of the flow key that cannot be
  * extracted from the packet itself.
+ *
+ * This must be called before the packet key fields are filled in 'key'.
  */
 
-int ovs_nla_get_flow_metadata(struct net *net, const struct nlattr *attr,
-			      struct sw_flow_key *key,
-			      bool log)
+int ovs_nla_get_flow_metadata(struct net *net,
+			      const struct nlattr *a[OVS_KEY_ATTR_MAX + 1],
+			      u64 attrs, struct sw_flow_key *key, bool log)
 {
-	const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
 	struct sw_flow_match match;
-	u64 attrs = 0;
-	int err;
-
-	err = parse_flow_nlattrs(attr, a, &attrs, log);
-	if (err)
-		return -EINVAL;
 
 	memset(&match, 0, sizeof(match));
 	match.key = key;
 
 	memset(&key->ct, 0, sizeof(key->ct));
+	memset(&key->ipv4.ct_orig, 0, sizeof(key->ipv4.ct_orig));
+	memset(&key->ipv6.ct_orig, 0, sizeof(key->ipv6.ct_orig));
+
 	key->phy.in_port = DP_MAX_PORTS;
 
 	return metadata_from_nlattrs(net, &match, &attrs, a, false, log);
@@ -1584,7 +1635,7 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
 	if (nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, output->phy.skb_mark))
 		goto nla_put_failure;
 
-	if (ovs_ct_put_key(output, skb))
+	if (ovs_ct_put_key(swkey, output, skb))
 		goto nla_put_failure;
 
 	if (ovs_key_mac_proto(swkey) == MAC_PROTO_ETHERNET) {
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 45f9769..929c665 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -46,8 +46,11 @@ void ovs_match_init(struct sw_flow_match *match,
 
 int ovs_nla_put_key(const struct sw_flow_key *, const struct sw_flow_key *,
 		    int attr, bool is_mask, struct sk_buff *);
-int ovs_nla_get_flow_metadata(struct net *, const struct nlattr *,
-			      struct sw_flow_key *, bool log);
+int parse_flow_nlattrs(const struct nlattr *attr, const struct nlattr *a[],
+		       u64 *attrsp, bool log);
+int ovs_nla_get_flow_metadata(struct net *net,
+			      const struct nlattr *a[OVS_KEY_ATTR_MAX + 1],
+			      u64 attrs, struct sw_flow_key *key, bool log);
 
 int ovs_nla_put_identifier(const struct sw_flow *flow, struct sk_buff *skb);
 int ovs_nla_put_masked_key(const struct sw_flow *flow, struct sk_buff *skb);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
                   ` (3 preceding siblings ...)
  2017-02-03  1:10 ` [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-06 17:08   ` Pravin Shelar
  2017-02-07 22:15   ` Joe Stringer
  2017-02-03  1:10 ` [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key Jarno Rajahalme
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

Stateful network admission policy may allow connections to one
direction and reject connections initiated in the other direction.
After policy change it is possible that for a new connection an
overlapping conntrack entry already exist, where the connection
original direction is opposed to the new connection's initial packet.

Most importantly, conntrack state relating to the current packet gets
the "reply" designation based on whether the original direction tuple
or the reply direction tuple matched.  If this "directionality" is
wrong w.r.t. to the stateful network admission policy it may happen
that packets in neither direction are correctly admitted.

This patch adds a new "force commit" option to the OVS conntrack
action that checks the original direction of an existing conntrack
entry.  If that direction is opposed to the current packet, the
existing conntrack entry is deleted and a new one is subsequently
created in the correct direction.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 include/uapi/linux/openvswitch.h | 10 ++++++++++
 net/openvswitch/conntrack.c      | 27 +++++++++++++++++++++++++--
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 90af8b8..d5ba9a9 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -674,6 +674,10 @@ struct ovs_action_hash {
  * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
  * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address
  * translation (NAT) on the packet.
+ * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing
+ * nothing if the connection is already committed will check that the current
+ * packet is in conntrack entry's original direction.  If directionality does
+ * not match, will delete the existing conntrack entry and commit a new one.
  */
 enum ovs_ct_attr {
 	OVS_CT_ATTR_UNSPEC,
@@ -684,6 +688,12 @@ enum ovs_ct_attr {
 	OVS_CT_ATTR_HELPER,     /* netlink helper to assist detection of
 				   related connections. */
 	OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
+	OVS_CT_ATTR_FORCE_COMMIT,  /* No argument, commits connection.  If the
+				    * conntrack entry original direction tuple
+				    * does not match the current packet header
+				    * values, will delete the current conntrack
+				    * entry and create a new one.
+				    */
 	__OVS_CT_ATTR_MAX
 };
 
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1afe153..1f27f44 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -65,6 +65,7 @@ struct ovs_conntrack_info {
 	struct nf_conn *ct;
 	u8 commit : 1;
 	u8 nat : 3;                 /* enum ovs_ct_nat */
+	u8 force : 1;
 	u16 family;
 	struct md_mark mark;
 	struct md_labels labels;
@@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net,
 	 */
 	if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
 	    !(key->ct.state & OVS_CS_F_INVALID) &&
-	    key->ct.zone == info->zone.id)
+	    key->ct.zone == info->zone.id) {
 		ct = ovs_ct_find_existing(net, &info->zone, info->family, skb,
 					  !!(key->ct.state
 					     & OVS_CS_F_NAT_MASK));
+		if (ct)
+			nf_ct_get(skb, &ctinfo);
+	}
 	if (!ct)
 		return false;
 	if (!net_eq(net, read_pnet(&ct->ct_net)))
@@ -648,6 +652,19 @@ static bool skb_nfct_cached(struct net *net,
 		if (help && rcu_access_pointer(help->helper) != info->helper)
 			return false;
 	}
+	/* Force conntrack entry direction to the current packet? */
+	if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) {
+		/* Delete the conntrack entry if confirmed, else just release
+		 * the reference.
+		 */
+		if (nf_ct_is_confirmed(ct))
+			nf_ct_delete(ct, 0, 0);
+		else
+			nf_ct_put(ct);
+		skb->nfct = NULL;
+		skb->nfctinfo = 0;
+		return false;
+	}
 
 	return true;
 }
@@ -1226,6 +1243,7 @@ static int parse_nat(const struct nlattr *attr,
 
 static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
 	[OVS_CT_ATTR_COMMIT]	= { .minlen = 0, .maxlen = 0 },
+	[OVS_CT_ATTR_FORCE_COMMIT]	= { .minlen = 0, .maxlen = 0 },
 	[OVS_CT_ATTR_ZONE]	= { .minlen = sizeof(u16),
 				    .maxlen = sizeof(u16) },
 	[OVS_CT_ATTR_MARK]	= { .minlen = sizeof(struct md_mark),
@@ -1265,6 +1283,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
 		}
 
 		switch (type) {
+		case OVS_CT_ATTR_FORCE_COMMIT:
+			info->force = true;
+			/* fall through. */
 		case OVS_CT_ATTR_COMMIT:
 			info->commit = true;
 			break;
@@ -1491,7 +1512,9 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
 	if (!start)
 		return -EMSGSIZE;
 
-	if (ct_info->commit && nla_put_flag(skb, OVS_CT_ATTR_COMMIT))
+	if (ct_info->commit && nla_put_flag(skb, ct_info->force
+					    ? OVS_CT_ATTR_FORCE_COMMIT
+					    : OVS_CT_ATTR_COMMIT))
 		return -EMSGSIZE;
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
 	    nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone.id))
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
                   ` (4 preceding siblings ...)
  2017-02-03  1:10 ` [PATCH net-next 6/7] openvswitch: Add force commit Jarno Rajahalme
@ 2017-02-03  1:10 ` Jarno Rajahalme
  2017-02-07  7:15   ` Joe Stringer
  2017-02-05 22:28 ` [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted David Miller
  2017-02-06 17:07 ` Pravin Shelar
  7 siblings, 1 reply; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-03  1:10 UTC (permalink / raw)
  To: netdev; +Cc: jarno

struct sw_flow_key has two 16-bit holes. Move the most matched
conntrack match fields there.  In some typical cases this reduces the
size of the key that needs to be hashed into half and into one cache
line.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/openvswitch/conntrack.c    | 42 +++++++++++++++++++++---------------------
 net/openvswitch/conntrack.h    |  6 +++---
 net/openvswitch/flow.h         | 14 ++++++++------
 net/openvswitch/flow_netlink.c |  8 ++++----
 4 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1f27f44..e8d29c0 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -152,8 +152,8 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 				const struct nf_conntrack_zone *zone,
 				const struct nf_conn *ct)
 {
-	key->ct.state = state;
-	key->ct.zone = zone->id;
+	key->ct_state = state;
+	key->ct_zone = zone->id;
 	key->ct.mark = ovs_ct_get_mark(ct);
 	ovs_ct_get_labels(ct, &key->ct.labels);
 
@@ -161,7 +161,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 	if (ct && ct->master)
 		ct = ct->master;
 
-	key->ct.orig_proto = 0;
+	key->ct_orig_proto = 0;
 	key->ct.orig_tp.src = 0;
 	key->ct.orig_tp.dst = 0;
 	if (key->eth.type == htons(ETH_P_IP)) {
@@ -172,7 +172,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 
 			key->ipv4.ct_orig.src = orig->src.u3.ip;
 			key->ipv4.ct_orig.dst = orig->dst.u3.ip;
-			key->ct.orig_proto = orig->dst.protonum;
+			key->ct_orig_proto = orig->dst.protonum;
 			if (orig->dst.protonum == IPPROTO_ICMP) {
 				key->ct.orig_tp.src
 					= htons(orig->dst.u.icmp.type);
@@ -195,7 +195,7 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
 
 			key->ipv6.ct_orig.src = orig->src.u3.in6;
 			key->ipv6.ct_orig.dst = orig->dst.u3.in6;
-			key->ct.orig_proto = orig->dst.protonum;
+			key->ct_orig_proto = orig->dst.protonum;
 			if (orig->dst.protonum == IPPROTO_ICMPV6) {
 				key->ct.orig_tp.src
 					= htons(orig->dst.u.icmp.type);
@@ -238,7 +238,7 @@ static void ovs_ct_update_key(const struct sk_buff *skb,
 		if (ct->master)
 			state |= OVS_CS_F_RELATED;
 		if (keep_nat_flags) {
-			state |= key->ct.state & OVS_CS_F_NAT_MASK;
+			state |= key->ct_state & OVS_CS_F_NAT_MASK;
 		} else {
 			if (ct->status & IPS_SRC_NAT)
 				state |= OVS_CS_F_SRC_NAT;
@@ -269,11 +269,11 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key)
 int ovs_ct_put_key(const struct sw_flow_key *swkey,
 		   const struct sw_flow_key *output, struct sk_buff *skb)
 {
-	if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state))
+	if (nla_put_u32(skb, OVS_KEY_ATTR_CT_STATE, output->ct_state))
 		return -EMSGSIZE;
 
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
-	    nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone))
+	    nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct_zone))
 		return -EMSGSIZE;
 
 	if (IS_ENABLED(CONFIG_NF_CONNTRACK_MARK) &&
@@ -285,25 +285,25 @@ int ovs_ct_put_key(const struct sw_flow_key *swkey,
 		    &output->ct.labels))
 		return -EMSGSIZE;
 
-	if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct.orig_proto) {
+	if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct_orig_proto) {
 		struct ovs_key_ct_tuple_ipv4 orig = {
 			output->ipv4.ct_orig.src,
 			output->ipv4.ct_orig.dst,
 			output->ct.orig_tp.src,
 			output->ct.orig_tp.dst,
-			output->ct.orig_proto,
+			output->ct_orig_proto,
 		};
 		if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig),
 			    &orig))
 			return -EMSGSIZE;
 	} else if (swkey->eth.type == htons(ETH_P_IPV6) &&
-		   swkey->ct.orig_proto) {
+		   swkey->ct_orig_proto) {
 		struct ovs_key_ct_tuple_ipv6 orig = {
 			IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.src),
 			IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst),
 			output->ct.orig_tp.src,
 			output->ct.orig_tp.dst,
-			output->ct.orig_proto,
+			output->ct_orig_proto,
 		};
 		if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig),
 			    &orig))
@@ -630,11 +630,11 @@ static bool skb_nfct_cached(struct net *net,
 	 * due to an upcall.  If the connection was not confirmed, it is not
 	 * cached and needs to be run through conntrack again.
 	 */
-	if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
-	    !(key->ct.state & OVS_CS_F_INVALID) &&
-	    key->ct.zone == info->zone.id) {
+	if (!ct && key->ct_state & OVS_CS_F_TRACKED &&
+	    !(key->ct_state & OVS_CS_F_INVALID) &&
+	    key->ct_zone == info->zone.id) {
 		ct = ovs_ct_find_existing(net, &info->zone, info->family, skb,
-					  !!(key->ct.state
+					  !!(key->ct_state
 					     & OVS_CS_F_NAT_MASK));
 		if (ct)
 			nf_ct_get(skb, &ctinfo);
@@ -759,7 +759,7 @@ static void ovs_nat_update_key(struct sw_flow_key *key,
 	if (maniptype == NF_NAT_MANIP_SRC) {
 		__be16 src;
 
-		key->ct.state |= OVS_CS_F_SRC_NAT;
+		key->ct_state |= OVS_CS_F_SRC_NAT;
 		if (key->eth.type == htons(ETH_P_IP))
 			key->ipv4.addr.src = ip_hdr(skb)->saddr;
 		else if (key->eth.type == htons(ETH_P_IPV6))
@@ -781,7 +781,7 @@ static void ovs_nat_update_key(struct sw_flow_key *key,
 	} else {
 		__be16 dst;
 
-		key->ct.state |= OVS_CS_F_DST_NAT;
+		key->ct_state |= OVS_CS_F_DST_NAT;
 		if (key->eth.type == htons(ETH_P_IP))
 			key->ipv4.addr.dst = ip_hdr(skb)->daddr;
 		else if (key->eth.type == htons(ETH_P_IPV6))
@@ -906,7 +906,7 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
 		 * NAT after the nf_conntrack_in() call.  We can actually clear
 		 * the whole state, as it will be re-initialized below.
 		 */
-		key->ct.state = 0;
+		key->ct_state = 0;
 
 		/* Update the key, but keep the NAT flags. */
 		ovs_ct_update_key(skb, info, key, true, true);
@@ -922,9 +922,9 @@ static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
 		 *
 		 * NAT will be done only if the CT action has NAT, and only
 		 * once per packet (per zone), as guarded by the NAT bits in
-		 * the key->ct.state.
+		 * the key->ct_state.
 		 */
-		if (info->nat && !(key->ct.state & OVS_CS_F_NAT_MASK) &&
+		if (info->nat && !(key->ct_state & OVS_CS_F_NAT_MASK) &&
 		    (nf_ct_is_confirmed(ct) || info->commit) &&
 		    ovs_ct_nat(net, key, info, skb, ct, ctinfo) != NF_ACCEPT) {
 			return -EINVAL;
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 8573ab3..b5f7130 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -76,11 +76,11 @@ static inline int ovs_ct_execute(struct net *net, struct sk_buff *skb,
 static inline void ovs_ct_fill_key(const struct sk_buff *skb,
 				   struct sw_flow_key *key)
 {
-	key->ct.state = 0;
-	key->ct.zone = 0;
+	key->ct_state = 0;
+	key->ct_zone = 0;
 	key->ct.mark = 0;
 	memset(&key->ct.labels, 0, sizeof(key->ct.labels));
-	key->ct.orig_proto = 0;
+	key->ct_orig_proto = 0;
 	key->ct.orig_tp.src = 0;
 	key->ct.orig_tp.dst = 0;
 	if (key->eth.type == htons(ETH_P_IP))
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 19dc4d2..60ef942 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -85,6 +85,11 @@ struct sw_flow_key {
 		struct vlan_head cvlan;
 		__be16 type;		/* Ethernet frame type. */
 	} eth;
+	/* Filling a hole of two bytes. */
+	u8 ct_state;
+	u8 ct_orig_proto;		/* CT original direction tuple IP
+					 * protocol.
+					 */
 	union {
 		struct {
 			__be32 top_lse;	/* top label stack entry */
@@ -96,6 +101,7 @@ struct sw_flow_key {
 			u8     frag;	/* One of OVS_FRAG_TYPE_*. */
 		} ip;
 	};
+	u16 ct_zone;			/* Conntrack zone. */
 	struct {
 		__be16 src;		/* TCP/UDP/SCTP source port. */
 		__be16 dst;		/* TCP/UDP/SCTP destination port. */
@@ -138,16 +144,12 @@ struct sw_flow_key {
 		} ipv6;
 	};
 	struct {
-		/* Connection tracking fields. */
-		u8 state;
-		u8 orig_proto;		/* CT orig tuple IP protocol. */
-		u16 zone;
-		u32 mark;
+		/* Connection tracking fields not packed above. */
 		struct {
 			__be16 src;	/* CT orig tuple tp src port. */
 			__be16 dst;	/* CT orig tuple tp dst port. */
 		} orig_tp;
-
+		u32 mark;
 		struct ovs_key_ct_labels labels;
 	} ct;
 
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index b2594bb..82c3f1e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1074,14 +1074,14 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 			return -EINVAL;
 		}
 
-		SW_FLOW_KEY_PUT(match, ct.state, ct_state, is_mask);
+		SW_FLOW_KEY_PUT(match, ct_state, ct_state, is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_STATE);
 	}
 	if (*attrs & (1 << OVS_KEY_ATTR_CT_ZONE) &&
 	    ovs_ct_verify(net, OVS_KEY_ATTR_CT_ZONE)) {
 		u16 ct_zone = nla_get_u16(a[OVS_KEY_ATTR_CT_ZONE]);
 
-		SW_FLOW_KEY_PUT(match, ct.zone, ct_zone, is_mask);
+		SW_FLOW_KEY_PUT(match, ct_zone, ct_zone, is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ZONE);
 	}
 	if (*attrs & (1 << OVS_KEY_ATTR_CT_MARK) &&
@@ -1109,7 +1109,7 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 		SW_FLOW_KEY_PUT(match, ipv4.ct_orig.dst, ct->ipv4_dst, is_mask);
 		SW_FLOW_KEY_PUT(match, ct.orig_tp.src, ct->src_port, is_mask);
 		SW_FLOW_KEY_PUT(match, ct.orig_tp.dst, ct->dst_port, is_mask);
-		SW_FLOW_KEY_PUT(match, ct.orig_proto, ct->ipv4_proto, is_mask);
+		SW_FLOW_KEY_PUT(match, ct_orig_proto, ct->ipv4_proto, is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4);
 	}
 	if (*attrs & (1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6)) {
@@ -1125,7 +1125,7 @@ static int metadata_from_nlattrs(struct net *net, struct sw_flow_match *match,
 				   is_mask);
 		SW_FLOW_KEY_PUT(match, ct.orig_tp.src, ct->src_port, is_mask);
 		SW_FLOW_KEY_PUT(match, ct.orig_tp.dst, ct->dst_port, is_mask);
-		SW_FLOW_KEY_PUT(match, ct.orig_proto, ct->ipv6_proto, is_mask);
+		SW_FLOW_KEY_PUT(match, ct_orig_proto, ct->ipv6_proto, is_mask);
 		*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6);
 	}
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
                   ` (5 preceding siblings ...)
  2017-02-03  1:10 ` [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key Jarno Rajahalme
@ 2017-02-05 22:28 ` David Miller
  2017-02-06 17:06   ` Pravin Shelar
  2017-02-07  0:45   ` Joe Stringer
  2017-02-06 17:07 ` Pravin Shelar
  7 siblings, 2 replies; 31+ messages in thread
From: David Miller @ 2017-02-05 22:28 UTC (permalink / raw)
  To: jarno; +Cc: netdev

From: Jarno Rajahalme <jarno@ovn.org>
Date: Thu,  2 Feb 2017 17:10:00 -0800

> This does not match either of the conntrack tuples above.  Normally
> this does not matter, as the conntrack lookup was already done using
> the tuple (B,A), but if the current packet does not match any flow in
> the OVS datapath, the packet is sent to userspace via an upcall,
> during which the packet's skb is freed, and the conntrack entry
> pointer in the skb is lost.

This is the real bug.

If the metadata for a packet is important, as it obviously is here,
then upcalls should preserve that metadata across the upcall.  This
is exactly how NF_QUEUE handles this problem and so should OVS.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-05 22:28 ` [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted David Miller
@ 2017-02-06 17:06   ` Pravin Shelar
  2017-02-06 17:15     ` David Miller
  2017-02-07  0:45   ` Joe Stringer
  1 sibling, 1 reply; 31+ messages in thread
From: Pravin Shelar @ 2017-02-06 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: Jarno Rajahalme, Linux Kernel Network Developers

On Sun, Feb 5, 2017 at 2:28 PM, David Miller <davem@davemloft.net> wrote:
> From: Jarno Rajahalme <jarno@ovn.org>
> Date: Thu,  2 Feb 2017 17:10:00 -0800
>
>> This does not match either of the conntrack tuples above.  Normally
>> this does not matter, as the conntrack lookup was already done using
>> the tuple (B,A), but if the current packet does not match any flow in
>> the OVS datapath, the packet is sent to userspace via an upcall,
>> during which the packet's skb is freed, and the conntrack entry
>> pointer in the skb is lost.
>
> This is the real bug.
>
> If the metadata for a packet is important, as it obviously is here,
> then upcalls should preserve that metadata across the upcall.  This
> is exactly how NF_QUEUE handles this problem and so should OVS.

OVS kernel datapath serializes skb context and sends it along with skb
during upcall via genetlink parameters. userspace echoes same
information back to kernel modules. This way OVS does maintains
metadata across upcall.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
                   ` (6 preceding siblings ...)
  2017-02-05 22:28 ` [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted David Miller
@ 2017-02-06 17:07 ` Pravin Shelar
  2017-02-08  5:30   ` Jarno Rajahalme
  7 siblings, 1 reply; 31+ messages in thread
From: Pravin Shelar @ 2017-02-06 17:07 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: Linux Kernel Network Developers

On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
> When looking for an existing conntrack entry, the packet 5-tuple
> must be inverted if NAT has already been applied, as the current
> packet headers do not match any conntrack tuple.  For
> example, if a packet from private address X to a public address B is
> source-NATted to A, the conntrack entry will have the following tuples
> (ignoring the protocol and port numbers) after the conntrack entry is
> committed:
>
> Original direction tuple: (X,B)
> Reply direction tuple: (B,A)
>
> Now, if a reply packet is already transformed back to the private
> address space (e.g., with a CT(nat) action), the tuple corresponding
> to the current packet headers is:
>
> Current packet tuple: (B,X)
>
> This does not match either of the conntrack tuples above.  Normally
> this does not matter, as the conntrack lookup was already done using
> the tuple (B,A), but if the current packet does not match any flow in
> the OVS datapath, the packet is sent to userspace via an upcall,
> during which the packet's skb is freed, and the conntrack entry
> pointer in the skb is lost.  When the packet is reintroduced to the
> datapath, any further conntrack action will need to perform a new
> conntrack lookup to find the entry again.  Prior to this patch this
> second lookup failed for NATted packets.  The datapath flow setup
> corresponding to the upcall can succeed, however, allowing all further
> packets in the reply direction to re-use the conntrack entry pointer
> in the skb, so typically the lookup failure only causes a packet drop.
>
> The solution is to invert the tuple derived from the current packet
> headers in case the conntrack state stored in the packet metadata
> indicates that the packet has been transformed by NAT:
>
> Inverted tuple: (X,B)
>
> With this the conntrack entry can be found, matching the original
> direction tuple.
>
> This same logic also works for the original direction packets:
>
> Current packet tuple (after NAT): (A,B)
> Inverted tuple: (B,A)
>
> While the current packet tuple (A,B) does not match either of the
> conntrack tuples, the inverted one (B,A) does match the reply
> direction tuple.
>
> Since the inverted tuple matches the reverse direction tuple the
> direction of the packet must be reversed as well.
>
> Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

I could not apply this patch series to net-next branch. But it does
applies to net, which branch are you targeting it for?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-03  1:10 ` [PATCH net-next 6/7] openvswitch: Add force commit Jarno Rajahalme
@ 2017-02-06 17:08   ` Pravin Shelar
  2017-02-07  7:28     ` Joe Stringer
  2017-02-07 22:15   ` Joe Stringer
  1 sibling, 1 reply; 31+ messages in thread
From: Pravin Shelar @ 2017-02-06 17:08 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: Linux Kernel Network Developers

On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
> Stateful network admission policy may allow connections to one
> direction and reject connections initiated in the other direction.
> After policy change it is possible that for a new connection an
> overlapping conntrack entry already exist, where the connection
> original direction is opposed to the new connection's initial packet.
>
> Most importantly, conntrack state relating to the current packet gets
> the "reply" designation based on whether the original direction tuple
> or the reply direction tuple matched.  If this "directionality" is
> wrong w.r.t. to the stateful network admission policy it may happen
> that packets in neither direction are correctly admitted.
>
Why not have the check in all commit actions? I am not sure in which
case user would not want forced commit considering this can cause
packet admission issue?

> This patch adds a new "force commit" option to the OVS conntrack
> action that checks the original direction of an existing conntrack
> entry.  If that direction is opposed to the current packet, the
> existing conntrack entry is deleted and a new one is subsequently
> created in the correct direction.
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-06 17:06   ` Pravin Shelar
@ 2017-02-06 17:15     ` David Miller
  2017-02-07 17:14       ` Pravin Shelar
  0 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2017-02-06 17:15 UTC (permalink / raw)
  To: pshelar; +Cc: jarno, netdev

From: Pravin Shelar <pshelar@ovn.org>
Date: Mon, 6 Feb 2017 09:06:29 -0800

> On Sun, Feb 5, 2017 at 2:28 PM, David Miller <davem@davemloft.net> wrote:
>> From: Jarno Rajahalme <jarno@ovn.org>
>> Date: Thu,  2 Feb 2017 17:10:00 -0800
>>
>>> This does not match either of the conntrack tuples above.  Normally
>>> this does not matter, as the conntrack lookup was already done using
>>> the tuple (B,A), but if the current packet does not match any flow in
>>> the OVS datapath, the packet is sent to userspace via an upcall,
>>> during which the packet's skb is freed, and the conntrack entry
>>> pointer in the skb is lost.
>>
>> This is the real bug.
>>
>> If the metadata for a packet is important, as it obviously is here,
>> then upcalls should preserve that metadata across the upcall.  This
>> is exactly how NF_QUEUE handles this problem and so should OVS.
> 
> OVS kernel datapath serializes skb context and sends it along with skb
> during upcall via genetlink parameters. userspace echoes same
> information back to kernel modules. This way OVS does maintains
> metadata across upcall.

Then the conntrack state looked up before the NAT operation done in
the upcall should be maintained and therefore this bug should not
exist.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection.
  2017-02-03  1:10 ` [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection Jarno Rajahalme
@ 2017-02-06 21:46   ` Joe Stringer
  2017-02-08  5:30     ` Jarno Rajahalme
  0 siblings, 1 reply; 31+ messages in thread
From: Joe Stringer @ 2017-02-06 21:46 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
> Avoid triggering change events for setting conntrack mark or labels
> before the conntrack entry has been confirmed.  Refactoring on this
> patch also makes chenges in later patches easier to review.
>
> Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark")
> Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

Functional and cosmetic changes should be in separate patches.

> ---
>  net/openvswitch/conntrack.c | 87 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 63 insertions(+), 24 deletions(-)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 6730f09..a163c44 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -229,23 +229,17 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
>         return 0;
>  }
>
> -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
> +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key,
>                            u32 ct_mark, u32 mask)
>  {
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
> -       enum ip_conntrack_info ctinfo;
> -       struct nf_conn *ct;
>         u32 new_mark;
>
> -       /* The connection could be invalid, in which case set_mark is no-op. */
> -       ct = nf_ct_get(skb, &ctinfo);
> -       if (!ct)
> -               return 0;
> -
>         new_mark = ct_mark | (ct->mark & ~(mask));
>         if (ct->mark != new_mark) {
>                 ct->mark = new_mark;
> -               nf_conntrack_event_cache(IPCT_MARK, ct);
> +               if (nf_ct_is_confirmed(ct))
> +                       nf_conntrack_event_cache(IPCT_MARK, ct);
>                 key->ct.mark = new_mark;
>         }
>
> @@ -255,26 +249,59 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
>  #endif
>  }
>
> -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
> -                            const struct ovs_key_ct_labels *labels,
> -                            const struct ovs_key_ct_labels *mask)
> +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
>  {
> -       enum ip_conntrack_info ctinfo;
>         struct nf_conn_labels *cl;
> -       struct nf_conn *ct;
> -       int err;
> -
> -       /* The connection could be invalid, in which case set_label is no-op.*/
> -       ct = nf_ct_get(skb, &ctinfo);
> -       if (!ct)
> -               return 0;
>
>         cl = nf_ct_labels_find(ct);
>         if (!cl) {
>                 nf_ct_labels_ext_add(ct);
>                 cl = nf_ct_labels_find(ct);
>         }
> +
>         if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN)
> +               return NULL;
> +
> +       return cl;
> +}
> +
> +/* Initialize labels for a new, to be committed conntrack entry.  Note that
> + * since the new connection is not yet confirmed, and thus no-one else has
> + * access to it's labels, we simply write them over.  Also, we refrain from
> + * triggering events, as receiving change events before the create event would
> + * be confusing.
> + */
> +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
> +                             const struct ovs_key_ct_labels *labels,
> +                             const struct ovs_key_ct_labels *mask)
> +{
> +       struct nf_conn_labels *cl;
> +       u32 *dst;
> +       int i;
> +
> +       cl = ovs_ct_get_conn_labels(ct);
> +       if (!cl)
> +               return -ENOSPC;
> +
> +       dst = (u32 *)cl->bits;

Is it worth extending the union to include unsigned long, to avoid
casting it to u32 here?

> +       for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
> +               dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
> +                       (labels->ct_labels_32[i] & mask->ct_labels_32[i]);
> +
> +       memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
> +
> +       return 0;
> +}
> +
> +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key,
> +                            const struct ovs_key_ct_labels *labels,
> +                            const struct ovs_key_ct_labels *mask)
> +{
> +       struct nf_conn_labels *cl;
> +       int err;
> +
> +       cl = ovs_ct_get_conn_labels(ct);
> +       if (!cl)
>                 return -ENOSPC;
>
>         err = nf_connlabels_replace(ct, labels->ct_labels_32,
> @@ -283,7 +310,8 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
>         if (err)
>                 return err;
>
> -       ovs_ct_get_labels(ct, &key->ct.labels);
> +       memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
> +

I noticed this change and started looking at ovs_ct_get_labels(). It
tries to handle length differences between nf_conn_labels matches
ovs_key_ct_labels, but perhaps it's easier to drop that code and use a
build-time assert that they're the same instead. Since 23014011ba42
("netfilter: conntrack: support a fixed size of 128 distinct labels"),
they haven't been variable length anyway so this can simplify some
code. This can be separate from this patch though.

I think that such a change would be orthogonal from the above change,
the above can stay as is (not much point sharing the function call if
it just retrieves a pointer we already have and does a memcpy).

>         return 0;
>  }
>
> @@ -865,25 +893,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>                          const struct ovs_conntrack_info *info,
>                          struct sk_buff *skb)
>  {
> +       enum ip_conntrack_info ctinfo;
> +       struct nf_conn *ct;
>         int err;
>
>         err = __ovs_ct_lookup(net, key, info, skb);
>         if (err)
>                 return err;
>
> +       /* The connection could be invalid, in which case this is a no-op.*/
> +       ct = nf_ct_get(skb, &ctinfo);
> +       if (!ct)
> +               return 0;
> +
>         /* Apply changes before confirming the connection so that the initial
>          * conntrack NEW netlink event carries the values given in the CT
>          * action.
>          */
>         if (info->mark.mask) {
> -               err = ovs_ct_set_mark(skb, key, info->mark.value,
> +               err = ovs_ct_set_mark(ct, key, info->mark.value,
>                                       info->mark.mask);
>                 if (err)
>                         return err;
>         }
>         if (labels_nonzero(&info->labels.mask)) {
> -               err = ovs_ct_set_labels(skb, key, &info->labels.value,
> -                                       &info->labels.mask);
> +               if (!nf_ct_is_confirmed(ct))
> +                       err = ovs_ct_init_labels(ct, key, &info->labels.value,
> +                                                &info->labels.mask);
> +               else
> +                       err = ovs_ct_set_labels(ct, key, &info->labels.value,
> +                                               &info->labels.mask);
>                 if (err)
>                         return err;
>         }
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 4/7] openvswitch: Inherit master's labels.
  2017-02-03  1:10 ` [PATCH net-next 4/7] openvswitch: Inherit master's labels Jarno Rajahalme
@ 2017-02-06 21:53   ` Joe Stringer
  2017-02-08  5:31     ` Jarno Rajahalme
  0 siblings, 1 reply; 31+ messages in thread
From: Joe Stringer @ 2017-02-06 21:53 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
> We avoid calling into nf_conntrack_in() for expected connections, as
> that would remove the expectation that we want to stick around until
> we are ready to commit the connection.  Instead, we do a lookup in the
> expectation table directly.  However, after a successful expectation
> lookup we have set the flow key label field from the master
> connection, whereas nf_conntrack_in() does not do this.  This leads to
> master's labels being iherited after an expectation lookup, but those
> labels not being inherited after the corresponding conntrack action
> with a commit flag.
>
> This patch resolves the problem by changing the commit code path to
> also inherit the master's labels to the expected connection.
> Resolving this conflict in favor or inheriting the labels allows
> information be passed from the master connection to related
> connections, which would otherwise be much harder.  Labels can still
> be set explicitly, so this change only affects the default values of
> the labels in presense of a master connection.
>
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> ---
>  net/openvswitch/conntrack.c | 48 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 34 insertions(+), 14 deletions(-)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index a163c44..738a4fa 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -265,6 +265,8 @@ static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
>         return cl;
>  }
>
> +static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
> +

These declarations typically live at the top of the file, not
somewhere in the middle.

>  /* Initialize labels for a new, to be committed conntrack entry.  Note that
>   * since the new connection is not yet confirmed, and thus no-one else has
>   * access to it's labels, we simply write them over.  Also, we refrain from
> @@ -275,18 +277,35 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
>                               const struct ovs_key_ct_labels *labels,
>                               const struct ovs_key_ct_labels *mask)
>  {
> -       struct nf_conn_labels *cl;
> -       u32 *dst;
> -       int i;
> +       struct nf_conn_labels *cl, *master_cl;
> +       bool have_mask = labels_nonzero(mask);
> +
> +       /* Inherit master's labels to the related connection? */
> +       master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL;
> +
> +       if (!master_cl && !have_mask)
> +               return 0;   /* Nothing to do. */
>
>         cl = ovs_ct_get_conn_labels(ct);
>         if (!cl)
>                 return -ENOSPC;
>
> -       dst = (u32 *)cl->bits;
> -       for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
> -               dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
> -                       (labels->ct_labels_32[i] & mask->ct_labels_32[i]);
> +       /* Inherit the master's labels, if any. */
> +       if (master_cl) {
> +               size_t len = sizeof(master_cl->bits);
> +
> +               memcpy(&cl->bits, &master_cl->bits,
> +                      len > OVS_CT_LABELS_LEN ? OVS_CT_LABELS_LEN : len);

Looks like this is another spot where we're trying to handle differing
label lengths, which we could simplify if there was a stronger
guarantee they're the same.

> +       }
> +       if (have_mask) {
> +               u32 *dst = (u32 *)cl->bits;
> +               int i;
> +
> +               for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
> +                       dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
> +                               (labels->ct_labels_32[i]
> +                                & mask->ct_labels_32[i]);
> +       }

By the way, is this open-coding nf_connlabels_replace()? Can
ovs_ct_set_labels() and this share the code?

>
>         memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
>
> @@ -916,13 +935,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>                 if (err)
>                         return err;
>         }
> -       if (labels_nonzero(&info->labels.mask)) {
> -               if (!nf_ct_is_confirmed(ct))
> -                       err = ovs_ct_init_labels(ct, key, &info->labels.value,
> -                                                &info->labels.mask);
> -               else
> -                       err = ovs_ct_set_labels(ct, key, &info->labels.value,
> -                                               &info->labels.mask);
> +       if (!nf_ct_is_confirmed(ct)) {
> +               err = ovs_ct_init_labels(ct, key, &info->labels.value,
> +                                        &info->labels.mask);
> +               if (err)
> +                       return err;
> +       } else if (labels_nonzero(&info->labels.mask)) {
> +               err = ovs_ct_set_labels(ct, key, &info->labels.value,
> +                                       &info->labels.mask);
>                 if (err)
>                         return err;
>         }
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-05 22:28 ` [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted David Miller
  2017-02-06 17:06   ` Pravin Shelar
@ 2017-02-07  0:45   ` Joe Stringer
  1 sibling, 0 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-07  0:45 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev, David Miller

On 5 February 2017 at 14:28, David Miller <davem@davemloft.net> wrote:
> From: Jarno Rajahalme <jarno@ovn.org>
> Date: Thu,  2 Feb 2017 17:10:00 -0800
>
>> This does not match either of the conntrack tuples above.  Normally
>> this does not matter, as the conntrack lookup was already done using
>> the tuple (B,A), but if the current packet does not match any flow in
>> the OVS datapath, the packet is sent to userspace via an upcall,
>> during which the packet's skb is freed, and the conntrack entry
>> pointer in the skb is lost.
>
> This is the real bug.
>
> If the metadata for a packet is important, as it obviously is here,
> then upcalls should preserve that metadata across the upcall.  This
> is exactly how NF_QUEUE handles this problem and so should OVS.

Looks like the patch #5 provides this preservation across upcall, so
this patch can be converted to use the key->ct.orig_* from that patch
instead of doing the invert.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
  2017-02-03  1:10 ` [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key Jarno Rajahalme
@ 2017-02-07  7:15   ` Joe Stringer
  2017-02-07 21:38     ` Joe Stringer
  2017-02-08  5:31     ` Jarno Rajahalme
  0 siblings, 2 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-07  7:15 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
> Add the fields of the conntrack original direction 5-tuple to struct
> sw_flow_key.  The new fields are initially zeroed, and are populated
> whenever a conntrack action is executed and either finds or generates
> a conntrack entry.  This means that these fields exist for all packets
> were not rejected by conntrack as untrackable.
>
> The original tuple fields in the sw_flow_key are filled from the
> original direction tuple of the conntrack entry relating to the
> current packet, or from the original direction tuple of the master
> conntrack entry, if the current conntrack entry has a master.
> Generally, expected connections of connections having an assigned
> helper (e.g., FTP), have a master conntrack entry.
>
> The main purpose of the new conntrack original tuple fields is to
> allow matching on them for policy decision purposes, with the premise
> that the admissibility of tracked connections reply packets (as well
> as original direction packets), and both direction packets of any
> related connections may be based on ACL rules applying to the master
> connection's original direction 5-tuple.  This also makes it easier to
> make policy decisions when the actual packet headers might have been
> transformed by NAT, as the original direction 5-tuple represents the
> packet headers before any such transformation.
>
> When using the original direction 5-tuple the admissibility of return
> and/or related packets need not be based on the mere existence of a
> conntrack entry, allowing separation of admission policy from the
> established conntrack state.  While existence of a conntrack entry is
> required for admission of the return or related packets, policy
> changes can render connections that were initially admitted to be
> rejected or dropped afterwards.  If the admission of the return and
> related packets was based on mere conntrack state (e.g., connection
> being in an established state), a policy change that would make the
> connection rejected or dropped would need to find and delete all
> conntrack entries affected by such a change.  When using the original
> direction 5-tuple matching the affected conntrack entries can be
> allowed to time out instead, as the established state of the
> connection would not need to be the basis for packet admission any
> more.
>
> It should be noted that the directionality of related connections may
> be the same or different than that of the master connection, and
> neither the original direction 5-tuple nor the conntrack state bits
> carry this information.  If needed, the directionality of the master
> connection can be stored in master's conntrack mark or labels, which
> are automatically inherited by the expected related connections.
>
> The fact that neither ARP not ND packets are trackable by conntrack
> allows mutual exclusion between ARP/ND and the new conntrack original
> tuple fields.  Hence, the IP addresses are overlaid in union with ARP
> and ND fields.  This allows the sw_flow_key to not grow much due to
> this patch, but it also means that we must be careful to never use the
> new key fields with ARP or ND packets.  ARP is easy to distinguish and
> keep mutually exclusive based on the ethernet type, but ND being an
> ICMPv6 protocol requires a bit more attention.
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> ---

OK, maybe we need to do something a bit more to handle the NATed
related connections to address the problem in patch 1.

<snip>

> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 738a4fa..1afe153 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -155,6 +155,59 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
>         key->ct.zone = zone->id;
>         key->ct.mark = ovs_ct_get_mark(ct);
>         ovs_ct_get_labels(ct, &key->ct.labels);
> +
> +       /* Use the master if we have one. */
> +       if (ct && ct->master)
> +               ct = ct->master;

Perhaps:

if (!ct || sw_flow_key_is_nd(key) || !is_ip_any(key->eth.type)) {
    /* zero everything */
    return;
}

One of the things this helps us to avoid is having a comment in the
middle of an if statement.

Then afterwards,
if (ct->master)
    ct = ct->master;

> +
> +       key->ct.orig_proto = 0;
> +       key->ct.orig_tp.src = 0;
> +       key->ct.orig_tp.dst = 0;
> +       if (key->eth.type == htons(ETH_P_IP)) {
> +               /* IP version must match. */
> +               if (ct && nf_ct_l3num(ct) == NFPROTO_IPV4) {

I don't quite understand how we could end up with a connection NFPROTO
that is mismatched with an IP version that we should handle here, but
if there are some legitimite cases perhaps we can pick them up and
handle them in the early exit condition above?

We can probably share a few more lines between IPv4 and IPv6 here.

> @@ -208,24 +261,54 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key)
>         ovs_ct_update_key(skb, NULL, key, false, false);
>  }
>
> -int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
> +#define IN6_ADDR_INITIALIZER(ADDR) \
> +       { (ADDR).s6_addr32[0], (ADDR).s6_addr32[1], \
> +         (ADDR).s6_addr32[2], (ADDR).s6_addr32[3] }

scripts/checkpatch.pl asks:

CHECK: Macro argument reuse 'ADDR' - possible side-effects?
#704: FILE: net/openvswitch/conntrack.c:264:
+#define IN6_ADDR_INITIALIZER(ADDR) \
+       { (ADDR).s6_addr32[0], (ADDR).s6_addr32[1], \
+         (ADDR).s6_addr32[2], (ADDR).s6_addr32[3] }

I'm guessing it just doesn't like your use of the name 'ADDR'.

> +       if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct.orig_proto) {
> +               struct ovs_key_ct_tuple_ipv4 orig = {
> +                       output->ipv4.ct_orig.src,
> +                       output->ipv4.ct_orig.dst,
> +                       output->ct.orig_tp.src,
> +                       output->ct.orig_tp.dst,
> +                       output->ct.orig_proto,
> +               };
> +               if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig),
> +                           &orig))
> +                       return -EMSGSIZE;
> +       } else if (swkey->eth.type == htons(ETH_P_IPV6) &&
> +                  swkey->ct.orig_proto) {
> +               struct ovs_key_ct_tuple_ipv6 orig = {
> +                       IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.src),
> +                       IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst),
> +                       output->ct.orig_tp.src,
> +                       output->ct.orig_tp.dst,
> +                       output->ct.orig_proto,
> +               };
> +               if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig),
> +                           &orig))
> +                       return -EMSGSIZE;
> +       }

swkey->ct.orig_proto check is common across both conditions, maybe
check that first then nest the other logic within?

> +
>         return 0;
>  }
>
> diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
> index 8f6230b..8573ab3 100644
> --- a/net/openvswitch/conntrack.h
> +++ b/net/openvswitch/conntrack.h
> @@ -32,7 +32,8 @@ int ovs_ct_execute(struct net *, struct sk_buff *, struct sw_flow_key *,
>                    const struct ovs_conntrack_info *);
>
>  void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key);
> -int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb);
> +int ovs_ct_put_key(const struct sw_flow_key *swkey,
> +                  const struct sw_flow_key *output, struct sk_buff *skb);
>  void ovs_ct_free_action(const struct nlattr *a);
>
>  #define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \
> @@ -79,9 +80,17 @@ static inline void ovs_ct_fill_key(const struct sk_buff *skb,
>         key->ct.zone = 0;
>         key->ct.mark = 0;
>         memset(&key->ct.labels, 0, sizeof(key->ct.labels));
> +       key->ct.orig_proto = 0;
> +       key->ct.orig_tp.src = 0;
> +       key->ct.orig_tp.dst = 0;
> +       if (key->eth.type == htons(ETH_P_IP))
> +               memset(&key->ipv4.ct_orig, 0, sizeof(key->ipv4.ct_orig));
> +       else if (key->eth.type == htons(ETH_P_IPV6) && !sw_flow_key_is_nd(key))
> +               memset(&key->ipv6.ct_orig, 0, sizeof(key->ipv6.ct_orig));

If we only ever use these fields based on key->ct.orig_proto being
nonzero, then we m

<snip>

> +static inline bool sw_flow_key_is_nd(const struct sw_flow_key *key)
> +{
> +       return key->eth.type == htons(ETH_P_IPV6) &&
> +               key->ip.proto == NEXTHDR_ICMP &&
> +               key->tp.dst == 0 &&
> +               (key->tp.src == ntohs(NDISC_NEIGHBOUR_SOLICITATION) ||
> +                key->tp.src == ntohs(NDISC_NEIGHBOUR_ADVERTISEMENT));

Sparse complains:

net/openvswitch/flow.h:163:33: warning: cast to restricted __be16
net/openvswitch/flow.h:163:25: warning: restricted __be16 degrades to integer
net/openvswitch/flow.h:164:33: warning: cast to restricted __be16
net/openvswitch/flow.h:164:25: warning: restricted __be16 degrades to integer

> @@ -161,8 +163,11 @@ static bool match_validate(const struct sw_flow_match *match,
>
>         if (match->key->eth.type == htons(ETH_P_IP)) {
>                 key_expected |= 1 << OVS_KEY_ATTR_IPV4;
> -               if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
> +               if (match->mask &&
> +                   (match->mask->key.eth.type == htons(0xffff))) {

Minor nit, can probably drop the parentheses and keep this to one line?

> @@ -1430,6 +1476,10 @@ int ovs_nla_get_match(struct net *net, struct sw_flow_match *match,
>                         goto free_newmask;
>         }
>
> +       /* This also checks that conntrack original direction metadata exists
> +        * only for packets for which it makes sense.  Otherwise the key may be
> +        * corrupted due to overlapping key fields.
> +        */
>         if (!match_validate(match, key_attrs, mask_attrs, log))
>                 err = -EINVAL;
>

I believe that this is already explained in the relevant portion of
match_validate(), not sure it's worth highlighting that particular
piece of validation above a call to such a general function.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
  2017-02-03  1:10 ` [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key Jarno Rajahalme
@ 2017-02-07  7:15   ` Joe Stringer
  2017-02-08  1:11     ` Jarno Rajahalme
  2017-02-08  5:31     ` Jarno Rajahalme
  0 siblings, 2 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-07  7:15 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
> struct sw_flow_key has two 16-bit holes. Move the most matched
> conntrack match fields there.  In some typical cases this reduces the
> size of the key that needs to be hashed into half and into one cache
> line.
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

Looks like this misses the zeroing in ovs_nla_get_flow_metadata();
might want to double-check for any other memset/copies of the key->ct
field.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-06 17:08   ` Pravin Shelar
@ 2017-02-07  7:28     ` Joe Stringer
  2017-02-07 17:14       ` Pravin Shelar
  0 siblings, 1 reply; 31+ messages in thread
From: Joe Stringer @ 2017-02-07  7:28 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Jarno Rajahalme, Linux Kernel Network Developers

On 6 February 2017 at 09:08, Pravin Shelar <pshelar@ovn.org> wrote:
> On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
>> Stateful network admission policy may allow connections to one
>> direction and reject connections initiated in the other direction.
>> After policy change it is possible that for a new connection an
>> overlapping conntrack entry already exist, where the connection
>> original direction is opposed to the new connection's initial packet.
>>
>> Most importantly, conntrack state relating to the current packet gets
>> the "reply" designation based on whether the original direction tuple
>> or the reply direction tuple matched.  If this "directionality" is
>> wrong w.r.t. to the stateful network admission policy it may happen
>> that packets in neither direction are correctly admitted.
>>
> Why not have the check in all commit actions? I am not sure in which
> case user would not want forced commit considering this can cause
> packet admission issue?

Seems like this case has involved one direction of a connection being
handled by a flow that committed the connection. Then something has
changed and you end up with a flow handling the opposite direction,
committing the connection. What if the first flow wasn't actually
removed? Plausibly you could end up with constant ct entry churn as
the connection is recreated each time there is a packet from an
alternating direction. Having a separate flag may assist with respect
to shooting one's own foot..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-06 17:15     ` David Miller
@ 2017-02-07 17:14       ` Pravin Shelar
  2017-02-07 21:29         ` Jarno Rajahalme
  0 siblings, 1 reply; 31+ messages in thread
From: Pravin Shelar @ 2017-02-07 17:14 UTC (permalink / raw)
  To: David Miller; +Cc: Jarno Rajahalme, Linux Kernel Network Developers

On Mon, Feb 6, 2017 at 9:15 AM, David Miller <davem@davemloft.net> wrote:
> From: Pravin Shelar <pshelar@ovn.org>
> Date: Mon, 6 Feb 2017 09:06:29 -0800
>
>> On Sun, Feb 5, 2017 at 2:28 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Jarno Rajahalme <jarno@ovn.org>
>>> Date: Thu,  2 Feb 2017 17:10:00 -0800
>>>
>>>> This does not match either of the conntrack tuples above.  Normally
>>>> this does not matter, as the conntrack lookup was already done using
>>>> the tuple (B,A), but if the current packet does not match any flow in
>>>> the OVS datapath, the packet is sent to userspace via an upcall,
>>>> during which the packet's skb is freed, and the conntrack entry
>>>> pointer in the skb is lost.
>>>
>>> This is the real bug.
>>>
>>> If the metadata for a packet is important, as it obviously is here,
>>> then upcalls should preserve that metadata across the upcall.  This
>>> is exactly how NF_QUEUE handles this problem and so should OVS.
>>
>> OVS kernel datapath serializes skb context and sends it along with skb
>> during upcall via genetlink parameters. userspace echoes same
>> information back to kernel modules. This way OVS does maintains
>> metadata across upcall.
>
> Then the conntrack state looked up before the NAT operation done in
> the upcall should be maintained and therefore this bug should not
> exist.

I think it fails because after upcall OVS is performing lookup for
nonexistent conntrack entry. This is due to the fact that the packet
is already Nat-ed. So one could argue that there is already enough
context available in OVS to lookup original conntract entry after the
upcall as shown in this patch.
But I am also fine with using original context to lookup the conntract
entry as Joe has suggested.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-07  7:28     ` Joe Stringer
@ 2017-02-07 17:14       ` Pravin Shelar
  0 siblings, 0 replies; 31+ messages in thread
From: Pravin Shelar @ 2017-02-07 17:14 UTC (permalink / raw)
  To: Joe Stringer; +Cc: Jarno Rajahalme, Linux Kernel Network Developers

On Mon, Feb 6, 2017 at 11:28 PM, Joe Stringer <joe@ovn.org> wrote:
> On 6 February 2017 at 09:08, Pravin Shelar <pshelar@ovn.org> wrote:
>> On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
>>> Stateful network admission policy may allow connections to one
>>> direction and reject connections initiated in the other direction.
>>> After policy change it is possible that for a new connection an
>>> overlapping conntrack entry already exist, where the connection
>>> original direction is opposed to the new connection's initial packet.
>>>
>>> Most importantly, conntrack state relating to the current packet gets
>>> the "reply" designation based on whether the original direction tuple
>>> or the reply direction tuple matched.  If this "directionality" is
>>> wrong w.r.t. to the stateful network admission policy it may happen
>>> that packets in neither direction are correctly admitted.
>>>
>> Why not have the check in all commit actions? I am not sure in which
>> case user would not want forced commit considering this can cause
>> packet admission issue?
>
> Seems like this case has involved one direction of a connection being
> handled by a flow that committed the connection. Then something has
> changed and you end up with a flow handling the opposite direction,
> committing the connection. What if the first flow wasn't actually
> removed? Plausibly you could end up with constant ct entry churn as
> the connection is recreated each time there is a packet from an
> alternating direction. Having a separate flag may assist with respect
> to shooting one's own foot..

I see. Thanks for explanation.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-07 17:14       ` Pravin Shelar
@ 2017-02-07 21:29         ` Jarno Rajahalme
  0 siblings, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-07 21:29 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: David Miller, Linux Kernel Network Developers


> On Feb 7, 2017, at 9:14 AM, Pravin Shelar <pshelar@ovn.org> wrote:
> 
> On Mon, Feb 6, 2017 at 9:15 AM, David Miller <davem@davemloft.net> wrote:
>> From: Pravin Shelar <pshelar@ovn.org>
>> Date: Mon, 6 Feb 2017 09:06:29 -0800
>> 
>>> On Sun, Feb 5, 2017 at 2:28 PM, David Miller <davem@davemloft.net> wrote:
>>>> From: Jarno Rajahalme <jarno@ovn.org>
>>>> Date: Thu,  2 Feb 2017 17:10:00 -0800
>>>> 
>>>>> This does not match either of the conntrack tuples above.  Normally
>>>>> this does not matter, as the conntrack lookup was already done using
>>>>> the tuple (B,A), but if the current packet does not match any flow in
>>>>> the OVS datapath, the packet is sent to userspace via an upcall,
>>>>> during which the packet's skb is freed, and the conntrack entry
>>>>> pointer in the skb is lost.
>>>> 
>>>> This is the real bug.
>>>> 
>>>> If the metadata for a packet is important, as it obviously is here,
>>>> then upcalls should preserve that metadata across the upcall.  This
>>>> is exactly how NF_QUEUE handles this problem and so should OVS.
>>> 
>>> OVS kernel datapath serializes skb context and sends it along with skb
>>> during upcall via genetlink parameters. userspace echoes same
>>> information back to kernel modules. This way OVS does maintains
>>> metadata across upcall.
>> 
>> Then the conntrack state looked up before the NAT operation done in
>> the upcall should be maintained and therefore this bug should not
>> exist.
> 

We already maintain enough metadata across the userspace upcall, but so far we have failed to use it correctly in the case of NATted packets. The NAT flags are there, based on which we (now) know that we have to do inverted lookup to locate the conntrack entry. Conntrack NAT by design only stores tuples for the incoming packets, and there are various instances of tuple inversions in netfilter code for this reason (see, for example, net/netfilter/nf_nat_core.c:nf_nat_used_tuple()). I’ll make the commit message clearer about this for the v2 of the series.

> I think it fails because after upcall OVS is performing lookup for
> nonexistent conntrack entry. This is due to the fact that the packet
> is already Nat-ed. So one could argue that there is already enough
> context available in OVS to lookup original conntract entry after the
> upcall as shown in this patch.
> But I am also fine with using original context to lookup the conntract
> entry as Joe has suggested.

For expected (i.e., related) connection, using the master’s original direction 5-tuple added to the flow meta-data in patch 5 would find the master connection, not the related connection, so it would not work.

  Jarno

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
  2017-02-07  7:15   ` Joe Stringer
@ 2017-02-07 21:38     ` Joe Stringer
  2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 0 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-07 21:38 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 6 February 2017 at 23:15, Joe Stringer <joe@ovn.org> wrote:
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> @@ -79,9 +80,17 @@ static inline void ovs_ct_fill_key(const struct sk_buff *skb,
>>         key->ct.zone = 0;
>>         key->ct.mark = 0;
>>         memset(&key->ct.labels, 0, sizeof(key->ct.labels));
>> +       key->ct.orig_proto = 0;
>> +       key->ct.orig_tp.src = 0;
>> +       key->ct.orig_tp.dst = 0;
>> +       if (key->eth.type == htons(ETH_P_IP))
>> +               memset(&key->ipv4.ct_orig, 0, sizeof(key->ipv4.ct_orig));
>> +       else if (key->eth.type == htons(ETH_P_IPV6) && !sw_flow_key_is_nd(key))
>> +               memset(&key->ipv6.ct_orig, 0, sizeof(key->ipv6.ct_orig));
>
> If we only ever use these fields based on key->ct.orig_proto being
> nonzero, then we m

I think I had my email opened in two windows so truncated this
sentence somehow. I was wondering whether we can just initialize
key->ct.orig_proto and only init the others on demand.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-03  1:10 ` [PATCH net-next 6/7] openvswitch: Add force commit Jarno Rajahalme
  2017-02-06 17:08   ` Pravin Shelar
@ 2017-02-07 22:15   ` Joe Stringer
       [not found]     ` <5B795D0B-4C7F-4297-BA2A-6BE3444033D0@ovn.org>
  2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 2 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-07 22:15 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
> Stateful network admission policy may allow connections to one
> direction and reject connections initiated in the other direction.
> After policy change it is possible that for a new connection an
> overlapping conntrack entry already exist, where the connection
> original direction is opposed to the new connection's initial packet.
>
> Most importantly, conntrack state relating to the current packet gets
> the "reply" designation based on whether the original direction tuple
> or the reply direction tuple matched.  If this "directionality" is
> wrong w.r.t. to the stateful network admission policy it may happen
> that packets in neither direction are correctly admitted.
>
> This patch adds a new "force commit" option to the OVS conntrack
> action that checks the original direction of an existing conntrack
> entry.  If that direction is opposed to the current packet, the
> existing conntrack entry is deleted and a new one is subsequently
> created in the correct direction.
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> ---
>  include/uapi/linux/openvswitch.h | 10 ++++++++++
>  net/openvswitch/conntrack.c      | 27 +++++++++++++++++++++++++--
>  2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
> index 90af8b8..d5ba9a9 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -674,6 +674,10 @@ struct ovs_action_hash {
>   * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
>   * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address
>   * translation (NAT) on the packet.
> + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing
> + * nothing if the connection is already committed will check that the current
> + * packet is in conntrack entry's original direction.  If directionality does
> + * not match, will delete the existing conntrack entry and commit a new one.
>   */
>  enum ovs_ct_attr {
>         OVS_CT_ATTR_UNSPEC,
> @@ -684,6 +688,12 @@ enum ovs_ct_attr {
>         OVS_CT_ATTR_HELPER,     /* netlink helper to assist detection of
>                                    related connections. */
>         OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
> +       OVS_CT_ATTR_FORCE_COMMIT,  /* No argument, commits connection.  If the
> +                                   * conntrack entry original direction tuple
> +                                   * does not match the current packet header
> +                                   * values, will delete the current conntrack
> +                                   * entry and create a new one.
> +                                   */

We only need one copy of the explanation, keep it above the enum, then
the inline comment can be /* No argument */.

>         __OVS_CT_ATTR_MAX
>  };
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 1afe153..1f27f44 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -65,6 +65,7 @@ struct ovs_conntrack_info {
>         struct nf_conn *ct;
>         u8 commit : 1;
>         u8 nat : 3;                 /* enum ovs_ct_nat */
> +       u8 force : 1;
>         u16 family;
>         struct md_mark mark;
>         struct md_labels labels;
> @@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net,
>          */
>         if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
>             !(key->ct.state & OVS_CS_F_INVALID) &&
> -           key->ct.zone == info->zone.id)
> +           key->ct.zone == info->zone.id) {
>                 ct = ovs_ct_find_existing(net, &info->zone, info->family, skb,
>                                           !!(key->ct.state
>                                              & OVS_CS_F_NAT_MASK));
> +               if (ct)
> +                       nf_ct_get(skb, &ctinfo);
> +       }

If ctinfo is only used with the new call below, we can unconditionally
fetch this just before it's used...

>         if (!ct)
>                 return false;
>         if (!net_eq(net, read_pnet(&ct->ct_net)))
> @@ -648,6 +652,19 @@ static bool skb_nfct_cached(struct net *net,
>                 if (help && rcu_access_pointer(help->helper) != info->helper)
>                         return false;
>         }
> +       /* Force conntrack entry direction to the current packet? */

Here.

> +       if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) {
> +               /* Delete the conntrack entry if confirmed, else just release
> +                * the reference.
> +                */
> +               if (nf_ct_is_confirmed(ct))
> +                       nf_ct_delete(ct, 0, 0);
> +               else
> +                       nf_ct_put(ct);

We've already ensured that ct is non-NULL, we can use
nf_conntrack_put() instead.

> +               skb->nfct = NULL;
> +               skb->nfctinfo = 0;
> +               return false;
> +       }
>
>         return true;
>  }
> @@ -1226,6 +1243,7 @@ static int parse_nat(const struct nlattr *attr,
>
>  static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
>         [OVS_CT_ATTR_COMMIT]    = { .minlen = 0, .maxlen = 0 },
> +       [OVS_CT_ATTR_FORCE_COMMIT]      = { .minlen = 0, .maxlen = 0 },
>         [OVS_CT_ATTR_ZONE]      = { .minlen = sizeof(u16),
>                                     .maxlen = sizeof(u16) },
>         [OVS_CT_ATTR_MARK]      = { .minlen = sizeof(struct md_mark),
> @@ -1265,6 +1283,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
>                 }
>
>                 switch (type) {
> +               case OVS_CT_ATTR_FORCE_COMMIT:
> +                       info->force = true;
> +                       /* fall through. */
>                 case OVS_CT_ATTR_COMMIT:
>                         info->commit = true;
>                         break;
> @@ -1491,7 +1512,9 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
>         if (!start)
>                 return -EMSGSIZE;
>
> -       if (ct_info->commit && nla_put_flag(skb, OVS_CT_ATTR_COMMIT))
> +       if (ct_info->commit && nla_put_flag(skb, ct_info->force
> +                                           ? OVS_CT_ATTR_FORCE_COMMIT
> +                                           : OVS_CT_ATTR_COMMIT))
>                 return -EMSGSIZE;
>         if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
>             nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone.id))
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
  2017-02-07  7:15   ` Joe Stringer
@ 2017-02-08  1:11     ` Jarno Rajahalme
  2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  1:11 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev


> On Feb 6, 2017, at 11:15 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> struct sw_flow_key has two 16-bit holes. Move the most matched
>> conntrack match fields there.  In some typical cases this reduces the
>> size of the key that needs to be hashed into half and into one cache
>> line.
>> 
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> Looks like this misses the zeroing in ovs_nla_get_flow_metadata();
> might want to double-check for any other memset/copies of the key->ct
> field.

Good catch. Looked, there are no other places to change.

Will rebase to current net-next and repost.

  Jarno

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
       [not found]     ` <5B795D0B-4C7F-4297-BA2A-6BE3444033D0@ovn.org>
@ 2017-02-08  1:32       ` Joe Stringer
  0 siblings, 0 replies; 31+ messages in thread
From: Joe Stringer @ 2017-02-08  1:32 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev

On 7 February 2017 at 17:03, Jarno Rajahalme <jarno@ovn.org> wrote:
>
> On Feb 7, 2017, at 2:15 PM, Joe Stringer <joe@ovn.org> wrote:
>
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>
> Stateful network admission policy may allow connections to one
> direction and reject connections initiated in the other direction.
> After policy change it is possible that for a new connection an
> overlapping conntrack entry already exist, where the connection
> original direction is opposed to the new connection's initial packet.
>
> Most importantly, conntrack state relating to the current packet gets
> the "reply" designation based on whether the original direction tuple
> or the reply direction tuple matched.  If this "directionality" is
> wrong w.r.t. to the stateful network admission policy it may happen
> that packets in neither direction are correctly admitted.
>
> This patch adds a new "force commit" option to the OVS conntrack
> action that checks the original direction of an existing conntrack
> entry.  If that direction is opposed to the current packet, the
> existing conntrack entry is deleted and a new one is subsequently
> created in the correct direction.
>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> ---
> include/uapi/linux/openvswitch.h | 10 ++++++++++
> net/openvswitch/conntrack.c      | 27 +++++++++++++++++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/openvswitch.h
> b/include/uapi/linux/openvswitch.h
> index 90af8b8..d5ba9a9 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -674,6 +674,10 @@ struct ovs_action_hash {
>  * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
>  * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address
>  * translation (NAT) on the packet.
> + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of
> doing
> + * nothing if the connection is already committed will check that the
> current
> + * packet is in conntrack entry's original direction.  If directionality
> does
> + * not match, will delete the existing conntrack entry and commit a new
> one.
>  */
> enum ovs_ct_attr {
>        OVS_CT_ATTR_UNSPEC,
> @@ -684,6 +688,12 @@ enum ovs_ct_attr {
>        OVS_CT_ATTR_HELPER,     /* netlink helper to assist detection of
>                                   related connections. */
>        OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
> +       OVS_CT_ATTR_FORCE_COMMIT,  /* No argument, commits connection.  If
> the
> +                                   * conntrack entry original direction
> tuple
> +                                   * does not match the current packet
> header
> +                                   * values, will delete the current
> conntrack
> +                                   * entry and create a new one.
> +                                   */
>
>
> We only need one copy of the explanation, keep it above the enum, then
> the inline comment can be /* No argument */.
>
>
> OK.
>
>        __OVS_CT_ATTR_MAX
> };
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 1afe153..1f27f44 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -65,6 +65,7 @@ struct ovs_conntrack_info {
>        struct nf_conn *ct;
>        u8 commit : 1;
>        u8 nat : 3;                 /* enum ovs_ct_nat */
> +       u8 force : 1;
>        u16 family;
>        struct md_mark mark;
>        struct md_labels labels;
> @@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net,
>         */
>        if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
>            !(key->ct.state & OVS_CS_F_INVALID) &&
> -           key->ct.zone == info->zone.id)
> +           key->ct.zone == info->zone.id) {
>                ct = ovs_ct_find_existing(net, &info->zone, info->family,
> skb,
>                                          !!(key->ct.state
>                                             & OVS_CS_F_NAT_MASK));
> +               if (ct)
> +                       nf_ct_get(skb, &ctinfo);
> +       }
>
>
> If ctinfo is only used with the new call below, we can unconditionally
> fetch this just before it's used...
>
>        if (!ct)
>                return false;
>        if (!net_eq(net, read_pnet(&ct->ct_net)))
> @@ -648,6 +652,19 @@ static bool skb_nfct_cached(struct net *net,
>                if (help && rcu_access_pointer(help->helper) != info->helper)
>                        return false;
>        }
> +       /* Force conntrack entry direction to the current packet? */
>
>
> Here.
>
>
> But then we would be executing nf_ct_get() twice in the common case?

Ah, fair enough. It's fine here.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted.
  2017-02-06 17:07 ` Pravin Shelar
@ 2017-02-08  5:30   ` Jarno Rajahalme
  0 siblings, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:30 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers


> On Feb 6, 2017, at 9:07 AM, Pravin Shelar <pshelar@ovn.org> wrote:
> 
> On Thu, Feb 2, 2017 at 5:10 PM, Jarno Rajahalme <jarno@ovn.org> wrote:
>> When looking for an existing conntrack entry, the packet 5-tuple
>> must be inverted if NAT has already been applied, as the current
>> packet headers do not match any conntrack tuple.  For
>> example, if a packet from private address X to a public address B is
>> source-NATted to A, the conntrack entry will have the following tuples
>> (ignoring the protocol and port numbers) after the conntrack entry is
>> committed:
>> 
>> Original direction tuple: (X,B)
>> Reply direction tuple: (B,A)
>> 
>> Now, if a reply packet is already transformed back to the private
>> address space (e.g., with a CT(nat) action), the tuple corresponding
>> to the current packet headers is:
>> 
>> Current packet tuple: (B,X)
>> 
>> This does not match either of the conntrack tuples above.  Normally
>> this does not matter, as the conntrack lookup was already done using
>> the tuple (B,A), but if the current packet does not match any flow in
>> the OVS datapath, the packet is sent to userspace via an upcall,
>> during which the packet's skb is freed, and the conntrack entry
>> pointer in the skb is lost.  When the packet is reintroduced to the
>> datapath, any further conntrack action will need to perform a new
>> conntrack lookup to find the entry again.  Prior to this patch this
>> second lookup failed for NATted packets.  The datapath flow setup
>> corresponding to the upcall can succeed, however, allowing all further
>> packets in the reply direction to re-use the conntrack entry pointer
>> in the skb, so typically the lookup failure only causes a packet drop.
>> 
>> The solution is to invert the tuple derived from the current packet
>> headers in case the conntrack state stored in the packet metadata
>> indicates that the packet has been transformed by NAT:
>> 
>> Inverted tuple: (X,B)
>> 
>> With this the conntrack entry can be found, matching the original
>> direction tuple.
>> 
>> This same logic also works for the original direction packets:
>> 
>> Current packet tuple (after NAT): (A,B)
>> Inverted tuple: (B,A)
>> 
>> While the current packet tuple (A,B) does not match either of the
>> conntrack tuples, the inverted one (B,A) does match the reply
>> direction tuple.
>> 
>> Since the inverted tuple matches the reverse direction tuple the
>> direction of the packet must be reversed as well.
>> 
>> Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> I could not apply this patch series to net-next branch. But it does
> applies to net, which branch are you targeting it for?

The patches were against net-next, but there likely was a merge from netfilter around the time of me sending the email out causing the difficulty. Will address all comments, rebase and post a v2 later today.

 Jarno

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection.
  2017-02-06 21:46   ` Joe Stringer
@ 2017-02-08  5:30     ` Jarno Rajahalme
  0 siblings, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:30 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev

Thanks for the review! Comments below,

  Jarno

> On Feb 6, 2017, at 1:46 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> Avoid triggering change events for setting conntrack mark or labels
>> before the conntrack entry has been confirmed.  Refactoring on this
>> patch also makes chenges in later patches easier to review.
>> 
>> Fixes: 182e3042e15d ("openvswitch: Allow matching on conntrack mark")
>> Fixes: c2ac66735870 ("openvswitch: Allow matching on conntrack label")
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> Functional and cosmetic changes should be in separate patches.
> 

OK, will split.

>> ---
>> net/openvswitch/conntrack.c | 87 ++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 63 insertions(+), 24 deletions(-)
>> 
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index 6730f09..a163c44 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -229,23 +229,17 @@ int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
>>        return 0;
>> }
>> 
>> -static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
>> +static int ovs_ct_set_mark(struct nf_conn *ct, struct sw_flow_key *key,
>>                           u32 ct_mark, u32 mask)
>> {
>> #if IS_ENABLED(CONFIG_NF_CONNTRACK_MARK)
>> -       enum ip_conntrack_info ctinfo;
>> -       struct nf_conn *ct;
>>        u32 new_mark;
>> 
>> -       /* The connection could be invalid, in which case set_mark is no-op. */
>> -       ct = nf_ct_get(skb, &ctinfo);
>> -       if (!ct)
>> -               return 0;
>> -
>>        new_mark = ct_mark | (ct->mark & ~(mask));
>>        if (ct->mark != new_mark) {
>>                ct->mark = new_mark;
>> -               nf_conntrack_event_cache(IPCT_MARK, ct);
>> +               if (nf_ct_is_confirmed(ct))
>> +                       nf_conntrack_event_cache(IPCT_MARK, ct);
>>                key->ct.mark = new_mark;
>>        }
>> 
>> @@ -255,26 +249,59 @@ static int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
>> #endif
>> }
>> 
>> -static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
>> -                            const struct ovs_key_ct_labels *labels,
>> -                            const struct ovs_key_ct_labels *mask)
>> +static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
>> {
>> -       enum ip_conntrack_info ctinfo;
>>        struct nf_conn_labels *cl;
>> -       struct nf_conn *ct;
>> -       int err;
>> -
>> -       /* The connection could be invalid, in which case set_label is no-op.*/
>> -       ct = nf_ct_get(skb, &ctinfo);
>> -       if (!ct)
>> -               return 0;
>> 
>>        cl = nf_ct_labels_find(ct);
>>        if (!cl) {
>>                nf_ct_labels_ext_add(ct);
>>                cl = nf_ct_labels_find(ct);
>>        }
>> +
>>        if (!cl || sizeof(cl->bits) < OVS_CT_LABELS_LEN)
>> +               return NULL;
>> +
>> +       return cl;
>> +}
>> +
>> +/* Initialize labels for a new, to be committed conntrack entry.  Note that
>> + * since the new connection is not yet confirmed, and thus no-one else has
>> + * access to it's labels, we simply write them over.  Also, we refrain from
>> + * triggering events, as receiving change events before the create event would
>> + * be confusing.
>> + */
>> +static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
>> +                             const struct ovs_key_ct_labels *labels,
>> +                             const struct ovs_key_ct_labels *mask)
>> +{
>> +       struct nf_conn_labels *cl;
>> +       u32 *dst;
>> +       int i;
>> +
>> +       cl = ovs_ct_get_conn_labels(ct);
>> +       if (!cl)
>> +               return -ENOSPC;
>> +
>> +       dst = (u32 *)cl->bits;
> 
> Is it worth extending the union to include unsigned long, to avoid
> casting it to u32 here?
> 

This cast is on the struct nf_conn_labels, I would not unionize it at this point. This type of cast is typical in conntrack code.

>> +       for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
>> +               dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
>> +                       (labels->ct_labels_32[i] & mask->ct_labels_32[i]);
>> +
>> +       memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
>> +
>> +       return 0;
>> +}
>> +
>> +static int ovs_ct_set_labels(struct nf_conn *ct, struct sw_flow_key *key,
>> +                            const struct ovs_key_ct_labels *labels,
>> +                            const struct ovs_key_ct_labels *mask)
>> +{
>> +       struct nf_conn_labels *cl;
>> +       int err;
>> +
>> +       cl = ovs_ct_get_conn_labels(ct);
>> +       if (!cl)
>>                return -ENOSPC;
>> 
>>        err = nf_connlabels_replace(ct, labels->ct_labels_32,
>> @@ -283,7 +310,8 @@ static int ovs_ct_set_labels(struct sk_buff *skb, struct sw_flow_key *key,
>>        if (err)
>>                return err;
>> 
>> -       ovs_ct_get_labels(ct, &key->ct.labels);
>> +       memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
>> +
> 
> I noticed this change and started looking at ovs_ct_get_labels(). It
> tries to handle length differences between nf_conn_labels matches
> ovs_key_ct_labels, but perhaps it's easier to drop that code and use a
> build-time assert that they're the same instead. Since 23014011ba42
> ("netfilter: conntrack: support a fixed size of 128 distinct labels"),
> they haven't been variable length anyway so this can simplify some
> code. This can be separate from this patch though.
> 
> I think that such a change would be orthogonal from the above change,
> the above can stay as is (not much point sharing the function call if
> it just retrieves a pointer we already have and does a memcpy).
> 

I kept this change with the refactoring patch, not with the functional change patch.

>>        return 0;
>> }
>> 
>> @@ -865,25 +893,36 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>>                         const struct ovs_conntrack_info *info,
>>                         struct sk_buff *skb)
>> {
>> +       enum ip_conntrack_info ctinfo;
>> +       struct nf_conn *ct;
>>        int err;
>> 
>>        err = __ovs_ct_lookup(net, key, info, skb);
>>        if (err)
>>                return err;
>> 
>> +       /* The connection could be invalid, in which case this is a no-op.*/
>> +       ct = nf_ct_get(skb, &ctinfo);
>> +       if (!ct)
>> +               return 0;
>> +
>>        /* Apply changes before confirming the connection so that the initial
>>         * conntrack NEW netlink event carries the values given in the CT
>>         * action.
>>         */
>>        if (info->mark.mask) {
>> -               err = ovs_ct_set_mark(skb, key, info->mark.value,
>> +               err = ovs_ct_set_mark(ct, key, info->mark.value,
>>                                      info->mark.mask);
>>                if (err)
>>                        return err;
>>        }
>>        if (labels_nonzero(&info->labels.mask)) {
>> -               err = ovs_ct_set_labels(skb, key, &info->labels.value,
>> -                                       &info->labels.mask);
>> +               if (!nf_ct_is_confirmed(ct))
>> +                       err = ovs_ct_init_labels(ct, key, &info->labels.value,
>> +                                                &info->labels.mask);
>> +               else
>> +                       err = ovs_ct_set_labels(ct, key, &info->labels.value,
>> +                                               &info->labels.mask);
>>                if (err)
>>                        return err;
>>        }
>> --
>> 2.1.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 4/7] openvswitch: Inherit master's labels.
  2017-02-06 21:53   ` Joe Stringer
@ 2017-02-08  5:31     ` Jarno Rajahalme
  0 siblings, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:31 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev


> On Feb 6, 2017, at 1:53 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> We avoid calling into nf_conntrack_in() for expected connections, as
>> that would remove the expectation that we want to stick around until
>> we are ready to commit the connection.  Instead, we do a lookup in the
>> expectation table directly.  However, after a successful expectation
>> lookup we have set the flow key label field from the master
>> connection, whereas nf_conntrack_in() does not do this.  This leads to
>> master's labels being iherited after an expectation lookup, but those
>> labels not being inherited after the corresponding conntrack action
>> with a commit flag.
>> 
>> This patch resolves the problem by changing the commit code path to
>> also inherit the master's labels to the expected connection.
>> Resolving this conflict in favor or inheriting the labels allows
>> information be passed from the master connection to related
>> connections, which would otherwise be much harder.  Labels can still
>> be set explicitly, so this change only affects the default values of
>> the labels in presense of a master connection.
>> 
>> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> ---
>> net/openvswitch/conntrack.c | 48 ++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 34 insertions(+), 14 deletions(-)
>> 
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index a163c44..738a4fa 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -265,6 +265,8 @@ static struct nf_conn_labels *ovs_ct_get_conn_labels(struct nf_conn *ct)
>>        return cl;
>> }
>> 
>> +static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
>> +
> 
> These declarations typically live at the top of the file, not
> somewhere in the middle.
> 

Moved.

>> /* Initialize labels for a new, to be committed conntrack entry.  Note that
>>  * since the new connection is not yet confirmed, and thus no-one else has
>>  * access to it's labels, we simply write them over.  Also, we refrain from
>> @@ -275,18 +277,35 @@ static int ovs_ct_init_labels(struct nf_conn *ct, struct sw_flow_key *key,
>>                              const struct ovs_key_ct_labels *labels,
>>                              const struct ovs_key_ct_labels *mask)
>> {
>> -       struct nf_conn_labels *cl;
>> -       u32 *dst;
>> -       int i;
>> +       struct nf_conn_labels *cl, *master_cl;
>> +       bool have_mask = labels_nonzero(mask);
>> +
>> +       /* Inherit master's labels to the related connection? */
>> +       master_cl = (ct->master) ? nf_ct_labels_find(ct->master) : NULL;
>> +
>> +       if (!master_cl && !have_mask)
>> +               return 0;   /* Nothing to do. */
>> 
>>        cl = ovs_ct_get_conn_labels(ct);
>>        if (!cl)
>>                return -ENOSPC;
>> 
>> -       dst = (u32 *)cl->bits;
>> -       for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
>> -               dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
>> -                       (labels->ct_labels_32[i] & mask->ct_labels_32[i]);
>> +       /* Inherit the master's labels, if any. */
>> +       if (master_cl) {
>> +               size_t len = sizeof(master_cl->bits);
>> +
>> +               memcpy(&cl->bits, &master_cl->bits,
>> +                      len > OVS_CT_LABELS_LEN ? OVS_CT_LABELS_LEN : len);
> 
> Looks like this is another spot where we're trying to handle differing
> label lengths, which we could simplify if there was a stronger
> guarantee they're the same.

Indeed, here the ‘cl’s are different instances of the same structure, so we do need not worry about the sizes at all. I’ll change this to an simple structure assignment for v2.

> 
>> +       }
>> +       if (have_mask) {
>> +               u32 *dst = (u32 *)cl->bits;
>> +               int i;
>> +
>> +               for (i = 0; i < OVS_CT_LABELS_LEN_32; i++)
>> +                       dst[i] = (dst[i] & ~mask->ct_labels_32[i]) |
>> +                               (labels->ct_labels_32[i]
>> +                                & mask->ct_labels_32[i]);
>> +       }
> 
> By the way, is this open-coding nf_connlabels_replace()? Can
> ovs_ct_set_labels() and this share the code?

nf_connlabels_replace() uses the compare-and-exchange function to change each 32-bit unit individually, and also triggers the nf netlink change event, first of which we do not need and the second of which we do not want.

> 
>> 
>>        memcpy(&key->ct.labels, cl->bits, OVS_CT_LABELS_LEN);
>> 
>> @@ -916,13 +935,14 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
>>                if (err)
>>                        return err;
>>        }
>> -       if (labels_nonzero(&info->labels.mask)) {
>> -               if (!nf_ct_is_confirmed(ct))
>> -                       err = ovs_ct_init_labels(ct, key, &info->labels.value,
>> -                                                &info->labels.mask);
>> -               else
>> -                       err = ovs_ct_set_labels(ct, key, &info->labels.value,
>> -                                               &info->labels.mask);
>> +       if (!nf_ct_is_confirmed(ct)) {
>> +               err = ovs_ct_init_labels(ct, key, &info->labels.value,
>> +                                        &info->labels.mask);
>> +               if (err)
>> +                       return err;
>> +       } else if (labels_nonzero(&info->labels.mask)) {
>> +               err = ovs_ct_set_labels(ct, key, &info->labels.value,
>> +                                       &info->labels.mask);
>>                if (err)
>>                        return err;
>>        }
>> --
>> 2.1.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key.
  2017-02-07  7:15   ` Joe Stringer
  2017-02-07 21:38     ` Joe Stringer
@ 2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:31 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev


> On Feb 6, 2017, at 11:15 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> Add the fields of the conntrack original direction 5-tuple to struct
>> sw_flow_key.  The new fields are initially zeroed, and are populated
>> whenever a conntrack action is executed and either finds or generates
>> a conntrack entry.  This means that these fields exist for all packets
>> were not rejected by conntrack as untrackable.
>> 
>> The original tuple fields in the sw_flow_key are filled from the
>> original direction tuple of the conntrack entry relating to the
>> current packet, or from the original direction tuple of the master
>> conntrack entry, if the current conntrack entry has a master.
>> Generally, expected connections of connections having an assigned
>> helper (e.g., FTP), have a master conntrack entry.
>> 
>> The main purpose of the new conntrack original tuple fields is to
>> allow matching on them for policy decision purposes, with the premise
>> that the admissibility of tracked connections reply packets (as well
>> as original direction packets), and both direction packets of any
>> related connections may be based on ACL rules applying to the master
>> connection's original direction 5-tuple.  This also makes it easier to
>> make policy decisions when the actual packet headers might have been
>> transformed by NAT, as the original direction 5-tuple represents the
>> packet headers before any such transformation.
>> 
>> When using the original direction 5-tuple the admissibility of return
>> and/or related packets need not be based on the mere existence of a
>> conntrack entry, allowing separation of admission policy from the
>> established conntrack state.  While existence of a conntrack entry is
>> required for admission of the return or related packets, policy
>> changes can render connections that were initially admitted to be
>> rejected or dropped afterwards.  If the admission of the return and
>> related packets was based on mere conntrack state (e.g., connection
>> being in an established state), a policy change that would make the
>> connection rejected or dropped would need to find and delete all
>> conntrack entries affected by such a change.  When using the original
>> direction 5-tuple matching the affected conntrack entries can be
>> allowed to time out instead, as the established state of the
>> connection would not need to be the basis for packet admission any
>> more.
>> 
>> It should be noted that the directionality of related connections may
>> be the same or different than that of the master connection, and
>> neither the original direction 5-tuple nor the conntrack state bits
>> carry this information.  If needed, the directionality of the master
>> connection can be stored in master's conntrack mark or labels, which
>> are automatically inherited by the expected related connections.
>> 
>> The fact that neither ARP not ND packets are trackable by conntrack
>> allows mutual exclusion between ARP/ND and the new conntrack original
>> tuple fields.  Hence, the IP addresses are overlaid in union with ARP
>> and ND fields.  This allows the sw_flow_key to not grow much due to
>> this patch, but it also means that we must be careful to never use the
>> new key fields with ARP or ND packets.  ARP is easy to distinguish and
>> keep mutually exclusive based on the ethernet type, but ND being an
>> ICMPv6 protocol requires a bit more attention.
>> 
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> ---
> 
> OK, maybe we need to do something a bit more to handle the NATed
> related connections to address the problem in patch 1.
> 
> <snip>
> 
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index 738a4fa..1afe153 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -155,6 +155,59 @@ static void __ovs_ct_update_key(struct sw_flow_key *key, u8 state,
>>        key->ct.zone = zone->id;
>>        key->ct.mark = ovs_ct_get_mark(ct);
>>        ovs_ct_get_labels(ct, &key->ct.labels);
>> +
>> +       /* Use the master if we have one. */
>> +       if (ct && ct->master)
>> +               ct = ct->master;
> 
> Perhaps:
> 
> if (!ct || sw_flow_key_is_nd(key) || !is_ip_any(key->eth.type)) {
>    /* zero everything */
>    return;
> }
> 
> One of the things this helps us to avoid is having a comment in the
> middle of an if statement.
> 
> Then afterwards,
> if (ct->master)
>    ct = ct->master;
> 
>> +
>> +       key->ct.orig_proto = 0;
>> +       key->ct.orig_tp.src = 0;
>> +       key->ct.orig_tp.dst = 0;
>> +       if (key->eth.type == htons(ETH_P_IP)) {
>> +               /* IP version must match. */
>> +               if (ct && nf_ct_l3num(ct) == NFPROTO_IPV4) {
> 
> I don't quite understand how we could end up with a connection NFPROTO
> that is mismatched with an IP version that we should handle here, but
> if there are some legitimite cases perhaps we can pick them up and
> handle them in the early exit condition above?
> 

One would be if an IPv4 FTP control connection opened an IPv6 data connection. Our flow key layout assumes the IP versions between master and related connection match. I added a comment to that effect.

> We can probably share a few more lines between IPv4 and IPv6 here.
> 

I refactored all the repeated code away for v2.


>> @@ -208,24 +261,54 @@ void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key)
>>        ovs_ct_update_key(skb, NULL, key, false, false);
>> }
>> 
>> -int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb)
>> +#define IN6_ADDR_INITIALIZER(ADDR) \
>> +       { (ADDR).s6_addr32[0], (ADDR).s6_addr32[1], \
>> +         (ADDR).s6_addr32[2], (ADDR).s6_addr32[3] }
> 
> scripts/checkpatch.pl asks:
> 
> CHECK: Macro argument reuse 'ADDR' - possible side-effects?
> #704: FILE: net/openvswitch/conntrack.c:264:
> +#define IN6_ADDR_INITIALIZER(ADDR) \
> +       { (ADDR).s6_addr32[0], (ADDR).s6_addr32[1], \
> +         (ADDR).s6_addr32[2], (ADDR).s6_addr32[3] }
> 
> I'm guessing it just doesn't like your use of the name 'ADDR’.

Rather it is recommending caution about ‘ADDR’ being evaluated multiple times, which is no problem here.

> 
>> +       if (swkey->eth.type == htons(ETH_P_IP) && swkey->ct.orig_proto) {
>> +               struct ovs_key_ct_tuple_ipv4 orig = {
>> +                       output->ipv4.ct_orig.src,
>> +                       output->ipv4.ct_orig.dst,
>> +                       output->ct.orig_tp.src,
>> +                       output->ct.orig_tp.dst,
>> +                       output->ct.orig_proto,
>> +               };
>> +               if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4, sizeof(orig),
>> +                           &orig))
>> +                       return -EMSGSIZE;
>> +       } else if (swkey->eth.type == htons(ETH_P_IPV6) &&
>> +                  swkey->ct.orig_proto) {
>> +               struct ovs_key_ct_tuple_ipv6 orig = {
>> +                       IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.src),
>> +                       IN6_ADDR_INITIALIZER(output->ipv6.ct_orig.dst),
>> +                       output->ct.orig_tp.src,
>> +                       output->ct.orig_tp.dst,
>> +                       output->ct.orig_proto,
>> +               };
>> +               if (nla_put(skb, OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6, sizeof(orig),
>> +                           &orig))
>> +                       return -EMSGSIZE;
>> +       }
> 
> swkey->ct.orig_proto check is common across both conditions, maybe
> check that first then nest the other logic within?

Done.

> 
>> +
>>        return 0;
>> }
>> 
>> diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
>> index 8f6230b..8573ab3 100644
>> --- a/net/openvswitch/conntrack.h
>> +++ b/net/openvswitch/conntrack.h
>> @@ -32,7 +32,8 @@ int ovs_ct_execute(struct net *, struct sk_buff *, struct sw_flow_key *,
>>                   const struct ovs_conntrack_info *);
>> 
>> void ovs_ct_fill_key(const struct sk_buff *skb, struct sw_flow_key *key);
>> -int ovs_ct_put_key(const struct sw_flow_key *key, struct sk_buff *skb);
>> +int ovs_ct_put_key(const struct sw_flow_key *swkey,
>> +                  const struct sw_flow_key *output, struct sk_buff *skb);
>> void ovs_ct_free_action(const struct nlattr *a);
>> 
>> #define CT_SUPPORTED_MASK (OVS_CS_F_NEW | OVS_CS_F_ESTABLISHED | \
>> @@ -79,9 +80,17 @@ static inline void ovs_ct_fill_key(const struct sk_buff *skb,
>>        key->ct.zone = 0;
>>        key->ct.mark = 0;
>>        memset(&key->ct.labels, 0, sizeof(key->ct.labels));
>> +       key->ct.orig_proto = 0;
>> +       key->ct.orig_tp.src = 0;
>> +       key->ct.orig_tp.dst = 0;
>> +       if (key->eth.type == htons(ETH_P_IP))
>> +               memset(&key->ipv4.ct_orig, 0, sizeof(key->ipv4.ct_orig));
>> +       else if (key->eth.type == htons(ETH_P_IPV6) && !sw_flow_key_is_nd(key))
>> +               memset(&key->ipv6.ct_orig, 0, sizeof(key->ipv6.ct_orig));
> 
> If we only ever use these fields based on key->ct.orig_proto being
> nonzero, then we m
> 

All these fields need to be initialized for megaflow matching purposes. We use orig_proto non-zero check to avoid encoding all-zero netlink attribute.

> <snip>
> 
>> +static inline bool sw_flow_key_is_nd(const struct sw_flow_key *key)
>> +{
>> +       return key->eth.type == htons(ETH_P_IPV6) &&
>> +               key->ip.proto == NEXTHDR_ICMP &&
>> +               key->tp.dst == 0 &&
>> +               (key->tp.src == ntohs(NDISC_NEIGHBOUR_SOLICITATION) ||
>> +                key->tp.src == ntohs(NDISC_NEIGHBOUR_ADVERTISEMENT));
> 
> Sparse complains:
> 
> net/openvswitch/flow.h:163:33: warning: cast to restricted __be16
> net/openvswitch/flow.h:163:25: warning: restricted __be16 degrades to integer
> net/openvswitch/flow.h:164:33: warning: cast to restricted __be16
> net/openvswitch/flow.h:164:25: warning: restricted __be16 degrades to integer
> 

Oops, should have used htons() instead.

>> @@ -161,8 +163,11 @@ static bool match_validate(const struct sw_flow_match *match,
>> 
>>        if (match->key->eth.type == htons(ETH_P_IP)) {
>>                key_expected |= 1 << OVS_KEY_ATTR_IPV4;
>> -               if (match->mask && (match->mask->key.eth.type == htons(0xffff)))
>> +               if (match->mask &&
>> +                   (match->mask->key.eth.type == htons(0xffff))) {
> 
> Minor nit, can probably drop the parentheses and keep this to one line?
> 

OK.

>> @@ -1430,6 +1476,10 @@ int ovs_nla_get_match(struct net *net, struct sw_flow_match *match,
>>                        goto free_newmask;
>>        }
>> 
>> +       /* This also checks that conntrack original direction metadata exists
>> +        * only for packets for which it makes sense.  Otherwise the key may be
>> +        * corrupted due to overlapping key fields.
>> +        */
>>        if (!match_validate(match, key_attrs, mask_attrs, log))
>>                err = -EINVAL;
>> 
> 
> I believe that this is already explained in the relevant portion of
> match_validate(), not sure it's worth highlighting that particular
> piece of validation above a call to such a general function.

Removed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/7] openvswitch: Add force commit.
  2017-02-07 22:15   ` Joe Stringer
       [not found]     ` <5B795D0B-4C7F-4297-BA2A-6BE3444033D0@ovn.org>
@ 2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:31 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev


> On Feb 7, 2017, at 2:15 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> Stateful network admission policy may allow connections to one
>> direction and reject connections initiated in the other direction.
>> After policy change it is possible that for a new connection an
>> overlapping conntrack entry already exist, where the connection
>> original direction is opposed to the new connection's initial packet.
>> 
>> Most importantly, conntrack state relating to the current packet gets
>> the "reply" designation based on whether the original direction tuple
>> or the reply direction tuple matched.  If this "directionality" is
>> wrong w.r.t. to the stateful network admission policy it may happen
>> that packets in neither direction are correctly admitted.
>> 
>> This patch adds a new "force commit" option to the OVS conntrack
>> action that checks the original direction of an existing conntrack
>> entry.  If that direction is opposed to the current packet, the
>> existing conntrack entry is deleted and a new one is subsequently
>> created in the correct direction.
>> 
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> ---
>> include/uapi/linux/openvswitch.h | 10 ++++++++++
>> net/openvswitch/conntrack.c      | 27 +++++++++++++++++++++++++--
>> 2 files changed, 35 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>> index 90af8b8..d5ba9a9 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -674,6 +674,10 @@ struct ovs_action_hash {
>>  * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
>>  * @OVS_CT_ATTR_NAT: Nested OVS_NAT_ATTR_* for performing L3 network address
>>  * translation (NAT) on the packet.
>> + * @OVS_CT_ATTR_FORCE_COMMIT: Like %OVS_CT_ATTR_COMMIT, but instead of doing
>> + * nothing if the connection is already committed will check that the current
>> + * packet is in conntrack entry's original direction.  If directionality does
>> + * not match, will delete the existing conntrack entry and commit a new one.
>>  */
>> enum ovs_ct_attr {
>>        OVS_CT_ATTR_UNSPEC,
>> @@ -684,6 +688,12 @@ enum ovs_ct_attr {
>>        OVS_CT_ATTR_HELPER,     /* netlink helper to assist detection of
>>                                   related connections. */
>>        OVS_CT_ATTR_NAT,        /* Nested OVS_NAT_ATTR_* */
>> +       OVS_CT_ATTR_FORCE_COMMIT,  /* No argument, commits connection.  If the
>> +                                   * conntrack entry original direction tuple
>> +                                   * does not match the current packet header
>> +                                   * values, will delete the current conntrack
>> +                                   * entry and create a new one.
>> +                                   */
> 
> We only need one copy of the explanation, keep it above the enum, then
> the inline comment can be /* No argument */.
> 

OK.

>>        __OVS_CT_ATTR_MAX
>> };
>> 
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index 1afe153..1f27f44 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -65,6 +65,7 @@ struct ovs_conntrack_info {
>>        struct nf_conn *ct;
>>        u8 commit : 1;
>>        u8 nat : 3;                 /* enum ovs_ct_nat */
>> +       u8 force : 1;
>>        u16 family;
>>        struct md_mark mark;
>>        struct md_labels labels;
>> @@ -631,10 +632,13 @@ static bool skb_nfct_cached(struct net *net,
>>         */
>>        if (!ct && key->ct.state & OVS_CS_F_TRACKED &&
>>            !(key->ct.state & OVS_CS_F_INVALID) &&
>> -           key->ct.zone == info->zone.id)
>> +           key->ct.zone == info->zone.id) {
>>                ct = ovs_ct_find_existing(net, &info->zone, info->family, skb,
>>                                          !!(key->ct.state
>>                                             & OVS_CS_F_NAT_MASK));
>> +               if (ct)
>> +                       nf_ct_get(skb, &ctinfo);
>> +       }
> 
> If ctinfo is only used with the new call below, we can unconditionally
> fetch this just before it's used...
> 
>>        if (!ct)
>>                return false;
>>        if (!net_eq(net, read_pnet(&ct->ct_net)))
>> @@ -648,6 +652,19 @@ static bool skb_nfct_cached(struct net *net,
>>                if (help && rcu_access_pointer(help->helper) != info->helper)
>>                        return false;
>>        }
>> +       /* Force conntrack entry direction to the current packet? */
> 
> Here.
> 

But then we would be executing nf_ct_get() twice in the common case?

>> +       if (info->force && CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL) {
>> +               /* Delete the conntrack entry if confirmed, else just release
>> +                * the reference.
>> +                */
>> +               if (nf_ct_is_confirmed(ct))
>> +                       nf_ct_delete(ct, 0, 0);
>> +               else
>> +                       nf_ct_put(ct);
> 
> We've already ensured that ct is non-NULL, we can use
> nf_conntrack_put() instead.
> 

Right.

>> +               skb->nfct = NULL;
>> +               skb->nfctinfo = 0;
>> +               return false;
>> +       }
>> 
>>        return true;
>> }
>> @@ -1226,6 +1243,7 @@ static int parse_nat(const struct nlattr *attr,
>> 
>> static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
>>        [OVS_CT_ATTR_COMMIT]    = { .minlen = 0, .maxlen = 0 },
>> +       [OVS_CT_ATTR_FORCE_COMMIT]      = { .minlen = 0, .maxlen = 0 },
>>        [OVS_CT_ATTR_ZONE]      = { .minlen = sizeof(u16),
>>                                    .maxlen = sizeof(u16) },
>>        [OVS_CT_ATTR_MARK]      = { .minlen = sizeof(struct md_mark),
>> @@ -1265,6 +1283,9 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
>>                }
>> 
>>                switch (type) {
>> +               case OVS_CT_ATTR_FORCE_COMMIT:
>> +                       info->force = true;
>> +                       /* fall through. */
>>                case OVS_CT_ATTR_COMMIT:
>>                        info->commit = true;
>>                        break;
>> @@ -1491,7 +1512,9 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
>>        if (!start)
>>                return -EMSGSIZE;
>> 
>> -       if (ct_info->commit && nla_put_flag(skb, OVS_CT_ATTR_COMMIT))
>> +       if (ct_info->commit && nla_put_flag(skb, ct_info->force
>> +                                           ? OVS_CT_ATTR_FORCE_COMMIT
>> +                                           : OVS_CT_ATTR_COMMIT))
>>                return -EMSGSIZE;
>>        if (IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) &&
>>            nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone.id))
>> --
>> 2.1.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key.
  2017-02-07  7:15   ` Joe Stringer
  2017-02-08  1:11     ` Jarno Rajahalme
@ 2017-02-08  5:31     ` Jarno Rajahalme
  1 sibling, 0 replies; 31+ messages in thread
From: Jarno Rajahalme @ 2017-02-08  5:31 UTC (permalink / raw)
  To: Joe Stringer; +Cc: netdev


> On Feb 6, 2017, at 11:15 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 2 February 2017 at 17:10, Jarno Rajahalme <jarno@ovn.org> wrote:
>> struct sw_flow_key has two 16-bit holes. Move the most matched
>> conntrack match fields there.  In some typical cases this reduces the
>> size of the key that needs to be hashed into half and into one cache
>> line.
>> 
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> Looks like this misses the zeroing in ovs_nla_get_flow_metadata();
> might want to double-check for any other memset/copies of the key->ct
> field.

Good catch. Looked, there are no other places to change.

Will rebase to current net-next and repost.

 Jarno

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-02-08  5:31 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-03  1:10 [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 2/7] openvswitch: Unionize ovs_key_ct_label with a u32 array Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 3/7] openvswitch: Do not trigger events for unconfirmed connection Jarno Rajahalme
2017-02-06 21:46   ` Joe Stringer
2017-02-08  5:30     ` Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 4/7] openvswitch: Inherit master's labels Jarno Rajahalme
2017-02-06 21:53   ` Joe Stringer
2017-02-08  5:31     ` Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 5/7] openvswitch: Add original direction conntrack tuple to sw_flow_key Jarno Rajahalme
2017-02-07  7:15   ` Joe Stringer
2017-02-07 21:38     ` Joe Stringer
2017-02-08  5:31     ` Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 6/7] openvswitch: Add force commit Jarno Rajahalme
2017-02-06 17:08   ` Pravin Shelar
2017-02-07  7:28     ` Joe Stringer
2017-02-07 17:14       ` Pravin Shelar
2017-02-07 22:15   ` Joe Stringer
     [not found]     ` <5B795D0B-4C7F-4297-BA2A-6BE3444033D0@ovn.org>
2017-02-08  1:32       ` Joe Stringer
2017-02-08  5:31     ` Jarno Rajahalme
2017-02-03  1:10 ` [PATCH net-next 7/7] openvswitch: Pack struct sw_flow_key Jarno Rajahalme
2017-02-07  7:15   ` Joe Stringer
2017-02-08  1:11     ` Jarno Rajahalme
2017-02-08  5:31     ` Jarno Rajahalme
2017-02-05 22:28 ` [PATCH net-next 1/7] openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted David Miller
2017-02-06 17:06   ` Pravin Shelar
2017-02-06 17:15     ` David Miller
2017-02-07 17:14       ` Pravin Shelar
2017-02-07 21:29         ` Jarno Rajahalme
2017-02-07  0:45   ` Joe Stringer
2017-02-06 17:07 ` Pravin Shelar
2017-02-08  5:30   ` Jarno Rajahalme

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.