netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next] VPLS support
@ 2017-08-16 17:01 David Lamparter
  2017-08-16 17:01 ` [PATCH 1/6] bridge: learn dst metadata in FDB David Lamparter
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:01 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa

Hi all,


triggered by Amine posting VPLS support earlier, this is what I had in
mind on my end.  You may note the patches share some bits, this is
because both series are derived from hacks I did at 33C3 in December
2016.

This patchset is different in the following ways:
- one vpls device encapsulates many pseudowires
- the pseudowire is learned in the bridge FDB as dst_metadata
- to do this, the bridge is extended to learn / keep dst metadata in its fdb
  on a per-entry level
- configuring pseudowires happens through the mpls LIB, through route
  add/delete commands
- the TX path is shared with normal MPLS forwarding
  - this also means this supports ECMP without any further work and without
    copypasta'ing it from af_mpls.c
- pseudowire control word is supported

iproute2 / FRR patches are at:
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
while this patchset is also available at:
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)

I've tested some basic setups, the chain from LDP down into the kernel works
at least in these.  FRR has some testcases around from OpenBSD VPLS support,
I haven't wired that up to run against Linux / this patchset yet.

The patchset needs a lot of polishing (yes I left my TODO notes in the
commit messages), for now my primary concern is overall design feedback.
Roopa has already provided a lot of input (Thanks!);  the major topic I'm
expecting to get discussion on is the bridge FDB changes.

Note that there is a rather large spec difference to VXLAN;  VPLS has
no concept that is analog to a "VNI".  There is nothing to do on a
per-vlan basis.  (Don't get confused by "vlan pseudowire mode", that's
just a fixed tag insertion on a per-remote-endpoint level.)


Cheers / input & feedback appreciated,

-David


P.S.: For a little context on the bridge FDB changes - I'm hoping to find
some time to extend this to the MDB to allow aggregating dst metadata and
handing down a list of dst metas on TX.  This isn't specifically for VPLS
but rather to give sufficient information to the 802.11 stack to allow it to
optimize selecting rates (or unicasting) for multicast traffic by having the
multicast subscriber list known.  This is done by major commercial wifi
solutions (e.g. google "dynamic multicast optimization".)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
@ 2017-08-16 17:01 ` David Lamparter
  2017-08-16 20:38   ` Nikolay Aleksandrov
  2017-08-16 17:01 ` [PATCH 2/6] mpls: split forwarding path on rx/tx boundary David Lamparter
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:01 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

This implements holding dst metadata information in the bridge layer,
but only for unicast entries in the MAC table.  Multicast is still left
to design and implement.

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 include/net/dst_metadata.h | 19 +++++++++++++------
 net/bridge/br_device.c     |  4 ++++
 net/bridge/br_fdb.c        | 45 +++++++++++++++++++++++++++++++--------------
 net/bridge/br_input.c      |  6 ++++--
 net/bridge/br_private.h    |  4 +++-
 5 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index a803129a4849..8858dc441458 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -56,16 +56,15 @@ static inline bool skb_valid_dst(const struct sk_buff *skb)
 	return dst && !(dst->flags & DST_METADATA);
 }
 
-static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
-				       const struct sk_buff *skb_b)
+static inline int dst_metadata_cmp(const struct dst_entry *dst_a,
+				   const struct dst_entry *dst_b)
 {
 	const struct metadata_dst *a, *b;
-
-	if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+	if (!(dst_a || dst_b))
 		return 0;
 
-	a = (const struct metadata_dst *) skb_dst(skb_a);
-	b = (const struct metadata_dst *) skb_dst(skb_b);
+	a = (const struct metadata_dst *) dst_a;
+	b = (const struct metadata_dst *) dst_b;
 
 	if (!a != !b || a->type != b->type)
 		return 1;
@@ -83,6 +82,14 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
 	}
 }
 
+static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
+				       const struct sk_buff *skb_b)
+{
+	if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+		return 0;
+	return dst_metadata_cmp(skb_dst(skb_a), skb_dst(skb_b));
+}
+
 void metadata_dst_free(struct metadata_dst *);
 struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
 					gfp_t flags);
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 861ae2a165f4..534cacf02f8d 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -53,6 +53,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	brstats->tx_bytes += skb->len;
 	u64_stats_update_end(&brstats->syncp);
 
+	skb_dst_drop(skb);
 	BR_INPUT_SKB_CB(skb)->brdev = dev;
 
 	skb_reset_mac_header(skb);
@@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 		else
 			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
 	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
+		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
+		if (md_dst)
+			skb_dst_set_noref(skb, md_dst);
 		br_forward(dst->dst, skb, false, true);
 	} else {
 		br_flood(br, skb, BR_PKT_UNICAST, false, true);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index a79b648aac88..0751fcb89699 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -25,11 +25,13 @@
 #include <asm/unaligned.h>
 #include <linux/if_vlan.h>
 #include <net/switchdev.h>
+#include <net/dst_metadata.h>
 #include "br_private.h"
 
 static struct kmem_cache *br_fdb_cache __read_mostly;
 static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		      const unsigned char *addr, u16 vid);
+		      struct dst_entry *md_dst, const unsigned char *addr,
+		      u16 vid);
 static void fdb_notify(struct net_bridge *br,
 		       const struct net_bridge_fdb_entry *, int);
 
@@ -174,6 +176,8 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
 	if (f->is_static)
 		fdb_del_hw_addr(br, f->addr.addr);
 
+	dst_release(rcu_access_pointer(f->md_dst));
+
 	hlist_del_init_rcu(&f->hlist);
 	fdb_notify(br, f, RTM_DELNEIGH);
 	call_rcu(&f->rcu, fdb_rcu_free);
@@ -260,7 +264,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
 
 insert:
 	/* insert new address,  may fail if invalid address or dup. */
-	fdb_insert(br, p, newaddr, 0);
+	fdb_insert(br, p, NULL, newaddr, 0);
 
 	if (!vg || !vg->num_vlans)
 		goto done;
@@ -270,7 +274,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
 	 * from under us.
 	 */
 	list_for_each_entry(v, &vg->vlan_list, vlist)
-		fdb_insert(br, p, newaddr, v->vid);
+		fdb_insert(br, p, NULL, newaddr, v->vid);
 
 done:
 	spin_unlock_bh(&br->hash_lock);
@@ -289,10 +293,11 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
 	if (f && f->is_local && !f->dst && !f->added_by_user)
 		fdb_delete_local(br, NULL, f);
 
-	fdb_insert(br, NULL, newaddr, 0);
+	fdb_insert(br, NULL, NULL, newaddr, 0);
 	vg = br_vlan_group(br);
 	if (!vg || !vg->num_vlans)
 		goto out;
+
 	/* Now remove and add entries for every VLAN configured on the
 	 * bridge.  This function runs under RTNL so the bitmap will not
 	 * change from under us.
@@ -303,7 +308,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
 		f = br_fdb_find(br, br->dev->dev_addr, v->vid);
 		if (f && f->is_local && !f->dst && !f->added_by_user)
 			fdb_delete_local(br, NULL, f);
-		fdb_insert(br, NULL, newaddr, v->vid);
+		fdb_insert(br, NULL, NULL, newaddr, v->vid);
 	}
 out:
 	spin_unlock_bh(&br->hash_lock);
@@ -477,6 +482,7 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
 
 static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
 					       struct net_bridge_port *source,
+					       struct dst_entry *md_dst,
 					       const unsigned char *addr,
 					       __u16 vid,
 					       unsigned char is_local,
@@ -488,6 +494,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
 	if (fdb) {
 		memcpy(fdb->addr.addr, addr, ETH_ALEN);
 		fdb->dst = source;
+		rcu_assign_pointer(fdb->md_dst, dst_clone(md_dst));
 		fdb->vlan_id = vid;
 		fdb->is_local = is_local;
 		fdb->is_static = is_static;
@@ -501,7 +508,8 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
 }
 
 static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		  const unsigned char *addr, u16 vid)
+		      struct dst_entry *md_dst, const unsigned char *addr,
+		      u16 vid)
 {
 	struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
 	struct net_bridge_fdb_entry *fdb;
@@ -521,7 +529,7 @@ static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 		fdb_delete(br, fdb);
 	}
 
-	fdb = fdb_create(head, source, addr, vid, 1, 1);
+	fdb = fdb_create(head, source, md_dst, addr, vid, 1, 1);
 	if (!fdb)
 		return -ENOMEM;
 
@@ -537,13 +545,14 @@ int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 	int ret;
 
 	spin_lock_bh(&br->hash_lock);
-	ret = fdb_insert(br, source, addr, vid);
+	ret = fdb_insert(br, source, NULL, addr, vid);
 	spin_unlock_bh(&br->hash_lock);
 	return ret;
 }
 
 void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
-		   const unsigned char *addr, u16 vid, bool added_by_user)
+		   struct dst_entry *md_dst, const unsigned char *addr,
+		   u16 vid, bool added_by_user)
 {
 	struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
 	struct net_bridge_fdb_entry *fdb;
@@ -558,6 +567,9 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 	      source->state == BR_STATE_FORWARDING))
 		return;
 
+	if (md_dst && !(md_dst->flags & DST_METADATA))
+		md_dst = NULL;
+
 	fdb = fdb_find_rcu(head, addr, vid);
 	if (likely(fdb)) {
 		/* attempt to update an entry for a local interface */
@@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 					source->dev->name, addr, vid);
 		} else {
 			unsigned long now = jiffies;
+			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
 
 			/* fastpath: update of existing entry */
-			if (unlikely(source != fdb->dst)) {
+			if (unlikely(source != fdb->dst ||
+			    dst_metadata_cmp(md_dst, ref_md))) {
 				fdb->dst = source;
+				dst_release(ref_md);
+				rcu_assign_pointer(fdb->md_dst,
+						dst_clone(md_dst));
 				fdb_modified = true;
 				/* Take over HW learned entry */
 				if (unlikely(fdb->added_by_external_learn))
@@ -586,7 +603,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 	} else {
 		spin_lock(&br->hash_lock);
 		if (likely(!fdb_find_rcu(head, addr, vid))) {
-			fdb = fdb_create(head, source, addr, vid, 0, 0);
+			fdb = fdb_create(head, source, md_dst, addr, vid, 0, 0);
 			if (fdb) {
 				if (unlikely(added_by_user))
 					fdb->added_by_user = 1;
@@ -781,7 +798,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 		if (!(flags & NLM_F_CREATE))
 			return -ENOENT;
 
-		fdb = fdb_create(head, source, addr, vid, 0, 0);
+		fdb = fdb_create(head, source, NULL, addr, vid, 0, 0);
 		if (!fdb)
 			return -ENOMEM;
 
@@ -844,7 +861,7 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
 		}
 		local_bh_disable();
 		rcu_read_lock();
-		br_fdb_update(br, p, addr, vid, true);
+		br_fdb_update(br, p, NULL, addr, vid, true);
 		rcu_read_unlock();
 		local_bh_enable();
 	} else if (ndm->ndm_flags & NTF_EXT_LEARNED) {
@@ -1071,7 +1088,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
 	head = &br->hash[br_mac_hash(addr, vid)];
 	fdb = br_fdb_find(br, addr, vid);
 	if (!fdb) {
-		fdb = fdb_create(head, p, addr, vid, 0, 0);
+		fdb = fdb_create(head, p, NULL, addr, vid, 0, 0);
 		if (!fdb) {
 			err = -ENOMEM;
 			goto err_unlock;
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637f58c1226..df10c87b2499 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -150,7 +150,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 	/* insert into forwarding database after filtering to avoid spoofing */
 	br = p->br;
 	if (p->flags & BR_LEARNING)
-		br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, false);
+		br_fdb_update(br, p, skb_dst(skb), eth_hdr(skb)->h_source,
+			      vid, false);
 
 	local_rcv = !!(br->dev->flags & IFF_PROMISC);
 	dest = eth_hdr(skb)->h_dest;
@@ -230,7 +231,8 @@ static void __br_handle_local_finish(struct sk_buff *skb)
 
 	/* check if vlan is allowed, to avoid spoofing */
 	if (p->flags & BR_LEARNING && br_should_learn(p, skb, &vid))
-		br_fdb_update(p->br, p, eth_hdr(skb)->h_source, vid, false);
+		br_fdb_update(p->br, p, skb_dst(skb),
+				eth_hdr(skb)->h_source, vid, false);
 }
 
 /* note: already called with rcu_read_lock */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fd9ee73e0a6d..b73f34ed765f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -164,6 +164,7 @@ struct net_bridge_vlan_group {
 struct net_bridge_fdb_entry {
 	struct hlist_node		hlist;
 	struct net_bridge_port		*dst;
+	struct dst_entry __rcu		*md_dst;
 
 	mac_addr			addr;
 	__u16				vlan_id;
@@ -524,7 +525,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf, unsigned long count,
 int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 		  const unsigned char *addr, u16 vid);
 void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
-		   const unsigned char *addr, u16 vid, bool added_by_user);
+		   struct dst_entry *md_dst, const unsigned char *addr,
+		   u16 vid, bool added_by_user);
 
 int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 		  struct net_device *dev, const unsigned char *addr, u16 vid);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] mpls: split forwarding path on rx/tx boundary
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
  2017-08-16 17:01 ` [PATCH 1/6] bridge: learn dst metadata in FDB David Lamparter
@ 2017-08-16 17:01 ` David Lamparter
  2017-08-19 17:10   ` kbuild test robot
  2017-08-19 17:42   ` kbuild test robot
  2017-08-16 17:01 ` [PATCH 3/6] mpls: add VPLS entry points David Lamparter
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:01 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

This makes mpls_rt_xmit() available for use in upcoming VPLS code.  Same
for mpls_route_input_rcu().

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 net/mpls/af_mpls.c  | 48 ++++++++++++++++++++++++++++++------------------
 net/mpls/internal.h |  4 ++++
 2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c5b9ce41d66f..0c5953e5d5bd 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -43,7 +43,7 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
 		       struct nlmsghdr *nlh, struct net *net, u32 portid,
 		       unsigned int nlm_flags);
 
-static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
 
@@ -313,15 +313,8 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	struct net *net = dev_net(dev);
 	struct mpls_shim_hdr *hdr;
 	struct mpls_route *rt;
-	struct mpls_nh *nh;
 	struct mpls_entry_decoded dec;
-	struct net_device *out_dev;
-	struct mpls_dev *out_mdev;
 	struct mpls_dev *mdev;
-	unsigned int hh_len;
-	unsigned int new_header_size;
-	unsigned int mtu;
-	int err;
 
 	/* Careful this entire function runs inside of an rcu critical section */
 
@@ -356,9 +349,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 		goto drop;
 	}
 
-	nh = mpls_select_multipath(rt, skb);
-	if (!nh)
-		goto err;
 
 	/* Pop the label */
 	skb_pull(skb, sizeof(*hdr));
@@ -376,6 +366,32 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 		goto err;
 	dec.ttl -= 1;
 
+	if (mpls_rt_xmit(skb, rt, dec))
+		goto drop;
+	return 0;
+
+err:
+	MPLS_INC_STATS(mdev, rx_errors);
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+		 struct mpls_entry_decoded dec)
+{
+	struct mpls_nh *nh;
+	struct net_device *out_dev;
+	struct mpls_dev *out_mdev;
+	unsigned int hh_len;
+	unsigned int new_header_size;
+	unsigned int mtu;
+	int err;
+
+	nh = mpls_select_multipath(rt, skb);
+	if (!nh)
+		goto tx_err;
+
 	/* Find the output device */
 	out_dev = rcu_dereference(nh->nh_dev);
 	if (!mpls_output_possible(out_dev))
@@ -401,8 +417,9 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	if (unlikely(!new_header_size && dec.bos)) {
 		/* Penultimate hop popping */
 		if (!mpls_egress(dev_net(out_dev), rt, skb, dec))
-			goto err;
+			goto tx_err;
 	} else {
+		struct mpls_shim_hdr *hdr;
 		bool bos;
 		int i;
 		skb_push(skb, new_header_size);
@@ -435,12 +452,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
 	if (out_mdev)
 		MPLS_INC_STATS(out_mdev, tx_errors);
-	goto drop;
-err:
-	MPLS_INC_STATS(mdev, rx_errors);
-drop:
-	kfree_skb(skb);
-	return NET_RX_DROP;
+	return -1;
 }
 
 static struct packet_type mpls_packet_type __read_mostly = {
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index cf65aec2e551..b70c6663d4f3 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -210,4 +210,8 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu);
 void mpls_stats_inc_outucastpkts(struct net_device *dev,
 				 const struct sk_buff *skb);
 
+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+		 struct mpls_entry_decoded dec);
+
 #endif /* MPLS_INTERNAL_H */
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] mpls: add VPLS entry points
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
  2017-08-16 17:01 ` [PATCH 1/6] bridge: learn dst metadata in FDB David Lamparter
  2017-08-16 17:01 ` [PATCH 2/6] mpls: split forwarding path on rx/tx boundary David Lamparter
@ 2017-08-16 17:01 ` David Lamparter
  2017-08-19 18:27   ` kbuild test robot
  2017-08-21 14:01   ` Amine Kherbouche
  2017-08-16 17:02 ` [PATCH 4/6] mpls: VPLS support David Lamparter
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:01 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

This wires up the neccessary calls for VPLS into the MPLS forwarding
pieces.  Since CONFIG_MPLS_VPLS doesn't exist yet in Kconfig, it'll
never be enabled, so we're on the stubs for now.

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 include/uapi/linux/rtnetlink.h |  1 +
 net/mpls/af_mpls.c             | 54 ++++++++++++++++++++++++++++++++++++++++++
 net/mpls/internal.h            | 29 +++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index dab7dad9e01a..b7840ed94526 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -326,6 +326,7 @@ enum rtattr_type_t {
 	RTA_PAD,
 	RTA_UID,
 	RTA_TTL_PROPAGATE,
+	RTA_VPLS_IF,
 	__RTA_MAX
 };
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0c5953e5d5bd..4d3ce007b7db 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -299,6 +299,11 @@ static bool mpls_egress(struct net *net, struct mpls_route *rt,
 		success = true;
 		break;
 	}
+	case MPT_VPLS:
+		/* nothing to do here, no TTL in Ethernet
+		 * (and we shouldn't mess with the TTL in inner IP packets,
+		 * pseudowires are supposed to be transparent) */
+		break;
 	case MPT_UNSPEC:
 		/* Should have decided which protocol it is by now */
 		break;
@@ -349,6 +354,8 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 		goto drop;
 	}
 
+	if (rt->rt_payload_type == MPT_VPLS)
+		return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);
 
 	/* Pop the label */
 	skb_pull(skb, sizeof(*hdr));
@@ -469,6 +476,7 @@ static const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = {
 struct mpls_route_config {
 	u32			rc_protocol;
 	u32			rc_ifindex;
+	u32			rc_vpls_ifindex;
 	u8			rc_via_table;
 	u8			rc_via_alen;
 	u8			rc_via[MAX_VIA_ALEN];
@@ -541,6 +549,8 @@ static void mpls_route_update(struct net *net, unsigned index,
 	rt = rtnl_dereference(platform_label[index]);
 	rcu_assign_pointer(platform_label[index], new);
 
+	vpls_label_update(index, rt, new);
+
 	mpls_notify_route(net, index, rt, new, info);
 
 	/* If we removed a route free it now */
@@ -942,6 +952,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
 	struct mpls_route __rcu **platform_label;
 	struct net *net = cfg->rc_nlinfo.nl_net;
 	struct mpls_route *rt, *old;
+	struct net_device *vpls_dev = NULL;
 	int err = -EINVAL;
 	u8 max_via_alen;
 	unsigned index;
@@ -996,6 +1007,24 @@ static int mpls_route_add(struct mpls_route_config *cfg,
 		goto errout;
 	}
 
+	if (cfg->rc_vpls_ifindex) {
+		vpls_dev = dev_get_by_index(net, cfg->rc_vpls_ifindex);
+		if (!vpls_dev) {
+			err = -ENODEV;
+			NL_SET_ERR_MSG(extack, "Invalid VPLS ifindex");
+			goto errout;
+		}
+		/* we're under RTNL; and we'll drop routes when we're
+		 * notified the device is going away. */
+		dev_put(vpls_dev);
+
+		if (!is_vpls_dev(vpls_dev)) {
+			err = -ENODEV;
+			NL_SET_ERR_MSG(extack, "Not a VPLS device");
+			goto errout;
+		}
+	}
+
 	err = -ENOMEM;
 	rt = mpls_rt_alloc(nhs, max_via_alen, max_labels);
 	if (IS_ERR(rt)) {
@@ -1006,6 +1035,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
 	rt->rt_protocol = cfg->rc_protocol;
 	rt->rt_payload_type = cfg->rc_payload_type;
 	rt->rt_ttl_propagate = cfg->rc_ttl_propagate;
+	rt->rt_vpls_dev = vpls_dev;
 
 	if (cfg->rc_mp)
 		err = mpls_nh_build_multi(cfg, rt, max_labels, extack);
@@ -1430,6 +1460,14 @@ static void mpls_ifdown(struct net_device *dev, int event)
 		if (!rt)
 			continue;
 
+		if (rt->rt_vpls_dev == dev) {
+			switch (event) {
+			case NETDEV_UNREGISTER:
+				mpls_route_update(net, index, NULL, NULL);
+				continue;
+			}
+		}
+
 		alive = 0;
 		deleted = 0;
 		change_nexthops(rt) {
@@ -1777,6 +1815,10 @@ static int rtm_to_route_config(struct sk_buff *skb,
 		case RTA_OIF:
 			cfg->rc_ifindex = nla_get_u32(nla);
 			break;
+		case RTA_VPLS_IF:
+			cfg->rc_vpls_ifindex = nla_get_u32(nla);
+			cfg->rc_payload_type = MPT_VPLS;
+			break;
 		case RTA_NEWDST:
 			if (nla_get_labels(nla, MAX_NEW_LABELS,
 					   &cfg->rc_output_labels,
@@ -1911,6 +1953,11 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
 			       ttl_propagate))
 			goto nla_put_failure;
 	}
+
+	if (rt->rt_vpls_dev)
+		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
+			goto nla_put_failure;
+
 	if (rt->rt_nhn == 1) {
 		const struct mpls_nh *nh = rt->rt_nh;
 
@@ -2220,6 +2267,10 @@ static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh,
 	if (nla_put_labels(skb, RTA_DST, 1, &in_label))
 		goto nla_put_failure;
 
+	if (rt->rt_vpls_dev)
+		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
+			goto nla_put_failure;
+
 	if (nh->nh_labels &&
 	    nla_put_labels(skb, RTA_NEWDST, nh->nh_labels,
 			   nh->nh_label))
@@ -2491,6 +2542,8 @@ static int __init mpls_init(void)
 
 	rtnl_af_register(&mpls_af_ops);
 
+	vpls_init();
+
 	rtnl_register(PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, 0);
 	rtnl_register(PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, 0);
 	rtnl_register(PF_MPLS, RTM_GETROUTE, mpls_getroute, mpls_dump_routes,
@@ -2510,6 +2563,7 @@ module_init(mpls_init);
 static void __exit mpls_exit(void)
 {
 	rtnl_unregister_all(PF_MPLS);
+	vpls_exit();
 	rtnl_af_unregister(&mpls_af_ops);
 	dev_remove_pack(&mpls_packet_type);
 	unregister_netdevice_notifier(&mpls_dev_notifier);
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index b70c6663d4f3..876ae9993207 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -76,6 +76,7 @@ struct sk_buff;
 
 enum mpls_payload_type {
 	MPT_UNSPEC, /* IPv4 or IPv6 */
+	MPT_VPLS = 2,	/* pseudowire */
 	MPT_IPV4 = 4,
 	MPT_IPV6 = 6,
 
@@ -153,6 +154,8 @@ struct mpls_route { /* next hop label forwarding entry */
 	u8			rt_nh_size;
 	u8			rt_via_offset;
 	u8			rt_reserved1;
+	struct net_device	*rt_vpls_dev;
+
 	struct mpls_nh		rt_nh[0];
 };
 
@@ -214,4 +217,30 @@ struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
 int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
 		 struct mpls_entry_decoded dec);
 
+#ifdef CONFIG_MPLS_VPLS
+int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
+	     struct packet_type *pt, struct mpls_route *rt,
+	     struct mpls_shim_hdr *hdr, struct net_device *orig_dev);
+void vpls_label_update(unsigned label, struct mpls_route *rt_old,
+		       struct mpls_route *rt_new);
+__init int vpls_init(void);
+__exit void vpls_exit(void);
+int is_vpls_dev(struct net_device *dev);
+
+#else /* !CONFIG_MPLS_VPLS */
+static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
+{
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+static inline int is_vpls_dev(struct net_device *dev)
+{
+	return 0;
+}
+
+#define vpls_label_update(label, rt_old, rt_new) do { } while (0)
+#define vpls_init() do { } while (0)
+#define vpls_exit() do { } while (0)
+#endif
+
 #endif /* MPLS_INTERNAL_H */
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] mpls: VPLS support
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
                   ` (2 preceding siblings ...)
  2017-08-16 17:01 ` [PATCH 3/6] mpls: add VPLS entry points David Lamparter
@ 2017-08-16 17:02 ` David Lamparter
  2017-08-21 15:14   ` Amine Kherbouche
  2017-08-21 16:11   ` Amine Kherbouche
  2017-08-16 17:02 ` [PATCH 5/6] bridge: add VPLS pseudowire info in fdb dump David Lamparter
  2017-08-16 17:02 ` [PATCH 6/6] mpls: pseudowire control word support David Lamparter
  5 siblings, 2 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:02 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

[work-in-progress, works but needs changes]
[v2: refactored lots of things, e.g. dst_metadata, no more genetlink]

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 include/net/dst_metadata.h |  21 ++
 include/net/vpls.h         |   8 +
 net/mpls/Kconfig           |  11 ++
 net/mpls/Makefile          |   1 +
 net/mpls/vpls.c            | 469 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 510 insertions(+)
 create mode 100644 include/net/vpls.h
 create mode 100644 net/mpls/vpls.c

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 8858dc441458..aeee4ce3b654 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -3,11 +3,13 @@
 
 #include <linux/skbuff.h>
 #include <net/ip_tunnels.h>
+#include <net/vpls.h>
 #include <net/dst.h>
 
 enum metadata_type {
 	METADATA_IP_TUNNEL,
 	METADATA_HW_PORT_MUX,
+	METADATA_VPLS,
 };
 
 struct hw_port_info {
@@ -21,6 +23,7 @@ struct metadata_dst {
 	union {
 		struct ip_tunnel_info	tun_info;
 		struct hw_port_info	port_info;
+		struct vpls_info	vpls_info;
 	} u;
 };
 
@@ -49,6 +52,15 @@ static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
 	return NULL;
 }
 
+static inline struct vpls_info *skb_vpls_info(struct sk_buff *skb)
+{
+	struct metadata_dst *md_dst = skb_metadata_dst(skb);
+	if (md_dst && md_dst->type == METADATA_VPLS)
+		return &md_dst->u.vpls_info;
+	return NULL;
+}
+
+
 static inline bool skb_valid_dst(const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -73,6 +85,9 @@ static inline int dst_metadata_cmp(const struct dst_entry *dst_a,
 	case METADATA_HW_PORT_MUX:
 		return memcmp(&a->u.port_info, &b->u.port_info,
 			      sizeof(a->u.port_info));
+	case METADATA_VPLS:
+		return memcmp(&a->u.vpls_info, &b->u.vpls_info,
+			      sizeof(a->u.vpls_info));
 	case METADATA_IP_TUNNEL:
 		return memcmp(&a->u.tun_info, &b->u.tun_info,
 			      sizeof(a->u.tun_info) +
@@ -218,4 +233,10 @@ static inline struct metadata_dst *ipv6_tun_rx_dst(struct sk_buff *skb,
 				  0, ip6_flowlabel(ip6h), flags, tunnel_id,
 				  md_size);
 }
+
+static inline struct metadata_dst *vpls_rx_dst(void)
+{
+	return metadata_dst_alloc(0, METADATA_VPLS, GFP_ATOMIC);
+}
+
 #endif /* __NET_DST_METADATA_H */
diff --git a/include/net/vpls.h b/include/net/vpls.h
new file mode 100644
index 000000000000..b261e2d97734
--- /dev/null
+++ b/include/net/vpls.h
@@ -0,0 +1,8 @@
+#ifndef __NET_VPLS_H
+#define __NET_VPLS_H 1
+
+struct vpls_info {
+	u32		pw_label;
+};
+
+#endif /* __NET_VPLS_H */
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 5c467ef97311..c15ba73efb34 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -27,6 +27,17 @@ config MPLS_ROUTING
 	---help---
 	 Add support for forwarding of mpls packets.
 
+config MPLS_VPLS
+	bool "VPLS support"
+	default y
+	depends on MPLS_ROUTING && BRIDGE_NETFILTER=n
+	---help---
+	 Add support for de-&encapsulating VPLS.  Not compatible with
+	 bridge netfilter due to the latter stomping over VPLS' dst metadata.
+
+comment "disable 'Bridged IP/ARP packets filtering' for VPLS support"
+	depends on BRIDGE_NETFILTER
+
 config MPLS_IPTUNNEL
 	tristate "MPLS: IP over MPLS tunnel support"
 	depends on LWTUNNEL && MPLS_ROUTING
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 9ca923625016..3c028600a980 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
 obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o
 
 mpls_router-y := af_mpls.o
+mpls_router-$(CONFIG_MPLS_VPLS) += vpls.o
diff --git a/net/mpls/vpls.c b/net/mpls/vpls.c
new file mode 100644
index 000000000000..28ac810da6e9
--- /dev/null
+++ b/net/mpls/vpls.c
@@ -0,0 +1,469 @@
+/*
+ *  net/mpls/vpls.c
+ *
+ *  Copyright (C) 2016 David Lamparter
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/slab.h>
+#include <linux/ethtool.h>
+#include <linux/etherdevice.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/mpls.h>
+
+#include <net/rtnetlink.h>
+#include <net/dst.h>
+#include <net/xfrm.h>
+#include <net/mpls.h>
+#include <linux/module.h>
+#include <net/dst_metadata.h>
+#include <net/ip_tunnels.h>
+
+#include "internal.h"
+
+#define DRV_NAME	"vpls"
+
+#define MIN_MTU 68		/* Min L3 MTU */
+#define MAX_MTU 65535		/* Max L3 MTU (arbitrary) */
+
+struct vpls_wirelist {
+	struct rcu_head rcu;
+	size_t count;
+	unsigned wires[0];
+};
+
+struct vpls_priv {
+	struct net *encap_net;
+	struct vpls_wirelist __rcu *wires;
+};
+
+static int vpls_xmit_wire(struct sk_buff *skb, struct net_device *dev,
+			  struct vpls_priv *vpls, u32 wire)
+{
+	struct mpls_route *rt;
+	struct mpls_entry_decoded dec;
+
+	dec.bos = 1;
+	dec.ttl = 255;
+
+	rt = mpls_route_input_rcu(vpls->encap_net, wire);
+	if (!rt)
+		return -ENOENT;
+	if (rt->rt_vpls_dev != dev)
+		return -EINVAL;
+
+	return mpls_rt_xmit(skb, rt, dec);
+}
+
+static netdev_tx_t vpls_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	int err = -EINVAL, ok_count = 0;
+	struct vpls_priv *priv = netdev_priv(dev);
+	struct vpls_info *vi;
+	struct pcpu_sw_netstats *stats;
+	size_t len = skb->len;
+
+	rcu_read_lock();
+	vi = skb_vpls_info(skb);
+
+	skb_orphan(skb);
+	skb_forward_csum(skb);
+
+	if (vi) {
+		err = vpls_xmit_wire(skb, dev, priv, vi->pw_label);
+		if (err)
+			goto out_err;
+	} else {
+		struct sk_buff *cloned;
+		struct vpls_wirelist *wl;
+		size_t i;
+
+		wl = rcu_dereference(priv->wires);
+		if (wl->count == 0) {
+			dev->stats.tx_carrier_errors++;
+			goto out_err;
+		}
+
+		for (i = 0; i < wl->count; i++) {
+			cloned = skb_clone(skb, GFP_KERNEL);
+			if (vpls_xmit_wire(cloned, dev, priv, wl->wires[i]))
+				consume_skb(cloned);
+			else
+				ok_count++;
+		}
+		if (!ok_count)
+			goto out_err;
+
+		consume_skb(skb);
+	}
+
+	stats = this_cpu_ptr(dev->tstats);
+	u64_stats_update_begin(&stats->syncp);
+	stats->tx_packets++;
+	stats->tx_bytes += len;
+	u64_stats_update_end(&stats->syncp);
+
+	rcu_read_unlock();
+	return 0;
+
+out_err:
+	dev->stats.tx_errors++;
+
+	consume_skb(skb);
+	rcu_read_unlock();
+	return err;
+}
+
+int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
+	     struct packet_type *pt, struct mpls_route *rt,
+	     struct mpls_shim_hdr *hdr, struct net_device *orig_dev)
+{
+	struct net_device *dev = rt->rt_vpls_dev;
+	struct mpls_entry_decoded dec;
+	struct metadata_dst *md_dst;
+	struct pcpu_sw_netstats *stats;
+
+	if (!dev)
+		goto drop_nodev;
+
+	dec = mpls_entry_decode(hdr);
+	if (!dec.bos) {
+		dev->stats.rx_frame_errors++;
+		goto drop;
+	}
+
+	skb_pull(skb, sizeof(*hdr));
+
+	if (unlikely(!pskb_may_pull(skb, ETH_HLEN))) {
+		dev->stats.rx_length_errors++;
+		goto drop;
+	}
+
+	md_dst = vpls_rx_dst();
+	if (unlikely(!md_dst)) {
+		netdev_err(dev, "failed to allocate dst metadata\n");
+		goto drop;
+	}
+	md_dst->u.vpls_info.pw_label = dec.label;
+
+	skb->dev = dev;
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_type_trans(skb, dev);
+	skb->ip_summed = CHECKSUM_NONE;
+	skb->pkt_type = PACKET_HOST;
+
+	skb_clear_hash(skb);
+	skb->vlan_tci = 0;
+	skb_set_queue_mapping(skb, 0);
+	skb_scrub_packet(skb, !net_eq(dev_net(in_dev), dev_net(dev)));
+
+	skb_reset_network_header(skb);
+	skb_probe_transport_header(skb, 0);
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &md_dst->dst);
+
+	stats = this_cpu_ptr(dev->tstats);
+	u64_stats_update_begin(&stats->syncp);
+	stats->rx_packets++;
+	stats->rx_bytes += skb->len;
+	u64_stats_update_end(&stats->syncp);
+
+	netif_rx(skb);
+	return 0;
+
+drop:
+	dev->stats.rx_errors++;
+drop_nodev:
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+
+void vpls_label_update(unsigned label, struct mpls_route *rt_old,
+		       struct mpls_route *rt_new)
+{
+	struct vpls_priv *priv;
+	struct vpls_wirelist *wl, *wl_new;
+	size_t i;
+
+	ASSERT_RTNL();
+
+	if (rt_old && rt_new && rt_old->rt_vpls_dev == rt_new->rt_vpls_dev)
+		return;
+
+	if (rt_old && rt_old->rt_vpls_dev) {
+		priv = netdev_priv(rt_old->rt_vpls_dev);
+		wl = rcu_dereference(priv->wires);
+
+		for (i = 0; i < wl->count; i++)
+			if (wl->wires[i] == label)
+				break;
+
+		if (i == wl->count) {
+			netdev_err(rt_old->rt_vpls_dev,
+				   "can't find pseudowire to remove!\n");
+			goto update_new;
+		}
+
+		wl_new = kmalloc(sizeof(*wl) +
+				 (wl->count - 1) * sizeof(wl->wires[0]),
+				 GFP_ATOMIC);
+		if (!wl_new) {
+			netdev_err(rt_old->rt_vpls_dev,
+				   "out of memory for pseudowire delete!\n");
+			goto update_new;
+		}
+
+		wl_new->count = wl->count - 1;
+		memcpy(wl_new->wires, wl->wires, i * sizeof(wl->wires[0]));
+		memcpy(wl_new->wires + i, wl->wires + i + 1,
+			(wl->count - i - 1) * sizeof(wl->wires[0]));
+
+		rcu_assign_pointer(priv->wires, wl_new);
+		kfree_rcu(wl, rcu);
+
+		if (wl_new->count == 0)
+			netif_carrier_off(rt_old->rt_vpls_dev);
+	}
+
+update_new:
+	if (rt_new && rt_new->rt_vpls_dev) {
+		priv = netdev_priv(rt_new->rt_vpls_dev);
+		wl = rcu_dereference(priv->wires);
+
+		wl_new = kmalloc(sizeof(*wl) +
+				 (wl->count + 1) * sizeof(wl->wires[0]),
+				 GFP_ATOMIC);
+		if (!wl_new) {
+			netdev_err(rt_new->rt_vpls_dev,
+				   "out of memory for pseudowire add!\n");
+			return;
+		}
+		wl_new->count = wl->count + 1;
+		memcpy(wl_new->wires, wl->wires,
+			wl->count * sizeof(wl->wires[0]));
+		wl_new->wires[wl->count] = label;
+
+		rcu_assign_pointer(priv->wires, wl_new);
+		kfree_rcu(wl, rcu);
+
+		if (wl_new->count == 1)
+			netif_carrier_on(rt_new->rt_vpls_dev);
+	}
+}
+
+/* fake multicast ability */
+static void vpls_set_multicast_list(struct net_device *dev)
+{
+}
+
+static int vpls_open(struct net_device *dev)
+{
+	struct vpls_priv *priv = netdev_priv(dev);
+	struct vpls_wirelist *wl;
+
+	wl = rcu_dereference(priv->wires);
+	if (wl->count > 0)
+		netif_carrier_on(dev);
+
+	return 0;
+}
+
+static int vpls_close(struct net_device *dev)
+{
+	netif_carrier_off(dev);
+	return 0;
+}
+
+static int is_valid_vpls_mtu(int new_mtu)
+{
+	return new_mtu >= MIN_MTU && new_mtu <= MAX_MTU;
+}
+
+static int vpls_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (!is_valid_vpls_mtu(new_mtu))
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static int vpls_dev_init(struct net_device *dev)
+{
+	struct vpls_priv *priv = netdev_priv(dev);
+	priv->wires = kzalloc(sizeof(struct vpls_wirelist), GFP_KERNEL);
+	if (!priv->wires)
+		return -ENOMEM;
+
+	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+	if (!dev->tstats) {
+		kfree(priv->wires);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void vpls_dev_free(struct net_device *dev)
+{
+	struct vpls_priv *priv = netdev_priv(dev);
+
+	free_percpu(dev->tstats);
+
+	if (priv->wires)
+		kfree(priv->wires);
+
+	if (priv->encap_net)
+		put_net(priv->encap_net);
+
+	free_netdev(dev);
+}
+
+static const struct net_device_ops vpls_netdev_ops = {
+	.ndo_init		= vpls_dev_init,
+	.ndo_open		= vpls_open,
+	.ndo_stop		= vpls_close,
+	.ndo_start_xmit		= vpls_xmit,
+	.ndo_change_mtu		= vpls_change_mtu,
+	.ndo_get_stats64	= ip_tunnel_get_stats64,
+	.ndo_set_rx_mode	= vpls_set_multicast_list,
+	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+int is_vpls_dev(struct net_device *dev)
+{
+	return dev->netdev_ops == &vpls_netdev_ops;
+}
+
+#define VPLS_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | \
+		       NETIF_F_HW_CSUM | NETIF_F_RXCSUM | NETIF_F_HIGHDMA)
+
+static void vpls_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+
+	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	dev->priv_flags |= IFF_NO_QUEUE;
+
+	dev->netdev_ops = &vpls_netdev_ops;
+	dev->features |= NETIF_F_LLTX;
+	dev->features |= VPLS_FEATURES;
+	dev->vlan_features = dev->features;
+	dev->priv_destructor = vpls_dev_free;
+
+	dev->hw_features = VPLS_FEATURES;
+	dev->hw_enc_features = VPLS_FEATURES;
+
+	netif_keep_dst(dev);
+}
+
+/*
+ * netlink interface
+ */
+
+static int vpls_validate(struct nlattr *tb[], struct nlattr *data[],
+			 struct netlink_ext_ack *extack)
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS],
+				    "Invalid Ethernet address length");
+			return -EINVAL;
+		}
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_ADDRESS],
+				    "Invalid Ethernet address");
+			return -EADDRNOTAVAIL;
+		}
+	}
+	if (tb[IFLA_MTU]) {
+		if (!is_valid_vpls_mtu(nla_get_u32(tb[IFLA_MTU]))) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_MTU],
+				    "Invalid MTU");
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static struct rtnl_link_ops vpls_link_ops;
+
+static int vpls_newlink(struct net *src_net, struct net_device *dev,
+			struct nlattr *tb[], struct nlattr *data[],
+			struct netlink_ext_ack *extack)
+{
+	int err;
+	struct vpls_priv *priv = netdev_priv(dev);
+
+	if (tb[IFLA_ADDRESS] == NULL)
+		eth_hw_addr_random(dev);
+
+	if (tb[IFLA_IFNAME])
+		nla_strlcpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
+	else
+		snprintf(dev->name, IFNAMSIZ, DRV_NAME "%%d");
+
+	err = register_netdevice(dev);
+	if (err < 0)
+		goto err;
+	priv->encap_net = get_net(src_net);
+
+	netif_carrier_off(dev);
+	return 0;
+
+err:
+	return err;
+}
+
+static void vpls_dellink(struct net_device *dev, struct list_head *head)
+{
+	unregister_netdevice_queue(dev, head);
+}
+
+
+static struct rtnl_link_ops vpls_link_ops = {
+	.kind		= DRV_NAME,
+	.priv_size	= sizeof(struct vpls_priv),
+	.setup		= vpls_setup,
+	.validate	= vpls_validate,
+	.newlink	= vpls_newlink,
+	.dellink	= vpls_dellink,
+};
+
+/*
+ * init/fini
+ */
+
+__init int vpls_init(void)
+{
+	int ret;
+
+	ret = rtnl_link_register(&vpls_link_ops);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	return ret;
+}
+
+__exit void vpls_exit(void)
+{
+	rtnl_link_unregister(&vpls_link_ops);
+}
+
+#if 0
+/* not currently available as a separate module... */
+
+module_init(vpls_init);
+module_exit(vpls_exit);
+
+MODULE_DESCRIPTION("Virtual Private LAN Service");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+#endif
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] bridge: add VPLS pseudowire info in fdb dump
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
                   ` (3 preceding siblings ...)
  2017-08-16 17:02 ` [PATCH 4/6] mpls: VPLS support David Lamparter
@ 2017-08-16 17:02 ` David Lamparter
  2017-08-16 17:02 ` [PATCH 6/6] mpls: pseudowire control word support David Lamparter
  5 siblings, 0 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:02 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

Add a NDA_VPLS_WIRE attribute to the FDB dump if we have dst metadata
that is indicative of a VPLS pseudowire.  This is really helpful for
debugging.

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 include/uapi/linux/neighbour.h |  1 +
 net/bridge/br_fdb.c            | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index 3199d28980b3..703089c91b04 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -27,6 +27,7 @@ enum {
 	NDA_MASTER,
 	NDA_LINK_NETNSID,
 	NDA_SRC_VNI,
+	NDA_VPLS_WIRE,
 	__NDA_MAX
 };
 
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 0751fcb89699..ffbaff914d55 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -670,6 +670,18 @@ static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br,
 
 	if (fdb->vlan_id && nla_put(skb, NDA_VLAN, sizeof(u16), &fdb->vlan_id))
 		goto nla_put_failure;
+	if (fdb->md_dst) {
+		struct metadata_dst *md_dst;
+		md_dst = (struct metadata_dst *)fdb->md_dst;
+		/* ensured in br_fdb_update */
+		BUG_ON(!(md_dst->dst.flags & DST_METADATA));
+
+		if (md_dst->type == METADATA_VPLS) {
+			unsigned wire = md_dst->u.vpls_info.pw_label;
+			if (nla_put_u32(skb, NDA_VPLS_WIRE, wire))
+				goto nla_put_failure;
+		}
+	}
 
 	nlmsg_end(skb, nlh);
 	return 0;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] mpls: pseudowire control word support
  2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
                   ` (4 preceding siblings ...)
  2017-08-16 17:02 ` [PATCH 5/6] bridge: add VPLS pseudowire info in fdb dump David Lamparter
@ 2017-08-16 17:02 ` David Lamparter
  5 siblings, 0 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-16 17:02 UTC (permalink / raw)
  To: netdev; +Cc: amine.kherbouche, roopa, David Lamparter

[TODO: maybe rename this to MPLS_FLAGS and use it for non-pseudowire OAM
bits too (e.g. enabling G-ACh or LSP ping.)]

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 include/uapi/linux/rtnetlink.h |  4 ++++
 net/mpls/af_mpls.c             | 11 +++++++++++
 net/mpls/internal.h            |  8 ++------
 net/mpls/vpls.c                | 34 +++++++++++++++++++++++++++++++++-
 4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index b7840ed94526..b5a34e0e4327 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -327,6 +327,7 @@ enum rtattr_type_t {
 	RTA_UID,
 	RTA_TTL_PROPAGATE,
 	RTA_VPLS_IF,
+	RTA_VPLS_FLAGS,
 	__RTA_MAX
 };
 
@@ -335,6 +336,9 @@ enum rtattr_type_t {
 #define RTM_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct rtmsg))))
 #define RTM_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct rtmsg))
 
+#define RTA_VPLS_F_CW_RX	(1 << 0)
+#define RTA_VPLS_F_CW_TX	(1 << 1)
+
 /* RTM_MULTIPATH --- array of struct rtnexthop.
  *
  * "struct rtnexthop" describes all necessary nexthop information,
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 4d3ce007b7db..9036beeca173 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -477,6 +477,7 @@ struct mpls_route_config {
 	u32			rc_protocol;
 	u32			rc_ifindex;
 	u32			rc_vpls_ifindex;
+	u8			rc_vpls_flags;
 	u8			rc_via_table;
 	u8			rc_via_alen;
 	u8			rc_via[MAX_VIA_ALEN];
@@ -1036,6 +1037,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
 	rt->rt_payload_type = cfg->rc_payload_type;
 	rt->rt_ttl_propagate = cfg->rc_ttl_propagate;
 	rt->rt_vpls_dev = vpls_dev;
+	rt->rt_vpls_flags = cfg->rc_vpls_flags;
 
 	if (cfg->rc_mp)
 		err = mpls_nh_build_multi(cfg, rt, max_labels, extack);
@@ -1819,6 +1821,9 @@ static int rtm_to_route_config(struct sk_buff *skb,
 			cfg->rc_vpls_ifindex = nla_get_u32(nla);
 			cfg->rc_payload_type = MPT_VPLS;
 			break;
+		case RTA_VPLS_FLAGS:
+			cfg->rc_vpls_flags = nla_get_u8(nla);
+			break;
 		case RTA_NEWDST:
 			if (nla_get_labels(nla, MAX_NEW_LABELS,
 					   &cfg->rc_output_labels,
@@ -1957,6 +1962,9 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	if (rt->rt_vpls_dev)
 		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
 			goto nla_put_failure;
+	if (rt->rt_vpls_flags)
+		if (nla_put_u8(skb, RTA_VPLS_FLAGS, rt->rt_vpls_flags))
+			goto nla_put_failure;
 
 	if (rt->rt_nhn == 1) {
 		const struct mpls_nh *nh = rt->rt_nh;
@@ -2270,6 +2278,9 @@ static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh,
 	if (rt->rt_vpls_dev)
 		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
 			goto nla_put_failure;
+	if (rt->rt_vpls_flags)
+		if (nla_put_u8(skb, RTA_VPLS_FLAGS, rt->rt_vpls_flags))
+			goto nla_put_failure;
 
 	if (nh->nh_labels &&
 	    nla_put_labels(skb, RTA_NEWDST, nh->nh_labels,
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index 876ae9993207..03048e3a5d83 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -79,11 +79,6 @@ enum mpls_payload_type {
 	MPT_VPLS = 2,	/* pseudowire */
 	MPT_IPV4 = 4,
 	MPT_IPV6 = 6,
-
-	/* Other types not implemented:
-	 *  - Pseudo-wire with or without control word (RFC4385)
-	 *  - GAL (RFC5586)
-	 */
 };
 
 struct mpls_nh { /* next hop label forwarding entry */
@@ -153,7 +148,8 @@ struct mpls_route { /* next hop label forwarding entry */
 	u8			rt_nhn_alive;
 	u8			rt_nh_size;
 	u8			rt_via_offset;
-	u8			rt_reserved1;
+
+	u8			rt_vpls_flags;
 	struct net_device	*rt_vpls_dev;
 
 	struct mpls_nh		rt_nh[0];
diff --git a/net/mpls/vpls.c b/net/mpls/vpls.c
index 28ac810da6e9..1496683d871c 100644
--- a/net/mpls/vpls.c
+++ b/net/mpls/vpls.c
@@ -27,6 +27,14 @@
 #define MIN_MTU 68		/* Min L3 MTU */
 #define MAX_MTU 65535		/* Max L3 MTU (arbitrary) */
 
+struct vpls_cw {
+	u8 type_flags;
+#define VPLS_CWTYPE(cw) ((cw)->type_flags & 0x0f)
+
+	u8 len;
+	u16 seqno;
+};
+
 struct vpls_wirelist {
 	struct rcu_head rcu;
 	size_t count;
@@ -53,6 +61,14 @@ static int vpls_xmit_wire(struct sk_buff *skb, struct net_device *dev,
 	if (rt->rt_vpls_dev != dev)
 		return -EINVAL;
 
+	if (rt->rt_vpls_flags & RTA_VPLS_F_CW_TX) {
+		struct vpls_cw *cw;
+		if (skb_cow(skb, sizeof(*cw)))
+			return -ENOMEM;
+		cw = skb_push(skb, sizeof(*cw));
+		memset(cw, 0, sizeof(*cw));
+	}
+
 	return mpls_rt_xmit(skb, rt, dec);
 }
 
@@ -123,6 +139,7 @@ int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
 	struct mpls_entry_decoded dec;
 	struct metadata_dst *md_dst;
 	struct pcpu_sw_netstats *stats;
+	void *next;
 
 	if (!dev)
 		goto drop_nodev;
@@ -133,7 +150,22 @@ int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
 		goto drop;
 	}
 
-	skb_pull(skb, sizeof(*hdr));
+	/* bottom label is still in the skb */
+	next = skb_pull(skb, sizeof(*hdr));
+
+	if (rt->rt_vpls_flags & RTA_VPLS_F_CW_RX) {
+		struct vpls_cw *cw = next;
+		if (unlikely(!pskb_may_pull(skb, sizeof(*cw)))) {
+			dev->stats.rx_length_errors++;
+			goto drop;
+		}
+		next = skb_pull(skb, sizeof(*cw));
+
+		if (VPLS_CWTYPE(cw) != 0) {
+			/* insert MPLS OAM implementation here */
+			goto drop_nodev;
+		}
+	}
 
 	if (unlikely(!pskb_may_pull(skb, ETH_HLEN))) {
 		dev->stats.rx_length_errors++;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-16 17:01 ` [PATCH 1/6] bridge: learn dst metadata in FDB David Lamparter
@ 2017-08-16 20:38   ` Nikolay Aleksandrov
  2017-08-17 11:03     ` David Lamparter
  2017-08-17 16:16     ` David Lamparter
  0 siblings, 2 replies; 26+ messages in thread
From: Nikolay Aleksandrov @ 2017-08-16 20:38 UTC (permalink / raw)
  To: David Lamparter, netdev; +Cc: amine.kherbouche, roopa, stephen, bridge

On 16/08/17 20:01, David Lamparter wrote:
> This implements holding dst metadata information in the bridge layer,
> but only for unicast entries in the MAC table.  Multicast is still left
> to design and implement.
> 
> Signed-off-by: David Lamparter <equinox@diac24.net>
> ---

Hi David,
Sorry but I do not agree with this change, adding a special case for VPLS in the bridge code and
hitting the fast path for everyone in a few different places for a feature that the majority
will not use does not sound acceptable to me. We've been trying hard to optimize it, trying to
avoid additional cache lines, removing tests and keeping special cases to a minimum.
I understand that you want to use the fdb tables and avoid duplication, but this is not
worth it. There're other similar use cases and they have their own private fdb tables,
that way the user can opt out and is much cleaner and separated.

As you've noted this is only an RFC so I will not point out every issue, but there seems
to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.

By the way I've added the bridge maintainers to the CC list.

Thanks,
 Nik

>  include/net/dst_metadata.h | 19 +++++++++++++------
>  net/bridge/br_device.c     |  4 ++++
>  net/bridge/br_fdb.c        | 45 +++++++++++++++++++++++++++++++--------------
>  net/bridge/br_input.c      |  6 ++++--
>  net/bridge/br_private.h    |  4 +++-
>  5 files changed, 55 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
> index a803129a4849..8858dc441458 100644
> --- a/include/net/dst_metadata.h
> +++ b/include/net/dst_metadata.h
> @@ -56,16 +56,15 @@ static inline bool skb_valid_dst(const struct sk_buff *skb)
>  	return dst && !(dst->flags & DST_METADATA);
>  }
>  
> -static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
> -				       const struct sk_buff *skb_b)
> +static inline int dst_metadata_cmp(const struct dst_entry *dst_a,
> +				   const struct dst_entry *dst_b)
>  {
>  	const struct metadata_dst *a, *b;
> -
> -	if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
> +	if (!(dst_a || dst_b))
>  		return 0;
>  
> -	a = (const struct metadata_dst *) skb_dst(skb_a);
> -	b = (const struct metadata_dst *) skb_dst(skb_b);
> +	a = (const struct metadata_dst *) dst_a;
> +	b = (const struct metadata_dst *) dst_b;
>  
>  	if (!a != !b || a->type != b->type)
>  		return 1;
> @@ -83,6 +82,14 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
>  	}
>  }
>  
> +static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
> +				       const struct sk_buff *skb_b)
> +{
> +	if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
> +		return 0;
> +	return dst_metadata_cmp(skb_dst(skb_a), skb_dst(skb_b));
> +}
> +
>  void metadata_dst_free(struct metadata_dst *);
>  struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
>  					gfp_t flags);
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index 861ae2a165f4..534cacf02f8d 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -53,6 +53,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>  	brstats->tx_bytes += skb->len;
>  	u64_stats_update_end(&brstats->syncp);
>  
> +	skb_dst_drop(skb);
>  	BR_INPUT_SKB_CB(skb)->brdev = dev;
>  
>  	skb_reset_mac_header(skb);
> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>  		else
>  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
>  	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
> +		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
> +		if (md_dst)
> +			skb_dst_set_noref(skb, md_dst);
>  		br_forward(dst->dst, skb, false, true);
>  	} else {
>  		br_flood(br, skb, BR_PKT_UNICAST, false, true);
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index a79b648aac88..0751fcb89699 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -25,11 +25,13 @@
>  #include <asm/unaligned.h>
>  #include <linux/if_vlan.h>
>  #include <net/switchdev.h>
> +#include <net/dst_metadata.h>
>  #include "br_private.h"
>  
>  static struct kmem_cache *br_fdb_cache __read_mostly;
>  static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
> -		      const unsigned char *addr, u16 vid);
> +		      struct dst_entry *md_dst, const unsigned char *addr,
> +		      u16 vid);
>  static void fdb_notify(struct net_bridge *br,
>  		       const struct net_bridge_fdb_entry *, int);
>  
> @@ -174,6 +176,8 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
>  	if (f->is_static)
>  		fdb_del_hw_addr(br, f->addr.addr);
>  
> +	dst_release(rcu_access_pointer(f->md_dst));
> +
>  	hlist_del_init_rcu(&f->hlist);
>  	fdb_notify(br, f, RTM_DELNEIGH);
>  	call_rcu(&f->rcu, fdb_rcu_free);
> @@ -260,7 +264,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
>  
>  insert:
>  	/* insert new address,  may fail if invalid address or dup. */
> -	fdb_insert(br, p, newaddr, 0);
> +	fdb_insert(br, p, NULL, newaddr, 0);
>  
>  	if (!vg || !vg->num_vlans)
>  		goto done;
> @@ -270,7 +274,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
>  	 * from under us.
>  	 */
>  	list_for_each_entry(v, &vg->vlan_list, vlist)
> -		fdb_insert(br, p, newaddr, v->vid);
> +		fdb_insert(br, p, NULL, newaddr, v->vid);
>  
>  done:
>  	spin_unlock_bh(&br->hash_lock);
> @@ -289,10 +293,11 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
>  	if (f && f->is_local && !f->dst && !f->added_by_user)
>  		fdb_delete_local(br, NULL, f);
>  
> -	fdb_insert(br, NULL, newaddr, 0);
> +	fdb_insert(br, NULL, NULL, newaddr, 0);
>  	vg = br_vlan_group(br);
>  	if (!vg || !vg->num_vlans)
>  		goto out;
> +
>  	/* Now remove and add entries for every VLAN configured on the
>  	 * bridge.  This function runs under RTNL so the bitmap will not
>  	 * change from under us.
> @@ -303,7 +308,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
>  		f = br_fdb_find(br, br->dev->dev_addr, v->vid);
>  		if (f && f->is_local && !f->dst && !f->added_by_user)
>  			fdb_delete_local(br, NULL, f);
> -		fdb_insert(br, NULL, newaddr, v->vid);
> +		fdb_insert(br, NULL, NULL, newaddr, v->vid);
>  	}
>  out:
>  	spin_unlock_bh(&br->hash_lock);
> @@ -477,6 +482,7 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
>  
>  static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
>  					       struct net_bridge_port *source,
> +					       struct dst_entry *md_dst,
>  					       const unsigned char *addr,
>  					       __u16 vid,
>  					       unsigned char is_local,
> @@ -488,6 +494,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
>  	if (fdb) {
>  		memcpy(fdb->addr.addr, addr, ETH_ALEN);
>  		fdb->dst = source;
> +		rcu_assign_pointer(fdb->md_dst, dst_clone(md_dst));
>  		fdb->vlan_id = vid;
>  		fdb->is_local = is_local;
>  		fdb->is_static = is_static;
> @@ -501,7 +508,8 @@ static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
>  }
>  
>  static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
> -		  const unsigned char *addr, u16 vid)
> +		      struct dst_entry *md_dst, const unsigned char *addr,
> +		      u16 vid)
>  {
>  	struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
>  	struct net_bridge_fdb_entry *fdb;
> @@ -521,7 +529,7 @@ static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
>  		fdb_delete(br, fdb);
>  	}
>  
> -	fdb = fdb_create(head, source, addr, vid, 1, 1);
> +	fdb = fdb_create(head, source, md_dst, addr, vid, 1, 1);
>  	if (!fdb)
>  		return -ENOMEM;
>  
> @@ -537,13 +545,14 @@ int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
>  	int ret;
>  
>  	spin_lock_bh(&br->hash_lock);
> -	ret = fdb_insert(br, source, addr, vid);
> +	ret = fdb_insert(br, source, NULL, addr, vid);
>  	spin_unlock_bh(&br->hash_lock);
>  	return ret;
>  }
>  
>  void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
> -		   const unsigned char *addr, u16 vid, bool added_by_user)
> +		   struct dst_entry *md_dst, const unsigned char *addr,
> +		   u16 vid, bool added_by_user)
>  {
>  	struct hlist_head *head = &br->hash[br_mac_hash(addr, vid)];
>  	struct net_bridge_fdb_entry *fdb;
> @@ -558,6 +567,9 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>  	      source->state == BR_STATE_FORWARDING))
>  		return;
>  
> +	if (md_dst && !(md_dst->flags & DST_METADATA))
> +		md_dst = NULL;
> +
>  	fdb = fdb_find_rcu(head, addr, vid);
>  	if (likely(fdb)) {
>  		/* attempt to update an entry for a local interface */
> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>  					source->dev->name, addr, vid);
>  		} else {
>  			unsigned long now = jiffies;
> +			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
>  
>  			/* fastpath: update of existing entry */
> -			if (unlikely(source != fdb->dst)) {
> +			if (unlikely(source != fdb->dst ||
> +			    dst_metadata_cmp(md_dst, ref_md))) {
>  				fdb->dst = source;
> +				dst_release(ref_md);
> +				rcu_assign_pointer(fdb->md_dst,
> +						dst_clone(md_dst));
>  				fdb_modified = true;
>  				/* Take over HW learned entry */
>  				if (unlikely(fdb->added_by_external_learn))
> @@ -586,7 +603,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>  	} else {
>  		spin_lock(&br->hash_lock);
>  		if (likely(!fdb_find_rcu(head, addr, vid))) {
> -			fdb = fdb_create(head, source, addr, vid, 0, 0);
> +			fdb = fdb_create(head, source, md_dst, addr, vid, 0, 0);
>  			if (fdb) {
>  				if (unlikely(added_by_user))
>  					fdb->added_by_user = 1;
> @@ -781,7 +798,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
>  		if (!(flags & NLM_F_CREATE))
>  			return -ENOENT;
>  
> -		fdb = fdb_create(head, source, addr, vid, 0, 0);
> +		fdb = fdb_create(head, source, NULL, addr, vid, 0, 0);
>  		if (!fdb)
>  			return -ENOMEM;
>  
> @@ -844,7 +861,7 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
>  		}
>  		local_bh_disable();
>  		rcu_read_lock();
> -		br_fdb_update(br, p, addr, vid, true);
> +		br_fdb_update(br, p, NULL, addr, vid, true);
>  		rcu_read_unlock();
>  		local_bh_enable();
>  	} else if (ndm->ndm_flags & NTF_EXT_LEARNED) {
> @@ -1071,7 +1088,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
>  	head = &br->hash[br_mac_hash(addr, vid)];
>  	fdb = br_fdb_find(br, addr, vid);
>  	if (!fdb) {
> -		fdb = fdb_create(head, p, addr, vid, 0, 0);
> +		fdb = fdb_create(head, p, NULL, addr, vid, 0, 0);
>  		if (!fdb) {
>  			err = -ENOMEM;
>  			goto err_unlock;
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 7637f58c1226..df10c87b2499 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -150,7 +150,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>  	/* insert into forwarding database after filtering to avoid spoofing */
>  	br = p->br;
>  	if (p->flags & BR_LEARNING)
> -		br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, false);
> +		br_fdb_update(br, p, skb_dst(skb), eth_hdr(skb)->h_source,
> +			      vid, false);
>  
>  	local_rcv = !!(br->dev->flags & IFF_PROMISC);
>  	dest = eth_hdr(skb)->h_dest;
> @@ -230,7 +231,8 @@ static void __br_handle_local_finish(struct sk_buff *skb)
>  
>  	/* check if vlan is allowed, to avoid spoofing */
>  	if (p->flags & BR_LEARNING && br_should_learn(p, skb, &vid))
> -		br_fdb_update(p->br, p, eth_hdr(skb)->h_source, vid, false);
> +		br_fdb_update(p->br, p, skb_dst(skb),
> +				eth_hdr(skb)->h_source, vid, false);
>  }
>  
>  /* note: already called with rcu_read_lock */
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index fd9ee73e0a6d..b73f34ed765f 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -164,6 +164,7 @@ struct net_bridge_vlan_group {
>  struct net_bridge_fdb_entry {
>  	struct hlist_node		hlist;
>  	struct net_bridge_port		*dst;
> +	struct dst_entry __rcu		*md_dst;
>  
>  	mac_addr			addr;
>  	__u16				vlan_id;
> @@ -524,7 +525,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf, unsigned long count,
>  int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
>  		  const unsigned char *addr, u16 vid);
>  void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
> -		   const unsigned char *addr, u16 vid, bool added_by_user);
> +		   struct dst_entry *md_dst, const unsigned char *addr,
> +		   u16 vid, bool added_by_user);
>  
>  int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
>  		  struct net_device *dev, const unsigned char *addr, u16 vid);
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-16 20:38   ` Nikolay Aleksandrov
@ 2017-08-17 11:03     ` David Lamparter
  2017-08-17 11:39       ` Nikolay Aleksandrov
  2017-08-17 16:16     ` David Lamparter
  1 sibling, 1 reply; 26+ messages in thread
From: David Lamparter @ 2017-08-17 11:03 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: David Lamparter, netdev, amine.kherbouche, roopa, stephen, bridge

Hi Nikolay,


On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
> On 16/08/17 20:01, David Lamparter wrote:
> > This implements holding dst metadata information in the bridge layer,
> > but only for unicast entries in the MAC table.  Multicast is still left
> > to design and implement.
> 
> Sorry but I do not agree with this change, adding a special case for
> VPLS in the bridge code

I don't think this is specific to VPLS at all, though you're right that
VPLS is the only user currently.

> and hitting the fast path for everyone in a few different places for a
> feature that the majority will not use does not sound acceptable to
> me. We've been trying hard to optimize it, trying to avoid additional
> cache lines, removing tests and keeping special cases to a minimum. 

skb->dst is on the same cacheline as skb->len.
fdb->md_dst is on the same cacheline as fdb->dst.
Both will be 0 in a lot of cases, so this should be two null checks on
data that is hot in the cache.  Are you sure this is an actual problem?

> I understand that you want to use the fdb tables and avoid
> duplication, but this is not worth it. There're other similar use
> cases and they have their own private fdb tables, that way the user
> can opt out and is much cleaner and separated.

Sure, this can be done.  I think it's a noticeable performance penalty
to have the entire fdb copied (multiple times for H-VPLS even), but I
understand that it's preferable to have the normal cases faster in
exchange.  As the previous paragraph notes, I still wonder if that hit
to the normal case exists though.

I will leave this to Amine, he's paid to work on VPLS while I'm doing
this for fun ;)

There is however another concern I have here.  As I noted in my
introductory mail, I'm working on the bridge MDB making similar changes.
And there's actually strong use cases for this in both VPLS and the
802.11 code (though I'm not sure I can code the latter one up, it's
related to rate control and this is seriously complicated - the goal is
to select a multicast rate based on the now-known receiving STAs).

I really hope you're not suggesting the entire MDB with IPv4 & IPv6
snooping be duplicated into both VPLS and mac80211?

> As you've noted this is only an RFC so I will not point out every issue, but there seems
> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.

Right, Thanks! ... I only thought about concurrent access, forgetting
about concurrent modification...  I'll replace it with an xchg I think.
(No need for a lock that way)

That said, now that I think about it, the skb_dst_set_noref() in the
following chunk is probably not safe if the VPLS device has a qdisc on
it.  I was relying on the fact that br_dev_xmit is holding RCU there,
but if the SKB is queued, md_dst might go away in the meantime...
... probably need to change this to dst_hold() + skb_dst_set()...


-David

> > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> > index 861ae2a165f4..534cacf02f8d 100644
> > --- a/net/bridge/br_device.c
> > +++ b/net/bridge/br_device.c
> > @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
> >  		else
> >  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
> >  	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
> > +		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
> > +		if (md_dst)
> > +			skb_dst_set_noref(skb, md_dst);
> >  		br_forward(dst->dst, skb, false, true);
> >  	} else {
> >  		br_flood(br, skb, BR_PKT_UNICAST, false, true);


> > diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> > index a79b648aac88..0751fcb89699 100644
> > @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
> >  					source->dev->name, addr, vid);
> >  		} else {
> >  			unsigned long now = jiffies;
> > +			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
> >  
> >  			/* fastpath: update of existing entry */
> > -			if (unlikely(source != fdb->dst)) {
> > +			if (unlikely(source != fdb->dst ||
> > +			    dst_metadata_cmp(md_dst, ref_md))) {
> >  				fdb->dst = source;
> > +				dst_release(ref_md);
> > +				rcu_assign_pointer(fdb->md_dst,
> > +						dst_clone(md_dst));
> >  				fdb_modified = true;
> >  				/* Take over HW learned entry */
> >  				if (unlikely(fdb->added_by_external_learn))

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 11:03     ` David Lamparter
@ 2017-08-17 11:39       ` Nikolay Aleksandrov
  2017-08-17 11:51         ` Nikolay Aleksandrov
  2017-08-17 12:45         ` David Lamparter
  0 siblings, 2 replies; 26+ messages in thread
From: Nikolay Aleksandrov @ 2017-08-17 11:39 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev, roopa, bridge, amine.kherbouche

On 17/08/17 14:03, David Lamparter wrote:
> Hi Nikolay,
> 
> 
> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
>> On 16/08/17 20:01, David Lamparter wrote:
>>> This implements holding dst metadata information in the bridge layer,
>>> but only for unicast entries in the MAC table.  Multicast is still left
>>> to design and implement.
>>
>> Sorry but I do not agree with this change, adding a special case for
>> VPLS in the bridge code
> 
> I don't think this is specific to VPLS at all, though you're right that
> VPLS is the only user currently.
> 
>> and hitting the fast path for everyone in a few different places for a
>> feature that the majority will not use does not sound acceptable to
>> me. We've been trying hard to optimize it, trying to avoid additional
>> cache lines, removing tests and keeping special cases to a minimum. 
> 
> skb->dst is on the same cacheline as skb->len.
> fdb->md_dst is on the same cacheline as fdb->dst.
> Both will be 0 in a lot of cases, so this should be two null checks on
> data that is hot in the cache.  Are you sure this is an actual problem?
> 

Sure - no, I haven't benchmarked it, but I don't see skb->len being on
the same cache line as _skb_refdst assuming 64 byte cache lines.
But again any special cases, in my opinion, should be handled on their own,
it is both about the fast path and the code complexity that they bring in.

>> I understand that you want to use the fdb tables and avoid
>> duplication, but this is not worth it. There're other similar use
>> cases and they have their own private fdb tables, that way the user
>> can opt out and is much cleaner and separated.
> 
> Sure, this can be done.  I think it's a noticeable performance penalty
> to have the entire fdb copied (multiple times for H-VPLS even), but I
> understand that it's preferable to have the normal cases faster in
> exchange.  As the previous paragraph notes, I still wonder if that hit
> to the normal case exists though.
> 
> I will leave this to Amine, he's paid to work on VPLS while I'm doing
> this for fun ;)
> 
> There is however another concern I have here.  As I noted in my
> introductory mail, I'm working on the bridge MDB making similar changes.
> And there's actually strong use cases for this in both VPLS and the
> 802.11 code (though I'm not sure I can code the latter one up, it's
> related to rate control and this is seriously complicated - the goal is
> to select a multicast rate based on the now-known receiving STAs).
> 
> I really hope you're not suggesting the entire MDB with IPv4 & IPv6
> snooping be duplicated into both VPLS and mac80211?
> 

Code can always be shared if there are more users, no need to stuff everything in
the bridge, but I'm not that familiar with this case, once patches are out I can
comment further.

>> As you've noted this is only an RFC so I will not point out every issue, but there seems
>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.
> 
> Right, Thanks! ... I only thought about concurrent access, forgetting
> about concurrent modification...  I'll replace it with an xchg I think.
> (No need for a lock that way)

I think you can still lose references to a dst that way, what if someone changes the
dst you read before the xchg and you xchg it ?

> 
> That said, now that I think about it, the skb_dst_set_noref() in the
> following chunk is probably not safe if the VPLS device has a qdisc on
> it.  I was relying on the fact that br_dev_xmit is holding RCU there,
> but if the SKB is queued, md_dst might go away in the meantime...
> ... probably need to change this to dst_hold() + skb_dst_set()...
> 
> 
> -David
> 
>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
>>> index 861ae2a165f4..534cacf02f8d 100644
>>> --- a/net/bridge/br_device.c
>>> +++ b/net/bridge/br_device.c
>>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>>>  		else
>>>  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
>>>  	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
>>> +		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
>>> +		if (md_dst)
>>> +			skb_dst_set_noref(skb, md_dst);
>>>  		br_forward(dst->dst, skb, false, true);
>>>  	} else {
>>>  		br_flood(br, skb, BR_PKT_UNICAST, false, true);
> 
> 
>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>> index a79b648aac88..0751fcb89699 100644
>>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>>>  					source->dev->name, addr, vid);
>>>  		} else {
>>>  			unsigned long now = jiffies;
>>> +			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
>>>  
>>>  			/* fastpath: update of existing entry */
>>> -			if (unlikely(source != fdb->dst)) {
>>> +			if (unlikely(source != fdb->dst ||
>>> +			    dst_metadata_cmp(md_dst, ref_md))) {
>>>  				fdb->dst = source;
>>> +				dst_release(ref_md);
>>> +				rcu_assign_pointer(fdb->md_dst,
>>> +						dst_clone(md_dst));
>>>  				fdb_modified = true;
>>>  				/* Take over HW learned entry */
>>>  				if (unlikely(fdb->added_by_external_learn))

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 11:39       ` Nikolay Aleksandrov
@ 2017-08-17 11:51         ` Nikolay Aleksandrov
  2017-08-17 12:10           ` David Lamparter
  2017-08-17 12:45         ` David Lamparter
  1 sibling, 1 reply; 26+ messages in thread
From: Nikolay Aleksandrov @ 2017-08-17 11:51 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev, roopa, bridge, amine.kherbouche

On 17/08/17 14:39, Nikolay Aleksandrov wrote:
> On 17/08/17 14:03, David Lamparter wrote:
>> Hi Nikolay,
>>
>>
>> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
>>> On 16/08/17 20:01, David Lamparter wrote:
>>>> This implements holding dst metadata information in the bridge layer,
>>>> but only for unicast entries in the MAC table.  Multicast is still left
>>>> to design and implement.
>>>
>>> Sorry but I do not agree with this change, adding a special case for
>>> VPLS in the bridge code
>>
>> I don't think this is specific to VPLS at all, though you're right that
>> VPLS is the only user currently.
>>
>>> and hitting the fast path for everyone in a few different places for a
>>> feature that the majority will not use does not sound acceptable to
>>> me. We've been trying hard to optimize it, trying to avoid additional
>>> cache lines, removing tests and keeping special cases to a minimum. 
>>
>> skb->dst is on the same cacheline as skb->len.
>> fdb->md_dst is on the same cacheline as fdb->dst.
>> Both will be 0 in a lot of cases, so this should be two null checks on
>> data that is hot in the cache.  Are you sure this is an actual problem?
>>
> 
> Sure - no, I haven't benchmarked it, but I don't see skb->len being on
> the same cache line as _skb_refdst assuming 64 byte cache lines.

I should've been clearer - that obviously depends on the kernel config, but
in order for them to be in the same line you need to disable either one of 
conntrack, bridge_netfilter or xfrm, these are almost always enabled (at
least in all major distributions).

> But again any special cases, in my opinion, should be handled on their own,
> it is both about the fast path and the code complexity that they bring in.
> 
>>> I understand that you want to use the fdb tables and avoid
>>> duplication, but this is not worth it. There're other similar use
>>> cases and they have their own private fdb tables, that way the user
>>> can opt out and is much cleaner and separated.
>>
>> Sure, this can be done.  I think it's a noticeable performance penalty
>> to have the entire fdb copied (multiple times for H-VPLS even), but I
>> understand that it's preferable to have the normal cases faster in
>> exchange.  As the previous paragraph notes, I still wonder if that hit
>> to the normal case exists though.
>>
>> I will leave this to Amine, he's paid to work on VPLS while I'm doing
>> this for fun ;)
>>
>> There is however another concern I have here.  As I noted in my
>> introductory mail, I'm working on the bridge MDB making similar changes.
>> And there's actually strong use cases for this in both VPLS and the
>> 802.11 code (though I'm not sure I can code the latter one up, it's
>> related to rate control and this is seriously complicated - the goal is
>> to select a multicast rate based on the now-known receiving STAs).
>>
>> I really hope you're not suggesting the entire MDB with IPv4 & IPv6
>> snooping be duplicated into both VPLS and mac80211?
>>
> 
> Code can always be shared if there are more users, no need to stuff everything in
> the bridge, but I'm not that familiar with this case, once patches are out I can
> comment further.
> 
>>> As you've noted this is only an RFC so I will not point out every issue, but there seems
>>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.
>>
>> Right, Thanks! ... I only thought about concurrent access, forgetting
>> about concurrent modification...  I'll replace it with an xchg I think.
>> (No need for a lock that way)
> 
> I think you can still lose references to a dst that way, what if someone changes the
> dst you read before the xchg and you xchg it ?
> 
>>
>> That said, now that I think about it, the skb_dst_set_noref() in the
>> following chunk is probably not safe if the VPLS device has a qdisc on
>> it.  I was relying on the fact that br_dev_xmit is holding RCU there,
>> but if the SKB is queued, md_dst might go away in the meantime...
>> ... probably need to change this to dst_hold() + skb_dst_set()...
>>
>>
>> -David
>>
>>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
>>>> index 861ae2a165f4..534cacf02f8d 100644
>>>> --- a/net/bridge/br_device.c
>>>> +++ b/net/bridge/br_device.c
>>>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>  		else
>>>>  			br_flood(br, skb, BR_PKT_MULTICAST, false, true);
>>>>  	} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
>>>> +		struct dst_entry *md_dst = rcu_dereference(dst->md_dst);
>>>> +		if (md_dst)
>>>> +			skb_dst_set_noref(skb, md_dst);
>>>>  		br_forward(dst->dst, skb, false, true);
>>>>  	} else {
>>>>  		br_flood(br, skb, BR_PKT_UNICAST, false, true);
>>
>>
>>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>>> index a79b648aac88..0751fcb89699 100644
>>>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
>>>>  					source->dev->name, addr, vid);
>>>>  		} else {
>>>>  			unsigned long now = jiffies;
>>>> +			struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst);
>>>>  
>>>>  			/* fastpath: update of existing entry */
>>>> -			if (unlikely(source != fdb->dst)) {
>>>> +			if (unlikely(source != fdb->dst ||
>>>> +			    dst_metadata_cmp(md_dst, ref_md))) {
>>>>  				fdb->dst = source;
>>>> +				dst_release(ref_md);
>>>> +				rcu_assign_pointer(fdb->md_dst,
>>>> +						dst_clone(md_dst));
>>>>  				fdb_modified = true;
>>>>  				/* Take over HW learned entry */
>>>>  				if (unlikely(fdb->added_by_external_learn))
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 11:51         ` Nikolay Aleksandrov
@ 2017-08-17 12:10           ` David Lamparter
  2017-08-17 12:19             ` Nikolay Aleksandrov
  2017-08-17 12:20             ` David Lamparter
  0 siblings, 2 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-17 12:10 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: David Lamparter, netdev, amine.kherbouche, roopa, stephen, bridge

On Thu, Aug 17, 2017 at 02:51:12PM +0300, Nikolay Aleksandrov wrote:
> On 17/08/17 14:39, Nikolay Aleksandrov wrote:
> > On 17/08/17 14:03, David Lamparter wrote:
> >> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
[cut]
> >>> and hitting the fast path for everyone in a few different places for a
> >>> feature that the majority will not use does not sound acceptable to
> >>> me. We've been trying hard to optimize it, trying to avoid additional
> >>> cache lines, removing tests and keeping special cases to a minimum. 
> >>
> >> skb->dst is on the same cacheline as skb->len.
> >> fdb->md_dst is on the same cacheline as fdb->dst.
> >> Both will be 0 in a lot of cases, so this should be two null checks on
> >> data that is hot in the cache.  Are you sure this is an actual problem?
> >>
> > 
> > Sure - no, I haven't benchmarked it, but I don't see skb->len being on
> > the same cache line as _skb_refdst assuming 64 byte cache lines.
> 
> I should've been clearer - that obviously depends on the kernel config, but
> in order for them to be in the same line you need to disable either one of 
> conntrack, bridge_netfilter or xfrm, these are almost always enabled (at
> least in all major distributions).

Did I miscount somewhere?  This is what I counted:
offs    size
00      16      next/prev/other union bits
16      8       sk
24      8       dev
32	32      cb (first 32 bytes)
---- boundary @ 64
64      16      cb (last 16 bytes)
80      8       _skb_refdst
88      8       destructor
96      8 (0)   sp
104     8 (0)   _nfct
112     8 (0)   nf_bridge
120     4       len
124     4       data_len
---- boundary @ 128
128     2       mac_len
130     2       hdr_len


-David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 12:10           ` David Lamparter
@ 2017-08-17 12:19             ` Nikolay Aleksandrov
  2017-08-17 12:20             ` David Lamparter
  1 sibling, 0 replies; 26+ messages in thread
From: Nikolay Aleksandrov @ 2017-08-17 12:19 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev, roopa, bridge, amine.kherbouche

On 17/08/17 15:10, David Lamparter wrote:
> On Thu, Aug 17, 2017 at 02:51:12PM +0300, Nikolay Aleksandrov wrote:
>> On 17/08/17 14:39, Nikolay Aleksandrov wrote:
>>> On 17/08/17 14:03, David Lamparter wrote:
>>>> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
> [cut]
>>>>> and hitting the fast path for everyone in a few different places for a
>>>>> feature that the majority will not use does not sound acceptable to
>>>>> me. We've been trying hard to optimize it, trying to avoid additional
>>>>> cache lines, removing tests and keeping special cases to a minimum. 
>>>>
>>>> skb->dst is on the same cacheline as skb->len.
>>>> fdb->md_dst is on the same cacheline as fdb->dst.
>>>> Both will be 0 in a lot of cases, so this should be two null checks on
>>>> data that is hot in the cache.  Are you sure this is an actual problem?
>>>>
>>>
>>> Sure - no, I haven't benchmarked it, but I don't see skb->len being on
>>> the same cache line as _skb_refdst assuming 64 byte cache lines.
>>
>> I should've been clearer - that obviously depends on the kernel config, but
>> in order for them to be in the same line you need to disable either one of 
>> conntrack, bridge_netfilter or xfrm, these are almost always enabled (at
>> least in all major distributions).
> 
> Did I miscount somewhere?  This is what I counted:
> offs    size
> 00      16      next/prev/other union bits
> 16      8       sk
> 24      8       dev
> 32	32      cb (first 32 bytes)
> ---- boundary @ 64
> 64      16      cb (last 16 bytes)
> 80      8       _skb_refdst
> 88      8       destructor
> 96      8 (0)   sp
> 104     8 (0)   _nfct
> 112     8 (0)   nf_bridge
> 120     4       len
> 124     4       data_len
> ---- boundary @ 128
> 128     2       mac_len
> 130     2       hdr_len
> 
> 
> -David
> 
What kernel ?

pahole -C sk_buff 

struct sk_buff {
	union {
		struct {
			struct sk_buff * next;           /*     0     8 */
			struct sk_buff * prev;           /*     8     8 */
			union {
				ktime_t tstamp;          /*           8 */
				u64 skb_mstamp;          /*           8 */
			};                               /*    16     8 */
		};                                       /*          24 */
		struct rb_node     rbnode;               /*          24 */
	};                                               /*     0    24 */
	struct sock *              sk;                   /*    24     8 */
	union {
		struct net_device * dev;                 /*           8 */
		long unsigned int  dev_scratch;          /*           8 */
	};                                               /*    32     8 */
	char                       cb[48];               /*    40    48 */
	/* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
	long unsigned int          _skb_refdst;          /*    88     8 */
	void                       (*destructor)(struct sk_buff *); /*    96     8 */
	struct sec_path *          sp;                   /*   104     8 */
	long unsigned int          _nfct;                /*   112     8 */
	struct nf_bridge_info *    nf_bridge;            /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	unsigned int               len;                  /*   128     4 */
	unsigned int               data_len;             /*   132     4 */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 12:10           ` David Lamparter
  2017-08-17 12:19             ` Nikolay Aleksandrov
@ 2017-08-17 12:20             ` David Lamparter
  1 sibling, 0 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-17 12:20 UTC (permalink / raw)
  To: David Lamparter
  Cc: Nikolay Aleksandrov, netdev, amine.kherbouche, roopa, stephen, bridge

On Thu, Aug 17, 2017 at 02:10:20PM +0200, David Lamparter wrote:
> On Thu, Aug 17, 2017 at 02:51:12PM +0300, Nikolay Aleksandrov wrote:
> > On 17/08/17 14:39, Nikolay Aleksandrov wrote:
> > > On 17/08/17 14:03, David Lamparter wrote:
> > >> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
> [cut]
> > >>> and hitting the fast path for everyone in a few different places for a
> > >>> feature that the majority will not use does not sound acceptable to
> > >>> me. We've been trying hard to optimize it, trying to avoid additional
> > >>> cache lines, removing tests and keeping special cases to a minimum. 
> > >>
> > >> skb->dst is on the same cacheline as skb->len.
> > >> fdb->md_dst is on the same cacheline as fdb->dst.
> > >> Both will be 0 in a lot of cases, so this should be two null checks on
> > >> data that is hot in the cache.  Are you sure this is an actual problem?
> > >>
> > > 
> > > Sure - no, I haven't benchmarked it, but I don't see skb->len being on
> > > the same cache line as _skb_refdst assuming 64 byte cache lines.
> > 
> > I should've been clearer - that obviously depends on the kernel config, but
> > in order for them to be in the same line you need to disable either one of 
> > conntrack, bridge_netfilter or xfrm, these are almost always enabled (at
> > least in all major distributions).
> 
> Did I miscount somewhere?  This is what I counted:
> offs    size
> 00      16      next/prev/other union bits

Argh, struct rb_node is 24 bytes.  *sigh*

Am I going to be stoned for saying that maybe the conditional fields
(sp, nfcd, nf_bridge) should be moved down? :D

btw: nf_bridge / BRIDGE_NETFILTER is incompatible with this to begin
with because it tramples over skb->dst with its DST_FAKE_RTABLE dst.


-David

> 16      8       sk
> 24      8       dev
> 32	32      cb (first 32 bytes)
> ---- boundary @ 64
> 64      16      cb (last 16 bytes)
> 80      8       _skb_refdst
> 88      8       destructor
> 96      8 (0)   sp
> 104     8 (0)   _nfct
> 112     8 (0)   nf_bridge
> 120     4       len
> 124     4       data_len
> ---- boundary @ 128
> 128     2       mac_len
> 130     2       hdr_len

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 11:39       ` Nikolay Aleksandrov
  2017-08-17 11:51         ` Nikolay Aleksandrov
@ 2017-08-17 12:45         ` David Lamparter
  2017-08-17 13:04           ` Nikolay Aleksandrov
  1 sibling, 1 reply; 26+ messages in thread
From: David Lamparter @ 2017-08-17 12:45 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: David Lamparter, netdev, amine.kherbouche, roopa, stephen, bridge

On Thu, Aug 17, 2017 at 02:39:43PM +0300, Nikolay Aleksandrov wrote:
> On 17/08/17 14:03, David Lamparter wrote:
> > On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
> >> On 16/08/17 20:01, David Lamparter wrote:
> >> and hitting the fast path for everyone in a few different places for a
> >> feature that the majority will not use does not sound acceptable to
> >> me. We've been trying hard to optimize it, trying to avoid additional
> >> cache lines, removing tests and keeping special cases to a minimum. 
> > 
> > skb->dst is on the same cacheline as skb->len.
> > fdb->md_dst is on the same cacheline as fdb->dst.
> > Both will be 0 in a lot of cases, so this should be two null checks on
> > data that is hot in the cache.  Are you sure this is an actual problem?
> > 
> 
> Sure - no, I haven't benchmarked it, but I don't see skb->len being on
> the same cache line as _skb_refdst assuming 64 byte cache lines.
> But again any special cases, in my opinion, should be handled on their own,
> it is both about the fast path and the code complexity that they bring in.

(separate thread)

[cut]
> > I really hope you're not suggesting the entire MDB with IPv4 & IPv6
> > snooping be duplicated into both VPLS and mac80211?
> 
> Code can always be shared if there are more users, no need to stuff
> everything in the bridge,

The MDB code is far from trivial, has several configuration knobs, and
even sends its own queries if configured to do so.  It can also use
quite a bit of memory of there's a nontrivial number of multicast
groups.  I *really* think it shouldn't be duplicated.

> but I'm not that familiar with this case, once patches are out I can
> comment further.

I've pushed my hacks to:
https://github.com/eqvinox/vpls-linux-kernel/commits/mdb-hack
(top two commits)

THIS IS ABSOLUTELY A PROOF OF CONCEPT.  It doesn't un-learn dst
metadata, it probably leaks buckets, and it may kill your cat.
(I haven't pushed my attempts at mac80211, because I haven't gotten
anywhere useful there just yet.)

> >> As you've noted this is only an RFC so I will not point out every issue, but there seems
> >> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.
> > 
> > Right, Thanks! ... I only thought about concurrent access, forgetting
> > about concurrent modification...  I'll replace it with an xchg I think.
> > (No need for a lock that way)
> 
> I think you can still lose references to a dst that way, what if someone changes the
> dst you read before the xchg and you xchg it ?

The dst to be released is the return from (atomic) xchg, not the value
read earlier for comparison.  This can happen in parallel, but apart
from a little extra churn in the update case it has no ill effects.

If someone changes it in the meantime, they have new dst information for
the fdb entry, and so do we.  With xchg'ing it, either one of the
updates will stick and the other will be properly released.  Considering
that there is no correct ordering here (either packet could be processed
a nanosecond later or earlier), this is perfectly fine as an outcome.


-David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-17 12:45         ` David Lamparter
@ 2017-08-17 13:04           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 26+ messages in thread
From: Nikolay Aleksandrov @ 2017-08-17 13:04 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev, amine.kherbouche, roopa, stephen, bridge

On 17/08/17 15:45, David Lamparter wrote:
> On Thu, Aug 17, 2017 at 02:39:43PM +0300, Nikolay Aleksandrov wrote:
>> On 17/08/17 14:03, David Lamparter wrote:
>>> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
>>>> On 16/08/17 20:01, David Lamparter wrote:
>>>> and hitting the fast path for everyone in a few different places for a
>>>> feature that the majority will not use does not sound acceptable to
>>>> me. We've been trying hard to optimize it, trying to avoid additional
>>>> cache lines, removing tests and keeping special cases to a minimum. 
>>>
>>> skb->dst is on the same cacheline as skb->len.
>>> fdb->md_dst is on the same cacheline as fdb->dst.
>>> Both will be 0 in a lot of cases, so this should be two null checks on
>>> data that is hot in the cache.  Are you sure this is an actual problem?
>>>
>>
>> Sure - no, I haven't benchmarked it, but I don't see skb->len being on
>> the same cache line as _skb_refdst assuming 64 byte cache lines.
>> But again any special cases, in my opinion, should be handled on their own,
>> it is both about the fast path and the code complexity that they bring in.
> 
> (separate thread)
> 
> [cut]
>>> I really hope you're not suggesting the entire MDB with IPv4 & IPv6
>>> snooping be duplicated into both VPLS and mac80211?
>>
>> Code can always be shared if there are more users, no need to stuff
>> everything in the bridge,
> 
> The MDB code is far from trivial, has several configuration knobs, and
> even sends its own queries if configured to do so.  It can also use
> quite a bit of memory of there's a nontrivial number of multicast
> groups.  I *really* think it shouldn't be duplicated.
> 
>> but I'm not that familiar with this case, once patches are out I can
>> comment further.
> 
> I've pushed my hacks to:
> https://github.com/eqvinox/vpls-linux-kernel/commits/mdb-hack
> (top two commits)
> 
> THIS IS ABSOLUTELY A PROOF OF CONCEPT.  It doesn't un-learn dst
> metadata, it probably leaks buckets, and it may kill your cat.
> (I haven't pushed my attempts at mac80211, because I haven't gotten
> anywhere useful there just yet.)
> 
>>>> As you've noted this is only an RFC so I will not point out every issue, but there seems
>>>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU.
>>>
>>> Right, Thanks! ... I only thought about concurrent access, forgetting
>>> about concurrent modification...  I'll replace it with an xchg I think.
>>> (No need for a lock that way)
>>
>> I think you can still lose references to a dst that way, what if someone changes the
>> dst you read before the xchg and you xchg it ?
> 
> The dst to be released is the return from (atomic) xchg, not the value
> read earlier for comparison.  This can happen in parallel, but apart
> from a little extra churn in the update case it has no ill effects.
> 
> If someone changes it in the meantime, they have new dst information for
> the fdb entry, and so do we.  With xchg'ing it, either one of the
> updates will stick and the other will be properly released.  Considering
> that there is no correct ordering here (either packet could be processed
> a nanosecond later or earlier), this is perfectly fine as an outcome.

Yep right you are, my bad.

> 
> 
> -David
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] bridge: learn dst metadata in FDB
  2017-08-16 20:38   ` Nikolay Aleksandrov
  2017-08-17 11:03     ` David Lamparter
@ 2017-08-17 16:16     ` David Lamparter
  1 sibling, 0 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-17 16:16 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: David Lamparter, netdev, amine.kherbouche, roopa, stephen, bridge

On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote:
> On 16/08/17 20:01, David Lamparter wrote:
> > This implements holding dst metadata information in the bridge layer,
> > but only for unicast entries in the MAC table.  Multicast is still left
> > to design and implement.
> > 
> > Signed-off-by: David Lamparter <equinox@diac24.net>
> > ---
> 
> Hi David,
> Sorry but I do not agree with this change, adding a special case for VPLS

To prove that this is not a special case for VPLS, I have attached a
patch for GRETAP multicast+unicast learning below.

It's just 24(!) lines added to get functionality similar to "basic
VXLAN" (i.e. multicast with dataplane learning.)


-David

---
From: David Lamparter <equinox@diac24.net>
Date: Thu, 17 Aug 2017 18:11:16 +0200
Subject: [PATCH] gretap: support multicast + unicast learning

This enables using an IPv4 multicast destination for gretap and enables
learning unicast destinations through the bridge fdb.

Signed-off-by: David Lamparter <equinox@diac24.net>
---
 net/ipv4/ip_gre.c    | 27 +++++++++++++++++++++++----
 net/ipv4/ip_tunnel.c |  1 +
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 7a7829e839c2..e58f8ccb2c87 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -266,7 +266,8 @@ static int __ipgre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi,
 			skb_pop_mac_header(skb);
 		else
 			skb_reset_mac_header(skb);
-		if (tunnel->collect_md) {
+		if (tunnel->collect_md
+		    || ipv4_is_multicast(tunnel->parms.iph.daddr)) {
 			__be16 flags;
 			__be64 tun_id;
 
@@ -379,7 +380,7 @@ static struct rtable *gre_get_rt(struct sk_buff *skb,
 static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 			__be16 proto)
 {
-	struct ip_tunnel_info *tun_info;
+	struct ip_tunnel_info *tun_info, flipped;
 	const struct ip_tunnel_key *key;
 	struct rtable *rt = NULL;
 	struct flowi4 fl;
@@ -390,10 +391,22 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 	int err;
 
 	tun_info = skb_tunnel_info(skb);
-	if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
+	if (unlikely(!tun_info ||
 		     ip_tunnel_info_af(tun_info) != AF_INET))
 		goto err_free_skb;
 
+	if (!(tun_info->mode & IP_TUNNEL_INFO_TX)) {
+		struct ip_tunnel *tunnel = netdev_priv(dev);
+
+		flipped = *tun_info;
+		flipped.mode |= IP_TUNNEL_INFO_TX;
+		flipped.key.u.ipv4.dst = tun_info->key.u.ipv4.src;
+		flipped.key.u.ipv4.src = tunnel->parms.iph.saddr;
+		flipped.key.tp_src = tun_info->key.tp_dst;
+		flipped.key.tp_dst = tun_info->key.tp_src;
+		tun_info = &flipped;
+	}
+
 	key = &tun_info->key;
 	use_cache = ip_tunnel_dst_cache_usable(skb, tun_info);
 	if (use_cache)
@@ -507,8 +520,9 @@ static netdev_tx_t gre_tap_xmit(struct sk_buff *skb,
 				struct net_device *dev)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct ip_tunnel_info *tun_info = skb_tunnel_info(skb);
 
-	if (tunnel->collect_md) {
+	if (tunnel->collect_md || tun_info) {
 		gre_fb_xmit(skb, dev, htons(ETH_P_TEB));
 		return NETDEV_TX_OK;
 	}
@@ -933,6 +947,7 @@ static int gre_tap_init(struct net_device *dev)
 {
 	__gre_tunnel_init(dev);
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	netif_keep_dst(dev);
 
 	return ip_tunnel_init(dev);
 }
@@ -940,6 +955,10 @@ static int gre_tap_init(struct net_device *dev)
 static const struct net_device_ops gre_tap_netdev_ops = {
 	.ndo_init		= gre_tap_init,
 	.ndo_uninit		= ip_tunnel_uninit,
+#ifdef CONFIG_NET_IPGRE_BROADCAST
+	.ndo_open		= ipgre_open,
+	.ndo_stop		= ipgre_close,
+#endif
 	.ndo_start_xmit		= gre_tap_xmit,
 	.ndo_set_mac_address 	= eth_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 129d1a3616f8..451c11fc9ae5 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -140,6 +140,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net *itn,
 
 	hlist_for_each_entry_rcu(t, head, hash_node) {
 		if ((local != t->parms.iph.saddr || t->parms.iph.daddr != 0) &&
+		    (local != t->parms.iph.saddr || !ipv4_is_multicast(t->parms.iph.daddr)) &&
 		    (local != t->parms.iph.daddr || !ipv4_is_multicast(local)))
 			continue;
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] mpls: split forwarding path on rx/tx boundary
  2017-08-16 17:01 ` [PATCH 2/6] mpls: split forwarding path on rx/tx boundary David Lamparter
@ 2017-08-19 17:10   ` kbuild test robot
  2017-08-19 17:42   ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-08-19 17:10 UTC (permalink / raw)
  To: David Lamparter
  Cc: kbuild-all, netdev, amine.kherbouche, roopa, David Lamparter

[-- Attachment #1: Type: text/plain, Size: 3916 bytes --]

Hi David,

[auto build test WARNING on net-next/master]
[also build test WARNING on next-20170817]
[cannot apply to v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/David-Lamparter/bridge-learn-dst-metadata-in-FDB/20170820-001849
config: s390-allmodconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=s390 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from net/mpls/af_mpls.c:1:
   net/mpls/af_mpls.c: In function 'mpls_rt_xmit':
>> include/linux/compiler.h:239:26: warning: 'out_dev' may be used uninitialized in this function [-Wmaybe-uninitialized]
     case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
                             ^
   net/mpls/af_mpls.c:384:21: note: 'out_dev' was declared here
     struct net_device *out_dev;
                        ^~~~~~~
--
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from net//mpls/af_mpls.c:1:
   net//mpls/af_mpls.c: In function 'mpls_rt_xmit':
>> include/linux/compiler.h:239:26: warning: 'out_dev' may be used uninitialized in this function [-Wmaybe-uninitialized]
     case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
                             ^
   net//mpls/af_mpls.c:384:21: note: 'out_dev' was declared here
     struct net_device *out_dev;
                        ^~~~~~~

vim +/out_dev +239 include/linux/compiler.h

230fa253 Christian Borntraeger 2014-11-25  232  
d976441f Andrey Ryabinin       2015-10-19  233  #define __READ_ONCE_SIZE						\
d976441f Andrey Ryabinin       2015-10-19  234  ({									\
d976441f Andrey Ryabinin       2015-10-19  235  	switch (size) {							\
d976441f Andrey Ryabinin       2015-10-19  236  	case 1: *(__u8 *)res = *(volatile __u8 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  237  	case 2: *(__u16 *)res = *(volatile __u16 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  238  	case 4: *(__u32 *)res = *(volatile __u32 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19 @239  	case 8: *(__u64 *)res = *(volatile __u64 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  240  	default:							\
d976441f Andrey Ryabinin       2015-10-19  241  		barrier();						\
d976441f Andrey Ryabinin       2015-10-19  242  		__builtin_memcpy((void *)res, (const void *)p, size);	\
d976441f Andrey Ryabinin       2015-10-19  243  		barrier();						\
d976441f Andrey Ryabinin       2015-10-19  244  	}								\
d976441f Andrey Ryabinin       2015-10-19  245  })
d976441f Andrey Ryabinin       2015-10-19  246  

:::::: The code at line 239 was first introduced by commit
:::::: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: Provide READ_ONCE_NOCHECK()

:::::: TO: Andrey Ryabinin <aryabinin@virtuozzo.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 47252 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] mpls: split forwarding path on rx/tx boundary
  2017-08-16 17:01 ` [PATCH 2/6] mpls: split forwarding path on rx/tx boundary David Lamparter
  2017-08-19 17:10   ` kbuild test robot
@ 2017-08-19 17:42   ` kbuild test robot
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-08-19 17:42 UTC (permalink / raw)
  To: David Lamparter
  Cc: kbuild-all, netdev, amine.kherbouche, roopa, David Lamparter

[-- Attachment #1: Type: text/plain, Size: 6964 bytes --]

Hi David,

[auto build test WARNING on net-next/master]
[also build test WARNING on next-20170817]
[cannot apply to v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/David-Lamparter/bridge-learn-dst-metadata-in-FDB/20170820-001849
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   net/mpls/af_mpls.c: In function 'mpls_rt_xmit':
>> net/mpls/af_mpls.c:452:45: warning: 'out_dev' may be used uninitialized in this function [-Wmaybe-uninitialized]
     out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
                                                ^

vim +/out_dev +452 net/mpls/af_mpls.c

679ee57f David Lamparter   2017-08-16  379  
679ee57f David Lamparter   2017-08-16  380  int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
679ee57f David Lamparter   2017-08-16  381  		 struct mpls_entry_decoded dec)
679ee57f David Lamparter   2017-08-16  382  {
679ee57f David Lamparter   2017-08-16  383  	struct mpls_nh *nh;
679ee57f David Lamparter   2017-08-16  384  	struct net_device *out_dev;
679ee57f David Lamparter   2017-08-16  385  	struct mpls_dev *out_mdev;
679ee57f David Lamparter   2017-08-16  386  	unsigned int hh_len;
679ee57f David Lamparter   2017-08-16  387  	unsigned int new_header_size;
679ee57f David Lamparter   2017-08-16  388  	unsigned int mtu;
679ee57f David Lamparter   2017-08-16  389  	int err;
679ee57f David Lamparter   2017-08-16  390  
679ee57f David Lamparter   2017-08-16  391  	nh = mpls_select_multipath(rt, skb);
679ee57f David Lamparter   2017-08-16  392  	if (!nh)
679ee57f David Lamparter   2017-08-16  393  		goto tx_err;
679ee57f David Lamparter   2017-08-16  394  
27d69105 Robert Shearman   2017-01-16  395  	/* Find the output device */
27d69105 Robert Shearman   2017-01-16  396  	out_dev = rcu_dereference(nh->nh_dev);
27d69105 Robert Shearman   2017-01-16  397  	if (!mpls_output_possible(out_dev))
27d69105 Robert Shearman   2017-01-16  398  		goto tx_err;
27d69105 Robert Shearman   2017-01-16  399  
0189197f Eric W. Biederman 2015-03-03  400  	/* Verify the destination can hold the packet */
f8efb73c Roopa Prabhu      2015-10-23  401  	new_header_size = mpls_nh_header_size(nh);
0189197f Eric W. Biederman 2015-03-03  402  	mtu = mpls_dev_mtu(out_dev);
0189197f Eric W. Biederman 2015-03-03  403  	if (mpls_pkt_too_big(skb, mtu - new_header_size))
27d69105 Robert Shearman   2017-01-16  404  		goto tx_err;
0189197f Eric W. Biederman 2015-03-03  405  
0189197f Eric W. Biederman 2015-03-03  406  	hh_len = LL_RESERVED_SPACE(out_dev);
0189197f Eric W. Biederman 2015-03-03  407  	if (!out_dev->header_ops)
0189197f Eric W. Biederman 2015-03-03  408  		hh_len = 0;
0189197f Eric W. Biederman 2015-03-03  409  
0189197f Eric W. Biederman 2015-03-03  410  	/* Ensure there is enough space for the headers in the skb */
0189197f Eric W. Biederman 2015-03-03  411  	if (skb_cow(skb, hh_len + new_header_size))
27d69105 Robert Shearman   2017-01-16  412  		goto tx_err;
0189197f Eric W. Biederman 2015-03-03  413  
0189197f Eric W. Biederman 2015-03-03  414  	skb->dev = out_dev;
0189197f Eric W. Biederman 2015-03-03  415  	skb->protocol = htons(ETH_P_MPLS_UC);
0189197f Eric W. Biederman 2015-03-03  416  
0189197f Eric W. Biederman 2015-03-03  417  	if (unlikely(!new_header_size && dec.bos)) {
0189197f Eric W. Biederman 2015-03-03  418  		/* Penultimate hop popping */
5b441ac8 Robert Shearman   2017-03-10  419  		if (!mpls_egress(dev_net(out_dev), rt, skb, dec))
679ee57f David Lamparter   2017-08-16  420  			goto tx_err;
0189197f Eric W. Biederman 2015-03-03  421  	} else {
679ee57f David Lamparter   2017-08-16  422  		struct mpls_shim_hdr *hdr;
0189197f Eric W. Biederman 2015-03-03  423  		bool bos;
0189197f Eric W. Biederman 2015-03-03  424  		int i;
0189197f Eric W. Biederman 2015-03-03  425  		skb_push(skb, new_header_size);
0189197f Eric W. Biederman 2015-03-03  426  		skb_reset_network_header(skb);
0189197f Eric W. Biederman 2015-03-03  427  		/* Push the new labels */
0189197f Eric W. Biederman 2015-03-03  428  		hdr = mpls_hdr(skb);
0189197f Eric W. Biederman 2015-03-03  429  		bos = dec.bos;
f8efb73c Roopa Prabhu      2015-10-23  430  		for (i = nh->nh_labels - 1; i >= 0; i--) {
f8efb73c Roopa Prabhu      2015-10-23  431  			hdr[i] = mpls_entry_encode(nh->nh_label[i],
f8efb73c Roopa Prabhu      2015-10-23  432  						   dec.ttl, 0, bos);
0189197f Eric W. Biederman 2015-03-03  433  			bos = false;
0189197f Eric W. Biederman 2015-03-03  434  		}
0189197f Eric W. Biederman 2015-03-03  435  	}
0189197f Eric W. Biederman 2015-03-03  436  
27d69105 Robert Shearman   2017-01-16  437  	mpls_stats_inc_outucastpkts(out_dev, skb);
27d69105 Robert Shearman   2017-01-16  438  
eb7809f0 Robert Shearman   2015-12-10  439  	/* If via wasn't specified then send out using device address */
eb7809f0 Robert Shearman   2015-12-10  440  	if (nh->nh_via_table == MPLS_NEIGH_TABLE_UNSPEC)
eb7809f0 Robert Shearman   2015-12-10  441  		err = neigh_xmit(NEIGH_LINK_TABLE, out_dev,
eb7809f0 Robert Shearman   2015-12-10  442  				 out_dev->dev_addr, skb);
eb7809f0 Robert Shearman   2015-12-10  443  	else
eb7809f0 Robert Shearman   2015-12-10  444  		err = neigh_xmit(nh->nh_via_table, out_dev,
eb7809f0 Robert Shearman   2015-12-10  445  				 mpls_nh_via(rt, nh), skb);
0189197f Eric W. Biederman 2015-03-03  446  	if (err)
0189197f Eric W. Biederman 2015-03-03  447  		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
0189197f Eric W. Biederman 2015-03-03  448  				    __func__, err);
0189197f Eric W. Biederman 2015-03-03  449  	return 0;
0189197f Eric W. Biederman 2015-03-03  450  
27d69105 Robert Shearman   2017-01-16  451  tx_err:
27d69105 Robert Shearman   2017-01-16 @452  	out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
27d69105 Robert Shearman   2017-01-16  453  	if (out_mdev)
27d69105 Robert Shearman   2017-01-16  454  		MPLS_INC_STATS(out_mdev, tx_errors);
679ee57f David Lamparter   2017-08-16  455  	return -1;
0189197f Eric W. Biederman 2015-03-03  456  }
0189197f Eric W. Biederman 2015-03-03  457  

:::::: The code at line 452 was first introduced by commit
:::::: 27d691056bde4a6feca5e83fd92b787332c46302 mpls: Packet stats

:::::: TO: Robert Shearman <rshearma@brocade.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50949 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] mpls: add VPLS entry points
  2017-08-16 17:01 ` [PATCH 3/6] mpls: add VPLS entry points David Lamparter
@ 2017-08-19 18:27   ` kbuild test robot
  2017-08-21 14:01   ` Amine Kherbouche
  1 sibling, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-08-19 18:27 UTC (permalink / raw)
  To: David Lamparter
  Cc: kbuild-all, netdev, amine.kherbouche, roopa, David Lamparter

[-- Attachment #1: Type: text/plain, Size: 2846 bytes --]

Hi David,

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20170817]
[cannot apply to v4.13-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/David-Lamparter/bridge-learn-dst-metadata-in-FDB/20170820-001849
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   In file included from net/mpls/af_mpls.c:25:0:
>> net/mpls/internal.h:231:19: error: function declaration isn't a prototype [-Werror=strict-prototypes]
    static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
                      ^
   net/mpls/internal.h: In function 'vpls_rcv':
>> net/mpls/internal.h:233:2: warning: passing argument 1 of 'kfree_skb' makes pointer from integer without a cast
     kfree_skb(skb);
     ^
   In file included from net/mpls/af_mpls.c:2:0:
   include/linux/skbuff.h:956:6: note: expected 'struct sk_buff *' but argument is of type 'int'
    void kfree_skb(struct sk_buff *skb);
         ^
   net/mpls/af_mpls.c: In function 'mpls_rt_xmit':
   net/mpls/af_mpls.c:459:45: warning: 'out_dev' may be used uninitialized in this function [-Wmaybe-uninitialized]
     out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
                                                ^
   cc1: some warnings being treated as errors
--
   In file included from net/mpls/mpls_iptunnel.c:28:0:
>> net/mpls/internal.h:231:19: error: function declaration isn't a prototype [-Werror=strict-prototypes]
    static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
                      ^
   net/mpls/internal.h: In function 'vpls_rcv':
>> net/mpls/internal.h:233:2: warning: passing argument 1 of 'kfree_skb' makes pointer from integer without a cast
     kfree_skb(skb);
     ^
   In file included from net/mpls/mpls_iptunnel.c:14:0:
   include/linux/skbuff.h:956:6: note: expected 'struct sk_buff *' but argument is of type 'int'
    void kfree_skb(struct sk_buff *skb);
         ^
   cc1: some warnings being treated as errors

vim +231 net/mpls/internal.h

   229	
   230	#else /* !CONFIG_MPLS_VPLS */
 > 231	static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
   232	{
 > 233		kfree_skb(skb);
   234		return NET_RX_DROP;
   235	}
   236	static inline int is_vpls_dev(struct net_device *dev)
   237	{
   238		return 0;
   239	}
   240	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50949 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] mpls: add VPLS entry points
  2017-08-16 17:01 ` [PATCH 3/6] mpls: add VPLS entry points David Lamparter
  2017-08-19 18:27   ` kbuild test robot
@ 2017-08-21 14:01   ` Amine Kherbouche
  2017-08-21 15:55     ` David Lamparter
  1 sibling, 1 reply; 26+ messages in thread
From: Amine Kherbouche @ 2017-08-21 14:01 UTC (permalink / raw)
  To: David Lamparter, netdev; +Cc: roopa



On 08/16/2017 07:01 PM, David Lamparter wrote:
> This wires up the neccessary calls for VPLS into the MPLS forwarding
> pieces.  Since CONFIG_MPLS_VPLS doesn't exist yet in Kconfig, it'll
> never be enabled, so we're on the stubs for now.
>
> Signed-off-by: David Lamparter <equinox@diac24.net>
> ---
>  include/uapi/linux/rtnetlink.h |  1 +
>  net/mpls/af_mpls.c             | 54 ++++++++++++++++++++++++++++++++++++++++++
>  net/mpls/internal.h            | 29 +++++++++++++++++++++++
>  3 files changed, 84 insertions(+)
>
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index dab7dad9e01a..b7840ed94526 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -326,6 +326,7 @@ enum rtattr_type_t {
>  	RTA_PAD,
>  	RTA_UID,
>  	RTA_TTL_PROPAGATE,
> +	RTA_VPLS_IF,
>  	__RTA_MAX
>  };
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 0c5953e5d5bd..4d3ce007b7db 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -299,6 +299,11 @@ static bool mpls_egress(struct net *net, struct mpls_route *rt,
>  		success = true;
>  		break;
>  	}
> +	case MPT_VPLS:
> +		/* nothing to do here, no TTL in Ethernet
> +		 * (and we shouldn't mess with the TTL in inner IP packets,
> +		 * pseudowires are supposed to be transparent) */
> +		break;
>  	case MPT_UNSPEC:
>  		/* Should have decided which protocol it is by now */
>  		break;
> @@ -349,6 +354,8 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>  		goto drop;
>  	}
>
> +	if (rt->rt_payload_type == MPT_VPLS)
> +		return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);

you should get the ret value of vpls_rcv() and increment stats if error 
occurs.
		int ret;
		ret = vpls_rc(...)
		if (ret)
			goto err;
		return NET_RX_SUCCESS;
>
>  	/* Pop the label */
>  	skb_pull(skb, sizeof(*hdr));
> @@ -469,6 +476,7 @@ static const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = {
>  struct mpls_route_config {
>  	u32			rc_protocol;
>  	u32			rc_ifindex;
> +	u32			rc_vpls_ifindex;
>  	u8			rc_via_table;
>  	u8			rc_via_alen;
>  	u8			rc_via[MAX_VIA_ALEN];
> @@ -541,6 +549,8 @@ static void mpls_route_update(struct net *net, unsigned index,
>  	rt = rtnl_dereference(platform_label[index]);
>  	rcu_assign_pointer(platform_label[index], new);
>
> +	vpls_label_update(index, rt, new);
> +
>  	mpls_notify_route(net, index, rt, new, info);
>
>  	/* If we removed a route free it now */
> @@ -942,6 +952,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
>  	struct mpls_route __rcu **platform_label;
>  	struct net *net = cfg->rc_nlinfo.nl_net;
>  	struct mpls_route *rt, *old;
> +	struct net_device *vpls_dev = NULL;
>  	int err = -EINVAL;
>  	u8 max_via_alen;
>  	unsigned index;
> @@ -996,6 +1007,24 @@ static int mpls_route_add(struct mpls_route_config *cfg,
>  		goto errout;
>  	}
>
> +	if (cfg->rc_vpls_ifindex) {
> +		vpls_dev = dev_get_by_index(net, cfg->rc_vpls_ifindex);
> +		if (!vpls_dev) {
> +			err = -ENODEV;
> +			NL_SET_ERR_MSG(extack, "Invalid VPLS ifindex");
> +			goto errout;
> +		}
> +		/* we're under RTNL; and we'll drop routes when we're
> +		 * notified the device is going away. */
> +		dev_put(vpls_dev);
> +
> +		if (!is_vpls_dev(vpls_dev)) {
> +			err = -ENODEV;
> +			NL_SET_ERR_MSG(extack, "Not a VPLS device");
> +			goto errout;
> +		}
> +	}
> +
>  	err = -ENOMEM;
>  	rt = mpls_rt_alloc(nhs, max_via_alen, max_labels);
>  	if (IS_ERR(rt)) {
> @@ -1006,6 +1035,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
>  	rt->rt_protocol = cfg->rc_protocol;
>  	rt->rt_payload_type = cfg->rc_payload_type;
>  	rt->rt_ttl_propagate = cfg->rc_ttl_propagate;
> +	rt->rt_vpls_dev = vpls_dev;
>
>  	if (cfg->rc_mp)
>  		err = mpls_nh_build_multi(cfg, rt, max_labels, extack);
> @@ -1430,6 +1460,14 @@ static void mpls_ifdown(struct net_device *dev, int event)
>  		if (!rt)
>  			continue;
>
> +		if (rt->rt_vpls_dev == dev) {
> +			switch (event) {
> +			case NETDEV_UNREGISTER:
> +				mpls_route_update(net, index, NULL, NULL);
> +				continue;
> +			}
> +		}
> +
>  		alive = 0;
>  		deleted = 0;
>  		change_nexthops(rt) {
> @@ -1777,6 +1815,10 @@ static int rtm_to_route_config(struct sk_buff *skb,
>  		case RTA_OIF:
>  			cfg->rc_ifindex = nla_get_u32(nla);
>  			break;
> +		case RTA_VPLS_IF:
> +			cfg->rc_vpls_ifindex = nla_get_u32(nla);
> +			cfg->rc_payload_type = MPT_VPLS;
> +			break;
>  		case RTA_NEWDST:
>  			if (nla_get_labels(nla, MAX_NEW_LABELS,
>  					   &cfg->rc_output_labels,
> @@ -1911,6 +1953,11 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
>  			       ttl_propagate))
>  			goto nla_put_failure;
>  	}
> +
> +	if (rt->rt_vpls_dev)
> +		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
> +			goto nla_put_failure;
> +
>  	if (rt->rt_nhn == 1) {
>  		const struct mpls_nh *nh = rt->rt_nh;
>
> @@ -2220,6 +2267,10 @@ static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh,
>  	if (nla_put_labels(skb, RTA_DST, 1, &in_label))
>  		goto nla_put_failure;
>
> +	if (rt->rt_vpls_dev)
> +		if (nla_put_u32(skb, RTA_VPLS_IF, rt->rt_vpls_dev->ifindex))
> +			goto nla_put_failure;
> +
>  	if (nh->nh_labels &&
>  	    nla_put_labels(skb, RTA_NEWDST, nh->nh_labels,
>  			   nh->nh_label))
> @@ -2491,6 +2542,8 @@ static int __init mpls_init(void)
>
>  	rtnl_af_register(&mpls_af_ops);
>
> +	vpls_init();
> +
>  	rtnl_register(PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, 0);
>  	rtnl_register(PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, 0);
>  	rtnl_register(PF_MPLS, RTM_GETROUTE, mpls_getroute, mpls_dump_routes,
> @@ -2510,6 +2563,7 @@ module_init(mpls_init);
>  static void __exit mpls_exit(void)
>  {
>  	rtnl_unregister_all(PF_MPLS);
> +	vpls_exit();
>  	rtnl_af_unregister(&mpls_af_ops);
>  	dev_remove_pack(&mpls_packet_type);
>  	unregister_netdevice_notifier(&mpls_dev_notifier);
> diff --git a/net/mpls/internal.h b/net/mpls/internal.h
> index b70c6663d4f3..876ae9993207 100644
> --- a/net/mpls/internal.h
> +++ b/net/mpls/internal.h
> @@ -76,6 +76,7 @@ struct sk_buff;
>
>  enum mpls_payload_type {
>  	MPT_UNSPEC, /* IPv4 or IPv6 */
> +	MPT_VPLS = 2,	/* pseudowire */
>  	MPT_IPV4 = 4,
>  	MPT_IPV6 = 6,
>
> @@ -153,6 +154,8 @@ struct mpls_route { /* next hop label forwarding entry */
>  	u8			rt_nh_size;
>  	u8			rt_via_offset;
>  	u8			rt_reserved1;
> +	struct net_device	*rt_vpls_dev;
> +
>  	struct mpls_nh		rt_nh[0];
>  };
>
> @@ -214,4 +217,30 @@ struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
>  int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
>  		 struct mpls_entry_decoded dec);
>
> +#ifdef CONFIG_MPLS_VPLS
> +int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
> +	     struct packet_type *pt, struct mpls_route *rt,
> +	     struct mpls_shim_hdr *hdr, struct net_device *orig_dev);
> +void vpls_label_update(unsigned label, struct mpls_route *rt_old,
> +		       struct mpls_route *rt_new);
> +__init int vpls_init(void);
> +__exit void vpls_exit(void);
> +int is_vpls_dev(struct net_device *dev);
> +
> +#else /* !CONFIG_MPLS_VPLS */
> +static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
> +{
> +	kfree_skb(skb);
> +	return NET_RX_DROP;

just return NET_RX_DROP;

> +}
> +static inline int is_vpls_dev(struct net_device *dev)
> +{
> +	return 0;
> +}
> +
> +#define vpls_label_update(label, rt_old, rt_new) do { } while (0)
> +#define vpls_init() do { } while (0)
> +#define vpls_exit() do { } while (0)
> +#endif

s/ #endif/#endif /* CONFIG_MPLS_VPLS *//

> +
>  #endif /* MPLS_INTERNAL_H */
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] mpls: VPLS support
  2017-08-16 17:02 ` [PATCH 4/6] mpls: VPLS support David Lamparter
@ 2017-08-21 15:14   ` Amine Kherbouche
  2017-08-21 16:18     ` David Lamparter
  2017-08-21 16:11   ` Amine Kherbouche
  1 sibling, 1 reply; 26+ messages in thread
From: Amine Kherbouche @ 2017-08-21 15:14 UTC (permalink / raw)
  To: David Lamparter, netdev; +Cc: roopa


> new file mode 100644
> index 000000000000..28ac810da6e9
> --- /dev/null
> +++ b/net/mpls/vpls.c
> @@ -0,0 +1,469 @@
> +/*
> + *  net/mpls/vpls.c
> + *
> + *  Copyright (C) 2016 David Lamparter
> + *
> + */
> +
> +#include <linux/netdevice.h>
> +#include <linux/slab.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/u64_stats_sync.h>
> +#include <linux/mpls.h>
> +
> +#include <net/rtnetlink.h>
> +#include <net/dst.h>
> +#include <net/xfrm.h>
> +#include <net/mpls.h>
> +#include <linux/module.h>
> +#include <net/dst_metadata.h>
> +#include <net/ip_tunnels.h>

some headers are not needed.


> +
> +static netdev_tx_t vpls_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	int err = -EINVAL, ok_count = 0;
> +	struct vpls_priv *priv = netdev_priv(dev);
> +	struct vpls_info *vi;
> +	struct pcpu_sw_netstats *stats;
> +	size_t len = skb->len;
> +
> +	rcu_read_lock();
> +	vi = skb_vpls_info(skb);
> +
> +	skb_orphan(skb);
> +	skb_forward_csum(skb);
> +
> +	if (vi) {
> +		err = vpls_xmit_wire(skb, dev, priv, vi->pw_label);
> +		if (err)
> +			goto out_err;
> +	} else {

the rcu_read_lock() is just needed for the else statement right ?

> +		struct sk_buff *cloned;
> +		struct vpls_wirelist *wl;
> +		size_t i;
> +
> +		wl = rcu_dereference(priv->wires);
> +		if (wl->count == 0) {
> +			dev->stats.tx_carrier_errors++;
> +			goto out_err;
> +		}
> +
> +		for (i = 0; i < wl->count; i++) {
> +			cloned = skb_clone(skb, GFP_KERNEL);
> +			if (vpls_xmit_wire(cloned, dev, priv, wl->wires[i]))
> +				consume_skb(cloned);
> +			else
> +				ok_count++;
> +		}
> +		if (!ok_count)
> +			goto out_err;
> +
> +		consume_skb(skb);
> +	}
> +
> +	stats = this_cpu_ptr(dev->tstats);
> +	u64_stats_update_begin(&stats->syncp);
> +	stats->tx_packets++;
> +	stats->tx_bytes += len;
> +	u64_stats_update_end(&stats->syncp);
> +
> +	rcu_read_unlock();
> +	return 0;
> +}


> +int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
> +	     struct packet_type *pt, struct mpls_route *rt,
> +	     struct mpls_shim_hdr *hdr, struct net_device *orig_dev)
> +{
> +	struct net_device *dev = rt->rt_vpls_dev;
> +	struct mpls_entry_decoded dec;
> +	struct metadata_dst *md_dst;
> +	struct pcpu_sw_netstats *stats;
> +
> +	if (!dev)
> +		goto drop_nodev;
> +

we can get dec.bos from mpls stack as a function param, no need to 
decode the mpls header again.

> +	dec = mpls_entry_decode(hdr);
> +	if (!dec.bos) {
> +		dev->stats.rx_frame_errors++;
> +		goto drop;
> +	}
> +
> +	skb_pull(skb, sizeof(*hdr));
> +
> +	if (unlikely(!pskb_may_pull(skb, ETH_HLEN))) {
> +		dev->stats.rx_length_errors++;
> +		goto drop;
> +	}



> +static const struct net_device_ops vpls_netdev_ops = {
> +	.ndo_init		= vpls_dev_init,
> +	.ndo_open		= vpls_open,
> +	.ndo_stop		= vpls_close,
> +	.ndo_start_xmit		= vpls_xmit,
> +	.ndo_change_mtu		= vpls_change_mtu,
> +	.ndo_get_stats64	= ip_tunnel_get_stats64,
> +	.ndo_set_rx_mode	= vpls_set_multicast_list,
> +	.ndo_set_mac_address	= eth_mac_addr,
> +	.ndo_features_check	= passthru_features_check,
> +};

can you add .ndo_get_stats64 function ?


>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] mpls: add VPLS entry points
  2017-08-21 14:01   ` Amine Kherbouche
@ 2017-08-21 15:55     ` David Lamparter
  2017-08-21 16:13       ` Amine Kherbouche
  0 siblings, 1 reply; 26+ messages in thread
From: David Lamparter @ 2017-08-21 15:55 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: David Lamparter, netdev, roopa

On Mon, Aug 21, 2017 at 04:01:15PM +0200, Amine Kherbouche wrote:
> On 08/16/2017 07:01 PM, David Lamparter wrote:
> > This wires up the neccessary calls for VPLS into the MPLS forwarding
> > pieces.  Since CONFIG_MPLS_VPLS doesn't exist yet in Kconfig, it'll
> > never be enabled, so we're on the stubs for now.
[...]
> > +	if (rt->rt_payload_type == MPT_VPLS)
> > +		return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);
> 
> you should get the ret value of vpls_rcv() and increment stats if error 
> occurs.

An error in vpls_rcv() is not a receive error on the outer device's MPLS
layer;  the packet was received correctly (and counted for that at the
beginning of mpls_forward()) and dispatched onto an appropriately
configured VPLS device.  vpls_rcv() counts its own stats on the inner
device.

[...]
> > +static inline int vpls_rcv(skb, in_dev, pt, rt, hdr, orig_dev)
> > +{
> > +	kfree_skb(skb);
> > +	return NET_RX_DROP;
> 
> just return NET_RX_DROP;

This is not correct;  the skb ownership is passed down to vpls_rcv(),
which itself will pass it on via netif_rx().  This is also why the call
in mpls_forward() does *not* "goto drop", instead directly returning.
vpls_rcv() consumes the skb in all cases.  Doing this the other way
around would incur extra overhead through a skb_clone() in vpls_rcv()
before giving it to netif_rx().

Note that the vpls_rcv() dummy function in this patch can't be called
because the kernel will refuse to install a VPLS route into the label
table (the is_vpls_device() check will be 0).  The dummy is just to keep
the code tidy.

[...]
> > +#endif
> s/ #endif/#endif /* CONFIG_MPLS_VPLS *//

Fixed in next ver (along with the bugged prototype reported by the
build bot.)

Cheers,


-David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] mpls: VPLS support
  2017-08-16 17:02 ` [PATCH 4/6] mpls: VPLS support David Lamparter
  2017-08-21 15:14   ` Amine Kherbouche
@ 2017-08-21 16:11   ` Amine Kherbouche
  1 sibling, 0 replies; 26+ messages in thread
From: Amine Kherbouche @ 2017-08-21 16:11 UTC (permalink / raw)
  To: David Lamparter, netdev; +Cc: roopa


Is it right to put vpls.c file in /net/mpls, I know it's depends 
essentially from the mpls stack, but vpls is a driver so moving it in 
/drivers/net should be cleaner but we have to export some mpls functions.

Let's see others opinion.

On 08/16/2017 07:02 PM, David Lamparter wrote:
> [work-in-progress, works but needs changes]
> [v2: refactored lots of things, e.g. dst_metadata, no more genetlink]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] mpls: add VPLS entry points
  2017-08-21 15:55     ` David Lamparter
@ 2017-08-21 16:13       ` Amine Kherbouche
  0 siblings, 0 replies; 26+ messages in thread
From: Amine Kherbouche @ 2017-08-21 16:13 UTC (permalink / raw)
  To: David Lamparter; +Cc: netdev, roopa



On 08/21/2017 05:55 PM, David Lamparter wrote:
>>> +	if (rt->rt_payload_type == MPT_VPLS)
>>> > > +		return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);
>> >
>> > you should get the ret value of vpls_rcv() and increment stats if error
>> > occurs.
> An error in vpls_rcv() is not a receive error on the outer device's MPLS
> layer;  the packet was received correctly (and counted for that at the
> beginning of mpls_forward()) and dispatched onto an appropriately
> configured VPLS device.  vpls_rcv() counts its own stats on the inner
> device.

Right.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] mpls: VPLS support
  2017-08-21 15:14   ` Amine Kherbouche
@ 2017-08-21 16:18     ` David Lamparter
  0 siblings, 0 replies; 26+ messages in thread
From: David Lamparter @ 2017-08-21 16:18 UTC (permalink / raw)
  To: Amine Kherbouche; +Cc: David Lamparter, netdev, roopa

On Mon, Aug 21, 2017 at 05:14:46PM +0200, Amine Kherbouche wrote:
> > +#include <net/ip_tunnels.h>
> 
> some headers are not needed.

Which ones?  net/ip_tunnels.h is needed for ip_tunnel_get_stats64().

> > +static netdev_tx_t vpls_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > +	int err = -EINVAL, ok_count = 0;
> > +	struct vpls_priv *priv = netdev_priv(dev);
> > +	struct vpls_info *vi;
> > +	struct pcpu_sw_netstats *stats;
> > +	size_t len = skb->len;
> > +
> > +	rcu_read_lock();
> > +	vi = skb_vpls_info(skb);
> > +
> > +	skb_orphan(skb);
> > +	skb_forward_csum(skb);
> > +
> > +	if (vi) {
> > +		err = vpls_xmit_wire(skb, dev, priv, vi->pw_label);
> > +		if (err)
> > +			goto out_err;
> > +	} else {
> 
> the rcu_read_lock() is just needed for the else statement right ?

Yeah, it's leftover from an earlier version where it was used for what
is now skb_vpls_info().  Fixed in next version.

> > +int vpls_rcv(struct sk_buff *skb, struct net_device *in_dev,
> > +	     struct packet_type *pt, struct mpls_route *rt,
> > +	     struct mpls_shim_hdr *hdr, struct net_device *orig_dev)
> > +{
> > +	struct net_device *dev = rt->rt_vpls_dev;
> > +	struct mpls_entry_decoded dec;
[...]
> > +	dec = mpls_entry_decode(hdr);
> 
> we can get dec.bos from mpls stack as a function param, no need to 
> decode the mpls header again.

I didn't do that for 2 reasons:
- mpls_entry_decode is a very small inlined function
- for future OAM GAL label support, it will need to decode a 2nd label

> > +static const struct net_device_ops vpls_netdev_ops = {
> > +	.ndo_init		= vpls_dev_init,
> > +	.ndo_open		= vpls_open,
> > +	.ndo_stop		= vpls_close,
> > +	.ndo_start_xmit		= vpls_xmit,
> > +	.ndo_change_mtu		= vpls_change_mtu,
> > +	.ndo_get_stats64	= ip_tunnel_get_stats64,
> > +	.ndo_set_rx_mode	= vpls_set_multicast_list,
> > +	.ndo_set_mac_address	= eth_mac_addr,
> > +	.ndo_features_check	= passthru_features_check,
> > +};
> 
> can you add .ndo_get_stats64 function ?

It uses the same stats as ip tunnels:
+	.ndo_get_stats64	= ip_tunnel_get_stats64,

I don't think copypasting improves code quality here(???)

Cheers,


-David

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-08-21 16:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-16 17:01 [RFC net-next] VPLS support David Lamparter
2017-08-16 17:01 ` [PATCH 1/6] bridge: learn dst metadata in FDB David Lamparter
2017-08-16 20:38   ` Nikolay Aleksandrov
2017-08-17 11:03     ` David Lamparter
2017-08-17 11:39       ` Nikolay Aleksandrov
2017-08-17 11:51         ` Nikolay Aleksandrov
2017-08-17 12:10           ` David Lamparter
2017-08-17 12:19             ` Nikolay Aleksandrov
2017-08-17 12:20             ` David Lamparter
2017-08-17 12:45         ` David Lamparter
2017-08-17 13:04           ` Nikolay Aleksandrov
2017-08-17 16:16     ` David Lamparter
2017-08-16 17:01 ` [PATCH 2/6] mpls: split forwarding path on rx/tx boundary David Lamparter
2017-08-19 17:10   ` kbuild test robot
2017-08-19 17:42   ` kbuild test robot
2017-08-16 17:01 ` [PATCH 3/6] mpls: add VPLS entry points David Lamparter
2017-08-19 18:27   ` kbuild test robot
2017-08-21 14:01   ` Amine Kherbouche
2017-08-21 15:55     ` David Lamparter
2017-08-21 16:13       ` Amine Kherbouche
2017-08-16 17:02 ` [PATCH 4/6] mpls: VPLS support David Lamparter
2017-08-21 15:14   ` Amine Kherbouche
2017-08-21 16:18     ` David Lamparter
2017-08-21 16:11   ` Amine Kherbouche
2017-08-16 17:02 ` [PATCH 5/6] bridge: add VPLS pseudowire info in fdb dump David Lamparter
2017-08-16 17:02 ` [PATCH 6/6] mpls: pseudowire control word support David Lamparter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).