All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/8] Basic MPLS support
@ 2015-02-25 17:09 Eric W. Biederman
  2015-02-25 17:13 ` [PATCH net-next 1/8] mpls: Refactor how the mpls module is built Eric W. Biederman
                   ` (10 more replies)
  0 siblings, 11 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


While trying to figure out what MPLS is and why MPLS support is not in
the kernel on a lark I sat down and wrote an MPLS implemenation, so I
could answer those questions for myself.

>From what I can tell the short answer is MPLS is trivial-simple and the
we don't have an in-kernel implementation because no one has sat down
and done the work to have a good mergable implementation.

MPLS has it's good sides and it's bad sides but at the end of the day
MPLS has users, and having an in-kernel implementation should help us
understand MPLS and focus our conversations dealing with MPLS and
VRFs.

Having MPLS in our toolkit as the entire world begins playing with
overlay networks aka ``network virtualization'' to support VM and
container migration seems appropriate as MPLS is the historical solution
to this problem.

Constructive criticism about the netlink interface is especially
appreciated.  Hopefully we can have at least one protocol in the kernel
where the netlink interface doesn't have nasty corner case.

As for linux users.  The conversations I had at netdev01 this sounds
like a case of if I build it people will use the code.

Eric

Eric W. Biederman (8):
      mpls: Refactor how the mpls module is built
      mpls: Basic routing support
      mpls: Add a sysctl to control the size of the mpls label table
      mpls: Basic support for adding and removing routes
      mpls: Functions for reading and wrinting mpls labels over netlink
      mpls: Netlink commands to add, remove, and dump routes
      mpls: Multicast route table change notifications
      ipmpls: Basic device for injecting packets into an mpls tunnel

 Documentation/networking/mpls-sysctl.txt |  20 +
 include/linux/socket.h                   |   2 +
 include/net/net_namespace.h              |   4 +
 include/net/netns/mpls.h                 |  17 +
 include/uapi/linux/if_arp.h              |   1 +
 include/uapi/linux/rtnetlink.h           |   4 +
 net/Makefile                             |   2 +-
 net/mpls/Kconfig                         |  28 +-
 net/mpls/Makefile                        |   2 +
 net/mpls/af_mpls.c                       | 919 +++++++++++++++++++++++++++++++
 net/mpls/internal.h                      |  59 ++
 net/mpls/ipmpls.c                        | 219 ++++++++
 12 files changed, 1275 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 1/8] mpls: Refactor how the mpls module is built
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
@ 2015-02-25 17:13 ` Eric W. Biederman
  2015-02-26  2:05   ` Simon Horman
  2015-02-25 17:14 ` [PATCH net-next 2/8] mpls: Basic routing support Eric W. Biederman
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


This refactoring is needed to allow more than just mpls gso support to
be built into the mpls moddule.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/Makefile     |  2 +-
 net/mpls/Kconfig | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/net/Makefile b/net/Makefile
index 38704bdf941a..3995613e5510 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -69,7 +69,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
-obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
+obj-$(CONFIG_MPLS)		+= mpls/
 obj-$(CONFIG_HSR)		+= hsr/
 ifneq ($(CONFIG_NET_SWITCHDEV),)
 obj-y				+= switchdev/
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 37421db88965..a77fbcdd04ee 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -1,9 +1,25 @@
 #
 # MPLS configuration
 #
+
+menuconfig MPLS
+	tristate "MultiProtocol Label Switching"
+	default n
+	---help---
+	  MultiProtocol Label Switching routes packets through logical
+	  circuits.  Originally conceved as a way of routing packets at
+	  hardware speeds (before hardware was capable of routing ipv4 packets),
+	  MPLS remains as simple way of making tunnels.
+
+	  If you have not heard of MPLS you probably want to say N here.
+
+if MPLS
+
 config NET_MPLS_GSO
-	tristate "MPLS: GSO support"
+	bool "MPLS: GSO support"
 	help
 	 This is helper module to allow segmentation of non-MPLS GSO packets
 	 that have had MPLS stack entries pushed onto them and thus
 	 become MPLS GSO packets.
+
+endif # MPLS
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 2/8] mpls: Basic routing support
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
  2015-02-25 17:13 ` [PATCH net-next 1/8] mpls: Refactor how the mpls module is built Eric W. Biederman
@ 2015-02-25 17:14 ` Eric W. Biederman
  2015-02-25 17:15 ` [PATCH net-next 3/8] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


This change adds a new Kconfig option MPLS_ROUTING.

The core of this change is the code to look at an mpls
packet received from another machine.  Look that packet
up in a routing table and forward the packet on.

Support of MPLS over ATM is not considered or attempted here.
This implemntation follows RFC3032 and implements the MPLS
shim header that can pass over essentially any network.

What RFC3021 refers to as the as the Incoming Label Map (ILM)
I call net->mpls.platform_label[].  What RFC3031 refers to
as the Next Label Hop Forwarding Entry (NHLFE) I call mpls_route.
Though calling it the label fordwarding information base (lfib)
might also be valid.

Further the implemntation forwards packets as described in RFC3032.
There is no need and given the original motivation for MPLS a strong
discincentive to have a flexible label forwarding path.  In essence
the logic is the topmost label is read, looked up, removed, and
replaced by 0 or more new lables and the sent out the specified
interface to it's next hop.

Quite a few optional features are not implemented here.  Among them
are generation of ICMP errors when the TTL is exceeded or the packet
is larger than the next hop MTU (those conditions are detected and the
packets are dropped instead of generating an icmp error).  The traffic
class field is always set to 0.  The implementation focuses on IP over
MPLS and does not handle egress of other kinds of protocols.

Instead of implementing coordination with the neighbour table and
sorting out how to input next hops in a different address family (for
which there is value).  I was lazy and implemented a next hop mac
address instead.  The code is simpler and there are flavor of MPLS
such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
appropriate so a next hop by mac address would need to be implemented
at some point.

Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.

Decoding the mpls header must be done by first byeswapping a 32bit bit
endian word into the local cpu endian and then bit shifting to extract
the pieces.  There is no C bit-field that can represent a wire format
mpls header on a little endian machine as the low bits of the 20bit
label wind up in the wrong half of third byte.  Therefore internally
everything is deal with in cpu native byte order except when writing
to and reading from a packet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/socket.h      |   2 +
 include/net/net_namespace.h |   4 +
 include/net/netns/mpls.h    |  15 ++
 net/mpls/Kconfig            |   5 +
 net/mpls/Makefile           |   1 +
 net/mpls/af_mpls.c          | 336 ++++++++++++++++++++++++++++++++++++++++++++
 net/mpls/internal.h         |  56 ++++++++
 7 files changed, 419 insertions(+)
 create mode 100644 include/net/netns/mpls.h
 create mode 100644 net/mpls/af_mpls.c
 create mode 100644 net/mpls/internal.h

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5c19cba34dce..fab4d0ddf4ed 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -181,6 +181,7 @@ struct ucred {
 #define AF_WANPIPE	25	/* Wanpipe API Sockets */
 #define AF_LLC		26	/* Linux LLC			*/
 #define AF_IB		27	/* Native InfiniBand address	*/
+#define AF_MPLS		28	/* MPLS */
 #define AF_CAN		29	/* Controller Area Network      */
 #define AF_TIPC		30	/* TIPC sockets			*/
 #define AF_BLUETOOTH	31	/* Bluetooth sockets 		*/
@@ -226,6 +227,7 @@ struct ucred {
 #define PF_WANPIPE	AF_WANPIPE
 #define PF_LLC		AF_LLC
 #define PF_IB		AF_IB
+#define PF_MPLS		AF_MPLS
 #define PF_CAN		AF_CAN
 #define PF_TIPC		AF_TIPC
 #define PF_BLUETOOTH	AF_BLUETOOTH
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 36faf4990c4b..2cb9acb618e9 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -26,6 +26,7 @@
 #endif
 #include <net/netns/nftables.h>
 #include <net/netns/xfrm.h>
+#include <net/netns/mpls.h>
 #include <linux/ns_common.h>
 
 struct user_namespace;
@@ -130,6 +131,9 @@ struct net {
 #if IS_ENABLED(CONFIG_IP_VS)
 	struct netns_ipvs	*ipvs;
 #endif
+#if IS_ENABLED(CONFIG_MPLS)
+	struct netns_mpls	mpls;
+#endif
 	struct sock		*diag_nlsk;
 	atomic_t		fnhe_genid;
 };
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
new file mode 100644
index 000000000000..f90aaf8d4f89
--- /dev/null
+++ b/include/net/netns/mpls.h
@@ -0,0 +1,15 @@
+/*
+ * mpls in net namespaces
+ */
+
+#ifndef __NETNS_MPLS_H__
+#define __NETNS_MPLS_H__
+
+struct mpls_route;
+
+struct netns_mpls {
+	size_t platform_labels;
+	struct mpls_route __rcu * __rcu *platform_label;
+};
+
+#endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index a77fbcdd04ee..f4286ee7e2b0 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -22,4 +22,9 @@ config NET_MPLS_GSO
 	 that have had MPLS stack entries pushed onto them and thus
 	 become MPLS GSO packets.
 
+config MPLS_ROUTING
+	bool "MPLS: routing support"
+	help
+	 Add support for forwarding of mpls packets.
+
 endif # MPLS
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 6dec088c2d0f..60af15f1960e 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -2,3 +2,4 @@
 # Makefile for MPLS.
 #
 obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
+obj-$(CONFIG_MPLS_ROUTING) += af_mpls.o
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
new file mode 100644
index 000000000000..c84c8057d3df
--- /dev/null
+++ b/net/mpls/af_mpls.c
@@ -0,0 +1,336 @@
+#include <linux/types.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/module.h>
+#include <linux/if_arp.h>
+#include <linux/ipv6.h>
+#include <linux/mpls.h>
+#include <net/ip.h>
+#include <net/dst.h>
+#include <net/sock.h>
+#include <net/arp.h>
+#include <net/ip_fib.h>
+#include <net/netevent.h>
+#include <net/netns/generic.h>
+#include "internal.h"
+
+#define MAX_NEW_LABELS 2
+
+/* This maximum ha length copied from the definition of struct neighbour */
+#define MAX_HA_LEN (ALIGN(MAX_ADDR_LEN, sizeof(unsigned long)))
+
+struct mpls_route { /* next hop label forwarding entry */
+	struct net_device 	*rt_dev;
+	unsigned char		rt_ha[MAX_HA_LEN];
+	u8			rt_protocol; /* routing protocol that set this entry */
+	u8			rt_labels;
+	u32			rt_label[MAX_NEW_LABELS];
+	struct rcu_head		rt_rcu;
+};
+
+static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
+{
+	struct mpls_route *rt = NULL;
+
+	if (index < net->mpls.platform_labels) {
+		struct mpls_route __rcu **platform_label =
+			rcu_dereference(net->mpls.platform_label);
+		rt = rcu_dereference(platform_label[index]);
+	}
+	return rt;
+}
+
+static unsigned int mpls_rt_header_size(const struct mpls_route *rt)
+{
+	/* The size of the layer 2.5 labels to be added for this route */
+	return rt->rt_labels * sizeof(struct mpls_shim_hdr);
+}
+
+static unsigned int mpls_rt_mtu(const struct mpls_route *rt)
+{
+	/* The amount of data the layer 2 frame can hold */
+	return rt->rt_dev->mtu;
+}
+
+static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
+{
+	if (skb->len <= mtu)
+		return false;
+
+	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+		return false;
+
+	return true;
+}
+
+static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
+			struct mpls_entry_decoded dec)
+{
+	/* RFC4385 and RFC5586 encode other packets in mpls such that
+	 * they don't conflict with the ip version number, making
+	 * decoding by examining the ip version correct in everything
+	 * except for the strangest cases.
+	 *
+	 * The strange cases if we choose to support them will require
+	 * manual configuration.
+	 */
+	struct iphdr *hdr4 = ip_hdr(skb);
+	bool success = true;
+
+	if (hdr4->version == 4) {
+		skb->protocol = htons(ETH_P_IP);
+		csum_replace2(&hdr4->check,
+			      htons(hdr4->ttl << 8),
+			      htons(dec.ttl << 8));
+		hdr4->ttl = dec.ttl;
+	}
+	else if (hdr4->version == 6) {
+		struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+		skb->protocol = htons(ETH_P_IPV6);
+		hdr6->hop_limit = dec.ttl;
+	}
+	else
+		/* version 0 and version 1 are used by pseudo wires */
+		success = false;
+	return success;
+}
+
+static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
+			struct packet_type *pt, struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+	struct mpls_shim_hdr *hdr;
+	struct mpls_route *rt;
+	struct mpls_entry_decoded dec;
+	struct net_device *out_dev;
+	unsigned int hh_len;
+	unsigned int new_header_size;
+	unsigned int mtu;
+	int err;
+
+	/* Careful this entire function runs inside of an rcu critical section */
+
+	if (skb->pkt_type != PACKET_HOST)
+		goto drop;
+
+	if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL)
+		goto drop;
+
+	if (!pskb_may_pull(skb, sizeof(*hdr)))
+		goto drop;
+
+	/* Read and decode the label */
+	hdr = mpls_hdr(skb);
+	dec = mpls_entry_decode(hdr);
+
+	/* Pop the label */
+	skb_pull(skb, sizeof(*hdr));
+	skb_reset_network_header(skb);
+
+	skb_orphan(skb);
+
+	rt = mpls_route_input_rcu(net, dec.label);
+	if (!rt)
+		goto drop;
+
+	if (skb_warn_if_lro(skb))
+		goto drop;
+
+	skb_forward_csum(skb);
+
+	/* Verify ttl is valid */
+	if (dec.ttl <= 2)
+		goto drop;
+	dec.ttl -= 1;
+
+	/* Verify the destination can hold the packet */
+	new_header_size = mpls_rt_header_size(rt);
+	mtu = mpls_rt_mtu(rt);
+	if (mpls_pkt_too_big(skb, mtu - new_header_size))
+		goto drop;
+
+	out_dev = rt->rt_dev;
+	hh_len = LL_RESERVED_SPACE(out_dev);
+	if (!out_dev->header_ops)
+		hh_len = 0;
+
+	/* Ensure there is enough space for the headers in the skb */
+	if (skb_cow(skb, hh_len + new_header_size))
+		goto drop;
+
+	skb->dev = out_dev;
+	skb->protocol = htons(ETH_P_MPLS_UC);
+
+	if (unlikely(!new_header_size && dec.bos)) {
+		/* Penultimate hop popping */
+		if (!mpls_egress(rt, skb, dec))
+			goto drop;
+	} else {
+		bool bos;
+		int i;
+		skb_push(skb, new_header_size);
+		skb_reset_network_header(skb);
+		/* Push the new labels */
+		hdr = mpls_hdr(skb);
+		bos = dec.bos;
+		for (i = rt->rt_labels - 1; i >= 0; i--) {
+			hdr[i] = mpls_entry_encode(rt->rt_label[i], dec.ttl, 0, bos);
+			bos = false;
+		}
+	}
+
+	err = dev_hard_header(skb, out_dev, ntohs(skb->protocol),
+				rt->rt_ha, NULL, skb->len);
+	if (err >= 0)
+		err = dev_queue_xmit(skb);
+	if (err) {
+		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
+				    __func__, err);
+		goto drop;
+	}
+	return 0;
+
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+
+static struct packet_type mpls_packet_type __read_mostly = {
+	.type = cpu_to_be16(ETH_P_MPLS_UC),
+	.func = mpls_forward,
+};
+
+static struct mpls_route *mpls_rt_alloc(void)
+{
+	struct mpls_route *rt;
+
+	rt = kzalloc(GFP_KERNEL, sizeof(*rt));
+	return rt;
+}
+
+static void mpls_rt_free(struct mpls_route *rt)
+{
+	if (rt)
+		kfree_rcu(rt, rt_rcu);
+}
+
+static void mpls_route_update(struct net *net, unsigned index,
+			      struct net_device *dev, struct mpls_route *new,
+			      const struct nl_info *info)
+{
+	struct mpls_route *rt, *old = NULL;
+
+	ASSERT_RTNL();
+
+	rt = net->mpls.platform_label[index];
+	if (!dev || (rt && (rt->rt_dev == dev))) {
+		rcu_assign_pointer(net->mpls.platform_label[index], new);
+		old = rt;
+	}
+
+	/* If we removed a route free it now */
+	mpls_rt_free(old);
+}
+
+static void mpls_ifdown(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	unsigned index;
+
+	for (index = 16; index < net->mpls.platform_labels; index++)
+		mpls_route_update(net, index, dev, NULL, NULL);
+}
+
+static int mpls_dev_notify(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	switch(event) {
+	case NETDEV_DOWN:
+	case NETDEV_UNREGISTER:
+		mpls_ifdown(dev);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block mpls_dev_notifier = {
+	.notifier_call = mpls_dev_notify,
+};
+
+static int mpls_net_init(struct net *net)
+{
+	net->mpls.platform_labels = 0;
+	net->mpls.platform_label = NULL;
+
+	return 0;
+}
+
+static void mpls_net_exit(struct net *net)
+{
+	unsigned int index;
+
+	/* An rcu grace period haselapsed since there was a device in
+	 * the network namespace (and thus the last in fqlight packet)
+	 * left this network namespace.  This is because
+	 * unregister_netdevice_many and netdev_run_todo has completed
+	 * for each network device that was in this network namespace.
+	 *
+	 * As such no additional rcu synchronization is necessary when
+	 * freeing the platform_label table.
+	 */
+	rtnl_lock();
+	for (index = 0; index < net->mpls.platform_labels; index++) {
+		struct mpls_route *rt = net->mpls.platform_label[index];
+		rcu_assign_pointer(net->mpls.platform_label[index], NULL);
+		mpls_rt_free(rt);
+	}
+	rtnl_unlock();
+
+	kvfree(net->mpls.platform_label);
+}
+
+static struct pernet_operations mpls_net_ops = {
+	.init = mpls_net_init,
+	.exit = mpls_net_exit,
+};
+
+static int __init mpls_init(void)
+{
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct mpls_shim_hdr) != 4);
+
+	err = register_pernet_subsys(&mpls_net_ops);
+	if (err)
+		goto out;
+
+	err = register_netdevice_notifier(&mpls_dev_notifier);
+	if (err)
+		goto out_unregister_pernet;
+
+	dev_add_pack(&mpls_packet_type);
+
+	err = 0;
+out:
+	return err;
+
+out_unregister_pernet:
+	unregister_pernet_subsys(&mpls_net_ops);
+	goto out;
+}
+module_init(mpls_init);
+
+static void __exit mpls_exit(void)
+{
+	dev_remove_pack(&mpls_packet_type);
+	unregister_netdevice_notifier(&mpls_dev_notifier);
+	unregister_pernet_subsys(&mpls_net_ops);
+}
+module_exit(mpls_exit);
+
+MODULE_DESCRIPTION("MultiProtocol Label Switching");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_NETPROTO(PF_MPLS);
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
new file mode 100644
index 000000000000..c2944cb84d48
--- /dev/null
+++ b/net/mpls/internal.h
@@ -0,0 +1,56 @@
+#ifndef MPLS_INTERNAL_H
+#define MPLS_INTERNAL_H
+
+#define LABEL_IPV4_EXPLICIT_NULL	0 /* RFC3032 */
+#define LABEL_ROUTER_ALERT_LABEL	1 /* RFC3032 */
+#define LABEL_IPV6_EXPLICIT_NULL	2 /* RFC3032 */
+#define LABEL_IMPLICIT_NULL		3 /* RFC3032 */
+#define LABEL_ENTROPY_INDICATOR		7 /* RFC6790 */
+#define LABEL_GAL			13 /* RFC5586 */
+#define LABEL_OAM_ALERT			14 /* RFC3429 */
+#define LABEL_EXTENSION			15 /* RFC7274 */
+
+
+struct mpls_shim_hdr {
+	__be32 label_stack_entry;
+};
+
+struct mpls_entry_decoded {
+	u32 label;
+	u8 ttl;
+	u8 tc;
+	u8 bos;
+};
+
+struct sk_buff;
+
+static inline struct mpls_shim_hdr *mpls_hdr(const struct sk_buff *skb)
+{
+	return (struct mpls_shim_hdr *)skb_network_header(skb);
+}
+
+static inline struct mpls_shim_hdr mpls_entry_encode(u32 label, unsigned ttl, unsigned tc, bool bos)
+{
+	struct mpls_shim_hdr result;
+	result.label_stack_entry =
+		cpu_to_be32((label << MPLS_LS_LABEL_SHIFT) |
+			    (tc << MPLS_LS_TC_SHIFT) |
+			    (bos ? (1 << MPLS_LS_S_SHIFT) : 0) |
+			    (ttl << MPLS_LS_TTL_SHIFT));
+	return result;
+}
+
+static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *hdr)
+{
+	struct mpls_entry_decoded result;
+	unsigned entry = be32_to_cpu(hdr->label_stack_entry);
+
+	result.label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+	result.ttl = (entry & MPLS_LS_TTL_MASK) >> MPLS_LS_TTL_SHIFT;
+	result.tc =  (entry & MPLS_LS_TC_MASK) >> MPLS_LS_TC_SHIFT;
+	result.bos = (entry & MPLS_LS_S_MASK) >> MPLS_LS_S_SHIFT;
+
+	return result;
+}
+
+#endif /* MPLS_INTERNAL_H */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 3/8] mpls: Add a sysctl to control the size of the mpls label table
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
  2015-02-25 17:13 ` [PATCH net-next 1/8] mpls: Refactor how the mpls module is built Eric W. Biederman
  2015-02-25 17:14 ` [PATCH net-next 2/8] mpls: Basic routing support Eric W. Biederman
@ 2015-02-25 17:15 ` Eric W. Biederman
  2015-02-25 17:16 ` [PATCH net-next 4/8] mpls: Basic support for adding and removing routes Eric W. Biederman
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


This sysctl gives two benefits.  By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets.  This prevents unpleasant surprises for users.

The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.

This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/networking/mpls-sysctl.txt |  20 +++++
 include/net/netns/mpls.h                 |   2 +
 net/mpls/af_mpls.c                       | 140 +++++++++++++++++++++++++++++++
 3 files changed, 162 insertions(+)
 create mode 100644 Documentation/networking/mpls-sysctl.txt

diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt
new file mode 100644
index 000000000000..639ddf0ece9b
--- /dev/null
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -0,0 +1,20 @@
+/proc/sys/net/mpls/* Variables:
+
+platform_labels - INTEGER
+	Number of entries in the platform label table.  It is not
+	possible to configure forwarding for label values equal to or
+	greater than the number of platform labels.
+
+	A dense utliziation of the entries in the platform label table
+	is possible and expected aas the platform labels are locally
+	allocated.
+
+	If the number of platform label table entries is set to 0 no
+	label will be recognized by the kernel and mpls forwarding
+	will be disabled.
+
+	Reducing this value will remove all label routing entries that
+	no longer fit in the table.
+
+	Possible values: 0 - 1048575
+	Default: 0
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index f90aaf8d4f89..d29203651c01 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -6,10 +6,12 @@
 #define __NETNS_MPLS_H__
 
 struct mpls_route;
+struct ctl_table_header;
 
 struct netns_mpls {
 	size_t platform_labels;
 	struct mpls_route __rcu * __rcu *platform_label;
+	struct ctl_table_header *ctl;
 };
 
 #endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c84c8057d3df..d49a54ea288e 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1,6 +1,7 @@
 #include <linux/types.h>
 #include <linux/skbuff.h>
 #include <linux/socket.h>
+#include <linux/sysctl.h>
 #include <linux/net.h>
 #include <linux/module.h>
 #include <linux/if_arp.h>
@@ -29,6 +30,9 @@ struct mpls_route { /* next hop label forwarding entry */
 	struct rcu_head		rt_rcu;
 };
 
+static int zero = 0;
+static int label_limit = (1 << 20) - 1;
+
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
@@ -260,18 +264,154 @@ static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+static int resize_platform_label_table(struct net *net, size_t limit)
+{
+	size_t size = sizeof(struct mpls_route *) * limit;
+	size_t old_limit;
+	size_t cp_size;
+	struct mpls_route __rcu **labels = NULL, **old;
+	struct mpls_route *rt0 = NULL, *rt2 = NULL;
+	unsigned index;
+
+	if (size) {
+		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (!labels)
+			labels = vzalloc(size);
+
+		if (!labels)
+			goto nolabels;
+	}
+
+	/* In case the predefined labels need to be populated */
+	if (limit > LABEL_IPV4_EXPLICIT_NULL) {
+		rt0 = mpls_rt_alloc();
+		if (!rt0)
+			goto nort0;
+		rt0->rt_dev = net->loopback_dev;
+		rt0->rt_protocol = RTPROT_KERNEL;
+	}
+	if (limit > LABEL_IPV6_EXPLICIT_NULL) {
+		rt2 = mpls_rt_alloc();
+		if (!rt2)
+			goto nort2;
+		rt2->rt_dev = net->loopback_dev;
+		rt2->rt_protocol = RTPROT_KERNEL;
+	}
+
+	rtnl_lock();
+	/* Remember the original table */
+	old = net->mpls.platform_label;
+	old_limit = net->mpls.platform_labels;
+
+	/* Free any labels beyond the new table */
+	for (index = limit; index < old_limit; index++)
+		mpls_route_update(net, index, NULL, NULL, NULL);
+
+	/* Copy over the old labels */
+	cp_size = size;
+	if (old_limit < limit)
+		cp_size = old_limit * sizeof(struct mpls_route *);
+
+	memcpy(labels, old, cp_size);
+
+	/* If needed set the predefined labels */
+	if ((old_limit <= LABEL_IPV6_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV6_EXPLICIT_NULL)) {
+		labels[LABEL_IPV6_EXPLICIT_NULL] = rt2;
+		rt2 = NULL;
+	}
+
+	if ((old_limit <= LABEL_IPV4_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV4_EXPLICIT_NULL)) {
+		labels[LABEL_IPV4_EXPLICIT_NULL] = rt0;
+		rt0 = NULL;
+	}
+
+	/* Update the global pointers */
+	net->mpls.platform_labels = limit;
+	net->mpls.platform_label = labels;
+
+	rtnl_unlock();
+
+	mpls_rt_free(rt2);
+	mpls_rt_free(rt0);
+
+	if (old) {
+		synchronize_rcu();
+		kvfree(old);
+	}
+	return 0;
+
+nort2:
+	mpls_rt_free(rt0);
+nort0:
+	kvfree(labels);
+nolabels:
+	return -ENOMEM;
+}
+
+static int mpls_platform_labels(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = table->data;
+	int platform_labels = net->mpls.platform_labels;
+	int ret;
+	struct ctl_table tmp = {
+		.procname	= table->procname,
+		.data		= &platform_labels,
+		.maxlen		= sizeof(int),
+		.mode		= table->mode,
+		.extra1		= &zero,
+		.extra2		= &label_limit,
+	};
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0)
+		ret = resize_platform_label_table(net, platform_labels);
+
+	return ret;
+}
+
+static struct ctl_table mpls_table[] = {
+	{
+		.procname	= "platform_labels",
+		.data		= NULL,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= mpls_platform_labels,
+	},
+	{ }
+};
+
 static int mpls_net_init(struct net *net)
 {
+	struct ctl_table *table;
+
 	net->mpls.platform_labels = 0;
 	net->mpls.platform_label = NULL;
 
+	table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL);
+	if (table == NULL)
+		return -ENOMEM;
+
+	table[0].data = net;
+	net->mpls.ctl = register_net_sysctl(net, "net/mpls", table);
+	if (net->mpls.ctl == NULL)
+		return -ENOMEM;
+
 	return 0;
 }
 
 static void mpls_net_exit(struct net *net)
 {
+	struct ctl_table *table;
 	unsigned int index;
 
+	table = net->mpls.ctl->ctl_table_arg;
+	unregister_net_sysctl_table(net->mpls.ctl);
+	kfree(table);
+
 	/* An rcu grace period haselapsed since there was a device in
 	 * the network namespace (and thus the last in fqlight packet)
 	 * left this network namespace.  This is because
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 4/8] mpls: Basic support for adding and removing routes
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (2 preceding siblings ...)
  2015-02-25 17:15 ` [PATCH net-next 3/8] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
@ 2015-02-25 17:16 ` Eric W. Biederman
  2015-02-25 17:16 ` [PATCH net-next 5/8] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


mpls_route_add and mpls_route_del implement the basic logic for adding
and removing Next Hop Label Forwarding Entries from the MPLS input
label map.  The addition and subtraction is done in a way that is
consistent with how the existing routing table in Linux are
maintained.  Thus all of the work to deal with NLM_F_APPEND,
NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.

Cases that are not clearly defined such as changing the interpretation
of the mpls reserved labels is not allowed.

Because it seems like the right thing to do adding an MPLS route without
specifying an input label and allowing the kernel to pick a free label
table entry is supported.   The implementation is currently less than optimal
but that can be changed.

As I don't have anything else to test with only ethernet and the loopback
device are the only two device types currently supported for forwarding
MPLS over.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index d49a54ea288e..6a9ef31e0129 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -16,6 +16,7 @@
 #include <net/netns/generic.h>
 #include "internal.h"
 
+#define LABEL_NOT_SPECIFIED (1<<20)
 #define MAX_NEW_LABELS 2
 
 /* This maximum ha length copied from the definition of struct neighbour */
@@ -205,6 +206,18 @@ static struct packet_type mpls_packet_type __read_mostly = {
 	.func = mpls_forward,
 };
 
+struct mpls_route_config {
+	u32		rc_protocol;
+	u32		rc_ifindex;
+	u32		rc_ha_len;
+	u8		rc_ha[MAX_HA_LEN];
+	u32		rc_label;
+	u32		rc_output_labels;
+	u32		rc_output_label[MAX_NEW_LABELS];
+	u32		rc_nlflags;
+	struct nl_info	rc_nlinfo;
+};
+
 static struct mpls_route *mpls_rt_alloc(void)
 {
 	struct mpls_route *rt;
@@ -237,6 +250,127 @@ static void mpls_route_update(struct net *net, unsigned index,
 	mpls_rt_free(old);
 }
 
+static unsigned find_free_label(struct net *net)
+{
+	unsigned index;
+	for (index = 16; index < net->mpls.platform_labels; index++) {
+		if (!net->mpls.platform_label[index])
+			return index;
+	}
+	return LABEL_NOT_SPECIFIED;
+}
+
+static int mpls_route_add(struct mpls_route_config *cfg)
+{
+	struct net *net = cfg->rc_nlinfo.nl_net;
+	struct net_device *dev = NULL;
+	struct mpls_route *rt, *old;
+	unsigned index;
+	int i;
+	int err = -EINVAL;
+
+	index = cfg->rc_label;
+
+	/* If a label was not specified during insert pick one */
+	if ((index == LABEL_NOT_SPECIFIED) &&
+	    (cfg->rc_nlflags & NLM_F_CREATE)) {
+		index = find_free_label(net);
+	}
+
+	/* The first 16 labels are reserved, and may not be set */
+	if (index < 16)
+		goto errout;
+
+	/* The full 20 bit range may not be supported. */
+	if (index >= net->mpls.platform_labels)
+		goto errout;
+
+	/* Ensure only a supported number of labels are present */
+	if (cfg->rc_output_labels > MAX_NEW_LABELS)
+		goto errout;
+
+	err = -ENODEV;
+	dev = dev_get_by_index(net, cfg->rc_ifindex);
+	if (!dev)
+		goto errout;
+
+	err = -ENETDOWN;
+	if (!(dev->flags & IFF_UP))
+		goto errout;
+
+	/* For now just support ethernet devices */
+	err = -EINVAL;
+	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
+		goto errout;
+
+	err = -EINVAL;
+	if (dev->addr_len != cfg->rc_ha_len)
+		goto errout;
+
+	/* Append makes no sense with mpls */
+	err = -EINVAL;
+	if (cfg->rc_nlflags & NLM_F_APPEND)
+		goto errout;
+
+	err = -EEXIST;
+	old = net->mpls.platform_label[index];
+	if ((cfg->rc_nlflags & NLM_F_EXCL) && old)
+		goto errout;
+
+	err = -EEXIST;
+	if (!(cfg->rc_nlflags & NLM_F_REPLACE) && old)
+		goto errout;
+
+	err = -ENOENT;
+	if (!(cfg->rc_nlflags & NLM_F_CREATE) && !old)
+		goto errout;
+
+	err = -ENOMEM;
+	rt = mpls_rt_alloc();
+	if (!rt)
+		goto errout;
+
+	rt->rt_labels = cfg->rc_output_labels;
+	for (i = 0; i < rt->rt_labels; i++)
+		rt->rt_label[i] = cfg->rc_output_label[i];
+	rt->rt_protocol = cfg->rc_protocol;
+	rt->rt_dev = dev;
+	memcpy(rt->rt_ha, cfg->rc_ha, dev->addr_len);
+
+	mpls_route_update(net, index, NULL, rt, &cfg->rc_nlinfo);
+
+	dev_put(dev);
+	return 0;
+
+errout:
+	if (dev)
+		dev_put(dev);
+	return err;
+}
+
+static int mpls_route_del(struct mpls_route_config *cfg)
+{
+	struct net *net = cfg->rc_nlinfo.nl_net;
+	unsigned index;
+	int err = -EINVAL;
+
+	index = cfg->rc_label;
+
+	/* The first 16 labels are reserved, and may not be removed */
+	if (index < 16)
+		goto errout;
+
+	/* The full 20 bit range may not be supported */
+	if (index >= net->mpls.platform_labels)
+		goto errout;
+
+	mpls_route_update(net, index, NULL, NULL, &cfg->rc_nlinfo);
+
+	err = 0;
+errout:
+	return err;
+}
+
 static void mpls_ifdown(struct net_device *dev)
 {
 	struct net *net = dev_net(dev);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 5/8] mpls: Functions for reading and wrinting mpls labels over netlink
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (3 preceding siblings ...)
  2015-02-25 17:16 ` [PATCH net-next 4/8] mpls: Basic support for adding and removing routes Eric W. Biederman
@ 2015-02-25 17:16 ` Eric W. Biederman
  2015-02-25 17:17 ` [PATCH net-next 6/8] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


Reading and writing addresses in network byte order in netlink is
traditional and I see no reason to change that.  MPLS is interesting
as effectively it has variabely length addresses (the MPLS label
stack).  To represent these variable length addresses in netlink
I use a valid MPLS label stack (complete with stop bit).

This achieves two things: a well defined existing format is used,
and the data can be interpreted without looking at it's length.

Not needed to look at the length to decode the variable length
network representation allows existing userspace functions
such as inet_ntop to be used without needed to change their
prototype.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c  | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/mpls/internal.h |  3 +++
 2 files changed, 60 insertions(+)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 6a9ef31e0129..75f24609f297 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -398,6 +398,63 @@ static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+int nla_put_labels(struct sk_buff *skb, int attrtype,
+		   u8 labels, const u32 label[])
+{
+	struct nlattr *nla;
+	struct mpls_shim_hdr *nla_label;
+	bool bos;
+	int i;
+	nla = nla_reserve(skb, attrtype, labels*4);
+	if (!nla)
+		return -EMSGSIZE;
+
+	nla_label = nla_data(nla);
+	bos = true;
+	for (i = labels - 1; i >= 0; i--) {
+		nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos);
+		bos = false;
+	}
+
+	return 0;
+}
+
+int nla_get_labels(const struct nlattr *nla,
+		   u32 max_labels, u32 *labels, u32 label[])
+{
+	unsigned len = nla_len(nla);
+	unsigned nla_labels;
+	struct mpls_shim_hdr *nla_label;
+	bool bos;
+	int i;
+
+	/* len needs to be an even multiple of 4 (the label size) */
+	if (len & 3)
+		return -EINVAL;
+
+	/* Limit the number of new labels allowed */
+	nla_labels = len/4;
+	if (nla_labels > max_labels)
+		return -EINVAL;
+
+	nla_label = nla_data(nla);
+	bos = true;
+	for (i = nla_labels - 1; i >= 0; i--, bos = false) {
+		struct mpls_entry_decoded dec;
+		dec = mpls_entry_decode(nla_label + i);
+
+		/* Ensure the bottom of stack flag is properly set
+		 * and ttl and tc are both clear.
+		 */
+		if ((dec.bos != bos) || dec.ttl || dec.tc)
+			return -EINVAL;
+
+		label[i] = dec.label;
+	}
+	*labels = nla_labels;
+	return 0;
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index c2944cb84d48..fb6de92052c4 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -53,4 +53,7 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *
 	return result;
 }
 
+int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels, const u32 label[]);
+int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]);
+
 #endif /* MPLS_INTERNAL_H */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 6/8] mpls: Netlink commands to add, remove, and dump routes
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (4 preceding siblings ...)
  2015-02-25 17:16 ` [PATCH net-next 5/8] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
@ 2015-02-25 17:17 ` Eric W. Biederman
  2015-02-25 17:18 ` [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel Eric W. Biederman
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


This change adds two new netlink routing attributes:
RTA_LLGATEWAY and RTA_NEWDST.

RTA_LLGATEWAY specifies the destination gateway by it's address in the
underlying address family instead of in MPLS as RTA_GATEWAY would
require.  That is RTA_LLGATEWAY allows specifying the next hop by the
next hops' mac address.

There are places where MPLS is used where the next hop address can not
be specified as an IPv4 or an IPv6 address and this allows me to avoid
figuring out how to teach the neighbour table to deal with next hops
in a different address family.  I expect at some point someone will
add suppport for next hop addresesses as IPv4 and IPv6 addresses as
that allows replacing the next hop machine without having to
reconfigure the rest of the network.

RTA_NEWDST specifies the destination address to forward the packet
with.  MPLS typically changes it's destination address at every hop.
For a swap operation RTA_NEWDST is specified with a length of one label.
For a push operation RTA_NEWDST is specified with two or more labels.
For a pop operation RTA_NEWDST is not specified or equivalently an emtpy
RTAN_NEWDST is specified.

Those new netlink attributes are used to implement handling of rt-netlink
RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the
MPLS label table.

rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message,
verify no unhandled attributes or unhandled values are present and sets
up the data structures for mpls_route_add and mpls_route_del.

I did my best to match up with the existing conventions with the caveats
that MPLS addresses are all destination-specific-addresses, and so
don't properly have a scope.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/uapi/linux/rtnetlink.h |   2 +
 net/mpls/af_mpls.c             | 192 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 194 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 5cc5d66bf519..da9889a4dec0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -303,6 +303,8 @@ enum rtattr_type_t {
 	RTA_TABLE,
 	RTA_MARK,
 	RTA_MFC_STATS,
+	RTA_LLGATEWAY,
+	RTA_NEWDST,
 	__RTA_MAX
 };
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 75f24609f297..5cf9aa68c32f 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -206,6 +206,11 @@ static struct packet_type mpls_packet_type __read_mostly = {
 	.func = mpls_forward,
 };
 
+const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = {
+	[RTA_DST]		= { .type = NLA_U32 },
+	[RTA_OIF]		= { .type = NLA_U32 },
+};
+
 struct mpls_route_config {
 	u32		rc_protocol;
 	u32		rc_ifindex;
@@ -455,6 +460,189 @@ int nla_get_labels(const struct nlattr *nla,
 	return 0;
 }
 
+static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
+			       struct mpls_route_config *cfg)
+{
+	struct rtmsg *rtm;
+	struct nlattr *tb[RTA_MAX+1];
+	int index;
+	int err;
+
+	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_mpls_policy);
+	if (err < 0)
+		goto errout;
+
+	err = -EINVAL;
+	rtm = nlmsg_data(nlh);
+	memset(cfg, 0, sizeof(*cfg));
+
+	if (rtm->rtm_family != AF_MPLS)
+		goto errout;
+	if (rtm->rtm_dst_len != 20)
+		goto errout;
+	if (rtm->rtm_src_len != 0)
+		goto errout;
+	if (rtm->rtm_tos != 0)
+		goto errout;
+	if (rtm->rtm_table != RT_TABLE_MAIN)
+		goto errout;
+	/* Any value is acceptable for rtm_protocol */
+
+	/* As mpls uses destination specific addresses
+	 * (or source specific address in the case of multicast)
+	 * all addresses have universal scope.
+	 */
+	if (rtm->rtm_scope != RT_SCOPE_UNIVERSE)
+		goto errout;
+	if (rtm->rtm_type != RTN_UNICAST)
+		goto errout;
+	if (rtm->rtm_flags != 0)
+		goto errout;
+
+	cfg->rc_label		= LABEL_NOT_SPECIFIED;
+	cfg->rc_protocol	= rtm->rtm_protocol;
+	cfg->rc_nlflags		= nlh->nlmsg_flags;
+	cfg->rc_nlinfo.portid	= NETLINK_CB(skb).portid;
+	cfg->rc_nlinfo.nlh	= nlh;
+	cfg->rc_nlinfo.nl_net	= sock_net(skb->sk);
+
+	for (index = 0; index <= RTA_MAX; index++) {
+		struct nlattr *nla = tb[index];
+		if (!nla)
+			continue;
+
+		switch(index) {
+		case RTA_OIF:
+			cfg->rc_ifindex = nla_get_u32(nla);
+			break;
+		case RTA_NEWDST:
+			if (nla_get_labels(nla, MAX_NEW_LABELS,
+					   &cfg->rc_output_labels,
+					   cfg->rc_output_label))
+				goto errout;
+			break;
+		case RTA_DST:
+		{
+			u32 label_count;
+			if (nla_get_labels(nla, 1, &label_count,
+					   &cfg->rc_label))
+				goto errout;
+
+			/* The first 16 labels are reserved, and may not be set */
+			if (cfg->rc_label < 16)
+				goto errout;
+
+			break;
+		}
+		case RTA_LLGATEWAY:
+			cfg->rc_ha_len = nla_len(nla);
+			if (cfg->rc_ha_len > MAX_HA_LEN)
+				goto errout;
+
+			memcpy(cfg->rc_ha, nla_data(nla), cfg->rc_ha_len);
+			break;
+		default:
+			/* Unsupported attribute */
+			goto errout;
+		}
+	}
+
+	err = 0;
+errout:
+	return err;
+}
+
+static int mpls_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct mpls_route_config cfg;
+	int err;
+
+	err = rtm_to_route_config(skb, nlh, &cfg);
+	if (err < 0)
+		return err;
+
+	return mpls_route_del(&cfg);
+}
+
+
+static int mpls_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct mpls_route_config cfg;
+	int err;
+
+	err = rtm_to_route_config(skb, nlh, &cfg);
+	if (err < 0)
+		return err;
+
+	return mpls_route_add(&cfg);
+}
+
+static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
+			   u32 label, struct mpls_route *rt, int flags)
+{
+	struct nlmsghdr *nlh;
+	struct rtmsg *rtm;
+
+	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*rtm), flags);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	rtm = nlmsg_data(nlh);
+	rtm->rtm_family = AF_MPLS;
+	rtm->rtm_dst_len = 20;
+	rtm->rtm_src_len = 0;
+	rtm->rtm_tos = 0;
+	rtm->rtm_table = RT_TABLE_MAIN;
+	rtm->rtm_protocol = rt->rt_protocol;
+	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
+	rtm->rtm_type = RTN_UNICAST;
+	rtm->rtm_flags = 0;
+
+	if (rt->rt_labels &&
+	    nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label))
+		goto nla_put_failure;
+	if (nla_put(skb, RTA_LLGATEWAY, rt->rt_dev->addr_len, rt->rt_ha))
+		goto nla_put_failure;
+	if (nla_put_u32(skb, RTA_OIF, rt->rt_dev->ifindex))
+		goto nla_put_failure;
+	if (nla_put_labels(skb, RTA_DST, 1, &label))
+		goto nla_put_failure;
+
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	unsigned int index;
+
+	ASSERT_RTNL();
+
+	index = cb->args[0];
+	if (index < 16)
+		index = 16;
+
+	for (; index < net->mpls.platform_labels; index++) {
+		struct mpls_route *rt;
+		rt = net->mpls.platform_label[index];
+		if (!rt)
+			continue;
+
+		if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid,
+				    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+				    index, rt, NLM_F_MULTI) < 0)
+			break;
+	}
+	cb->args[0] = index;
+
+	return skb->len;
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
@@ -644,6 +832,9 @@ static int __init mpls_init(void)
 
 	dev_add_pack(&mpls_packet_type);
 
+	rtnl_register(PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, NULL);
+	rtnl_register(PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, NULL);
+	rtnl_register(PF_MPLS, RTM_GETROUTE, NULL, mpls_dump_routes, NULL);
 	err = 0;
 out:
 	return err;
@@ -656,6 +847,7 @@ module_init(mpls_init);
 
 static void __exit mpls_exit(void)
 {
+	rtnl_unregister_all(PF_MPLS);
 	dev_remove_pack(&mpls_packet_type);
 	unregister_netdevice_notifier(&mpls_dev_notifier);
 	unregister_pernet_subsys(&mpls_net_ops);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (5 preceding siblings ...)
  2015-02-25 17:17 ` [PATCH net-next 6/8] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
@ 2015-02-25 17:18 ` Eric W. Biederman
  2015-03-05  9:17   ` Vivek Venkatraman
  2015-02-25 17:19 ` [PATCH net-next 7/8] mpls: Multicast route table change notifications Eric W. Biederman
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


Allow creating an mpls tunnel endpoint with

ip link add type ipmpls.

This tunnel has an mpls label for it's link layer address, and by
default sends all ingress packets over loopback to the local MPLS
forwarding logic which performs all of the work.

Ingress IPv4, IPv6 and MPLS packets are supported.

A new arp type ARPHRD_MPLS is defined for network devices that
whose link-layer addresses is an mpls label stack.

This is the most bare bones version of this tunnel device I can think
of.  Not even packet counters have been implemented. Offloads
and features in general are not supported, just to keep it simple and
obviously correct to start with.  In principle it should be able to
allow binding to a physical network device and pass all of the
offloads through ipmpls like the vlan, macvlan, or even ipvlan does.
Allowing a very fast light weight connection to the network.

The technical tricky bit to residing over something besides
the loopback device is how to get the next-hop mac address.
Neighbour table integration?  Something else?

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/uapi/linux/if_arp.h |   1 +
 net/mpls/Kconfig            |   5 +
 net/mpls/Makefile           |   1 +
 net/mpls/ipmpls.c           | 219 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 226 insertions(+)
 create mode 100644 net/mpls/ipmpls.c

diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index 4d024d75d64b..17d669fd1781 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -88,6 +88,7 @@
 #define ARPHRD_IEEE80211_RADIOTAP 803	/* IEEE 802.11 + radiotap header */
 #define ARPHRD_IEEE802154	  804
 #define ARPHRD_IEEE802154_MONITOR 805	/* IEEE 802.15.4 network monitor */
+#define ARPHRD_MPLS	806		/* IP and IPv6 over MPLS tunnels */
 
 #define ARPHRD_PHONET	820		/* PhoNet media type		*/
 #define ARPHRD_PHONET_PIPE 821		/* PhoNet pipe header		*/
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index f4286ee7e2b0..4a6106dabfa8 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -27,4 +27,9 @@ config MPLS_ROUTING
 	help
 	 Add support for forwarding of mpls packets.
 
+config MPLS_IPTUNNEL
+	bool "MPLS: IP over MPLS tunnel support"
+	help
+	 A network device that allows sending ip packets into an mpls tunnel
+
 endif # MPLS
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 60af15f1960e..4b578c80b9c5 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -3,3 +3,4 @@
 #
 obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
 obj-$(CONFIG_MPLS_ROUTING) += af_mpls.o
+obj-$(CONFIG_MPLS_IPTUNNEL) += ipmpls.o
diff --git a/net/mpls/ipmpls.c b/net/mpls/ipmpls.c
new file mode 100644
index 000000000000..96938748654f
--- /dev/null
+++ b/net/mpls/ipmpls.c
@@ -0,0 +1,219 @@
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/if_arp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/module.h>
+#include <linux/mpls.h>
+#include "internal.h"
+
+static LIST_HEAD(ipmpls_dev_list);
+
+struct ipmpls_dev_priv {
+	unsigned label;
+	struct net_device *out_dev;
+	struct list_head list;
+	struct net_device *dev;
+};
+
+static netdev_tx_t ipmpls_dev_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+	struct net_device *out_dev = priv->out_dev;
+	struct mpls_shim_hdr *hdr;
+	bool bottom_of_stack = true;
+	unsigned ttl;
+	int ret;
+
+	/* Obtain the ttl */
+	if (skb->protocol == htons(ETH_P_IP)) {
+		ttl = ip_hdr(skb)->ttl;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ttl = ipv6_hdr(skb)->hop_limit;
+	} else if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+		ttl = mpls_entry_decode(mpls_hdr(skb)).ttl;
+		bottom_of_stack = false;
+	} else
+		goto drop;
+
+	skb_orphan(skb);
+
+	skb->inner_protocol = skb->protocol;
+	skb->inner_network_header = skb->network_header;
+
+	skb_push(skb, sizeof(*hdr));
+	skb_reset_network_header(skb);
+	hdr = mpls_hdr(skb);
+	*hdr = mpls_entry_encode(priv->label, ttl, 0, bottom_of_stack);
+
+	skb->dev = out_dev;
+	skb->protocol = htons(ETH_P_MPLS_UC);
+
+	ret = dev_hard_header(skb, out_dev, ETH_P_MPLS_UC,
+			      out_dev->dev_addr, NULL, skb->len);
+	if (ret >= 0)
+		ret = dev_queue_xmit(skb);
+	if (ret)
+		goto drop;
+
+	return 0;
+
+drop:
+	/* TODO keep packet counters */
+	kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int ipmpls_dev_init(struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+
+	list_add_tail(&priv->list, &ipmpls_dev_list);
+
+	return 0;
+}
+
+
+static void ipmpls_dev_uninit(struct net_device *dev)
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+	list_del_init(&priv->list);
+}
+
+static void ipmpls_dev_free(struct net_device *dev)
+{
+	free_netdev(dev);
+}
+
+static const struct net_device_ops ipmpls_netdev_ops = {
+	.ndo_init		= ipmpls_dev_init,
+	.ndo_start_xmit		= ipmpls_dev_xmit,
+	.ndo_uninit		= ipmpls_dev_uninit,
+};
+
+#define IPMPLS_FEATURES (NETIF_F_SG | 			\
+			 NETIF_F_FRAGLIST |		\
+			 NETIF_F_HIGHDMA |		\
+			 NETIF_F_VLAN_CHALLENGED)
+
+static void ipmpls_dev_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &ipmpls_netdev_ops;
+
+	dev->type		= ARPHRD_MPLS;
+	dev->flags = IFF_NOARP;
+	dev->flags &= ~IFF_MULTICAST;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_LLTX;
+	dev->features		|= IPMPLS_FEATURES;
+	dev->hw_features	|= IPMPLS_FEATURES;
+	dev->vlan_features	= 0;
+
+	dev->destructor = ipmpls_dev_free;
+}
+
+static int ipmpls_dev_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	unsigned label;
+	if (!tb[IFLA_ADDRESS])
+		return -EADDRNOTAVAIL;
+	if (nla_len(tb[IFLA_ADDRESS]) != 4)
+		return -EINVAL;
+
+	label = be32_to_cpu(nla_get_be32(tb[IFLA_ADDRESS]));
+	if (label >= (1 << 20))
+		return -EADDRNOTAVAIL;
+
+	return 0;
+}
+
+static int ipmpls_dev_newlink(struct net *src_net, struct net_device *dev,
+			struct nlattr *tb[], struct nlattr *data[])
+{
+	struct ipmpls_dev_priv *priv = netdev_priv(dev);
+	u32 labels;
+
+	if (nla_get_labels(tb[IFLA_ADDRESS], 1, &labels, &priv->label))
+		return -EINVAL;
+
+	priv->out_dev = src_net->loopback_dev;
+	priv->dev = dev;
+
+	dev->hard_header_len =
+		priv->out_dev->hard_header_len + sizeof(struct mpls_shim_hdr);
+
+	return register_netdevice(dev);
+}
+
+static void ipmpls_dev_dellink(struct net_device *dev, struct list_head *head)
+{
+	unregister_netdevice_queue(dev, head);
+}
+
+static struct rtnl_link_ops ipmpls_ops = {
+	.kind		= "ipmpls",
+	.priv_size	= sizeof(struct ipmpls_dev_priv),
+	.setup		= ipmpls_dev_setup,
+	.validate	= ipmpls_dev_validate,
+	.newlink	= ipmpls_dev_newlink,
+	.dellink	= ipmpls_dev_dellink,
+};
+
+static int ipmpls_dev_notify(struct notifier_block *this, unsigned long event,
+			     void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	if (event == NETDEV_UNREGISTER) {
+		struct ipmpls_dev_priv *priv, *priv2;
+		LIST_HEAD(list_kill);
+
+		/* Ignore netns device moves */
+		if (dev->reg_state != NETREG_UNREGISTERING)
+			goto done;
+
+		list_for_each_entry_safe(priv, priv2, &ipmpls_dev_list, list) {
+			if (priv->out_dev != dev)
+				continue;
+
+			ipmpls_dev_dellink(priv->dev, &list_kill);
+		}
+		unregister_netdevice_many(&list_kill);
+	}
+done:
+	return NOTIFY_OK;
+}
+
+static struct notifier_block ipmpls_dev_notifier = {
+	.notifier_call = ipmpls_dev_notify,
+};
+
+static int __init ipmpls_init(void)
+{
+	int err;
+
+	err = register_netdevice_notifier(&ipmpls_dev_notifier);
+	if (err)
+		goto out;
+
+	err = rtnl_link_register(&ipmpls_ops);
+	if (err)
+		goto out_unregister_notifier;
+out:
+	return err;
+out_unregister_notifier:
+	unregister_netdevice_notifier(&ipmpls_dev_notifier);
+	goto out;
+}
+module_init(ipmpls_init);
+
+static void __exit ipmpls_exit(void)
+{
+	rtnl_link_unregister(&ipmpls_ops);
+	unregister_netdevice_notifier(&ipmpls_dev_notifier);
+}
+module_exit(ipmpls_exit);
+
+MODULE_ALIAS_RTNL_LINK("ipmpls");
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 7/8] mpls: Multicast route table change notifications
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (6 preceding siblings ...)
  2015-02-25 17:18 ` [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel Eric W. Biederman
@ 2015-02-25 17:19 ` Eric W. Biederman
  2015-02-26  7:21   ` roopa
  2015-02-25 17:37 ` [PATCH iproute2] mpls: Add basic mpls support to iproute Eric W. Biederman
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago


Unlike IPv4 this code notifies on all cases where mpls routes
are added or removed as that was the simplest to implement.

In particular routes being removed because a network interface
goes down or is removed are notified about.  Are there technical
arguments for handling this differently?  Userspace developers
don't particularly like the way IPv4 handles route removal
on ifdown.

For now reserved labels are handled automatically and userspace
is not notified.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/uapi/linux/rtnetlink.h |  2 ++
 net/mpls/af_mpls.c             | 60 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index da9889a4dec0..481d2516ccd0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -625,6 +625,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_IPV6_NETCONF	RTNLGRP_IPV6_NETCONF
 	RTNLGRP_MDB,
 #define RTNLGRP_MDB		RTNLGRP_MDB
+	RTNLGRP_MPLS_ROUTE,
+#define RTNLGRP_MPLS_ROUTE	RTNLGRP_MPLS_ROUTE
 	__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX	(__RTNLGRP_MAX - 1)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 5cf9aa68c32f..90e45461c8e2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -34,6 +34,10 @@ struct mpls_route { /* next hop label forwarding entry */
 static int zero = 0;
 static int label_limit = (1 << 20) - 1;
 
+static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
+		       struct nlmsghdr *nlh, struct net *net, u32 portid,
+		       unsigned int nlm_flags);
+
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
@@ -237,6 +241,20 @@ static void mpls_rt_free(struct mpls_route *rt)
 		kfree_rcu(rt, rt_rcu);
 }
 
+static void mpls_notify_route(struct net *net, unsigned index,
+			      struct mpls_route *old, struct mpls_route *new,
+			      const struct nl_info *info)
+{
+	struct nlmsghdr *nlh = info ? info->nlh : NULL;
+	unsigned portid = info ? info->portid : 0;
+	int event = new ? RTM_NEWROUTE : RTM_DELROUTE;
+	struct mpls_route *rt = new ? new : old;
+	unsigned nlm_flags = (old && new) ? NLM_F_REPLACE : 0;
+	/* Ignore reserved labels for now */
+	if (rt && (index >= 16))
+		rtmsg_lfib(event, index, rt, nlh, net, portid, nlm_flags);
+}
+
 static void mpls_route_update(struct net *net, unsigned index,
 			      struct net_device *dev, struct mpls_route *new,
 			      const struct nl_info *info)
@@ -251,6 +269,8 @@ static void mpls_route_update(struct net *net, unsigned index,
 		old = rt;
 	}
 
+	mpls_notify_route(net, index, old, new, info);
+
 	/* If we removed a route free it now */
 	mpls_rt_free(old);
 }
@@ -643,6 +663,46 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+static inline size_t lfib_nlmsg_size(struct mpls_route *rt)
+{
+	size_t payload =
+		NLMSG_ALIGN(sizeof(struct rtmsg))
+		+ ((rt->rt_labels == 0) ? 0 :		/* RTA_NEWDST */
+		   nla_total_size(rt->rt_labels *4))
+		+ nla_total_size(rt->rt_dev->addr_len)	/* RTA_LLGATEWAY */
+		+ nla_total_size(4)			/* RTA_OIF */
+		+ nla_total_size(4);			/* RTA_DST */
+
+	return payload;
+}
+
+static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
+		       struct nlmsghdr *nlh, struct net *net, u32 portid,
+		       unsigned int nlm_flags)
+{
+	struct sk_buff *skb;
+	u32 seq = nlh ? nlh->nlmsg_seq : 0;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(lfib_nlmsg_size(rt), GFP_KERNEL);
+	if (skb == NULL)
+		goto errout;
+
+	err = mpls_dump_route(skb, portid, seq, event, label, rt, nlm_flags);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in lfib_nlmsg_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+	rtnl_notify(skb, net, portid, RTNLGRP_MPLS_ROUTE, nlh, GFP_KERNEL);
+
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_MPLS_ROUTE, err);
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH iproute2] mpls: Add basic mpls support to iproute
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (7 preceding siblings ...)
  2015-02-25 17:19 ` [PATCH net-next 7/8] mpls: Multicast route table change notifications Eric W. Biederman
@ 2015-02-25 17:37 ` Eric W. Biederman
  2015-02-26  6:58 ` [PATCH net-next 0/8] Basic MPLS support roopa
  2015-02-27 21:21 ` David Miller
  10 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-25 17:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, santiago, David Miller


This includes support for two new netlink attributes and mpls address
parsing and printing routines.

I don't like how I have AF_MPLS and the defines from include/uapi/linux/mpls.h
duplicated in include/utils.h but I drew a blank when thinking of a
better way to handle this.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

The kernel side of this code hasn't gone in yet, so I expect it is
probably premature to pull this code into iproute2 but at the same
time this code is needed to use and understand the kernel code so I am
posting it now, and will resend later if needed.

 Makefile                  |  3 +++
 include/linux/rtnetlink.h |  4 +++
 include/utils.h           | 41 +++++++++++++++++++++++++++++
 ip/ip.c                   |  4 +++
 ip/ipmonitor.c            |  3 +++
 ip/iproute.c              | 36 +++++++++++++++++++++++++
 lib/mpls_ntop.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++++
 lib/mpls_pton.c           | 58 ++++++++++++++++++++++++++++++++++++++++
 lib/utils.c               | 26 ++++++++++++++++--
 9 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 lib/mpls_ntop.c
 create mode 100644 lib/mpls_pton.c

diff --git a/Makefile b/Makefile
index 9dbb29f3d0cd..ca6c2e141308 100644
--- a/Makefile
+++ b/Makefile
@@ -26,6 +26,9 @@ ADDLIB+=dnet_ntop.o dnet_pton.o
 #options for ipx
 ADDLIB+=ipx_ntop.o ipx_pton.o
 
+#options for mpls
+ADDLIB+=mpls_ntop.o mpls_pton.o
+
 CC = gcc
 HOSTCC = gcc
 DEFINES += -D_GNU_SOURCE
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 3eb78105399b..cf0866d1a8ff 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -303,6 +303,8 @@ enum rtattr_type_t {
 	RTA_TABLE,
 	RTA_MARK,
 	RTA_MFC_STATS,
+	RTA_LLGATEWAY,
+	RTA_NEWDST,
 	__RTA_MAX
 };
 
@@ -621,6 +623,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_IPV6_NETCONF	RTNLGRP_IPV6_NETCONF
 	RTNLGRP_MDB,
 #define RTNLGRP_MDB		RTNLGRP_MDB
+	RTNLGRP_MPLS_ROUTE,
+#define RTNLGRP_MPLS_ROUTE	RTNLGRP_MPLS_ROUTE
 	__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX	(__RTNLGRP_MAX - 1)
diff --git a/include/utils.h b/include/utils.h
index 3da22837d2e6..f36fee83bfbe 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -77,6 +77,44 @@ struct ipx_addr {
 	u_int8_t  ipx_node[IPX_NODE_LEN];
 };
 
+#ifndef AF_MPLS
+# define AF_MPLS 28
+#endif
+#ifndef PF_MPLS
+# define PF_MPLS AF_MPLS
+#endif
+
+#ifndef MPLS_LS_LABEL_MASK
+# define MPLS_LS_LABEL_MASK      0xFFFFF000
+#endif
+#ifndef MPLS_LS_LABEL_SHIFT
+# define MPLS_LS_LABEL_SHIFT     12
+#endif
+#ifndef MPLS_LS_TC_MASK
+# define MPLS_LS_TC_MASK         0x00000E00
+#endif
+#ifndef MPLS_LS_TC_SHIFT
+# define MPLS_LS_TC_SHIFT        9
+#endif
+#ifndef MPLS_LS_S_MASK
+# define MPLS_LS_S_MASK          0x00000100
+#endif
+#ifndef MPLS_LS_S_SHIFT
+# define MPLS_LS_S_SHIFT         8
+#endif
+#ifndef MPLS_LS_TTL_MASK
+# define MPLS_LS_TTL_MASK        0x000000FF
+#endif
+#ifndef MPLS_LS_TTL_SHIFT
+# define MPLS_LS_TTL_SHIFT       0
+#endif
+
+/* Maximum number of labels our helpers support */
+#define MPLS_MAX_LABELS 8
+struct mpls_addr {
+	u_int32_t label_stack_entry;
+};
+
 extern __u32 get_addr32(const char *name);
 extern int get_addr_1(inet_prefix *dst, const char *arg, int family);
 extern int get_prefix_1(inet_prefix *dst, char *arg, int family);
@@ -119,6 +157,9 @@ int dnet_pton(int af, const char *src, void *addr);
 const char *ipx_ntop(int af, const void *addr, char *str, size_t len);
 int ipx_pton(int af, const char *src, void *addr);
 
+const char *mpls_ntop(int af, const void *addr, char *str, size_t len);
+int mpls_pton(int af, const char *src, void *addr);
+
 extern int __iproute2_hz_internal;
 extern int __get_hz(void);
 
diff --git a/ip/ip.c b/ip/ip.c
index da16b15f8b55..53be50dd378b 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -200,6 +200,8 @@ int main(int argc, char **argv)
 				preferred_family = AF_PACKET;
 			else if (strcmp(argv[1], "ipx") == 0)
 				preferred_family = AF_IPX;
+			else if (strcmp(argv[1], "mpls") == 0)
+				preferred_family = AF_MPLS;
 			else if (strcmp(argv[1], "bridge") == 0)
 				preferred_family = AF_BRIDGE;
 			else if (strcmp(argv[1], "help") == 0)
@@ -216,6 +218,8 @@ int main(int argc, char **argv)
 			preferred_family = AF_IPX;
 		} else if (strcmp(opt, "-D") == 0) {
 			preferred_family = AF_DECnet;
+		} else if (strcmp(opt, "-M") == 0) {
+			preferred_family = AF_MPLS;
 		} else if (strcmp(opt, "-B") == 0) {
 			preferred_family = AF_BRIDGE;
 		} else if (matches(opt, "-human") == 0 ||
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 5ec8f4181222..03e50c7eb787 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -163,6 +163,7 @@ int do_ipmonitor(int argc, char **argv)
 	groups |= nl_mgrp(RTNLGRP_NEIGH);
 	groups |= nl_mgrp(RTNLGRP_IPV4_NETCONF);
 	groups |= nl_mgrp(RTNLGRP_IPV6_NETCONF);
+	groups |= nl_mgrp(RTNLGRP_MPLS_ROUTE);
 
 	rtnl_close(&rth);
 
@@ -229,6 +230,8 @@ int do_ipmonitor(int argc, char **argv)
 			groups |= nl_mgrp(RTNLGRP_IPV4_ROUTE);
 		if (!preferred_family || preferred_family == AF_INET6)
 			groups |= nl_mgrp(RTNLGRP_IPV6_ROUTE);
+		if (!preferred_family || preferred_family == AF_MPLS)
+			groups |= nl_mgrp(RTNLGRP_MPLS_ROUTE)
 	}
 	if (lmroute) {
 		if (!preferred_family || preferred_family == AF_INET)
diff --git a/ip/iproute.c b/ip/iproute.c
index 76d8e36ccc2b..939b661b2a7a 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -23,6 +23,7 @@
 #include <netinet/ip.h>
 #include <arpa/inet.h>
 #include <linux/in_route.h>
+#include <linux/if_arp.h>
 #include <errno.h>
 
 #include "rt_names.h"
@@ -278,6 +279,8 @@ static int calc_host_len(const struct rtmsg *r)
 		return 16;
 	else if (r->rtm_family == AF_IPX)
 		return 80;
+	else if (r->rtm_family == AF_MPLS)
+		return 20;
 	else
 		return -1;
 }
@@ -386,6 +389,13 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 	} else if (r->rtm_src_len) {
 		fprintf(fp, "from 0/%u ", r->rtm_src_len);
 	}
+	if (tb[RTA_NEWDST]) {
+		fprintf(fp, "as %s ", format_host(r->rtm_family,
+						  RTA_PAYLOAD(tb[RTA_NEWDST]),
+						  RTA_DATA(tb[RTA_NEWDST]),
+						  abuf, sizeof(abuf))
+			);
+	}
 	if (r->rtm_tos && filter.tosmask != -1) {
 		SPRINT_BUF(b1);
 		fprintf(fp, "tos %s ", rtnl_dsfield_n2a(r->rtm_tos, b1, sizeof(b1)));
@@ -398,6 +408,14 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 				    RTA_DATA(tb[RTA_GATEWAY]),
 				    abuf, sizeof(abuf)));
 	}
+	if (tb[RTA_LLGATEWAY]) {
+		SPRINT_BUF(b1);
+		fprintf(fp, "llvia %s ",
+			ll_addr_n2a(RTA_DATA(tb[RTA_LLGATEWAY]),
+				    RTA_PAYLOAD(tb[RTA_LLGATEWAY]),
+				    ARPHRD_VOID /* Unknown link-layer address type */,
+				    b1, sizeof(b1)));
+	}
 	if (tb[RTA_OIF] && filter.oifmask != -1)
 		fprintf(fp, "dev %s ", ll_index_to_name(*(int*)RTA_DATA(tb[RTA_OIF])));
 
@@ -770,6 +788,13 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
 			if (req.r.rtm_family == AF_UNSPEC)
 				req.r.rtm_family = addr.family;
 			addattr_l(&req.n, sizeof(req), RTA_PREFSRC, &addr.data, addr.bytelen);
+		} else if (strcmp(*argv, "as") == 0) {
+			inet_prefix addr;
+			NEXT_ARG();
+			get_addr(&addr, *argv, req.r.rtm_family);
+			if (req.r.rtm_family == AF_UNSPEC)
+				req.r.rtm_family = addr.family;
+			addattr_l(&req.n, sizeof(req), RTA_NEWDST, &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "via") == 0) {
 			inet_prefix addr;
 			gw_ok = 1;
@@ -778,6 +803,17 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
 			if (req.r.rtm_family == AF_UNSPEC)
 				req.r.rtm_family = addr.family;
 			addattr_l(&req.n, sizeof(req), RTA_GATEWAY, &addr.data, addr.bytelen);
+		} else if (strcmp(*argv, "llvia") == 0) {
+			char abuf[32];
+			int len;
+			gw_ok = 1;
+			NEXT_ARG();
+			len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
+			if (len <= 0) {
+				invarg("Invalid llvia address", *argv);
+				len = 0;
+			}
+			addattr_l(&req.n, sizeof(req), RTA_LLGATEWAY, abuf, len);
 		} else if (strcmp(*argv, "from") == 0) {
 			inet_prefix addr;
 			NEXT_ARG();
diff --git a/lib/mpls_ntop.c b/lib/mpls_ntop.c
new file mode 100644
index 000000000000..c6c7afae75b8
--- /dev/null
+++ b/lib/mpls_ntop.c
@@ -0,0 +1,67 @@
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+
+#include "utils.h"
+
+static const char *mpls_ntop1(const struct mpls_addr *addr, char *buf, size_t buflen)
+{
+	unsigned count;
+
+	for (count = 0; count < MPLS_MAX_LABELS; count++) {
+		uint32_t entry = ntohl(addr->label_stack_entry);
+		uint32_t label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+		int len = snprintf(buf, buflen, "%u", label);
+
+		/* Is this the end? */
+		if (entry & MPLS_LS_S_MASK)
+			return buf;
+
+		buf += len;
+		buflen -= len;
+	}
+	errno = -E2BIG;
+	return NULL;
+}
+
+static const char *mpls_ntop1(const struct mpls_addr *addr, char *buf, size_t buflen)
+{
+	size_t destlen = buflen;
+	char *dest = buf;
+	int count;
+
+	for (count = 0; count < MPLS_MAX_LABELS; count++) {
+		uint32_t entry = ntohl(addr[count].label_stack_entry);
+		uint32_t label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+		int len = snprintf(dest, destlen, "%u", label);
+
+		/* Is this the end? */
+		if (entry & MPLS_LS_S_MASK)
+			return buf;
+
+
+		dest += len;
+		destlen -= len;
+		if (destlen) {
+			*dest = '/';
+			dest++;
+			destlen--;
+		}
+	}
+	errno = -E2BIG;
+	return NULL;
+}
+
+const char *mpls_ntop(int af, const void *addr, char *buf, size_t buflen)
+{
+	switch(af) {
+	case AF_MPLS:
+		errno = 0;
+		return mpls_ntop1((struct mpls_addr *)addr, buf, buflen);
+	default:
+		errno = EAFNOSUPPORT;
+	}
+
+	return NULL;
+}
diff --git a/lib/mpls_pton.c b/lib/mpls_pton.c
new file mode 100644
index 000000000000..be99b159b256
--- /dev/null
+++ b/lib/mpls_pton.c
@@ -0,0 +1,58 @@
+#include <errno.h>
+#include <string.h>
+#include <sys/types.h>
+#include <netinet/in.h>
+
+#include "utils.h"
+
+
+static int mpls_pton1(const char *name, struct mpls_addr *addr)
+{
+	char *endp;
+	unsigned long label;
+	unsigned count;
+
+	for (count = 0; count < MPLS_MAX_LABELS; count++) {
+		unsigned long label;
+
+		label = strtoul(name, &endp, 0);
+		/* Fail when the label value is out or range */
+		if (label >= (1 << 20))
+			return 0;
+
+		if (endp == name) /* no digits */
+			return 0;
+
+		addr->label_stack_entry = htonl(label << MPLS_LS_LABEL_SHIFT);
+		if (*endp == '\0') {
+			addr->label_stack_entry |= htonl(1 << MPLS_LS_S_SHIFT);
+			return 1;
+		}
+
+		/* Bad character in the address */
+		if (*endp != '/')
+			return 0;
+
+		name = endp + 1;
+		addr += 1;
+	}
+	/* The address was too long */
+	return 0;
+}
+
+int mpls_pton(int af, const char *src, void *addr)
+{
+	int err;
+
+	switch(af) {
+	case AF_MPLS:
+		errno = 0;
+		err = mpls_pton1(src, (struct mpls_addr *)addr);
+		break;
+	default:
+		errno = EAFNOSUPPORT;
+		err = -1;
+	}
+
+	return err;
+}
diff --git a/lib/utils.c b/lib/utils.c
index efebe189758f..8385eeb2c30e 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -389,7 +389,7 @@ int get_addr_1(inet_prefix *addr, const char *name, int family)
 	if (strcmp(name, "default") == 0 ||
 	    strcmp(name, "all") == 0 ||
 	    strcmp(name, "any") == 0) {
-		if (family == AF_DECnet)
+		if ((family == AF_DECnet) || (family == AF_MPLS))
 			return -1;
 		addr->family = family;
 		addr->bytelen = (family == AF_INET6 ? 16 : 4);
@@ -419,6 +419,23 @@ int get_addr_1(inet_prefix *addr, const char *name, int family)
 		return 0;
 	}
 
+	if (family == AF_MPLS) {
+		int i;
+		addr->family = AF_MPLS;
+		if (mpls_pton(AF_MPLS, name, addr->data) <= 0)
+			return -1;
+		addr->bytelen = 4;
+		addr->bitlen = 20;
+		/* How many bytes do I need? */
+		for (i = 0; i < 8; i++) {
+			if (ntohl(addr->data[i]) & MPLS_LS_S_MASK) {
+				addr->bytelen = (i + 1)*4;
+				break;
+			}
+		}
+		return 0;
+	}
+
 	addr->family = AF_INET;
 	if (family != AF_UNSPEC && family != AF_INET)
 		return -1;
@@ -442,7 +459,7 @@ int get_prefix_1(inet_prefix *dst, char *arg, int family)
 	if (strcmp(arg, "default") == 0 ||
 	    strcmp(arg, "any") == 0 ||
 	    strcmp(arg, "all") == 0) {
-		if (family == AF_DECnet)
+		if ((family == AF_DECnet) || (family = AF_MPLS))
 			return -1;
 		dst->family = family;
 		dst->bytelen = 0;
@@ -463,6 +480,9 @@ int get_prefix_1(inet_prefix *dst, char *arg, int family)
 		case AF_DECnet:
 			dst->bitlen = 16;
 			break;
+		case AF_MPLS:
+			dst->bitlen = 20;
+			break;
 		default:
 		case AF_INET:
 			dst->bitlen = 32;
@@ -630,6 +650,8 @@ const char *rt_addr_n2a(int af, const void *addr, char *buf, int buflen)
 	case AF_INET:
 	case AF_INET6:
 		return inet_ntop(af, addr, buf, buflen);
+	case AF_MPLS:
+		return mpls_ntop(af, addr, buf, buflen);
 	case AF_IPX:
 		return ipx_ntop(af, addr, buf, buflen);
 	case AF_DECnet:
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/8] mpls: Refactor how the mpls module is built
  2015-02-25 17:13 ` [PATCH net-next 1/8] mpls: Refactor how the mpls module is built Eric W. Biederman
@ 2015-02-26  2:05   ` Simon Horman
  2015-02-26  2:15     ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Simon Horman @ 2015-02-26  2:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

On Wed, Feb 25, 2015 at 11:13:02AM -0600, Eric W. Biederman wrote:
> 
> This refactoring is needed to allow more than just mpls gso support to
> be built into the mpls moddule.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  net/Makefile     |  2 +-
>  net/mpls/Kconfig | 18 +++++++++++++++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/net/Makefile b/net/Makefile
> index 38704bdf941a..3995613e5510 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -69,7 +69,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
>  obj-$(CONFIG_NFC)		+= nfc/
>  obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
>  obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
> -obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
> +obj-$(CONFIG_MPLS)		+= mpls/
>  obj-$(CONFIG_HSR)		+= hsr/
>  ifneq ($(CONFIG_NET_SWITCHDEV),)
>  obj-y				+= switchdev/
> diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
> index 37421db88965..a77fbcdd04ee 100644
> --- a/net/mpls/Kconfig
> +++ b/net/mpls/Kconfig
> @@ -1,9 +1,25 @@
>  #
>  # MPLS configuration
>  #
> +
> +menuconfig MPLS
> +	tristate "MultiProtocol Label Switching"
> +	default n
> +	---help---
> +	  MultiProtocol Label Switching routes packets through logical
> +	  circuits.  Originally conceved as a way of routing packets at
> +	  hardware speeds (before hardware was capable of routing ipv4 packets),
> +	  MPLS remains as simple way of making tunnels.
> +
> +	  If you have not heard of MPLS you probably want to say N here.
> +
> +if MPLS
> +
>  config NET_MPLS_GSO
> -	tristate "MPLS: GSO support"
> +	bool "MPLS: GSO support"
>  	help
>  	 This is helper module to allow segmentation of non-MPLS GSO packets
>  	 that have had MPLS stack entries pushed onto them and thus
>  	 become MPLS GSO packets.
> +
> +endif # MPLS

Is the implication here that MPLS must be selected to allow NET_MPLS_GSO to
be selected? That is if NET_MPLS_GSO is to be used to handle MPLS packets
emitted by OVS then now MPLS also needs to be selected?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/8] mpls: Refactor how the mpls module is built
  2015-02-26  2:05   ` Simon Horman
@ 2015-02-26  2:15     ` Eric W. Biederman
  2015-02-26  2:28       ` Simon Horman
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-26  2:15 UTC (permalink / raw)
  To: Simon Horman; +Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

Simon Horman <horms@verge.net.au> writes:

> On Wed, Feb 25, 2015 at 11:13:02AM -0600, Eric W. Biederman wrote:
>> 
>> This refactoring is needed to allow more than just mpls gso support to
>> be built into the mpls moddule.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  net/Makefile     |  2 +-
>>  net/mpls/Kconfig | 18 +++++++++++++++++-
>>  2 files changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/Makefile b/net/Makefile
>> index 38704bdf941a..3995613e5510 100644
>> --- a/net/Makefile
>> +++ b/net/Makefile
>> @@ -69,7 +69,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
>>  obj-$(CONFIG_NFC)		+= nfc/
>>  obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
>>  obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
>> -obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
>> +obj-$(CONFIG_MPLS)		+= mpls/
>>  obj-$(CONFIG_HSR)		+= hsr/
>>  ifneq ($(CONFIG_NET_SWITCHDEV),)
>>  obj-y				+= switchdev/
>> diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
>> index 37421db88965..a77fbcdd04ee 100644
>> --- a/net/mpls/Kconfig
>> +++ b/net/mpls/Kconfig
>> @@ -1,9 +1,25 @@
>>  #
>>  # MPLS configuration
>>  #
>> +
>> +menuconfig MPLS
>> +	tristate "MultiProtocol Label Switching"
>> +	default n
>> +	---help---
>> +	  MultiProtocol Label Switching routes packets through logical
>> +	  circuits.  Originally conceved as a way of routing packets at
>> +	  hardware speeds (before hardware was capable of routing ipv4 packets),
>> +	  MPLS remains as simple way of making tunnels.
>> +
>> +	  If you have not heard of MPLS you probably want to say N here.
>> +
>> +if MPLS
>> +
>>  config NET_MPLS_GSO
>> -	tristate "MPLS: GSO support"
>> +	bool "MPLS: GSO support"
>>  	help
>>  	 This is helper module to allow segmentation of non-MPLS GSO packets
>>  	 that have had MPLS stack entries pushed onto them and thus
>>  	 become MPLS GSO packets.
>> +
>> +endif # MPLS
>
> Is the implication here that MPLS must be selected to allow NET_MPLS_GSO to
> be selected? That is if NET_MPLS_GSO is to be used to handle MPLS packets
> emitted by OVS then now MPLS also needs to be selected?

Yes.

That is the way we seem to handle this for other protocols and I could
not see an easy way to build multiple modules from a single Makefile.

I am a tad afraid that this Kconfig clause will cause problems with
make oldconfig as it stands, and I will be happy to take suggestions
on how to do this better.

The other MPLS bits that are added in the following patches are behind
their own Kconfig options so there is no danger in getting more MPLS
code than desired.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/8] mpls: Refactor how the mpls module is built
  2015-02-26  2:15     ` Eric W. Biederman
@ 2015-02-26  2:28       ` Simon Horman
  0 siblings, 0 replies; 88+ messages in thread
From: Simon Horman @ 2015-02-26  2:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

On Wed, Feb 25, 2015 at 08:15:43PM -0600, Eric W. Biederman wrote:
> Simon Horman <horms@verge.net.au> writes:
> 
> > On Wed, Feb 25, 2015 at 11:13:02AM -0600, Eric W. Biederman wrote:
> >> 
> >> This refactoring is needed to allow more than just mpls gso support to
> >> be built into the mpls moddule.
> >> 
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  net/Makefile     |  2 +-
> >>  net/mpls/Kconfig | 18 +++++++++++++++++-
> >>  2 files changed, 18 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/net/Makefile b/net/Makefile
> >> index 38704bdf941a..3995613e5510 100644
> >> --- a/net/Makefile
> >> +++ b/net/Makefile
> >> @@ -69,7 +69,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
> >>  obj-$(CONFIG_NFC)		+= nfc/
> >>  obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
> >>  obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
> >> -obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
> >> +obj-$(CONFIG_MPLS)		+= mpls/
> >>  obj-$(CONFIG_HSR)		+= hsr/
> >>  ifneq ($(CONFIG_NET_SWITCHDEV),)
> >>  obj-y				+= switchdev/
> >> diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
> >> index 37421db88965..a77fbcdd04ee 100644
> >> --- a/net/mpls/Kconfig
> >> +++ b/net/mpls/Kconfig
> >> @@ -1,9 +1,25 @@
> >>  #
> >>  # MPLS configuration
> >>  #
> >> +
> >> +menuconfig MPLS
> >> +	tristate "MultiProtocol Label Switching"
> >> +	default n
> >> +	---help---
> >> +	  MultiProtocol Label Switching routes packets through logical
> >> +	  circuits.  Originally conceved as a way of routing packets at
> >> +	  hardware speeds (before hardware was capable of routing ipv4 packets),
> >> +	  MPLS remains as simple way of making tunnels.
> >> +
> >> +	  If you have not heard of MPLS you probably want to say N here.
> >> +
> >> +if MPLS
> >> +
> >>  config NET_MPLS_GSO
> >> -	tristate "MPLS: GSO support"
> >> +	bool "MPLS: GSO support"
> >>  	help
> >>  	 This is helper module to allow segmentation of non-MPLS GSO packets
> >>  	 that have had MPLS stack entries pushed onto them and thus
> >>  	 become MPLS GSO packets.
> >> +
> >> +endif # MPLS
> >
> > Is the implication here that MPLS must be selected to allow NET_MPLS_GSO to
> > be selected? That is if NET_MPLS_GSO is to be used to handle MPLS packets
> > emitted by OVS then now MPLS also needs to be selected?
> 
> Yes.
> 
> That is the way we seem to handle this for other protocols and I could
> not see an easy way to build multiple modules from a single Makefile.
> 
> I am a tad afraid that this Kconfig clause will cause problems with
> make oldconfig as it stands, and I will be happy to take suggestions
> on how to do this better.
> 
> The other MPLS bits that are added in the following patches are behind
> their own Kconfig options so there is no danger in getting more MPLS
> code than desired.

Thanks, that part seems reasonable to me.

I'm also unsure of a better way to handle this.

Reviewed-by: Simon Horman <horms@verge.net.au>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (8 preceding siblings ...)
  2015-02-25 17:37 ` [PATCH iproute2] mpls: Add basic mpls support to iproute Eric W. Biederman
@ 2015-02-26  6:58 ` roopa
  2015-02-27 21:21 ` David Miller
  10 siblings, 0 replies; 88+ messages in thread
From: roopa @ 2015-02-26  6:58 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev, Stephen Hemminger, santiago

On 2/25/15, 9:09 AM, Eric W. Biederman wrote:
> While trying to figure out what MPLS is and why MPLS support is not in
> the kernel on a lark I sat down and wrote an MPLS implemenation, so I
> could answer those questions for myself.
>
>  From what I can tell the short answer is MPLS is trivial-simple and the
> we don't have an in-kernel implementation because no one has sat down
> and done the work to have a good mergable implementation.

>
> MPLS has it's good sides and it's bad sides but at the end of the day
> MPLS has users, and having an in-kernel implementation should help us
> understand MPLS and focus our conversations dealing with MPLS and
> VRFs.

very much agree.
>
> Having MPLS in our toolkit as the entire world begins playing with
> overlay networks aka ``network virtualization'' to support VM and
> container migration seems appropriate as MPLS is the historical solution
> to this problem.
>
> Constructive criticism about the netlink interface is especially
> appreciated.  Hopefully we can have at least one protocol in the kernel
> where the netlink interface doesn't have nasty corner case.
>
> As for linux users.  The conversations I had at netdev01 this sounds
> like a case of if I build it people will use the code.
ack again.

Thanks eric!.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 7/8] mpls: Multicast route table change notifications
  2015-02-25 17:19 ` [PATCH net-next 7/8] mpls: Multicast route table change notifications Eric W. Biederman
@ 2015-02-26  7:21   ` roopa
  2015-02-26 14:03     ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: roopa @ 2015-02-26  7:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, Stephen Hemminger, santiago, Vivek Venkatraman

On 2/25/15, 9:19 AM, Eric W. Biederman wrote:
> Unlike IPv4 this code notifies on all cases where mpls routes
> are added or removed as that was the simplest to implement.
>
> In particular routes being removed because a network interface
> goes down or is removed are notified about.  Are there technical
> arguments for handling this differently ? Userspace developers
> don't particularly like the way IPv4 handles route removal
> on ifdown.
that is true. However, from previous emails on this topic on netdev,
there is no reason to notify these deletes to userspace thereby creating 
a notification storm
when userspace can figure this out. Which seems like a valid reason.
(Your approach resembles IPv6 which does generate these notifications 
and userspace is usually happy with this).

Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 7/8] mpls: Multicast route table change notifications
  2015-02-26  7:21   ` roopa
@ 2015-02-26 14:03     ` Eric W. Biederman
  2015-02-26 15:12       ` roopa
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-26 14:03 UTC (permalink / raw)
  To: roopa
  Cc: David Miller, netdev, Stephen Hemminger, santiago, Vivek Venkatraman

roopa <roopa@cumulusnetworks.com> writes:

> On 2/25/15, 9:19 AM, Eric W. Biederman wrote:
>> Unlike IPv4 this code notifies on all cases where mpls routes
>> are added or removed as that was the simplest to implement.
>>
>> In particular routes being removed because a network interface
>> goes down or is removed are notified about.  Are there technical
>> arguments for handling this differently ? Userspace developers
>> don't particularly like the way IPv4 handles route removal
>> on ifdown.
> that is true. However, from previous emails on this topic on netdev,
> there is no reason to notify these deletes to userspace thereby creating a
> notification storm
> when userspace can figure this out. Which seems like a valid reason.
> (Your approach resembles IPv6 which does generate these notifications and
> userspace is usually happy with this).

Grr.  There is an even better way to do this.

The semantically best way to handle this is to simply not use routes for
forwarding where the network inteface is down, the carrier is down, or
the network device has gone away for forwarding.

Apparently there are some multi-path scenearios that already do this
legitimately, and routes going away auto-matically can cause userspace
other kinds of problems.

In MPLS I especially don't want to free the routing table slot until I
know that the change has propagated in the network and I can be
reasonably confident that no-one will send me traffic on that label.
Otherwise there is a chance the label will be reused too soon.

Grumble.  That is a code change I need to make.  Grumble.

I also need to look and see if those multi-path scenarios report a next
hop as dead or just rely on the network interface state (which I think
it is) to be sufficient information relayed to userspace

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 7/8] mpls: Multicast route table change notifications
  2015-02-26 14:03     ` Eric W. Biederman
@ 2015-02-26 15:12       ` roopa
  2015-03-05  1:56         ` Andy Gospodarek
  0 siblings, 1 reply; 88+ messages in thread
From: roopa @ 2015-02-26 15:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, Stephen Hemminger, santiago, Vivek Venkatraman

On 2/26/15, 6:03 AM, Eric W. Biederman wrote:
> roopa <roopa@cumulusnetworks.com> writes:
>
>> On 2/25/15, 9:19 AM, Eric W. Biederman wrote:
>>> Unlike IPv4 this code notifies on all cases where mpls routes
>>> are added or removed as that was the simplest to implement.
>>>
>>> In particular routes being removed because a network interface
>>> goes down or is removed are notified about.  Are there technical
>>> arguments for handling this differently ? Userspace developers
>>> don't particularly like the way IPv4 handles route removal
>>> on ifdown.
>> that is true. However, from previous emails on this topic on netdev,
>> there is no reason to notify these deletes to userspace thereby creating a
>> notification storm
>> when userspace can figure this out. Which seems like a valid reason.
>> (Your approach resembles IPv6 which does generate these notifications and
>> userspace is usually happy with this).
> Grr.  There is an even better way to do this.
>
> The semantically best way to handle this is to simply not use routes for
> forwarding where the network inteface is down, the carrier is down, or
> the network device has gone away for forwarding.

agreed, And we have an internal patch that does this for regular routing
on carrier down (which we will upstream soon).
>
> Apparently there are some multi-path scenearios that already do this
> legitimately, and routes going away auto-matically can cause userspace
> other kinds of problems.
>
> In MPLS I especially don't want to free the routing table slot until I
> know that the change has propagated in the network and I can be
> reasonably confident that no-one will send me traffic on that label.
> Otherwise there is a chance the label will be reused too soon.
ack
>
> Grumble.  That is a code change I need to make.  Grumble.
>
> I also need to look and see if those multi-path scenarios report a next
> hop as dead or just rely on the network interface state (which I think
> it is) to be sufficient information relayed to userspace
>
they are marked DEAD on ifdown today (AFAIR they dont generate a 
notification in IPv4)  and are skipped during route lookup.
Only when all the nexthops in a multi-path route are dead, is the route 
multipath route declared dead
and is deleted today (with no notification to userspace in the IPv4 case).

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
                   ` (9 preceding siblings ...)
  2015-02-26  6:58 ` [PATCH net-next 0/8] Basic MPLS support roopa
@ 2015-02-27 21:21 ` David Miller
  2015-02-28  0:58   ` Eric W. Biederman
  10 siblings, 1 reply; 88+ messages in thread
From: David Miller @ 2015-02-27 21:21 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, roopa, stephen, santiago

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 25 Feb 2015 11:09:23 -0600

> While trying to figure out what MPLS is and why MPLS support is not in
> the kernel on a lark I sat down and wrote an MPLS implemenation, so I
> could answer those questions for myself.
> 
> From what I can tell the short answer is MPLS is trivial-simple and the
> we don't have an in-kernel implementation because no one has sat down
> and done the work to have a good mergable implementation.
> 
> MPLS has it's good sides and it's bad sides but at the end of the day
> MPLS has users, and having an in-kernel implementation should help us
> understand MPLS and focus our conversations dealing with MPLS and
> VRFs.
> 
> Having MPLS in our toolkit as the entire world begins playing with
> overlay networks aka ``network virtualization'' to support VM and
> container migration seems appropriate as MPLS is the historical solution
> to this problem.
> 
> Constructive criticism about the netlink interface is especially
> appreciated.  Hopefully we can have at least one protocol in the kernel
> where the netlink interface doesn't have nasty corner case.
> 
> As for linux users.  The conversations I had at netdev01 this sounds
> like a case of if I build it people will use the code.

At a high level I have no objections to this work and I'm in fact
extremely happy to see someone working on this.

However I would ask you to reconsider the neighbour handling issue.

It seems to me that routing daemons are going to more naturally work
with ipv4 addresses as MPLS next hops, and therefore when that's the
case we should too.

Why?

Because then the neighbour layer handles failover transparently for
you.

Think about it, if we have a case where some other resolving mechanism
would be used for MPLS nexthops, there would need to be some kind of
fail over handling mechanism for it as well.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-02-27 21:21 ` David Miller
@ 2015-02-28  0:58   ` Eric W. Biederman
  2015-03-02  0:05     ` Shrijeet Mukherjee
  2015-03-02  4:03     ` David Miller
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-02-28  0:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, stephen, santiago

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Wed, 25 Feb 2015 11:09:23 -0600
>
>> While trying to figure out what MPLS is and why MPLS support is not in
>> the kernel on a lark I sat down and wrote an MPLS implemenation, so I
>> could answer those questions for myself.
>> 
>> From what I can tell the short answer is MPLS is trivial-simple and the
>> we don't have an in-kernel implementation because no one has sat down
>> and done the work to have a good mergable implementation.
>> 
>> MPLS has it's good sides and it's bad sides but at the end of the day
>> MPLS has users, and having an in-kernel implementation should help us
>> understand MPLS and focus our conversations dealing with MPLS and
>> VRFs.
>> 
>> Having MPLS in our toolkit as the entire world begins playing with
>> overlay networks aka ``network virtualization'' to support VM and
>> container migration seems appropriate as MPLS is the historical solution
>> to this problem.
>> 
>> Constructive criticism about the netlink interface is especially
>> appreciated.  Hopefully we can have at least one protocol in the kernel
>> where the netlink interface doesn't have nasty corner case.
>> 
>> As for linux users.  The conversations I had at netdev01 this sounds
>> like a case of if I build it people will use the code.
>
> At a high level I have no objections to this work and I'm in fact
> extremely happy to see someone working on this.

Thank you.  That statement alone I think is enough to ensure that
someone completes this work.

> However I would ask you to reconsider the neighbour handling issue.
>
> It seems to me that routing daemons are going to more naturally work
> with ipv4 addresses as MPLS next hops, and therefore when that's the
> case we should too.
>
> Why?
>
> Because then the neighbour layer handles failover transparently for
> you.

I have no objection to using the neighbour table for ipv4 or ipv6
next hops.  I simply did not implement them out of expediency.

Part of that expediency was the realization that waiting for neighbour
resolution before transmitting packets requires the packets have dst
entries.  Something that is not otherwise required.  That seems to add
a noticable amount of complexity to the forwarding code.  If nothing
else I have to manage dst objects and their packet specific lifetimes.

There is also my experience in router contexts that says arp or
neighbour discovery is usually the last thing to know (short of
gratuitious arps) that a neighbour has failed.  So some other protocol
is needed to detect failure.

At the same time if you have a static configuration a arp or ipv6
neighbour discovery is the only thing you have so it those protocols
definitely has some value.

I think to properly handle ipv4 and ipv6 next hops I would need to pull
the neighbour cache apart and and put it back together again while
reexaming all of it's assumptions about which things are a good idea to
optimize.   That feels like more work in benchmarking etc than the MPLS
code has been so far.  

Little details make a big difference, especially the question of when
we are caching the link-layer header do we take a performance hit if we
don't cache the protocol type in the cached link-layer header?   Upon
that question revolves the effort of refactoring the neighbour cache to
support multiple protocol types.  There are other questions such as is
there actually a benefit in caching the link-layer header?

> Think about it, if we have a case where some other resolving mechanism
> would be used for MPLS nexthops, there would need to be some kind of
> fail over handling mechanism for it as well.

Good question.  What I know for certain is that the MPLS-TP
specification does not use IPv4 or IPv6 next hops.  I think in those use
cases some of the next hops don't actually have link-layer addresses,
and I expect some of them are designed to be used on machines where the
control plane and the data plane are separate interfaces.  Which
suggests that there would not be any next hop resolution as we are
familiar with it, in the case of ethernet and related networks.

I don't know if any of those weird cases apply to Linux.  That is I
don't know if anyone would ever connect one of those weird MPLS users
as a nexthop to a MPLS speaking linux box.

So I don't know that the usual conditions do not apply or if we would
ever actually need a Link-Layer Gateway address.  I just know it was
coding MPLS in that way seemed much simpler, easier and more performant
than figuring out the neighbor cache.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* RE: [PATCH net-next 0/8] Basic MPLS support
  2015-02-28  0:58   ` Eric W. Biederman
@ 2015-03-02  0:05     ` Shrijeet Mukherjee
  2015-03-02  4:03     ` David Miller
  1 sibling, 0 replies; 88+ messages in thread
From: Shrijeet Mukherjee @ 2015-03-02  0:05 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller; +Cc: netdev, Roopa Prabhu, stephen, santiago

>There is also my experience in router contexts that says arp or neighbour
>discovery is usually the last thing to know (short of gratuitious arps)
that a
>neighbour has failed.  So some other protocol is needed to detect
failure.

Definitely a longer discussion is needed here for the router/switch use
case. Our current position on this (not unique) is to let the protocol
manager which handles the neighbor state machine manage the need or lack
thereof of the GARP's.

This also came up in the L3 offload/switchdev side of the discussion,
where the need maybe more pronounced even.

This has led to mechanisms like BFD to exist which in my opinion causes
more problems than it solves. But sounds like some sunlight on mechanisms
that can be used here may have far reaching benefits.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-02-28  0:58   ` Eric W. Biederman
  2015-03-02  0:05     ` Shrijeet Mukherjee
@ 2015-03-02  4:03     ` David Miller
  2015-03-02  5:10       ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-02  4:03 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, roopa, stephen, santiago

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 27 Feb 2015 18:58:09 -0600

> Part of that expediency was the realization that waiting for neighbour
> resolution before transmitting packets requires the packets have dst
> entries.  Something that is not otherwise required.  That seems to add
> a noticable amount of complexity to the forwarding code.  If nothing
> else I have to manage dst objects and their packet specific lifetimes.

There is no requirement as such, in fact you can use your MPLS
forwarding frames to trigger neighbour resolution.

You just put IPv4/IPv6 addresses in your mpls routes, and then
at transmit time:

	rcu_read_lock();
	n = __ipv4_neigh_lookup_noref(&arp_tbl, &mpls_route->v4addr, dev, false);
	if (unlikely(!n))
		n = __neigh_create(&arp_tbl, &mpls_route->v4addr, dev, false);
	if (!IS_ERR(n)) {
		const struct hh_cache *hh = &n->hh;

		if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
			return neigh_hh_output(hh, skb);
		else
			return n->output(n, skb);
	}
	rcu_read_unlock();
> I think to properly handle ipv4 and ipv6 next hops I would need to pull
> the neighbour cache apart and and put it back together again while
> reexaming all of it's assumptions about which things are a good idea to
> optimize.   That feels like more work in benchmarking etc than the MPLS
> code has been so far.  

No you don't, the neigh state machine is built to properly handle
everything, see above.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-03-02  4:03     ` David Miller
@ 2015-03-02  5:10       ` Eric W. Biederman
  2015-03-02  5:53         ` David Miller
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  5:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, stephen, santiago

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 27 Feb 2015 18:58:09 -0600
>
>> Part of that expediency was the realization that waiting for neighbour
>> resolution before transmitting packets requires the packets have dst
>> entries.  Something that is not otherwise required.  That seems to add
>> a noticable amount of complexity to the forwarding code.  If nothing
>> else I have to manage dst objects and their packet specific lifetimes.
>
> There is no requirement as such, in fact you can use your MPLS
> forwarding frames to trigger neighbour resolution.
>
> You just put IPv4/IPv6 addresses in your mpls routes, and then
> at transmit time:
>
> 	rcu_read_lock();
> 	n = __ipv4_neigh_lookup_noref(&arp_tbl, &mpls_route->v4addr, dev, false);
> 	if (unlikely(!n))
> 		n = __neigh_create(&arp_tbl, &mpls_route->v4addr, dev, false);
> 	if (!IS_ERR(n)) {
> 		const struct hh_cache *hh = &n->hh;
>
> 		if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
> 			return neigh_hh_output(hh, skb);
> 		else
> 			return n->output(n, skb);
> 	}
> 	rcu_read_unlock();

Which fails miserably.  neigh_hh_output will use an ethertype of
ETH_P_IP from the cached header, instead of a header type of
ETH_P_MPLS_UC from skb->protocol.

Just using n->output is better but if you look at neigh_resolve_output
frames without a dst entry will be dropped.

>> I think to properly handle ipv4 and ipv6 next hops I would need to pull
>> the neighbour cache apart and and put it back together again while
>> reexaming all of it's assumptions about which things are a good idea to
>> optimize.   That feels like more work in benchmarking etc than the MPLS
>> code has been so far.  
>
> No you don't, the neigh state machine is built to properly handle
> everything, see above.

The state machine is fine.  Things like hardware header caching and teql
driver cause some interesting issues.

That said I have figured out how to sourt out the neighbour cache
without touching the fast path.  (Assuming I don't try to use the
cached header).

My neighbour table patches just need a final look over and then I will
send them out.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/8] Basic MPLS support
  2015-03-02  5:10       ` Eric W. Biederman
@ 2015-03-02  5:53         ` David Miller
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
  1 sibling, 0 replies; 88+ messages in thread
From: David Miller @ 2015-03-02  5:53 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, roopa, stephen, santiago

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 01 Mar 2015 23:10:12 -0600

> My neighbour table patches just need a final look over and then I will
> send them out.

Ok, looking forward to it.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 0/15] Neighbour table and ax25 cleanups
  2015-03-02  5:10       ` Eric W. Biederman
  2015-03-02  5:53         ` David Miller
@ 2015-03-02  5:59         ` Eric W. Biederman
  2015-03-02  5:59           ` [PATCH net-next 01/15] ax25: In ax25_rebuild_header add missing kfree_skb Eric W. Biederman
                             ` (15 more replies)
  1 sibling, 16 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  5:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


While looking at the neighbour table to what it would take to allow
using next hops in a different address family than the current packets
I found a partial resolution for my issues and I stumbled upon some
work that makes the neighbour table code easier to understand and
maintain.

Long ago in a much younger kernel ax25 found a hack to use
dev_rebuild_header to transmit it's packets instead of going through
what today is ndo_start_xmit.

When the neighbour table was rewritten into it's current form the ax25
code was such a challenge that arp_broken_ops appeard in arp.c and
neigh_compat_output appeared in neighbour.c to keep the ax25 hack alive.

With a little bit of work I was able to remove some of the hack that
is the ax25 transmit path for ip packets and to isolate what remains
into a slightly more readable piece of code in ax25_ip.c.  Removing the
need for the generic code to worry about ax25 special cases.

After cleaning up the old ax25 hacks I also performed a little bit of
work on neigh_resolve_output to remove the need for a dst entry and to
ensure cached headers get a deterministic protocol value in their cached
header.   This guarantees that a cached header will not be different
depending on which protocol of packet is transmitted, and it allows
packets to be transmitted that don't have a dst entry.  There remains
a small amount of code that takes advantage of when packets have a dst
entry but that is something different.

Eric W. Biederman (15):
      ax25: In ax25_rebuild_header add missing kfree_skb
      rose: Set the destination address in rose_header
      rose: Transmit packets in rose_xmit not rose_rebuild_header
      ax25/kiss: Replace ax_header_ops with ax25_header_ops
      ax25/6pack: Replace sp_header_ops with ax25_header_ops
      ax25: Make ax25_header and ax25_rebuild_header static
      ax25: Refactor to use private neighbour operations.
      arp: Remove special case to give AX25 it's open arp operations.
      neigh: Move neigh_compat_output into ax25_ip.c
      ax25: Stop calling/abusing dev_rebuild_header
      ax25: Stop depending on arp_find
      net: Kill dev_rebuild_header
      arp: Kill arp_find
      neigh: Don't require dst in neigh_hh_init
      neigh: Don't require a dst in neigh_resolve_output

 drivers/firewire/net.c                    |  13 ----
 drivers/isdn/i4l/isdn_net.c               |  33 ----------
 drivers/media/dvb-core/dvb_net.c          |   1 -
 drivers/net/arcnet/arcnet.c               |  55 ----------------
 drivers/net/hamradio/6pack.c              |  30 +--------
 drivers/net/hamradio/baycom_epp.c         |   2 +
 drivers/net/hamradio/bpqether.c           |   2 +
 drivers/net/hamradio/dmascc.c             |   2 +
 drivers/net/hamradio/hdlcdrv.c            |   2 +
 drivers/net/hamradio/mkiss.c              |  35 +---------
 drivers/net/hamradio/scc.c                |   2 +
 drivers/net/hamradio/yam.c                |   2 +
 drivers/net/ipvlan/ipvlan_main.c          |   1 -
 drivers/net/macvlan.c                     |   1 -
 drivers/net/wireless/hostap/hostap_main.c |   1 -
 include/linux/etherdevice.h               |   1 -
 include/linux/netdevice.h                 |  12 +---
 include/net/arp.h                         |   1 -
 include/net/ax25.h                        |   8 ++-
 include/net/neighbour.h                   |   2 +-
 net/802/fc.c                              |  21 ------
 net/802/fddi.c                            |  26 --------
 net/802/hippi.c                           |  28 --------
 net/8021q/vlan_dev.c                      |  35 ----------
 net/ax25/ax25_ip.c                        |  76 +++++++++++++++++-----
 net/core/neighbour.c                      |  34 ++--------
 net/decnet/dn_neigh.c                     |   1 +
 net/ethernet/eth.c                        |  34 ----------
 net/ipv4/arp.c                            | 103 +-----------------------------
 net/ipv6/ndisc.c                          |   1 +
 net/netrom/nr_dev.c                       |  31 ---------
 net/rose/rose_dev.c                       |  53 ++++-----------
 32 files changed, 105 insertions(+), 544 deletions(-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 01/15] ax25: In ax25_rebuild_header add missing kfree_skb
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
@ 2015-03-02  5:59           ` Eric W. Biederman
  2015-03-02  6:01           ` [PATCH net-next 02/15] rose: Set the destination address in rose_header Eric W. Biederman
                             ` (14 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  5:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


In the unlikely (impossible?) event that we attempt to transmit
an ax25 packet over a non-ax25 device free the skb so we don't
leak it.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ax25/ax25_ip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index 67de6b33f2c3..db3c283821d1 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -129,6 +129,7 @@ int ax25_rebuild_header(struct sk_buff *skb)
 		dev = skb->dev;
 
 	if ((ax25_dev = ax25_dev_ax25dev(dev)) == NULL) {
+		kfree_skb(skb);
 		goto put;
 	}
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 02/15] rose: Set the destination address in rose_header
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
  2015-03-02  5:59           ` [PATCH net-next 01/15] ax25: In ax25_rebuild_header add missing kfree_skb Eric W. Biederman
@ 2015-03-02  6:01           ` Eric W. Biederman
  2015-03-02  6:02           ` [PATCH net-next 03/15] rose: Transmit packets in rose_xmit not rose_rebuild_header Eric W. Biederman
                             ` (13 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


Not setting the destination address is a bug that I suspect causes no
problems today, as only the arp code seems to call dev_hard_header and
the description I have of rose is that it is expected to be used with a
static neigbour table.

I have derived the offset and the length of the rose destination address
from rose_rebuild_header where arp_find calls neigh_ha_snapshot to set
the destination address.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/rose/rose_dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/rose/rose_dev.c b/net/rose/rose_dev.c
index 50005888be57..24d2b40b6c6b 100644
--- a/net/rose/rose_dev.c
+++ b/net/rose/rose_dev.c
@@ -41,6 +41,9 @@ static int rose_header(struct sk_buff *skb, struct net_device *dev,
 {
 	unsigned char *buff = skb_push(skb, ROSE_MIN_LEN + 2);
 
+	if (daddr)
+		memcpy(buff + 7, daddr, dev->addr_len);
+
 	*buff++ = ROSE_GFI | ROSE_Q_BIT;
 	*buff++ = 0x00;
 	*buff++ = ROSE_DATA;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 03/15] rose: Transmit packets in rose_xmit not rose_rebuild_header
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
  2015-03-02  5:59           ` [PATCH net-next 01/15] ax25: In ax25_rebuild_header add missing kfree_skb Eric W. Biederman
  2015-03-02  6:01           ` [PATCH net-next 02/15] rose: Set the destination address in rose_header Eric W. Biederman
@ 2015-03-02  6:02           ` Eric W. Biederman
  2015-03-02  6:03           ` [PATCH net-next 04/15] ax25/kiss: Replace ax_header_ops with ax25_header_ops Eric W. Biederman
                             ` (12 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


Patterned after the similar code in net/rom this turns out
to be a trivial obviously correct transmformation.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/rose/rose_dev.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/net/rose/rose_dev.c b/net/rose/rose_dev.c
index 24d2b40b6c6b..90209c1fa49b 100644
--- a/net/rose/rose_dev.c
+++ b/net/rose/rose_dev.c
@@ -59,38 +59,14 @@ static int rose_header(struct sk_buff *skb, struct net_device *dev,
 static int rose_rebuild_header(struct sk_buff *skb)
 {
 #ifdef CONFIG_INET
-	struct net_device *dev = skb->dev;
-	struct net_device_stats *stats = &dev->stats;
 	unsigned char *bp = (unsigned char *)skb->data;
-	struct sk_buff *skbn;
-	unsigned int len;
 
 	if (arp_find(bp + 7, skb)) {
 		return 1;
 	}
 
-	if ((skbn = skb_clone(skb, GFP_ATOMIC)) == NULL) {
-		kfree_skb(skb);
-		return 1;
-	}
-
-	if (skb->sk != NULL)
-		skb_set_owner_w(skbn, skb->sk);
-
-	kfree_skb(skb);
-
-	len = skbn->len;
-
-	if (!rose_route_frame(skbn, NULL)) {
-		kfree_skb(skbn);
-		stats->tx_errors++;
-		return 1;
-	}
-
-	stats->tx_packets++;
-	stats->tx_bytes += len;
 #endif
-	return 1;
+	return 0;
 }
 
 static int rose_set_mac_address(struct net_device *dev, void *addr)
@@ -137,13 +113,21 @@ static int rose_close(struct net_device *dev)
 static netdev_tx_t rose_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct net_device_stats *stats = &dev->stats;
+	unsigned int len = skb->len;
 
 	if (!netif_running(dev)) {
 		printk(KERN_ERR "ROSE: rose_xmit - called when iface is down\n");
 		return NETDEV_TX_BUSY;
 	}
-	dev_kfree_skb(skb);
-	stats->tx_errors++;
+
+	if (!rose_route_frame(skb, NULL)) {
+		dev_kfree_skb(skb);
+		stats->tx_errors++;
+		return NETDEV_TX_OK;
+	}
+
+	stats->tx_packets++;
+	stats->tx_bytes += len;
 	return NETDEV_TX_OK;
 }
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 04/15] ax25/kiss: Replace ax_header_ops with ax25_header_ops
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (2 preceding siblings ...)
  2015-03-02  6:02           ` [PATCH net-next 03/15] rose: Transmit packets in rose_xmit not rose_rebuild_header Eric W. Biederman
@ 2015-03-02  6:03           ` Eric W. Biederman
  2015-03-02  6:03           ` [PATCH net-next 05/15] ax25/6pack: Replace sp_header_ops " Eric W. Biederman
                             ` (11 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


The two sets of header operations are functionally identical remove the
duplicate definition.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/hamradio/mkiss.c | 33 +--------------------------------
 1 file changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c
index f990bb1c3e02..e37c8d515ce8 100644
--- a/drivers/net/hamradio/mkiss.c
+++ b/drivers/net/hamradio/mkiss.c
@@ -573,32 +573,6 @@ static int ax_open_dev(struct net_device *dev)
 	return 0;
 }
 
-#if defined(CONFIG_AX25) || defined(CONFIG_AX25_MODULE)
-
-/* Return the frame type ID */
-static int ax_header(struct sk_buff *skb, struct net_device *dev,
-		     unsigned short type, const void *daddr,
-		     const void *saddr, unsigned len)
-{
-#ifdef CONFIG_INET
-	if (type != ETH_P_AX25)
-		return ax25_hard_header(skb, dev, type, daddr, saddr, len);
-#endif
-	return 0;
-}
-
-
-static int ax_rebuild_header(struct sk_buff *skb)
-{
-#ifdef CONFIG_INET
-	return ax25_rebuild_header(skb);
-#else
-	return 0;
-#endif
-}
-
-#endif	/* CONFIG_{AX25,AX25_MODULE} */
-
 /* Open the low-level part of the AX25 channel. Easy! */
 static int ax_open(struct net_device *dev)
 {
@@ -662,11 +636,6 @@ static int ax_close(struct net_device *dev)
 	return 0;
 }
 
-static const struct header_ops ax_header_ops = {
-	.create    = ax_header,
-	.rebuild   = ax_rebuild_header,
-};
-
 static const struct net_device_ops ax_netdev_ops = {
 	.ndo_open            = ax_open_dev,
 	.ndo_stop            = ax_close,
@@ -682,7 +651,7 @@ static void ax_setup(struct net_device *dev)
 	dev->addr_len        = 0;
 	dev->type            = ARPHRD_AX25;
 	dev->tx_queue_len    = 10;
-	dev->header_ops      = &ax_header_ops;
+	dev->header_ops      = &ax25_header_ops;
 	dev->netdev_ops	     = &ax_netdev_ops;
 
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 05/15] ax25/6pack: Replace sp_header_ops with ax25_header_ops
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (3 preceding siblings ...)
  2015-03-02  6:03           ` [PATCH net-next 04/15] ax25/kiss: Replace ax_header_ops with ax25_header_ops Eric W. Biederman
@ 2015-03-02  6:03           ` Eric W. Biederman
  2015-03-02  6:04           ` [PATCH net-next 06/15] ax25: Make ax25_header and ax25_rebuild_header static Eric W. Biederman
                             ` (10 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


The two sets of header operations are functionally identical remove
the duplicate definition.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/hamradio/6pack.c | 28 +---------------------------
 1 file changed, 1 insertion(+), 27 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index daca0dee88f3..2533933c79dc 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -284,18 +284,6 @@ static int sp_close(struct net_device *dev)
 	return 0;
 }
 
-/* Return the frame type ID */
-static int sp_header(struct sk_buff *skb, struct net_device *dev,
-		     unsigned short type, const void *daddr,
-		     const void *saddr, unsigned len)
-{
-#ifdef CONFIG_INET
-	if (type != ETH_P_AX25)
-		return ax25_hard_header(skb, dev, type, daddr, saddr, len);
-#endif
-	return 0;
-}
-
 static int sp_set_mac_address(struct net_device *dev, void *addr)
 {
 	struct sockaddr_ax25 *sa = addr;
@@ -309,20 +297,6 @@ static int sp_set_mac_address(struct net_device *dev, void *addr)
 	return 0;
 }
 
-static int sp_rebuild_header(struct sk_buff *skb)
-{
-#ifdef CONFIG_INET
-	return ax25_rebuild_header(skb);
-#else
-	return 0;
-#endif
-}
-
-static const struct header_ops sp_header_ops = {
-	.create		= sp_header,
-	.rebuild	= sp_rebuild_header,
-};
-
 static const struct net_device_ops sp_netdev_ops = {
 	.ndo_open		= sp_open_dev,
 	.ndo_stop		= sp_close,
@@ -337,7 +311,7 @@ static void sp_setup(struct net_device *dev)
 	dev->destructor		= free_netdev;
 	dev->mtu		= SIXP_MTU;
 	dev->hard_header_len	= AX25_MAX_HEADER_LEN;
-	dev->header_ops 	= &sp_header_ops;
+	dev->header_ops 	= &ax25_header_ops;
 
 	dev->addr_len		= AX25_ADDR_LEN;
 	dev->type		= ARPHRD_AX25;
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 06/15] ax25: Make ax25_header and ax25_rebuild_header static
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (4 preceding siblings ...)
  2015-03-02  6:03           ` [PATCH net-next 05/15] ax25/6pack: Replace sp_header_ops " Eric W. Biederman
@ 2015-03-02  6:04           ` Eric W. Biederman
  2015-03-02  6:05           ` [PATCH net-next 07/15] ax25: Refactor to use private neighbour operations Eric W. Biederman
                             ` (9 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


The only user is in ax25_ip.c so stop exporting these functions.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/ax25.h |  3 ---
 net/ax25/ax25_ip.c | 18 ++++++++----------
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/net/ax25.h b/include/net/ax25.h
index bf0396e9a5d3..7385a64b61b8 100644
--- a/include/net/ax25.h
+++ b/include/net/ax25.h
@@ -366,9 +366,6 @@ int ax25_kiss_rcv(struct sk_buff *, struct net_device *, struct packet_type *,
 		  struct net_device *);
 
 /* ax25_ip.c */
-int ax25_hard_header(struct sk_buff *, struct net_device *, unsigned short,
-		     const void *, const void *, unsigned int);
-int ax25_rebuild_header(struct sk_buff *);
 extern const struct header_ops ax25_header_ops;
 
 /* ax25_out.c */
diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index db3c283821d1..d93103ba8cec 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -46,9 +46,9 @@
 
 #ifdef CONFIG_INET
 
-int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
-		     unsigned short type, const void *daddr,
-		     const void *saddr, unsigned int len)
+static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
+			    unsigned short type, const void *daddr,
+			    const void *saddr, unsigned int len)
 {
 	unsigned char *buff;
 
@@ -100,7 +100,7 @@ int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
 	return -AX25_HEADER_LEN;	/* Unfinished header */
 }
 
-int ax25_rebuild_header(struct sk_buff *skb)
+static int ax25_rebuild_header(struct sk_buff *skb)
 {
 	struct sk_buff *ourskb;
 	unsigned char *bp  = skb->data;
@@ -218,14 +218,14 @@ put:
 
 #else	/* INET */
 
-int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
-		     unsigned short type, const void *daddr,
-		     const void *saddr, unsigned int len)
+static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
+			    unsigned short type, const void *daddr,
+			    const void *saddr, unsigned int len)
 {
 	return -AX25_HEADER_LEN;
 }
 
-int ax25_rebuild_header(struct sk_buff *skb)
+static int ax25_rebuild_header(struct sk_buff *skb)
 {
 	return 1;
 }
@@ -237,7 +237,5 @@ const struct header_ops ax25_header_ops = {
 	.rebuild = ax25_rebuild_header,
 };
 
-EXPORT_SYMBOL(ax25_hard_header);
-EXPORT_SYMBOL(ax25_rebuild_header);
 EXPORT_SYMBOL(ax25_header_ops);
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 07/15]  ax25: Refactor to use private neighbour operations.
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (5 preceding siblings ...)
  2015-03-02  6:04           ` [PATCH net-next 06/15] ax25: Make ax25_header and ax25_rebuild_header static Eric W. Biederman
@ 2015-03-02  6:05           ` Eric W. Biederman
  2015-03-02  6:06           ` [PATCH net-next 08/15] arp: Remove special case to give AX25 it's open arp operations Eric W. Biederman
                             ` (8 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


AX25 already has it's own private arp cache operations to isolate
it's abuse of dev_rebuild_header to transmit packets.  Add a function
ax25_neigh_construct that will allow all of the ax25 devices to
force using these operations, so that the generic arp code does
not need to.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/hamradio/6pack.c      |  2 ++
 drivers/net/hamradio/baycom_epp.c |  2 ++
 drivers/net/hamradio/bpqether.c   |  2 ++
 drivers/net/hamradio/dmascc.c     |  2 ++
 drivers/net/hamradio/hdlcdrv.c    |  2 ++
 drivers/net/hamradio/mkiss.c      |  2 ++
 drivers/net/hamradio/scc.c        |  2 ++
 drivers/net/hamradio/yam.c        |  2 ++
 include/net/ax25.h                |  5 +++++
 net/ax25/ax25_ip.c                | 21 +++++++++++++++++++++
 10 files changed, 42 insertions(+)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 2533933c79dc..0b8393ca8c80 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -302,6 +302,7 @@ static const struct net_device_ops sp_netdev_ops = {
 	.ndo_stop		= sp_close,
 	.ndo_start_xmit		= sp_xmit,
 	.ndo_set_mac_address    = sp_set_mac_address,
+	.ndo_neigh_construct	= ax25_neigh_construct,
 };
 
 static void sp_setup(struct net_device *dev)
@@ -315,6 +316,7 @@ static void sp_setup(struct net_device *dev)
 
 	dev->addr_len		= AX25_ADDR_LEN;
 	dev->type		= ARPHRD_AX25;
+	dev->neigh_priv_len	= sizeof(struct ax25_neigh_priv);
 	dev->tx_queue_len	= 10;
 
 	/* Only activated in AX.25 mode */
diff --git a/drivers/net/hamradio/baycom_epp.c b/drivers/net/hamradio/baycom_epp.c
index a98c153f371e..3539ab392f7d 100644
--- a/drivers/net/hamradio/baycom_epp.c
+++ b/drivers/net/hamradio/baycom_epp.c
@@ -1109,6 +1109,7 @@ static const struct net_device_ops baycom_netdev_ops = {
 	.ndo_do_ioctl	     = baycom_ioctl,
 	.ndo_start_xmit      = baycom_send_packet,
 	.ndo_set_mac_address = baycom_set_mac_address,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /*
@@ -1146,6 +1147,7 @@ static void baycom_probe(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 	
 	dev->type = ARPHRD_AX25;           /* AF_AX25 device */
+	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;        /* eth_mtu is the default */
 	dev->addr_len = AX25_ADDR_LEN;     /* sizeof an ax.25 address */
diff --git a/drivers/net/hamradio/bpqether.c b/drivers/net/hamradio/bpqether.c
index c2894e43840e..bce105b16ed0 100644
--- a/drivers/net/hamradio/bpqether.c
+++ b/drivers/net/hamradio/bpqether.c
@@ -469,6 +469,7 @@ static const struct net_device_ops bpq_netdev_ops = {
 	.ndo_start_xmit	     = bpq_xmit,
 	.ndo_set_mac_address = bpq_set_mac_address,
 	.ndo_do_ioctl	     = bpq_ioctl,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void bpq_setup(struct net_device *dev)
@@ -486,6 +487,7 @@ static void bpq_setup(struct net_device *dev)
 #endif
 
 	dev->type            = ARPHRD_AX25;
+	dev->neigh_priv_len  = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu             = AX25_DEF_PACLEN;
 	dev->addr_len        = AX25_ADDR_LEN;
diff --git a/drivers/net/hamradio/dmascc.c b/drivers/net/hamradio/dmascc.c
index 0fad408f24aa..abab7be77406 100644
--- a/drivers/net/hamradio/dmascc.c
+++ b/drivers/net/hamradio/dmascc.c
@@ -433,6 +433,7 @@ module_exit(dmascc_exit);
 static void __init dev_setup(struct net_device *dev)
 {
 	dev->type = ARPHRD_AX25;
+	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN;
 	dev->mtu = 1500;
 	dev->addr_len = AX25_ADDR_LEN;
@@ -447,6 +448,7 @@ static const struct net_device_ops scc_netdev_ops = {
 	.ndo_start_xmit = scc_send_packet,
 	.ndo_do_ioctl = scc_ioctl,
 	.ndo_set_mac_address = scc_set_mac_address,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static int __init setup_adapter(int card_base, int type, int n)
diff --git a/drivers/net/hamradio/hdlcdrv.c b/drivers/net/hamradio/hdlcdrv.c
index c67a27245072..435868a7b69c 100644
--- a/drivers/net/hamradio/hdlcdrv.c
+++ b/drivers/net/hamradio/hdlcdrv.c
@@ -626,6 +626,7 @@ static const struct net_device_ops hdlcdrv_netdev = {
 	.ndo_start_xmit = hdlcdrv_send_packet,
 	.ndo_do_ioctl	= hdlcdrv_ioctl,
 	.ndo_set_mac_address = hdlcdrv_set_mac_address,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /*
@@ -676,6 +677,7 @@ static void hdlcdrv_setup(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 	
 	dev->type = ARPHRD_AX25;           /* AF_AX25 device */
+	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;        /* eth_mtu is the default */
 	dev->addr_len = AX25_ADDR_LEN;     /* sizeof an ax.25 address */
diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c
index e37c8d515ce8..c12ec2c2b594 100644
--- a/drivers/net/hamradio/mkiss.c
+++ b/drivers/net/hamradio/mkiss.c
@@ -641,6 +641,7 @@ static const struct net_device_ops ax_netdev_ops = {
 	.ndo_stop            = ax_close,
 	.ndo_start_xmit	     = ax_xmit,
 	.ndo_set_mac_address = ax_set_mac_address,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void ax_setup(struct net_device *dev)
@@ -650,6 +651,7 @@ static void ax_setup(struct net_device *dev)
 	dev->hard_header_len = 0;
 	dev->addr_len        = 0;
 	dev->type            = ARPHRD_AX25;
+	dev->neigh_priv_len  = sizeof(struct ax25_neigh_priv);
 	dev->tx_queue_len    = 10;
 	dev->header_ops      = &ax25_header_ops;
 	dev->netdev_ops	     = &ax_netdev_ops;
diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
index 57be9e0e98a6..b305f51eb420 100644
--- a/drivers/net/hamradio/scc.c
+++ b/drivers/net/hamradio/scc.c
@@ -1550,6 +1550,7 @@ static const struct net_device_ops scc_netdev_ops = {
 	.ndo_set_mac_address = scc_net_set_mac_address,
 	.ndo_get_stats       = scc_net_get_stats,
 	.ndo_do_ioctl        = scc_net_ioctl,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /* ----> Initialize device <----- */
@@ -1567,6 +1568,7 @@ static void scc_net_setup(struct net_device *dev)
 	dev->flags      = 0;
 
 	dev->type = ARPHRD_AX25;
+	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;
 	dev->addr_len = AX25_ADDR_LEN;
diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index 717433cfb81d..89d9da7a0c51 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -1100,6 +1100,7 @@ static const struct net_device_ops yam_netdev_ops = {
 	.ndo_start_xmit      = yam_send_packet,
 	.ndo_do_ioctl 	     = yam_ioctl,
 	.ndo_set_mac_address = yam_set_mac_address,
+	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void yam_setup(struct net_device *dev)
@@ -1128,6 +1129,7 @@ static void yam_setup(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 
 	dev->type = ARPHRD_AX25;
+	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN;
 	dev->mtu = AX25_MTU;
 	dev->addr_len = AX25_ADDR_LEN;
diff --git a/include/net/ax25.h b/include/net/ax25.h
index 7385a64b61b8..45feeba7a325 100644
--- a/include/net/ax25.h
+++ b/include/net/ax25.h
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/atomic.h>
+#include <net/neighbour.h>
 
 #define	AX25_T1CLAMPLO  		1
 #define	AX25_T1CLAMPHI 			(30 * HZ)
@@ -366,7 +367,11 @@ int ax25_kiss_rcv(struct sk_buff *, struct net_device *, struct packet_type *,
 		  struct net_device *);
 
 /* ax25_ip.c */
+int ax25_neigh_construct(struct neighbour *neigh);
 extern const struct header_ops ax25_header_ops;
+struct ax25_neigh_priv {
+	struct neigh_ops ops;
+};
 
 /* ax25_out.c */
 ax25_cb *ax25_send_frame(struct sk_buff *, int, ax25_address *, ax25_address *,
diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index d93103ba8cec..bff12e0c9090 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -216,6 +216,22 @@ put:
 	return 1;
 }
 
+int ax25_neigh_construct(struct neighbour *neigh)
+{
+	/* This trouble could be saved if ax25 would right a proper
+	 * dev_queue_xmit function.
+	 */
+	struct ax25_neigh_priv *priv = neighbour_priv(neigh);
+
+	if (neigh->tbl->family != AF_INET)
+		return -EINVAL;
+
+	priv->ops = *neigh->ops;
+	priv->ops.output = neigh_compat_output;
+	priv->ops.connected_output = neigh_compat_output;
+	return 0;
+}
+
 #else	/* INET */
 
 static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
@@ -230,6 +246,10 @@ static int ax25_rebuild_header(struct sk_buff *skb)
 	return 1;
 }
 
+int ax25_neigh_construct(struct neighbour *neigh)
+{
+	return 0;
+}
 #endif
 
 const struct header_ops ax25_header_ops = {
@@ -238,4 +258,5 @@ const struct header_ops ax25_header_ops = {
 };
 
 EXPORT_SYMBOL(ax25_header_ops);
+EXPORT_SYMBOL(ax25_neigh_construct);
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 08/15] arp: Remove special case to give AX25 it's open arp operations.
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (6 preceding siblings ...)
  2015-03-02  6:05           ` [PATCH net-next 07/15] ax25: Refactor to use private neighbour operations Eric W. Biederman
@ 2015-03-02  6:06           ` Eric W. Biederman
  2015-03-02  6:07           ` [PATCH net-next 09/15] neigh: Move neigh_compat_output into ax25_ip.c Eric W. Biederman
                             ` (7 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


The special case has been pushed out into ax25_neigh_construct so there
is no need to keep this code in arp.c

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/arp.c | 37 -------------------------------------
 1 file changed, 37 deletions(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 205e1472aa78..2557cf9a4648 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -149,14 +149,6 @@ static const struct neigh_ops arp_direct_ops = {
 	.connected_output =	neigh_direct_output,
 };
 
-static const struct neigh_ops arp_broken_ops = {
-	.family =		AF_INET,
-	.solicit =		arp_solicit,
-	.error_report =		arp_error_report,
-	.output =		neigh_compat_output,
-	.connected_output =	neigh_compat_output,
-};
-
 struct neigh_table arp_tbl = {
 	.family		= AF_INET,
 	.key_len	= 4,
@@ -260,35 +252,6 @@ static int arp_constructor(struct neighbour *neigh)
 		   in old paradigm.
 		 */
 
-#if 1
-		/* So... these "amateur" devices are hopeless.
-		   The only thing, that I can say now:
-		   It is very sad that we need to keep ugly obsolete
-		   code to make them happy.
-
-		   They should be moved to more reasonable state, now
-		   they use rebuild_header INSTEAD OF hard_start_xmit!!!
-		   Besides that, they are sort of out of date
-		   (a lot of redundant clones/copies, useless in 2.1),
-		   I wonder why people believe that they work.
-		 */
-		switch (dev->type) {
-		default:
-			break;
-		case ARPHRD_ROSE:
-#if IS_ENABLED(CONFIG_AX25)
-		case ARPHRD_AX25:
-#if IS_ENABLED(CONFIG_NETROM)
-		case ARPHRD_NETROM:
-#endif
-			neigh->ops = &arp_broken_ops;
-			neigh->output = neigh->ops->output;
-			return 0;
-#else
-			break;
-#endif
-		}
-#endif
 		if (neigh->type == RTN_MULTICAST) {
 			neigh->nud_state = NUD_NOARP;
 			arp_mc_map(addr, neigh->ha, dev, 1);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 09/15] neigh: Move neigh_compat_output into ax25_ip.c
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (7 preceding siblings ...)
  2015-03-02  6:06           ` [PATCH net-next 08/15] arp: Remove special case to give AX25 it's open arp operations Eric W. Biederman
@ 2015-03-02  6:07           ` Eric W. Biederman
  2015-03-02  6:08           ` [PATCH net-next 10/15] ax25: Stop calling/abusing dev_rebuild_header Eric W. Biederman
                             ` (6 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


The only caller is now is ax25_neigh_construct so move
neigh_compat_output into ax25_ip.c make it static and rename it
ax25_neigh_output.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/neighbour.h |  1 -
 net/ax25/ax25_ip.c      | 18 ++++++++++++++++--
 net/core/neighbour.c    | 20 --------------------
 3 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 76f708486aae..bc66babb5f27 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -268,7 +268,6 @@ void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev);
 int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb);
 int neigh_connected_output(struct neighbour *neigh, struct sk_buff *skb);
-int neigh_compat_output(struct neighbour *neigh, struct sk_buff *skb);
 int neigh_direct_output(struct neighbour *neigh, struct sk_buff *skb);
 struct neighbour *neigh_event_ns(struct neigh_table *tbl,
 						u8 *lladdr, void *saddr,
diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index bff12e0c9090..cc7415b33cfb 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -216,6 +216,20 @@ put:
 	return 1;
 }
 
+static int ax25_neigh_output(struct neighbour *neigh, struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+
+	__skb_pull(skb, skb_network_offset(skb));
+
+	if (dev_hard_header(skb, dev, ntohs(skb->protocol), NULL, NULL,
+			    skb->len) < 0 &&
+	    dev_rebuild_header(skb))
+		return 0;
+
+	return dev_queue_xmit(skb);
+}
+
 int ax25_neigh_construct(struct neighbour *neigh)
 {
 	/* This trouble could be saved if ax25 would right a proper
@@ -227,8 +241,8 @@ int ax25_neigh_construct(struct neighbour *neigh)
 		return -EINVAL;
 
 	priv->ops = *neigh->ops;
-	priv->ops.output = neigh_compat_output;
-	priv->ops.connected_output = neigh_compat_output;
+	priv->ops.output = ax25_neigh_output;
+	priv->ops.connected_output = ax25_neigh_output;
 	return 0;
 }
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 70fe9e10ac86..8a319ff3e8d1 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1280,26 +1280,6 @@ static void neigh_hh_init(struct neighbour *n, struct dst_entry *dst)
 	write_unlock_bh(&n->lock);
 }
 
-/* This function can be used in contexts, where only old dev_queue_xmit
- * worked, f.e. if you want to override normal output path (eql, shaper),
- * but resolution is not made yet.
- */
-
-int neigh_compat_output(struct neighbour *neigh, struct sk_buff *skb)
-{
-	struct net_device *dev = skb->dev;
-
-	__skb_pull(skb, skb_network_offset(skb));
-
-	if (dev_hard_header(skb, dev, ntohs(skb->protocol), NULL, NULL,
-			    skb->len) < 0 &&
-	    dev_rebuild_header(skb))
-		return 0;
-
-	return dev_queue_xmit(skb);
-}
-EXPORT_SYMBOL(neigh_compat_output);
-
 /* Slow and careful. */
 
 int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 10/15] ax25: Stop calling/abusing dev_rebuild_header
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (8 preceding siblings ...)
  2015-03-02  6:07           ` [PATCH net-next 09/15] neigh: Move neigh_compat_output into ax25_ip.c Eric W. Biederman
@ 2015-03-02  6:08           ` Eric W. Biederman
  2015-03-02  6:09           ` [PATCH net-next 11/15] ax25: Stop depending on arp_find Eric W. Biederman
                             ` (5 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


- Rename ax25_rebuild_header to ax25_neigh_xmit and call it from
  ax25_neigh_output directly.  The rename is to make it clear
  that this is not a rebuild_header operation.

- Remove ax25_rebuild_header from ax25_header_ops.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ax25/ax25_ip.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index cc7415b33cfb..08803e820f1d 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -100,7 +100,7 @@ static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
 	return -AX25_HEADER_LEN;	/* Unfinished header */
 }
 
-static int ax25_rebuild_header(struct sk_buff *skb)
+static int ax25_neigh_xmit(struct sk_buff *skb)
 {
 	struct sk_buff *ourskb;
 	unsigned char *bp  = skb->data;
@@ -224,7 +224,7 @@ static int ax25_neigh_output(struct neighbour *neigh, struct sk_buff *skb)
 
 	if (dev_hard_header(skb, dev, ntohs(skb->protocol), NULL, NULL,
 			    skb->len) < 0 &&
-	    dev_rebuild_header(skb))
+	    ax25_neigh_xmit(skb));
 		return 0;
 
 	return dev_queue_xmit(skb);
@@ -255,11 +255,6 @@ static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
 	return -AX25_HEADER_LEN;
 }
 
-static int ax25_rebuild_header(struct sk_buff *skb)
-{
-	return 1;
-}
-
 int ax25_neigh_construct(struct neighbour *neigh)
 {
 	return 0;
@@ -268,7 +263,6 @@ int ax25_neigh_construct(struct neighbour *neigh)
 
 const struct header_ops ax25_header_ops = {
 	.create = ax25_hard_header,
-	.rebuild = ax25_rebuild_header,
 };
 
 EXPORT_SYMBOL(ax25_header_ops);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 11/15] ax25: Stop depending on arp_find
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (9 preceding siblings ...)
  2015-03-02  6:08           ` [PATCH net-next 10/15] ax25: Stop calling/abusing dev_rebuild_header Eric W. Biederman
@ 2015-03-02  6:09           ` Eric W. Biederman
  2015-03-02  6:11           ` [PATCH net-next 12/15] net: Kill dev_rebuild_header Eric W. Biederman
                             ` (4 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


Have ax25_neigh_output perform ordinary arp resolution before calling
ax25_neigh_xmit.

Call dev_hard_header in ax25_neigh_output with a destination address so
it will not fail, and the destination mac address will not need to be
set in ax25_neigh_xmit.

Remove arp_find from ax25_neigh_xmit (the ordinary arp resolution added
to ax25_neigh_output removes the need for calling arp_find).

Document how close ax25_neigh_output is to neigh_resolve_output.

Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ax25/ax25_ip.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index 08803e820f1d..e030c64ebfb7 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -115,9 +115,6 @@ static int ax25_neigh_xmit(struct sk_buff *skb)
 	dst = (ax25_address *)(bp + 1);
 	src = (ax25_address *)(bp + 8);
 
-	if (arp_find(bp + 1, skb))
-		return 1;
-
 	route = ax25_get_route(dst, NULL);
 	if (route) {
 		digipeat = route->digipeat;
@@ -218,16 +215,35 @@ put:
 
 static int ax25_neigh_output(struct neighbour *neigh, struct sk_buff *skb)
 {
-	struct net_device *dev = skb->dev;
-
-	__skb_pull(skb, skb_network_offset(skb));
-
-	if (dev_hard_header(skb, dev, ntohs(skb->protocol), NULL, NULL,
-			    skb->len) < 0 &&
-	    ax25_neigh_xmit(skb));
-		return 0;
+	/* Except for calling ax25_neigh_xmit instead of
+	 * dev_queue_xmit this is neigh_resolve_output.
+	 */
+	int rc = 0;
+
+	if (!neigh_event_send(neigh, skb)) {
+		int err;
+		struct net_device *dev = neigh->dev;
+		unsigned int seq;
+
+		do {
+			__skb_pull(skb, skb_network_offset(skb));
+			seq = read_seqbegin(&neigh->ha_lock);
+			err = dev_hard_header(skb, dev, ntohs(skb->protocol),
+					      neigh->ha, NULL, skb->len);
+		} while (read_seqretry(&neigh->ha_lock, seq));
+
+		if (err >= 0) {
+			ax25_neigh_xmit(skb);
+		} else
+			goto out_kfree_skb;
+	}
+out:
+	return rc;
 
-	return dev_queue_xmit(skb);
+out_kfree_skb:
+	rc = -EINVAL;
+	kfree_skb(skb);
+	goto out;
 }
 
 int ax25_neigh_construct(struct neighbour *neigh)
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 12/15] net: Kill dev_rebuild_header
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (10 preceding siblings ...)
  2015-03-02  6:09           ` [PATCH net-next 11/15] ax25: Stop depending on arp_find Eric W. Biederman
@ 2015-03-02  6:11           ` Eric W. Biederman
  2015-03-02  6:12           ` [PATCH net-next 13/15] arp: Kill arp_find Eric W. Biederman
                             ` (3 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


Now that there are no more users kill dev_rebuild_header and all of it's
implementations.

This is long overdue.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/firewire/net.c                    | 13 --------
 drivers/isdn/i4l/isdn_net.c               | 33 -------------------
 drivers/media/dvb-core/dvb_net.c          |  1 -
 drivers/net/arcnet/arcnet.c               | 55 -------------------------------
 drivers/net/ipvlan/ipvlan_main.c          |  1 -
 drivers/net/macvlan.c                     |  1 -
 drivers/net/wireless/hostap/hostap_main.c |  1 -
 include/linux/etherdevice.h               |  1 -
 include/linux/netdevice.h                 | 12 +------
 net/802/fc.c                              | 21 ------------
 net/802/fddi.c                            | 26 ---------------
 net/802/hippi.c                           | 28 ----------------
 net/8021q/vlan_dev.c                      | 35 --------------------
 net/ethernet/eth.c                        | 34 -------------------
 net/netrom/nr_dev.c                       | 31 -----------------
 net/rose/rose_dev.c                       | 14 --------
 16 files changed, 1 insertion(+), 306 deletions(-)

diff --git a/drivers/firewire/net.c b/drivers/firewire/net.c
index 2c68da1ceeee..f4ea80d602f7 100644
--- a/drivers/firewire/net.c
+++ b/drivers/firewire/net.c
@@ -237,18 +237,6 @@ static int fwnet_header_create(struct sk_buff *skb, struct net_device *net,
 	return -net->hard_header_len;
 }
 
-static int fwnet_header_rebuild(struct sk_buff *skb)
-{
-	struct fwnet_header *h = (struct fwnet_header *)skb->data;
-
-	if (get_unaligned_be16(&h->h_proto) == ETH_P_IP)
-		return arp_find((unsigned char *)&h->h_dest, skb);
-
-	dev_notice(&skb->dev->dev, "unable to resolve type %04x addresses\n",
-		   be16_to_cpu(h->h_proto));
-	return 0;
-}
-
 static int fwnet_header_cache(const struct neighbour *neigh,
 			      struct hh_cache *hh, __be16 type)
 {
@@ -282,7 +270,6 @@ static int fwnet_header_parse(const struct sk_buff *skb, unsigned char *haddr)
 
 static const struct header_ops fwnet_header_ops = {
 	.create         = fwnet_header_create,
-	.rebuild        = fwnet_header_rebuild,
 	.cache		= fwnet_header_cache,
 	.cache_update	= fwnet_header_cache_update,
 	.parse          = fwnet_header_parse,
diff --git a/drivers/isdn/i4l/isdn_net.c b/drivers/isdn/i4l/isdn_net.c
index 94affa5e6f28..546b7e81161d 100644
--- a/drivers/isdn/i4l/isdn_net.c
+++ b/drivers/isdn/i4l/isdn_net.c
@@ -1951,38 +1951,6 @@ static int isdn_net_header(struct sk_buff *skb, struct net_device *dev,
 	return len;
 }
 
-/* We don't need to send arp, because we have point-to-point connections. */
-static int
-isdn_net_rebuild_header(struct sk_buff *skb)
-{
-	struct net_device *dev = skb->dev;
-	isdn_net_local *lp = netdev_priv(dev);
-	int ret = 0;
-
-	if (lp->p_encap == ISDN_NET_ENCAP_ETHER) {
-		struct ethhdr *eth = (struct ethhdr *) skb->data;
-
-		/*
-		 *      Only ARP/IP is currently supported
-		 */
-
-		if (eth->h_proto != htons(ETH_P_IP)) {
-			printk(KERN_WARNING
-			       "isdn_net: %s don't know how to resolve type %d addresses?\n",
-			       dev->name, (int) eth->h_proto);
-			memcpy(eth->h_source, dev->dev_addr, dev->addr_len);
-			return 0;
-		}
-		/*
-		 *      Try to get ARP to resolve the header.
-		 */
-#ifdef CONFIG_INET
-		ret = arp_find(eth->h_dest, skb);
-#endif
-	}
-	return ret;
-}
-
 static int isdn_header_cache(const struct neighbour *neigh, struct hh_cache *hh,
 			     __be16 type)
 {
@@ -2005,7 +1973,6 @@ static void isdn_header_cache_update(struct hh_cache *hh,
 
 static const struct header_ops isdn_header_ops = {
 	.create = isdn_net_header,
-	.rebuild = isdn_net_rebuild_header,
 	.cache = isdn_header_cache,
 	.cache_update = isdn_header_cache_update,
 };
diff --git a/drivers/media/dvb-core/dvb_net.c b/drivers/media/dvb-core/dvb_net.c
index 686d3277dad1..4a77cb02dffc 100644
--- a/drivers/media/dvb-core/dvb_net.c
+++ b/drivers/media/dvb-core/dvb_net.c
@@ -1190,7 +1190,6 @@ static int dvb_net_stop(struct net_device *dev)
 static const struct header_ops dvb_header_ops = {
 	.create		= eth_header,
 	.parse		= eth_header_parse,
-	.rebuild	= eth_rebuild_header,
 };
 
 
diff --git a/drivers/net/arcnet/arcnet.c b/drivers/net/arcnet/arcnet.c
index 09de683c167e..10f71c732b59 100644
--- a/drivers/net/arcnet/arcnet.c
+++ b/drivers/net/arcnet/arcnet.c
@@ -104,7 +104,6 @@ EXPORT_SYMBOL(arcnet_timeout);
 static int arcnet_header(struct sk_buff *skb, struct net_device *dev,
 			 unsigned short type, const void *daddr,
 			 const void *saddr, unsigned len);
-static int arcnet_rebuild_header(struct sk_buff *skb);
 static int go_tx(struct net_device *dev);
 
 static int debug = ARCNET_DEBUG;
@@ -312,7 +311,6 @@ static int choose_mtu(void)
 
 static const struct header_ops arcnet_header_ops = {
 	.create = arcnet_header,
-	.rebuild = arcnet_rebuild_header,
 };
 
 static const struct net_device_ops arcnet_netdev_ops = {
@@ -538,59 +536,6 @@ static int arcnet_header(struct sk_buff *skb, struct net_device *dev,
 	return proto->build_header(skb, dev, type, _daddr);
 }
 
-
-/* 
- * Rebuild the ARCnet hard header. This is called after an ARP (or in the
- * future other address resolution) has completed on this sk_buff. We now
- * let ARP fill in the destination field.
- */
-static int arcnet_rebuild_header(struct sk_buff *skb)
-{
-	struct net_device *dev = skb->dev;
-	struct arcnet_local *lp = netdev_priv(dev);
-	int status = 0;		/* default is failure */
-	unsigned short type;
-	uint8_t daddr=0;
-	struct ArcProto *proto;
-	/*
-	 * XXX: Why not use skb->mac_len?
-	 */
-	if (skb->network_header - skb->mac_header != 2) {
-		BUGMSG(D_NORMAL,
-		       "rebuild_header: shouldn't be here! (hdrsize=%d)\n",
-		       (int)(skb->network_header - skb->mac_header));
-		return 0;
-	}
-	type = *(uint16_t *) skb_pull(skb, 2);
-	BUGMSG(D_DURING, "rebuild header for protocol %Xh\n", type);
-
-	if (type == ETH_P_IP) {
-#ifdef CONFIG_INET
-		BUGMSG(D_DURING, "rebuild header for ethernet protocol %Xh\n", type);
-		status = arp_find(&daddr, skb) ? 1 : 0;
-		BUGMSG(D_DURING, " rebuilt: dest is %d; protocol %Xh\n",
-		       daddr, type);
-#endif
-	} else {
-		BUGMSG(D_NORMAL,
-		       "I don't understand ethernet protocol %Xh addresses!\n", type);
-		dev->stats.tx_errors++;
-		dev->stats.tx_aborted_errors++;
-	}
-
-	/* if we couldn't resolve the address... give up. */
-	if (!status)
-		return 0;
-
-	/* add the _real_ header this time! */
-	proto = arc_proto_map[lp->default_proto[daddr]];
-	proto->build_header(skb, dev, type, daddr);
-
-	return 1;		/* success */
-}
-
-
-
 /* Called by the kernel in order to transmit a packet. */
 netdev_tx_t arcnet_send_packet(struct sk_buff *skb,
 				     struct net_device *dev)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 4f4099d5603d..2950c3780230 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -336,7 +336,6 @@ static int ipvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 
 static const struct header_ops ipvlan_header_ops = {
 	.create  	= ipvlan_hard_header,
-	.rebuild	= eth_rebuild_header,
 	.parse		= eth_header_parse,
 	.cache		= eth_header_cache,
 	.cache_update	= eth_header_cache_update,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 1df38bdae2ee..b5e3320ca506 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -550,7 +550,6 @@ static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 
 static const struct header_ops macvlan_hard_header_ops = {
 	.create  	= macvlan_hard_header,
-	.rebuild	= eth_rebuild_header,
 	.parse		= eth_header_parse,
 	.cache		= eth_header_cache,
 	.cache_update	= eth_header_cache_update,
diff --git a/drivers/net/wireless/hostap/hostap_main.c b/drivers/net/wireless/hostap/hostap_main.c
index 52919ad42726..8f9f3e9fbfce 100644
--- a/drivers/net/wireless/hostap/hostap_main.c
+++ b/drivers/net/wireless/hostap/hostap_main.c
@@ -798,7 +798,6 @@ static void prism2_tx_timeout(struct net_device *dev)
 
 const struct header_ops hostap_80211_ops = {
 	.create		= eth_header,
-	.rebuild	= eth_rebuild_header,
 	.cache		= eth_header_cache,
 	.cache_update	= eth_header_cache_update,
 	.parse		= hostap_80211_header_parse,
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 1d869d185a0d..606563ef8a72 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -35,7 +35,6 @@ extern const struct header_ops eth_header_ops;
 
 int eth_header(struct sk_buff *skb, struct net_device *dev, unsigned short type,
 	       const void *daddr, const void *saddr, unsigned len);
-int eth_rebuild_header(struct sk_buff *skb);
 int eth_header_parse(const struct sk_buff *skb, unsigned char *haddr);
 int eth_header_cache(const struct neighbour *neigh, struct hh_cache *hh,
 		     __be16 type);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5897b4ea5a3f..2007f3b44d05 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -261,7 +261,6 @@ struct header_ops {
 			   unsigned short type, const void *daddr,
 			   const void *saddr, unsigned int len);
 	int	(*parse)(const struct sk_buff *skb, unsigned char *haddr);
-	int	(*rebuild)(struct sk_buff *skb);
 	int	(*cache)(const struct neighbour *neigh, struct hh_cache *hh, __be16 type);
 	void	(*cache_update)(struct hh_cache *hh,
 				const struct net_device *dev,
@@ -1346,7 +1345,7 @@ enum netdev_priv_flags {
  *			if one wants to override the ndo_*() functions
  *	@ethtool_ops:	Management operations
  *	@fwd_ops:	Management operations
- *	@header_ops:	Includes callbacks for creating,parsing,rebuilding,etc
+ *	@header_ops:	Includes callbacks for creating,parsing,caching,etc
  *			of Layer 2 headers.
  *
  *	@flags:		Interface flags (a la BSD)
@@ -2399,15 +2398,6 @@ static inline int dev_parse_header(const struct sk_buff *skb,
 	return dev->header_ops->parse(skb, haddr);
 }
 
-static inline int dev_rebuild_header(struct sk_buff *skb)
-{
-	const struct net_device *dev = skb->dev;
-
-	if (!dev->header_ops || !dev->header_ops->rebuild)
-		return 0;
-	return dev->header_ops->rebuild(skb);
-}
-
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int len);
 int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
 static inline int unregister_gifconf(unsigned int family)
diff --git a/net/802/fc.c b/net/802/fc.c
index 7c174b6750cd..7b9219022418 100644
--- a/net/802/fc.c
+++ b/net/802/fc.c
@@ -75,29 +75,8 @@ static int fc_header(struct sk_buff *skb, struct net_device *dev,
 	return -hdr_len;
 }
 
-/*
- *	A neighbour discovery of some species (eg arp) has completed. We
- *	can now send the packet.
- */
-
-static int fc_rebuild_header(struct sk_buff *skb)
-{
-#ifdef CONFIG_INET
-	struct fch_hdr *fch=(struct fch_hdr *)skb->data;
-	struct fcllc *fcllc=(struct fcllc *)(skb->data+sizeof(struct fch_hdr));
-	if(fcllc->ethertype != htons(ETH_P_IP)) {
-		printk("fc_rebuild_header: Don't know how to resolve type %04X addresses ?\n", ntohs(fcllc->ethertype));
-		return 0;
-	}
-	return arp_find(fch->daddr, skb);
-#else
-	return 0;
-#endif
-}
-
 static const struct header_ops fc_header_ops = {
 	.create	 = fc_header,
-	.rebuild = fc_rebuild_header,
 };
 
 static void fc_setup(struct net_device *dev)
diff --git a/net/802/fddi.c b/net/802/fddi.c
index 59e7346f1193..7d3a0af954e8 100644
--- a/net/802/fddi.c
+++ b/net/802/fddi.c
@@ -87,31 +87,6 @@ static int fddi_header(struct sk_buff *skb, struct net_device *dev,
 	return -hl;
 }
 
-
-/*
- * Rebuild the FDDI MAC header. This is called after an ARP
- * (or in future other address resolution) has completed on
- * this sk_buff.  We now let ARP fill in the other fields.
- */
-
-static int fddi_rebuild_header(struct sk_buff	*skb)
-{
-	struct fddihdr *fddi = (struct fddihdr *)skb->data;
-
-#ifdef CONFIG_INET
-	if (fddi->hdr.llc_snap.ethertype == htons(ETH_P_IP))
-		/* Try to get ARP to resolve the header and fill destination address */
-		return arp_find(fddi->daddr, skb);
-	else
-#endif
-	{
-		printk("%s: Don't know how to resolve type %04X addresses.\n",
-		       skb->dev->name, ntohs(fddi->hdr.llc_snap.ethertype));
-		return 0;
-	}
-}
-
-
 /*
  * Determine the packet's protocol ID and fill in skb fields.
  * This routine is called before an incoming packet is passed
@@ -177,7 +152,6 @@ EXPORT_SYMBOL(fddi_change_mtu);
 
 static const struct header_ops fddi_header_ops = {
 	.create		= fddi_header,
-	.rebuild	= fddi_rebuild_header,
 };
 
 
diff --git a/net/802/hippi.c b/net/802/hippi.c
index 2e03f8259dd5..ade1a52cdcff 100644
--- a/net/802/hippi.c
+++ b/net/802/hippi.c
@@ -91,33 +91,6 @@ static int hippi_header(struct sk_buff *skb, struct net_device *dev,
 
 
 /*
- * Rebuild the HIPPI MAC header. This is called after an ARP has
- * completed on this sk_buff. We now let ARP fill in the other fields.
- */
-
-static int hippi_rebuild_header(struct sk_buff *skb)
-{
-	struct hippi_hdr *hip = (struct hippi_hdr *)skb->data;
-
-	/*
-	 * Only IP is currently supported
-	 */
-
-	if(hip->snap.ethertype != htons(ETH_P_IP))
-	{
-		printk(KERN_DEBUG "%s: unable to resolve type %X addresses.\n",skb->dev->name,ntohs(hip->snap.ethertype));
-		return 0;
-	}
-
-	/*
-	 * We don't support dynamic ARP on HIPPI, but we use the ARP
-	 * static ARP tables to hold the I-FIELDs.
-	 */
-	return arp_find(hip->le.daddr, skb);
-}
-
-
-/*
  *	Determine the packet's protocol ID.
  */
 
@@ -186,7 +159,6 @@ EXPORT_SYMBOL(hippi_neigh_setup_dev);
 
 static const struct header_ops hippi_header_ops = {
 	.create		= hippi_header,
-	.rebuild	= hippi_rebuild_header,
 };
 
 
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 118956448cf6..1dcfec8b49f3 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -37,39 +37,6 @@
 #include <linux/netpoll.h>
 
 /*
- *	Rebuild the Ethernet MAC header. This is called after an ARP
- *	(or in future other address resolution) has completed on this
- *	sk_buff. We now let ARP fill in the other fields.
- *
- *	This routine CANNOT use cached dst->neigh!
- *	Really, it is used only when dst->neigh is wrong.
- *
- * TODO:  This needs a checkup, I'm ignorant here. --BLG
- */
-static int vlan_dev_rebuild_header(struct sk_buff *skb)
-{
-	struct net_device *dev = skb->dev;
-	struct vlan_ethhdr *veth = (struct vlan_ethhdr *)(skb->data);
-
-	switch (veth->h_vlan_encapsulated_proto) {
-#ifdef CONFIG_INET
-	case htons(ETH_P_IP):
-
-		/* TODO:  Confirm this will work with VLAN headers... */
-		return arp_find(veth->h_dest, skb);
-#endif
-	default:
-		pr_debug("%s: unable to resolve type %X addresses\n",
-			 dev->name, ntohs(veth->h_vlan_encapsulated_proto));
-
-		ether_addr_copy(veth->h_source, dev->dev_addr);
-		break;
-	}
-
-	return 0;
-}
-
-/*
  *	Create the VLAN header for an arbitrary protocol layer
  *
  *	saddr=NULL	means use device source address
@@ -534,7 +501,6 @@ static int vlan_dev_get_lock_subclass(struct net_device *dev)
 
 static const struct header_ops vlan_header_ops = {
 	.create	 = vlan_dev_hard_header,
-	.rebuild = vlan_dev_rebuild_header,
 	.parse	 = eth_header_parse,
 };
 
@@ -554,7 +520,6 @@ static int vlan_passthru_hard_header(struct sk_buff *skb, struct net_device *dev
 
 static const struct header_ops vlan_passthru_header_ops = {
 	.create	 = vlan_passthru_hard_header,
-	.rebuild = dev_rebuild_header,
 	.parse	 = eth_header_parse,
 };
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 238f38d21641..8dbdf6c910b7 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -113,39 +113,6 @@ int eth_header(struct sk_buff *skb, struct net_device *dev,
 EXPORT_SYMBOL(eth_header);
 
 /**
- * eth_rebuild_header- rebuild the Ethernet MAC header.
- * @skb: socket buffer to update
- *
- * This is called after an ARP or IPV6 ndisc it's resolution on this
- * sk_buff. We now let protocol (ARP) fill in the other fields.
- *
- * This routine CANNOT use cached dst->neigh!
- * Really, it is used only when dst->neigh is wrong.
- */
-int eth_rebuild_header(struct sk_buff *skb)
-{
-	struct ethhdr *eth = (struct ethhdr *)skb->data;
-	struct net_device *dev = skb->dev;
-
-	switch (eth->h_proto) {
-#ifdef CONFIG_INET
-	case htons(ETH_P_IP):
-		return arp_find(eth->h_dest, skb);
-#endif
-	default:
-		netdev_dbg(dev,
-		       "%s: unable to resolve type %X addresses.\n",
-		       dev->name, ntohs(eth->h_proto));
-
-		memcpy(eth->h_source, dev->dev_addr, ETH_ALEN);
-		break;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(eth_rebuild_header);
-
-/**
  * eth_get_headlen - determine the the length of header for an ethernet frame
  * @data: pointer to start of frame
  * @len: total length of frame
@@ -369,7 +336,6 @@ EXPORT_SYMBOL(eth_validate_addr);
 const struct header_ops eth_header_ops ____cacheline_aligned = {
 	.create		= eth_header,
 	.parse		= eth_header_parse,
-	.rebuild	= eth_rebuild_header,
 	.cache		= eth_header_cache,
 	.cache_update	= eth_header_cache_update,
 };
diff --git a/net/netrom/nr_dev.c b/net/netrom/nr_dev.c
index 6ae063cebf7d..988f542481a8 100644
--- a/net/netrom/nr_dev.c
+++ b/net/netrom/nr_dev.c
@@ -65,36 +65,6 @@ int nr_rx_ip(struct sk_buff *skb, struct net_device *dev)
 	return 1;
 }
 
-#ifdef CONFIG_INET
-
-static int nr_rebuild_header(struct sk_buff *skb)
-{
-	unsigned char *bp = skb->data;
-
-	if (arp_find(bp + 7, skb))
-		return 1;
-
-	bp[6] &= ~AX25_CBIT;
-	bp[6] &= ~AX25_EBIT;
-	bp[6] |= AX25_SSSID_SPARE;
-	bp    += AX25_ADDR_LEN;
-
-	bp[6] &= ~AX25_CBIT;
-	bp[6] |= AX25_EBIT;
-	bp[6] |= AX25_SSSID_SPARE;
-
-	return 0;
-}
-
-#else
-
-static int nr_rebuild_header(struct sk_buff *skb)
-{
-	return 1;
-}
-
-#endif
-
 static int nr_header(struct sk_buff *skb, struct net_device *dev,
 		     unsigned short type,
 		     const void *daddr, const void *saddr, unsigned int len)
@@ -188,7 +158,6 @@ static netdev_tx_t nr_xmit(struct sk_buff *skb, struct net_device *dev)
 
 static const struct header_ops nr_header_ops = {
 	.create	= nr_header,
-	.rebuild= nr_rebuild_header,
 };
 
 static const struct net_device_ops nr_netdev_ops = {
diff --git a/net/rose/rose_dev.c b/net/rose/rose_dev.c
index 90209c1fa49b..369ca81a8c5d 100644
--- a/net/rose/rose_dev.c
+++ b/net/rose/rose_dev.c
@@ -56,19 +56,6 @@ static int rose_header(struct sk_buff *skb, struct net_device *dev,
 	return -37;
 }
 
-static int rose_rebuild_header(struct sk_buff *skb)
-{
-#ifdef CONFIG_INET
-	unsigned char *bp = (unsigned char *)skb->data;
-
-	if (arp_find(bp + 7, skb)) {
-		return 1;
-	}
-
-#endif
-	return 0;
-}
-
 static int rose_set_mac_address(struct net_device *dev, void *addr)
 {
 	struct sockaddr *sa = addr;
@@ -133,7 +120,6 @@ static netdev_tx_t rose_xmit(struct sk_buff *skb, struct net_device *dev)
 
 static const struct header_ops rose_header_ops = {
 	.create	= rose_header,
-	.rebuild = rose_rebuild_header,
 };
 
 static const struct net_device_ops rose_netdev_ops = {
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 13/15] arp: Kill arp_find
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (11 preceding siblings ...)
  2015-03-02  6:11           ` [PATCH net-next 12/15] net: Kill dev_rebuild_header Eric W. Biederman
@ 2015-03-02  6:12           ` Eric W. Biederman
  2015-03-02  6:13           ` [PATCH net-next 14/15] neigh: Don't require dst in neigh_hh_init Eric W. Biederman
                             ` (2 subsequent siblings)
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


There are no more callers so kill this function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/arp.h |  1 -
 net/ipv4/arp.c    | 65 -------------------------------------------------------
 2 files changed, 66 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 73c49864076b..21ee1860abbc 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -47,7 +47,6 @@ static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, u32
 }
 
 void arp_init(void);
-int arp_find(unsigned char *haddr, struct sk_buff *skb);
 int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 void arp_send(int type, int ptype, __be32 dest_ip,
 	      struct net_device *dev, __be32 src_ip,
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 2557cf9a4648..bca5b9d9b442 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -396,71 +396,6 @@ static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev)
 	return flag;
 }
 
-/* OBSOLETE FUNCTIONS */
-
-/*
- *	Find an arp mapping in the cache. If not found, post a request.
- *
- *	It is very UGLY routine: it DOES NOT use skb->dst->neighbour,
- *	even if it exists. It is supposed that skb->dev was mangled
- *	by a virtual device (eql, shaper). Nobody but broken devices
- *	is allowed to use this function, it is scheduled to be removed. --ANK
- */
-
-static int arp_set_predefined(int addr_hint, unsigned char *haddr,
-			      __be32 paddr, struct net_device *dev)
-{
-	switch (addr_hint) {
-	case RTN_LOCAL:
-		pr_debug("arp called for own IP address\n");
-		memcpy(haddr, dev->dev_addr, dev->addr_len);
-		return 1;
-	case RTN_MULTICAST:
-		arp_mc_map(paddr, haddr, dev, 1);
-		return 1;
-	case RTN_BROADCAST:
-		memcpy(haddr, dev->broadcast, dev->addr_len);
-		return 1;
-	}
-	return 0;
-}
-
-
-int arp_find(unsigned char *haddr, struct sk_buff *skb)
-{
-	struct net_device *dev = skb->dev;
-	__be32 paddr;
-	struct neighbour *n;
-
-	if (!skb_dst(skb)) {
-		pr_debug("arp_find is called with dst==NULL\n");
-		kfree_skb(skb);
-		return 1;
-	}
-
-	paddr = rt_nexthop(skb_rtable(skb), ip_hdr(skb)->daddr);
-	if (arp_set_predefined(inet_addr_type(dev_net(dev), paddr), haddr,
-			       paddr, dev))
-		return 0;
-
-	n = __neigh_lookup(&arp_tbl, &paddr, dev, 1);
-
-	if (n) {
-		n->used = jiffies;
-		if (n->nud_state & NUD_VALID || neigh_event_send(n, skb) == 0) {
-			neigh_ha_snapshot(haddr, n, dev);
-			neigh_release(n);
-			return 0;
-		}
-		neigh_release(n);
-	} else
-		kfree_skb(skb);
-	return 1;
-}
-EXPORT_SYMBOL(arp_find);
-
-/* END OF OBSOLETE FUNCTIONS */
-
 /*
  * Check if we can use proxy ARP for this path
  */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 14/15] neigh: Don't require dst in neigh_hh_init
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (12 preceding siblings ...)
  2015-03-02  6:12           ` [PATCH net-next 13/15] arp: Kill arp_find Eric W. Biederman
@ 2015-03-02  6:13           ` Eric W. Biederman
  2015-03-02  6:14           ` [PATCH net-next 15/15] neigh: Don't require a dst in neigh_resolve_output Eric W. Biederman
  2015-03-02 21:44           ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups David Miller
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev

This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/neighbour.h | 1 +
 net/core/neighbour.c    | 8 ++++----
 net/decnet/dn_neigh.c   | 1 +
 net/ipv4/arp.c          | 1 +
 net/ipv6/ndisc.c        | 1 +
 5 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index bc66babb5f27..9f912e4d4232 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -193,6 +193,7 @@ struct neigh_table {
 	int			family;
 	int			entry_size;
 	int			key_len;
+	__be16			protocol;
 	__u32			(*hash)(const void *pkey,
 					const struct net_device *dev,
 					__u32 *hash_rnd);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 8a319ff3e8d1..af72b863e968 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1263,10 +1263,10 @@ struct neighbour *neigh_event_ns(struct neigh_table *tbl,
 EXPORT_SYMBOL(neigh_event_ns);
 
 /* called with read_lock_bh(&n->lock); */
-static void neigh_hh_init(struct neighbour *n, struct dst_entry *dst)
+static void neigh_hh_init(struct neighbour *n)
 {
-	struct net_device *dev = dst->dev;
-	__be16 prot = dst->ops->protocol;
+	struct net_device *dev = n->dev;
+	__be16 prot = n->tbl->protocol;
 	struct hh_cache	*hh = &n->hh;
 
 	write_lock_bh(&n->lock);
@@ -1296,7 +1296,7 @@ int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
 		unsigned int seq;
 
 		if (dev->header_ops->cache && !neigh->hh.hh_len)
-			neigh_hh_init(neigh, dst);
+			neigh_hh_init(neigh);
 
 		do {
 			__skb_pull(skb, skb_network_offset(skb));
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 7ca7c3143da3..f123c6c6748c 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -97,6 +97,7 @@ struct neigh_table dn_neigh_table = {
 	.family =			PF_DECnet,
 	.entry_size =			NEIGH_ENTRY_SIZE(sizeof(struct dn_neigh)),
 	.key_len =			sizeof(__le16),
+	.protocol =			cpu_to_be16(ETH_P_DNA_RT),
 	.hash =				dn_neigh_hash,
 	.constructor =			dn_neigh_construct,
 	.id =				"dn_neigh_cache",
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index bca5b9d9b442..6b8aad6a0d7d 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -152,6 +152,7 @@ static const struct neigh_ops arp_direct_ops = {
 struct neigh_table arp_tbl = {
 	.family		= AF_INET,
 	.key_len	= 4,
+	.protocol	= cpu_to_be16(ETH_P_IP),
 	.hash		= arp_hash,
 	.constructor	= arp_constructor,
 	.proxy_redo	= parp_redo,
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 471ed24aabae..e363bbc2420d 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -117,6 +117,7 @@ static const struct neigh_ops ndisc_direct_ops = {
 struct neigh_table nd_tbl = {
 	.family =	AF_INET6,
 	.key_len =	sizeof(struct in6_addr),
+	.protocol =	cpu_to_be16(ETH_P_IPV6),
 	.hash =		ndisc_hash,
 	.constructor =	ndisc_constructor,
 	.pconstructor =	pndisc_constructor,
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 15/15] neigh: Don't require a dst in neigh_resolve_output
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (13 preceding siblings ...)
  2015-03-02  6:13           ` [PATCH net-next 14/15] neigh: Don't require dst in neigh_hh_init Eric W. Biederman
@ 2015-03-02  6:14           ` Eric W. Biederman
  2015-03-02 21:44           ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups David Miller
  15 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-02  6:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ralf Baechle, linux-hams


Having a dst helps a little bit for teql but is fundamentally
unnecessary and there are code paths where a dst is not available that
it would be nice to use the neighbour cache.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/core/neighbour.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index af72b863e968..0f48ea3affed 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1284,12 +1284,8 @@ static void neigh_hh_init(struct neighbour *n)
 
 int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
 {
-	struct dst_entry *dst = skb_dst(skb);
 	int rc = 0;
 
-	if (!dst)
-		goto discard;
-
 	if (!neigh_event_send(neigh, skb)) {
 		int err;
 		struct net_device *dev = neigh->dev;
@@ -1312,8 +1308,6 @@ int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb)
 	}
 out:
 	return rc;
-discard:
-	neigh_dbg(1, "%s: dst=%p neigh=%p\n", __func__, dst, neigh);
 out_kfree_skb:
 	rc = -EINVAL;
 	kfree_skb(skb);
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/15] Neighbour table and ax25 cleanups
  2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
                             ` (14 preceding siblings ...)
  2015-03-02  6:14           ` [PATCH net-next 15/15] neigh: Don't require a dst in neigh_resolve_output Eric W. Biederman
@ 2015-03-02 21:44           ` David Miller
  2015-03-03 15:41             ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Eric W. Biederman
  15 siblings, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-02 21:44 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, ralf, linux-hams

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 01 Mar 2015 23:59:11 -0600

> While looking at the neighbour table to what it would take to allow
> using next hops in a different address family than the current packets
> I found a partial resolution for my issues and I stumbled upon some
> work that makes the neighbour table code easier to understand and
> maintain.
> 
> Long ago in a much younger kernel ax25 found a hack to use
> dev_rebuild_header to transmit it's packets instead of going through
> what today is ndo_start_xmit.
> 
> When the neighbour table was rewritten into it's current form the ax25
> code was such a challenge that arp_broken_ops appeard in arp.c and
> neigh_compat_output appeared in neighbour.c to keep the ax25 hack alive.
> 
> With a little bit of work I was able to remove some of the hack that
> is the ax25 transmit path for ip packets and to isolate what remains
> into a slightly more readable piece of code in ax25_ip.c.  Removing the
> need for the generic code to worry about ax25 special cases.
> 
> After cleaning up the old ax25 hacks I also performed a little bit of
> work on neigh_resolve_output to remove the need for a dst entry and to
> ensure cached headers get a deterministic protocol value in their cached
> header.   This guarantees that a cached header will not be different
> depending on which protocol of packet is transmitted, and it allows
> packets to be transmitted that don't have a dst entry.  There remains
> a small amount of code that takes advantage of when packets have a dst
> entry but that is something different.

Wow... simply, wow.

I thought we'd be stuck with that crap forever, great work!

Applied to net-next, thanks again!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-02 21:44           ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups David Miller
@ 2015-03-03 15:41             ` Eric W. Biederman
  2015-03-03 19:45               ` David Miller
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-03 15:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ralf, linux-hams


Before the ax25 stack calls dev_queue_xmit it always calls
ax25_type_trans which sets skb->protocol to ETH_P_AX25.

Which means that by looking at the protocol type it is possible to
detect IP packets that have not been munged by the ax25 stack in
ndo_start_xmit and call a function to munge them.

Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and
value to be appropriate for an ndo_start_xmit function.

Update all of the ax25 devices to test the protocol type for ETH_P_IP
and return ax25_ip_xmit as the first thing they do.  This preserves
the existing semantics of IP packet processing, but the timing will be
a little different as the IP packets now pass through the qdisc layer
before reaching the ax25 ip packet processing.

Remove the now unnecessary ax25 neighbour table operations.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/net/hamradio/6pack.c      |  5 ++--
 drivers/net/hamradio/baycom_epp.c |  5 ++--
 drivers/net/hamradio/bpqether.c   |  5 ++--
 drivers/net/hamradio/dmascc.c     |  5 ++--
 drivers/net/hamradio/hdlcdrv.c    |  5 ++--
 drivers/net/hamradio/mkiss.c      |  5 ++--
 drivers/net/hamradio/scc.c        |  5 ++--
 drivers/net/hamradio/yam.c        |  5 ++--
 include/net/ax25.h                |  5 +---
 net/ax25/ax25_ip.c                | 60 ++++-----------------------------------
 10 files changed, 31 insertions(+), 74 deletions(-)

diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c
index 0b8393ca8c80..7c4a4151ef0f 100644
--- a/drivers/net/hamradio/6pack.c
+++ b/drivers/net/hamradio/6pack.c
@@ -247,6 +247,9 @@ static netdev_tx_t sp_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct sixpack *sp = netdev_priv(dev);
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	spin_lock_bh(&sp->lock);
 	/* We were not busy, so we are now... :-) */
 	netif_stop_queue(dev);
@@ -302,7 +305,6 @@ static const struct net_device_ops sp_netdev_ops = {
 	.ndo_stop		= sp_close,
 	.ndo_start_xmit		= sp_xmit,
 	.ndo_set_mac_address    = sp_set_mac_address,
-	.ndo_neigh_construct	= ax25_neigh_construct,
 };
 
 static void sp_setup(struct net_device *dev)
@@ -316,7 +318,6 @@ static void sp_setup(struct net_device *dev)
 
 	dev->addr_len		= AX25_ADDR_LEN;
 	dev->type		= ARPHRD_AX25;
-	dev->neigh_priv_len	= sizeof(struct ax25_neigh_priv);
 	dev->tx_queue_len	= 10;
 
 	/* Only activated in AX.25 mode */
diff --git a/drivers/net/hamradio/baycom_epp.c b/drivers/net/hamradio/baycom_epp.c
index 3539ab392f7d..83c7cce0d172 100644
--- a/drivers/net/hamradio/baycom_epp.c
+++ b/drivers/net/hamradio/baycom_epp.c
@@ -772,6 +772,9 @@ static int baycom_send_packet(struct sk_buff *skb, struct net_device *dev)
 {
 	struct baycom_state *bc = netdev_priv(dev);
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	if (skb->data[0] != 0) {
 		do_kiss_params(bc, skb->data, skb->len);
 		dev_kfree_skb(skb);
@@ -1109,7 +1112,6 @@ static const struct net_device_ops baycom_netdev_ops = {
 	.ndo_do_ioctl	     = baycom_ioctl,
 	.ndo_start_xmit      = baycom_send_packet,
 	.ndo_set_mac_address = baycom_set_mac_address,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /*
@@ -1147,7 +1149,6 @@ static void baycom_probe(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 	
 	dev->type = ARPHRD_AX25;           /* AF_AX25 device */
-	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;        /* eth_mtu is the default */
 	dev->addr_len = AX25_ADDR_LEN;     /* sizeof an ax.25 address */
diff --git a/drivers/net/hamradio/bpqether.c b/drivers/net/hamradio/bpqether.c
index bce105b16ed0..63ff08a26da8 100644
--- a/drivers/net/hamradio/bpqether.c
+++ b/drivers/net/hamradio/bpqether.c
@@ -251,6 +251,9 @@ static netdev_tx_t bpq_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct net_device *orig_dev;
 	int size;
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	/*
 	 * Just to be *really* sure not to send anything if the interface
 	 * is down, the ethernet device may have gone.
@@ -469,7 +472,6 @@ static const struct net_device_ops bpq_netdev_ops = {
 	.ndo_start_xmit	     = bpq_xmit,
 	.ndo_set_mac_address = bpq_set_mac_address,
 	.ndo_do_ioctl	     = bpq_ioctl,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void bpq_setup(struct net_device *dev)
@@ -487,7 +489,6 @@ static void bpq_setup(struct net_device *dev)
 #endif
 
 	dev->type            = ARPHRD_AX25;
-	dev->neigh_priv_len  = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu             = AX25_DEF_PACLEN;
 	dev->addr_len        = AX25_ADDR_LEN;
diff --git a/drivers/net/hamradio/dmascc.c b/drivers/net/hamradio/dmascc.c
index abab7be77406..c3d377770616 100644
--- a/drivers/net/hamradio/dmascc.c
+++ b/drivers/net/hamradio/dmascc.c
@@ -433,7 +433,6 @@ module_exit(dmascc_exit);
 static void __init dev_setup(struct net_device *dev)
 {
 	dev->type = ARPHRD_AX25;
-	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN;
 	dev->mtu = 1500;
 	dev->addr_len = AX25_ADDR_LEN;
@@ -448,7 +447,6 @@ static const struct net_device_ops scc_netdev_ops = {
 	.ndo_start_xmit = scc_send_packet,
 	.ndo_do_ioctl = scc_ioctl,
 	.ndo_set_mac_address = scc_set_mac_address,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static int __init setup_adapter(int card_base, int type, int n)
@@ -922,6 +920,9 @@ static int scc_send_packet(struct sk_buff *skb, struct net_device *dev)
 	unsigned long flags;
 	int i;
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	/* Temporarily stop the scheduler feeding us packets */
 	netif_stop_queue(dev);
 
diff --git a/drivers/net/hamradio/hdlcdrv.c b/drivers/net/hamradio/hdlcdrv.c
index 435868a7b69c..49fe59b180a8 100644
--- a/drivers/net/hamradio/hdlcdrv.c
+++ b/drivers/net/hamradio/hdlcdrv.c
@@ -404,6 +404,9 @@ static netdev_tx_t hdlcdrv_send_packet(struct sk_buff *skb,
 {
 	struct hdlcdrv_state *sm = netdev_priv(dev);
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	if (skb->data[0] != 0) {
 		do_kiss_params(sm, skb->data, skb->len);
 		dev_kfree_skb(skb);
@@ -626,7 +629,6 @@ static const struct net_device_ops hdlcdrv_netdev = {
 	.ndo_start_xmit = hdlcdrv_send_packet,
 	.ndo_do_ioctl	= hdlcdrv_ioctl,
 	.ndo_set_mac_address = hdlcdrv_set_mac_address,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /*
@@ -677,7 +679,6 @@ static void hdlcdrv_setup(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 	
 	dev->type = ARPHRD_AX25;           /* AF_AX25 device */
-	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;        /* eth_mtu is the default */
 	dev->addr_len = AX25_ADDR_LEN;     /* sizeof an ax.25 address */
diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c
index c12ec2c2b594..17058c490b79 100644
--- a/drivers/net/hamradio/mkiss.c
+++ b/drivers/net/hamradio/mkiss.c
@@ -529,6 +529,9 @@ static netdev_tx_t ax_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mkiss *ax = netdev_priv(dev);
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	if (!netif_running(dev))  {
 		printk(KERN_ERR "mkiss: %s: xmit call when iface is down\n", dev->name);
 		return NETDEV_TX_BUSY;
@@ -641,7 +644,6 @@ static const struct net_device_ops ax_netdev_ops = {
 	.ndo_stop            = ax_close,
 	.ndo_start_xmit	     = ax_xmit,
 	.ndo_set_mac_address = ax_set_mac_address,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void ax_setup(struct net_device *dev)
@@ -651,7 +653,6 @@ static void ax_setup(struct net_device *dev)
 	dev->hard_header_len = 0;
 	dev->addr_len        = 0;
 	dev->type            = ARPHRD_AX25;
-	dev->neigh_priv_len  = sizeof(struct ax25_neigh_priv);
 	dev->tx_queue_len    = 10;
 	dev->header_ops      = &ax25_header_ops;
 	dev->netdev_ops	     = &ax_netdev_ops;
diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
index b305f51eb420..ce88df33fe17 100644
--- a/drivers/net/hamradio/scc.c
+++ b/drivers/net/hamradio/scc.c
@@ -1550,7 +1550,6 @@ static const struct net_device_ops scc_netdev_ops = {
 	.ndo_set_mac_address = scc_net_set_mac_address,
 	.ndo_get_stats       = scc_net_get_stats,
 	.ndo_do_ioctl        = scc_net_ioctl,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 /* ----> Initialize device <----- */
@@ -1568,7 +1567,6 @@ static void scc_net_setup(struct net_device *dev)
 	dev->flags      = 0;
 
 	dev->type = ARPHRD_AX25;
-	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN + AX25_BPQ_HEADER_LEN;
 	dev->mtu = AX25_DEF_PACLEN;
 	dev->addr_len = AX25_ADDR_LEN;
@@ -1641,6 +1639,9 @@ static netdev_tx_t scc_net_tx(struct sk_buff *skb, struct net_device *dev)
 	unsigned long flags;
 	char kisscmd;
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	if (skb->len > scc->stat.bufsize || skb->len < 2) {
 		scc->dev_stat.tx_dropped++;	/* bogus frame */
 		dev_kfree_skb(skb);
diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index 89d9da7a0c51..1a4729c36aa4 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -597,6 +597,9 @@ static netdev_tx_t yam_send_packet(struct sk_buff *skb,
 {
 	struct yam_port *yp = netdev_priv(dev);
 
+	if (skb->protocol == htons(ETH_P_IP))
+		return ax25_ip_xmit(skb);
+
 	skb_queue_tail(&yp->send_queue, skb);
 	dev->trans_start = jiffies;
 	return NETDEV_TX_OK;
@@ -1100,7 +1103,6 @@ static const struct net_device_ops yam_netdev_ops = {
 	.ndo_start_xmit      = yam_send_packet,
 	.ndo_do_ioctl 	     = yam_ioctl,
 	.ndo_set_mac_address = yam_set_mac_address,
-	.ndo_neigh_construct = ax25_neigh_construct,
 };
 
 static void yam_setup(struct net_device *dev)
@@ -1129,7 +1131,6 @@ static void yam_setup(struct net_device *dev)
 	dev->header_ops = &ax25_header_ops;
 
 	dev->type = ARPHRD_AX25;
-	dev->neigh_priv_len = sizeof(struct ax25_neigh_priv);
 	dev->hard_header_len = AX25_MAX_HEADER_LEN;
 	dev->mtu = AX25_MTU;
 	dev->addr_len = AX25_ADDR_LEN;
diff --git a/include/net/ax25.h b/include/net/ax25.h
index 45feeba7a325..16a923a3a43a 100644
--- a/include/net/ax25.h
+++ b/include/net/ax25.h
@@ -367,11 +367,8 @@ int ax25_kiss_rcv(struct sk_buff *, struct net_device *, struct packet_type *,
 		  struct net_device *);
 
 /* ax25_ip.c */
-int ax25_neigh_construct(struct neighbour *neigh);
+netdev_tx_t ax25_ip_xmit(struct sk_buff *skb);
 extern const struct header_ops ax25_header_ops;
-struct ax25_neigh_priv {
-	struct neigh_ops ops;
-};
 
 /* ax25_out.c */
 ax25_cb *ax25_send_frame(struct sk_buff *, int, ax25_address *, ax25_address *,
diff --git a/net/ax25/ax25_ip.c b/net/ax25/ax25_ip.c
index e030c64ebfb7..8b35af4ef93e 100644
--- a/net/ax25/ax25_ip.c
+++ b/net/ax25/ax25_ip.c
@@ -100,7 +100,7 @@ static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
 	return -AX25_HEADER_LEN;	/* Unfinished header */
 }
 
-static int ax25_neigh_xmit(struct sk_buff *skb)
+netdev_tx_t ax25_ip_xmit(struct sk_buff *skb)
 {
 	struct sk_buff *ourskb;
 	unsigned char *bp  = skb->data;
@@ -210,56 +210,7 @@ put:
 	if (route)
 		ax25_put_route(route);
 
-	return 1;
-}
-
-static int ax25_neigh_output(struct neighbour *neigh, struct sk_buff *skb)
-{
-	/* Except for calling ax25_neigh_xmit instead of
-	 * dev_queue_xmit this is neigh_resolve_output.
-	 */
-	int rc = 0;
-
-	if (!neigh_event_send(neigh, skb)) {
-		int err;
-		struct net_device *dev = neigh->dev;
-		unsigned int seq;
-
-		do {
-			__skb_pull(skb, skb_network_offset(skb));
-			seq = read_seqbegin(&neigh->ha_lock);
-			err = dev_hard_header(skb, dev, ntohs(skb->protocol),
-					      neigh->ha, NULL, skb->len);
-		} while (read_seqretry(&neigh->ha_lock, seq));
-
-		if (err >= 0) {
-			ax25_neigh_xmit(skb);
-		} else
-			goto out_kfree_skb;
-	}
-out:
-	return rc;
-
-out_kfree_skb:
-	rc = -EINVAL;
-	kfree_skb(skb);
-	goto out;
-}
-
-int ax25_neigh_construct(struct neighbour *neigh)
-{
-	/* This trouble could be saved if ax25 would right a proper
-	 * dev_queue_xmit function.
-	 */
-	struct ax25_neigh_priv *priv = neighbour_priv(neigh);
-
-	if (neigh->tbl->family != AF_INET)
-		return -EINVAL;
-
-	priv->ops = *neigh->ops;
-	priv->ops.output = ax25_neigh_output;
-	priv->ops.connected_output = ax25_neigh_output;
-	return 0;
+	return NETDEV_TX_OK;
 }
 
 #else	/* INET */
@@ -271,9 +222,10 @@ static int ax25_hard_header(struct sk_buff *skb, struct net_device *dev,
 	return -AX25_HEADER_LEN;
 }
 
-int ax25_neigh_construct(struct neighbour *neigh)
+netdev_tx_t ax25_ip_xmit(sturct sk_buff *skb)
 {
-	return 0;
+	kfree_skb(skb);
+	return NETDEV_TX_OK;
 }
 #endif
 
@@ -282,5 +234,5 @@ const struct header_ops ax25_header_ops = {
 };
 
 EXPORT_SYMBOL(ax25_header_ops);
-EXPORT_SYMBOL(ax25_neigh_construct);
+EXPORT_SYMBOL(ax25_ip_xmit);
 
-- 
2.2.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-03 15:41             ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Eric W. Biederman
@ 2015-03-03 19:45               ` David Miller
  2015-03-03 20:22                 ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-03 19:45 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, ralf, linux-hams

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 03 Mar 2015 09:41:47 -0600

> 
> Before the ax25 stack calls dev_queue_xmit it always calls
> ax25_type_trans which sets skb->protocol to ETH_P_AX25.
> 
> Which means that by looking at the protocol type it is possible to
> detect IP packets that have not been munged by the ax25 stack in
> ndo_start_xmit and call a function to munge them.
> 
> Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and
> value to be appropriate for an ndo_start_xmit function.
> 
> Update all of the ax25 devices to test the protocol type for ETH_P_IP
> and return ax25_ip_xmit as the first thing they do.  This preserves
> the existing semantics of IP packet processing, but the timing will be
> a little different as the IP packets now pass through the qdisc layer
> before reaching the ax25 ip packet processing.
> 
> Remove the now unnecessary ax25 neighbour table operations.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Another nice cleanup, applied, thanks Eric.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-03 19:45               ` David Miller
@ 2015-03-03 20:22                 ` Eric W. Biederman
  2015-03-03 20:33                   ` David Miller
  2015-03-05 10:14                   ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Steven Whitehouse
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-03 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ralf, linux-hams

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 03 Mar 2015 09:41:47 -0600
>
>> 
>> Before the ax25 stack calls dev_queue_xmit it always calls
>> ax25_type_trans which sets skb->protocol to ETH_P_AX25.
>> 
>> Which means that by looking at the protocol type it is possible to
>> detect IP packets that have not been munged by the ax25 stack in
>> ndo_start_xmit and call a function to munge them.
>> 
>> Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and
>> value to be appropriate for an ndo_start_xmit function.
>> 
>> Update all of the ax25 devices to test the protocol type for ETH_P_IP
>> and return ax25_ip_xmit as the first thing they do.  This preserves
>> the existing semantics of IP packet processing, but the timing will be
>> a little different as the IP packets now pass through the qdisc layer
>> before reaching the ax25 ip packet processing.
>> 
>> Remove the now unnecessary ax25 neighbour table operations.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Another nice cleanup, applied, thanks Eric.

We can almost universally use the same procedures for generating
link layer headers from neighbour table entries now.  I had hoped
to optimized things by removing function pointers.

The big hold out is DECnet that sets src_mac based on the DECnet source
address.  

Which leads me to the conclusion that since DECnet has a different
algorithm for setting the src_mac than everything else in the kernel
DECnet neighbour table entries can not be used for nexthops for other
protocols :(

DECnet also abuses neigh->output to select by output device which kind
of DECnet header to put on the packets.  But that is easily fixable.

Anyway slowly slowly the neighbour table code becomes more readable
and maintable.

Eric


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-03 20:22                 ` Eric W. Biederman
@ 2015-03-03 20:33                   ` David Miller
  2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
  2015-03-05 10:14                   ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Steven Whitehouse
  1 sibling, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-03 20:33 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, ralf, linux-hams

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 03 Mar 2015 14:22:41 -0600

> Anyway slowly slowly the neighbour table code becomes more readable
> and maintable.

Without question, thanks again Eric.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-03 20:33                   ` David Miller
@ 2015-03-03 23:09                     ` Eric W. Biederman
  2015-03-03 23:10                       ` [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref Eric W. Biederman
                                         ` (3 more replies)
  0 siblings, 4 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-03 23:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


In preparation for using the IPv4 and IPv6 neighbour tables in my mpls
code this patchset factors out ___neigh_lookup_noref from
__ipv4_neigh_lookup_noref, __ipv6_lookup_noref and neigh_lookup.
Allowing the lookup logic to be shared between the different
implementations.  At what appears to be no cost. (Aka the same assembly
is generated for ip6_finish_output2 and ip_finish_output2).

After that I add a simple function that takes an address family and an
address consults the neighbour table and sends the packet to the
appropriate location.  The address family argument decoupls callers
of neigh_xmit from the addresses families the packets are sent over.
(Aka The ipv6 module can be loaded after mpls and a previously
configured ipv6 next hop will start working).

The refactoring in ___neigh_lookup_noref may be a bit overkill but it
feels like the right thing to do.  Especially since the same code is
generated.

Eric W. Biederman (2):
      neigh: Factor out ___neigh_lookup_noref
      neigh: Add helper function neigh_xmit

 include/net/arp.h       | 19 ++++-------------
 include/net/ndisc.h     | 19 +----------------
 include/net/neighbour.h | 55 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/neighbour.c    | 54 ++++++++++++++++++++++++++++++++++--------------
 net/decnet/dn_neigh.c   |  6 ++++++
 net/ipv4/arp.c          |  9 +++++++-
 net/ipv6/ndisc.c        |  7 +++++++
 7 files changed, 120 insertions(+), 49 deletions(-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref
  2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
@ 2015-03-03 23:10                       ` Eric W. Biederman
  2015-03-04 14:53                         ` Andy Gospodarek
  2015-03-03 23:11                       ` [PATCH net-next 2/2] neigh: Add helper function neigh_xmit Eric W. Biederman
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-03 23:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


While looking at the mpls code I found myself writing yet another
version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
and __ipv6_lookup_noref.

So to make my work a little easier and to make it a smidge easier to
verify/maintain the mpls code in the future I stopped and wrote
___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
__ipv6_lookup_noref in terms of this new function.  I tested my new
version by verifying that the same code is generated in
ip_finish_output2 and ip6_finish_output2 where these functions are
inlined.

To get to ___neigh_lookup_noref I added a new neighbour cache table
function key_eq.  So that the static size of the key would be
available.

I also added __neigh_lookup_noref for people who want to to lookup
a neighbour table entry quickly but don't know which neibhgour table
they are going to look up.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/arp.h       | 19 ++++--------------
 include/net/ndisc.h     | 19 +-----------------
 include/net/neighbour.h | 52 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/neighbour.c    | 20 +++++--------------
 net/decnet/dn_neigh.c   |  6 ++++++
 net/ipv4/arp.c          |  9 ++++++++-
 net/ipv6/ndisc.c        |  7 +++++++
 7 files changed, 83 insertions(+), 49 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 21ee1860abbc..5e0f891d476c 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -9,28 +9,17 @@
 
 extern struct neigh_table arp_tbl;
 
-static inline u32 arp_hashfn(u32 key, const struct net_device *dev, u32 hash_rnd)
+static inline u32 arp_hashfn(const void *pkey, const struct net_device *dev, u32 *hash_rnd)
 {
+	u32 key = *(const u32 *)pkey;
 	u32 val = key ^ hash32_ptr(dev);
 
-	return val * hash_rnd;
+	return val * hash_rnd[0];
 }
 
 static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev, u32 key)
 {
-	struct neigh_hash_table *nht = rcu_dereference_bh(arp_tbl.nht);
-	struct neighbour *n;
-	u32 hash_val;
-
-	hash_val = arp_hashfn(key, dev, nht->hash_rnd[0]) >> (32 - nht->hash_shift);
-	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
-	     n != NULL;
-	     n = rcu_dereference_bh(n->next)) {
-		if (n->dev == dev && *(u32 *)n->primary_key == key)
-			return n;
-	}
-
-	return NULL;
+	return ___neigh_lookup_noref(&arp_tbl, neigh_key_eq32, arp_hashfn, &key, dev);
 }
 
 static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, u32 key)
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index 6bbda34d5e59..b3a7751251b4 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -156,24 +156,7 @@ static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, _
 
 static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device *dev, const void *pkey)
 {
-	struct neigh_hash_table *nht;
-	const u32 *p32 = pkey;
-	struct neighbour *n;
-	u32 hash_val;
-
-	nht = rcu_dereference_bh(nd_tbl.nht);
-	hash_val = ndisc_hashfn(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
-	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
-	     n != NULL;
-	     n = rcu_dereference_bh(n->next)) {
-		u32 *n32 = (u32 *) n->primary_key;
-		if (n->dev == dev &&
-		    ((n32[0] ^ p32[0]) | (n32[1] ^ p32[1]) |
-		     (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0)
-			return n;
-	}
-
-	return NULL;
+	return ___neigh_lookup_noref(&nd_tbl, neigh_key_eq128, ndisc_hashfn, pkey, dev);
 }
 
 static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, const void *pkey)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 9f912e4d4232..14e3f017966b 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -197,6 +197,7 @@ struct neigh_table {
 	__u32			(*hash)(const void *pkey,
 					const struct net_device *dev,
 					__u32 *hash_rnd);
+	bool			(*key_eq)(const struct neighbour *, const void *pkey);
 	int			(*constructor)(struct neighbour *);
 	int			(*pconstructor)(struct pneigh_entry *);
 	void			(*pdestructor)(struct pneigh_entry *);
@@ -247,6 +248,57 @@ static inline void *neighbour_priv(const struct neighbour *n)
 #define NEIGH_UPDATE_F_ISROUTER			0x40000000
 #define NEIGH_UPDATE_F_ADMIN			0x80000000
 
+
+static inline bool neigh_key_eq16(const struct neighbour *n, const void *pkey)
+{
+	return *(const u16 *)n->primary_key == *(const u16 *)pkey;
+}
+
+static inline bool neigh_key_eq32(const struct neighbour *n, const void *pkey)
+{
+	return *(const u32 *)n->primary_key == *(const u32 *)pkey;
+}
+
+static inline bool neigh_key_eq128(const struct neighbour *n, const void *pkey)
+{
+	const u32 *n32 = (const u32 *)n->primary_key;
+	const u32 *p32 = pkey;
+
+	return ((n32[0] ^ p32[0]) | (n32[1] ^ p32[1]) |
+		(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
+}
+
+static inline struct neighbour *___neigh_lookup_noref(
+	struct neigh_table *tbl,
+	bool (*key_eq)(const struct neighbour *n, const void *pkey),
+	__u32 (*hash)(const void *pkey,
+		      const struct net_device *dev,
+		      __u32 *hash_rnd),
+	const void *pkey,
+	struct net_device *dev)
+{
+	struct neigh_hash_table *nht = rcu_dereference_bh(tbl->nht);
+	struct neighbour *n;
+	u32 hash_val;
+
+	hash_val = hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
+	     n != NULL;
+	     n = rcu_dereference_bh(n->next)) {
+		if (n->dev == dev && key_eq(n, pkey))
+			return n;
+	}
+
+	return NULL;
+}
+
+static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
+						     const void *pkey,
+						     struct net_device *dev)
+{
+	return ___neigh_lookup_noref(tbl, tbl->key_eq, tbl->hash, pkey, dev);
+}
+
 void neigh_table_init(int index, struct neigh_table *tbl);
 int neigh_table_clear(int index, struct neigh_table *tbl);
 struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 0f48ea3affed..fe3c6eac5805 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -397,25 +397,15 @@ struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
 			       struct net_device *dev)
 {
 	struct neighbour *n;
-	int key_len = tbl->key_len;
-	u32 hash_val;
-	struct neigh_hash_table *nht;
 
 	NEIGH_CACHE_STAT_INC(tbl, lookups);
 
 	rcu_read_lock_bh();
-	nht = rcu_dereference_bh(tbl->nht);
-	hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
-
-	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
-	     n != NULL;
-	     n = rcu_dereference_bh(n->next)) {
-		if (dev == n->dev && !memcmp(n->primary_key, pkey, key_len)) {
-			if (!atomic_inc_not_zero(&n->refcnt))
-				n = NULL;
-			NEIGH_CACHE_STAT_INC(tbl, hits);
-			break;
-		}
+	n = __neigh_lookup_noref(tbl, pkey, dev);
+	if (n) {
+		if (!atomic_inc_not_zero(&n->refcnt))
+			n = NULL;
+		NEIGH_CACHE_STAT_INC(tbl, hits);
 	}
 
 	rcu_read_unlock_bh();
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index f123c6c6748c..ee7d1cef0027 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -93,12 +93,18 @@ static u32 dn_neigh_hash(const void *pkey,
 	return jhash_2words(*(__u16 *)pkey, 0, hash_rnd[0]);
 }
 
+static bool dn_key_eq(const struct neighbour *neigh, const void *pkey)
+{
+	return neigh_key_eq16(neigh, pkey);
+}
+
 struct neigh_table dn_neigh_table = {
 	.family =			PF_DECnet,
 	.entry_size =			NEIGH_ENTRY_SIZE(sizeof(struct dn_neigh)),
 	.key_len =			sizeof(__le16),
 	.protocol =			cpu_to_be16(ETH_P_DNA_RT),
 	.hash =				dn_neigh_hash,
+	.key_eq =			dn_key_eq,
 	.constructor =			dn_neigh_construct,
 	.id =				"dn_neigh_cache",
 	.parms ={
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 6b8aad6a0d7d..5f5c674e130a 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -122,6 +122,7 @@
  *	Interface to generic neighbour cache.
  */
 static u32 arp_hash(const void *pkey, const struct net_device *dev, __u32 *hash_rnd);
+static bool arp_key_eq(const struct neighbour *n, const void *pkey);
 static int arp_constructor(struct neighbour *neigh);
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb);
 static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb);
@@ -154,6 +155,7 @@ struct neigh_table arp_tbl = {
 	.key_len	= 4,
 	.protocol	= cpu_to_be16(ETH_P_IP),
 	.hash		= arp_hash,
+	.key_eq		= arp_key_eq,
 	.constructor	= arp_constructor,
 	.proxy_redo	= parp_redo,
 	.id		= "arp_cache",
@@ -209,7 +211,12 @@ static u32 arp_hash(const void *pkey,
 		    const struct net_device *dev,
 		    __u32 *hash_rnd)
 {
-	return arp_hashfn(*(u32 *)pkey, dev, *hash_rnd);
+	return arp_hashfn(pkey, dev, hash_rnd);
+}
+
+static bool arp_key_eq(const struct neighbour *neigh, const void *pkey)
+{
+	return neigh_key_eq32(neigh, pkey);
 }
 
 static int arp_constructor(struct neighbour *neigh)
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index e363bbc2420d..247ad7c298f7 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -84,6 +84,7 @@ do {								\
 static u32 ndisc_hash(const void *pkey,
 		      const struct net_device *dev,
 		      __u32 *hash_rnd);
+static bool ndisc_key_eq(const struct neighbour *neigh, const void *pkey);
 static int ndisc_constructor(struct neighbour *neigh);
 static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb);
 static void ndisc_error_report(struct neighbour *neigh, struct sk_buff *skb);
@@ -119,6 +120,7 @@ struct neigh_table nd_tbl = {
 	.key_len =	sizeof(struct in6_addr),
 	.protocol =	cpu_to_be16(ETH_P_IPV6),
 	.hash =		ndisc_hash,
+	.key_eq =	ndisc_key_eq,
 	.constructor =	ndisc_constructor,
 	.pconstructor =	pndisc_constructor,
 	.pdestructor =	pndisc_destructor,
@@ -295,6 +297,11 @@ static u32 ndisc_hash(const void *pkey,
 	return ndisc_hashfn(pkey, dev, hash_rnd);
 }
 
+static bool ndisc_key_eq(const struct neighbour *n, const void *pkey)
+{
+	return neigh_key_eq128(n, pkey);
+}
+
 static int ndisc_constructor(struct neighbour *neigh)
 {
 	struct in6_addr *addr = (struct in6_addr *)&neigh->primary_key;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 2/2] neigh: Add helper function neigh_xmit
  2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
  2015-03-03 23:10                       ` [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref Eric W. Biederman
@ 2015-03-03 23:11                       ` Eric W. Biederman
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
  2015-03-04  5:25                       ` [PATCH net-next 0/2] Neighbour table prep for MPLS David Miller
  3 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-03 23:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev


For MPLS I am building the code so that either the neighbour mac
address can be specified or we can have a next hop in ipv4 or ipv6.

The kind of next hop we have is indicated by the neighbour table
pointer.  A neighbour table pointer of NULL is a link layer address.
A non-NULL neighbour table pointer indicates which neighbour table and
thus which address family the next hop address is in that we need to
look up.

The code either sends a packet directly or looks up the appropriate
neighbour table entry and sends the packet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/neighbour.h |  3 +++
 net/core/neighbour.c    | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 14e3f017966b..afb8237b0a8c 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -358,6 +358,7 @@ void neigh_for_each(struct neigh_table *tbl,
 		    void (*cb)(struct neighbour *, void *), void *cookie);
 void __neigh_for_each_release(struct neigh_table *tbl,
 			      int (*cb)(struct neighbour *));
+int neigh_xmit(int fam, struct net_device *, const void *, struct sk_buff *);
 void pneigh_for_each(struct neigh_table *tbl,
 		     void (*cb)(struct pneigh_entry *));
 
@@ -511,4 +512,6 @@ static inline void neigh_ha_snapshot(char *dst, const struct neighbour *n,
 		memcpy(dst, n->ha, dev->addr_len);
 	} while (read_seqretry(&n->ha_lock, seq));
 }
+
+
 #endif
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index fe3c6eac5805..cffaf00561e7 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2391,6 +2391,40 @@ void __neigh_for_each_release(struct neigh_table *tbl,
 }
 EXPORT_SYMBOL(__neigh_for_each_release);
 
+int neigh_xmit(int family, struct net_device *dev,
+	       const void *addr, struct sk_buff *skb)
+{
+	int err;
+	if (family == AF_PACKET) {
+		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
+				      addr, NULL, skb->len);
+		if (err < 0)
+			goto out_kfree_skb;
+		err = dev_queue_xmit(skb);
+	} else {
+		struct neigh_table *tbl;
+		struct neighbour *neigh;
+
+		err = -ENETDOWN;
+		tbl = neigh_find_table(family);
+		if (!tbl)
+			goto out;
+		neigh = __neigh_lookup_noref(tbl, addr, dev);
+		if (!neigh)
+			neigh = __neigh_create(tbl, addr, dev, false);
+		err = PTR_ERR(neigh);
+		if (IS_ERR(neigh))
+			goto out_kfree_skb;
+		err = neigh->output(neigh, skb);
+	}
+out:
+	return err;
+out_kfree_skb:
+	kfree_skb(skb);
+	goto out;
+}
+EXPORT_SYMBOL(neigh_xmit);
+
 #ifdef CONFIG_PROC_FS
 
 static struct neighbour *neigh_get_first(struct seq_file *seq)
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 0/7] Basic MPLS support take 2
  2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
  2015-03-03 23:10                       ` [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref Eric W. Biederman
  2015-03-03 23:11                       ` [PATCH net-next 2/2] neigh: Add helper function neigh_xmit Eric W. Biederman
@ 2015-03-04  1:06                       ` Eric W. Biederman
  2015-03-04  1:10                         ` [PATCH net-next 1/7] mpls: Refactor how the mpls module is built Eric W. Biederman
                                           ` (7 more replies)
  2015-03-04  5:25                       ` [PATCH net-next 0/2] Neighbour table prep for MPLS David Miller
  3 siblings, 8 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


On top of my two pending neighbour table prep patches here is the mpls
support refactored to use them, and edited to not drop routes when
an interface goes down.  Additionally the addition of RTA_LLGATEWAY
has been replaced with the addtion of RTA_VIA.  RTA_VIA being an
attribute that includes the address family as well as the address
of the next hop.

MPLS is at it's heart simple and I have endeavoured to maintain that
simplicity in my implemenation.

This is an implementation of a RFC3032 forwarding engine, and basic MPLS
egress logic.  Which should make linux sufficient to be a mpls
forwarding node or to be a LSA (Label Switched Router) as it says in all
of the MPLS documents.  The ingress support will follow but it deserves
it's own discussion so I am pushing it separately.

Eric W. Biederman (7):
      mpls: Refactor how the mpls module is built
      mpls: Basic routing support
      mpls: Add a sysctl to control the size of the mpls label table
      mpls: Basic support for adding and removing routes
      mpls: Functions for reading and wrinting mpls labels over netlink
      mpls: Netlink commands to add, remove, and dump routes
      mpls: Multicast route table change notifications

 Documentation/networking/mpls-sysctl.txt |  20 +
 include/linux/socket.h                   |   2 +
 include/net/net_namespace.h              |   4 +
 include/net/netns/mpls.h                 |  17 +
 include/uapi/linux/rtnetlink.h           |  10 +
 net/Makefile                             |   2 +-
 net/mpls/Kconfig                         |  23 +-
 net/mpls/Makefile                        |   1 +
 net/mpls/af_mpls.c                       | 974 +++++++++++++++++++++++++++++++
 net/mpls/internal.h                      |  59 ++
 10 files changed, 1110 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next 1/7] mpls: Refactor how the mpls module is built
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
@ 2015-03-04  1:10                         ` Eric W. Biederman
  2015-03-04  1:10                         ` [PATCH net-next 2/7] mpls: Basic routing support Eric W. Biederman
                                           ` (6 subsequent siblings)
  7 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


This refactoring is needed to allow more than just mpls gso
support to be built into the mpls moddule.

Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/Makefile     |  2 +-
 net/mpls/Kconfig | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/net/Makefile b/net/Makefile
index 38704bdf941a..3995613e5510 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -69,7 +69,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
-obj-$(CONFIG_NET_MPLS_GSO)	+= mpls/
+obj-$(CONFIG_MPLS)		+= mpls/
 obj-$(CONFIG_HSR)		+= hsr/
 ifneq ($(CONFIG_NET_SWITCHDEV),)
 obj-y				+= switchdev/
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index 37421db88965..a77fbcdd04ee 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -1,9 +1,25 @@
 #
 # MPLS configuration
 #
+
+menuconfig MPLS
+	tristate "MultiProtocol Label Switching"
+	default n
+	---help---
+	  MultiProtocol Label Switching routes packets through logical
+	  circuits.  Originally conceved as a way of routing packets at
+	  hardware speeds (before hardware was capable of routing ipv4 packets),
+	  MPLS remains as simple way of making tunnels.
+
+	  If you have not heard of MPLS you probably want to say N here.
+
+if MPLS
+
 config NET_MPLS_GSO
-	tristate "MPLS: GSO support"
+	bool "MPLS: GSO support"
 	help
 	 This is helper module to allow segmentation of non-MPLS GSO packets
 	 that have had MPLS stack entries pushed onto them and thus
 	 become MPLS GSO packets.
+
+endif # MPLS
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 2/7] mpls: Basic routing support
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
  2015-03-04  1:10                         ` [PATCH net-next 1/7] mpls: Refactor how the mpls module is built Eric W. Biederman
@ 2015-03-04  1:10                         ` Eric W. Biederman
  2015-03-05 16:36                           ` Vivek Venkatraman
  2015-03-04  1:11                         ` [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
                                           ` (5 subsequent siblings)
  7 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


This change adds a new Kconfig option MPLS_ROUTING.

The core of this change is the code to look at an mpls packet received
from another machine.  Look that packet up in a routing table and
forward the packet on.

Support of MPLS over ATM is not considered or attempted here.  This
implemntation follows RFC3032 and implements the MPLS shim header that
can pass over essentially any network.

What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
net->mpls.platform_label[].  What RFC3031 refers to as the Next Label
Hop Forwarding Entry (NHLFE) I call mpls_route.  Though calling it the
label fordwarding information base (lfib) might also be valid.

Further the implemntation forwards packets as described in RFC3032.
There is no need and given the original motivation for MPLS a strong
discincentive to have a flexible label forwarding path.  In essence
the logic is the topmost label is read, looked up, removed, and
replaced by 0 or more new lables and the sent out the specified
interface to it's next hop.

Quite a few optional features are not implemented here.  Among them
are generation of ICMP errors when the TTL is exceeded or the packet
is larger than the next hop MTU (those conditions are detected and the
packets are dropped instead of generating an icmp error).  The traffic
class field is always set to 0.  The implementation focuses on IP over
MPLS and does not handle egress of other kinds of protocols.

Instead of implementing coordination with the neighbour table and
sorting out how to input next hops in a different address family (for
which there is value).  I was lazy and implemented a next hop mac
address instead.  The code is simpler and there are flavor of MPLS
such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
appropriate so a next hop by mac address would need to be implemented
at some point.

Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.

Decoding the mpls header must be done by first byeswapping a 32bit bit
endian word into the local cpu endian and then bit shifting to extract
the pieces.  There is no C bit-field that can represent a wire format
mpls header on a little endian machine as the low bits of the 20bit
label wind up in the wrong half of third byte.  Therefore internally
everything is deal with in cpu native byte order except when writing
to and reading from a packet.

For management simplicity if a label is configured to forward out
an interface that is down the packet is dropped early.  Similarly
if an network interface is removed rt_dev is updated to NULL
(so no reference is preserved) and any packets for that label
are dropped.  Keeping the label entries in the kernel allows
the kernel label table to function as the definitive source
of which labels are allocated and which are not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/socket.h      |   2 +
 include/net/net_namespace.h |   4 +
 include/net/netns/mpls.h    |  15 ++
 net/mpls/Kconfig            |   5 +
 net/mpls/Makefile           |   1 +
 net/mpls/af_mpls.c          | 349 ++++++++++++++++++++++++++++++++++++++++++++
 net/mpls/internal.h         |  56 +++++++
 7 files changed, 432 insertions(+)
 create mode 100644 include/net/netns/mpls.h
 create mode 100644 net/mpls/af_mpls.c
 create mode 100644 net/mpls/internal.h

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5c19cba34dce..fab4d0ddf4ed 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -181,6 +181,7 @@ struct ucred {
 #define AF_WANPIPE	25	/* Wanpipe API Sockets */
 #define AF_LLC		26	/* Linux LLC			*/
 #define AF_IB		27	/* Native InfiniBand address	*/
+#define AF_MPLS		28	/* MPLS */
 #define AF_CAN		29	/* Controller Area Network      */
 #define AF_TIPC		30	/* TIPC sockets			*/
 #define AF_BLUETOOTH	31	/* Bluetooth sockets 		*/
@@ -226,6 +227,7 @@ struct ucred {
 #define PF_WANPIPE	AF_WANPIPE
 #define PF_LLC		AF_LLC
 #define PF_IB		AF_IB
+#define PF_MPLS		AF_MPLS
 #define PF_CAN		AF_CAN
 #define PF_TIPC		AF_TIPC
 #define PF_BLUETOOTH	AF_BLUETOOTH
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 36faf4990c4b..2cb9acb618e9 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -26,6 +26,7 @@
 #endif
 #include <net/netns/nftables.h>
 #include <net/netns/xfrm.h>
+#include <net/netns/mpls.h>
 #include <linux/ns_common.h>
 
 struct user_namespace;
@@ -130,6 +131,9 @@ struct net {
 #if IS_ENABLED(CONFIG_IP_VS)
 	struct netns_ipvs	*ipvs;
 #endif
+#if IS_ENABLED(CONFIG_MPLS)
+	struct netns_mpls	mpls;
+#endif
 	struct sock		*diag_nlsk;
 	atomic_t		fnhe_genid;
 };
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
new file mode 100644
index 000000000000..f90aaf8d4f89
--- /dev/null
+++ b/include/net/netns/mpls.h
@@ -0,0 +1,15 @@
+/*
+ * mpls in net namespaces
+ */
+
+#ifndef __NETNS_MPLS_H__
+#define __NETNS_MPLS_H__
+
+struct mpls_route;
+
+struct netns_mpls {
+	size_t platform_labels;
+	struct mpls_route __rcu * __rcu *platform_label;
+};
+
+#endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/Kconfig b/net/mpls/Kconfig
index a77fbcdd04ee..f4286ee7e2b0 100644
--- a/net/mpls/Kconfig
+++ b/net/mpls/Kconfig
@@ -22,4 +22,9 @@ config NET_MPLS_GSO
 	 that have had MPLS stack entries pushed onto them and thus
 	 become MPLS GSO packets.
 
+config MPLS_ROUTING
+	bool "MPLS: routing support"
+	help
+	 Add support for forwarding of mpls packets.
+
 endif # MPLS
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 6dec088c2d0f..60af15f1960e 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -2,3 +2,4 @@
 # Makefile for MPLS.
 #
 obj-$(CONFIG_NET_MPLS_GSO) += mpls_gso.o
+obj-$(CONFIG_MPLS_ROUTING) += af_mpls.o
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
new file mode 100644
index 000000000000..924377736b2a
--- /dev/null
+++ b/net/mpls/af_mpls.c
@@ -0,0 +1,349 @@
+#include <linux/types.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/module.h>
+#include <linux/if_arp.h>
+#include <linux/ipv6.h>
+#include <linux/mpls.h>
+#include <net/ip.h>
+#include <net/dst.h>
+#include <net/sock.h>
+#include <net/arp.h>
+#include <net/ip_fib.h>
+#include <net/netevent.h>
+#include <net/netns/generic.h>
+#include "internal.h"
+
+#define MAX_NEW_LABELS 2
+
+/* This maximum ha length copied from the definition of struct neighbour */
+#define MAX_VIA_ALEN (ALIGN(MAX_ADDR_LEN, sizeof(unsigned long)))
+
+struct mpls_route { /* next hop label forwarding entry */
+	struct net_device 	*rt_dev;
+	struct rcu_head		rt_rcu;
+	u32			rt_label[MAX_NEW_LABELS];
+	u8			rt_protocol; /* routing protocol that set this entry */
+	u8			rt_labels:2,
+				rt_via_alen:6;
+	unsigned short		rt_via_family;
+	u8			rt_via[0];
+};
+
+static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
+{
+	struct mpls_route *rt = NULL;
+
+	if (index < net->mpls.platform_labels) {
+		struct mpls_route __rcu **platform_label =
+			rcu_dereference(net->mpls.platform_label);
+		rt = rcu_dereference(platform_label[index]);
+	}
+	return rt;
+}
+
+static bool mpls_output_possible(const struct net_device *dev)
+{
+	return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev);
+}
+
+static unsigned int mpls_rt_header_size(const struct mpls_route *rt)
+{
+	/* The size of the layer 2.5 labels to be added for this route */
+	return rt->rt_labels * sizeof(struct mpls_shim_hdr);
+}
+
+static unsigned int mpls_dev_mtu(const struct net_device *dev)
+{
+	/* The amount of data the layer 2 frame can hold */
+	return dev->mtu;
+}
+
+static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
+{
+	if (skb->len <= mtu)
+		return false;
+
+	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+		return false;
+
+	return true;
+}
+
+static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
+			struct mpls_entry_decoded dec)
+{
+	/* RFC4385 and RFC5586 encode other packets in mpls such that
+	 * they don't conflict with the ip version number, making
+	 * decoding by examining the ip version correct in everything
+	 * except for the strangest cases.
+	 *
+	 * The strange cases if we choose to support them will require
+	 * manual configuration.
+	 */
+	struct iphdr *hdr4 = ip_hdr(skb);
+	bool success = true;
+
+	if (hdr4->version == 4) {
+		skb->protocol = htons(ETH_P_IP);
+		csum_replace2(&hdr4->check,
+			      htons(hdr4->ttl << 8),
+			      htons(dec.ttl << 8));
+		hdr4->ttl = dec.ttl;
+	}
+	else if (hdr4->version == 6) {
+		struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+		skb->protocol = htons(ETH_P_IPV6);
+		hdr6->hop_limit = dec.ttl;
+	}
+	else
+		/* version 0 and version 1 are used by pseudo wires */
+		success = false;
+	return success;
+}
+
+static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
+			struct packet_type *pt, struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+	struct mpls_shim_hdr *hdr;
+	struct mpls_route *rt;
+	struct mpls_entry_decoded dec;
+	struct net_device *out_dev;
+	unsigned int hh_len;
+	unsigned int new_header_size;
+	unsigned int mtu;
+	int err;
+
+	/* Careful this entire function runs inside of an rcu critical section */
+
+	if (skb->pkt_type != PACKET_HOST)
+		goto drop;
+
+	if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL)
+		goto drop;
+
+	if (!pskb_may_pull(skb, sizeof(*hdr)))
+		goto drop;
+
+	/* Read and decode the label */
+	hdr = mpls_hdr(skb);
+	dec = mpls_entry_decode(hdr);
+
+	/* Pop the label */
+	skb_pull(skb, sizeof(*hdr));
+	skb_reset_network_header(skb);
+
+	skb_orphan(skb);
+
+	rt = mpls_route_input_rcu(net, dec.label);
+	if (!rt)
+		goto drop;
+
+	/* Find the output device */
+	out_dev = rt->rt_dev;
+	if (!mpls_output_possible(out_dev))
+		goto drop;
+
+	if (skb_warn_if_lro(skb))
+		goto drop;
+
+	skb_forward_csum(skb);
+
+	/* Verify ttl is valid */
+	if (dec.ttl <= 2)
+		goto drop;
+	dec.ttl -= 1;
+
+	/* Verify the destination can hold the packet */
+	new_header_size = mpls_rt_header_size(rt);
+	mtu = mpls_dev_mtu(out_dev);
+	if (mpls_pkt_too_big(skb, mtu - new_header_size))
+		goto drop;
+
+	hh_len = LL_RESERVED_SPACE(out_dev);
+	if (!out_dev->header_ops)
+		hh_len = 0;
+
+	/* Ensure there is enough space for the headers in the skb */
+	if (skb_cow(skb, hh_len + new_header_size))
+		goto drop;
+
+	skb->dev = out_dev;
+	skb->protocol = htons(ETH_P_MPLS_UC);
+
+	if (unlikely(!new_header_size && dec.bos)) {
+		/* Penultimate hop popping */
+		if (!mpls_egress(rt, skb, dec))
+			goto drop;
+	} else {
+		bool bos;
+		int i;
+		skb_push(skb, new_header_size);
+		skb_reset_network_header(skb);
+		/* Push the new labels */
+		hdr = mpls_hdr(skb);
+		bos = dec.bos;
+		for (i = rt->rt_labels - 1; i >= 0; i--) {
+			hdr[i] = mpls_entry_encode(rt->rt_label[i], dec.ttl, 0, bos);
+			bos = false;
+		}
+	}
+
+	err = neigh_xmit(rt->rt_via_family, out_dev, rt->rt_via, skb);
+	if (err)
+		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
+				    __func__, err);
+	return 0;
+
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+
+static struct packet_type mpls_packet_type __read_mostly = {
+	.type = cpu_to_be16(ETH_P_MPLS_UC),
+	.func = mpls_forward,
+};
+
+static struct mpls_route *mpls_rt_alloc(size_t alen)
+{
+	struct mpls_route *rt;
+
+	rt = kzalloc(GFP_KERNEL, sizeof(*rt) + alen);
+	if (rt)
+		rt->rt_via_alen = alen;
+	return rt;
+}
+
+static void mpls_rt_free(struct mpls_route *rt)
+{
+	if (rt)
+		kfree_rcu(rt, rt_rcu);
+}
+
+static void mpls_route_update(struct net *net, unsigned index,
+			      struct net_device *dev, struct mpls_route *new,
+			      const struct nl_info *info)
+{
+	struct mpls_route *rt, *old = NULL;
+
+	ASSERT_RTNL();
+
+	rt = net->mpls.platform_label[index];
+	if (!dev || (rt && (rt->rt_dev == dev))) {
+		rcu_assign_pointer(net->mpls.platform_label[index], new);
+		old = rt;
+	}
+
+	/* If we removed a route free it now */
+	mpls_rt_free(old);
+}
+
+static void mpls_ifdown(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	unsigned index;
+
+	for (index = 0; index < net->mpls.platform_labels; index++) {
+		struct mpls_route *rt = net->mpls.platform_label[index];
+		if (!rt)
+			continue;
+		if (rt->rt_dev != dev)
+			continue;
+		rt->rt_dev = NULL;
+	}
+}
+
+static int mpls_dev_notify(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	switch(event) {
+	case NETDEV_UNREGISTER:
+		mpls_ifdown(dev);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block mpls_dev_notifier = {
+	.notifier_call = mpls_dev_notify,
+};
+
+static int mpls_net_init(struct net *net)
+{
+	net->mpls.platform_labels = 0;
+	net->mpls.platform_label = NULL;
+
+	return 0;
+}
+
+static void mpls_net_exit(struct net *net)
+{
+	unsigned int index;
+
+	/* An rcu grace period haselapsed since there was a device in
+	 * the network namespace (and thus the last in fqlight packet)
+	 * left this network namespace.  This is because
+	 * unregister_netdevice_many and netdev_run_todo has completed
+	 * for each network device that was in this network namespace.
+	 *
+	 * As such no additional rcu synchronization is necessary when
+	 * freeing the platform_label table.
+	 */
+	rtnl_lock();
+	for (index = 0; index < net->mpls.platform_labels; index++) {
+		struct mpls_route *rt = net->mpls.platform_label[index];
+		rcu_assign_pointer(net->mpls.platform_label[index], NULL);
+		mpls_rt_free(rt);
+	}
+	rtnl_unlock();
+
+	kvfree(net->mpls.platform_label);
+}
+
+static struct pernet_operations mpls_net_ops = {
+	.init = mpls_net_init,
+	.exit = mpls_net_exit,
+};
+
+static int __init mpls_init(void)
+{
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct mpls_shim_hdr) != 4);
+
+	err = register_pernet_subsys(&mpls_net_ops);
+	if (err)
+		goto out;
+
+	err = register_netdevice_notifier(&mpls_dev_notifier);
+	if (err)
+		goto out_unregister_pernet;
+
+	dev_add_pack(&mpls_packet_type);
+
+	err = 0;
+out:
+	return err;
+
+out_unregister_pernet:
+	unregister_pernet_subsys(&mpls_net_ops);
+	goto out;
+}
+module_init(mpls_init);
+
+static void __exit mpls_exit(void)
+{
+	dev_remove_pack(&mpls_packet_type);
+	unregister_netdevice_notifier(&mpls_dev_notifier);
+	unregister_pernet_subsys(&mpls_net_ops);
+}
+module_exit(mpls_exit);
+
+MODULE_DESCRIPTION("MultiProtocol Label Switching");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_NETPROTO(PF_MPLS);
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
new file mode 100644
index 000000000000..c2944cb84d48
--- /dev/null
+++ b/net/mpls/internal.h
@@ -0,0 +1,56 @@
+#ifndef MPLS_INTERNAL_H
+#define MPLS_INTERNAL_H
+
+#define LABEL_IPV4_EXPLICIT_NULL	0 /* RFC3032 */
+#define LABEL_ROUTER_ALERT_LABEL	1 /* RFC3032 */
+#define LABEL_IPV6_EXPLICIT_NULL	2 /* RFC3032 */
+#define LABEL_IMPLICIT_NULL		3 /* RFC3032 */
+#define LABEL_ENTROPY_INDICATOR		7 /* RFC6790 */
+#define LABEL_GAL			13 /* RFC5586 */
+#define LABEL_OAM_ALERT			14 /* RFC3429 */
+#define LABEL_EXTENSION			15 /* RFC7274 */
+
+
+struct mpls_shim_hdr {
+	__be32 label_stack_entry;
+};
+
+struct mpls_entry_decoded {
+	u32 label;
+	u8 ttl;
+	u8 tc;
+	u8 bos;
+};
+
+struct sk_buff;
+
+static inline struct mpls_shim_hdr *mpls_hdr(const struct sk_buff *skb)
+{
+	return (struct mpls_shim_hdr *)skb_network_header(skb);
+}
+
+static inline struct mpls_shim_hdr mpls_entry_encode(u32 label, unsigned ttl, unsigned tc, bool bos)
+{
+	struct mpls_shim_hdr result;
+	result.label_stack_entry =
+		cpu_to_be32((label << MPLS_LS_LABEL_SHIFT) |
+			    (tc << MPLS_LS_TC_SHIFT) |
+			    (bos ? (1 << MPLS_LS_S_SHIFT) : 0) |
+			    (ttl << MPLS_LS_TTL_SHIFT));
+	return result;
+}
+
+static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *hdr)
+{
+	struct mpls_entry_decoded result;
+	unsigned entry = be32_to_cpu(hdr->label_stack_entry);
+
+	result.label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+	result.ttl = (entry & MPLS_LS_TTL_MASK) >> MPLS_LS_TTL_SHIFT;
+	result.tc =  (entry & MPLS_LS_TC_MASK) >> MPLS_LS_TC_SHIFT;
+	result.bos = (entry & MPLS_LS_S_MASK) >> MPLS_LS_S_SHIFT;
+
+	return result;
+}
+
+#endif /* MPLS_INTERNAL_H */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
  2015-03-04  1:10                         ` [PATCH net-next 1/7] mpls: Refactor how the mpls module is built Eric W. Biederman
  2015-03-04  1:10                         ` [PATCH net-next 2/7] mpls: Basic routing support Eric W. Biederman
@ 2015-03-04  1:11                         ` Eric W. Biederman
  2015-03-05  9:45                           ` Vivek Venkatraman
  2015-03-04  1:12                         ` [PATCH net-next 4/7] mpls: Basic support for adding and removing routes Eric W. Biederman
                                           ` (4 subsequent siblings)
  7 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


This sysctl gives two benefits.  By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets.  This prevents unpleasant surprises for users.

The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.

This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/networking/mpls-sysctl.txt |  20 +++++
 include/net/netns/mpls.h                 |   2 +
 net/mpls/af_mpls.c                       | 146 +++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 Documentation/networking/mpls-sysctl.txt

diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt
new file mode 100644
index 000000000000..639ddf0ece9b
--- /dev/null
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -0,0 +1,20 @@
+/proc/sys/net/mpls/* Variables:
+
+platform_labels - INTEGER
+	Number of entries in the platform label table.  It is not
+	possible to configure forwarding for label values equal to or
+	greater than the number of platform labels.
+
+	A dense utliziation of the entries in the platform label table
+	is possible and expected aas the platform labels are locally
+	allocated.
+
+	If the number of platform label table entries is set to 0 no
+	label will be recognized by the kernel and mpls forwarding
+	will be disabled.
+
+	Reducing this value will remove all label routing entries that
+	no longer fit in the table.
+
+	Possible values: 0 - 1048575
+	Default: 0
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index f90aaf8d4f89..d29203651c01 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -6,10 +6,12 @@
 #define __NETNS_MPLS_H__
 
 struct mpls_route;
+struct ctl_table_header;
 
 struct netns_mpls {
 	size_t platform_labels;
 	struct mpls_route __rcu * __rcu *platform_label;
+	struct ctl_table_header *ctl;
 };
 
 #endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 924377736b2a..b097125dfa33 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1,6 +1,7 @@
 #include <linux/types.h>
 #include <linux/skbuff.h>
 #include <linux/socket.h>
+#include <linux/sysctl.h>
 #include <linux/net.h>
 #include <linux/module.h>
 #include <linux/if_arp.h>
@@ -31,6 +32,9 @@ struct mpls_route { /* next hop label forwarding entry */
 	u8			rt_via[0];
 };
 
+static int zero = 0;
+static int label_limit = (1 << 20) - 1;
+
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
@@ -273,18 +277,160 @@ static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+static int resize_platform_label_table(struct net *net, size_t limit)
+{
+	size_t size = sizeof(struct mpls_route *) * limit;
+	size_t old_limit;
+	size_t cp_size;
+	struct mpls_route __rcu **labels = NULL, **old;
+	struct mpls_route *rt0 = NULL, *rt2 = NULL;
+	unsigned index;
+
+	if (size) {
+		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+		if (!labels)
+			labels = vzalloc(size);
+
+		if (!labels)
+			goto nolabels;
+	}
+
+	/* In case the predefined labels need to be populated */
+	if (limit > LABEL_IPV4_EXPLICIT_NULL) {
+		struct net_device *lo = net->loopback_dev;
+		rt0 = mpls_rt_alloc(lo->addr_len);
+		if (!rt0)
+			goto nort0;
+		rt0->rt_dev = lo;
+		rt0->rt_protocol = RTPROT_KERNEL;
+		rt0->rt_via_family = AF_PACKET;
+		memcpy(rt0->rt_via, lo->dev_addr, lo->addr_len);
+	}
+	if (limit > LABEL_IPV6_EXPLICIT_NULL) {
+		struct net_device *lo = net->loopback_dev;
+		rt2 = mpls_rt_alloc(lo->addr_len);
+		if (!rt2)
+			goto nort2;
+		rt2->rt_dev = lo;
+		rt2->rt_protocol = RTPROT_KERNEL;
+		rt2->rt_via_family = AF_PACKET;
+		memcpy(rt2->rt_via, lo->dev_addr, lo->addr_len);
+	}
+
+	rtnl_lock();
+	/* Remember the original table */
+	old = net->mpls.platform_label;
+	old_limit = net->mpls.platform_labels;
+
+	/* Free any labels beyond the new table */
+	for (index = limit; index < old_limit; index++)
+		mpls_route_update(net, index, NULL, NULL, NULL);
+
+	/* Copy over the old labels */
+	cp_size = size;
+	if (old_limit < limit)
+		cp_size = old_limit * sizeof(struct mpls_route *);
+
+	memcpy(labels, old, cp_size);
+
+	/* If needed set the predefined labels */
+	if ((old_limit <= LABEL_IPV6_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV6_EXPLICIT_NULL)) {
+		labels[LABEL_IPV6_EXPLICIT_NULL] = rt2;
+		rt2 = NULL;
+	}
+
+	if ((old_limit <= LABEL_IPV4_EXPLICIT_NULL) &&
+	    (limit > LABEL_IPV4_EXPLICIT_NULL)) {
+		labels[LABEL_IPV4_EXPLICIT_NULL] = rt0;
+		rt0 = NULL;
+	}
+
+	/* Update the global pointers */
+	net->mpls.platform_labels = limit;
+	net->mpls.platform_label = labels;
+
+	rtnl_unlock();
+
+	mpls_rt_free(rt2);
+	mpls_rt_free(rt0);
+
+	if (old) {
+		synchronize_rcu();
+		kvfree(old);
+	}
+	return 0;
+
+nort2:
+	mpls_rt_free(rt0);
+nort0:
+	kvfree(labels);
+nolabels:
+	return -ENOMEM;
+}
+
+static int mpls_platform_labels(struct ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = table->data;
+	int platform_labels = net->mpls.platform_labels;
+	int ret;
+	struct ctl_table tmp = {
+		.procname	= table->procname,
+		.data		= &platform_labels,
+		.maxlen		= sizeof(int),
+		.mode		= table->mode,
+		.extra1		= &zero,
+		.extra2		= &label_limit,
+	};
+
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
+	if (write && ret == 0)
+		ret = resize_platform_label_table(net, platform_labels);
+
+	return ret;
+}
+
+static struct ctl_table mpls_table[] = {
+	{
+		.procname	= "platform_labels",
+		.data		= NULL,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= mpls_platform_labels,
+	},
+	{ }
+};
+
 static int mpls_net_init(struct net *net)
 {
+	struct ctl_table *table;
+
 	net->mpls.platform_labels = 0;
 	net->mpls.platform_label = NULL;
 
+	table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL);
+	if (table == NULL)
+		return -ENOMEM;
+
+	table[0].data = net;
+	net->mpls.ctl = register_net_sysctl(net, "net/mpls", table);
+	if (net->mpls.ctl == NULL)
+		return -ENOMEM;
+
 	return 0;
 }
 
 static void mpls_net_exit(struct net *net)
 {
+	struct ctl_table *table;
 	unsigned int index;
 
+	table = net->mpls.ctl->ctl_table_arg;
+	unregister_net_sysctl_table(net->mpls.ctl);
+	kfree(table);
+
 	/* An rcu grace period haselapsed since there was a device in
 	 * the network namespace (and thus the last in fqlight packet)
 	 * left this network namespace.  This is because
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
                                           ` (2 preceding siblings ...)
  2015-03-04  1:11                         ` [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
@ 2015-03-04  1:12                         ` Eric W. Biederman
  2015-03-04  8:13                           ` roopa
  2015-03-04  1:13                         ` [PATCH net-next 5/7] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
                                           ` (3 subsequent siblings)
  7 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


mpls_route_add and mpls_route_del implement the basic logic for adding
and removing Next Hop Label Forwarding Entries from the MPLS input
label map.  The addition and subtraction is done in a way that is
consistent with how the existing routing table in Linux are
maintained.  Thus all of the work to deal with NLM_F_APPEND,
NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.

Cases that are not clearly defined such as changing the interpretation
of the mpls reserved labels is not allowed.

Because it seems like the right thing to do adding an MPLS route without
specifying an input label and allowing the kernel to pick a free label
table entry is supported.   The implementation is currently less than optimal
but that can be changed.

As I don't have anything else to test with only ethernet and the loopback
device are the only two device types currently supported for forwarding
MPLS over.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index b097125dfa33..e432f092f2fb 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -16,6 +16,7 @@
 #include <net/netns/generic.h>
 #include "internal.h"
 
+#define LABEL_NOT_SPECIFIED (1<<20)
 #define MAX_NEW_LABELS 2
 
 /* This maximum ha length copied from the definition of struct neighbour */
@@ -211,6 +212,19 @@ static struct packet_type mpls_packet_type __read_mostly = {
 	.func = mpls_forward,
 };
 
+struct mpls_route_config {
+	u32		rc_protocol;
+	u32		rc_ifindex;
+	u16		rc_via_family;
+	u16		rc_via_alen;
+	u8		rc_via[MAX_VIA_ALEN];
+	u32		rc_label;
+	u32		rc_output_labels;
+	u32		rc_output_label[MAX_NEW_LABELS];
+	u32		rc_nlflags;
+	struct nl_info	rc_nlinfo;
+};
+
 static struct mpls_route *mpls_rt_alloc(size_t alen)
 {
 	struct mpls_route *rt;
@@ -245,6 +259,125 @@ static void mpls_route_update(struct net *net, unsigned index,
 	mpls_rt_free(old);
 }
 
+static unsigned find_free_label(struct net *net)
+{
+	unsigned index;
+	for (index = 16; index < net->mpls.platform_labels; index++) {
+		if (!net->mpls.platform_label[index])
+			return index;
+	}
+	return LABEL_NOT_SPECIFIED;
+}
+
+static int mpls_route_add(struct mpls_route_config *cfg)
+{
+	struct net *net = cfg->rc_nlinfo.nl_net;
+	struct net_device *dev = NULL;
+	struct mpls_route *rt, *old;
+	unsigned index;
+	int i;
+	int err = -EINVAL;
+
+	index = cfg->rc_label;
+
+	/* If a label was not specified during insert pick one */
+	if ((index == LABEL_NOT_SPECIFIED) &&
+	    (cfg->rc_nlflags & NLM_F_CREATE)) {
+		index = find_free_label(net);
+	}
+
+	/* The first 16 labels are reserved, and may not be set */
+	if (index < 16)
+		goto errout;
+
+	/* The full 20 bit range may not be supported. */
+	if (index >= net->mpls.platform_labels)
+		goto errout;
+
+	/* Ensure only a supported number of labels are present */
+	if (cfg->rc_output_labels > MAX_NEW_LABELS)
+		goto errout;
+
+	err = -ENODEV;
+	dev = dev_get_by_index(net, cfg->rc_ifindex);
+	if (!dev)
+		goto errout;
+
+	/* For now just support ethernet devices */
+	err = -EINVAL;
+	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
+		goto errout;
+
+	err = -EINVAL;
+	if ((cfg->rc_via_family == AF_PACKET) &&
+	    (dev->addr_len != cfg->rc_via_alen))
+		goto errout;
+
+	/* Append makes no sense with mpls */
+	err = -EINVAL;
+	if (cfg->rc_nlflags & NLM_F_APPEND)
+		goto errout;
+
+	err = -EEXIST;
+	old = net->mpls.platform_label[index];
+	if ((cfg->rc_nlflags & NLM_F_EXCL) && old)
+		goto errout;
+
+	err = -EEXIST;
+	if (!(cfg->rc_nlflags & NLM_F_REPLACE) && old)
+		goto errout;
+
+	err = -ENOENT;
+	if (!(cfg->rc_nlflags & NLM_F_CREATE) && !old)
+		goto errout;
+
+	err = -ENOMEM;
+	rt = mpls_rt_alloc(cfg->rc_via_alen);
+	if (!rt)
+		goto errout;
+
+	rt->rt_labels = cfg->rc_output_labels;
+	for (i = 0; i < rt->rt_labels; i++)
+		rt->rt_label[i] = cfg->rc_output_label[i];
+	rt->rt_protocol = cfg->rc_protocol;
+	rt->rt_dev = dev;
+	rt->rt_via_family = cfg->rc_via_family;
+	memcpy(rt->rt_via, cfg->rc_via, cfg->rc_via_alen);
+
+	mpls_route_update(net, index, NULL, rt, &cfg->rc_nlinfo);
+
+	dev_put(dev);
+	return 0;
+
+errout:
+	if (dev)
+		dev_put(dev);
+	return err;
+}
+
+static int mpls_route_del(struct mpls_route_config *cfg)
+{
+	struct net *net = cfg->rc_nlinfo.nl_net;
+	unsigned index;
+	int err = -EINVAL;
+
+	index = cfg->rc_label;
+
+	/* The first 16 labels are reserved, and may not be removed */
+	if (index < 16)
+		goto errout;
+
+	/* The full 20 bit range may not be supported */
+	if (index >= net->mpls.platform_labels)
+		goto errout;
+
+	mpls_route_update(net, index, NULL, NULL, &cfg->rc_nlinfo);
+
+	err = 0;
+errout:
+	return err;
+}
+
 static void mpls_ifdown(struct net_device *dev)
 {
 	struct net *net = dev_net(dev);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 5/7] mpls: Functions for reading and wrinting mpls labels over netlink
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
                                           ` (3 preceding siblings ...)
  2015-03-04  1:12                         ` [PATCH net-next 4/7] mpls: Basic support for adding and removing routes Eric W. Biederman
@ 2015-03-04  1:13                         ` Eric W. Biederman
  2015-03-04  1:13                         ` [PATCH net-next 6/7] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
                                           ` (2 subsequent siblings)
  7 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


Reading and writing addresses in network byte order in netlink is
traditional and I see no reason to change that.  MPLS is interesting
as effectively it has variabely length addresses (the MPLS label
stack).  To represent these variable length addresses in netlink
I use a valid MPLS label stack (complete with stop bit).

This achieves two things: a well defined existing format is used,
and the data can be interpreted without looking at it's length.

Not needed to look at the length to decode the variable length
network representation allows existing userspace functions
such as inet_ntop to be used without needed to change their
prototype.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c  | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/mpls/internal.h |  3 +++
 2 files changed, 60 insertions(+)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index e432f092f2fb..2d6612a10e30 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -410,6 +410,63 @@ static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+int nla_put_labels(struct sk_buff *skb, int attrtype,
+		   u8 labels, const u32 label[])
+{
+	struct nlattr *nla;
+	struct mpls_shim_hdr *nla_label;
+	bool bos;
+	int i;
+	nla = nla_reserve(skb, attrtype, labels*4);
+	if (!nla)
+		return -EMSGSIZE;
+
+	nla_label = nla_data(nla);
+	bos = true;
+	for (i = labels - 1; i >= 0; i--) {
+		nla_label[i] = mpls_entry_encode(label[i], 0, 0, bos);
+		bos = false;
+	}
+
+	return 0;
+}
+
+int nla_get_labels(const struct nlattr *nla,
+		   u32 max_labels, u32 *labels, u32 label[])
+{
+	unsigned len = nla_len(nla);
+	unsigned nla_labels;
+	struct mpls_shim_hdr *nla_label;
+	bool bos;
+	int i;
+
+	/* len needs to be an even multiple of 4 (the label size) */
+	if (len & 3)
+		return -EINVAL;
+
+	/* Limit the number of new labels allowed */
+	nla_labels = len/4;
+	if (nla_labels > max_labels)
+		return -EINVAL;
+
+	nla_label = nla_data(nla);
+	bos = true;
+	for (i = nla_labels - 1; i >= 0; i--, bos = false) {
+		struct mpls_entry_decoded dec;
+		dec = mpls_entry_decode(nla_label + i);
+
+		/* Ensure the bottom of stack flag is properly set
+		 * and ttl and tc are both clear.
+		 */
+		if ((dec.bos != bos) || dec.ttl || dec.tc)
+			return -EINVAL;
+
+		label[i] = dec.label;
+	}
+	*labels = nla_labels;
+	return 0;
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index c2944cb84d48..fb6de92052c4 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -53,4 +53,7 @@ static inline struct mpls_entry_decoded mpls_entry_decode(struct mpls_shim_hdr *
 	return result;
 }
 
+int nla_put_labels(struct sk_buff *skb, int attrtype,  u8 labels, const u32 label[]);
+int nla_get_labels(const struct nlattr *nla, u32 max_labels, u32 *labels, u32 label[]);
+
 #endif /* MPLS_INTERNAL_H */
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 6/7] mpls: Netlink commands to add, remove, and dump routes
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
                                           ` (4 preceding siblings ...)
  2015-03-04  1:13                         ` [PATCH net-next 5/7] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
@ 2015-03-04  1:13                         ` Eric W. Biederman
  2015-03-04  1:14                         ` [PATCH net-next 7/7] mpls: Multicast route table change notifications Eric W. Biederman
  2015-03-04  5:27                         ` [PATCH net-next 0/7] Basic MPLS support take 2 David Miller
  7 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


This change adds two new netlink routing attributes:
RTA_VIA and RTA_NEWDST.

RTA_VIA specifies the specifies the next machine to send a packet to
like RTA_GATEWAY.  RTA_VIA differs from RTA_GATEWAY in that it
includes the address family of the address of the next machine to send
a packet to.  Currently the MPLS code supports addresses in AF_INET,
AF_INET6 and AF_PACKET.  For AF_INET and AF_INET6 the destination mac
address is acquired from the neighbour table.  For AF_PACKET the
destination mac_address is specified in the netlink configuration.

I think raw destination mac address support with the family AF_PACKET
will prove useful.  There is MPLS-TP which is defined to operate
on machines that do not support internet packets of any flavor.  Further
seem to be corner cases where it can be useful.  At this point
I don't care much either way.

RTA_NEWDST specifies the destination address to forward the packet
with.  MPLS typically changes it's destination address at every hop.
For a swap operation RTA_NEWDST is specified with a length of one label.
For a push operation RTA_NEWDST is specified with two or more labels.
For a pop operation RTA_NEWDST is not specified or equivalently an emtpy
RTAN_NEWDST is specified.

Those new netlink attributes are used to implement handling of rt-netlink
RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the
MPLS label table.

rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message,
verify no unhandled attributes or unhandled values are present and sets
up the data structures for mpls_route_add and mpls_route_del.

I did my best to match up with the existing conventions with the caveats
that MPLS addresses are all destination-specific-addresses, and so
don't properly have a scope.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/uapi/linux/rtnetlink.h |   8 ++
 net/mpls/af_mpls.c             | 229 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 237 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 5cc5d66bf519..bad65550ae3e 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -303,6 +303,8 @@ enum rtattr_type_t {
 	RTA_TABLE,
 	RTA_MARK,
 	RTA_MFC_STATS,
+	RTA_VIA,
+	RTA_NEWDST,
 	__RTA_MAX
 };
 
@@ -344,6 +346,12 @@ struct rtnexthop {
 #define RTNH_SPACE(len)	RTNH_ALIGN(RTNH_LENGTH(len))
 #define RTNH_DATA(rtnh)   ((struct rtattr*)(((char*)(rtnh)) + RTNH_LENGTH(0)))
 
+/* RTA_VIA */
+struct rtvia {
+	__kernel_sa_family_t	rtvia_family;
+	__u8			rtvia_addr[0];
+};
+
 /* RTM_CACHEINFO */
 
 struct rta_cacheinfo {
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 2d6612a10e30..b4d7cec398d2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -212,6 +212,11 @@ static struct packet_type mpls_packet_type __read_mostly = {
 	.func = mpls_forward,
 };
 
+const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = {
+	[RTA_DST]		= { .type = NLA_U32 },
+	[RTA_OIF]		= { .type = NLA_U32 },
+};
+
 struct mpls_route_config {
 	u32		rc_protocol;
 	u32		rc_ifindex;
@@ -410,6 +415,22 @@ static struct notifier_block mpls_dev_notifier = {
 	.notifier_call = mpls_dev_notify,
 };
 
+static int nla_put_via(struct sk_buff *skb,
+		       u16 family, const void *addr, int alen)
+{
+	struct nlattr *nla;
+	struct rtvia *via;
+
+	nla = nla_reserve(skb, RTA_VIA, alen + 2);
+	if (!nla)
+		return -EMSGSIZE;
+
+	via = nla_data(nla);
+	via->rtvia_family = family;
+	memcpy(via->rtvia_addr, addr, alen);
+	return 0;
+}
+
 int nla_put_labels(struct sk_buff *skb, int attrtype,
 		   u8 labels, const u32 label[])
 {
@@ -467,6 +488,210 @@ int nla_get_labels(const struct nlattr *nla,
 	return 0;
 }
 
+static int rtm_to_route_config(struct sk_buff *skb,  struct nlmsghdr *nlh,
+			       struct mpls_route_config *cfg)
+{
+	struct rtmsg *rtm;
+	struct nlattr *tb[RTA_MAX+1];
+	int index;
+	int err;
+
+	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_mpls_policy);
+	if (err < 0)
+		goto errout;
+
+	err = -EINVAL;
+	rtm = nlmsg_data(nlh);
+	memset(cfg, 0, sizeof(*cfg));
+
+	if (rtm->rtm_family != AF_MPLS)
+		goto errout;
+	if (rtm->rtm_dst_len != 20)
+		goto errout;
+	if (rtm->rtm_src_len != 0)
+		goto errout;
+	if (rtm->rtm_tos != 0)
+		goto errout;
+	if (rtm->rtm_table != RT_TABLE_MAIN)
+		goto errout;
+	/* Any value is acceptable for rtm_protocol */
+
+	/* As mpls uses destination specific addresses
+	 * (or source specific address in the case of multicast)
+	 * all addresses have universal scope.
+	 */
+	if (rtm->rtm_scope != RT_SCOPE_UNIVERSE)
+		goto errout;
+	if (rtm->rtm_type != RTN_UNICAST)
+		goto errout;
+	if (rtm->rtm_flags != 0)
+		goto errout;
+
+	cfg->rc_label		= LABEL_NOT_SPECIFIED;
+	cfg->rc_protocol	= rtm->rtm_protocol;
+	cfg->rc_nlflags		= nlh->nlmsg_flags;
+	cfg->rc_nlinfo.portid	= NETLINK_CB(skb).portid;
+	cfg->rc_nlinfo.nlh	= nlh;
+	cfg->rc_nlinfo.nl_net	= sock_net(skb->sk);
+
+	for (index = 0; index <= RTA_MAX; index++) {
+		struct nlattr *nla = tb[index];
+		if (!nla)
+			continue;
+
+		switch(index) {
+		case RTA_OIF:
+			cfg->rc_ifindex = nla_get_u32(nla);
+			break;
+		case RTA_NEWDST:
+			if (nla_get_labels(nla, MAX_NEW_LABELS,
+					   &cfg->rc_output_labels,
+					   cfg->rc_output_label))
+				goto errout;
+			break;
+		case RTA_DST:
+		{
+			u32 label_count;
+			if (nla_get_labels(nla, 1, &label_count,
+					   &cfg->rc_label))
+				goto errout;
+
+			/* The first 16 labels are reserved, and may not be set */
+			if (cfg->rc_label < 16)
+				goto errout;
+
+			break;
+		}
+		case RTA_VIA:
+		{
+			struct rtvia *via = nla_data(nla);
+			cfg->rc_via_family = via->rtvia_family;
+			cfg->rc_via_alen   = nla_len(nla) - 2;
+			if (cfg->rc_via_alen > MAX_VIA_ALEN)
+				goto errout;
+
+			/* Validate the address family */
+			switch(cfg->rc_via_family) {
+			case AF_PACKET:
+				break;
+			case AF_INET:
+				if (cfg->rc_via_alen != 4)
+					goto errout;
+				break;
+			case AF_INET6:
+				if (cfg->rc_via_alen != 16)
+					goto errout;
+				break;
+			default:
+				/* Unsupported address family */
+				goto errout;
+			}
+
+			memcpy(cfg->rc_via, via->rtvia_addr, cfg->rc_via_alen);
+			break;
+		}
+		default:
+			/* Unsupported attribute */
+			goto errout;
+		}
+	}
+
+	err = 0;
+errout:
+	return err;
+}
+
+static int mpls_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct mpls_route_config cfg;
+	int err;
+
+	err = rtm_to_route_config(skb, nlh, &cfg);
+	if (err < 0)
+		return err;
+
+	return mpls_route_del(&cfg);
+}
+
+
+static int mpls_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	struct mpls_route_config cfg;
+	int err;
+
+	err = rtm_to_route_config(skb, nlh, &cfg);
+	if (err < 0)
+		return err;
+
+	return mpls_route_add(&cfg);
+}
+
+static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event,
+			   u32 label, struct mpls_route *rt, int flags)
+{
+	struct nlmsghdr *nlh;
+	struct rtmsg *rtm;
+
+	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*rtm), flags);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	rtm = nlmsg_data(nlh);
+	rtm->rtm_family = AF_MPLS;
+	rtm->rtm_dst_len = 20;
+	rtm->rtm_src_len = 0;
+	rtm->rtm_tos = 0;
+	rtm->rtm_table = RT_TABLE_MAIN;
+	rtm->rtm_protocol = rt->rt_protocol;
+	rtm->rtm_scope = RT_SCOPE_UNIVERSE;
+	rtm->rtm_type = RTN_UNICAST;
+	rtm->rtm_flags = 0;
+
+	if (rt->rt_labels &&
+	    nla_put_labels(skb, RTA_NEWDST, rt->rt_labels, rt->rt_label))
+		goto nla_put_failure;
+	if (nla_put_via(skb, rt->rt_via_family, rt->rt_via, rt->rt_via_alen))
+		goto nla_put_failure;
+	if (rt->rt_dev && nla_put_u32(skb, RTA_OIF, rt->rt_dev->ifindex))
+		goto nla_put_failure;
+	if (nla_put_labels(skb, RTA_DST, 1, &label))
+		goto nla_put_failure;
+
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	unsigned int index;
+
+	ASSERT_RTNL();
+
+	index = cb->args[0];
+	if (index < 16)
+		index = 16;
+
+	for (; index < net->mpls.platform_labels; index++) {
+		struct mpls_route *rt;
+		rt = net->mpls.platform_label[index];
+		if (!rt)
+			continue;
+
+		if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid,
+				    cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+				    index, rt, NLM_F_MULTI) < 0)
+			break;
+	}
+	cb->args[0] = index;
+
+	return skb->len;
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
@@ -662,6 +887,9 @@ static int __init mpls_init(void)
 
 	dev_add_pack(&mpls_packet_type);
 
+	rtnl_register(PF_MPLS, RTM_NEWROUTE, mpls_rtm_newroute, NULL, NULL);
+	rtnl_register(PF_MPLS, RTM_DELROUTE, mpls_rtm_delroute, NULL, NULL);
+	rtnl_register(PF_MPLS, RTM_GETROUTE, NULL, mpls_dump_routes, NULL);
 	err = 0;
 out:
 	return err;
@@ -674,6 +902,7 @@ module_init(mpls_init);
 
 static void __exit mpls_exit(void)
 {
+	rtnl_unregister_all(PF_MPLS);
 	dev_remove_pack(&mpls_packet_type);
 	unregister_netdevice_notifier(&mpls_dev_notifier);
 	unregister_pernet_subsys(&mpls_net_ops);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next 7/7] mpls: Multicast route table change notifications
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
                                           ` (5 preceding siblings ...)
  2015-03-04  1:13                         ` [PATCH net-next 6/7] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
@ 2015-03-04  1:14                         ` Eric W. Biederman
  2015-03-04  5:27                         ` [PATCH net-next 0/7] Basic MPLS support take 2 David Miller
  7 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  1:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, Stephen Hemminger, santiago, Simon Horman


Unlike IPv4 this code notifies on all cases where mpls routes
are added or removed and it never automatically removes routes.
Avoiding both the userspace confusion that is caused by omitting
route updates and the possibility of a flood of netlink traffic
when an interface goes doew.

For now reserved labels are handled automatically and userspace
is not notified.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/uapi/linux/rtnetlink.h |  2 ++
 net/mpls/af_mpls.c             | 60 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index bad65550ae3e..06f75a407f74 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -631,6 +631,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_IPV6_NETCONF	RTNLGRP_IPV6_NETCONF
 	RTNLGRP_MDB,
 #define RTNLGRP_MDB		RTNLGRP_MDB
+	RTNLGRP_MPLS_ROUTE,
+#define RTNLGRP_MPLS_ROUTE	RTNLGRP_MPLS_ROUTE
 	__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX	(__RTNLGRP_MAX - 1)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index b4d7cec398d2..75a994a50381 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -36,6 +36,10 @@ struct mpls_route { /* next hop label forwarding entry */
 static int zero = 0;
 static int label_limit = (1 << 20) - 1;
 
+static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
+		       struct nlmsghdr *nlh, struct net *net, u32 portid,
+		       unsigned int nlm_flags);
+
 static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
 	struct mpls_route *rt = NULL;
@@ -246,6 +250,20 @@ static void mpls_rt_free(struct mpls_route *rt)
 		kfree_rcu(rt, rt_rcu);
 }
 
+static void mpls_notify_route(struct net *net, unsigned index,
+			      struct mpls_route *old, struct mpls_route *new,
+			      const struct nl_info *info)
+{
+	struct nlmsghdr *nlh = info ? info->nlh : NULL;
+	unsigned portid = info ? info->portid : 0;
+	int event = new ? RTM_NEWROUTE : RTM_DELROUTE;
+	struct mpls_route *rt = new ? new : old;
+	unsigned nlm_flags = (old && new) ? NLM_F_REPLACE : 0;
+	/* Ignore reserved labels for now */
+	if (rt && (index >= 16))
+		rtmsg_lfib(event, index, rt, nlh, net, portid, nlm_flags);
+}
+
 static void mpls_route_update(struct net *net, unsigned index,
 			      struct net_device *dev, struct mpls_route *new,
 			      const struct nl_info *info)
@@ -260,6 +278,8 @@ static void mpls_route_update(struct net *net, unsigned index,
 		old = rt;
 	}
 
+	mpls_notify_route(net, index, old, new, info);
+
 	/* If we removed a route free it now */
 	mpls_rt_free(old);
 }
@@ -692,6 +712,46 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+static inline size_t lfib_nlmsg_size(struct mpls_route *rt)
+{
+	size_t payload =
+		NLMSG_ALIGN(sizeof(struct rtmsg))
+		+ nla_total_size(2 + rt->rt_via_alen)	/* RTA_VIA */
+		+ nla_total_size(4);			/* RTA_DST */
+	if (rt->rt_labels)				/* RTA_NEWDST */
+		payload += nla_total_size(rt->rt_labels * 4);
+	if (rt->rt_dev)					/* RTA_OIF */
+		payload += nla_total_size(4);
+	return payload;
+}
+
+static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
+		       struct nlmsghdr *nlh, struct net *net, u32 portid,
+		       unsigned int nlm_flags)
+{
+	struct sk_buff *skb;
+	u32 seq = nlh ? nlh->nlmsg_seq : 0;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(lfib_nlmsg_size(rt), GFP_KERNEL);
+	if (skb == NULL)
+		goto errout;
+
+	err = mpls_dump_route(skb, portid, seq, event, label, rt, nlm_flags);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in lfib_nlmsg_size */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+	rtnl_notify(skb, net, portid, RTNLGRP_MPLS_ROUTE, nlh, GFP_KERNEL);
+
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_MPLS_ROUTE, err);
+}
+
 static int resize_platform_label_table(struct net *net, size_t limit)
 {
 	size_t size = sizeof(struct mpls_route *) * limit;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
                                         ` (2 preceding siblings ...)
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
@ 2015-03-04  5:25                       ` David Miller
  2015-03-04  5:53                         ` Eric W. Biederman
  3 siblings, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-04  5:25 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 03 Mar 2015 17:09:35 -0600

> In preparation for using the IPv4 and IPv6 neighbour tables in my mpls
> code this patchset factors out ___neigh_lookup_noref from
> __ipv4_neigh_lookup_noref, __ipv6_lookup_noref and neigh_lookup.
> Allowing the lookup logic to be shared between the different
> implementations.  At what appears to be no cost. (Aka the same assembly
> is generated for ip6_finish_output2 and ip_finish_output2).
> 
> After that I add a simple function that takes an address family and an
> address consults the neighbour table and sends the packet to the
> appropriate location.  The address family argument decoupls callers
> of neigh_xmit from the addresses families the packets are sent over.
> (Aka The ipv6 module can be loaded after mpls and a previously
> configured ipv6 next hop will start working).
> 
> The refactoring in ___neigh_lookup_noref may be a bit overkill but it
> feels like the right thing to do.  Especially since the same code is
> generated.

Series applied, thanks.

Maybe we can make neigh_table_find() faster by making it a direct
array demux of some kind instead of some switch statment thing?
It's the only think I don't like about neigh_xmit().

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/7] Basic MPLS support take 2
  2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
                                           ` (6 preceding siblings ...)
  2015-03-04  1:14                         ` [PATCH net-next 7/7] mpls: Multicast route table change notifications Eric W. Biederman
@ 2015-03-04  5:27                         ` David Miller
  2015-03-04  6:13                           ` Eric W. Biederman
  7 siblings, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-04  5:27 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev, roopa, stephen, santiago, horms

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 03 Mar 2015 19:06:39 -0600

> On top of my two pending neighbour table prep patches here is the mpls
> support refactored to use them, and edited to not drop routes when
> an interface goes down.  Additionally the addition of RTA_LLGATEWAY
> has been replaced with the addtion of RTA_VIA.  RTA_VIA being an
> attribute that includes the address family as well as the address
> of the next hop.
> 
> MPLS is at it's heart simple and I have endeavoured to maintain that
> simplicity in my implemenation.
> 
> This is an implementation of a RFC3032 forwarding engine, and basic MPLS
> egress logic.  Which should make linux sufficient to be a mpls
> forwarding node or to be a LSA (Label Switched Router) as it says in all
> of the MPLS documents.  The ingress support will follow but it deserves
> it's own discussion so I am pushing it separately.

Ok, with the neigh changes, I think I'm fine with this.

Series applied, thanks.

Please think very carefully about the netlink API you've created to
config this stuff, once we hit the next merge window we will be stuck
with this thing forever.

Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-04  5:25                       ` [PATCH net-next 0/2] Neighbour table prep for MPLS David Miller
@ 2015-03-04  5:53                         ` Eric W. Biederman
  2015-03-04 14:56                           ` Andy Gospodarek
  2015-03-04 21:04                           ` David Miller
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  5:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 03 Mar 2015 17:09:35 -0600
>
>> In preparation for using the IPv4 and IPv6 neighbour tables in my mpls
>> code this patchset factors out ___neigh_lookup_noref from
>> __ipv4_neigh_lookup_noref, __ipv6_lookup_noref and neigh_lookup.
>> Allowing the lookup logic to be shared between the different
>> implementations.  At what appears to be no cost. (Aka the same assembly
>> is generated for ip6_finish_output2 and ip_finish_output2).
>> 
>> After that I add a simple function that takes an address family and an
>> address consults the neighbour table and sends the packet to the
>> appropriate location.  The address family argument decoupls callers
>> of neigh_xmit from the addresses families the packets are sent over.
>> (Aka The ipv6 module can be loaded after mpls and a previously
>> configured ipv6 next hop will start working).
>> 
>> The refactoring in ___neigh_lookup_noref may be a bit overkill but it
>> feels like the right thing to do.  Especially since the same code is
>> generated.
>
> Series applied, thanks.
>
> Maybe we can make neigh_table_find() faster by making it a direct
> array demux of some kind instead of some switch statment thing?
> It's the only think I don't like about neigh_xmit().

We could potentially translate the numbers into the enumeration that is
NEIGH_ARP_TABLE, NEIGH_ND_TABLE, and NEIGH_DN_TABLE.  Or waste a little
bit of memory in have a 30 entry array and looking things up by address
protocol number.   The only disadvantage I can see to using AF_NNN as
the index is that it might be a little less cache friendly.

Other issues the hh header cache doesn't work. (How much do we care).

I worry a little that supporting AF_PACKET case might cause problems
in the future.

The cumulus folks are probably going to want to use neigh_xmit so they
can have ipv6 nexthops on ipv4.  Using this for IPv4 and loosing the
header cache worries me a little.

But it seems like a good starting point.  And right now I am very ready
to say good enough for now and move on to the next thing.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/7] Basic MPLS support take 2
  2015-03-04  5:27                         ` [PATCH net-next 0/7] Basic MPLS support take 2 David Miller
@ 2015-03-04  6:13                           ` Eric W. Biederman
  0 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, stephen, santiago, horms

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 03 Mar 2015 19:06:39 -0600
>
>> On top of my two pending neighbour table prep patches here is the mpls
>> support refactored to use them, and edited to not drop routes when
>> an interface goes down.  Additionally the addition of RTA_LLGATEWAY
>> has been replaced with the addtion of RTA_VIA.  RTA_VIA being an
>> attribute that includes the address family as well as the address
>> of the next hop.
>> 
>> MPLS is at it's heart simple and I have endeavoured to maintain that
>> simplicity in my implemenation.
>> 
>> This is an implementation of a RFC3032 forwarding engine, and basic MPLS
>> egress logic.  Which should make linux sufficient to be a mpls
>> forwarding node or to be a LSA (Label Switched Router) as it says in all
>> of the MPLS documents.  The ingress support will follow but it deserves
>> it's own discussion so I am pushing it separately.
>
> Ok, with the neigh changes, I think I'm fine with this.
>
> Series applied, thanks.
>
> Please think very carefully about the netlink API you've created to
> config this stuff, once we hit the next merge window we will be stuck
> with this thing forever.

I have done my best to keep things minimal so only the bare essentials
are present at this point.

I will also take suggestions from anyone else who has a thought.

What I worry most about is the table table size is a sysctl and
thus table size changes, and notifications don't happen over netlink.

I have done my best with the netlink API and other than the above table
size issue I don't have any real reservations.

I do worry about the alignment of struct rtvia.  A 16bit field before
addresses is not exactly ideal for address alignment.  But short of
wasting space I don't see any real good alternatives.

If and when we implement forwarding of MPLS multicast there will have
to be some interesting changes.  Instead of having destination specific
addresses there are source specific addresses so a RTA_NEWSRC attribute
will probably have to be added to complement the RTA_NEWDST attribute.

But in essence that all fits into the existing routing message model
and iproute2 has required very minimal changes so far. 

Is there a standard location to document netlink messages?  I don't
think I have tripped over one at this point.

I hope this gets us to the point where other folks who care about mpls
can jump in and make incremental improvements.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-04  1:12                         ` [PATCH net-next 4/7] mpls: Basic support for adding and removing routes Eric W. Biederman
@ 2015-03-04  8:13                           ` roopa
  2015-03-04 20:36                             ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: roopa @ 2015-03-04  8:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
> mpls_route_add and mpls_route_del implement the basic logic for adding
> and removing Next Hop Label Forwarding Entries from the MPLS input
> label map.  The addition and subtraction is done in a way that is
> consistent with how the existing routing table in Linux are
> maintained.  Thus all of the work to deal with NLM_F_APPEND,
> NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.
>
> Cases that are not clearly defined such as changing the interpretation
> of the mpls reserved labels is not allowed.
>
> Because it seems like the right thing to do adding an MPLS route without
> specifying an input label and allowing the kernel to pick a free label
> table entry is supported.   The implementation is currently less than optimal
> but that can be changed.
>
> As I don't have anything else to test with only ethernet and the loopback
> device are the only two device types currently supported for forwarding
> MPLS over.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>   net/mpls/af_mpls.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 133 insertions(+)
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index b097125dfa33..e432f092f2fb 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -16,6 +16,7 @@
>   #include <net/netns/generic.h>
>   #include "internal.h"
>   
> +#define LABEL_NOT_SPECIFIED (1<<20)
>   #define MAX_NEW_LABELS 2
>   
>   /* This maximum ha length copied from the definition of struct neighbour */
> @@ -211,6 +212,19 @@ static struct packet_type mpls_packet_type __read_mostly = {
>   	.func = mpls_forward,
>   };
>   
> +struct mpls_route_config {
> +	u32		rc_protocol;
> +	u32		rc_ifindex;
> +	u16		rc_via_family;
> +	u16		rc_via_alen;
> +	u8		rc_via[MAX_VIA_ALEN];
> +	u32		rc_label;
> +	u32		rc_output_labels;
> +	u32		rc_output_label[MAX_NEW_LABELS];
> +	u32		rc_nlflags;
> +	struct nl_info	rc_nlinfo;
> +};
> +
>   static struct mpls_route *mpls_rt_alloc(size_t alen)
>   {
>   	struct mpls_route *rt;
> @@ -245,6 +259,125 @@ static void mpls_route_update(struct net *net, unsigned index,
>   	mpls_rt_free(old);
>   }
>   
> +static unsigned find_free_label(struct net *net)
> +{
> +	unsigned index;
> +	for (index = 16; index < net->mpls.platform_labels; index++) {
> +		if (!net->mpls.platform_label[index])
> +			return index;
> +	}
> +	return LABEL_NOT_SPECIFIED;
> +}
> +
> +static int mpls_route_add(struct mpls_route_config *cfg)
> +{
> +	struct net *net = cfg->rc_nlinfo.nl_net;
> +	struct net_device *dev = NULL;
> +	struct mpls_route *rt, *old;
> +	unsigned index;
> +	int i;
> +	int err = -EINVAL;
> +
> +	index = cfg->rc_label;
> +
> +	/* If a label was not specified during insert pick one */
> +	if ((index == LABEL_NOT_SPECIFIED) &&
> +	    (cfg->rc_nlflags & NLM_F_CREATE)) {
> +		index = find_free_label(net);
> +	}
> +
> +	/* The first 16 labels are reserved, and may not be set */
> +	if (index < 16)
> +		goto errout;
> +
> +	/* The full 20 bit range may not be supported. */
> +	if (index >= net->mpls.platform_labels)
> +		goto errout;
> +
> +	/* Ensure only a supported number of labels are present */
> +	if (cfg->rc_output_labels > MAX_NEW_LABELS)
> +		goto errout;
> +
> +	err = -ENODEV;
> +	dev = dev_get_by_index(net, cfg->rc_ifindex);
> +	if (!dev)
> +		goto errout;
> +
> +	/* For now just support ethernet devices */
> +	err = -EINVAL;
> +	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
> +		goto errout;
> +
> +	err = -EINVAL;
> +	if ((cfg->rc_via_family == AF_PACKET) &&
> +	    (dev->addr_len != cfg->rc_via_alen))
> +		goto errout;
> +
> +	/* Append makes no sense with mpls */
> +	err = -EINVAL;

minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
I did not find any other such case for consistency check.

> +	if (cfg->rc_nlflags & NLM_F_APPEND)
> +		goto errout;
> +
>
Thanks,
Roopa

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref
  2015-03-03 23:10                       ` [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref Eric W. Biederman
@ 2015-03-04 14:53                         ` Andy Gospodarek
  2015-03-04 15:58                           ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Andy Gospodarek @ 2015-03-04 14:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev

On Tue, Mar 03, 2015 at 05:10:44PM -0600, Eric W. Biederman wrote:
> 
> While looking at the mpls code I found myself writing yet another
> version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
> and __ipv6_lookup_noref.
> 
> So to make my work a little easier and to make it a smidge easier to
> verify/maintain the mpls code in the future I stopped and wrote
> ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
> __ipv6_lookup_noref in terms of this new function.  I tested my new
> version by verifying that the same code is generated in
> ip_finish_output2 and ip6_finish_output2 where these functions are
> inlined.
> 
> To get to ___neigh_lookup_noref I added a new neighbour cache table
> function key_eq.  So that the static size of the key would be
> available.
> 
> I also added __neigh_lookup_noref for people who want to to lookup
> a neighbour table entry quickly but don't know which neibhgour table
> they are going to look up.

While I understand your intent here, you do really need to know which
neighbour table being used in order to do the look-up with your new
function, so this changelog isn't quite accurate.  I know Dave has
already accepted this patch, but it did not appear in the tree I just
updated, so hopefully there is time to fix this if you agree with me.

I realize patch 2/2 allows one to not specify a table as look-up is done
for you in neigh_xmit, but ___neigh_lookup_noref will clearly panic if
no valid table is passed.

Otherwise the patch-set looks good to me.

Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>

> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/net/arp.h       | 19 ++++--------------
>  include/net/ndisc.h     | 19 +-----------------
>  include/net/neighbour.h | 52 +++++++++++++++++++++++++++++++++++++++++++++++++
>  net/core/neighbour.c    | 20 +++++--------------
>  net/decnet/dn_neigh.c   |  6 ++++++
>  net/ipv4/arp.c          |  9 ++++++++-
>  net/ipv6/ndisc.c        |  7 +++++++
>  7 files changed, 83 insertions(+), 49 deletions(-)
> 
> diff --git a/include/net/arp.h b/include/net/arp.h
> index 21ee1860abbc..5e0f891d476c 100644
> --- a/include/net/arp.h
> +++ b/include/net/arp.h
> @@ -9,28 +9,17 @@
>  
>  extern struct neigh_table arp_tbl;
>  
> -static inline u32 arp_hashfn(u32 key, const struct net_device *dev, u32 hash_rnd)
> +static inline u32 arp_hashfn(const void *pkey, const struct net_device *dev, u32 *hash_rnd)
>  {
> +	u32 key = *(const u32 *)pkey;
>  	u32 val = key ^ hash32_ptr(dev);
>  
> -	return val * hash_rnd;
> +	return val * hash_rnd[0];
>  }
>  
>  static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev, u32 key)
>  {
> -	struct neigh_hash_table *nht = rcu_dereference_bh(arp_tbl.nht);
> -	struct neighbour *n;
> -	u32 hash_val;
> -
> -	hash_val = arp_hashfn(key, dev, nht->hash_rnd[0]) >> (32 - nht->hash_shift);
> -	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
> -	     n != NULL;
> -	     n = rcu_dereference_bh(n->next)) {
> -		if (n->dev == dev && *(u32 *)n->primary_key == key)
> -			return n;
> -	}
> -
> -	return NULL;
> +	return ___neigh_lookup_noref(&arp_tbl, neigh_key_eq32, arp_hashfn, &key, dev);
>  }
>  
>  static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, u32 key)
> diff --git a/include/net/ndisc.h b/include/net/ndisc.h
> index 6bbda34d5e59..b3a7751251b4 100644
> --- a/include/net/ndisc.h
> +++ b/include/net/ndisc.h
> @@ -156,24 +156,7 @@ static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, _
>  
>  static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device *dev, const void *pkey)
>  {
> -	struct neigh_hash_table *nht;
> -	const u32 *p32 = pkey;
> -	struct neighbour *n;
> -	u32 hash_val;
> -
> -	nht = rcu_dereference_bh(nd_tbl.nht);
> -	hash_val = ndisc_hashfn(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
> -	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
> -	     n != NULL;
> -	     n = rcu_dereference_bh(n->next)) {
> -		u32 *n32 = (u32 *) n->primary_key;
> -		if (n->dev == dev &&
> -		    ((n32[0] ^ p32[0]) | (n32[1] ^ p32[1]) |
> -		     (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0)
> -			return n;
> -	}
> -
> -	return NULL;
> +	return ___neigh_lookup_noref(&nd_tbl, neigh_key_eq128, ndisc_hashfn, pkey, dev);
>  }
>  
>  static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, const void *pkey)
> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
> index 9f912e4d4232..14e3f017966b 100644
> --- a/include/net/neighbour.h
> +++ b/include/net/neighbour.h
> @@ -197,6 +197,7 @@ struct neigh_table {
>  	__u32			(*hash)(const void *pkey,
>  					const struct net_device *dev,
>  					__u32 *hash_rnd);
> +	bool			(*key_eq)(const struct neighbour *, const void *pkey);
>  	int			(*constructor)(struct neighbour *);
>  	int			(*pconstructor)(struct pneigh_entry *);
>  	void			(*pdestructor)(struct pneigh_entry *);
> @@ -247,6 +248,57 @@ static inline void *neighbour_priv(const struct neighbour *n)
>  #define NEIGH_UPDATE_F_ISROUTER			0x40000000
>  #define NEIGH_UPDATE_F_ADMIN			0x80000000
>  
> +
> +static inline bool neigh_key_eq16(const struct neighbour *n, const void *pkey)
> +{
> +	return *(const u16 *)n->primary_key == *(const u16 *)pkey;
> +}
> +
> +static inline bool neigh_key_eq32(const struct neighbour *n, const void *pkey)
> +{
> +	return *(const u32 *)n->primary_key == *(const u32 *)pkey;
> +}
> +
> +static inline bool neigh_key_eq128(const struct neighbour *n, const void *pkey)
> +{
> +	const u32 *n32 = (const u32 *)n->primary_key;
> +	const u32 *p32 = pkey;
> +
> +	return ((n32[0] ^ p32[0]) | (n32[1] ^ p32[1]) |
> +		(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
> +}
> +
> +static inline struct neighbour *___neigh_lookup_noref(
> +	struct neigh_table *tbl,
> +	bool (*key_eq)(const struct neighbour *n, const void *pkey),
> +	__u32 (*hash)(const void *pkey,
> +		      const struct net_device *dev,
> +		      __u32 *hash_rnd),
> +	const void *pkey,
> +	struct net_device *dev)
> +{
> +	struct neigh_hash_table *nht = rcu_dereference_bh(tbl->nht);
> +	struct neighbour *n;
> +	u32 hash_val;
> +
> +	hash_val = hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
> +	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
> +	     n != NULL;
> +	     n = rcu_dereference_bh(n->next)) {
> +		if (n->dev == dev && key_eq(n, pkey))
> +			return n;
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
> +						     const void *pkey,
> +						     struct net_device *dev)
> +{
> +	return ___neigh_lookup_noref(tbl, tbl->key_eq, tbl->hash, pkey, dev);
> +}
> +
>  void neigh_table_init(int index, struct neigh_table *tbl);
>  int neigh_table_clear(int index, struct neigh_table *tbl);
>  struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 0f48ea3affed..fe3c6eac5805 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -397,25 +397,15 @@ struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
>  			       struct net_device *dev)
>  {
>  	struct neighbour *n;
> -	int key_len = tbl->key_len;
> -	u32 hash_val;
> -	struct neigh_hash_table *nht;
>  
>  	NEIGH_CACHE_STAT_INC(tbl, lookups);
>  
>  	rcu_read_lock_bh();
> -	nht = rcu_dereference_bh(tbl->nht);
> -	hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
> -
> -	for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
> -	     n != NULL;
> -	     n = rcu_dereference_bh(n->next)) {
> -		if (dev == n->dev && !memcmp(n->primary_key, pkey, key_len)) {
> -			if (!atomic_inc_not_zero(&n->refcnt))
> -				n = NULL;
> -			NEIGH_CACHE_STAT_INC(tbl, hits);
> -			break;
> -		}
> +	n = __neigh_lookup_noref(tbl, pkey, dev);
> +	if (n) {
> +		if (!atomic_inc_not_zero(&n->refcnt))
> +			n = NULL;
> +		NEIGH_CACHE_STAT_INC(tbl, hits);
>  	}
>  
>  	rcu_read_unlock_bh();
> diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
> index f123c6c6748c..ee7d1cef0027 100644
> --- a/net/decnet/dn_neigh.c
> +++ b/net/decnet/dn_neigh.c
> @@ -93,12 +93,18 @@ static u32 dn_neigh_hash(const void *pkey,
>  	return jhash_2words(*(__u16 *)pkey, 0, hash_rnd[0]);
>  }
>  
> +static bool dn_key_eq(const struct neighbour *neigh, const void *pkey)
> +{
> +	return neigh_key_eq16(neigh, pkey);
> +}
> +
>  struct neigh_table dn_neigh_table = {
>  	.family =			PF_DECnet,
>  	.entry_size =			NEIGH_ENTRY_SIZE(sizeof(struct dn_neigh)),
>  	.key_len =			sizeof(__le16),
>  	.protocol =			cpu_to_be16(ETH_P_DNA_RT),
>  	.hash =				dn_neigh_hash,
> +	.key_eq =			dn_key_eq,
>  	.constructor =			dn_neigh_construct,
>  	.id =				"dn_neigh_cache",
>  	.parms ={
> diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
> index 6b8aad6a0d7d..5f5c674e130a 100644
> --- a/net/ipv4/arp.c
> +++ b/net/ipv4/arp.c
> @@ -122,6 +122,7 @@
>   *	Interface to generic neighbour cache.
>   */
>  static u32 arp_hash(const void *pkey, const struct net_device *dev, __u32 *hash_rnd);
> +static bool arp_key_eq(const struct neighbour *n, const void *pkey);
>  static int arp_constructor(struct neighbour *neigh);
>  static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb);
>  static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb);
> @@ -154,6 +155,7 @@ struct neigh_table arp_tbl = {
>  	.key_len	= 4,
>  	.protocol	= cpu_to_be16(ETH_P_IP),
>  	.hash		= arp_hash,
> +	.key_eq		= arp_key_eq,
>  	.constructor	= arp_constructor,
>  	.proxy_redo	= parp_redo,
>  	.id		= "arp_cache",
> @@ -209,7 +211,12 @@ static u32 arp_hash(const void *pkey,
>  		    const struct net_device *dev,
>  		    __u32 *hash_rnd)
>  {
> -	return arp_hashfn(*(u32 *)pkey, dev, *hash_rnd);
> +	return arp_hashfn(pkey, dev, hash_rnd);
> +}
> +
> +static bool arp_key_eq(const struct neighbour *neigh, const void *pkey)
> +{
> +	return neigh_key_eq32(neigh, pkey);
>  }
>  
>  static int arp_constructor(struct neighbour *neigh)
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index e363bbc2420d..247ad7c298f7 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -84,6 +84,7 @@ do {								\
>  static u32 ndisc_hash(const void *pkey,
>  		      const struct net_device *dev,
>  		      __u32 *hash_rnd);
> +static bool ndisc_key_eq(const struct neighbour *neigh, const void *pkey);
>  static int ndisc_constructor(struct neighbour *neigh);
>  static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb);
>  static void ndisc_error_report(struct neighbour *neigh, struct sk_buff *skb);
> @@ -119,6 +120,7 @@ struct neigh_table nd_tbl = {
>  	.key_len =	sizeof(struct in6_addr),
>  	.protocol =	cpu_to_be16(ETH_P_IPV6),
>  	.hash =		ndisc_hash,
> +	.key_eq =	ndisc_key_eq,
>  	.constructor =	ndisc_constructor,
>  	.pconstructor =	pndisc_constructor,
>  	.pdestructor =	pndisc_destructor,
> @@ -295,6 +297,11 @@ static u32 ndisc_hash(const void *pkey,
>  	return ndisc_hashfn(pkey, dev, hash_rnd);
>  }
>  
> +static bool ndisc_key_eq(const struct neighbour *n, const void *pkey)
> +{
> +	return neigh_key_eq128(n, pkey);
> +}
> +
>  static int ndisc_constructor(struct neighbour *neigh)
>  {
>  	struct in6_addr *addr = (struct in6_addr *)&neigh->primary_key;
> -- 
> 2.2.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-04  5:53                         ` Eric W. Biederman
@ 2015-03-04 14:56                           ` Andy Gospodarek
  2015-03-04 21:04                           ` David Miller
  1 sibling, 0 replies; 88+ messages in thread
From: Andy Gospodarek @ 2015-03-04 14:56 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev

On Tue, Mar 03, 2015 at 11:53:21PM -0600, Eric W. Biederman wrote:
> David Miller <davem@davemloft.net> writes:
> 
> > From: ebiederm@xmission.com (Eric W. Biederman)
> > Date: Tue, 03 Mar 2015 17:09:35 -0600
> >
> >> In preparation for using the IPv4 and IPv6 neighbour tables in my mpls
> >> code this patchset factors out ___neigh_lookup_noref from
> >> __ipv4_neigh_lookup_noref, __ipv6_lookup_noref and neigh_lookup.
> >> Allowing the lookup logic to be shared between the different
> >> implementations.  At what appears to be no cost. (Aka the same assembly
> >> is generated for ip6_finish_output2 and ip_finish_output2).
> >> 
> >> After that I add a simple function that takes an address family and an
> >> address consults the neighbour table and sends the packet to the
> >> appropriate location.  The address family argument decoupls callers
> >> of neigh_xmit from the addresses families the packets are sent over.
> >> (Aka The ipv6 module can be loaded after mpls and a previously
> >> configured ipv6 next hop will start working).
> >> 
> >> The refactoring in ___neigh_lookup_noref may be a bit overkill but it
> >> feels like the right thing to do.  Especially since the same code is
> >> generated.
> >
> > Series applied, thanks.
> >
> > Maybe we can make neigh_table_find() faster by making it a direct
> > array demux of some kind instead of some switch statment thing?
> > It's the only think I don't like about neigh_xmit().
> 
> We could potentially translate the numbers into the enumeration that is
> NEIGH_ARP_TABLE, NEIGH_ND_TABLE, and NEIGH_DN_TABLE.  Or waste a little
> bit of memory in have a 30 entry array and looking things up by address
> protocol number.   The only disadvantage I can see to using AF_NNN as
> the index is that it might be a little less cache friendly.
> 
> Other issues the hh header cache doesn't work. (How much do we care).
> 
> I worry a little that supporting AF_PACKET case might cause problems
> in the future.
> 
> The cumulus folks are probably going to want to use neigh_xmit so they
> can have ipv6 nexthops on ipv4.  Using this for IPv4 and loosing the
> header cache worries me a little.

Agreed, this will be good.  I had done something a bit different coming
off the the discussions at netconf, but I'll rebase to this and use it
instead.  Thanks, Eric!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref
  2015-03-04 14:53                         ` Andy Gospodarek
@ 2015-03-04 15:58                           ` Eric W. Biederman
  2015-03-04 16:30                             ` Andy Gospodarek
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04 15:58 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: David Miller, netdev

Andy Gospodarek <gospo@cumulusnetworks.com> writes:

> On Tue, Mar 03, 2015 at 05:10:44PM -0600, Eric W. Biederman wrote:
>> 
>> While looking at the mpls code I found myself writing yet another
>> version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
>> and __ipv6_lookup_noref.
>> 
>> So to make my work a little easier and to make it a smidge easier to
>> verify/maintain the mpls code in the future I stopped and wrote
>> ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
>> __ipv6_lookup_noref in terms of this new function.  I tested my new
>> version by verifying that the same code is generated in
>> ip_finish_output2 and ip6_finish_output2 where these functions are
>> inlined.
>> 
>> To get to ___neigh_lookup_noref I added a new neighbour cache table
>> function key_eq.  So that the static size of the key would be
>> available.
>> 
>> I also added __neigh_lookup_noref for people who want to to lookup
>> a neighbour table entry quickly but don't know which neibhgour table
>> they are going to look up.
>
> While I understand your intent here, you do really need to know which
> neighbour table being used in order to do the look-up with your new
> function, so this changelog isn't quite accurate.  I know Dave has
> already accepted this patch, but it did not appear in the tree I just
> updated, so hopefully there is time to fix this if you agree with me.

Currently __ipv4_lookup_noref and __ipv6_lookup_noref hard code the
table.  __neigh_lookup_noref works without needing to hard code the
neighbour table.  The neighbour table being a variable in the code and
not a hard coded value is what I was referring to above when I said you
don't need to know your neighbour table.  That is you still need a
neighbour table it just doesn't need to be hard coded.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref
  2015-03-04 15:58                           ` Eric W. Biederman
@ 2015-03-04 16:30                             ` Andy Gospodarek
  0 siblings, 0 replies; 88+ messages in thread
From: Andy Gospodarek @ 2015-03-04 16:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev

On Wed, Mar 04, 2015 at 09:58:28AM -0600, Eric W. Biederman wrote:
> Andy Gospodarek <gospo@cumulusnetworks.com> writes:
> 
> > On Tue, Mar 03, 2015 at 05:10:44PM -0600, Eric W. Biederman wrote:
> >> 
> >> While looking at the mpls code I found myself writing yet another
> >> version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
> >> and __ipv6_lookup_noref.
> >> 
> >> So to make my work a little easier and to make it a smidge easier to
> >> verify/maintain the mpls code in the future I stopped and wrote
> >> ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
> >> __ipv6_lookup_noref in terms of this new function.  I tested my new
> >> version by verifying that the same code is generated in
> >> ip_finish_output2 and ip6_finish_output2 where these functions are
> >> inlined.
> >> 
> >> To get to ___neigh_lookup_noref I added a new neighbour cache table
> >> function key_eq.  So that the static size of the key would be
> >> available.
> >> 
> >> I also added __neigh_lookup_noref for people who want to to lookup
> >> a neighbour table entry quickly but don't know which neibhgour table
> >> they are going to look up.
> >
> > While I understand your intent here, you do really need to know which
> > neighbour table being used in order to do the look-up with your new
> > function, so this changelog isn't quite accurate.  I know Dave has
> > already accepted this patch, but it did not appear in the tree I just
> > updated, so hopefully there is time to fix this if you agree with me.
> 
> Currently __ipv4_lookup_noref and __ipv6_lookup_noref hard code the
> table.  __neigh_lookup_noref works without needing to hard code the
> neighbour table.  The neighbour table being a variable in the code and
> not a hard coded value is what I was referring to above when I said you
> don't need to know your neighbour table.  That is you still need a
> neighbour table it just doesn't need to be hard coded.

Thanks for the clarification.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-04  8:13                           ` roopa
@ 2015-03-04 20:36                             ` Eric W. Biederman
  2015-03-05  0:30                               ` roopa
  2015-03-05  2:50                               ` Bill Fink
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-04 20:36 UTC (permalink / raw)
  To: roopa; +Cc: David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

roopa <roopa@cumulusnetworks.com> writes:

> On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
>> mpls_route_add and mpls_route_del implement the basic logic for adding
>> and removing Next Hop Label Forwarding Entries from the MPLS input
>> label map.  The addition and subtraction is done in a way that is
>> consistent with how the existing routing table in Linux are
>> maintained.  Thus all of the work to deal with NLM_F_APPEND,
>> NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.
>>
>> Cases that are not clearly defined such as changing the interpretation
>> of the mpls reserved labels is not allowed.
>>
>> Because it seems like the right thing to do adding an MPLS route without
>> specifying an input label and allowing the kernel to pick a free label
>> table entry is supported.   The implementation is currently less than optimal
>> but that can be changed.
>>
>> As I don't have anything else to test with only ethernet and the loopback
>> device are the only two device types currently supported for forwarding
>> MPLS over.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>   net/mpls/af_mpls.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 133 insertions(+)
>>
>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>> index b097125dfa33..e432f092f2fb 100644
>> --- a/net/mpls/af_mpls.c
>> +++ b/net/mpls/af_mpls.c
>> @@ -16,6 +16,7 @@
>>   #include <net/netns/generic.h>
>>   #include "internal.h"
>>   +#define LABEL_NOT_SPECIFIED (1<<20)
>>   #define MAX_NEW_LABELS 2
>>     /* This maximum ha length copied from the definition of struct
>> neighbour */
>> @@ -211,6 +212,19 @@ static struct packet_type mpls_packet_type __read_mostly = {
>>   	.func = mpls_forward,
>>   };
>>   +struct mpls_route_config {
>> +	u32		rc_protocol;
>> +	u32		rc_ifindex;
>> +	u16		rc_via_family;
>> +	u16		rc_via_alen;
>> +	u8		rc_via[MAX_VIA_ALEN];
>> +	u32		rc_label;
>> +	u32		rc_output_labels;
>> +	u32		rc_output_label[MAX_NEW_LABELS];
>> +	u32		rc_nlflags;
>> +	struct nl_info	rc_nlinfo;
>> +};
>> +
>>   static struct mpls_route *mpls_rt_alloc(size_t alen)
>>   {
>>   	struct mpls_route *rt;
>> @@ -245,6 +259,125 @@ static void mpls_route_update(struct net *net, unsigned index,
>>   	mpls_rt_free(old);
>>   }
>>   +static unsigned find_free_label(struct net *net)
>> +{
>> +	unsigned index;
>> +	for (index = 16; index < net->mpls.platform_labels; index++) {
>> +		if (!net->mpls.platform_label[index])
>> +			return index;
>> +	}
>> +	return LABEL_NOT_SPECIFIED;
>> +}
>> +
>> +static int mpls_route_add(struct mpls_route_config *cfg)
>> +{
>> +	struct net *net = cfg->rc_nlinfo.nl_net;
>> +	struct net_device *dev = NULL;
>> +	struct mpls_route *rt, *old;
>> +	unsigned index;
>> +	int i;
>> +	int err = -EINVAL;
>> +
>> +	index = cfg->rc_label;
>> +
>> +	/* If a label was not specified during insert pick one */
>> +	if ((index == LABEL_NOT_SPECIFIED) &&
>> +	    (cfg->rc_nlflags & NLM_F_CREATE)) {
>> +		index = find_free_label(net);
>> +	}
>> +
>> +	/* The first 16 labels are reserved, and may not be set */
>> +	if (index < 16)
>> +		goto errout;
>> +
>> +	/* The full 20 bit range may not be supported. */
>> +	if (index >= net->mpls.platform_labels)
>> +		goto errout;
>> +
>> +	/* Ensure only a supported number of labels are present */
>> +	if (cfg->rc_output_labels > MAX_NEW_LABELS)
>> +		goto errout;
>> +
>> +	err = -ENODEV;
>> +	dev = dev_get_by_index(net, cfg->rc_ifindex);
>> +	if (!dev)
>> +		goto errout;
>> +
>> +	/* For now just support ethernet devices */
>> +	err = -EINVAL;
>> +	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
>> +		goto errout;
>> +
>> +	err = -EINVAL;
>> +	if ((cfg->rc_via_family == AF_PACKET) &&
>> +	    (dev->addr_len != cfg->rc_via_alen))
>> +		goto errout;
>> +
>> +	/* Append makes no sense with mpls */
>> +	err = -EINVAL;
>
> minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
> really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
> I did not find any other such case for consistency check.

Yes.  IPv4 implements NLM_F_APPEND and IPv6 ignores it.

I will add a patch to change the error code.

Do you happen to know what NLM_F_APPEND means?  I couldn't figure out
when glancing through the IPv4 code.

NLM_F_REPLACE seems obvious.  Though it seems to have exactly the
oposite meaning of NLM_F_EXCL.  Which seems to make NLM_F_EXCL
redundant.

My hunch is that there are meanings here that apply when you are doing
longest prefix matching that don't quite apply when you are are doing
exact match routing.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-04  5:53                         ` Eric W. Biederman
  2015-03-04 14:56                           ` Andy Gospodarek
@ 2015-03-04 21:04                           ` David Miller
  2015-03-05 12:35                             ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: David Miller @ 2015-03-04 21:04 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 03 Mar 2015 23:53:21 -0600

> We could potentially translate the numbers into the enumeration that is
> NEIGH_ARP_TABLE, NEIGH_ND_TABLE, and NEIGH_DN_TABLE.  Or waste a little
> bit of memory in have a 30 entry array and looking things up by address
> protocol number.   The only disadvantage I can see to using AF_NNN as
> the index is that it might be a little less cache friendly.

Yes, you can just store NEIGH_*_TABLE in your route entries and
pass that directly into neigh_xmit(), or something like that.

> Other issues the hh header cache doesn't work. (How much do we care).
> 
> I worry a little that supporting AF_PACKET case might cause problems
> in the future.
> 
> The cumulus folks are probably going to want to use neigh_xmit so they
> can have ipv6 nexthops on ipv4.  Using this for IPv4 and loosing the
> header cache worries me a little.

We can have variable hard header caches per neigh entry if we really
want to.  The only issue is, again, making the demux simple.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-04 20:36                             ` Eric W. Biederman
@ 2015-03-05  0:30                               ` roopa
  2015-03-05  2:50                               ` Bill Fink
  1 sibling, 0 replies; 88+ messages in thread
From: roopa @ 2015-03-05  0:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

On 3/4/15, 12:36 PM, Eric W. Biederman wrote:
> roopa <roopa@cumulusnetworks.com> writes:
>
>> On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
>>> mpls_route_add and mpls_route_del implement the basic logic for adding
>>> and removing Next Hop Label Forwarding Entries from the MPLS input
>>> label map.  The addition and subtraction is done in a way that is
>>> consistent with how the existing routing table in Linux are
>>> maintained.  Thus all of the work to deal with NLM_F_APPEND,
>>> NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.
>>>
>>> Cases that are not clearly defined such as changing the interpretation
>>> of the mpls reserved labels is not allowed.
>>>
>>> Because it seems like the right thing to do adding an MPLS route without
>>> specifying an input label and allowing the kernel to pick a free label
>>> table entry is supported.   The implementation is currently less than optimal
>>> but that can be changed.
>>>
>>> As I don't have anything else to test with only ethernet and the loopback
>>> device are the only two device types currently supported for forwarding
>>> MPLS over.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>    net/mpls/af_mpls.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 133 insertions(+)
>>>
>>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>>> index b097125dfa33..e432f092f2fb 100644
>>> --- a/net/mpls/af_mpls.c
>>> +++ b/net/mpls/af_mpls.c
>>> @@ -16,6 +16,7 @@
>>>    #include <net/netns/generic.h>
>>>    #include "internal.h"
>>>    +#define LABEL_NOT_SPECIFIED (1<<20)
>>>    #define MAX_NEW_LABELS 2
>>>      /* This maximum ha length copied from the definition of struct
>>> neighbour */
>>> @@ -211,6 +212,19 @@ static struct packet_type mpls_packet_type __read_mostly = {
>>>    	.func = mpls_forward,
>>>    };
>>>    +struct mpls_route_config {
>>> +	u32		rc_protocol;
>>> +	u32		rc_ifindex;
>>> +	u16		rc_via_family;
>>> +	u16		rc_via_alen;
>>> +	u8		rc_via[MAX_VIA_ALEN];
>>> +	u32		rc_label;
>>> +	u32		rc_output_labels;
>>> +	u32		rc_output_label[MAX_NEW_LABELS];
>>> +	u32		rc_nlflags;
>>> +	struct nl_info	rc_nlinfo;
>>> +};
>>> +
>>>    static struct mpls_route *mpls_rt_alloc(size_t alen)
>>>    {
>>>    	struct mpls_route *rt;
>>> @@ -245,6 +259,125 @@ static void mpls_route_update(struct net *net, unsigned index,
>>>    	mpls_rt_free(old);
>>>    }
>>>    +static unsigned find_free_label(struct net *net)
>>> +{
>>> +	unsigned index;
>>> +	for (index = 16; index < net->mpls.platform_labels; index++) {
>>> +		if (!net->mpls.platform_label[index])
>>> +			return index;
>>> +	}
>>> +	return LABEL_NOT_SPECIFIED;
>>> +}
>>> +
>>> +static int mpls_route_add(struct mpls_route_config *cfg)
>>> +{
>>> +	struct net *net = cfg->rc_nlinfo.nl_net;
>>> +	struct net_device *dev = NULL;
>>> +	struct mpls_route *rt, *old;
>>> +	unsigned index;
>>> +	int i;
>>> +	int err = -EINVAL;
>>> +
>>> +	index = cfg->rc_label;
>>> +
>>> +	/* If a label was not specified during insert pick one */
>>> +	if ((index == LABEL_NOT_SPECIFIED) &&
>>> +	    (cfg->rc_nlflags & NLM_F_CREATE)) {
>>> +		index = find_free_label(net);
>>> +	}
>>> +
>>> +	/* The first 16 labels are reserved, and may not be set */
>>> +	if (index < 16)
>>> +		goto errout;
>>> +
>>> +	/* The full 20 bit range may not be supported. */
>>> +	if (index >= net->mpls.platform_labels)
>>> +		goto errout;
>>> +
>>> +	/* Ensure only a supported number of labels are present */
>>> +	if (cfg->rc_output_labels > MAX_NEW_LABELS)
>>> +		goto errout;
>>> +
>>> +	err = -ENODEV;
>>> +	dev = dev_get_by_index(net, cfg->rc_ifindex);
>>> +	if (!dev)
>>> +		goto errout;
>>> +
>>> +	/* For now just support ethernet devices */
>>> +	err = -EINVAL;
>>> +	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
>>> +		goto errout;
>>> +
>>> +	err = -EINVAL;
>>> +	if ((cfg->rc_via_family == AF_PACKET) &&
>>> +	    (dev->addr_len != cfg->rc_via_alen))
>>> +		goto errout;
>>> +
>>> +	/* Append makes no sense with mpls */
>>> +	err = -EINVAL;
>> minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
>> really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
>> I did not find any other such case for consistency check.
> Yes.  IPv4 implements NLM_F_APPEND and IPv6 ignores it.
>
> I will add a patch to change the error code.

Thanks.
>
> Do you happen to know what NLM_F_APPEND means?  I couldn't figure out
> when glancing through the IPv4 code.
>
> NLM_F_REPLACE seems obvious.  Though it seems to have exactly the
> oposite meaning of NLM_F_EXCL.  Which seems to make NLM_F_EXCL
> redundant.
>
> My hunch is that there are meanings here that apply when you are doing
> longest prefix matching that don't quite apply when you are are doing
> exact match routing.

NLM_F_APPEND only affects the position where the new fib_alias gets added
(fib_insert_alias). A new fib_alias is always added at the front and 
NLM_F_APPEND makes it
go to the tail AFAICT (This is fib_aliases for the same prefix or fib_info)

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 7/8] mpls: Multicast route table change notifications
  2015-02-26 15:12       ` roopa
@ 2015-03-05  1:56         ` Andy Gospodarek
  0 siblings, 0 replies; 88+ messages in thread
From: Andy Gospodarek @ 2015-03-05  1:56 UTC (permalink / raw)
  To: roopa
  Cc: Eric W. Biederman, David Miller, netdev, Stephen Hemminger,
	santiago, Vivek Venkatraman

On Thu, Feb 26, 2015 at 07:12:34AM -0800, roopa wrote:
> On 2/26/15, 6:03 AM, Eric W. Biederman wrote:
> >roopa <roopa@cumulusnetworks.com> writes:
> >
> >>On 2/25/15, 9:19 AM, Eric W. Biederman wrote:
> >>>Unlike IPv4 this code notifies on all cases where mpls routes
> >>>are added or removed as that was the simplest to implement.
> >>>
> >>>In particular routes being removed because a network interface
> >>>goes down or is removed are notified about.  Are there technical
> >>>arguments for handling this differently ? Userspace developers
> >>>don't particularly like the way IPv4 handles route removal
> >>>on ifdown.
> >>that is true. However, from previous emails on this topic on netdev,
> >>there is no reason to notify these deletes to userspace thereby creating a
> >>notification storm
> >>when userspace can figure this out. Which seems like a valid reason.
> >>(Your approach resembles IPv6 which does generate these notifications and
> >>userspace is usually happy with this).
> >Grr.  There is an even better way to do this.
> >
> >The semantically best way to handle this is to simply not use routes for
> >forwarding where the network inteface is down, the carrier is down, or
> >the network device has gone away for forwarding.
> 
> agreed, And we have an internal patch that does this for regular routing
> on carrier down (which we will upstream soon).
Yep, I should be able to easily forward-port it from 3.17 to net-next
without much issue.  Eric feel free to email me directly if you want to
see what I've got now.

> >
> >Apparently there are some multi-path scenearios that already do this
> >legitimately, and routes going away auto-matically can cause userspace
> >other kinds of problems.
> >
> >In MPLS I especially don't want to free the routing table slot until I
> >know that the change has propagated in the network and I can be
> >reasonably confident that no-one will send me traffic on that label.
> >Otherwise there is a chance the label will be reused too soon.
> ack
> >
> >Grumble.  That is a code change I need to make.  Grumble.
> >
> >I also need to look and see if those multi-path scenarios report a next
> >hop as dead or just rely on the network interface state (which I think
> >it is) to be sufficient information relayed to userspace
> >
> they are marked DEAD on ifdown today (AFAIR they dont generate a
> notification in IPv4)  and are skipped during route lookup.
> Only when all the nexthops in a multi-path route are dead, is the route
> multipath route declared dead
> and is deleted today (with no notification to userspace in the IPv4 case).
> 
> Thanks,
> Roopa
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-04 20:36                             ` Eric W. Biederman
  2015-03-05  0:30                               ` roopa
@ 2015-03-05  2:50                               ` Bill Fink
  2015-03-05 11:54                                 ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: Bill Fink @ 2015-03-05  2:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: roopa, David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

On Wed, 04 Mar 2015, Eric W. Biederman wrote:

> roopa <roopa@cumulusnetworks.com> writes:
> 
> > On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
> >> mpls_route_add and mpls_route_del implement the basic logic for adding
> >> and removing Next Hop Label Forwarding Entries from the MPLS input
> >> label map.  The addition and subtraction is done in a way that is
> >> consistent with how the existing routing table in Linux are
> >> maintained.  Thus all of the work to deal with NLM_F_APPEND,
> >> NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.
> >>
> >> Cases that are not clearly defined such as changing the interpretation
> >> of the mpls reserved labels is not allowed.
> >>
> >> Because it seems like the right thing to do adding an MPLS route without
> >> specifying an input label and allowing the kernel to pick a free label
> >> table entry is supported.   The implementation is currently less than optimal
> >> but that can be changed.
> >>
> >> As I don't have anything else to test with only ethernet and the loopback
> >> device are the only two device types currently supported for forwarding
> >> MPLS over.
> >>
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>   net/mpls/af_mpls.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 133 insertions(+)
> >>
> >> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> >> index b097125dfa33..e432f092f2fb 100644
> >> --- a/net/mpls/af_mpls.c
> >> +++ b/net/mpls/af_mpls.c
> >> @@ -16,6 +16,7 @@
> >>   #include <net/netns/generic.h>
> >>   #include "internal.h"
> >>   +#define LABEL_NOT_SPECIFIED (1<<20)
> >>   #define MAX_NEW_LABELS 2
> >>     /* This maximum ha length copied from the definition of struct
> >> neighbour */
> >> @@ -211,6 +212,19 @@ static struct packet_type mpls_packet_type __read_mostly = {
> >>   	.func = mpls_forward,
> >>   };
> >>   +struct mpls_route_config {
> >> +	u32		rc_protocol;
> >> +	u32		rc_ifindex;
> >> +	u16		rc_via_family;
> >> +	u16		rc_via_alen;
> >> +	u8		rc_via[MAX_VIA_ALEN];
> >> +	u32		rc_label;
> >> +	u32		rc_output_labels;
> >> +	u32		rc_output_label[MAX_NEW_LABELS];
> >> +	u32		rc_nlflags;
> >> +	struct nl_info	rc_nlinfo;
> >> +};
> >> +
> >>   static struct mpls_route *mpls_rt_alloc(size_t alen)
> >>   {
> >>   	struct mpls_route *rt;
> >> @@ -245,6 +259,125 @@ static void mpls_route_update(struct net *net, unsigned index,
> >>   	mpls_rt_free(old);
> >>   }
> >>   +static unsigned find_free_label(struct net *net)
> >> +{
> >> +	unsigned index;
> >> +	for (index = 16; index < net->mpls.platform_labels; index++) {
> >> +		if (!net->mpls.platform_label[index])
> >> +			return index;
> >> +	}
> >> +	return LABEL_NOT_SPECIFIED;
> >> +}
> >> +
> >> +static int mpls_route_add(struct mpls_route_config *cfg)
> >> +{
> >> +	struct net *net = cfg->rc_nlinfo.nl_net;
> >> +	struct net_device *dev = NULL;
> >> +	struct mpls_route *rt, *old;
> >> +	unsigned index;
> >> +	int i;
> >> +	int err = -EINVAL;
> >> +
> >> +	index = cfg->rc_label;
> >> +
> >> +	/* If a label was not specified during insert pick one */
> >> +	if ((index == LABEL_NOT_SPECIFIED) &&
> >> +	    (cfg->rc_nlflags & NLM_F_CREATE)) {
> >> +		index = find_free_label(net);
> >> +	}
> >> +
> >> +	/* The first 16 labels are reserved, and may not be set */
> >> +	if (index < 16)
> >> +		goto errout;
> >> +
> >> +	/* The full 20 bit range may not be supported. */
> >> +	if (index >= net->mpls.platform_labels)
> >> +		goto errout;
> >> +
> >> +	/* Ensure only a supported number of labels are present */
> >> +	if (cfg->rc_output_labels > MAX_NEW_LABELS)
> >> +		goto errout;
> >> +
> >> +	err = -ENODEV;
> >> +	dev = dev_get_by_index(net, cfg->rc_ifindex);
> >> +	if (!dev)
> >> +		goto errout;
> >> +
> >> +	/* For now just support ethernet devices */
> >> +	err = -EINVAL;
> >> +	if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_LOOPBACK))
> >> +		goto errout;
> >> +
> >> +	err = -EINVAL;
> >> +	if ((cfg->rc_via_family == AF_PACKET) &&
> >> +	    (dev->addr_len != cfg->rc_via_alen))
> >> +		goto errout;
> >> +
> >> +	/* Append makes no sense with mpls */
> >> +	err = -EINVAL;
> >
> > minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
> > really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
> > I did not find any other such case for consistency check.
> 
> Yes.  IPv4 implements NLM_F_APPEND and IPv6 ignores it.
> 
> I will add a patch to change the error code.

I believe the error -ENOTSUPP is deprecated and -EOPNOTSUPP should
be used instead.

						-Bill

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-02-25 17:18 ` [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel Eric W. Biederman
@ 2015-03-05  9:17   ` Vivek Venkatraman
  2015-03-05 14:00     ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-05  9:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

It is great to see an MPLS data plane implementation make it into the
kernel. I have a couple of questions on this patch.

On Wed, Feb 25, 2015 at 9:18 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
>
> Allow creating an mpls tunnel endpoint with
>
> ip link add type ipmpls.
>
> This tunnel has an mpls label for it's link layer address, and by
> default sends all ingress packets over loopback to the local MPLS
> forwarding logic which performs all of the work.
>

Is it correct that to achieve IPoMPLS, each LSP has to be installed as
a link/netdevice?

If ingress packets loopback with the label associated with the link to
hit the MPLS forwarding logic, how does it work if each packet has to
be then forwarded with a different label stack? One use case is a
common IP/MPLS application such as L3VPNs (RFC 4364) where multiple
VPNs may reside over the same LSP, each having its own VPN (inner)
label.

> Ingress IPv4, IPv6 and MPLS packets are supported.
>
> A new arp type ARPHRD_MPLS is defined for network devices that
> whose link-layer addresses is an mpls label stack.
>
> This is the most bare bones version of this tunnel device I can think
> of.  Not even packet counters have been implemented. Offloads
> and features in general are not supported, just to keep it simple and
> obviously correct to start with.  In principle it should be able to
> allow binding to a physical network device and pass all of the
> offloads through ipmpls like the vlan, macvlan, or even ipvlan does.
> Allowing a very fast light weight connection to the network.
>
> The technical tricky bit to residing over something besides
> the loopback device is how to get the next-hop mac address.
> Neighbour table integration?  Something else?
>

Thanks,
Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table
  2015-03-04  1:11                         ` [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
@ 2015-03-05  9:45                           ` Vivek Venkatraman
  2015-03-05 13:22                             ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-05  9:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This sysctl gives two benefits.  By defaulting the table size to 0
> mpls even when compiled in and enabled defaults to not forwarding
> any packets.  This prevents unpleasant surprises for users.
>
> The other benefit is that as mpls labels are allocated locally a dense
> table a small dense label table may be used which saves memory and
> is extremely simple and efficient to implement.
>

The label space is often partitioned into multiple sets in MPLS and
used for different purposes - for example, LSP labels, VPN labels,
Segment labels. This in turn means that the table may no longer be
dense. A sysctl allowing min and max label that spans the sets of
labels may be useful. Or should the ILM be made a hash table?

> This sysctl allows userspace to choose the restrictions on the label
> table size userspace applications need to cope with.
>

Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-03 20:22                 ` Eric W. Biederman
  2015-03-03 20:33                   ` David Miller
@ 2015-03-05 10:14                   ` Steven Whitehouse
  2015-03-06 20:44                     ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: Steven Whitehouse @ 2015-03-05 10:14 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller; +Cc: netdev, ralf, linux-hams

Hi,

On 03/03/15 20:22, Eric W. Biederman wrote:
> David Miller <davem@davemloft.net> writes:
>
>> From: ebiederm@xmission.com (Eric W. Biederman)
>> Date: Tue, 03 Mar 2015 09:41:47 -0600
>>
>>> Before the ax25 stack calls dev_queue_xmit it always calls
>>> ax25_type_trans which sets skb->protocol to ETH_P_AX25.
>>>
>>> Which means that by looking at the protocol type it is possible to
>>> detect IP packets that have not been munged by the ax25 stack in
>>> ndo_start_xmit and call a function to munge them.
>>>
>>> Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and
>>> value to be appropriate for an ndo_start_xmit function.
>>>
>>> Update all of the ax25 devices to test the protocol type for ETH_P_IP
>>> and return ax25_ip_xmit as the first thing they do.  This preserves
>>> the existing semantics of IP packet processing, but the timing will be
>>> a little different as the IP packets now pass through the qdisc layer
>>> before reaching the ax25 ip packet processing.
>>>
>>> Remove the now unnecessary ax25 neighbour table operations.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> Another nice cleanup, applied, thanks Eric.
> We can almost universally use the same procedures for generating
> link layer headers from neighbour table entries now.  I had hoped
> to optimized things by removing function pointers.
>
> The big hold out is DECnet that sets src_mac based on the DECnet source
> address.
That is a requirement of DECnet I'm afraid - DECnet does not have an 
exact equivalent of ARP/ndisc and many hosts will refuse to communicate 
if the MAC address is not the expected one based on the DECnet address. 
This is one bit of DECnet that is being used (to the best of my 
knowledge) and working.

> Which leads me to the conclusion that since DECnet has a different
> algorithm for setting the src_mac than everything else in the kernel
> DECnet neighbour table entries can not be used for nexthops for other
> protocols :(
>
> DECnet also abuses neigh->output to select by output device which kind
> of DECnet header to put on the packets.  But that is easily fixable.
One way to fix it would be to drop support for non-broadcast devices. We 
don't have an implementation of DDCMP currently. Ethernet is the only 
working DECnet device at the moment. PPP could also potentially work, 
with a bit of tweeking, but strangely PPP is a broadcast device so far 
as DECnet is concerned,

Steve.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-05  2:50                               ` Bill Fink
@ 2015-03-05 11:54                                 ` Eric W. Biederman
  2015-03-05 19:10                                   ` Bill Fink
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 11:54 UTC (permalink / raw)
  To: Bill Fink
  Cc: roopa, David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

Bill Fink <billfink@mindspring.com> writes:

> On Wed, 04 Mar 2015, Eric W. Biederman wrote:
>
>> roopa <roopa@cumulusnetworks.com> writes:
>> 
>> > On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
>> >> +	/* Append makes no sense with mpls */
>> >> +	err = -EINVAL;
>> >
>> > minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
>> > really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
>> > I did not find any other such case for consistency check.
>> 
>> Yes.  IPv4 implements NLM_F_APPEND and IPv6 ignores it.
>> 
>> I will add a patch to change the error code.
>
> I believe the error -ENOTSUPP is deprecated and -EOPNOTSUPP should
> be used instead.

Ack.

In particular ENOTSUPP is not allowed to make it to user space while
EOPNOTSUPP is.

Which makes me a little leary when I grep the kernel code and I see so
may uses of ENOTSUPP in the kernel when I grep for it.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 0/2] Neighbour table prep for MPLS
  2015-03-04 21:04                           ` David Miller
@ 2015-03-05 12:35                             ` Eric W. Biederman
  0 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andy Gospodarek

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 03 Mar 2015 23:53:21 -0600
>
>> We could potentially translate the numbers into the enumeration that is
>> NEIGH_ARP_TABLE, NEIGH_ND_TABLE, and NEIGH_DN_TABLE.  Or waste a little
>> bit of memory in have a 30 entry array and looking things up by address
>> protocol number.   The only disadvantage I can see to using AF_NNN as
>> the index is that it might be a little less cache friendly.
>
> Yes, you can just store NEIGH_*_TABLE in your route entries and
> pass that directly into neigh_xmit(), or something like that.

And using the NEIGH_*_TABLE defines doesn't look too bad.  I walked a
little ways down the path of what would it take to remove NEIGH_*_TABLE
altogether and replacing NEIGH_*_TABLE with AF_* but the loops that are
for each possible neighbout table made just seemed to horrible to
convert that way.

So I now have an implementation that changes my routing tables to hold
NEIGH_*_ and it doesn't look bad at all.  Especially given that I
already have to filter the address families who's neighbour tables I can
use.

>> Other issues the hh header cache doesn't work. (How much do we care).
>> 
>> I worry a little that supporting AF_PACKET case might cause problems
>> in the future.
>> 
>> The cumulus folks are probably going to want to use neigh_xmit so they
>> can have ipv6 nexthops on ipv4.  Using this for IPv4 and loosing the
>> header cache worries me a little.
>
> We can have variable hard header caches per neigh entry if we really
> want to.  The only issue is, again, making the demux simple.

This is where things start creeping up on benchmarking that is really
more work than I am ready to take on for this project.

I think it would be really interesting to know if the hardware header
caches are worth it, and if they are is it only because it avoids a
function pointer or because all of our data comes from the same cache
line.

Looking at the code I find it interesting that we reserve less space
for the hardware header than we do for the hardware address itself.

Optimization opportunities clearly abound.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table
  2015-03-05  9:45                           ` Vivek Venkatraman
@ 2015-03-05 13:22                             ` Eric W. Biederman
  2015-03-05 14:38                               ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 13:22 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

Vivek Venkatraman <vivek@cumulusnetworks.com> writes:

> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> This sysctl gives two benefits.  By defaulting the table size to 0
>> mpls even when compiled in and enabled defaults to not forwarding
>> any packets.  This prevents unpleasant surprises for users.
>>
>> The other benefit is that as mpls labels are allocated locally a dense
>> table a small dense label table may be used which saves memory and
>> is extremely simple and efficient to implement.
>>
>
> The label space is often partitioned into multiple sets in MPLS and
> used for different purposes - for example, LSP labels, VPN labels,
> Segment labels. This in turn means that the table may no longer be
> dense. A sysctl allowing min and max label that spans the sets of
> labels may be useful. Or should the ILM be made a hash table?

Good question.

These kinds of labels are a local label management problem.

Given how nice it is to have a reasonably dense label space I am not
keen to abandon the notion of having a dense label space, as it makes
the code simple and fast for forwarding mpls packets.

That said my code is a starting point.  If you have a real world use
case and you can show a better way to deal with it.  Go for it.
Now is definitely time to evolve the API.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-05  9:17   ` Vivek Venkatraman
@ 2015-03-05 14:00     ` Eric W. Biederman
  2015-03-05 16:25       ` Vivek Venkatraman
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 14:00 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

Vivek Venkatraman <vivek@cumulusnetworks.com> writes:

> It is great to see an MPLS data plane implementation make it into the
> kernel. I have a couple of questions on this patch.
>
> On Wed, Feb 25, 2015 at 9:18 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>>
>> Allow creating an mpls tunnel endpoint with
>>
>> ip link add type ipmpls.
>>
>> This tunnel has an mpls label for it's link layer address, and by
>> default sends all ingress packets over loopback to the local MPLS
>> forwarding logic which performs all of the work.
>>
>
> Is it correct that to achieve IPoMPLS, each LSP has to be installed as
> a link/netdevice?

This is still a bit in flux.  The ingress logic is not yet merged.  When
I resent the patches I did not resend this one as I am less happy with
it than I am about the others and the problem is orthogonal.

> If ingress packets loopback with the label associated with the link to
> hit the MPLS forwarding logic, how does it work if each packet has to
> be then forwarded with a different label stack? One use case is a
> common IP/MPLS application such as L3VPNs (RFC 4364) where multiple
> VPNs may reside over the same LSP, each having its own VPN (inner)
> label.

If we continue using this approach (which I picked because it was simple
for bootstrapping and testing) the way it would work is that you have a
local label that when you forward packets with that label all of the
other needed labels are pushed.

That said I think the approach I chose has a lot going for it.

Fundamentally I think the ingress to an mpls tunnel fundamentally needs
the same knobs and parameters as struct mpls_route.  Aka which machine
do we forward the packets to, and which labels do we push.

The extra decrement of the hop count on ingress is not my favorite
thing.

The question in my mind is how do we select which mpls route to use.
Spending a local label for that purpose does not seem particularly
unreasonable.

Using one network device per tunnel it a bit more questionable.  I keep
playing with ideas that would allow a single device to serve multiple
mpls tunnels.

For going from normal ip routing to mpls routing somewhere we need the 
the destination ip prefix to mpls tunnel mapping. There are a couple of 
possible ways this could be solved.
- One ingress network device per mpls tunnel.
- One ingress network device and with with a a configurable routing
  prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
  does something like this for ATM virtual circuits.
- One ingress network device that looks at IP_ROUTE_CLASSID and
  use that to select the mpls labels to use.
- Teach the IP network stack how to insert packets in tunnels without
  needing a magic netdevice.

None of the ideas I have thought of so far feels just right.

At the same time I don't think there is a lot of wiggle room in the
fundmanetals.  Mapping ip routes to mpls tunnels in a way that software
can process quickly and efficiently and the code is maintainable, does
not leave a lot of wiggle room at the end of the day.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table
  2015-03-05 13:22                             ` Eric W. Biederman
@ 2015-03-05 14:38                               ` Eric W. Biederman
  2015-03-05 16:49                                 ` Vivek Venkatraman
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 14:38 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

ebiederm@xmission.com (Eric W. Biederman) writes:

> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>
>> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>
>>> This sysctl gives two benefits.  By defaulting the table size to 0
>>> mpls even when compiled in and enabled defaults to not forwarding
>>> any packets.  This prevents unpleasant surprises for users.
>>>
>>> The other benefit is that as mpls labels are allocated locally a dense
>>> table a small dense label table may be used which saves memory and
>>> is extremely simple and efficient to implement.
>>>
>>
>> The label space is often partitioned into multiple sets in MPLS and
>> used for different purposes - for example, LSP labels, VPN labels,
>> Segment labels. This in turn means that the table may no longer be
>> dense. A sysctl allowing min and max label that spans the sets of
>> labels may be useful. Or should the ILM be made a hash table?
>
> Good question.
>
> These kinds of labels are a local label management problem.
>
> Given how nice it is to have a reasonably dense label space I am not
> keen to abandon the notion of having a dense label space, as it makes
> the code simple and fast for forwarding mpls packets.
>
> That said my code is a starting point.  If you have a real world use
> case and you can show a better way to deal with it.  Go for it.
> Now is definitely time to evolve the API.

A couple more thoughts.  

The rtnetlink interface and my implementation carries a type field so it
is possible to mark which routing protocol uses an mpls route.

The global routing table is already over 500,000 routes so the 1 million
forward equivalanece classes of mpls with a single label may be
exhausted in the not too distant future so a dense label space may be a
necessity.

In a similar vein.  When I look at top of rack switches and their
hardware forwarding capacity it looks like they are in the ballpark
of 32K MPLS routes.

All of which says to me that the MPLS label space is limited and it
should be managed as a precious resource.  (A good example of why I
might want to rethink my mpls ingress path).

So while I can see arguments for one use of labels getting one quota of
labels and another use of labels getting another quota when I look at
the space there are not that many labels and I don't see how or why it
would make sense to manage the labels explicitly with ranges.

At some point for MPLS multicast traffic and MPLS source specific
addresses if we choose to support those we will need a hash table
as those addresses are assigned by others, though in that case
we will be limited in our egress set of labels we can use.

So I think MPLS interfaces need to encourage thrifty label use,
which in my mind almost certainly means not manually allocated
label use.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-05 14:00     ` Eric W. Biederman
@ 2015-03-05 16:25       ` Vivek Venkatraman
  2015-03-05 19:52         ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-05 16:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

On Thu, Mar 5, 2015 at 6:00 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>
>> It is great to see an MPLS data plane implementation make it into the
>> kernel. I have a couple of questions on this patch.
>>
>> On Wed, Feb 25, 2015 at 9:18 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>>
>>> Allow creating an mpls tunnel endpoint with
>>>
>>> ip link add type ipmpls.
>>>
>>> This tunnel has an mpls label for it's link layer address, and by
>>> default sends all ingress packets over loopback to the local MPLS
>>> forwarding logic which performs all of the work.
>>>
>>
>> Is it correct that to achieve IPoMPLS, each LSP has to be installed as
>> a link/netdevice?
>
> This is still a bit in flux.  The ingress logic is not yet merged.  When
> I resent the patches I did not resend this one as I am less happy with
> it than I am about the others and the problem is orthogonal.
>
>> If ingress packets loopback with the label associated with the link to
>> hit the MPLS forwarding logic, how does it work if each packet has to
>> be then forwarded with a different label stack? One use case is a
>> common IP/MPLS application such as L3VPNs (RFC 4364) where multiple
>> VPNs may reside over the same LSP, each having its own VPN (inner)
>> label.
>
> If we continue using this approach (which I picked because it was simple
> for bootstrapping and testing) the way it would work is that you have a
> local label that when you forward packets with that label all of the
> other needed labels are pushed.
>

Yes, I can see that this approach is simple for bootstrapping.

However, I think the need for a local label is going to be bit of a
challenge as well as not intuitive. I say the latter because at an
ingress LSP (i.e., the kernel is performing an MPLS LER function), you
are only pushing labels just based on normal IP routing (or L2, if
implementing a pseudowire), so needing to assign a local label that
then gets popped seems convoluted. The challenge is because the local
label has to be unique for the label stack that needs to be imposed,
it is not just a 1-to-1 mapping with the tunnel.

> That said I think the approach I chose has a lot going for it.
>
> Fundamentally I think the ingress to an mpls tunnel fundamentally needs
> the same knobs and parameters as struct mpls_route.  Aka which machine
> do we forward the packets to, and which labels do we push.
>
> The extra decrement of the hop count on ingress is not my favorite
> thing.
>
> The question in my mind is how do we select which mpls route to use.
> Spending a local label for that purpose does not seem particularly
> unreasonable.
>
> Using one network device per tunnel it a bit more questionable.  I keep
> playing with ideas that would allow a single device to serve multiple
> mpls tunnels.
>

For the scenario I mentioned (L3VPNs) which would be common at the
edge, isn't it a network device per "VPN" (or more precisely, per VPN
per LSP)? I don't think this scales well.

> For going from normal ip routing to mpls routing somewhere we need the
> the destination ip prefix to mpls tunnel mapping. There are a couple of
> possible ways this could be solved.
> - One ingress network device per mpls tunnel.
> - One ingress network device and with with a a configurable routing
>   prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
>   does something like this for ATM virtual circuits.
> - One ingress network device that looks at IP_ROUTE_CLASSID and
>   use that to select the mpls labels to use.
> - Teach the IP network stack how to insert packets in tunnels without
>   needing a magic netdevice.
>

I feel it should be along the lines of "teach the IP network stack how
to push labels". In general, MPLS LSPs can be setup as hop-by-hop
routed LSPs (when using a signaling protocol like LDP or BGP) as well
as tunnels that may take a different path than normal routing. I feel
it is good if the dataplane can support both models. In the former,
the IP network stack should push the labels which are just
encapsulation and then just transmit on the underlying netdevice that
corresponds to the neighbor interface. To achieve this, maybe it is
the neighbor (nexthop) that has to reference the mpls_route. In the
latter (LSPs are treated as tunnels and/or this is the only model
supported), the IP network stack would still need to impose any inner
labels (i.e., VPN or pseudowire, later on Entropy or Segment labels)
and then transmit over the tunnel netdevice which would impose the
tunnel label.


> None of the ideas I have thought of so far feels just right.
>
> At the same time I don't think there is a lot of wiggle room in the
> fundmanetals.  Mapping ip routes to mpls tunnels in a way that software
> can process quickly and efficiently and the code is maintainable, does
> not leave a lot of wiggle room at the end of the day.
>
> Eric

Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 2/7] mpls: Basic routing support
  2015-03-04  1:10                         ` [PATCH net-next 2/7] mpls: Basic routing support Eric W. Biederman
@ 2015-03-05 16:36                           ` Vivek Venkatraman
  2015-03-05 18:42                             ` Eric W. Biederman
  0 siblings, 1 reply; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-05 16:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

On Tue, Mar 3, 2015 at 5:10 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This change adds a new Kconfig option MPLS_ROUTING.
>
> The core of this change is the code to look at an mpls packet received
> from another machine.  Look that packet up in a routing table and
> forward the packet on.
>
> Support of MPLS over ATM is not considered or attempted here.  This
> implemntation follows RFC3032 and implements the MPLS shim header that
> can pass over essentially any network.
>
> What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
> net->mpls.platform_label[].  What RFC3031 refers to as the Next Label
> Hop Forwarding Entry (NHLFE) I call mpls_route.  Though calling it the
> label fordwarding information base (lfib) might also be valid.
>

This currently does not allow for ECMP when acting as a transit, correct?

> Further the implemntation forwards packets as described in RFC3032.
> There is no need and given the original motivation for MPLS a strong
> discincentive to have a flexible label forwarding path.  In essence
> the logic is the topmost label is read, looked up, removed, and
> replaced by 0 or more new lables and the sent out the specified
> interface to it's next hop.
>
> Quite a few optional features are not implemented here.  Among them
> are generation of ICMP errors when the TTL is exceeded or the packet
> is larger than the next hop MTU (those conditions are detected and the
> packets are dropped instead of generating an icmp error).  The traffic
> class field is always set to 0.  The implementation focuses on IP over
> MPLS and does not handle egress of other kinds of protocols.
>
> Instead of implementing coordination with the neighbour table and
> sorting out how to input next hops in a different address family (for
> which there is value).  I was lazy and implemented a next hop mac
> address instead.  The code is simpler and there are flavor of MPLS
> such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
> appropriate so a next hop by mac address would need to be implemented
> at some point.
>

I guess the above is no longer the case with this revised patch which
can support a IPv4 or IPv6 next hop too, right?

> Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.
>
> Decoding the mpls header must be done by first byeswapping a 32bit bit
> endian word into the local cpu endian and then bit shifting to extract
> the pieces.  There is no C bit-field that can represent a wire format
> mpls header on a little endian machine as the low bits of the 20bit
> label wind up in the wrong half of third byte.  Therefore internally
> everything is deal with in cpu native byte order except when writing
> to and reading from a packet.
>
> For management simplicity if a label is configured to forward out
> an interface that is down the packet is dropped early.  Similarly
> if an network interface is removed rt_dev is updated to NULL
> (so no reference is preserved) and any packets for that label
> are dropped.  Keeping the label entries in the kernel allows
> the kernel label table to function as the definitive source
> of which labels are allocated and which are not.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/socket.h      |   2 +
>  include/net/net_namespace.h |   4 +
>  include/net/netns/mpls.h    |  15 ++
>  net/mpls/Kconfig            |   5 +
>  net/mpls/Makefile           |   1 +
>  net/mpls/af_mpls.c          | 349 ++++++++++++++++++++++++++++++++++++++++++++
>  net/mpls/internal.h         |  56 +++++++
>  7 files changed, 432 insertions(+)
>  create mode 100644 include/net/netns/mpls.h
>  create mode 100644 net/mpls/af_mpls.c
>  create mode 100644 net/mpls/internal.h
>
> <snip>
> +
> +static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
> +                       struct packet_type *pt, struct net_device *orig_dev)
> +{
> +       struct net *net = dev_net(dev);
> +       struct mpls_shim_hdr *hdr;
> +       struct mpls_route *rt;
> +       struct mpls_entry_decoded dec;
> +       struct net_device *out_dev;
> +       unsigned int hh_len;
> +       unsigned int new_header_size;
> +       unsigned int mtu;
> +       int err;
> +
> +       /* Careful this entire function runs inside of an rcu critical section */
> +
> +       if (skb->pkt_type != PACKET_HOST)
> +               goto drop;
> +
> +       if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL)
> +               goto drop;
> +
> +       if (!pskb_may_pull(skb, sizeof(*hdr)))
> +               goto drop;
> +
> +       /* Read and decode the label */
> +       hdr = mpls_hdr(skb);
> +       dec = mpls_entry_decode(hdr);
> +
> +       /* Pop the label */
> +       skb_pull(skb, sizeof(*hdr));
> +       skb_reset_network_header(skb);
> +
> +       skb_orphan(skb);
> +
> +       rt = mpls_route_input_rcu(net, dec.label);
> +       if (!rt)
> +               goto drop;
> +
> +       /* Find the output device */
> +       out_dev = rt->rt_dev;
> +       if (!mpls_output_possible(out_dev))
> +               goto drop;
> +
> +       if (skb_warn_if_lro(skb))
> +               goto drop;
> +
> +       skb_forward_csum(skb);
> +
> +       /* Verify ttl is valid */
> +       if (dec.ttl <= 2)

Why is this "<= 2"?

> +               goto drop;
> +       dec.ttl -= 1;
> +
> +       /* Verify the destination can hold the packet */
> +       new_header_size = mpls_rt_header_size(rt);
> +       mtu = mpls_dev_mtu(out_dev);
> +       if (mpls_pkt_too_big(skb, mtu - new_header_size))
> +               goto drop;
> +
> +       hh_len = LL_RESERVED_SPACE(out_dev);
> +       if (!out_dev->header_ops)
> +               hh_len = 0;
> +
> +       /* Ensure there is enough space for the headers in the skb */
> +       if (skb_cow(skb, hh_len + new_header_size))
> +               goto drop;
> +
> +       skb->dev = out_dev;
> +       skb->protocol = htons(ETH_P_MPLS_UC);
> +
> +       if (unlikely(!new_header_size && dec.bos)) {
> +               /* Penultimate hop popping */
> +               if (!mpls_egress(rt, skb, dec))
> +                       goto drop;
> +       } else {
> +               bool bos;
> +               int i;
> +               skb_push(skb, new_header_size);
> +               skb_reset_network_header(skb);
> +               /* Push the new labels */
> +               hdr = mpls_hdr(skb);
> +               bos = dec.bos;
> +               for (i = rt->rt_labels - 1; i >= 0; i--) {
> +                       hdr[i] = mpls_entry_encode(rt->rt_label[i], dec.ttl, 0, bos);
> +                       bos = false;
> +               }
> +       }
> +
> +       err = neigh_xmit(rt->rt_via_family, out_dev, rt->rt_via, skb);
> +       if (err)
> +               net_dbg_ratelimited("%s: packet transmission failed: %d\n",
> +                                   __func__, err);
> +       return 0;
> +
> +drop:
> +       kfree_skb(skb);
> +       return NET_RX_DROP;
> +}
> +

Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table
  2015-03-05 14:38                               ` Eric W. Biederman
@ 2015-03-05 16:49                                 ` Vivek Venkatraman
  0 siblings, 0 replies; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-05 16:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

On Thu, Mar 5, 2015 at 6:38 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>>
>>> On Tue, Mar 3, 2015 at 5:11 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>
>>>> This sysctl gives two benefits.  By defaulting the table size to 0
>>>> mpls even when compiled in and enabled defaults to not forwarding
>>>> any packets.  This prevents unpleasant surprises for users.
>>>>
>>>> The other benefit is that as mpls labels are allocated locally a dense
>>>> table a small dense label table may be used which saves memory and
>>>> is extremely simple and efficient to implement.
>>>>
>>>
>>> The label space is often partitioned into multiple sets in MPLS and
>>> used for different purposes - for example, LSP labels, VPN labels,
>>> Segment labels. This in turn means that the table may no longer be
>>> dense. A sysctl allowing min and max label that spans the sets of
>>> labels may be useful. Or should the ILM be made a hash table?
>>
>> Good question.
>>
>> These kinds of labels are a local label management problem.
>>
>> Given how nice it is to have a reasonably dense label space I am not
>> keen to abandon the notion of having a dense label space, as it makes
>> the code simple and fast for forwarding mpls packets.
>>

That's true. I guess this can be considered a local label management issue.

>> That said my code is a starting point.  If you have a real world use
>> case and you can show a better way to deal with it.  Go for it.
>> Now is definitely time to evolve the API.
>
> A couple more thoughts.
>
> The rtnetlink interface and my implementation carries a type field so it
> is possible to mark which routing protocol uses an mpls route.
>
> The global routing table is already over 500,000 routes so the 1 million
> forward equivalanece classes of mpls with a single label may be
> exhausted in the not too distant future so a dense label space may be a
> necessity.
>
> In a similar vein.  When I look at top of rack switches and their
> hardware forwarding capacity it looks like they are in the ballpark
> of 32K MPLS routes.
>
> All of which says to me that the MPLS label space is limited and it
> should be managed as a precious resource.  (A good example of why I
> might want to rethink my mpls ingress path).
>
> So while I can see arguments for one use of labels getting one quota of
> labels and another use of labels getting another quota when I look at
> the space there are not that many labels and I don't see how or why it
> would make sense to manage the labels explicitly with ranges.
>

The ability to impose a label stack and the way a FEC is defined at
the edge is what may help against the exhaustion of the label space.
It is usually for the label stack that multiple ranges of labels are
used. Again, I guess that can be handled by the application and the
dataplane can treat the entire set of labels as a single flat range.

> At some point for MPLS multicast traffic and MPLS source specific
> addresses if we choose to support those we will need a hash table
> as those addresses are assigned by others, though in that case
> we will be limited in our egress set of labels we can use.
>
> So I think MPLS interfaces need to encourage thrifty label use,
> which in my mind almost certainly means not manually allocated
> label use.
>
> Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 2/7] mpls: Basic routing support
  2015-03-05 16:36                           ` Vivek Venkatraman
@ 2015-03-05 18:42                             ` Eric W. Biederman
  0 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 18:42 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago, Simon Horman

Vivek Venkatraman <vivek@cumulusnetworks.com> writes:

> On Tue, Mar 3, 2015 at 5:10 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> This change adds a new Kconfig option MPLS_ROUTING.
>>
>> The core of this change is the code to look at an mpls packet received
>> from another machine.  Look that packet up in a routing table and
>> forward the packet on.
>>
>> Support of MPLS over ATM is not considered or attempted here.  This
>> implemntation follows RFC3032 and implements the MPLS shim header that
>> can pass over essentially any network.
>>
>> What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
>> net->mpls.platform_label[].  What RFC3031 refers to as the Next Label
>> Hop Forwarding Entry (NHLFE) I call mpls_route.  Though calling it the
>> label fordwarding information base (lfib) might also be valid.
>>
>
> This currently does not allow for ECMP when acting as a transit,
> correct?

Correct.  There is no fundamental reason for that, ECMP just has not
been implemented yet.

>> Further the implemntation forwards packets as described in RFC3032.
>> There is no need and given the original motivation for MPLS a strong
>> discincentive to have a flexible label forwarding path.  In essence
>> the logic is the topmost label is read, looked up, removed, and
>> replaced by 0 or more new lables and the sent out the specified
>> interface to it's next hop.
>>
>> Quite a few optional features are not implemented here.  Among them
>> are generation of ICMP errors when the TTL is exceeded or the packet
>> is larger than the next hop MTU (those conditions are detected and the
>> packets are dropped instead of generating an icmp error).  The traffic
>> class field is always set to 0.  The implementation focuses on IP over
>> MPLS and does not handle egress of other kinds of protocols.
>>
>> Instead of implementing coordination with the neighbour table and
>> sorting out how to input next hops in a different address family (for
>> which there is value).  I was lazy and implemented a next hop mac
>> address instead.  The code is simpler and there are flavor of MPLS
>> such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
>> appropriate so a next hop by mac address would need to be implemented
>> at some point.
>>
>
> I guess the above is no longer the case with this revised patch which
> can support a IPv4 or IPv6 next hop too, right?

Correct.

>> Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.
>>
>> Decoding the mpls header must be done by first byeswapping a 32bit bit
>> endian word into the local cpu endian and then bit shifting to extract
>> the pieces.  There is no C bit-field that can represent a wire format
>> mpls header on a little endian machine as the low bits of the 20bit
>> label wind up in the wrong half of third byte.  Therefore internally
>> everything is deal with in cpu native byte order except when writing
>> to and reading from a packet.
>>
>> For management simplicity if a label is configured to forward out
>> an interface that is down the packet is dropped early.  Similarly
>> if an network interface is removed rt_dev is updated to NULL
>> (so no reference is preserved) and any packets for that label
>> are dropped.  Keeping the label entries in the kernel allows
>> the kernel label table to function as the definitive source
>> of which labels are allocated and which are not.
>>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  include/linux/socket.h      |   2 +
>>  include/net/net_namespace.h |   4 +
>>  include/net/netns/mpls.h    |  15 ++
>>  net/mpls/Kconfig            |   5 +
>>  net/mpls/Makefile           |   1 +
>>  net/mpls/af_mpls.c          | 349 ++++++++++++++++++++++++++++++++++++++++++++
>>  net/mpls/internal.h         |  56 +++++++
>>  7 files changed, 432 insertions(+)
>>  create mode 100644 include/net/netns/mpls.h
>>  create mode 100644 net/mpls/af_mpls.c
>>  create mode 100644 net/mpls/internal.h
>>
>> <snip>
>> +
>> +static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>> +                       struct packet_type *pt, struct net_device *orig_dev)
>> +{
>> +       struct net *net = dev_net(dev);
>> +       struct mpls_shim_hdr *hdr;
>> +       struct mpls_route *rt;
>> +       struct mpls_entry_decoded dec;
>> +       struct net_device *out_dev;
>> +       unsigned int hh_len;
>> +       unsigned int new_header_size;
>> +       unsigned int mtu;
>> +       int err;
>> +
>> +       /* Careful this entire function runs inside of an rcu critical section */
>> +
>> +       if (skb->pkt_type != PACKET_HOST)
>> +               goto drop;
>> +
>> +       if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL)
>> +               goto drop;
>> +
>> +       if (!pskb_may_pull(skb, sizeof(*hdr)))
>> +               goto drop;
>> +
>> +       /* Read and decode the label */
>> +       hdr = mpls_hdr(skb);
>> +       dec = mpls_entry_decode(hdr);
>> +
>> +       /* Pop the label */
>> +       skb_pull(skb, sizeof(*hdr));
>> +       skb_reset_network_header(skb);
>> +
>> +       skb_orphan(skb);
>> +
>> +       rt = mpls_route_input_rcu(net, dec.label);
>> +       if (!rt)
>> +               goto drop;
>> +
>> +       /* Find the output device */
>> +       out_dev = rt->rt_dev;
>> +       if (!mpls_output_possible(out_dev))
>> +               goto drop;
>> +
>> +       if (skb_warn_if_lro(skb))
>> +               goto drop;
>> +
>> +       skb_forward_csum(skb);
>> +
>> +       /* Verify ttl is valid */
>> +       if (dec.ttl <= 2)
>
> Why is this "<= 2"?

It appears I rewrote that section one too many times it should be <= 1.

>> +               goto drop;
>> +       dec.ttl -= 1;
>> +
>> +       /* Verify the destination can hold the packet */
>> +       new_header_size = mpls_rt_header_size(rt);
>> +       mtu = mpls_dev_mtu(out_dev);
>> +       if (mpls_pkt_too_big(skb, mtu - new_header_size))
>> +               goto drop;
>> +
>> +       hh_len = LL_RESERVED_SPACE(out_dev);
>> +       if (!out_dev->header_ops)
>> +               hh_len = 0;
>> +
>> +       /* Ensure there is enough space for the headers in the skb */
>> +       if (skb_cow(skb, hh_len + new_header_size))
>> +               goto drop;
>> +
>> +       skb->dev = out_dev;
>> +       skb->protocol = htons(ETH_P_MPLS_UC);
>> +
>> +       if (unlikely(!new_header_size && dec.bos)) {
>> +               /* Penultimate hop popping */
>> +               if (!mpls_egress(rt, skb, dec))
>> +                       goto drop;
>> +       } else {
>> +               bool bos;
>> +               int i;
>> +               skb_push(skb, new_header_size);
>> +               skb_reset_network_header(skb);
>> +               /* Push the new labels */
>> +               hdr = mpls_hdr(skb);
>> +               bos = dec.bos;
>> +               for (i = rt->rt_labels - 1; i >= 0; i--) {
>> +                       hdr[i] = mpls_entry_encode(rt->rt_label[i], dec.ttl, 0, bos);
>> +                       bos = false;
>> +               }
>> +       }
>> +
>> +       err = neigh_xmit(rt->rt_via_family, out_dev, rt->rt_via, skb);
>> +       if (err)
>> +               net_dbg_ratelimited("%s: packet transmission failed: %d\n",
>> +                                   __func__, err);
>> +       return 0;
>> +
>> +drop:
>> +       kfree_skb(skb);
>> +       return NET_RX_DROP;
>> +}
>> +
>
> Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 4/7] mpls: Basic support for adding and removing routes
  2015-03-05 11:54                                 ` Eric W. Biederman
@ 2015-03-05 19:10                                   ` Bill Fink
  0 siblings, 0 replies; 88+ messages in thread
From: Bill Fink @ 2015-03-05 19:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: roopa, David Miller, netdev, Stephen Hemminger, santiago, Simon Horman

On Thu, 05 Mar 2015, Eric W. Biederman wrote:

> Bill Fink <billfink@mindspring.com> writes:
> 
> > On Wed, 04 Mar 2015, Eric W. Biederman wrote:
> >
> >> roopa <roopa@cumulusnetworks.com> writes:
> >> 
> >> > On 3/3/15, 5:12 PM, Eric W. Biederman wrote:
> >> >> +	/* Append makes no sense with mpls */
> >> >> +	err = -EINVAL;
> >> >
> >> > minor nit: should this be -ENOTSUPP in that case ? (NLM_F_REPLACE and NLM_F_APPEND are
> >> > really operations. But, one can argue that they are an attribute of the msg and hence -EINVAL might be ok).
> >> > I did not find any other such case for consistency check.
> >> 
> >> Yes.  IPv4 implements NLM_F_APPEND and IPv6 ignores it.
> >> 
> >> I will add a patch to change the error code.
> >
> > I believe the error -ENOTSUPP is deprecated and -EOPNOTSUPP should
> > be used instead.
> 
> Ack.
> 
> In particular ENOTSUPP is not allowed to make it to user space while
> EOPNOTSUPP is.
> 
> Which makes me a little leary when I grep the kernel code and I see so
> may uses of ENOTSUPP in the kernel when I grep for it.

Perhaps checkpatch.pl could be modified to warn about new uses
of ENOTSUPP.

						-Bill

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-05 16:25       ` Vivek Venkatraman
@ 2015-03-05 19:52         ` Eric W. Biederman
  2015-03-06  6:05           ` Vivek Venkatraman
  2015-03-07 10:36           ` Robert Shearman
  0 siblings, 2 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-05 19:52 UTC (permalink / raw)
  To: Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

Vivek Venkatraman <vivek@cumulusnetworks.com> writes:

> On Thu, Mar 5, 2015 at 6:00 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>>
>>> It is great to see an MPLS data plane implementation make it into the
>>> kernel. I have a couple of questions on this patch.
>>>
>>> On Wed, Feb 25, 2015 at 9:18 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>>
>>>>
>>>> Allow creating an mpls tunnel endpoint with
>>>>
>>>> ip link add type ipmpls.
>>>>
>>>> This tunnel has an mpls label for it's link layer address, and by
>>>> default sends all ingress packets over loopback to the local MPLS
>>>> forwarding logic which performs all of the work.
>>>>
>>>
>>> Is it correct that to achieve IPoMPLS, each LSP has to be installed as
>>> a link/netdevice?
>>
>> This is still a bit in flux.  The ingress logic is not yet merged.  When
>> I resent the patches I did not resend this one as I am less happy with
>> it than I am about the others and the problem is orthogonal.
>>
>>> If ingress packets loopback with the label associated with the link to
>>> hit the MPLS forwarding logic, how does it work if each packet has to
>>> be then forwarded with a different label stack? One use case is a
>>> common IP/MPLS application such as L3VPNs (RFC 4364) where multiple
>>> VPNs may reside over the same LSP, each having its own VPN (inner)
>>> label.
>>
>> If we continue using this approach (which I picked because it was simple
>> for bootstrapping and testing) the way it would work is that you have a
>> local label that when you forward packets with that label all of the
>> other needed labels are pushed.
>>
>
> Yes, I can see that this approach is simple for bootstrapping.
>
> However, I think the need for a local label is going to be bit of a
> challenge as well as not intuitive. I say the latter because at an
> ingress LSP (i.e., the kernel is performing an MPLS LER function), you
> are only pushing labels just based on normal IP routing (or L2, if
> implementing a pseudowire), so needing to assign a local label that
> then gets popped seems convoluted. The challenge is because the local
> label has to be unique for the label stack that needs to be imposed,
> it is not just a 1-to-1 mapping with the tunnel.

Agreed.

>> That said I think the approach I chose has a lot going for it.
>>
>> Fundamentally I think the ingress to an mpls tunnel fundamentally needs
>> the same knobs and parameters as struct mpls_route.  Aka which machine
>> do we forward the packets to, and which labels do we push.
>>
>> The extra decrement of the hop count on ingress is not my favorite
>> thing.
>>
>> The question in my mind is how do we select which mpls route to use.
>> Spending a local label for that purpose does not seem particularly
>> unreasonable.
>>
>> Using one network device per tunnel it a bit more questionable.  I keep
>> playing with ideas that would allow a single device to serve multiple
>> mpls tunnels.
>>
>
> For the scenario I mentioned (L3VPNs) which would be common at the
> edge, isn't it a network device per "VPN" (or more precisely, per VPN
> per LSP)? I don't think this scales well.

We need a data structure in the kernel for each
Forwarding Equivalent Class (aka per VPN per LSP) the only question is
how expensive that data structure should be.

In big-O notation the scaling is equal.  The practical question how large
are our constant factors and are they a problem.  If the L3VPN results
in enough entries on a machine then it is a scaling problem otherwise
not so much.

>> For going from normal ip routing to mpls routing somewhere we need the
>> the destination ip prefix to mpls tunnel mapping. There are a couple of
>> possible ways this could be solved.
>> - One ingress network device per mpls tunnel.
>> - One ingress network device and with with a a configurable routing
>>   prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
>>   does something like this for ATM virtual circuits.
>> - One ingress network device that looks at IP_ROUTE_CLASSID and
>>   use that to select the mpls labels to use.
>> - Teach the IP network stack how to insert packets in tunnels without
>>   needing a magic netdevice.
>>
>
> I feel it should be along the lines of "teach the IP network stack how
> to push labels".

That phrasing sets off alarms bells in my mind of mpls specific hacks in
the kernel, which most likely will cause performance regression and
maintenance complications.

> In general, MPLS LSPs can be setup as hop-by-hop
> routed LSPs (when using a signaling protocol like LDP or BGP) as well
> as tunnels that may take a different path than normal routing. I feel
> it is good if the dataplane can support both models. In the former,
> the IP network stack should push the labels which are just
> encapsulation and then just transmit on the underlying netdevice that
> corresponds to the neighbor interface. To achieve this, maybe it is
> the neighbor (nexthop) that has to reference the mpls_route. In the
> latter (LSPs are treated as tunnels and/or this is the only model
> supported), the IP network stack would still need to impose any inner
> labels (i.e., VPN or pseudowire, later on Entropy or Segment labels)
> and then transmit over the tunnel netdevice which would impose the
> tunnel label.

Potentially.  This part of the discussion has reached the point where I
need to see code to carry this part of the discussion any farther.

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-05 19:52         ` Eric W. Biederman
@ 2015-03-06  6:05           ` Vivek Venkatraman
  2015-03-07 10:36           ` Robert Shearman
  1 sibling, 0 replies; 88+ messages in thread
From: Vivek Venkatraman @ 2015-03-06  6:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

On Thu, Mar 5, 2015 at 11:52 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>
>> On Thu, Mar 5, 2015 at 6:00 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>>>
>>>> It is great to see an MPLS data plane implementation make it into the
>>>> kernel. I have a couple of questions on this patch.
>>>>
>>>> On Wed, Feb 25, 2015 at 9:18 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>>
>>>>>
>>>>> Allow creating an mpls tunnel endpoint with
>>>>>
>>>>> ip link add type ipmpls.
>>>>>
>>>>> This tunnel has an mpls label for it's link layer address, and by
>>>>> default sends all ingress packets over loopback to the local MPLS
>>>>> forwarding logic which performs all of the work.
>>>>>
>>>>
>>>> Is it correct that to achieve IPoMPLS, each LSP has to be installed as
>>>> a link/netdevice?
>>>
>>> This is still a bit in flux.  The ingress logic is not yet merged.  When
>>> I resent the patches I did not resend this one as I am less happy with
>>> it than I am about the others and the problem is orthogonal.
>>>
>>>> If ingress packets loopback with the label associated with the link to
>>>> hit the MPLS forwarding logic, how does it work if each packet has to
>>>> be then forwarded with a different label stack? One use case is a
>>>> common IP/MPLS application such as L3VPNs (RFC 4364) where multiple
>>>> VPNs may reside over the same LSP, each having its own VPN (inner)
>>>> label.
>>>
>>> If we continue using this approach (which I picked because it was simple
>>> for bootstrapping and testing) the way it would work is that you have a
>>> local label that when you forward packets with that label all of the
>>> other needed labels are pushed.
>>>
>>
>> Yes, I can see that this approach is simple for bootstrapping.
>>
>> However, I think the need for a local label is going to be bit of a
>> challenge as well as not intuitive. I say the latter because at an
>> ingress LSP (i.e., the kernel is performing an MPLS LER function), you
>> are only pushing labels just based on normal IP routing (or L2, if
>> implementing a pseudowire), so needing to assign a local label that
>> then gets popped seems convoluted. The challenge is because the local
>> label has to be unique for the label stack that needs to be imposed,
>> it is not just a 1-to-1 mapping with the tunnel.
>
> Agreed.
>
>>> That said I think the approach I chose has a lot going for it.
>>>
>>> Fundamentally I think the ingress to an mpls tunnel fundamentally needs
>>> the same knobs and parameters as struct mpls_route.  Aka which machine
>>> do we forward the packets to, and which labels do we push.
>>>
>>> The extra decrement of the hop count on ingress is not my favorite
>>> thing.
>>>
>>> The question in my mind is how do we select which mpls route to use.
>>> Spending a local label for that purpose does not seem particularly
>>> unreasonable.
>>>
>>> Using one network device per tunnel it a bit more questionable.  I keep
>>> playing with ideas that would allow a single device to serve multiple
>>> mpls tunnels.
>>>
>>
>> For the scenario I mentioned (L3VPNs) which would be common at the
>> edge, isn't it a network device per "VPN" (or more precisely, per VPN
>> per LSP)? I don't think this scales well.
>
> We need a data structure in the kernel for each
> Forwarding Equivalent Class (aka per VPN per LSP) the only question is
> how expensive that data structure should be.
>
> In big-O notation the scaling is equal.  The practical question how large
> are our constant factors and are they a problem.  If the L3VPN results
> in enough entries on a machine then it is a scaling problem otherwise
> not so much.
>
>>> For going from normal ip routing to mpls routing somewhere we need the
>>> the destination ip prefix to mpls tunnel mapping. There are a couple of
>>> possible ways this could be solved.
>>> - One ingress network device per mpls tunnel.
>>> - One ingress network device and with with a a configurable routing
>>>   prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
>>>   does something like this for ATM virtual circuits.
>>> - One ingress network device that looks at IP_ROUTE_CLASSID and
>>>   use that to select the mpls labels to use.
>>> - Teach the IP network stack how to insert packets in tunnels without
>>>   needing a magic netdevice.
>>>
>>
>> I feel it should be along the lines of "teach the IP network stack how
>> to push labels".
>
> That phrasing sets off alarms bells in my mind of mpls specific hacks in
> the kernel, which most likely will cause performance regression and
> maintenance complications.
>
>> In general, MPLS LSPs can be setup as hop-by-hop
>> routed LSPs (when using a signaling protocol like LDP or BGP) as well
>> as tunnels that may take a different path than normal routing. I feel
>> it is good if the dataplane can support both models. In the former,
>> the IP network stack should push the labels which are just
>> encapsulation and then just transmit on the underlying netdevice that
>> corresponds to the neighbor interface. To achieve this, maybe it is
>> the neighbor (nexthop) that has to reference the mpls_route. In the
>> latter (LSPs are treated as tunnels and/or this is the only model
>> supported), the IP network stack would still need to impose any inner
>> labels (i.e., VPN or pseudowire, later on Entropy or Segment labels)
>> and then transmit over the tunnel netdevice which would impose the
>> tunnel label.
>
> Potentially.  This part of the discussion has reached the point where I
> need to see code to carry this part of the discussion any farther.
>
> Eric

I'm in full agreement too that there shouldn't be any mpls-specific
hacks in the kernel.

Thank you for the discussion. We shall take your patches and
brainstorm internally on what we'd add or change. As soon as we have
code to share, I'll come back to seek opinion and continue the
discussion.

Vivek

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-05 10:14                   ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Steven Whitehouse
@ 2015-03-06 20:44                     ` Eric W. Biederman
  2015-03-14  0:33                       ` Steven Whitehouse
  0 siblings, 1 reply; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-06 20:44 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: David Miller, netdev, ralf, linux-hams

Steven Whitehouse <swhiteho@redhat.com> writes:

> Hi,


>> We can almost universally use the same procedures for generating
>> link layer headers from neighbour table entries now.  I had hoped
>> to optimized things by removing function pointers.
>>
>> The big hold out is DECnet that sets src_mac based on the DECnet source
>> address.
> That is a requirement of DECnet I'm afraid - DECnet does not have an exact
> equivalent of ARP/ndisc and many hosts will refuse to communicate if the MAC
> address is not the expected one based on the DECnet address. This is one bit of
> DECnet that is being used (to the best of my knowledge) and working.

Having a mac address that matches the DECnet address completely makes
sense.  Sourcing packets with a different mac address than the devices
default mac address make sense.  This latter case doesn't usually happen
for IP packets but we force it with the macvlan driver for example.

>> Which leads me to the conclusion that since DECnet has a different
>> algorithm for setting the src_mac than everything else in the kernel
>> DECnet neighbour table entries can not be used for nexthops for other
>> protocols :(
>>
>> DECnet also abuses neigh->output to select by output device which kind
>> of DECnet header to put on the packets.  But that is easily fixable.
> One way to fix it would be to drop support for non-broadcast devices. We don't
> have an implementation of DDCMP currently. Ethernet is the only working DECnet
> device at the moment. PPP could also potentially work, with a bit of tweeking,
> but strangely PPP is a broadcast device so far as DECnet is concerned,

What I wound up doing was just doing a shuffle of which function was in
the neigh->output method.  Which meant there was no need to disable any
of the current code.

That should fix interactions between DECnet and drivers like sch_teql
and the netfilter bridge code which after a packet has already been
output turn around and do:

	dst = skb_dst(skb);
	neigh = n = dst_neigh_lookup_skb(dst, skb);
        if (dst->dev != dev)
        	neigh = __neigh_lookup_errno(n->tbl, n->primary_key, dev);
	neigh->output(neigh, skb);

Which I think if the DECnet code hit one of those code paths today would
result in double DECnet headers.  :(

When looking at the DECnet code there is a flag that enables phase3
support but I don't seeing it set anywhere.  Should DECnet phase3
support actually work?

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-05 19:52         ` Eric W. Biederman
  2015-03-06  6:05           ` Vivek Venkatraman
@ 2015-03-07 10:36           ` Robert Shearman
  2015-03-07 21:12             ` Eric W. Biederman
  1 sibling, 1 reply; 88+ messages in thread
From: Robert Shearman @ 2015-03-07 10:36 UTC (permalink / raw)
  To: Eric W. Biederman, Vivek Venkatraman
  Cc: David Miller, netdev, roopa, Stephen Hemminger, santiago

On 05/03/15 19:52, Eric W. Biederman wrote:
> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>
>> On Thu, Mar 5, 2015 at 6:00 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> For going from normal ip routing to mpls routing somewhere we need the
>>> the destination ip prefix to mpls tunnel mapping. There are a couple of
>>> possible ways this could be solved.
>>> - One ingress network device per mpls tunnel.
>>> - One ingress network device and with with a a configurable routing
>>>    prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
>>>    does something like this for ATM virtual circuits.
>>> - One ingress network device that looks at IP_ROUTE_CLASSID and
>>>    use that to select the mpls labels to use.
>>> - Teach the IP network stack how to insert packets in tunnels without
>>>    needing a magic netdevice.
>>>
>>
>> I feel it should be along the lines of "teach the IP network stack how
>> to push labels".
>
> That phrasing sets off alarms bells in my mind of mpls specific hacks in
> the kernel, which most likely will cause performance regression and
> maintenance complications.

Other than the TTL and label-use issues already pointed out, it will 
also be tricky to perform UCMP & ECMP with a mix of labeled and 
unlabeled paths, unless the forwarding information that the routing 
protocols install in the imposition case is substantially different from 
the incoming-label case (in which case it will overly complicate the 
routing protocols).

There are also cases where it's highly desirable to use different 
subsets of available paths for incoming IP traffic, compared to incoming 
labeled traffic (eiBGP multipath) and this could be tricky to do without 
the IP stack doing the selection of the path to use.

There's also the issue of memory usage with route scale to be concerned 
with, with some of the solutions being better in this respect than 
others. Naturally, the "teach the IP network stack now to push labels" 
will scale the best, especially if routing information were to be shared 
with the label table where possible.

>
>> In general, MPLS LSPs can be setup as hop-by-hop
>> routed LSPs (when using a signaling protocol like LDP or BGP) as well
>> as tunnels that may take a different path than normal routing. I feel
>> it is good if the dataplane can support both models. In the former,
>> the IP network stack should push the labels which are just
>> encapsulation and then just transmit on the underlying netdevice that
>> corresponds to the neighbor interface. To achieve this, maybe it is
>> the neighbor (nexthop) that has to reference the mpls_route. In the
>> latter (LSPs are treated as tunnels and/or this is the only model
>> supported), the IP network stack would still need to impose any inner
>> labels (i.e., VPN or pseudowire, later on Entropy or Segment labels)
>> and then transmit over the tunnel netdevice which would impose the
>> tunnel label.
>
> Potentially.  This part of the discussion has reached the point where I
> need to see code to carry this part of the discussion any farther.

Another discussion point is whether using collapsed of label stacks for 
VPN prefixes will work adequately under scale when faced with IGP 
reconvergence events. The alternative would be to allow the control 
plane to install "push-and-lookup" type forwarding entries, essentially 
behaving as a recursive MPLS route in a similar way to what was proposed 
in the ipmpls tunnel - this would separate the VPN routing entries from 
the IGP ones, meaning that the forwarding information for the latter can 
change independently from the former. This can be done without further 
changes to the netlink protocol, so isn't a big priority right now.

Thanks,
Rob

>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel
  2015-03-07 10:36           ` Robert Shearman
@ 2015-03-07 21:12             ` Eric W. Biederman
  0 siblings, 0 replies; 88+ messages in thread
From: Eric W. Biederman @ 2015-03-07 21:12 UTC (permalink / raw)
  To: Robert Shearman
  Cc: Vivek Venkatraman, David Miller, netdev, roopa,
	Stephen Hemminger, santiago

Robert Shearman <rshearma@brocade.com> writes:

> On 05/03/15 19:52, Eric W. Biederman wrote:
>> Vivek Venkatraman <vivek@cumulusnetworks.com> writes:
>>
>>> On Thu, Mar 5, 2015 at 6:00 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> For going from normal ip routing to mpls routing somewhere we need the
>>>> the destination ip prefix to mpls tunnel mapping. There are a couple of
>>>> possible ways this could be solved.
>>>> - One ingress network device per mpls tunnel.
>>>> - One ingress network device and with with a a configurable routing
>>>>    prefix to mpls mapping.  Possibly loaded on the fly.  net/atm/clip.c
>>>>    does something like this for ATM virtual circuits.
>>>> - One ingress network device that looks at IP_ROUTE_CLASSID and
>>>>    use that to select the mpls labels to use.
>>>> - Teach the IP network stack how to insert packets in tunnels without
>>>>    needing a magic netdevice.
>>>>
>>>
>>> I feel it should be along the lines of "teach the IP network stack how
>>> to push labels".
>>
>> That phrasing sets off alarms bells in my mind of mpls specific hacks in
>> the kernel, which most likely will cause performance regression and
>> maintenance complications.
>
> Other than the TTL and label-use issues already pointed out, it will also be
> tricky to perform UCMP & ECMP with a mix of labeled and unlabeled paths, unless
> the forwarding information that the routing protocols install in the imposition
> case is substantially different from the incoming-label case (in which case it
> will overly complicate the routing protocols).

Six of one half a dozen of the other.  But I agree keeping track of
labels that are only used to forward IP traffic is likely an unnecessary
complication.

> There are also cases where it's highly desirable to use different subsets of
> available paths for incoming IP traffic, compared to incoming labeled traffic
> (eiBGP multipath) and this could be tricky to do without the IP stack doing the
> selection of the path to use.

We definitely want to use the standard routing table to do routing.

> There's also the issue of memory usage with route scale to be concerned with,
> with some of the solutions being better in this respect than others. Naturally,
> the "teach the IP network stack now to push labels" will scale the best,
> especially if routing information were to be shared with the label table where
> possible.
>
>>
>>> In general, MPLS LSPs can be setup as hop-by-hop
>>> routed LSPs (when using a signaling protocol like LDP or BGP) as well
>>> as tunnels that may take a different path than normal routing. I feel
>>> it is good if the dataplane can support both models. In the former,
>>> the IP network stack should push the labels which are just
>>> encapsulation and then just transmit on the underlying netdevice that
>>> corresponds to the neighbor interface. To achieve this, maybe it is
>>> the neighbor (nexthop) that has to reference the mpls_route. In the
>>> latter (LSPs are treated as tunnels and/or this is the only model
>>> supported), the IP network stack would still need to impose any inner
>>> labels (i.e., VPN or pseudowire, later on Entropy or Segment labels)
>>> and then transmit over the tunnel netdevice which would impose the
>>> tunnel label.
>>
>> Potentially.  This part of the discussion has reached the point where I
>> need to see code to carry this part of the discussion any farther.
>
> Another discussion point is whether using collapsed of label stacks for VPN
> prefixes will work adequately under scale when faced with IGP reconvergence
> events. The alternative would be to allow the control plane to install
> "push-and-lookup" type forwarding entries, essentially behaving as a recursive
> MPLS route in a similar way to what was proposed in the ipmpls tunnel - this
> would separate the VPN routing entries from the IGP ones, meaning that the
> forwarding information for the latter can change independently from the
> former. This can be done without further changes to the netlink protocol, so
> isn't a big priority right now.

You can currently make multiple trips through the MPLS forwarding
stack, you just need to set your output interface to "lo".   If that
case becomes heavily used we may want to optimize it, but the code
as implemented should 

Eric

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next] ax25: Stop using magic neighbour cache operations.
  2015-03-06 20:44                     ` Eric W. Biederman
@ 2015-03-14  0:33                       ` Steven Whitehouse
  0 siblings, 0 replies; 88+ messages in thread
From: Steven Whitehouse @ 2015-03-14  0:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev, ralf, linux-hams

Hi,

On 06/03/15 20:44, Eric W. Biederman wrote:
> Steven Whitehouse <swhiteho@redhat.com> writes:
>
>> Hi,
>
>>> We can almost universally use the same procedures for generating
>>> link layer headers from neighbour table entries now.  I had hoped
>>> to optimized things by removing function pointers.
>>>
>>> The big hold out is DECnet that sets src_mac based on the DECnet source
>>> address.
>> That is a requirement of DECnet I'm afraid - DECnet does not have an exact
>> equivalent of ARP/ndisc and many hosts will refuse to communicate if the MAC
>> address is not the expected one based on the DECnet address. This is one bit of
>> DECnet that is being used (to the best of my knowledge) and working.
> Having a mac address that matches the DECnet address completely makes
> sense.  Sourcing packets with a different mac address than the devices
> default mac address make sense.  This latter case doesn't usually happen
> for IP packets but we force it with the macvlan driver for example.
The history is that when DECnet was being written, it was not generally 
reliable to have a different MAC to that of the ethernet card - it 
worked for some cards, but by no means all of them. So it is something 
that would be done slightly differently with modern cards.

>>> Which leads me to the conclusion that since DECnet has a different
>>> algorithm for setting the src_mac than everything else in the kernel
>>> DECnet neighbour table entries can not be used for nexthops for other
>>> protocols :(
>>>
>>> DECnet also abuses neigh->output to select by output device which kind
>>> of DECnet header to put on the packets.  But that is easily fixable.
>> One way to fix it would be to drop support for non-broadcast devices. We don't
>> have an implementation of DDCMP currently. Ethernet is the only working DECnet
>> device at the moment. PPP could also potentially work, with a bit of tweeking,
>> but strangely PPP is a broadcast device so far as DECnet is concerned,
> What I wound up doing was just doing a shuffle of which function was in
> the neigh->output method.  Which meant there was no need to disable any
> of the current code.
>
> That should fix interactions between DECnet and drivers like sch_teql
> and the netfilter bridge code which after a packet has already been
> output turn around and do:
>
> 	dst = skb_dst(skb);
> 	neigh = n = dst_neigh_lookup_skb(dst, skb);
>          if (dst->dev != dev)
>          	neigh = __neigh_lookup_errno(n->tbl, n->primary_key, dev);
> 	neigh->output(neigh, skb);
>
> Which I think if the DECnet code hit one of those code paths today would
> result in double DECnet headers.  :(
>
> When looking at the DECnet code there is a flag that enables phase3
> support but I don't seeing it set anywhere.  Should DECnet phase3
> support actually work?
>
> Eric

Ok. That sounds reasonable to me. The phase 3 stuff is another stub 
really... most of what is needed is there, but not all of it. Phase 3 is 
basically the same as Phase IV but without areas and broadcast devices, 
so in order to make Phase 3 devices reachable on a Phase IV network, the 
Phase IV router has to add an area and forward the packet on, and remove 
the area when sending to the Phase 3 node. However since we don't have 
non-broadcast DECnet devices, there is no point in having Phase3 support 
really. At least it is one area on which the spec is quite useful 
though, as this is fairly well described,

Steve.

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2015-03-14  0:33 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25 17:09 [PATCH net-next 0/8] Basic MPLS support Eric W. Biederman
2015-02-25 17:13 ` [PATCH net-next 1/8] mpls: Refactor how the mpls module is built Eric W. Biederman
2015-02-26  2:05   ` Simon Horman
2015-02-26  2:15     ` Eric W. Biederman
2015-02-26  2:28       ` Simon Horman
2015-02-25 17:14 ` [PATCH net-next 2/8] mpls: Basic routing support Eric W. Biederman
2015-02-25 17:15 ` [PATCH net-next 3/8] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
2015-02-25 17:16 ` [PATCH net-next 4/8] mpls: Basic support for adding and removing routes Eric W. Biederman
2015-02-25 17:16 ` [PATCH net-next 5/8] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
2015-02-25 17:17 ` [PATCH net-next 6/8] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
2015-02-25 17:18 ` [PATCH net-next 8/8] ipmpls: Basic device for injecting packets into an mpls tunnel Eric W. Biederman
2015-03-05  9:17   ` Vivek Venkatraman
2015-03-05 14:00     ` Eric W. Biederman
2015-03-05 16:25       ` Vivek Venkatraman
2015-03-05 19:52         ` Eric W. Biederman
2015-03-06  6:05           ` Vivek Venkatraman
2015-03-07 10:36           ` Robert Shearman
2015-03-07 21:12             ` Eric W. Biederman
2015-02-25 17:19 ` [PATCH net-next 7/8] mpls: Multicast route table change notifications Eric W. Biederman
2015-02-26  7:21   ` roopa
2015-02-26 14:03     ` Eric W. Biederman
2015-02-26 15:12       ` roopa
2015-03-05  1:56         ` Andy Gospodarek
2015-02-25 17:37 ` [PATCH iproute2] mpls: Add basic mpls support to iproute Eric W. Biederman
2015-02-26  6:58 ` [PATCH net-next 0/8] Basic MPLS support roopa
2015-02-27 21:21 ` David Miller
2015-02-28  0:58   ` Eric W. Biederman
2015-03-02  0:05     ` Shrijeet Mukherjee
2015-03-02  4:03     ` David Miller
2015-03-02  5:10       ` Eric W. Biederman
2015-03-02  5:53         ` David Miller
2015-03-02  5:59         ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups Eric W. Biederman
2015-03-02  5:59           ` [PATCH net-next 01/15] ax25: In ax25_rebuild_header add missing kfree_skb Eric W. Biederman
2015-03-02  6:01           ` [PATCH net-next 02/15] rose: Set the destination address in rose_header Eric W. Biederman
2015-03-02  6:02           ` [PATCH net-next 03/15] rose: Transmit packets in rose_xmit not rose_rebuild_header Eric W. Biederman
2015-03-02  6:03           ` [PATCH net-next 04/15] ax25/kiss: Replace ax_header_ops with ax25_header_ops Eric W. Biederman
2015-03-02  6:03           ` [PATCH net-next 05/15] ax25/6pack: Replace sp_header_ops " Eric W. Biederman
2015-03-02  6:04           ` [PATCH net-next 06/15] ax25: Make ax25_header and ax25_rebuild_header static Eric W. Biederman
2015-03-02  6:05           ` [PATCH net-next 07/15] ax25: Refactor to use private neighbour operations Eric W. Biederman
2015-03-02  6:06           ` [PATCH net-next 08/15] arp: Remove special case to give AX25 it's open arp operations Eric W. Biederman
2015-03-02  6:07           ` [PATCH net-next 09/15] neigh: Move neigh_compat_output into ax25_ip.c Eric W. Biederman
2015-03-02  6:08           ` [PATCH net-next 10/15] ax25: Stop calling/abusing dev_rebuild_header Eric W. Biederman
2015-03-02  6:09           ` [PATCH net-next 11/15] ax25: Stop depending on arp_find Eric W. Biederman
2015-03-02  6:11           ` [PATCH net-next 12/15] net: Kill dev_rebuild_header Eric W. Biederman
2015-03-02  6:12           ` [PATCH net-next 13/15] arp: Kill arp_find Eric W. Biederman
2015-03-02  6:13           ` [PATCH net-next 14/15] neigh: Don't require dst in neigh_hh_init Eric W. Biederman
2015-03-02  6:14           ` [PATCH net-next 15/15] neigh: Don't require a dst in neigh_resolve_output Eric W. Biederman
2015-03-02 21:44           ` [PATCH net-next 0/15] Neighbour table and ax25 cleanups David Miller
2015-03-03 15:41             ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Eric W. Biederman
2015-03-03 19:45               ` David Miller
2015-03-03 20:22                 ` Eric W. Biederman
2015-03-03 20:33                   ` David Miller
2015-03-03 23:09                     ` [PATCH net-next 0/2] Neighbour table prep for MPLS Eric W. Biederman
2015-03-03 23:10                       ` [PATCH net-next 1/2] neigh: Factor out ___neigh_lookup_noref Eric W. Biederman
2015-03-04 14:53                         ` Andy Gospodarek
2015-03-04 15:58                           ` Eric W. Biederman
2015-03-04 16:30                             ` Andy Gospodarek
2015-03-03 23:11                       ` [PATCH net-next 2/2] neigh: Add helper function neigh_xmit Eric W. Biederman
2015-03-04  1:06                       ` [PATCH net-next 0/7] Basic MPLS support take 2 Eric W. Biederman
2015-03-04  1:10                         ` [PATCH net-next 1/7] mpls: Refactor how the mpls module is built Eric W. Biederman
2015-03-04  1:10                         ` [PATCH net-next 2/7] mpls: Basic routing support Eric W. Biederman
2015-03-05 16:36                           ` Vivek Venkatraman
2015-03-05 18:42                             ` Eric W. Biederman
2015-03-04  1:11                         ` [PATCH net-next 3/7] mpls: Add a sysctl to control the size of the mpls label table Eric W. Biederman
2015-03-05  9:45                           ` Vivek Venkatraman
2015-03-05 13:22                             ` Eric W. Biederman
2015-03-05 14:38                               ` Eric W. Biederman
2015-03-05 16:49                                 ` Vivek Venkatraman
2015-03-04  1:12                         ` [PATCH net-next 4/7] mpls: Basic support for adding and removing routes Eric W. Biederman
2015-03-04  8:13                           ` roopa
2015-03-04 20:36                             ` Eric W. Biederman
2015-03-05  0:30                               ` roopa
2015-03-05  2:50                               ` Bill Fink
2015-03-05 11:54                                 ` Eric W. Biederman
2015-03-05 19:10                                   ` Bill Fink
2015-03-04  1:13                         ` [PATCH net-next 5/7] mpls: Functions for reading and wrinting mpls labels over netlink Eric W. Biederman
2015-03-04  1:13                         ` [PATCH net-next 6/7] mpls: Netlink commands to add, remove, and dump routes Eric W. Biederman
2015-03-04  1:14                         ` [PATCH net-next 7/7] mpls: Multicast route table change notifications Eric W. Biederman
2015-03-04  5:27                         ` [PATCH net-next 0/7] Basic MPLS support take 2 David Miller
2015-03-04  6:13                           ` Eric W. Biederman
2015-03-04  5:25                       ` [PATCH net-next 0/2] Neighbour table prep for MPLS David Miller
2015-03-04  5:53                         ` Eric W. Biederman
2015-03-04 14:56                           ` Andy Gospodarek
2015-03-04 21:04                           ` David Miller
2015-03-05 12:35                             ` Eric W. Biederman
2015-03-05 10:14                   ` [PATCH net-next] ax25: Stop using magic neighbour cache operations Steven Whitehouse
2015-03-06 20:44                     ` Eric W. Biederman
2015-03-14  0:33                       ` Steven Whitehouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.