All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
@ 2016-09-12 18:01 Mahesh Bandewar
  2016-09-14  3:24 ` David Ahern
  0 siblings, 1 reply; 5+ messages in thread
From: Mahesh Bandewar @ 2016-09-12 18:01 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig                 |  1 +
 drivers/net/ipvlan/ipvlan.h         |  7 +++
 drivers/net/ipvlan/ipvlan_core.c    | 94 +++++++++++++++++++++++++++++++++++++
 drivers/net/ipvlan/ipvlan_main.c    | 60 ++++++++++++++++++++---
 include/uapi/linux/if_link.h        |  1 +
 6 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt
index 14422f8fcdc4..24196cef7c91 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
 	There are no module parameters for this driver and it can be configured
 using IProute2/ip utility.
 
-	ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | L3 }
+	ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | l3 | l3s }
 
 	e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
 
@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.
 
+4.3 L3S mode:
+	This is very similar to the L3 mode except that iptables (conn-tracking)
+works in this mode and hence it is L3-symmetric (L3s). This will have slightly less
+performance but that shouldn't matter since you are choosing this mode over plain-L3
+mode to make conn-tracking work.
 
 5. What to choose (macvlan vs. ipvlan)?
 	These two devices are very similar in many regards and the specific use
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c5415b05ea9..8768a625350d 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,6 +149,7 @@ config IPVLAN
     tristate "IP-VLAN support"
     depends on INET
     depends on IPV6
+    depends on NET_L3_MASTER_DEV
     ---help---
       This allows one to create virtual devices off of a main interface
       and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 695a5dc9ace3..68b270b59ba9 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -23,11 +23,13 @@
 #include <linux/if_vlan.h>
 #include <linux/ip.h>
 #include <linux/inetdevice.h>
+#include <linux/netfilter.h>
 #include <net/ip.h>
 #include <net/ip6_route.h>
 #include <net/rtnetlink.h>
 #include <net/route.h>
 #include <net/addrconf.h>
+#include <net/l3mdev.h>
 
 #define IPVLAN_DRV	"ipvlan"
 #define IPV_DRV_VER	"0.1"
@@ -96,6 +98,7 @@ struct ipvl_port {
 	struct work_struct	wq;
 	struct sk_buff_head	backlog;
 	int			count;
+	bool			ipt_hook_added;
 	struct rcu_head		rcu;
 };
 
@@ -124,4 +127,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan,
 				   const void *iaddr, bool is_v6);
 bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6);
 void ipvlan_ht_addr_del(struct ipvl_addr *addr);
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+			      u16 proto);
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+			     const struct nf_hook_state *state);
 #endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b5f9511d819e..b4e990743e1d 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	case IPVLAN_MODE_L2:
 		return ipvlan_xmit_mode_l2(skb, dev);
 	case IPVLAN_MODE_L3:
+	case IPVLAN_MODE_L3S:
 		return ipvlan_xmit_mode_l3(skb, dev);
 	}
 
@@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb)
 		return ipvlan_handle_mode_l2(pskb, port);
 	case IPVLAN_MODE_L3:
 		return ipvlan_handle_mode_l3(pskb, port);
+	case IPVLAN_MODE_L3S:
+		return RX_HANDLER_PASS;
 	}
 
 	/* Should not reach here */
@@ -672,3 +675,94 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb)
 	kfree_skb(skb);
 	return RX_HANDLER_CONSUMED;
 }
+
+static struct ipvl_addr *ipvlan_skb_to_addr(struct sk_buff *skb,
+					    struct net_device *dev)
+{
+	struct ipvl_addr *addr = NULL;
+	struct ipvl_port *port;
+	void *lyr3h;
+	int addr_type;
+
+	if (!dev || !netif_is_ipvlan_port(dev))
+		goto out;
+
+	port = ipvlan_port_get_rcu(dev);
+	if (!port || port->mode != IPVLAN_MODE_L3S)
+		goto out;
+
+	lyr3h = ipvlan_get_L3_hdr(skb, &addr_type);
+	if (!lyr3h)
+		goto out;
+
+	addr = ipvlan_addr_lookup(port, lyr3h, addr_type, true);
+out:
+	return addr;
+}
+
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+			      u16 proto)
+{
+	struct ipvl_addr *addr;
+	struct net_device *sdev;
+
+	addr = ipvlan_skb_to_addr(skb, dev);
+	if (!addr)
+		goto out;
+
+	sdev = addr->master->dev;
+	switch (proto) {
+	case AF_INET:
+	{
+		int err;
+		struct iphdr *ip4h = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
+					   ip4h->tos, sdev);
+		if (unlikely(err))
+			goto out;
+		break;
+	}
+	case AF_INET6:
+	{
+		struct dst_entry *dst;
+		struct ipv6hdr *ip6h = ipv6_hdr(skb);
+		int flags = RT6_LOOKUP_F_HAS_SADDR;
+		struct flowi6 fl6 = {
+			.flowi6_iif   = sdev->ifindex,
+			.daddr        = ip6h->daddr,
+			.saddr        = ip6h->saddr,
+			.flowlabel    = ip6_flowinfo(ip6h),
+			.flowi6_mark  = skb->mark,
+			.flowi6_proto = ip6h->nexthdr,
+		};
+
+		skb_dst_drop(skb);
+		dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
+		skb_dst_set(skb, dst);
+		break;
+	}
+	default:
+		break;
+	}
+
+out:
+	return skb;
+}
+
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+			     const struct nf_hook_state *state)
+{
+	struct ipvl_addr *addr;
+	unsigned int len;
+
+	addr = ipvlan_skb_to_addr(skb, skb->dev);
+	if (!addr)
+		goto out;
+
+	skb->dev = addr->master->dev;
+	len = skb->len + ETH_HLEN;
+	ipvlan_count_rx(addr->master, len, true, false);
+out:
+	return NF_ACCEPT;
+}
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 18b4e8c7f68a..d02be277e1db 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -9,24 +9,65 @@
 
 #include "ipvlan.h"
 
+static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
+	{
+		.hook     = ipvlan_nf_input,
+		.pf       = NFPROTO_IPV4,
+		.hooknum  = NF_INET_LOCAL_IN,
+		.priority = INT_MAX,
+	},
+	{
+		.hook     = ipvlan_nf_input,
+		.pf       = NFPROTO_IPV6,
+		.hooknum  = NF_INET_LOCAL_IN,
+		.priority = INT_MAX,
+	},
+};
+
+static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
+	.l3mdev_l3_rcv = ipvlan_l3_rcv,
+};
+
 static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
 {
 	ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
 }
 
-static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
+static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
 {
 	struct ipvl_dev *ipvlan;
+	int err = 0;
 
+	ASSERT_RTNL();
 	if (port->mode != nval) {
+		if (nval == IPVLAN_MODE_L3S) {
+			port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
+			port->dev->priv_flags |= IFF_L3MDEV_MASTER;
+			if (!port->ipt_hook_added) {
+				err = _nf_register_hooks(ipvl_nfops,
+							ARRAY_SIZE(ipvl_nfops));
+				if (!err)
+					port->ipt_hook_added = true;
+				else
+					return err;
+			}
+		} else {
+			port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
+			port->dev->l3mdev_ops = NULL;
+			if (port->ipt_hook_added)
+				_nf_unregister_hooks(ipvl_nfops,
+						     ARRAY_SIZE(ipvl_nfops));
+			port->ipt_hook_added = false;
+		}
 		list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
-			if (nval == IPVLAN_MODE_L3)
+			if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S)
 				ipvlan->dev->flags |= IFF_NOARP;
 			else
 				ipvlan->dev->flags &= ~IFF_NOARP;
 		}
 		port->mode = nval;
 	}
+	return err;
 }
 
 static int ipvlan_port_create(struct net_device *dev)
@@ -132,7 +173,8 @@ static int ipvlan_open(struct net_device *dev)
 	struct net_device *phy_dev = ipvlan->phy_dev;
 	struct ipvl_addr *addr;
 
-	if (ipvlan->port->mode == IPVLAN_MODE_L3)
+	if (ipvlan->port->mode == IPVLAN_MODE_L3 ||
+	    ipvlan->port->mode == IPVLAN_MODE_L3S)
 		dev->flags |= IFF_NOARP;
 	else
 		dev->flags &= ~IFF_NOARP;
@@ -372,13 +414,14 @@ static int ipvlan_nl_changelink(struct net_device *dev,
 {
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 	struct ipvl_port *port = ipvlan_port_get_rtnl(ipvlan->phy_dev);
+	int err = 0;
 
 	if (data && data[IFLA_IPVLAN_MODE]) {
 		u16 nmode = nla_get_u16(data[IFLA_IPVLAN_MODE]);
 
-		ipvlan_set_port_mode(port, nmode);
+		err = ipvlan_set_port_mode(port, nmode);
 	}
-	return 0;
+	return err;
 }
 
 static size_t ipvlan_nl_getsize(const struct net_device *dev)
@@ -473,10 +516,13 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
 		unregister_netdevice(dev);
 		return err;
 	}
+	err = ipvlan_set_port_mode(port, mode);
+	if (err) {
+		unregister_netdevice(dev);
+		return err;
+	}
 
 	list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
-	ipvlan_set_port_mode(port, mode);
-
 	netif_stacked_transfer_operstate(phy_dev, dev);
 	return 0;
 }
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9bf3aecfe05b..a615583bab09 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -464,6 +464,7 @@ enum {
 enum ipvlan_mode {
 	IPVLAN_MODE_L2 = 0,
 	IPVLAN_MODE_L3,
+	IPVLAN_MODE_L3S,
 	IPVLAN_MODE_MAX
 };
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
  2016-09-12 18:01 [PATCHv2 next 3/3] ipvlan: Introduce l3s mode Mahesh Bandewar
@ 2016-09-14  3:24 ` David Ahern
  2016-09-14 16:30   ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 1 reply; 5+ messages in thread
From: David Ahern @ 2016-09-14  3:24 UTC (permalink / raw)
  To: Mahesh Bandewar, netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar

On 9/12/16 12:01 PM, Mahesh Bandewar wrote:

> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
> +			      u16 proto)
> +{
> +	struct ipvl_addr *addr;
> +	struct net_device *sdev;
> +
> +	addr = ipvlan_skb_to_addr(skb, dev);
> +	if (!addr)
> +		goto out;
> +
> +	sdev = addr->master->dev;
> +	switch (proto) {
> +	case AF_INET:
> +	{
> +		int err;
> +		struct iphdr *ip4h = ip_hdr(skb);
> +
> +		err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
> +					   ip4h->tos, sdev);
> +		if (unlikely(err))
> +			goto out;
> +		break;
> +	}
> +	case AF_INET6:
> +	{
> +		struct dst_entry *dst;
> +		struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +		int flags = RT6_LOOKUP_F_HAS_SADDR;
> +		struct flowi6 fl6 = {
> +			.flowi6_iif   = sdev->ifindex,
> +			.daddr        = ip6h->daddr,
> +			.saddr        = ip6h->saddr,
> +			.flowlabel    = ip6_flowinfo(ip6h),
> +			.flowi6_mark  = skb->mark,
> +			.flowi6_proto = ip6h->nexthdr,
> +		};
> +
> +		skb_dst_drop(skb);
> +		dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
> +		skb_dst_set(skb, dst);
> +		break;
> +	}
> +	default:
> +		break;
> +	}

Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?


> +
> +out:
> +	return skb;
> +}
> +
> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
> +			     const struct nf_hook_state *state)
> +{
> +	struct ipvl_addr *addr;
> +	unsigned int len;
> +
> +	addr = ipvlan_skb_to_addr(skb, skb->dev);
> +	if (!addr)
> +		goto out;
> +
> +	skb->dev = addr->master->dev;
> +	len = skb->len + ETH_HLEN;
> +	ipvlan_count_rx(addr->master, len, true, false);
> +out:
> +	return NF_ACCEPT;
> +}
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 18b4e8c7f68a..d02be277e1db 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -9,24 +9,65 @@
>  
>  #include "ipvlan.h"
>  
> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
> +	{
> +		.hook     = ipvlan_nf_input,
> +		.pf       = NFPROTO_IPV4,
> +		.hooknum  = NF_INET_LOCAL_IN,
> +		.priority = INT_MAX,
> +	},
> +	{
> +		.hook     = ipvlan_nf_input,
> +		.pf       = NFPROTO_IPV6,
> +		.hooknum  = NF_INET_LOCAL_IN,
> +		.priority = INT_MAX,
> +	},
> +};
> +
> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
> +	.l3mdev_l3_rcv = ipvlan_l3_rcv,
> +};
> +
>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>  {
>  	ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>  }
>  
> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>  {
>  	struct ipvl_dev *ipvlan;
> +	int err = 0;
>  
> +	ASSERT_RTNL();
>  	if (port->mode != nval) {
> +		if (nval == IPVLAN_MODE_L3S) {
> +			port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
> +			port->dev->priv_flags |= IFF_L3MDEV_MASTER;
> +			if (!port->ipt_hook_added) {
> +				err = _nf_register_hooks(ipvl_nfops,
> +							ARRAY_SIZE(ipvl_nfops));

That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.

 
> +				if (!err)
> +					port->ipt_hook_added = true;
> +				else
> +					return err;
> +			}
> +		} else {
> +			port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
> +			port->dev->l3mdev_ops = NULL;
> +			if (port->ipt_hook_added)
> +				_nf_unregister_hooks(ipvl_nfops,
> +						     ARRAY_SIZE(ipvl_nfops));
> +			port->ipt_hook_added = false;
> +		}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
  2016-09-14  3:24 ` David Ahern
@ 2016-09-14 16:30   ` Mahesh Bandewar (महेश बंडेवार)
  2016-09-14 18:04     ` David Ahern
  0 siblings, 1 reply; 5+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-09-14 16:30 UTC (permalink / raw)
  To: David Ahern; +Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller

Hi David, thanks for the comments.

On Tue, Sep 13, 2016 at 8:24 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 9/12/16 12:01 PM, Mahesh Bandewar wrote:
>
>> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
>> +                           u16 proto)
>> +{
>> +     struct ipvl_addr *addr;
>> +     struct net_device *sdev;
>> +
>> +     addr = ipvlan_skb_to_addr(skb, dev);
>> +     if (!addr)
>> +             goto out;
>> +
>> +     sdev = addr->master->dev;
>> +     switch (proto) {
>> +     case AF_INET:
>> +     {
>> +             int err;
>> +             struct iphdr *ip4h = ip_hdr(skb);
>> +
>> +             err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
>> +                                        ip4h->tos, sdev);
>> +             if (unlikely(err))
>> +                     goto out;
>> +             break;
>> +     }
>> +     case AF_INET6:
>> +     {
>> +             struct dst_entry *dst;
>> +             struct ipv6hdr *ip6h = ipv6_hdr(skb);
>> +             int flags = RT6_LOOKUP_F_HAS_SADDR;
>> +             struct flowi6 fl6 = {
>> +                     .flowi6_iif   = sdev->ifindex,
>> +                     .daddr        = ip6h->daddr,
>> +                     .saddr        = ip6h->saddr,
>> +                     .flowlabel    = ip6_flowinfo(ip6h),
>> +                     .flowi6_mark  = skb->mark,
>> +                     .flowi6_proto = ip6h->nexthdr,
>> +             };
>> +
>> +             skb_dst_drop(skb);
>> +             dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
>> +             skb_dst_set(skb, dst);
>> +             break;
>> +     }
>> +     default:
>> +             break;
>> +     }
>
> Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?
>
I can but it's small enough to have it together. Also 'proto' is a
parameter to the handler and makes it easier However do you see any
issue having just one function?

>
>> +
>> +out:
>> +     return skb;
>> +}
>> +
>> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
>> +                          const struct nf_hook_state *state)
>> +{
>> +     struct ipvl_addr *addr;
>> +     unsigned int len;
>> +
>> +     addr = ipvlan_skb_to_addr(skb, skb->dev);
>> +     if (!addr)
>> +             goto out;
>> +
>> +     skb->dev = addr->master->dev;
>> +     len = skb->len + ETH_HLEN;
>> +     ipvlan_count_rx(addr->master, len, true, false);
>> +out:
>> +     return NF_ACCEPT;
>> +}
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>> index 18b4e8c7f68a..d02be277e1db 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -9,24 +9,65 @@
>>
>>  #include "ipvlan.h"
>>
>> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
>> +     {
>> +             .hook     = ipvlan_nf_input,
>> +             .pf       = NFPROTO_IPV4,
>> +             .hooknum  = NF_INET_LOCAL_IN,
>> +             .priority = INT_MAX,
>> +     },
>> +     {
>> +             .hook     = ipvlan_nf_input,
>> +             .pf       = NFPROTO_IPV6,
>> +             .hooknum  = NF_INET_LOCAL_IN,
>> +             .priority = INT_MAX,
>> +     },
>> +};
>> +
>> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
>> +     .l3mdev_l3_rcv = ipvlan_l3_rcv,
>> +};
>> +
>>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>>  {
>>       ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>>  }
>>
>> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>  {
>>       struct ipvl_dev *ipvlan;
>> +     int err = 0;
>>
>> +     ASSERT_RTNL();
>>       if (port->mode != nval) {
>> +             if (nval == IPVLAN_MODE_L3S) {
>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>> +                     if (!port->ipt_hook_added) {
>> +                             err = _nf_register_hooks(ipvl_nfops,
>> +                                                     ARRAY_SIZE(ipvl_nfops));
>
> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>
Do you mean per slave device?  It's registering it for a port (so only
once) depending on the mode. If the mode != l3s it wont bother
registering to keep current modes as they are.
I'm sure dst idea could work as well (as you have suggested) but
l3mdev + ipt-hook is simpler and probably far less intrusive.

>
>> +                             if (!err)
>> +                                     port->ipt_hook_added = true;
>> +                             else
>> +                                     return err;
>> +                     }
>> +             } else {
>> +                     port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
>> +                     port->dev->l3mdev_ops = NULL;
>> +                     if (port->ipt_hook_added)
>> +                             _nf_unregister_hooks(ipvl_nfops,
>> +                                                  ARRAY_SIZE(ipvl_nfops));
>> +                     port->ipt_hook_added = false;
>> +             }
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
  2016-09-14 16:30   ` Mahesh Bandewar (महेश बंडेवार)
@ 2016-09-14 18:04     ` David Ahern
  2016-09-14 18:47       ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 1 reply; 5+ messages in thread
From: David Ahern @ 2016-09-14 18:04 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश
	बंडेवार)
  Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller

On 9/14/16 10:30 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> Hi David, thanks for the comments.
> 
> On Tue, Sep 13, 2016 at 8:24 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> On 9/12/16 12:01 PM, Mahesh Bandewar wrote:
>>
>>> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
>>> +                           u16 proto)
>>> +{
>>> +     struct ipvl_addr *addr;
>>> +     struct net_device *sdev;
>>> +
>>> +     addr = ipvlan_skb_to_addr(skb, dev);
>>> +     if (!addr)
>>> +             goto out;
>>> +
>>> +     sdev = addr->master->dev;
>>> +     switch (proto) {
>>> +     case AF_INET:
>>> +     {
>>> +             int err;
>>> +             struct iphdr *ip4h = ip_hdr(skb);
>>> +
>>> +             err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
>>> +                                        ip4h->tos, sdev);
>>> +             if (unlikely(err))
>>> +                     goto out;
>>> +             break;
>>> +     }
>>> +     case AF_INET6:
>>> +     {
>>> +             struct dst_entry *dst;
>>> +             struct ipv6hdr *ip6h = ipv6_hdr(skb);
>>> +             int flags = RT6_LOOKUP_F_HAS_SADDR;
>>> +             struct flowi6 fl6 = {
>>> +                     .flowi6_iif   = sdev->ifindex,
>>> +                     .daddr        = ip6h->daddr,
>>> +                     .saddr        = ip6h->saddr,
>>> +                     .flowlabel    = ip6_flowinfo(ip6h),
>>> +                     .flowi6_mark  = skb->mark,
>>> +                     .flowi6_proto = ip6h->nexthdr,
>>> +             };
>>> +
>>> +             skb_dst_drop(skb);
>>> +             dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
>>> +             skb_dst_set(skb, dst);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             break;
>>> +     }
>>
>> Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?
>>
> I can but it's small enough to have it together. Also 'proto' is a
> parameter to the handler and makes it easier However do you see any
> issue having just one function?

no, just readability comment about putting the case statements in helper functions.

> 
>>
>>> +
>>> +out:
>>> +     return skb;
>>> +}
>>> +
>>> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
>>> +                          const struct nf_hook_state *state)
>>> +{
>>> +     struct ipvl_addr *addr;
>>> +     unsigned int len;
>>> +
>>> +     addr = ipvlan_skb_to_addr(skb, skb->dev);
>>> +     if (!addr)
>>> +             goto out;
>>> +
>>> +     skb->dev = addr->master->dev;
>>> +     len = skb->len + ETH_HLEN;
>>> +     ipvlan_count_rx(addr->master, len, true, false);
>>> +out:
>>> +     return NF_ACCEPT;
>>> +}
>>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>>> index 18b4e8c7f68a..d02be277e1db 100644
>>> --- a/drivers/net/ipvlan/ipvlan_main.c
>>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>>> @@ -9,24 +9,65 @@
>>>
>>>  #include "ipvlan.h"
>>>
>>> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
>>> +     {
>>> +             .hook     = ipvlan_nf_input,
>>> +             .pf       = NFPROTO_IPV4,
>>> +             .hooknum  = NF_INET_LOCAL_IN,
>>> +             .priority = INT_MAX,
>>> +     },
>>> +     {
>>> +             .hook     = ipvlan_nf_input,
>>> +             .pf       = NFPROTO_IPV6,
>>> +             .hooknum  = NF_INET_LOCAL_IN,
>>> +             .priority = INT_MAX,
>>> +     },
>>> +};
>>> +
>>> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
>>> +     .l3mdev_l3_rcv = ipvlan_l3_rcv,
>>> +};
>>> +
>>>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>>>  {
>>>       ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>>>  }
>>>
>>> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>>  {
>>>       struct ipvl_dev *ipvlan;
>>> +     int err = 0;
>>>
>>> +     ASSERT_RTNL();
>>>       if (port->mode != nval) {
>>> +             if (nval == IPVLAN_MODE_L3S) {
>>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>>> +                     if (!port->ipt_hook_added) {
>>> +                             err = _nf_register_hooks(ipvl_nfops,
>>> +                                                     ARRAY_SIZE(ipvl_nfops));
>>
>> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>>
> Do you mean per slave device?  It's registering it for a port (so only
> once) depending on the mode. If the mode != l3s it wont bother
> registering to keep current modes as they are.

My reading of the patch the register is called for each ipvlan newlink that uses l3s mode. Adding this debug patch 

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index d02be277e1db..3f733f1e18ae 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -46,6 +46,7 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
                        if (!port->ipt_hook_added) {
                                err = _nf_register_hooks(ipvl_nfops,
                                                        ARRAY_SIZE(ipvl_nfops));
+pr_warn("called _nf_register_hooks for dev %s: err %d\n", port->dev->name, err);
                                if (!err)
                                        port->ipt_hook_added = true;
                                else
@@ -54,9 +55,11 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
                } else {
                        port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
                        port->dev->l3mdev_ops = NULL;
-                       if (port->ipt_hook_added)
+                       if (port->ipt_hook_added) {
+pr_warn("calling _nf_unregister_hooks for dev %s\n", port->dev->name);
                                _nf_unregister_hooks(ipvl_nfops,
                                                     ARRAY_SIZE(ipvl_nfops));
+                       }
                        port->ipt_hook_added = false;
                }
                list_for_each_entry(ipvlan, &port->ipvlans, pnode) {



and building, installing and testing I see this:

$ ip link add ipvl1 link eth1 type ipvlan mode l3s
-->  called _nf_register_hooks for dev eth1: err 0

$ ip link add ipvl2 link eth1 type ipvlan mode l3s
--> no message generated

$ ip link add ipvl3 link eth2 type ipvlan mode l3s
--> called _nf_register_hooks for dev eth2: err 0


But there is more. If I delete all 3 ipvlan devices the nf_unregister is not called. Unload the ipvlan module and panic:

[  181.135061] BUG: unable to handle kernel paging request at ffffffffa002cfca
[  181.137710] IP: [<ffffffffa002cfca>] 0xffffffffa002cfca
[  181.139574] PGD 1a07067 PUD 1a08063 PMD 1387e4067 PTE 0
[  181.140964] Oops: 0010 [#1] SMP
[  181.141684] Modules linked in: 8021q garp mrp stp llc vrf [last unloaded: ipvlan]
[  181.143678] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc6+ #96
[  181.145092] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[  181.147340] task: ffff88013f196180 task.stack: ffff88013f19c000
[  181.148510] RIP: 0010:[<ffffffffa002cfca>]  [<ffffffffa002cfca>] 0xffffffffa002cfca
[  181.150044] RSP: 0018:ffff88013fc83bd0  EFLAGS: 00010a12
[  181.151102] RAX: ffff88013a781c88 RBX: ffff88013fc83c08 RCX: 0000000000000000
[  181.152460] RDX: ffff88013fc83c38 RSI: ffff88013ab15600 RDI: 0000000000000000
[  181.153781] RBP: ffff88013fc83bf8 R08: 0000000000004b61 R09: 0000000000000004
[  181.155107] R10: 0000000000000000 R11: ffffea00044d9c80 R12: ffffffff81a89510
[  181.156426] R13: ffff88013ab15600 R14: ffff88013fc83c38 R15: ffff88013ab15600
[  181.157742] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000
[  181.159232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  181.160303] CR2: ffffffffa002cfca CR3: 000000013a5de000 CR4: 00000000000406e0
[  181.161588] Stack:
[  181.161954]  ffffffff813f8b47 ffff88013ab15600 ffff88013ab15600 ffff88013fc83c38
[  181.163353]  ffff88013874ac4e ffff88013fc83c28 ffffffff813f8b8c ffff88013a781c88
[  181.164722]  ffff88013ab15600 ffffffff81a88b00 ffff88013874ac4e ffff88013fc83c88
[  181.166094] Call Trace:
[  181.166532]  <IRQ>
[  181.166885]  [<ffffffff813f8b47>] ? nf_iterate+0x41/0x5b
[  181.167880]  [<ffffffff813f8b8c>] nf_hook_slow+0x2b/0x94
[  181.168803]  [<ffffffff81400c6e>] ip_local_deliver+0xa4/0xbf
[  181.169748]  [<ffffffff81400644>] ? xfrm4_policy_check.constprop.8+0x52/0x52
[  181.170910]  [<ffffffff81400a71>] ip_rcv_finish+0x2ed/0x34a
[  181.171841]  [<ffffffff81400f02>] ip_rcv+0x279/0x2fb
...


Also, another sequence:
$ ip link add ipvl1 link eth1 type ipvlan mode l3s
-->  called _nf_register_hooks for dev eth1: err 0

$ ip link add ipvl2 link eth1 type ipvlan mode l3s
--> no message generated

$ ip link set ipvl2 type ipvlan mode l3
--> calling _nf_unregister_hooks for dev eth1

that means the hooks are not there for ipvl1. I can remove the module with no panic.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
  2016-09-14 18:04     ` David Ahern
@ 2016-09-14 18:47       ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 0 replies; 5+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-09-14 18:47 UTC (permalink / raw)
  To: David Ahern; +Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller

On Wed, Sep 14, 2016 at 11:04 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
[...]
>>>> +     ASSERT_RTNL();
>>>>       if (port->mode != nval) {
>>>> +             if (nval == IPVLAN_MODE_L3S) {
>>>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>>>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>>>> +                     if (!port->ipt_hook_added) {
>>>> +                             err = _nf_register_hooks(ipvl_nfops,
>>>> +                                                     ARRAY_SIZE(ipvl_nfops));
>>>
>>> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>>>
>> Do you mean per slave device?  It's registering it for a port (so only
>> once) depending on the mode. If the mode != l3s it wont bother
>> registering to keep current modes as they are.
>
> My reading of the patch the register is called for each ipvlan newlink that uses l3s mode. Adding this debug patch
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index d02be277e1db..3f733f1e18ae 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -46,6 +46,7 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>                         if (!port->ipt_hook_added) {
>                                 err = _nf_register_hooks(ipvl_nfops,
>                                                         ARRAY_SIZE(ipvl_nfops));
> +pr_warn("called _nf_register_hooks for dev %s: err %d\n", port->dev->name, err);
>                                 if (!err)
>                                         port->ipt_hook_added = true;
>                                 else
> @@ -54,9 +55,11 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>                 } else {
>                         port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
>                         port->dev->l3mdev_ops = NULL;
> -                       if (port->ipt_hook_added)
> +                       if (port->ipt_hook_added) {
> +pr_warn("calling _nf_unregister_hooks for dev %s\n", port->dev->name);
>                                 _nf_unregister_hooks(ipvl_nfops,
>                                                      ARRAY_SIZE(ipvl_nfops));
> +                       }
>                         port->ipt_hook_added = false;
>                 }
>                 list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
>
>
>
> and building, installing and testing I see this:
>
> $ ip link add ipvl1 link eth1 type ipvlan mode l3s
> -->  called _nf_register_hooks for dev eth1: err 0
>
> $ ip link add ipvl2 link eth1 type ipvlan mode l3s
> --> no message generated
>
> $ ip link add ipvl3 link eth2 type ipvlan mode l3s
> --> called _nf_register_hooks for dev eth2: err 0
>
The first newlink (l3s) will register the hook and subsequent newlink
calls will skip registration. This is why you did not see the message
for the ipvl2 link creation. However lpvl3 is a added on top of eth2
while the first two link were added on eth1. So this is new port
creation and hence the driver tried to register the hook again.

>
> But there is more. If I delete all 3 ipvlan devices the nf_unregister is not called. Unload the ipvlan module and panic:
>
> [  181.135061] BUG: unable to handle kernel paging request at ffffffffa002cfca
> [  181.137710] IP: [<ffffffffa002cfca>] 0xffffffffa002cfca
> [  181.139574] PGD 1a07067 PUD 1a08063 PMD 1387e4067 PTE 0
> [  181.140964] Oops: 0010 [#1] SMP
> [  181.141684] Modules linked in: 8021q garp mrp stp llc vrf [last unloaded: ipvlan]
> [  181.143678] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc6+ #96
> [  181.145092] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [  181.147340] task: ffff88013f196180 task.stack: ffff88013f19c000
> [  181.148510] RIP: 0010:[<ffffffffa002cfca>]  [<ffffffffa002cfca>] 0xffffffffa002cfca
> [  181.150044] RSP: 0018:ffff88013fc83bd0  EFLAGS: 00010a12
> [  181.151102] RAX: ffff88013a781c88 RBX: ffff88013fc83c08 RCX: 0000000000000000
> [  181.152460] RDX: ffff88013fc83c38 RSI: ffff88013ab15600 RDI: 0000000000000000
> [  181.153781] RBP: ffff88013fc83bf8 R08: 0000000000004b61 R09: 0000000000000004
> [  181.155107] R10: 0000000000000000 R11: ffffea00044d9c80 R12: ffffffff81a89510
> [  181.156426] R13: ffff88013ab15600 R14: ffff88013fc83c38 R15: ffff88013ab15600
> [  181.157742] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000
> [  181.159232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  181.160303] CR2: ffffffffa002cfca CR3: 000000013a5de000 CR4: 00000000000406e0
> [  181.161588] Stack:
> [  181.161954]  ffffffff813f8b47 ffff88013ab15600 ffff88013ab15600 ffff88013fc83c38
> [  181.163353]  ffff88013874ac4e ffff88013fc83c28 ffffffff813f8b8c ffff88013a781c88
> [  181.164722]  ffff88013ab15600 ffffffff81a88b00 ffff88013874ac4e ffff88013fc83c88
> [  181.166094] Call Trace:
> [  181.166532]  <IRQ>
> [  181.166885]  [<ffffffff813f8b47>] ? nf_iterate+0x41/0x5b
> [  181.167880]  [<ffffffff813f8b8c>] nf_hook_slow+0x2b/0x94
> [  181.168803]  [<ffffffff81400c6e>] ip_local_deliver+0xa4/0xbf
> [  181.169748]  [<ffffffff81400644>] ? xfrm4_policy_check.constprop.8+0x52/0x52
> [  181.170910]  [<ffffffff81400a71>] ip_rcv_finish+0x2ed/0x34a
> [  181.171841]  [<ffffffff81400f02>] ip_rcv+0x279/0x2fb
> ...
>
Probably creating two ports (on two physical interfaces) is not a
common / use case but that doesn't mean kernel should crash either!
>
> Also, another sequence:
> $ ip link add ipvl1 link eth1 type ipvlan mode l3s
> -->  called _nf_register_hooks for dev eth1: err 0
>
> $ ip link add ipvl2 link eth1 type ipvlan mode l3s
> --> no message generated
>
> $ ip link set ipvl2 type ipvlan mode l3
> --> calling _nf_unregister_hooks for dev eth1
>
> that means the hooks are not there for ipvl1. I can remove the module with no panic.
>
tl;dr you found an issue for sure! The hook registration has to be
global than per port and I'll fix that in the next rev. Thank you for
taking to time to test this.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-09-14 18:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12 18:01 [PATCHv2 next 3/3] ipvlan: Introduce l3s mode Mahesh Bandewar
2016-09-14  3:24 ` David Ahern
2016-09-14 16:30   ` Mahesh Bandewar (महेश बंडेवार)
2016-09-14 18:04     ` David Ahern
2016-09-14 18:47       ` Mahesh Bandewar (महेश बंडेवार)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.