linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Daniel Borkmann <daniel@iogearbox.net>,
	Mahesh Bandewar <maheshb@google.com>,
	David Ahern <dsa@cumulusnetworks.com>,
	Florian Westphal <fw@strlen.de>, Martynas Pumputis <m@lambda.lt>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 4.9 12/30] ipvlan, l3mdev: fix broken l3s mode wrt local routes
Date: Mon,  4 Feb 2019 11:36:50 +0100	[thread overview]
Message-ID: <20190204103607.503854833@linuxfoundation.org> (raw)
In-Reply-To: <20190204103605.271746870@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Daniel Borkmann <daniel@iogearbox.net>

[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]

While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/net/ipvlan/ipvlan_main.c |    6 +++---
 include/linux/netdevice.h        |    8 ++++++++
 include/net/l3mdev.h             |    3 ++-
 3 files changed, 13 insertions(+), 4 deletions(-)

--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -85,12 +85,12 @@ static int ipvlan_set_port_mode(struct i
 			err = ipvlan_register_nf_hook();
 			if (!err) {
 				mdev->l3mdev_ops = &ipvl_l3mdev_ops;
-				mdev->priv_flags |= IFF_L3MDEV_MASTER;
+				mdev->priv_flags |= IFF_L3MDEV_RX_HANDLER;
 			} else
 				goto fail;
 		} else if (port->mode == IPVLAN_MODE_L3S) {
 			/* Old mode was L3S */
-			mdev->priv_flags &= ~IFF_L3MDEV_MASTER;
+			mdev->priv_flags &= ~IFF_L3MDEV_RX_HANDLER;
 			ipvlan_unregister_nf_hook();
 			mdev->l3mdev_ops = NULL;
 		}
@@ -158,7 +158,7 @@ static void ipvlan_port_destroy(struct n
 
 	dev->priv_flags &= ~IFF_IPVLAN_MASTER;
 	if (port->mode == IPVLAN_MODE_L3S) {
-		dev->priv_flags &= ~IFF_L3MDEV_MASTER;
+		dev->priv_flags &= ~IFF_L3MDEV_RX_HANDLER;
 		ipvlan_unregister_nf_hook();
 		dev->l3mdev_ops = NULL;
 	}
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1368,6 +1368,7 @@ struct net_device_ops {
  * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
+ * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1398,6 +1399,7 @@ enum netdev_priv_flags {
 	IFF_RXFH_CONFIGURED		= 1<<25,
 	IFF_PHONY_HEADROOM		= 1<<26,
 	IFF_MACSEC			= 1<<27,
+	IFF_L3MDEV_RX_HANDLER		= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1427,6 +1429,7 @@ enum netdev_priv_flags {
 #define IFF_TEAM			IFF_TEAM
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
+#define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4244,6 +4247,11 @@ static inline bool netif_supports_nofcs(
 	return dev->priv_flags & IFF_SUPP_NOFCS;
 }
 
+static inline bool netif_has_l3_rx_handler(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_L3MDEV_RX_HANDLER;
+}
+
 static inline bool netif_is_l3_master(const struct net_device *dev)
 {
 	return dev->priv_flags & IFF_L3MDEV_MASTER;
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -142,7 +142,8 @@ struct sk_buff *l3mdev_l3_rcv(struct sk_
 
 	if (netif_is_l3_slave(skb->dev))
 		master = netdev_master_upper_dev_get_rcu(skb->dev);
-	else if (netif_is_l3_master(skb->dev))
+	else if (netif_is_l3_master(skb->dev) ||
+		 netif_has_l3_rx_handler(skb->dev))
 		master = skb->dev;
 
 	if (master && master->l3mdev_ops->l3mdev_l3_rcv)



  parent reply	other threads:[~2019-02-04 10:43 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-04 10:36 [PATCH 4.9 00/30] 4.9.155-stable review Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 01/30] Fix "net: ipv4: do not handle duplicate fragments as overlapping" Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 02/30] fs: add the fsnotify call to vfs_iter_write Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 03/30] ipv6: Consider sk_bound_dev_if when binding a socket to an address Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 04/30] l2tp: copy 4 more bytes to linear part if necessary Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 05/30] net/mlx4_core: Add masking for a few queries on HCA caps Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 06/30] netrom: switch to sock timer API Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 07/30] net/rose: fix NULL ax25_cb kernel panic Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 08/30] ucc_geth: Reset BQL queue when stopping device Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 09/30] net/mlx5e: Allow MAC invalidation while spoofchk is ON Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 10/30] l2tp: remove l2specific_len dependency in l2tp_core Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 11/30] l2tp: fix reading optional fields of L2TPv3 Greg Kroah-Hartman
2019-02-04 10:36 ` Greg Kroah-Hartman [this message]
2019-02-04 10:36 ` [PATCH 4.9 13/30] CIFS: Do not count -ENODATA as failure for query directory Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 14/30] fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 15/30] ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 16/30] arm64: kaslr: ensure randomized quantities are clean also when kaslr is off Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 17/30] arm64: hyp-stub: Forbid kprobing of the hyp-stub Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 18/30] arm64: hibernate: Clean the __hyp_text to PoC after resume Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 19/30] gfs2: Revert "Fix loop in gfs2_rbm_find" Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 20/30] platform/x86: asus-nb-wmi: Map 0x35 to KEY_SCREENLOCK Greg Kroah-Hartman
2019-02-04 10:36 ` [PATCH 4.9 21/30] platform/x86: asus-nb-wmi: Drop mapping of 0x33 and 0x34 scan codes Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 22/30] mmc: sdhci-iproc: handle mmc_of_parse() errors during probe Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 23/30] kernel/exit.c: release ptraced tasks before zap_pid_ns_processes Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 24/30] mm, oom: fix use-after-free in oom_kill_process Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 25/30] mm: hwpoison: use do_send_sig_info() instead of force_sig() Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 26/30] mm: migrate: dont rely on __PageMovable() of newpage after unlocking it Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 27/30] cifs: Always resolve hostname before reconnecting Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 28/30] drivers: core: Remove glue dirs from sysfs earlier Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 29/30] fs: dont scan the inode cache before SB_BORN is set Greg Kroah-Hartman
2019-02-04 10:37 ` [PATCH 4.9 30/30] fanotify: fix handling of events on child sub-directory Greg Kroah-Hartman
2019-02-04 10:48   ` Amir Goldstein
2019-02-04 11:03     ` Greg Kroah-Hartman
2019-02-04 21:48 ` [PATCH 4.9 00/30] 4.9.155-stable review Guenter Roeck
2019-02-05  6:39 ` Naresh Kamboju
2019-02-05 11:38 ` Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190204103607.503854833@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsa@cumulusnetworks.com \
    --cc=fw@strlen.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=m@lambda.lt \
    --cc=maheshb@google.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).