All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/29] net: VRF support
@ 2015-02-05  1:34 David Ahern
  2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
                   ` (36 more replies)
  0 siblings, 37 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Kernel patches are also available here:
    https://github.com/dsahern/linux.git vrf-3.19

iproute2 patches are also available here:
    https://github.com/dsahern/iproute2 vrf-3.19


Background
----------
The concept of VRFs (Virtual Routing and Forwarding) has been around for over
15 years. Support for VRFs in the Linux kernel has been an often requested
feature for almost as long. For a while support was available via an out of
tree patch [1]. Since network namespaces came along, the response to queries
about VRF support for Linux was 'use namespaces'. But as mentioned previously
[2] network namespaces are not a good match for VRFs. Of the list of problems
noted the big one is that namespaces do not scale efficiently to the number
of VRFs supported by networking gear (> 1000 VRFs). Networking vendors that
want to use Linux as the OS have to carry custom solutions to this problem --
be it userspace networking stacks, extensive kernel patches (to add VRF
support or bend the implementation of namespaces), and/or patches to many
open source components. The recent addition of switchdev support in the
kernel suggests that people expect the use of Linux as a switch networking
OS to increase. Hopefully the time is right to re-open the discussion on a
salable VRF implementation for the Linux kernel.

The intent of this RFC is to get feedback on the overall idea - namely VRFs
as integer id and the nesting of VRFs within a namespace. This set includes
changes only to core IPv4 code which shows the concept; changes to the rest
of the network stack are fairly repetitive.

This patch set has a number of similarities to the original VRF patch - most
notably VRF ids as an integer index and plumbing through iproute2 and
netlink. But this set is really a complete re-implementation of the feature,
integrating VRF within a namespace and leveraging existing support for
network namespaces.

Design
------
Namespaces provide excellent separation of the networking stack from the
netdevices and up. The intent of VRFs is to provide an additional,
logical separation at the L3 layer within a namespace.

   +----------------------------------------------------------+
   | Namespace foo                                            |
   |                         +---------------+                |
   |          +------+       | L3/L4 service |                |
   |          | lldp |       |   (VRF any)   |                |
   |          +------+       +---------------+                |
   |                                                          |
   |                             +-------------------------+  |
   |                             | VRF M                   |  |
   |  +---------------------+  +-------------------------+ |  |
   |  | VRF 1 (default)     |  | VRF N                   | |  |
   |  |  +---------------+  |  |    +---------------+    | |  |
   |  |  | L3/L4 service |  |  |    | L3/L4 service |    | |  |
   |  |  | (VRF unaware) |  |  |    | (VRF unaware) |    | |  |
   |  |  +---------------+  |  |    +---------------+    | |  |
   |  |                     |  |                         | |  |
   |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
   |  || FIB | | neighbor | |  |  | FIB | | neighbor |   | |  |
   |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
   |  |                     |  |                         |-+  |
   |  | {dev 1}  {dev 2}    |  | {dev 3} {dev 4} {dev 5} |    |
   |  +---------------------+  +-------------------------+    |
   +----------------------------------------------------------+

This is accomplished by enhancing the current namespace checks to a
broader network context that is both a namepsace and a VRF id. The VRF
id is a tag applied to relevant structures, an integer between 1 and 4095
which allows for 4095 VRFs (could have 0 be the default VRF and then the
range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
is based on the genid scheme for versioning networking data which is a
32-bit integer. The VRF id is the lower 12 bits of the genid's.)

Netdevices, sk_buffs, sockets, and tasks are all tagged with a VRF id.
Network lookups (devices, sockets, addresses, routes, neighbors) require a
match of both network namespace and VRF id (or the special 'vrf any' tag;
more on that later).

Beyond the 4-byte tag in various data structures, there are no resources
allocated to a VRF so there is no need to create or destroy a VRF which is
in-line with the concept of keeping it lightweight for scalability. The
trade-off is that VRFs use the the same sysctl settings as the namespace
they are part of and, for example, MIB counters.

The VRF id of tasks defaults to 1 and is inherited parent to child. It can
be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
to this file (if preferred this can be made a prctl to change the VRF id).
This allows services to be launched in a VRF context using ip, similar to
what is done for network namespaces.
    e.g., ip vrf exec 99 /usr/sbin/sshd

(or a simpler chvrf alias/command can be used to just write the VRF id
to the proc file.)

The task's VRF id also affects viewing and modifying network configuration.
For example, 'ip addr show', 'ip route ls', 'ifconfig', 'arp -n', etc, only
show network data for the VRF associated with the task's VRF id; devices
are at the L2 layer so a command listing devices is not impacted by VRF id.

When a socket is created the VRF id is taken from the task. Socket-vrf
association for non-connected sockets can be changed using a setsockopt
(e.g., create a socket then change VRF id prior to calling bind or connect).

Network devices belong to a single VRF context which defaults to VRF 1.
They can be assigned to another VRF using IFLA_VRF attribute in link
messages. Similarly the VRF assignment is returned in the IFLA_VRF
attribute. The ip command has been modified to display the VRF id of a
device. L2 applications like lldp are not VRF aware and still work through
through all network devices within the namespace.

On RX skbs get their VRF context from the netdevice the packet is received
on. For TX the VRF context for an skb is taken from the socket. The
intention is for L3/raw sockets to be able to set the VRF context for a
packet TX using cmsg (not coded in this patch set).

VRF aware apps (e.g., L3 VPNs) can have sockets in multiple VRFs for
forwarding packets.

The special 'ANY VRF' context allows a single instance of a daemon to
provide a service across all VRFs.
    e.g., ip vrf exec any /usr/sbin/sshd 

The 'any' context applies to listen sockets only; connected sockets are in
a VRF context. Child sockets accepted by the daemon acquire the VRF context
of the network device the connection originated on.

The 'ANY VRF' context can also be used to display all addresses, routes
or neighbors in the kernel cache. That is, 'ip addr show', 'ip route ls',
'ifconfig', 'arp -n', etc, show all network data for the namespace.


About this Patch Set
--------------------
This is not a complete conversion of the networking stack, only a small
sampling to test the waters. Only changes are to core IPv4 code [2] which
is sufficient to illustrate the fundamental concept. Changes from 
struct net to net_ctx are very repetitive.

I'm sure there are a lot of oversights and bugs, but the intent here is
to solicit feedback on the overall idea.


Examples
--------
To illustrate the VRF patches consider a system with 18 NICs:
- eth0, eth17 are in default namespace (e.g., management namespace)

- eth1 - eth8 are in group1 namespace
  - eth1 - eth4 are in VRF 11
  - eth5 - eth8 are in VRF 13

- eth9 - eth16 are in group2 namespace
  - eth9 - eth12 are in VRF 21
  - eth13 - eth16 are in VRF 23

- Addresses assigned to each interface:
  - eth1: 1.1.1.1/24
  - eth2: 2.2.2.1/24
  - eth3: 3.3.3.1/24
  - eth4: 4.4.4.1/24
  - eth5: 1.1.1.1/24 (not a typo, duplicate address in different vrfs)
  - eth6: 6.6.6.1/24
  - eth7: 7.7.7.1/24
  - eth8: 8.8.8.1/24

- openlldpd is started in each namespace

1. device list is VRF agnostic
   - ifconfig -a, ip link show, /proc/net/dev
     --> default namespace shows only eth0 and eth17
     --> group1 namespace shows only eth1 - eth8
     --> group2 namespace shows only eth9 - eth16
         - ip shows vrf assignment of each link

    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
        link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff

2. address, route, neighbor list is VRF aware
   - ifconfig, ip addr show, ip route ls, /proc/net/route
     --> shows only addresses for VRF id of task unless id is 'any'

   in VRF 1:
   ifconfig eth1
   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 02:ab:cd:02:00:01  txqueuelen 1000  (Ethernet)
   ...

   No addresses are shown. But if the command is run in VRF 11 or VRF 'any' 
     ip vrf exec 11 ip addr show dev eth1
     3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP group default qlen 1000
        link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff
        inet 1.1.1.1/24 brd 1.1.1.255 scope global eth1
           valid_lft forever preferred_lft forever

3. start ssh in group1 namespace
   ip netns exec group1 ip vrf exec 11 /usr/sbin/sshd -d
   ssh to 1.1.1.1 via eth1

   ip netns exec group1 ip vrf exec 13 /usr/sbin/sshd -d
   ssh to 1.1.1.1 via eth5
   --> same namespace but different VRFs

4. One ssh instance handles VRFs in group1 namespace
   ip netns exec group1 ip vrf exec any /usr/sbin/sshd

   --> ssh to any address in the namespace works

References
----------
[1] http://sourceforge.net/projects/linux-vrf

[2] http://www.spinics.net/lists/netdev/msg298368.html

[3] To build only enable core ipv4 code. Disable IPv6, netfilter, ipsec, etc.


David Ahern (29):
  net: Introduce net_ctx and macro for context comparison
  net: Flip net_device to use net_ctx
  net: Flip sock_common to net_ctx
  net: Add net_ctx macros for skbuffs
  net: Flip seq_net_private to net_ctx
  net: Flip fib_rules and fib_rules_ops to use net_ctx
  net: Flip inet_bind_bucket to net_ctx
  net: Flip fib_info to net_ctx
  net: Flip ip6_flowlabel to net_ctx
  net: Flip neigh structs to net_ctx
  net: Flip nl_info to net_ctx
  net: Add device lookups by net_ctx
  net: Convert function arg from struct net to struct net_ctx
  net: vrf: Introduce vrf header file
  net: vrf: Add vrf to net_ctx struct
  net: vrf: Set default vrf
  net: vrf: Add vrf context to task struct
  net: vrf: Plumbing for vrf context on a socket
  net: vrf: Add vrf context to skb
  net: vrf: Add vrf context to flow struct
  net: vrf: Add vrf context to genid's
  net: vrf: Set VRF id in various network structs
  net: vrf: Enable vrf checks
  net: vrf: Add support to get/set vrf context on a device
  net: vrf: Handle VRF any context
  net: vrf: Change single_open_net to pass net_ctx
  net: vrf: Add vrf checks and context to ipv4 proc files
  iproute2: vrf: Add vrf subcommand
  iproute2: Add vrf option to ip link command

 fs/proc/base.c                   |  94 +++++++++++++++++++++++++
 fs/proc/proc_net.c               |  22 +++++-
 include/linux/inetdevice.h       |  12 ++--
 include/linux/init_task.h        |   1 +
 include/linux/netdevice.h        |  44 +++++++++++-
 include/linux/sched.h            |   2 +
 include/linux/seq_file_net.h     |  10 +--
 include/linux/skbuff.h           |   5 ++
 include/net/addrconf.h           |  22 +++---
 include/net/arp.h                |   2 +-
 include/net/dst.h                |  16 ++---
 include/net/fib_rules.h          |  10 ++-
 include/net/flow.h               |  10 ++-
 include/net/inet6_hashtables.h   |  19 +++---
 include/net/inet_hashtables.h    |  60 ++++++++++------
 include/net/inet_sock.h          |   1 +
 include/net/inet_timewait_sock.h |   1 +
 include/net/ip.h                 |  10 +--
 include/net/ip6_fib.h            |   4 +-
 include/net/ip6_route.h          |  24 +++----
 include/net/ip_fib.h             |  38 +++++++----
 include/net/ipv6.h               |  14 +++-
 include/net/neighbour.h          |  93 +++++++++++++++++++++----
 include/net/net_namespace.h      |  39 +++++++++--
 include/net/netlink.h            |   5 +-
 include/net/route.h              |  46 +++++++------
 include/net/sock.h               |  21 ++++--
 include/net/tcp.h                |   1 +
 include/net/transp_v6.h          |   2 +-
 include/net/udp.h                |   8 +--
 include/net/vrf.h                |  36 ++++++++++
 include/net/xfrm.h               |  28 ++++----
 include/uapi/linux/if_link.h     |   1 +
 include/uapi/linux/in.h          |   1 +
 kernel/fork.c                    |   2 +
 net/core/dev.c                   |  95 +++++++++++++++++++++++---
 net/core/fib_rules.c             |  36 ++++++----
 net/core/flow.c                  |   5 +-
 net/core/neighbour.c             | 106 +++++++++++++++--------------
 net/core/rtnetlink.c             |  12 ++++
 net/core/skbuff.c                |  12 ++++
 net/core/sock.c                  |   2 +
 net/ipv4/af_inet.c               |  20 ++++--
 net/ipv4/arp.c                   |  76 ++++++++++++---------
 net/ipv4/datagram.c              |   6 +-
 net/ipv4/devinet.c               |  64 ++++++++++++------
 net/ipv4/fib_frontend.c          |  83 ++++++++++++++---------
 net/ipv4/fib_rules.c             |  12 ++--
 net/ipv4/fib_semantics.c         |  38 +++++++----
 net/ipv4/fib_trie.c              |  24 +++++--
 net/ipv4/icmp.c                  |  40 ++++++-----
 net/ipv4/igmp.c                  |  53 +++++++++------
 net/ipv4/inet_connection_sock.c  |  23 ++++---
 net/ipv4/inet_diag.c             |  13 ++--
 net/ipv4/inet_hashtables.c       |  42 +++++++-----
 net/ipv4/inet_timewait_sock.c    |   1 +
 net/ipv4/ip_input.c              |   6 +-
 net/ipv4/ip_options.c            |  20 +++---
 net/ipv4/ip_output.c             |  16 +++--
 net/ipv4/ip_sockglue.c           |  32 +++++++--
 net/ipv4/ipconfig.c              |   6 +-
 net/ipv4/ipmr.c                  |  53 +++++++++------
 net/ipv4/netfilter.c             |  13 ++--
 net/ipv4/ping.c                  |  41 +++++------
 net/ipv4/proc.c                  |  10 +--
 net/ipv4/raw.c                   |  48 ++++++++-----
 net/ipv4/route.c                 | 143 +++++++++++++++++++++++----------------
 net/ipv4/syncookies.c            |   6 +-
 net/ipv4/tcp_ipv4.c              |  57 +++++++++-------
 net/ipv4/tcp_minisocks.c         |   1 +
 net/ipv4/udp.c                   | 122 ++++++++++++++++++---------------
 net/ipv4/udp_diag.c              |  11 +--
 net/ipv4/xfrm4_policy.c          |  14 ++--
 net/netlink/af_netlink.c         |  12 ++++
 net/sctp/protocol.c              |  10 +--
 net/xfrm/xfrm_policy.c           |   9 +--
 76 files changed, 1415 insertions(+), 682 deletions(-)
 create mode 100644 include/net/vrf.h

-- 
1.9.3 (Apple Git-50)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

For now the network context is just the namespace. A later patch adds
a vrf context.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/net_namespace.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 36faf4990c4b..b932f2a83865 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -134,11 +134,19 @@ struct net {
 	atomic_t		fnhe_genid;
 };
 
+struct net_ctx {
+#ifdef CONFIG_NET_NS
+	struct net *net;
+#endif
+};
+
 #include <linux/seq_file_net.h>
 
 /* Init's network namespace */
 extern struct net init_net;
 
+#define INIT_NET_CTX  { .net = &init_net }
+
 #ifdef CONFIG_NET_NS
 struct net *copy_net_ns(unsigned long flags, struct user_namespace *user_ns,
 			struct net *old_net);
@@ -202,6 +210,13 @@ int net_eq(const struct net *net1, const struct net *net2)
 	return net1 == net2;
 }
 
+static inline
+int net_ctx_eq(struct net_ctx *ctx1, struct net_ctx *ctx2)
+{
+	return net_eq(ctx1->net, ctx2->net);
+}
+
+
 void net_drop_ns(void *);
 
 #else
@@ -226,6 +241,12 @@ int net_eq(const struct net *net1, const struct net *net2)
 	return 1;
 }
 
+static inline
+int net_ctx_eq(struct net_ctx *ctx1, struct net_ctx *ctx2)
+{
+	return 1;
+}
+
 #define net_drop_ns NULL
 #endif
 
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 02/29] net: Flip net_device to use net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
  2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05 13:47   ` Nicolas Dichtel
  2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
                   ` (34 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Enhances a net_device from the current namespace only assignment to
a network context.

Define nd_net macro to handle existing code references. Add macros
for generating net_ctx struct given a net_device (needed because of
the way net references are done) and for comparing a given net_ctx
struct to the context of a net_device.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 16251e96e6aa..55a221fec12b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1697,9 +1697,8 @@ struct net_device {
 	struct netpoll_info __rcu	*npinfo;
 #endif
 
-#ifdef CONFIG_NET_NS
-	struct net		*nd_net;
-#endif
+	struct net_ctx		net_ctx;
+#define nd_net net_ctx.net
 
 	/* mid-layer private */
 	union {
@@ -1845,6 +1844,18 @@ void dev_net_set(struct net_device *dev, struct net *net)
 #endif
 }
 
+/* get net_ctx from device */
+#define DEV_NET_CTX(dev)  { .net = dev_net((dev)) }
+
+static inline
+int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
+{
+	if (net_eq(dev_net(dev), ctx->net))
+		return 1;
+
+	return 0;
+}
+
 static inline bool netdev_uses_dsa(struct net_device *dev)
 {
 #if IS_ENABLED(CONFIG_NET_DSA)
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 03/29] net: Flip sock_common to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
  2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
  2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Enhances sockets from the current namespace only assignment to a
network context.

Define skc_net macro to handle existing code references.

Add macros for retrieving net_ctx struct given a sock struct (needed
because of the way net references are done) and for comparing a
given net_ctx struct to the context of a sock struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/sock.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 511ef7c8889b..e67347ed1555 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -146,7 +146,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
  *	@skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
  *	@skc_prot: protocol handlers inside a network family
- *	@skc_net: reference to the network namespace of this socket
+ *	@skc_net_ctx: reference to network context of this socket
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
@@ -190,9 +190,8 @@ struct sock_common {
 		struct hlist_nulls_node skc_portaddr_node;
 	};
 	struct proto		*skc_prot;
-#ifdef CONFIG_NET_NS
-	struct net	 	*skc_net;
-#endif
+	struct net_ctx		skc_net_ctx;
+#define skc_net  skc_net_ctx.net
 
 #if IS_ENABLED(CONFIG_IPV6)
 	struct in6_addr		skc_v6_daddr;
@@ -326,7 +325,7 @@ struct sock {
 #define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
-#define sk_net			__sk_common.skc_net
+#define sk_net			__sk_common.skc_net_ctx.net
 #define sk_v6_daddr		__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr
 
@@ -2197,6 +2196,14 @@ void sock_net_set(struct sock *sk, struct net *net)
 	write_pnet(&sk->sk_net, net);
 }
 
+#define SOCK_NET_CTX(sk)  { .net = sock_net((sk)) }
+
+static inline
+int sock_net_ctx_eq(struct sock *sk, struct net_ctx *ctx)
+{
+	return net_eq(sock_net(sk), ctx->net);
+}
+
 /*
  * Kernel sockets, f.e. rtnl or icmp_socket, are a part of a namespace.
  * They should not hold a reference to a namespace in order to allow
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (2 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

skb macros will be used later for determining a network context
from skbs.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/skbuff.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85ab7d72b54c..a5dfef469d07 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -665,6 +665,10 @@ struct sk_buff {
 	atomic_t		users;
 };
 
+#define SKB_NET_CTX_DEV(skb)  { .net = dev_net((skb)->dev) }
+#define SKB_NET_CTX_DST(skb)  { .net = dev_net(skb_dst((skb))->dev) }
+#define SKB_NET_CTX_SOCK(skb) { .net = sock_net((skb)->sk) }
+
 #ifdef __KERNEL__
 /*
  *	Handling routines are only of interest to the kernel
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (3 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Enhances seq files for networking to have a network context from the
current namespace only.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 fs/proc/proc_net.c           |  2 +-
 include/linux/seq_file_net.h | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 1bde894bc624..4996f5e91a90 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -54,7 +54,7 @@ int seq_open_net(struct inode *ino, struct file *f,
 		return -ENOMEM;
 	}
 #ifdef CONFIG_NET_NS
-	p->net = net;
+	p->net_ctx.net = net;
 #endif
 	return 0;
 }
diff --git a/include/linux/seq_file_net.h b/include/linux/seq_file_net.h
index 32c89bbe24a2..b860d053a65e 100644
--- a/include/linux/seq_file_net.h
+++ b/include/linux/seq_file_net.h
@@ -7,9 +7,7 @@ struct net;
 extern struct net init_net;
 
 struct seq_net_private {
-#ifdef CONFIG_NET_NS
-	struct net *net;
-#endif
+	struct net_ctx net_ctx;
 };
 
 int seq_open_net(struct inode *, struct file *,
@@ -21,10 +19,14 @@ int single_release_net(struct inode *, struct file *);
 static inline struct net *seq_file_net(struct seq_file *seq)
 {
 #ifdef CONFIG_NET_NS
-	return ((struct seq_net_private *)seq->private)->net;
+	return ((struct seq_net_private *)seq->private)->net_ctx.net;
 #else
 	return &init_net;
 #endif
 }
+static inline struct net_ctx *seq_file_net_ctx(struct seq_file *seq)
+{
+	return &((struct seq_net_private *)seq->private)->net_ctx;
+}
 
 #endif
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (4 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/fib_rules.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index e584de16e4c3..b02bd45e3e97 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -20,7 +20,8 @@ struct fib_rule {
 	/* 3 bytes hole, try to use */
 	u32			target;
 	struct fib_rule __rcu	*ctarget;
-	struct net		*fr_net;
+	struct net_ctx		fr_net_ctx;
+#define fr_net  fr_net_ctx.net
 
 	atomic_t		refcnt;
 	u32			pref;
@@ -75,7 +76,8 @@ struct fib_rules_ops {
 	const struct nla_policy	*policy;
 	struct list_head	rules_list;
 	struct module		*owner;
-	struct net		*fro_net;
+	struct net_ctx		fro_net_ctx;
+#define fro_net  fro_net_ctx.net
 	struct rcu_head		rcu;
 };
 
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (5 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/inet_hashtables.h | 21 +++++++++++++++++----
 net/ipv4/inet_hashtables.c    |  2 +-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index dd1950a7e273..c9e8b7b7331a 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -76,9 +76,7 @@ struct inet_ehash_bucket {
  * ports are created in O(1) time?  I thought so. ;-)	-DaveM
  */
 struct inet_bind_bucket {
-#ifdef CONFIG_NET_NS
-	struct net		*ib_net;
-#endif
+	struct net_ctx		ib_net_ctx;
 	unsigned short		port;
 	signed char		fastreuse;
 	signed char		fastreuseport;
@@ -90,7 +88,22 @@ struct inet_bind_bucket {
 
 static inline struct net *ib_net(struct inet_bind_bucket *ib)
 {
-	return read_pnet(&ib->ib_net);
+	return read_pnet(&ib->ib_net_ctx.net);
+}
+
+static inline
+void ib_net_ctx_set(struct inet_bind_bucket *ib, struct net_ctx *ctx)
+{
+	write_pnet(&ib->ib_net_ctx.net, hold_net(ctx->net));
+}
+
+static inline
+int ib_net_ctx_eq(struct inet_bind_bucket *ib, struct net_ctx *ctx)
+{
+	if (net_eq(ib_net(ib), ctx->net))
+		return 1;
+
+	return 0;
 }
 
 #define inet_bind_bucket_for_each(tb, head) \
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9111a4e22155..1485dac0ead5 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -61,7 +61,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
 	if (tb != NULL) {
-		write_pnet(&tb->ib_net, hold_net(net));
+		write_pnet(&tb->ib_net_ctx.net, hold_net(net));
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 08/29] net: Flip fib_info to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (6 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip_fib.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 5bd120e4bc0a..dca7f30be57f 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -98,7 +98,8 @@ struct fib_nh {
 struct fib_info {
 	struct hlist_node	fib_hash;
 	struct hlist_node	fib_lhash;
-	struct net		*fib_net;
+	struct net_ctx		fib_net_ctx;
+#define fib_net  fib_net_ctx.net
 	int			fib_treeref;
 	atomic_t		fib_clntref;
 	unsigned int		fib_flags;
@@ -122,6 +123,14 @@ struct fib_info {
 #define fib_dev		fib_nh[0].nh_dev
 };
 
+static inline
+int fib_net_ctx_eq(const struct fib_info *fi, const struct net_ctx *ctx)
+{
+	if (net_eq(fi->fib_net_ctx.net, ctx->net))
+		return 1;
+
+	return 0;
+}
 
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 struct fib_rule;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 09/29] net: Flip ip6_flowlabel to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (7 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ipv6.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8027ca53e31f..2d025ed7a183 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -238,9 +238,20 @@ struct ip6_flowlabel {
 	} owner;
 	unsigned long		lastuse;
 	unsigned long		expires;
-	struct net		*fl_net;
+	struct net_ctx		fl_net_ctx;
+#define fl_net  fl_net_ctx.net
 };
 
+static inline
+int fl_net_ctx_eq(struct ip6_flowlabel *fl, struct net_ctx *ctx)
+{
+#ifdef CONFIG_NET_NS
+	return net_eq(fl->fl_net, ctx->net);
+#else
+	return 1;
+#endif
+}
+
 #define IPV6_FLOWINFO_MASK	cpu_to_be32(0x0FFFFFFF)
 #define IPV6_FLOWLABEL_MASK	cpu_to_be32(0x000FFFFF)
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 10/29] net: Flip neigh structs to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (8 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/neighbour.h | 44 ++++++++++++++++++++++++++++++++++++--------
 net/core/neighbour.c    |  6 +++---
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 76f708486aae..6228edd1e483 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -65,9 +65,7 @@ enum {
 };
 
 struct neigh_parms {
-#ifdef CONFIG_NET_NS
-	struct net *net;
-#endif
+	struct net_ctx net_ctx;
 	struct net_device *dev;
 	struct list_head list;
 	int	(*neigh_setup)(struct neighbour *);
@@ -167,9 +165,7 @@ struct neigh_ops {
 
 struct pneigh_entry {
 	struct pneigh_entry	*next;
-#ifdef CONFIG_NET_NS
-	struct net		*net;
-#endif
+	struct net_ctx		net_ctx;
 	struct net_device	*dev;
 	u8			flags;
 	u8			key[0];
@@ -281,9 +277,22 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms);
 static inline
 struct net *neigh_parms_net(const struct neigh_parms *parms)
 {
-	return read_pnet(&parms->net);
+	return read_pnet(&parms->net_ctx.net);
 }
 
+static inline
+int neigh_parms_net_ctx_eq(const struct neigh_parms *parms,
+			   const struct net_ctx *net_ctx)
+{
+#ifdef CONFIG_NET_NS
+	if (net_eq(neigh_parms_net(parms), net_ctx->net))
+		return 1;
+
+	return 0;
+#else
+	return 1;
+#endif
+}
 unsigned long neigh_rand_reach_time(unsigned long base);
 
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
@@ -298,7 +307,26 @@ int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *key,
 
 static inline struct net *pneigh_net(const struct pneigh_entry *pneigh)
 {
-	return read_pnet(&pneigh->net);
+	return read_pnet(&pneigh->net_ctx.net);
+}
+static inline
+void pneigh_net_ctx_set(struct pneigh_entry *pneigh,
+			const struct net_ctx *net_ctx)
+{
+	write_pnet(&pneigh->net_ctx.net, hold_net(net_ctx->net));
+}
+static inline
+int pneigh_net_ctx_eq(const struct pneigh_entry *pneigh,
+		      const struct net_ctx *net_ctx)
+{
+#ifdef CONFIG_NET_NS
+	if (net_eq(pneigh_net(pneigh), net_ctx->net))
+		return 1;
+
+	return 0;
+#else
+	return 1;
+#endif
 }
 
 void neigh_app_ns(struct neighbour *n);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 70fe9e10ac86..bd77804849cc 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -601,7 +601,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 	if (!n)
 		goto out;
 
-	write_pnet(&n->net, hold_net(net));
+	write_pnet(&n->net_ctx.net, hold_net(net));
 	memcpy(n->key, pkey, key_len);
 	n->dev = dev;
 	if (dev)
@@ -1464,7 +1464,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
 		dev_hold(dev);
 		p->dev = dev;
-		write_pnet(&p->net, hold_net(net));
+		write_pnet(&p->net_ctx.net, hold_net(net));
 		p->sysctl_table = NULL;
 
 		if (ops->ndo_neigh_setup && ops->ndo_neigh_setup(dev, p)) {
@@ -1523,7 +1523,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 
 	INIT_LIST_HEAD(&tbl->parms_list);
 	list_add(&tbl->parms.list, &tbl->parms_list);
-	write_pnet(&tbl->parms.net, &init_net);
+	write_pnet(&tbl->parms.net_ctx.net, &init_net);
 	atomic_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 11/29] net: Flip nl_info to net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (9 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/netlink.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index e010ee8da41d..587a6ef973e5 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 #include <linux/netlink.h>
 #include <linux/jiffies.h>
+#include <net/net_namespace.h>
 
 /* ========================================================================
  *         Netlink Messages and Attributes Interface (As Seen On TV)
@@ -221,7 +222,8 @@ struct nla_policy {
  */
 struct nl_info {
 	struct nlmsghdr		*nlh;
-	struct net		*nl_net;
+	struct net_ctx		nl_net_ctx;
+#define nl_net  nl_net_ctx.net
 	u32			portid;
 };
 
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 12/29] net: Add device lookups by net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (10 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

For now it just uses the namespace so the _ctx versions mirror the
current ones. Later patch adds the vrf check.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h |  3 +++
 net/core/dev.c            | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 55a221fec12b..43bb40260bfa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2164,6 +2164,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 struct net_device *dev_get_by_name(struct net *net, const char *name);
 struct net_device *dev_get_by_name_rcu(struct net *net, const char *name);
 struct net_device *__dev_get_by_name(struct net *net, const char *name);
+struct net_device *__dev_get_by_name_ctx(struct net_ctx *ctx, const char *name);
 int dev_alloc_name(struct net_device *dev, const char *name);
 int dev_open(struct net_device *dev);
 int dev_close(struct net_device *dev);
@@ -2187,7 +2188,9 @@ int init_dummy_netdev(struct net_device *dev);
 
 struct net_device *dev_get_by_index(struct net *net, int ifindex);
 struct net_device *__dev_get_by_index(struct net *net, int ifindex);
+struct net_device *__dev_get_by_index_ctx(struct net_ctx *ctx, int ifindex);
 struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex);
+struct net_device *dev_get_by_index_rcu_ctx(struct net_ctx *ctx, int ifindex);
 int netdev_get_name(struct net *net, char *name, int ifindex);
 int dev_restart(struct net_device *dev);
 int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 1d564d68e31a..624335140857 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -684,6 +684,14 @@ struct net_device *__dev_get_by_name(struct net *net, const char *name)
 }
 EXPORT_SYMBOL(__dev_get_by_name);
 
+struct net_device *__dev_get_by_name_ctx(struct net_ctx *ctx, const char *name)
+{
+	struct net_device *dev = __dev_get_by_name(ctx->net, name);
+
+	return dev;
+}
+EXPORT_SYMBOL(__dev_get_by_name_ctx);
+
 /**
  *	dev_get_by_name_rcu	- find a device by its name
  *	@net: the applicable net namespace
@@ -759,6 +767,14 @@ struct net_device *__dev_get_by_index(struct net *net, int ifindex)
 }
 EXPORT_SYMBOL(__dev_get_by_index);
 
+struct net_device *__dev_get_by_index_ctx(struct net_ctx *ctx, int ifindex)
+{
+	struct net_device *dev = __dev_get_by_index(ctx->net, ifindex);
+
+	return dev;
+}
+EXPORT_SYMBOL(__dev_get_by_index_ctx);
+
 /**
  *	dev_get_by_index_rcu - find a device by its ifindex
  *	@net: the applicable net namespace
@@ -783,6 +799,25 @@ struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex)
 }
 EXPORT_SYMBOL(dev_get_by_index_rcu);
 
+/**
+ *	dev_get_by_index_rcu_ctx - find a device by its ifindex
+ *	@net_ctx: the applicable net context
+ *	@ifindex: index of device
+ *
+ *	Search for an interface by index. Returns %NULL if the device
+ *	is not found or a pointer to the device. The device has not
+ *	had its reference counter increased so the caller must be careful
+ *	about locking. The caller must hold RCU lock.
+ */
+
+struct net_device *dev_get_by_index_rcu_ctx(struct net_ctx *ctx, int ifindex)
+{
+	struct net_device *dev = dev_get_by_index_rcu(ctx->net, ifindex);
+
+	return dev;
+}
+EXPORT_SYMBOL(dev_get_by_index_rcu_ctx);
+
 
 /**
  *	dev_get_by_index - find a device by its ifindex
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (11 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

A sweep of core and ipv4 code changing uses of struct net. Difficult to
break this into smaller patches. Functionally there is no change in
behavior.

Sysctl settings and MIB counters will stay with the namespace. Almost
all other users of struct net are converted to struct net_ctx. This is
stepping stone for follow on patches which add a VRF id to the net
context.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/inetdevice.h      |  12 ++---
 include/net/addrconf.h          |  22 ++++----
 include/net/arp.h               |   2 +-
 include/net/dst.h               |  16 +++---
 include/net/fib_rules.h         |   2 +-
 include/net/flow.h              |   3 +-
 include/net/inet6_hashtables.h  |  19 +++----
 include/net/inet_hashtables.h   |  35 ++++++-------
 include/net/ip.h                |  10 ++--
 include/net/ip6_fib.h           |   4 +-
 include/net/ip6_route.h         |  24 ++++-----
 include/net/ip_fib.h            |  25 +++++----
 include/net/neighbour.h         |   8 +--
 include/net/net_namespace.h     |   8 +--
 include/net/route.h             |  42 ++++++++--------
 include/net/transp_v6.h         |   2 +-
 include/net/udp.h               |   8 +--
 include/net/xfrm.h              |  28 +++++------
 net/core/dev.c                  |  18 +++----
 net/core/fib_rules.c            |  23 +++++----
 net/core/flow.c                 |   5 +-
 net/core/neighbour.c            |  94 +++++++++++++++++-----------------
 net/ipv4/af_inet.c              |  16 +++---
 net/ipv4/arp.c                  |  66 +++++++++++++-----------
 net/ipv4/datagram.c             |   3 +-
 net/ipv4/devinet.c              |  34 +++++++------
 net/ipv4/fib_frontend.c         |  61 +++++++++++-----------
 net/ipv4/fib_rules.c            |  10 ++--
 net/ipv4/fib_semantics.c        |  29 ++++++-----
 net/ipv4/icmp.c                 |  36 ++++++-------
 net/ipv4/igmp.c                 |  46 +++++++++--------
 net/ipv4/inet_connection_sock.c |  18 ++++---
 net/ipv4/inet_diag.c            |  13 ++---
 net/ipv4/inet_hashtables.c      |  39 +++++++-------
 net/ipv4/ip_input.c             |   6 ++-
 net/ipv4/ip_options.c           |  20 ++++----
 net/ipv4/ip_output.c            |  11 ++--
 net/ipv4/ip_sockglue.c          |  14 ++++--
 net/ipv4/ipconfig.c             |   6 ++-
 net/ipv4/ipmr.c                 |  50 ++++++++++--------
 net/ipv4/netfilter.c            |  12 ++---
 net/ipv4/ping.c                 |  39 +++++++-------
 net/ipv4/raw.c                  |  30 +++++------
 net/ipv4/route.c                | 104 +++++++++++++++++++++-----------------
 net/ipv4/syncookies.c           |   3 +-
 net/ipv4/tcp_ipv4.c             |  41 ++++++++-------
 net/ipv4/udp.c                  | 109 ++++++++++++++++++++++------------------
 net/ipv4/udp_diag.c             |  11 ++--
 net/ipv4/xfrm4_policy.c         |  12 ++---
 net/sctp/protocol.c             |   9 ++--
 net/xfrm/xfrm_policy.c          |   9 ++--
 51 files changed, 684 insertions(+), 583 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 0a21fbefdfbe..b3aa61d4253f 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -153,18 +153,18 @@ int unregister_inetaddr_notifier(struct notifier_block *nb);
 void inet_netconf_notify_devconf(struct net *net, int type, int ifindex,
 				 struct ipv4_devconf *devconf);
 
-struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref);
-static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
+struct net_device *__ip_dev_find(struct net_ctx *ctx, __be32 addr, bool devref);
+static inline struct net_device *ip_dev_find(struct net_ctx *ctx, __be32 addr)
 {
-	return __ip_dev_find(net, addr, true);
+	return __ip_dev_find(ctx, addr, true);
 }
 
 int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
-int devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
+int devinet_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *);
 void devinet_init(void);
-struct in_device *inetdev_by_index(struct net *, int);
+struct in_device *inetdev_by_index(struct net_ctx *, int);
 __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
-__be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst,
+__be32 inet_confirm_addr(struct net_ctx *ctx, struct in_device *in_dev, __be32 dst,
 			 __be32 local, int scope);
 struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix,
 				    __be32 mask);
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index d13573bb879e..a872e62b2003 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -56,15 +56,15 @@ struct prefix_info {
 int addrconf_init(void);
 void addrconf_cleanup(void);
 
-int addrconf_add_ifaddr(struct net *net, void __user *arg);
-int addrconf_del_ifaddr(struct net *net, void __user *arg);
-int addrconf_set_dstaddr(struct net *net, void __user *arg);
+int addrconf_add_ifaddr(struct net_ctx *ctx, void __user *arg);
+int addrconf_del_ifaddr(struct net_ctx *ctx, void __user *arg);
+int addrconf_set_dstaddr(struct net_ctx *ctx, void __user *arg);
 
-int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
+int ipv6_chk_addr(struct net_ctx *ctx, const struct in6_addr *addr,
 		  const struct net_device *dev, int strict);
 
 #if defined(CONFIG_IPV6_MIP6) || defined(CONFIG_IPV6_MIP6_MODULE)
-int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr);
+int ipv6_chk_home_addr(struct net_ctx *ctx, const struct in6_addr *addr);
 #endif
 
 bool ipv6_chk_custom_prefix(const struct in6_addr *addr,
@@ -73,11 +73,11 @@ bool ipv6_chk_custom_prefix(const struct in6_addr *addr,
 
 int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev);
 
-struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net,
+struct inet6_ifaddr *ipv6_get_ifaddr(struct net_ctx *ctx,
 				     const struct in6_addr *addr,
 				     struct net_device *dev, int strict);
 
-int ipv6_dev_get_saddr(struct net *net, const struct net_device *dev,
+int ipv6_dev_get_saddr(struct net_ctx *ctx, const struct net_device *dev,
 		       const struct in6_addr *daddr, unsigned int srcprefs,
 		       struct in6_addr *saddr);
 int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr *addr,
@@ -116,7 +116,7 @@ static inline int addrconf_finite_timeout(unsigned long timeout)
 int ipv6_addr_label_init(void);
 void ipv6_addr_label_cleanup(void);
 void ipv6_addr_label_rtnl_register(void);
-u32 ipv6_addr_label(struct net *net, const struct in6_addr *addr,
+u32 ipv6_addr_label(struct net_ctx *ctx, const struct in6_addr *addr,
 		    int type, int ifindex);
 
 /*
@@ -205,9 +205,9 @@ void ipv6_sock_ac_close(struct sock *sk);
 int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr);
 int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr);
 void ipv6_ac_destroy_dev(struct inet6_dev *idev);
-bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
+bool ipv6_chk_acast_addr(struct net_ctx *ctx, struct net_device *dev,
 			 const struct in6_addr *addr);
-bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
+bool ipv6_chk_acast_addr_src(struct net_ctx *ctx, struct net_device *dev,
 			     const struct in6_addr *addr);
 
 /* Device notifier */
@@ -215,7 +215,7 @@ int register_inet6addr_notifier(struct notifier_block *nb);
 int unregister_inet6addr_notifier(struct notifier_block *nb);
 int inet6addr_notifier_call_chain(unsigned long val, void *v);
 
-void inet6_netconf_notify_devconf(struct net *net, int type, int ifindex,
+void inet6_netconf_notify_devconf(struct net_ctx *ctx, int type, int ifindex,
 				  struct ipv6_devconf *devconf);
 
 /**
diff --git a/include/net/arp.h b/include/net/arp.h
index 73c49864076b..8189c7358af5 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -48,7 +48,7 @@ static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, u32
 
 void arp_init(void);
 int arp_find(unsigned char *haddr, struct sk_buff *skb);
-int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg);
+int arp_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg);
 void arp_send(int type, int ptype, __be32 dest_ip,
 	      struct net_device *dev, __be32 src_ip,
 	      const unsigned char *dest_hw,
diff --git a/include/net/dst.h b/include/net/dst.h
index a8ae4e760778..13ee371e8486 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -323,7 +323,7 @@ static inline void skb_dst_force(struct sk_buff *skb)
  *	so make some cleanups. (no accounting done)
  */
 static inline void __skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev,
-				   struct net *net)
+				   struct net_ctx *ctx)
 {
 	skb->dev = dev;
 
@@ -334,7 +334,7 @@ static inline void __skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev,
 	 */
 	skb_clear_hash_if_not_l4(skb);
 	skb_set_queue_mapping(skb, 0);
-	skb_scrub_packet(skb, !net_eq(net, dev_net(dev)));
+	skb_scrub_packet(skb, !dev_net_ctx_eq(dev, ctx));
 }
 
 /**
@@ -347,12 +347,12 @@ static inline void __skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev,
  *	Note: this accounting is not SMP safe.
  */
 static inline void skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev,
-				 struct net *net)
+				 struct net_ctx *ctx)
 {
 	/* TODO : stats should be SMP safe */
 	dev->stats.rx_packets++;
 	dev->stats.rx_bytes += skb->len;
-	__skb_tunnel_rx(skb, dev, net);
+	__skb_tunnel_rx(skb, dev, ctx);
 }
 
 /* Children define the path of the packet through the
@@ -485,7 +485,7 @@ enum {
 
 struct flowi;
 #ifndef CONFIG_XFRM
-static inline struct dst_entry *xfrm_lookup(struct net *net,
+static inline struct dst_entry *xfrm_lookup(struct net_ctx *net_ctx,
 					    struct dst_entry *dst_orig,
 					    const struct flowi *fl, struct sock *sk,
 					    int flags)
@@ -493,7 +493,7 @@ static inline struct dst_entry *xfrm_lookup(struct net *net,
 	return dst_orig;
 }
 
-static inline struct dst_entry *xfrm_lookup_route(struct net *net,
+static inline struct dst_entry *xfrm_lookup_route(struct net_ctx *ctx,
 						  struct dst_entry *dst_orig,
 						  const struct flowi *fl,
 						  struct sock *sk,
@@ -508,11 +508,11 @@ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
 }
 
 #else
-struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+struct dst_entry *xfrm_lookup(struct net_ctx *ctx, struct dst_entry *dst_orig,
 			      const struct flowi *fl, struct sock *sk,
 			      int flags);
 
-struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
+struct dst_entry *xfrm_lookup_route(struct net_ctx *ctx, struct dst_entry *dst_orig,
 				    const struct flowi *fl, struct sock *sk,
 				    int flags);
 
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index b02bd45e3e97..1a545b23494e 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -118,7 +118,7 @@ static inline u32 frh_get_table(struct fib_rule_hdr *frh, struct nlattr **nla)
 }
 
 struct fib_rules_ops *fib_rules_register(const struct fib_rules_ops *,
-					 struct net *);
+					 struct net_ctx *);
 void fib_rules_unregister(struct fib_rules_ops *);
 
 int fib_rules_lookup(struct fib_rules_ops *, struct flowi *, int flags,
diff --git a/include/net/flow.h b/include/net/flow.h
index 8109a159d1b3..07e7a58b9aac 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -205,6 +205,7 @@ static inline size_t flow_key_size(u16 family)
 #define FLOW_DIR_FWD	2
 
 struct net;
+struct net_ctx;
 struct sock;
 struct flow_cache_ops;
 
@@ -222,7 +223,7 @@ typedef struct flow_cache_object *(*flow_resolve_t)(
 		struct net *net, const struct flowi *key, u16 family,
 		u8 dir, struct flow_cache_object *oldobj, void *ctx);
 
-struct flow_cache_object *flow_cache_lookup(struct net *net,
+struct flow_cache_object *flow_cache_lookup(struct net_ctx *net_ctx,
 					    const struct flowi *key, u16 family,
 					    u8 dir, flow_resolve_t resolver,
 					    void *ctx);
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 9201afe083fa..7b8e7dd06cc8 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -46,21 +46,21 @@ int __inet6_hash(struct sock *sk, struct inet_timewait_sock *twp);
  *
  * The sockhash lock must be held as a reader here.
  */
-struct sock *__inet6_lookup_established(struct net *net,
+struct sock *__inet6_lookup_established(struct net_ctx *ctx,
 					struct inet_hashinfo *hashinfo,
 					const struct in6_addr *saddr,
 					const __be16 sport,
 					const struct in6_addr *daddr,
 					const u16 hnum, const int dif);
 
-struct sock *inet6_lookup_listener(struct net *net,
+struct sock *inet6_lookup_listener(struct net_ctx *ctx,
 				   struct inet_hashinfo *hashinfo,
 				   const struct in6_addr *saddr,
 				   const __be16 sport,
 				   const struct in6_addr *daddr,
 				   const unsigned short hnum, const int dif);
 
-static inline struct sock *__inet6_lookup(struct net *net,
+static inline struct sock *__inet6_lookup(struct net_ctx *ctx,
 					  struct inet_hashinfo *hashinfo,
 					  const struct in6_addr *saddr,
 					  const __be16 sport,
@@ -68,12 +68,12 @@ static inline struct sock *__inet6_lookup(struct net *net,
 					  const u16 hnum,
 					  const int dif)
 {
-	struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
+	struct sock *sk = __inet6_lookup_established(ctx, hashinfo, saddr,
 						sport, daddr, hnum, dif);
 	if (sk)
 		return sk;
 
-	return inet6_lookup_listener(net, hashinfo, saddr, sport,
+	return inet6_lookup_listener(ctx, hashinfo, saddr, sport,
 				     daddr, hnum, dif);
 }
 
@@ -84,29 +84,30 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
 					      int iif)
 {
 	struct sock *sk = skb_steal_sock(skb);
+	struct net_ctx ctx = SKB_NET_CTX_DST(skb);
 
 	if (sk)
 		return sk;
 
-	return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
+	return __inet6_lookup(&ctx, hashinfo,
 			      &ipv6_hdr(skb)->saddr, sport,
 			      &ipv6_hdr(skb)->daddr, ntohs(dport),
 			      iif);
 }
 
-struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
+struct sock *inet6_lookup(struct net_ctx *ctx, struct inet_hashinfo *hashinfo,
 			  const struct in6_addr *saddr, const __be16 sport,
 			  const struct in6_addr *daddr, const __be16 dport,
 			  const int dif);
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif)	\
+#define INET6_MATCH(__sk, __ctx, __saddr, __daddr, __ports, __dif)	\
 	(((__sk)->sk_portpair == (__ports))			&&	\
 	 ((__sk)->sk_family == AF_INET6)			&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))		&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))	&&	\
 	 (!(__sk)->sk_bound_dev_if	||				\
 	   ((__sk)->sk_bound_dev_if == (__dif))) 		&&	\
-	 net_eq(sock_net(__sk), (__net)))
+	 sock_net_ctx_eq((__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index c9e8b7b7331a..9ddc1b2309ce 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -230,7 +230,7 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 }
 
 struct inet_bind_bucket *
-inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
+inet_bind_bucket_create(struct kmem_cache *cachep, struct net_ctx *ctx,
 			struct inet_bind_hashbucket *head,
 			const unsigned short snum);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
@@ -267,19 +267,19 @@ int __inet_hash_nolisten(struct sock *sk, struct inet_timewait_sock *tw);
 void inet_hash(struct sock *sk);
 void inet_unhash(struct sock *sk);
 
-struct sock *__inet_lookup_listener(struct net *net,
+struct sock *__inet_lookup_listener(struct net_ctx *ctx,
 				    struct inet_hashinfo *hashinfo,
 				    const __be32 saddr, const __be16 sport,
 				    const __be32 daddr,
 				    const unsigned short hnum,
 				    const int dif);
 
-static inline struct sock *inet_lookup_listener(struct net *net,
+static inline struct sock *inet_lookup_listener(struct net_ctx *ctx,
 		struct inet_hashinfo *hashinfo,
 		__be32 saddr, __be16 sport,
 		__be32 daddr, __be16 dport, int dif)
 {
-	return __inet_lookup_listener(net, hashinfo, saddr, sport,
+	return __inet_lookup_listener(ctx, hashinfo, saddr, sport,
 				      daddr, ntohs(dport), dif);
 }
 
@@ -312,23 +312,23 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 				   (((__force __u64)(__be32)(__daddr)) << 32) | \
 				   ((__force __u64)(__be32)(__saddr)))
 #endif /* __BIG_ENDIAN */
-#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)	\
+#define INET_MATCH(__sk, __ctx, __cookie, __saddr, __daddr, __ports, __dif)	\
 	(((__sk)->sk_portpair == (__ports))			&&	\
 	 ((__sk)->sk_addrpair == (__cookie))			&&	\
 	 (!(__sk)->sk_bound_dev_if	||				\
 	   ((__sk)->sk_bound_dev_if == (__dif))) 		&& 	\
-	 net_eq(sock_net(__sk), (__net)))
+	 sock_net_ctx_eq((__sk), (__ctx)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
 	const int __name __deprecated __attribute__((unused))
 
-#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \
+#define INET_MATCH(__sk, __ctx, __cookie, __saddr, __daddr, __ports, __dif) \
 	(((__sk)->sk_portpair == (__ports))		&&		\
 	 ((__sk)->sk_daddr	== (__saddr))		&&		\
 	 ((__sk)->sk_rcv_saddr	== (__daddr))		&&		\
 	 (!(__sk)->sk_bound_dev_if	||				\
 	   ((__sk)->sk_bound_dev_if == (__dif))) 	&&		\
-	 net_eq(sock_net(__sk), (__net)))
+	 sock_net_ctx_eq((__sk), (__ctx)))
 #endif /* 64-bit arch */
 
 /*
@@ -337,37 +337,37 @@ static inline struct sock *inet_lookup_listener(struct net *net,
  *
  * Local BH must be disabled here.
  */
-struct sock *__inet_lookup_established(struct net *net,
+struct sock *__inet_lookup_established(struct net_ctx *ctx,
 				       struct inet_hashinfo *hashinfo,
 				       const __be32 saddr, const __be16 sport,
 				       const __be32 daddr, const u16 hnum,
 				       const int dif);
 
 static inline struct sock *
-	inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
+	inet_lookup_established(struct net_ctx *ctx, struct inet_hashinfo *hashinfo,
 				const __be32 saddr, const __be16 sport,
 				const __be32 daddr, const __be16 dport,
 				const int dif)
 {
-	return __inet_lookup_established(net, hashinfo, saddr, sport, daddr,
+	return __inet_lookup_established(ctx, hashinfo, saddr, sport, daddr,
 					 ntohs(dport), dif);
 }
 
-static inline struct sock *__inet_lookup(struct net *net,
+static inline struct sock *__inet_lookup(struct net_ctx *ctx,
 					 struct inet_hashinfo *hashinfo,
 					 const __be32 saddr, const __be16 sport,
 					 const __be32 daddr, const __be16 dport,
 					 const int dif)
 {
 	u16 hnum = ntohs(dport);
-	struct sock *sk = __inet_lookup_established(net, hashinfo,
+	struct sock *sk = __inet_lookup_established(ctx, hashinfo,
 				saddr, sport, daddr, hnum, dif);
 
-	return sk ? : __inet_lookup_listener(net, hashinfo, saddr, sport,
+	return sk ? : __inet_lookup_listener(ctx, hashinfo, saddr, sport,
 					     daddr, hnum, dif);
 }
 
-static inline struct sock *inet_lookup(struct net *net,
+static inline struct sock *inet_lookup(struct net_ctx *ctx,
 				       struct inet_hashinfo *hashinfo,
 				       const __be32 saddr, const __be16 sport,
 				       const __be32 daddr, const __be16 dport,
@@ -376,7 +376,7 @@ static inline struct sock *inet_lookup(struct net *net,
 	struct sock *sk;
 
 	local_bh_disable();
-	sk = __inet_lookup(net, hashinfo, saddr, sport, daddr, dport, dif);
+	sk = __inet_lookup(ctx, hashinfo, saddr, sport, daddr, dport, dif);
 	local_bh_enable();
 
 	return sk;
@@ -389,11 +389,12 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
 {
 	struct sock *sk = skb_steal_sock(skb);
 	const struct iphdr *iph = ip_hdr(skb);
+	struct net_ctx ctx = SKB_NET_CTX_DST(skb);
 
 	if (sk)
 		return sk;
 	else
-		return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo,
+		return __inet_lookup(&ctx, hashinfo,
 				     iph->saddr, sport,
 				     iph->daddr, dport, inet_iif(skb));
 }
diff --git a/include/net/ip.h b/include/net/ip.h
index 14211eaff17f..b1516fa33ac4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -181,7 +181,7 @@ static inline __u8 ip_reply_arg_flowi_flags(const struct ip_reply_arg *arg)
 	return (arg->flags & IP_REPLY_ARG_NOSRCCHECK) ? FLOWI_FLAG_ANYSRC : 0;
 }
 
-void ip_send_unicast_reply(struct net *net, struct sk_buff *skb,
+void ip_send_unicast_reply(struct net_ctx *ctx, struct sk_buff *skb,
 			   const struct ip_options *sopt,
 			   __be32 daddr, __be32 saddr,
 			   const struct ip_reply_arg *arg,
@@ -523,11 +523,11 @@ static inline int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb)
 }
 
 void ip_options_fragment(struct sk_buff *skb);
-int ip_options_compile(struct net *net, struct ip_options *opt,
+int ip_options_compile(struct net_ctx *ctx, struct ip_options *opt,
 		       struct sk_buff *skb);
-int ip_options_get(struct net *net, struct ip_options_rcu **optp,
+int ip_options_get(struct net_ctx *ctx, struct ip_options_rcu **optp,
 		   unsigned char *data, int optlen);
-int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
+int ip_options_get_from_user(struct net_ctx *ctx, struct ip_options_rcu **optp,
 			     unsigned char __user *data, int optlen);
 void ip_options_undo(struct ip_options *opt);
 void ip_forward_options(struct sk_buff *skb);
@@ -539,7 +539,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset);
-int ip_cmsg_send(struct net *net, struct msghdr *msg,
+int ip_cmsg_send(struct net_ctx *ctx, struct msghdr *msg,
 		 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
 		  unsigned int optlen);
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 20e80fa7bbdd..b9b3df17b99d 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -293,7 +293,7 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 			      const struct in6_addr *daddr, int dst_len,
 			      const struct in6_addr *saddr, int src_len);
 
-void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
+void fib6_clean_all(struct net_ctx *ctx, int (*func)(struct rt6_info *, void *arg),
 		    void *arg);
 
 int fib6_add(struct fib6_node *root, struct rt6_info *rt,
@@ -302,7 +302,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info);
 
 void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info);
 
-void fib6_run_gc(unsigned long expires, struct net *net, bool force);
+void fib6_run_gc(unsigned long expires, struct net_ctx *ctx, bool force);
 
 void fib6_gc_cleanup(void);
 
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 1d09b46c1e48..9967de3772c3 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -64,31 +64,31 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr)
 
 void ip6_route_input(struct sk_buff *skb);
 
-struct dst_entry *ip6_route_output(struct net *net, const struct sock *sk,
+struct dst_entry *ip6_route_output(struct net_ctx *ctx, const struct sock *sk,
 				   struct flowi6 *fl6);
-struct dst_entry *ip6_route_lookup(struct net *net, struct flowi6 *fl6,
+struct dst_entry *ip6_route_lookup(struct net_ctx *ctx, struct flowi6 *fl6,
 				   int flags);
 
 int ip6_route_init(void);
 void ip6_route_cleanup(void);
 
-int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
+int ipv6_route_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg);
 
 int ip6_route_add(struct fib6_config *cfg);
 int ip6_ins_rt(struct rt6_info *);
 int ip6_del_rt(struct rt6_info *);
 
-int ip6_route_get_saddr(struct net *net, struct rt6_info *rt,
+int ip6_route_get_saddr(struct net_ctx *ctx, struct rt6_info *rt,
 			const struct in6_addr *daddr, unsigned int prefs,
 			struct in6_addr *saddr);
 
-struct rt6_info *rt6_lookup(struct net *net, const struct in6_addr *daddr,
+struct rt6_info *rt6_lookup(struct net_ctx *ctx, const struct in6_addr *daddr,
 			    const struct in6_addr *saddr, int oif, int flags);
 
 struct dst_entry *icmp6_dst_alloc(struct net_device *dev, struct flowi6 *fl6);
 int icmp6_dst_gc(void);
 
-void fib6_force_start_gc(struct net *net);
+void fib6_force_start_gc(struct net_ctx *ctx);
 
 struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 				    const struct in6_addr *addr, bool anycast);
@@ -107,11 +107,11 @@ void rt6_purge_dflt_routers(struct net *net);
 int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
 		  const struct in6_addr *gwaddr);
 
-void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif,
+void ip6_update_pmtu(struct sk_buff *skb, struct net_ctx *ctx, __be32 mtu, int oif,
 		     u32 mark);
 void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu);
-void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark);
-void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif,
+void ip6_redirect(struct sk_buff *skb, struct net_ctx *ctx, int oif, u32 mark);
+void ip6_redirect_no_header(struct sk_buff *skb, struct net_ctx *ctx, int oif,
 			    u32 mark);
 void ip6_sk_redirect(struct sk_buff *skb, struct sock *sk);
 
@@ -120,14 +120,14 @@ struct netlink_callback;
 struct rt6_rtnl_dump_arg {
 	struct sk_buff *skb;
 	struct netlink_callback *cb;
-	struct net *net;
+	struct net_ctx *ctx;
 };
 
 int rt6_dump_route(struct rt6_info *rt, void *p_arg);
-void rt6_ifdown(struct net *net, struct net_device *dev);
+void rt6_ifdown(struct net_ctx *ctx, struct net_device *dev);
 void rt6_mtu_change(struct net_device *dev, unsigned int mtu);
 void rt6_remove_prefsrc(struct inet6_ifaddr *ifp);
-void rt6_clean_tohost(struct net *net, struct in6_addr *gateway);
+void rt6_clean_tohost(struct net_ctx *ctx, struct in6_addr *gateway);
 
 
 /*
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index dca7f30be57f..85f5ddacba8d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -175,19 +175,19 @@ struct fib_result_nl {
 #define FIB_TABLE_HASHSZ 2
 #endif
 
-__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh);
+__be32 fib_info_update_nh_saddr(struct net_ctx *ctx, struct fib_nh *nh);
 
-#define FIB_RES_SADDR(net, res)				\
+#define FIB_RES_SADDR(ctx, res)				\
 	((FIB_RES_NH(res).nh_saddr_genid ==		\
-	  atomic_read(&(net)->ipv4.dev_addr_genid)) ?	\
+	  atomic_read(&(ctx)->net->ipv4.dev_addr_genid)) ? \
 	 FIB_RES_NH(res).nh_saddr :			\
-	 fib_info_update_nh_saddr((net), &FIB_RES_NH(res)))
+	 fib_info_update_nh_saddr((ctx), &FIB_RES_NH(res)))
 #define FIB_RES_GW(res)			(FIB_RES_NH(res).nh_gw)
 #define FIB_RES_DEV(res)		(FIB_RES_NH(res).nh_dev)
 #define FIB_RES_OIF(res)		(FIB_RES_NH(res).nh_oif)
 
-#define FIB_RES_PREFSRC(net, res)	((res).fi->fib_prefsrc ? : \
-					 FIB_RES_SADDR(net, res))
+#define FIB_RES_PREFSRC(ctx, res)	((res).fi->fib_prefsrc ? : \
+					 FIB_RES_SADDR(ctx, res))
 
 struct fib_table {
 	struct hlist_node	tb_hlist;
@@ -228,9 +228,10 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id)
 	return fib_get_table(net, id);
 }
 
-static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
+static inline int fib_lookup(struct net_ctx *ctx, const struct flowi4 *flp,
 			     struct fib_result *res)
 {
+	struct net *net = ctx->net;
 	int err = -ENETUNREACH;
 
 	rcu_read_lock();
@@ -253,11 +254,13 @@ void __net_exit fib4_rules_exit(struct net *net);
 struct fib_table *fib_new_table(struct net *net, u32 id);
 struct fib_table *fib_get_table(struct net *net, u32 id);
 
-int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
+int __fib_lookup(struct net_ctx *ctx, struct flowi4 *flp, struct fib_result *res);
 
-static inline int fib_lookup(struct net *net, struct flowi4 *flp,
+static inline int fib_lookup(struct net_ctx *ctx, struct flowi4 *flp,
 			     struct fib_result *res)
 {
+	struct net *net = ctx->net;
+
 	if (!net->ipv4.fib_has_custom_rules) {
 		int err = -ENETUNREACH;
 
@@ -279,7 +282,7 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp,
 
 		return err;
 	}
-	return __fib_lookup(net, flp, res);
+	return __fib_lookup(ctx, flp, res);
 }
 
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
@@ -307,7 +310,7 @@ static inline int fib_num_tclassid_users(struct net *net)
 /* Exported by fib_semantics.c */
 int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, int force);
-int fib_sync_down_addr(struct net *net, __be32 local);
+int fib_sync_down_addr(struct net_ctx *ctx, __be32 local);
 int fib_sync_up(struct net_device *dev);
 void fib_select_multipath(struct fib_result *res);
 
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 6228edd1e483..8cf9bc2236da 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -246,7 +246,7 @@ void neigh_table_init(int index, struct neigh_table *tbl);
 int neigh_table_clear(int index, struct neigh_table *tbl);
 struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
 			       struct net_device *dev);
-struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl, struct net *net,
+struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl, struct net_ctx *ctx,
 				     const void *pkey);
 struct neighbour *__neigh_create(struct neigh_table *tbl, const void *pkey,
 				 struct net_device *dev, bool want_ref);
@@ -297,12 +297,12 @@ unsigned long neigh_rand_reach_time(unsigned long base);
 
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 		    struct sk_buff *skb);
-struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net *net,
+struct pneigh_entry *pneigh_lookup(struct neigh_table *tbl, struct net_ctx *ctx,
 				   const void *key, struct net_device *dev,
 				   int creat);
-struct pneigh_entry *__pneigh_lookup(struct neigh_table *tbl, struct net *net,
+struct pneigh_entry *__pneigh_lookup(struct neigh_table *tbl, struct net_ctx *ctx,
 				     const void *key, struct net_device *dev);
-int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *key,
+int pneigh_delete(struct neigh_table *tbl, struct net_ctx *ctx, const void *key,
 		  struct net_device *dev);
 
 static inline struct net *pneigh_net(const struct pneigh_entry *pneigh)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index b932f2a83865..e7060b43570d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -368,9 +368,9 @@ static inline void unregister_net_sysctl_table(struct ctl_table_header *header)
 }
 #endif
 
-static inline int rt_genid_ipv4(struct net *net)
+static inline int rt_genid_ipv4(struct net_ctx *ctx)
 {
-	return atomic_read(&net->ipv4.rt_genid);
+	return atomic_read(&ctx->net->ipv4.rt_genid);
 }
 
 static inline void rt_genid_bump_ipv4(struct net *net)
@@ -400,9 +400,9 @@ static inline void rt_genid_bump_all(struct net *net)
 	rt_genid_bump_ipv6(net);
 }
 
-static inline int fnhe_genid(struct net *net)
+static inline int fnhe_genid(struct net_ctx *ctx)
 {
-	return atomic_read(&net->fnhe_genid);
+	return atomic_read(&ctx->net->fnhe_genid);
 }
 
 static inline void fnhe_genid_bump(struct net *net)
diff --git a/include/net/route.h b/include/net/route.h
index fe22d03afb6a..5f0b770225d7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -110,18 +110,18 @@ struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key(struct net *, struct flowi4 *flp);
-struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
+struct rtable *__ip_route_output_key(struct net_ctx *, struct flowi4 *flp);
+struct rtable *ip_route_output_flow(struct net_ctx *, struct flowi4 *flp,
 				    struct sock *sk);
-struct dst_entry *ipv4_blackhole_route(struct net *net,
+struct dst_entry *ipv4_blackhole_route(struct net_ctx *ctx,
 				       struct dst_entry *dst_orig);
 
-static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
+static inline struct rtable *ip_route_output_key(struct net_ctx *ctx, struct flowi4 *flp)
 {
-	return ip_route_output_flow(net, flp, NULL);
+	return ip_route_output_flow(ctx, flp, NULL);
 }
 
-static inline struct rtable *ip_route_output(struct net *net, __be32 daddr,
+static inline struct rtable *ip_route_output(struct net_ctx *ctx, __be32 daddr,
 					     __be32 saddr, u8 tos, int oif)
 {
 	struct flowi4 fl4 = {
@@ -130,10 +130,10 @@ static inline struct rtable *ip_route_output(struct net *net, __be32 daddr,
 		.daddr = daddr,
 		.saddr = saddr,
 	};
-	return ip_route_output_key(net, &fl4);
+	return ip_route_output_key(ctx, &fl4);
 }
 
-static inline struct rtable *ip_route_output_ports(struct net *net, struct flowi4 *fl4,
+static inline struct rtable *ip_route_output_ports(struct net_ctx *ctx, struct flowi4 *fl4,
 						   struct sock *sk,
 						   __be32 daddr, __be32 saddr,
 						   __be16 dport, __be16 sport,
@@ -145,10 +145,10 @@ static inline struct rtable *ip_route_output_ports(struct net *net, struct flowi
 			   daddr, saddr, dport, sport);
 	if (sk)
 		security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-	return ip_route_output_flow(net, fl4, sk);
+	return ip_route_output_flow(ctx, fl4, sk);
 }
 
-static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4 *fl4,
+static inline struct rtable *ip_route_output_gre(struct net_ctx *ctx, struct flowi4 *fl4,
 						 __be32 daddr, __be32 saddr,
 						 __be32 gre_key, __u8 tos, int oif)
 {
@@ -159,7 +159,7 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 	fl4->flowi4_tos = tos;
 	fl4->flowi4_proto = IPPROTO_GRE;
 	fl4->fl4_gre_key = gre_key;
-	return ip_route_output_key(net, fl4);
+	return ip_route_output_key(ctx, fl4);
 }
 
 int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
@@ -179,19 +179,19 @@ static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
 	return err;
 }
 
-void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu, int oif,
+void ipv4_update_pmtu(struct sk_buff *skb, struct net_ctx *ctx, u32 mtu, int oif,
 		      u32 mark, u8 protocol, int flow_flags);
 void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu);
-void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark,
+void ipv4_redirect(struct sk_buff *skb, struct net_ctx *ctx, int oif, u32 mark,
 		   u8 protocol, int flow_flags);
 void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
-unsigned int inet_addr_type(struct net *net, __be32 addr);
-unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
+unsigned int inet_addr_type(struct net_ctx *ctx, __be32 addr);
+unsigned int inet_dev_addr_type(struct net_ctx *ctx, const struct net_device *dev,
 				__be32 addr);
 void ip_rt_multicast_event(struct in_device *);
-int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
+int ip_rt_ioctl(struct net_ctx *, unsigned int cmd, void __user *arg);
 void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
 
 struct in_ifaddr;
@@ -260,21 +260,21 @@ static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
 					      __be16 sport, __be16 dport,
 					      struct sock *sk)
 {
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	struct rtable *rt;
 
 	ip_route_connect_init(fl4, dst, src, tos, oif, protocol,
 			      sport, dport, sk);
 
 	if (!dst || !src) {
-		rt = __ip_route_output_key(net, fl4);
+		rt = __ip_route_output_key(&sk_ctx, fl4);
 		if (IS_ERR(rt))
 			return rt;
 		ip_rt_put(rt);
 		flowi4_update_output(fl4, oif, tos, fl4->daddr, fl4->saddr);
 	}
 	security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-	return ip_route_output_flow(net, fl4, sk);
+	return ip_route_output_flow(&sk_ctx, fl4, sk);
 }
 
 static inline struct rtable *ip_route_newports(struct flowi4 *fl4, struct rtable *rt,
@@ -283,6 +283,8 @@ static inline struct rtable *ip_route_newports(struct flowi4 *fl4, struct rtable
 					       struct sock *sk)
 {
 	if (sport != orig_sport || dport != orig_dport) {
+		struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+
 		fl4->fl4_dport = dport;
 		fl4->fl4_sport = sport;
 		ip_rt_put(rt);
@@ -290,7 +292,7 @@ static inline struct rtable *ip_route_newports(struct flowi4 *fl4, struct rtable
 				     RT_CONN_FLAGS(sk), fl4->daddr,
 				     fl4->saddr);
 		security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-		return ip_route_output_flow(sock_net(sk), fl4, sk);
+		return ip_route_output_flow(&sk_ctx, fl4, sk);
 	}
 	return rt;
 }
diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index b927413dde86..34a8685554f9 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -40,7 +40,7 @@ void ip6_datagram_recv_common_ctl(struct sock *sk, struct msghdr *msg,
 void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg,
 				    struct sk_buff *skb);
 
-int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
+int ip6_datagram_send_ctl(struct net_ctx *ctx, struct sock *sk, struct msghdr *msg,
 			  struct flowi6 *fl6, struct ipv6_txoptions *opt,
 			  int *hlimit, int *tclass, int *dontfrag);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 07f9b70962f6..2525f0e38b71 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,16 +246,16 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname,
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		       char __user *optval, unsigned int optlen,
 		       int (*push_pending_frames)(struct sock *));
-struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
+struct sock *udp4_lib_lookup(struct net_ctx *ctx, __be32 saddr, __be16 sport,
 			     __be32 daddr, __be16 dport, int dif);
-struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
+struct sock *__udp4_lib_lookup(struct net_ctx *ctx, __be32 saddr, __be16 sport,
 			       __be32 daddr, __be16 dport, int dif,
 			       struct udp_table *tbl);
-struct sock *udp6_lib_lookup(struct net *net,
+struct sock *udp6_lib_lookup(struct net_ctx *ctx,
 			     const struct in6_addr *saddr, __be16 sport,
 			     const struct in6_addr *daddr, __be16 dport,
 			     int dif);
-struct sock *__udp6_lib_lookup(struct net *net,
+struct sock *__udp6_lib_lookup(struct net_ctx *ctx,
 			       const struct in6_addr *saddr, __be16 sport,
 			       const struct in6_addr *daddr, __be16 dport,
 			       int dif, struct udp_table *tbl);
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index dc4865e90fe4..f5a0ebd4e211 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -286,15 +286,15 @@ struct xfrm_policy_afinfo {
 	unsigned short		family;
 	struct dst_ops		*dst_ops;
 	void			(*garbage_collect)(struct net *net);
-	struct dst_entry	*(*dst_lookup)(struct net *net, int tos,
+	struct dst_entry	*(*dst_lookup)(struct net_ctx *ctx, int tos,
 					       const xfrm_address_t *saddr,
 					       const xfrm_address_t *daddr);
-	int			(*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr);
+	int			(*get_saddr)(struct net_ctx *ctx, xfrm_address_t *saddr, xfrm_address_t *daddr);
 	void			(*decode_session)(struct sk_buff *skb,
 						  struct flowi *fl,
 						  int reverse);
 	int			(*get_tos)(const struct flowi *fl);
-	void			(*init_dst)(struct net *net,
+	void			(*init_dst)(struct net_ctx *ctx,
 					    struct xfrm_dst *dst);
 	int			(*init_path)(struct xfrm_dst *path,
 					     struct dst_entry *dst,
@@ -302,7 +302,7 @@ struct xfrm_policy_afinfo {
 	int			(*fill_dst)(struct xfrm_dst *xdst,
 					    struct net_device *dev,
 					    const struct flowi *fl);
-	struct dst_entry	*(*blackhole_route)(struct net *net, struct dst_entry *orig);
+	struct dst_entry	*(*blackhole_route)(struct net_ctx *ctx, struct dst_entry *orig);
 };
 
 int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo);
@@ -597,7 +597,7 @@ struct xfrm_mgr {
 	struct xfrm_policy	*(*compile_policy)(struct sock *sk, int opt, u8 *data, int len, int *dir);
 	int			(*new_mapping)(struct xfrm_state *x, xfrm_address_t *ipaddr, __be16 sport);
 	int			(*notify_policy)(struct xfrm_policy *x, int dir, const struct km_event *c);
-	int			(*report)(struct net *net, u8 proto, struct xfrm_selector *sel, xfrm_address_t *addr);
+	int			(*report)(struct net_ctx *ctx, u8 proto, struct xfrm_selector *sel, xfrm_address_t *addr);
 	int			(*migrate)(const struct xfrm_selector *sel,
 					   u8 dir, u8 type,
 					   const struct xfrm_migrate *m,
@@ -1428,43 +1428,43 @@ static inline void xfrm_sysctl_fini(struct net *net)
 
 void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto,
 			  struct xfrm_address_filter *filter);
-int xfrm_state_walk(struct net *net, struct xfrm_state_walk *walk,
+int xfrm_state_walk(struct net_ctx *ctx, struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*), void *);
-void xfrm_state_walk_done(struct xfrm_state_walk *walk, struct net *net);
-struct xfrm_state *xfrm_state_alloc(struct net *net);
+void xfrm_state_walk_done(struct xfrm_state_walk *walk, struct net_ctx *ctx);
+struct xfrm_state *xfrm_state_alloc(struct net_ctx *ctx);
 struct xfrm_state *xfrm_state_find(const xfrm_address_t *daddr,
 				   const xfrm_address_t *saddr,
 				   const struct flowi *fl,
 				   struct xfrm_tmpl *tmpl,
 				   struct xfrm_policy *pol, int *err,
 				   unsigned short family);
-struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark,
+struct xfrm_state *xfrm_stateonly_find(struct net_ctx *ctx, u32 mark,
 				       xfrm_address_t *daddr,
 				       xfrm_address_t *saddr,
 				       unsigned short family,
 				       u8 mode, u8 proto, u32 reqid);
-struct xfrm_state *xfrm_state_lookup_byspi(struct net *net, __be32 spi,
+struct xfrm_state *xfrm_state_lookup_byspi(struct net_ctx *ctx, __be32 spi,
 					      unsigned short family);
 int xfrm_state_check_expire(struct xfrm_state *x);
 void xfrm_state_insert(struct xfrm_state *x);
 int xfrm_state_add(struct xfrm_state *x);
 int xfrm_state_update(struct xfrm_state *x);
-struct xfrm_state *xfrm_state_lookup(struct net *net, u32 mark,
+struct xfrm_state *xfrm_state_lookup(struct net_ctx *ctx, u32 mark,
 				     const xfrm_address_t *daddr, __be32 spi,
 				     u8 proto, unsigned short family);
-struct xfrm_state *xfrm_state_lookup_byaddr(struct net *net, u32 mark,
+struct xfrm_state *xfrm_state_lookup_byaddr(struct net_ctx *ctx, u32 mark,
 					    const xfrm_address_t *daddr,
 					    const xfrm_address_t *saddr,
 					    u8 proto,
 					    unsigned short family);
 #ifdef CONFIG_XFRM_SUB_POLICY
 int xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src, int n,
-		   unsigned short family, struct net *net);
+		   unsigned short family, struct net_ctx *ctx);
 int xfrm_state_sort(struct xfrm_state **dst, struct xfrm_state **src, int n,
 		    unsigned short family);
 #else
 static inline int xfrm_tmpl_sort(struct xfrm_tmpl **dst, struct xfrm_tmpl **src,
-				 int n, unsigned short family, struct net *net)
+				 int n, unsigned short family, struct net_ctx *ctx)
 {
 	return -ENOSYS;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 624335140857..fa92d1046eeb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5394,10 +5394,10 @@ void netdev_adjacent_add_links(struct net_device *dev)
 {
 	struct netdev_adjacent *iter;
 
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	list_for_each_entry(iter, &dev->adj_list.upper, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_add(iter->dev, dev,
 					  &iter->dev->adj_list.lower);
@@ -5406,7 +5406,7 @@ void netdev_adjacent_add_links(struct net_device *dev)
 	}
 
 	list_for_each_entry(iter, &dev->adj_list.lower, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_add(iter->dev, dev,
 					  &iter->dev->adj_list.upper);
@@ -5419,10 +5419,10 @@ void netdev_adjacent_del_links(struct net_device *dev)
 {
 	struct netdev_adjacent *iter;
 
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	list_for_each_entry(iter, &dev->adj_list.upper, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_del(iter->dev, dev->name,
 					  &iter->dev->adj_list.lower);
@@ -5431,7 +5431,7 @@ void netdev_adjacent_del_links(struct net_device *dev)
 	}
 
 	list_for_each_entry(iter, &dev->adj_list.lower, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_del(iter->dev, dev->name,
 					  &iter->dev->adj_list.upper);
@@ -5444,10 +5444,10 @@ void netdev_adjacent_rename_links(struct net_device *dev, char *oldname)
 {
 	struct netdev_adjacent *iter;
 
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	list_for_each_entry(iter, &dev->adj_list.upper, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_del(iter->dev, oldname,
 					  &iter->dev->adj_list.lower);
@@ -5456,7 +5456,7 @@ void netdev_adjacent_rename_links(struct net_device *dev, char *oldname)
 	}
 
 	list_for_each_entry(iter, &dev->adj_list.lower, list) {
-		if (!net_eq(net,dev_net(iter->dev)))
+		if (!dev_net_ctx_eq(iter->dev, &dev_ctx))
 			continue;
 		netdev_adjacent_sysfs_del(iter->dev, oldname,
 					  &iter->dev->adj_list.upper);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 44706e81b2e0..b793196f9521 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -65,12 +65,12 @@ static void notify_rule_change(int event, struct fib_rule *rule,
 			       struct fib_rules_ops *ops, struct nlmsghdr *nlh,
 			       u32 pid);
 
-static struct fib_rules_ops *lookup_rules_ops(struct net *net, int family)
+static struct fib_rules_ops *lookup_rules_ops(struct net_ctx *ctx, int family)
 {
 	struct fib_rules_ops *ops;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ops, &net->rules_ops, list) {
+	list_for_each_entry_rcu(ops, &ctx->net->rules_ops, list) {
 		if (ops->family == family) {
 			if (!try_module_get(ops->owner))
 				ops = NULL;
@@ -126,7 +126,7 @@ static int __fib_rules_register(struct fib_rules_ops *ops)
 }
 
 struct fib_rules_ops *
-fib_rules_register(const struct fib_rules_ops *tmpl, struct net *net)
+fib_rules_register(const struct fib_rules_ops *tmpl, struct net_ctx *ctx)
 {
 	struct fib_rules_ops *ops;
 	int err;
@@ -136,7 +136,7 @@ fib_rules_register(const struct fib_rules_ops *tmpl, struct net *net)
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&ops->rules_list);
-	ops->fro_net = net;
+	ops->fro_net = ctx->net;
 
 	err = __fib_rules_register(ops);
 	if (err) {
@@ -274,7 +274,8 @@ static int validate_rulemsg(struct fib_rule_hdr *frh, struct nlattr **tb,
 
 static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(skb);
+	struct net *net = sk_ctx.net;
 	struct fib_rule_hdr *frh = nlmsg_data(nlh);
 	struct fib_rules_ops *ops = NULL;
 	struct fib_rule *rule, *r, *last = NULL;
@@ -284,7 +285,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
 		goto errout;
 
-	ops = lookup_rules_ops(net, frh->family);
+	ops = lookup_rules_ops(&sk_ctx, frh->family);
 	if (ops == NULL) {
 		err = -EAFNOSUPPORT;
 		goto errout;
@@ -432,7 +433,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 
 static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(skb);
 	struct fib_rule_hdr *frh = nlmsg_data(nlh);
 	struct fib_rules_ops *ops = NULL;
 	struct fib_rule *rule, *tmp;
@@ -442,7 +443,7 @@ static int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
 		goto errout;
 
-	ops = lookup_rules_ops(net, frh->family);
+	ops = lookup_rules_ops(&sk_ctx, frh->family);
 	if (ops == NULL) {
 		err = -EAFNOSUPPORT;
 		goto errout;
@@ -644,14 +645,14 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(skb);
 	struct fib_rules_ops *ops;
 	int idx = 0, family;
 
 	family = rtnl_msg_family(cb->nlh);
 	if (family != AF_UNSPEC) {
 		/* Protocol specific dump request */
-		ops = lookup_rules_ops(net, family);
+		ops = lookup_rules_ops(&sk_ctx, family);
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
@@ -659,7 +660,7 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(ops, &net->rules_ops, list) {
+	list_for_each_entry_rcu(ops, &sk_ctx.net->rules_ops, list) {
 		if (idx < cb->args[0] || !try_module_get(ops->owner))
 			goto skip;
 
diff --git a/net/core/flow.c b/net/core/flow.c
index a0348fde1fdf..59021e0a50f1 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -189,9 +189,10 @@ static int flow_key_compare(const struct flowi *key1, const struct flowi *key2,
 }
 
 struct flow_cache_object *
-flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
+flow_cache_lookup(struct net_ctx *net_ctx, const struct flowi *key, u16 family, u8 dir,
 		  flow_resolve_t resolver, void *ctx)
 {
+	struct net *net = net_ctx->net;
 	struct flow_cache *fc = &net->xfrm.flow_cache_global;
 	struct flow_cache_percpu *fcp;
 	struct flow_cache_entry *fle, *tfle;
@@ -261,7 +262,7 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
 		flo = fle->object;
 		fle->object = NULL;
 	}
-	flo = resolver(net, key, family, dir, flo, ctx);
+	flo = resolver(net_ctx, key, family, dir, flo, ctx);
 	if (fle) {
 		fle->genid = atomic_read(&net->xfrm.flow_cache_genid);
 		if (!IS_ERR(flo))
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index bd77804849cc..93a7701a7ae7 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -423,7 +423,8 @@ struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
 }
 EXPORT_SYMBOL(neigh_lookup);
 
-struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl, struct net *net,
+struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl,
+				     struct net_ctx *ctx,
 				     const void *pkey)
 {
 	struct neighbour *n;
@@ -441,7 +442,7 @@ struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl, struct net *net,
 	     n != NULL;
 	     n = rcu_dereference_bh(n->next)) {
 		if (!memcmp(n->primary_key, pkey, key_len) &&
-		    net_eq(dev_net(n->dev), net)) {
+		    dev_net_ctx_eq(n->dev, ctx)) {
 			if (!atomic_inc_not_zero(&n->refcnt))
 				n = NULL;
 			NEIGH_CACHE_STAT_INC(tbl, hits);
@@ -553,14 +554,14 @@ static u32 pneigh_hash(const void *pkey, int key_len)
 }
 
 static struct pneigh_entry *__pneigh_lookup_1(struct pneigh_entry *n,
-					      struct net *net,
+					      struct net_ctx *ctx,
 					      const void *pkey,
 					      int key_len,
 					      struct net_device *dev)
 {
 	while (n) {
 		if (!memcmp(n->key, pkey, key_len) &&
-		    net_eq(pneigh_net(n), net) &&
+		    pneigh_net_ctx_eq(n, ctx) &&
 		    (n->dev == dev || !n->dev))
 			return n;
 		n = n->next;
@@ -569,18 +570,19 @@ static struct pneigh_entry *__pneigh_lookup_1(struct pneigh_entry *n,
 }
 
 struct pneigh_entry *__pneigh_lookup(struct neigh_table *tbl,
-		struct net *net, const void *pkey, struct net_device *dev)
+				     struct net_ctx *ctx,
+				     const void *pkey, struct net_device *dev)
 {
 	int key_len = tbl->key_len;
 	u32 hash_val = pneigh_hash(pkey, key_len);
 
 	return __pneigh_lookup_1(tbl->phash_buckets[hash_val],
-				 net, pkey, key_len, dev);
+				 ctx, pkey, key_len, dev);
 }
 EXPORT_SYMBOL_GPL(__pneigh_lookup);
 
 struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
-				    struct net *net, const void *pkey,
+				    struct net_ctx *ctx, const void *pkey,
 				    struct net_device *dev, int creat)
 {
 	struct pneigh_entry *n;
@@ -589,7 +591,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 
 	read_lock_bh(&tbl->lock);
 	n = __pneigh_lookup_1(tbl->phash_buckets[hash_val],
-			      net, pkey, key_len, dev);
+			      ctx, pkey, key_len, dev);
 	read_unlock_bh(&tbl->lock);
 
 	if (n || !creat)
@@ -601,7 +603,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 	if (!n)
 		goto out;
 
-	write_pnet(&n->net_ctx.net, hold_net(net));
+	pneigh_net_ctx_set(n, ctx);
 	memcpy(n->key, pkey, key_len);
 	n->dev = dev;
 	if (dev)
@@ -610,7 +612,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 	if (tbl->pconstructor && tbl->pconstructor(n)) {
 		if (dev)
 			dev_put(dev);
-		release_net(net);
+		release_net(ctx->net);
 		kfree(n);
 		n = NULL;
 		goto out;
@@ -626,7 +628,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 EXPORT_SYMBOL(pneigh_lookup);
 
 
-int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *pkey,
+int pneigh_delete(struct neigh_table *tbl, struct net_ctx *ctx, const void *pkey,
 		  struct net_device *dev)
 {
 	struct pneigh_entry *n, **np;
@@ -637,7 +639,7 @@ int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *pkey,
 	for (np = &tbl->phash_buckets[hash_val]; (n = *np) != NULL;
 	     np = &n->next) {
 		if (!memcmp(n->key, pkey, key_len) && n->dev == dev &&
-		    net_eq(pneigh_net(n), net)) {
+		    pneigh_net_ctx_eq(n, ctx)) {
 			*np = n->next;
 			write_unlock_bh(&tbl->lock);
 			if (tbl->pdestructor)
@@ -1436,13 +1438,13 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 EXPORT_SYMBOL(pneigh_enqueue);
 
 static inline struct neigh_parms *lookup_neigh_parms(struct neigh_table *tbl,
-						      struct net *net, int ifindex)
+						      struct net_ctx *ctx, int ifindex)
 {
 	struct neigh_parms *p;
 
 	list_for_each_entry(p, &tbl->parms_list, list) {
-		if ((p->dev && p->dev->ifindex == ifindex && net_eq(neigh_parms_net(p), net)) ||
-		    (!p->dev && !ifindex && net_eq(net, &init_net)))
+		if ((p->dev && p->dev->ifindex == ifindex && neigh_parms_net_ctx_eq((p), ctx)) ||
+		    (!p->dev && !ifindex && net_eq(ctx->net, &init_net)))
 			return p;
 	}
 
@@ -1453,7 +1455,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 				      struct neigh_table *tbl)
 {
 	struct neigh_parms *p;
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	const struct net_device_ops *ops = dev->netdev_ops;
 
 	p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL);
@@ -1464,11 +1466,11 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
 		dev_hold(dev);
 		p->dev = dev;
-		write_pnet(&p->net_ctx.net, hold_net(net));
+		write_pnet(&p->net_ctx.net, hold_net(dev_ctx.net));
 		p->sysctl_table = NULL;
 
 		if (ops->ndo_neigh_setup && ops->ndo_neigh_setup(dev, p)) {
-			release_net(net);
+			release_net(dev_ctx.net);
 			dev_put(dev);
 			kfree(p);
 			return NULL;
@@ -1615,7 +1617,7 @@ static struct neigh_table *neigh_find_table(int family)
 
 static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct ndmsg *ndm;
 	struct nlattr *dst_attr;
 	struct neigh_table *tbl;
@@ -1633,7 +1635,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex) {
-		dev = __dev_get_by_index(net, ndm->ndm_ifindex);
+		dev = __dev_get_by_index_ctx(&ctx, ndm->ndm_ifindex);
 		if (dev == NULL) {
 			err = -ENODEV;
 			goto out;
@@ -1648,7 +1650,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh)
 		goto out;
 
 	if (ndm->ndm_flags & NTF_PROXY) {
-		err = pneigh_delete(tbl, net, nla_data(dst_attr), dev);
+		err = pneigh_delete(tbl, &ctx, nla_data(dst_attr), dev);
 		goto out;
 	}
 
@@ -1673,7 +1675,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh)
 static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
 	int flags = NEIGH_UPDATE_F_ADMIN | NEIGH_UPDATE_F_OVERRIDE;
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct ndmsg *ndm;
 	struct nlattr *tb[NDA_MAX+1];
 	struct neigh_table *tbl;
@@ -1693,7 +1695,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 	ndm = nlmsg_data(nlh);
 	if (ndm->ndm_ifindex) {
-		dev = __dev_get_by_index(net, ndm->ndm_ifindex);
+		dev = __dev_get_by_index_ctx(&ctx, ndm->ndm_ifindex);
 		if (dev == NULL) {
 			err = -ENODEV;
 			goto out;
@@ -1716,7 +1718,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh)
 		struct pneigh_entry *pn;
 
 		err = -ENOBUFS;
-		pn = pneigh_lookup(tbl, net, dst, dev, 1);
+		pn = pneigh_lookup(tbl, &ctx, dst, dev, 1);
 		if (pn) {
 			pn->flags = ndm->ndm_flags;
 			err = 0;
@@ -1953,7 +1955,7 @@ static const struct nla_policy nl_ntbl_parm_policy[NDTPA_MAX+1] = {
 
 static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct neigh_table *tbl;
 	struct ndtmsg *ndtmsg;
 	struct nlattr *tb[NDTA_MAX+1];
@@ -2006,7 +2008,7 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh)
 		if (tbp[NDTPA_IFINDEX])
 			ifindex = nla_get_u32(tbp[NDTPA_IFINDEX]);
 
-		p = lookup_neigh_parms(tbl, net, ifindex);
+		p = lookup_neigh_parms(tbl, &ctx, ifindex);
 		if (p == NULL) {
 			err = -ENOENT;
 			goto errout_tbl_lock;
@@ -2083,7 +2085,7 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh)
 	err = -ENOENT;
 	if ((tb[NDTA_THRESH1] || tb[NDTA_THRESH2] ||
 	     tb[NDTA_THRESH3] || tb[NDTA_GC_INTERVAL]) &&
-	    !net_eq(net, &init_net))
+	    !net_eq(ctx.net, &init_net))
 		goto errout_tbl_lock;
 
 	if (tb[NDTA_THRESH1])
@@ -2108,7 +2110,7 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	int family, tidx, nidx = 0;
 	int tbl_skip = cb->args[0];
 	int neigh_skip = cb->args[1];
@@ -2134,7 +2136,7 @@ static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 		nidx = 0;
 		p = list_next_entry(&tbl->parms, list);
 		list_for_each_entry_from(p, &tbl->parms_list, list) {
-			if (!net_eq(neigh_parms_net(p), net))
+			if (!neigh_parms_net_ctx_eq(p, &ctx))
 				continue;
 
 			if (nidx < neigh_skip)
@@ -2252,7 +2254,7 @@ static void neigh_update_notify(struct neighbour *neigh)
 static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 			    struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct neighbour *n;
 	int rc, h, s_h = cb->args[1];
 	int idx, s_idx = idx = cb->args[2];
@@ -2267,7 +2269,7 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
 		     n != NULL;
 		     n = rcu_dereference_bh(n->next)) {
-			if (!net_eq(dev_net(n->dev), net))
+			if (!dev_net_ctx_eq(n->dev, &ctx))
 				continue;
 			if (idx < s_idx)
 				goto next;
@@ -2294,7 +2296,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 			     struct netlink_callback *cb)
 {
 	struct pneigh_entry *n;
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	int rc, h, s_h = cb->args[3];
 	int idx, s_idx = idx = cb->args[4];
 
@@ -2304,7 +2306,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		if (h > s_h)
 			s_idx = 0;
 		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
-			if (dev_net(n->dev) != net)
+			if (!dev_net_ctx_eq(n->dev, &ctx))
 				continue;
 			if (idx < s_idx)
 				goto next;
@@ -2432,7 +2434,7 @@ EXPORT_SYMBOL(__neigh_for_each_release);
 static struct neighbour *neigh_get_first(struct seq_file *seq)
 {
 	struct neigh_seq_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 	struct neigh_hash_table *nht = state->nht;
 	struct neighbour *n = NULL;
 	int bucket = state->bucket;
@@ -2442,7 +2444,7 @@ static struct neighbour *neigh_get_first(struct seq_file *seq)
 		n = rcu_dereference_bh(nht->hash_buckets[bucket]);
 
 		while (n) {
-			if (!net_eq(dev_net(n->dev), net))
+			if (!dev_net_ctx_eq(n->dev, ctx))
 				goto next;
 			if (state->neigh_sub_iter) {
 				loff_t fakep = 0;
@@ -2473,7 +2475,7 @@ static struct neighbour *neigh_get_next(struct seq_file *seq,
 					loff_t *pos)
 {
 	struct neigh_seq_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 	struct neigh_hash_table *nht = state->nht;
 
 	if (state->neigh_sub_iter) {
@@ -2485,7 +2487,7 @@ static struct neighbour *neigh_get_next(struct seq_file *seq,
 
 	while (1) {
 		while (n) {
-			if (!net_eq(dev_net(n->dev), net))
+			if (!dev_net_ctx_eq(n->dev, ctx))
 				goto next;
 			if (state->neigh_sub_iter) {
 				void *v = state->neigh_sub_iter(state, n, pos);
@@ -2534,7 +2536,7 @@ static struct neighbour *neigh_get_idx(struct seq_file *seq, loff_t *pos)
 static struct pneigh_entry *pneigh_get_first(struct seq_file *seq)
 {
 	struct neigh_seq_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 	struct neigh_table *tbl = state->tbl;
 	struct pneigh_entry *pn = NULL;
 	int bucket = state->bucket;
@@ -2542,7 +2544,7 @@ static struct pneigh_entry *pneigh_get_first(struct seq_file *seq)
 	state->flags |= NEIGH_SEQ_IS_PNEIGH;
 	for (bucket = 0; bucket <= PNEIGH_HASHMASK; bucket++) {
 		pn = tbl->phash_buckets[bucket];
-		while (pn && !net_eq(pneigh_net(pn), net))
+		while (pn && !pneigh_net_ctx_eq(pn, ctx))
 			pn = pn->next;
 		if (pn)
 			break;
@@ -2557,18 +2559,18 @@ static struct pneigh_entry *pneigh_get_next(struct seq_file *seq,
 					    loff_t *pos)
 {
 	struct neigh_seq_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 	struct neigh_table *tbl = state->tbl;
 
 	do {
 		pn = pn->next;
-	} while (pn && !net_eq(pneigh_net(pn), net));
+	} while (pn && !pneigh_net_ctx_eq(pn, ctx));
 
 	while (!pn) {
 		if (++state->bucket > PNEIGH_HASHMASK)
 			break;
 		pn = tbl->phash_buckets[state->bucket];
-		while (pn && !net_eq(pneigh_net(pn), net))
+		while (pn && !pneigh_net_ctx_eq(pn, ctx))
 			pn = pn->next;
 		if (pn)
 			break;
@@ -2832,14 +2834,14 @@ static struct neigh_parms *neigh_get_dev_parms_rcu(struct net_device *dev,
 	return NULL;
 }
 
-static void neigh_copy_dflt_parms(struct net *net, struct neigh_parms *p,
+static void neigh_copy_dflt_parms(struct net_ctx *ctx, struct neigh_parms *p,
 				  int index)
 {
 	struct net_device *dev;
 	int family = neigh_parms_family(p);
 
 	rcu_read_lock();
-	for_each_netdev_rcu(net, dev) {
+	for_each_netdev_rcu(ctx->net, dev) {
 		struct neigh_parms *dst_p =
 				neigh_get_dev_parms_rcu(dev, family);
 
@@ -2853,7 +2855,7 @@ static void neigh_proc_update(struct ctl_table *ctl, int write)
 {
 	struct net_device *dev = ctl->extra1;
 	struct neigh_parms *p = ctl->extra2;
-	struct net *net = neigh_parms_net(p);
+	struct net_ctx ctx = { .net = neigh_parms_net(p) };
 	int index = (int *) ctl->data - p->data;
 
 	if (!write)
@@ -2861,7 +2863,7 @@ static void neigh_proc_update(struct ctl_table *ctl, int write)
 
 	set_bit(index, p->data_state);
 	if (!dev) /* NULL dev means this is default value */
-		neigh_copy_dflt_parms(net, p, index);
+		neigh_copy_dflt_parms(&ctx, p, index);
 }
 
 static int neigh_proc_dointvec_zero_intmax(struct ctl_table *ctl, int write,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a44773c8346c..2627fff2b2d0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -423,7 +423,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk = sock->sk;
 	struct inet_sock *inet = inet_sk(sk);
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 	unsigned short snum;
 	int chk_addr_ret;
 	int err;
@@ -447,7 +448,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 			goto out;
 	}
 
-	chk_addr_ret = inet_addr_type(net, addr->sin_addr.s_addr);
+	chk_addr_ret = inet_addr_type(&sk_ctx, addr->sin_addr.s_addr);
 
 	/* Not specified by any standard per-se, however it breaks too
 	 * many applications when removed.  It is unfortunate since
@@ -838,7 +839,7 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
 	struct sock *sk = sock->sk;
 	int err = 0;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	switch (cmd) {
 	case SIOCGSTAMP:
@@ -850,12 +851,12 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 	case SIOCADDRT:
 	case SIOCDELRT:
 	case SIOCRTMSG:
-		err = ip_rt_ioctl(net, cmd, (void __user *)arg);
+		err = ip_rt_ioctl(&sk_ctx, cmd, (void __user *)arg);
 		break;
 	case SIOCDARP:
 	case SIOCGARP:
 	case SIOCSARP:
-		err = arp_ioctl(net, cmd, (void __user *)arg);
+		err = arp_ioctl(&sk_ctx, cmd, (void __user *)arg);
 		break;
 	case SIOCGIFADDR:
 	case SIOCSIFADDR:
@@ -868,7 +869,7 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 	case SIOCSIFPFLAGS:
 	case SIOCGIFPFLAGS:
 	case SIOCSIFFLAGS:
-		err = devinet_ioctl(net, cmd, (void __user *)arg);
+		err = devinet_ioctl(&sk_ctx, cmd, (void __user *)arg);
 		break;
 	default:
 		if (sk->sk_prot->ioctl)
@@ -1157,6 +1158,7 @@ int inet_sk_rebuild_header(struct sock *sk)
 	struct ip_options_rcu *inet_opt;
 	struct flowi4 *fl4;
 	int err;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	/* Route is OK, nothing to do. */
 	if (rt)
@@ -1170,7 +1172,7 @@ int inet_sk_rebuild_header(struct sock *sk)
 		daddr = inet_opt->opt.faddr;
 	rcu_read_unlock();
 	fl4 = &inet->cork.fl.u.ip4;
-	rt = ip_route_output_ports(sock_net(sk), fl4, sk, daddr, inet->inet_saddr,
+	rt = ip_route_output_ports(&sk_ctx, fl4, sk, daddr, inet->inet_saddr,
 				   inet->inet_dport, inet->inet_sport,
 				   sk->sk_protocol, RT_CONN_FLAGS(sk),
 				   sk->sk_bound_dev_if);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 205e1472aa78..b24773b275a9 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -225,6 +225,7 @@ static int arp_constructor(struct neighbour *neigh)
 	struct net_device *dev = neigh->dev;
 	struct in_device *in_dev;
 	struct neigh_parms *parms;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev);
@@ -233,7 +234,7 @@ static int arp_constructor(struct neighbour *neigh)
 		return -EINVAL;
 	}
 
-	neigh->type = inet_addr_type(dev_net(dev), addr);
+	neigh->type = inet_addr_type(&dev_ctx, addr);
 
 	parms = in_dev->arp_parms;
 	__neigh_parms_put(neigh->parms);
@@ -328,6 +329,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 	__be32 target = *(__be32 *)neigh->primary_key;
 	int probes = atomic_read(&neigh->probes);
 	struct in_device *in_dev;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev);
@@ -338,7 +340,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 	switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
 	default:
 	case 0:		/* By default announce any local IP */
-		if (skb && inet_addr_type(dev_net(dev),
+		if (skb && inet_addr_type(&dev_ctx,
 					  ip_hdr(skb)->saddr) == RTN_LOCAL)
 			saddr = ip_hdr(skb)->saddr;
 		break;
@@ -346,7 +348,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 		if (!skb)
 			break;
 		saddr = ip_hdr(skb)->saddr;
-		if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) {
+		if (inet_addr_type(&dev_ctx, saddr) == RTN_LOCAL) {
 			/* saddr should be known to target */
 			if (inet_addr_onlink(in_dev, target, saddr))
 				break;
@@ -381,7 +383,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 
 static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)
 {
-	struct net *net = dev_net(in_dev->dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(in_dev->dev);
 	int scope;
 
 	switch (IN_DEV_ARP_IGNORE(in_dev)) {
@@ -412,7 +414,7 @@ static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)
 	default:
 		return 0;
 	}
-	return !inet_confirm_addr(net, in_dev, sip, tip, scope);
+	return !inet_confirm_addr(&dev_ctx, in_dev, sip, tip, scope);
 }
 
 static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev)
@@ -420,9 +422,10 @@ static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev)
 	struct rtable *rt;
 	int flag = 0;
 	/*unsigned long now; */
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
+	struct net *net = dev_ctx.net;
 
-	rt = ip_route_output(net, sip, tip, 0, 0);
+	rt = ip_route_output(&dev_ctx, sip, tip, 0, 0);
 	if (IS_ERR(rt))
 		return 1;
 	if (rt->dst.dev != dev) {
@@ -468,6 +471,7 @@ int arp_find(unsigned char *haddr, struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	__be32 paddr;
 	struct neighbour *n;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	if (!skb_dst(skb)) {
 		pr_debug("arp_find is called with dst==NULL\n");
@@ -476,7 +480,7 @@ int arp_find(unsigned char *haddr, struct sk_buff *skb)
 	}
 
 	paddr = rt_nexthop(skb_rtable(skb), ip_hdr(skb)->daddr);
-	if (arp_set_predefined(inet_addr_type(dev_net(dev), paddr), haddr,
+	if (arp_set_predefined(inet_addr_type(&dev_ctx, paddr), haddr,
 			       paddr, dev))
 		return 0;
 
@@ -731,7 +735,7 @@ static int arp_process(struct sk_buff *skb)
 	u16 dev_type = dev->type;
 	int addr_type;
 	struct neighbour *n;
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	bool is_garp = false;
 
 	/* arp_rcv below verifies the ARP header and verifies the device
@@ -835,7 +839,7 @@ static int arp_process(struct sk_buff *skb)
 	/* Special case: IPv4 duplicate address detection packet (RFC2131) */
 	if (sip == 0) {
 		if (arp->ar_op == htons(ARPOP_REQUEST) &&
-		    inet_addr_type(net, tip) == RTN_LOCAL &&
+		    inet_addr_type(&dev_ctx, tip) == RTN_LOCAL &&
 		    !arp_ignore(in_dev, sip, tip))
 			arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip, sha,
 				 dev->dev_addr, sha);
@@ -869,7 +873,7 @@ static int arp_process(struct sk_buff *skb)
 			    (arp_fwd_proxy(in_dev, dev, rt) ||
 			     arp_fwd_pvlan(in_dev, dev, rt, sip, tip) ||
 			     (rt->dst.dev != dev &&
-			      pneigh_lookup(&arp_tbl, net, &tip, dev, 0)))) {
+			      pneigh_lookup(&arp_tbl, &dev_ctx, &tip, dev, 0)))) {
 				n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
 				if (n)
 					neigh_release(n);
@@ -900,11 +904,11 @@ static int arp_process(struct sk_buff *skb)
 		   devices (strip is candidate)
 		 */
 		is_garp = arp->ar_op == htons(ARPOP_REQUEST) && tip == sip &&
-			  inet_addr_type(net, sip) == RTN_UNICAST;
+			  inet_addr_type(&dev_ctx, sip) == RTN_UNICAST;
 
 		if (n == NULL &&
 		    ((arp->ar_op == htons(ARPOP_REPLY)  &&
-		      inet_addr_type(net, sip) == RTN_UNICAST) || is_garp))
+		      inet_addr_type(&dev_ctx, sip) == RTN_UNICAST) || is_garp))
 			n = __neigh_lookup(&arp_tbl, &sip, dev, 1);
 	}
 
@@ -1005,9 +1009,10 @@ static int arp_req_set_proxy(struct net *net, struct net_device *dev, int on)
 	return -ENXIO;
 }
 
-static int arp_req_set_public(struct net *net, struct arpreq *r,
+static int arp_req_set_public(struct net_ctx *ctx, struct arpreq *r,
 		struct net_device *dev)
 {
+	struct net *net = ctx->net;
 	__be32 ip = ((struct sockaddr_in *)&r->arp_pa)->sin_addr.s_addr;
 	__be32 mask = ((struct sockaddr_in *)&r->arp_netmask)->sin_addr.s_addr;
 
@@ -1020,15 +1025,15 @@ static int arp_req_set_public(struct net *net, struct arpreq *r,
 			return -ENODEV;
 	}
 	if (mask) {
-		if (pneigh_lookup(&arp_tbl, net, &ip, dev, 1) == NULL)
+		if (pneigh_lookup(&arp_tbl, ctx, &ip, dev, 1) == NULL)
 			return -ENOBUFS;
 		return 0;
 	}
 
-	return arp_req_set_proxy(net, dev, 1);
+	return arp_req_set_proxy(ctx->net, dev, 1);
 }
 
-static int arp_req_set(struct net *net, struct arpreq *r,
+static int arp_req_set(struct net_ctx *ctx, struct arpreq *r,
 		       struct net_device *dev)
 {
 	__be32 ip;
@@ -1036,13 +1041,13 @@ static int arp_req_set(struct net *net, struct arpreq *r,
 	int err;
 
 	if (r->arp_flags & ATF_PUBL)
-		return arp_req_set_public(net, r, dev);
+		return arp_req_set_public(ctx, r, dev);
 
 	ip = ((struct sockaddr_in *)&r->arp_pa)->sin_addr.s_addr;
 	if (r->arp_flags & ATF_PERM)
 		r->arp_flags |= ATF_COM;
 	if (dev == NULL) {
-		struct rtable *rt = ip_route_output(net, ip, 0, RTO_ONLINK, 0);
+		struct rtable *rt = ip_route_output(ctx, ip, 0, RTO_ONLINK, 0);
 
 		if (IS_ERR(rt))
 			return PTR_ERR(rt);
@@ -1137,32 +1142,31 @@ static int arp_invalidate(struct net_device *dev, __be32 ip)
 	return err;
 }
 
-static int arp_req_delete_public(struct net *net, struct arpreq *r,
+static int arp_req_delete_public(struct net_ctx *ctx, struct arpreq *r,
 		struct net_device *dev)
 {
 	__be32 ip = ((struct sockaddr_in *) &r->arp_pa)->sin_addr.s_addr;
 	__be32 mask = ((struct sockaddr_in *)&r->arp_netmask)->sin_addr.s_addr;
 
 	if (mask == htonl(0xFFFFFFFF))
-		return pneigh_delete(&arp_tbl, net, &ip, dev);
+		return pneigh_delete(&arp_tbl, ctx, &ip, dev);
 
 	if (mask)
 		return -EINVAL;
 
-	return arp_req_set_proxy(net, dev, 0);
+	return arp_req_set_proxy(ctx->net, dev, 0);
 }
 
-static int arp_req_delete(struct net *net, struct arpreq *r,
+static int arp_req_delete(struct net_ctx *ctx, struct arpreq *r,
 			  struct net_device *dev)
 {
 	__be32 ip;
 
 	if (r->arp_flags & ATF_PUBL)
-		return arp_req_delete_public(net, r, dev);
-
+		return arp_req_delete_public(ctx, r, dev);
 	ip = ((struct sockaddr_in *)&r->arp_pa)->sin_addr.s_addr;
 	if (dev == NULL) {
-		struct rtable *rt = ip_route_output(net, ip, 0, RTO_ONLINK, 0);
+		struct rtable *rt = ip_route_output(ctx, ip, 0, RTO_ONLINK, 0);
 		if (IS_ERR(rt))
 			return PTR_ERR(rt);
 		dev = rt->dst.dev;
@@ -1177,8 +1181,9 @@ static int arp_req_delete(struct net *net, struct arpreq *r,
  *	Handle an ARP layer I/O control request.
  */
 
-int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int arp_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 {
+	struct net *net = ctx->net;
 	int err;
 	struct arpreq r;
 	struct net_device *dev = NULL;
@@ -1226,10 +1231,10 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 
 	switch (cmd) {
 	case SIOCDARP:
-		err = arp_req_delete(net, &r, dev);
+		err = arp_req_delete(ctx, &r, dev);
 		break;
 	case SIOCSARP:
-		err = arp_req_set(net, &r, dev);
+		err = arp_req_set(ctx, &r, dev);
 		break;
 	case SIOCGARP:
 		err = arp_req_get(&r, dev);
@@ -1247,11 +1252,12 @@ static int arp_netdev_event(struct notifier_block *this, unsigned long event,
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct netdev_notifier_change_info *change_info;
+	struct net *net = dev_net(dev);
 
 	switch (event) {
 	case NETDEV_CHANGEADDR:
 		neigh_changeaddr(&arp_tbl, dev);
-		rt_cache_flush(dev_net(dev));
+		rt_cache_flush(net);
 		break;
 	case NETDEV_CHANGE:
 		change_info = ptr;
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 90c0e8386116..7f93d6b92d0b 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -99,6 +99,7 @@ void ip4_datagram_release_cb(struct sock *sk)
 	struct dst_entry *dst;
 	struct flowi4 fl4;
 	struct rtable *rt;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	rcu_read_lock();
 
@@ -110,7 +111,7 @@ void ip4_datagram_release_cb(struct sock *sk)
 	inet_opt = rcu_dereference(inet->inet_opt);
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
-	rt = ip_route_output_ports(sock_net(sk), &fl4, sk, daddr,
+	rt = ip_route_output_ports(&sk_ctx, &fl4, sk, daddr,
 				   inet->inet_saddr, inet->inet_dport,
 				   inet->inet_sport, sk->sk_protocol,
 				   RT_CONN_FLAGS(sk), sk->sk_bound_dev_if);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index f0b4a31d7bd6..a0182f79f6bf 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -136,8 +136,9 @@ static void inet_hash_remove(struct in_ifaddr *ifa)
  *
  * If a caller uses devref=false, it should be protected by RCU, or RTNL
  */
-struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
+struct net_device *__ip_dev_find(struct net_ctx *ctx, __be32 addr, bool devref)
 {
+	struct net *net = ctx->net;
 	u32 hash = inet_addr_hash(net, addr);
 	struct net_device *result = NULL;
 	struct in_ifaddr *ifa;
@@ -147,7 +148,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 		if (ifa->ifa_local == addr) {
 			struct net_device *dev = ifa->ifa_dev->dev;
 
-			if (!net_eq(dev_net(dev), net))
+			if (!dev_net_ctx_eq(dev, ctx))
 				continue;
 			result = dev;
 			break;
@@ -520,13 +521,13 @@ static int inet_set_ifa(struct net_device *dev, struct in_ifaddr *ifa)
 /* Caller must hold RCU or RTNL :
  * We dont take a reference on found in_device
  */
-struct in_device *inetdev_by_index(struct net *net, int ifindex)
+struct in_device *inetdev_by_index(struct net_ctx *ctx, int ifindex)
 {
 	struct net_device *dev;
 	struct in_device *in_dev = NULL;
 
 	rcu_read_lock();
-	dev = dev_get_by_index_rcu(net, ifindex);
+	dev = dev_get_by_index_rcu_ctx(ctx, ifindex);
 	if (dev)
 		in_dev = rcu_dereference_rtnl(dev->ip_ptr);
 	rcu_read_unlock();
@@ -550,7 +551,7 @@ struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix,
 
 static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct nlattr *tb[IFA_MAX+1];
 	struct in_device *in_dev;
 	struct ifaddrmsg *ifm;
@@ -564,7 +565,7 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh)
 		goto errout;
 
 	ifm = nlmsg_data(nlh);
-	in_dev = inetdev_by_index(net, ifm->ifa_index);
+	in_dev = inetdev_by_index(&ctx, ifm->ifa_index);
 	if (in_dev == NULL) {
 		err = -ENODEV;
 		goto errout;
@@ -717,7 +718,7 @@ static void set_ifa_lifetime(struct in_ifaddr *ifa, __u32 valid_lft,
 		ifa->ifa_cstamp = ifa->ifa_tstamp;
 }
 
-static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
+static struct in_ifaddr *rtm_to_ifaddr(struct net_ctx *ctx, struct nlmsghdr *nlh,
 				       __u32 *pvalid_lft, __u32 *pprefered_lft)
 {
 	struct nlattr *tb[IFA_MAX+1];
@@ -736,7 +737,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh,
 	if (ifm->ifa_prefixlen > 32 || tb[IFA_LOCAL] == NULL)
 		goto errout;
 
-	dev = __dev_get_by_index(net, ifm->ifa_index);
+	dev = __dev_get_by_index_ctx(ctx, ifm->ifa_index);
 	err = -ENODEV;
 	if (dev == NULL)
 		goto errout;
@@ -820,7 +821,7 @@ static struct in_ifaddr *find_matching_ifa(struct in_ifaddr *ifa)
 
 static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SKB_NET_CTX_SOCK(skb);
 	struct in_ifaddr *ifa;
 	struct in_ifaddr *ifa_existing;
 	__u32 valid_lft = INFINITY_LIFE_TIME;
@@ -828,7 +829,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 	ASSERT_RTNL();
 
-	ifa = rtm_to_ifaddr(net, nlh, &valid_lft, &prefered_lft);
+	ifa = rtm_to_ifaddr(&ctx, nlh, &valid_lft, &prefered_lft);
 	if (IS_ERR(ifa))
 		return PTR_ERR(ifa);
 
@@ -881,8 +882,9 @@ static int inet_abc_len(__be32 addr)
 }
 
 
-int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int devinet_ioctl(struct net_ctx *net_ctx, unsigned int cmd, void __user *arg)
 {
+	struct net *net = net_ctx->net;
 	struct ifreq ifr;
 	struct sockaddr_in sin_orig;
 	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
@@ -1253,7 +1255,7 @@ static __be32 confirm_addr_indev(struct in_device *in_dev, __be32 dst,
  * - local: address, 0=autoselect the local address
  * - scope: maximum allowed scope value for the local address
  */
-__be32 inet_confirm_addr(struct net *net, struct in_device *in_dev,
+__be32 inet_confirm_addr(struct net_ctx *ctx, struct in_device *in_dev,
 			 __be32 dst, __be32 local, int scope)
 {
 	__be32 addr = 0;
@@ -1263,7 +1265,7 @@ __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev,
 		return confirm_addr_indev(in_dev, dst, local, scope);
 
 	rcu_read_lock();
-	for_each_netdev_rcu(net, dev) {
+	for_each_netdev_rcu(ctx->net, dev) {
 		in_dev = __in_dev_get_rcu(dev);
 		if (in_dev) {
 			addr = confirm_addr_indev(in_dev, dst, local, scope);
@@ -1532,7 +1534,8 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
 
 static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(skb->sk);
+	struct net *net = sk_ctx.net;
 	int h, s_h;
 	int idx, s_idx;
 	int ip_idx, s_ip_idx;
@@ -1854,7 +1857,8 @@ static int inet_netconf_get_devconf(struct sk_buff *in_skb,
 static int inet_netconf_dump_devconf(struct sk_buff *skb,
 				     struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(skb->sk);
+	struct net *net = sk_ctx.net;
 	int h, s_h;
 	int idx, s_idx;
 	struct net_device *dev;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 57be71dd6a9e..b068ab996cc3 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -150,10 +150,11 @@ static void fib_flush(struct net *net)
  * Find address type as if only "dev" was present in the system. If
  * on_dev is NULL then all interfaces are taken into consideration.
  */
-static inline unsigned int __inet_dev_addr_type(struct net *net,
+static inline unsigned int __inet_dev_addr_type(struct net_ctx *ctx,
 						const struct net_device *dev,
 						__be32 addr)
 {
+	struct net *net = ctx->net;
 	struct flowi4		fl4 = { .daddr = addr };
 	struct fib_result	res;
 	unsigned int ret = RTN_BROADCAST;
@@ -179,16 +180,17 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 	return ret;
 }
 
-unsigned int inet_addr_type(struct net *net, __be32 addr)
+unsigned int inet_addr_type(struct net_ctx *ctx, __be32 addr)
 {
-	return __inet_dev_addr_type(net, NULL, addr);
+	return __inet_dev_addr_type(ctx, NULL, addr);
 }
 EXPORT_SYMBOL(inet_addr_type);
 
-unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
+unsigned int inet_dev_addr_type(struct net_ctx *ctx,
+				const struct net_device *dev,
 				__be32 addr)
 {
-	return __inet_dev_addr_type(net, dev, addr);
+	return __inet_dev_addr_type(ctx, dev, addr);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
 
@@ -199,7 +201,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 	struct fib_result res;
 	struct rtable *rt;
 	struct flowi4 fl4;
-	struct net *net;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	int scope;
 
 	rt = skb_rtable(skb);
@@ -210,8 +212,6 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 	in_dev = __in_dev_get_rcu(dev);
 	BUG_ON(!in_dev);
 
-	net = dev_net(dev);
-
 	scope = RT_SCOPE_UNIVERSE;
 	if (!ipv4_is_zeronet(ip_hdr(skb)->saddr)) {
 		fl4.flowi4_oif = 0;
@@ -221,8 +221,8 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 		fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 		fl4.flowi4_scope = scope;
 		fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
-		if (!fib_lookup(net, &fl4, &res))
-			return FIB_RES_PREFSRC(net, res);
+		if (!fib_lookup(&dev_ctx, &fl4, &res))
+			return FIB_RES_PREFSRC(&dev_ctx, res);
 	} else {
 		scope = RT_SCOPE_LINK;
 	}
@@ -245,7 +245,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	int ret, no_addr;
 	struct fib_result res;
 	struct flowi4 fl4;
-	struct net *net;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	bool dev_match;
 
 	fl4.flowi4_oif = 0;
@@ -259,8 +259,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 
 	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
 
-	net = dev_net(dev);
-	if (fib_lookup(net, &fl4, &res))
+	if (fib_lookup(&dev_ctx, &fl4, &res))
 		goto last_resort;
 	if (res.type != RTN_UNICAST &&
 	    (res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
@@ -295,7 +294,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	fl4.flowi4_oif = dev->ifindex;
 
 	ret = 0;
-	if (fib_lookup(net, &fl4, &res) == 0) {
+	if (fib_lookup(&dev_ctx, &fl4, &res) == 0) {
 		if (res.type == RTN_UNICAST)
 			ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
 	}
@@ -346,9 +345,10 @@ static int put_rtax(struct nlattr *mx, int len, int type, u32 value)
 	return len + nla_total_size(4);
 }
 
-static int rtentry_to_fib_config(struct net *net, int cmd, struct rtentry *rt,
-				 struct fib_config *cfg)
+static int rtentry_to_fib_config(struct net_ctx *ctx, int cmd,
+				 struct rtentry *rt, struct fib_config *cfg)
 {
+	struct net *net = ctx->net;
 	__be32 addr;
 	int plen;
 
@@ -437,7 +437,7 @@ static int rtentry_to_fib_config(struct net *net, int cmd, struct rtentry *rt,
 	if (rt->rt_gateway.sa_family == AF_INET && addr) {
 		cfg->fc_gw = addr;
 		if (rt->rt_flags & RTF_GATEWAY &&
-		    inet_addr_type(net, addr) == RTN_UNICAST)
+		    inet_addr_type(ctx, addr) == RTN_UNICAST)
 			cfg->fc_scope = RT_SCOPE_UNIVERSE;
 	}
 
@@ -478,8 +478,9 @@ static int rtentry_to_fib_config(struct net *net, int cmd, struct rtentry *rt,
  * Handle IP routing ioctl calls.
  * These are used to manipulate the routing tables
  */
-int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int ip_rt_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 {
+	struct net *net = ctx->net;
 	struct fib_config cfg;
 	struct rtentry rt;
 	int err;
@@ -494,7 +495,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 			return -EFAULT;
 
 		rtnl_lock();
-		err = rtentry_to_fib_config(net, cmd, &rt, &cfg);
+		err = rtentry_to_fib_config(ctx, cmd, &rt, &cfg);
 		if (err == 0) {
 			struct fib_table *tb;
 
@@ -534,7 +535,7 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
 	[RTA_FLOW]		= { .type = NLA_U32 },
 };
 
-static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
+static int rtm_to_fib_config(struct net_ctx *ctx, struct sk_buff *skb,
 			     struct nlmsghdr *nlh, struct fib_config *cfg)
 {
 	struct nlattr *attr;
@@ -559,7 +560,7 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
 
 	cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
 	cfg->fc_nlinfo.nlh = nlh;
-	cfg->fc_nlinfo.nl_net = net;
+	cfg->fc_nlinfo.nl_net = ctx->net;
 
 	if (cfg->fc_type > RTN_MAX) {
 		err = -EINVAL;
@@ -607,12 +608,13 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
 
 static int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(skb->sk);
+	struct net *net = sk_ctx.net;
 	struct fib_config cfg;
 	struct fib_table *tb;
 	int err;
 
-	err = rtm_to_fib_config(net, skb, nlh, &cfg);
+	err = rtm_to_fib_config(&sk_ctx, skb, nlh, &cfg);
 	if (err < 0)
 		goto errout;
 
@@ -629,12 +631,13 @@ static int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(skb);
+	struct net *net = sk_ctx.net;
 	struct fib_config cfg;
 	struct fib_table *tb;
 	int err;
 
-	err = rtm_to_fib_config(net, skb, nlh, &cfg);
+	err = rtm_to_fib_config(&sk_ctx, skb, nlh, &cfg);
 	if (err < 0)
 		goto errout;
 
@@ -897,19 +900,21 @@ void fib_del_ifaddr(struct in_ifaddr *ifa, struct in_ifaddr *iprim)
 			fib_magic(RTM_DELROUTE, RTN_BROADCAST, any, 32, prim);
 	}
 	if (!(ok & LOCAL_OK)) {
+		struct net_ctx dev_ctx = DEV_NET_CTX(dev);
+
 		fib_magic(RTM_DELROUTE, RTN_LOCAL, ifa->ifa_local, 32, prim);
 
 		/* Check, that this local address finally disappeared. */
 		if (gone &&
-		    inet_addr_type(dev_net(dev), ifa->ifa_local) != RTN_LOCAL) {
+		    inet_addr_type(&dev_ctx, ifa->ifa_local) != RTN_LOCAL) {
 			/* And the last, but not the least thing.
 			 * We must flush stray FIB entries.
 			 *
 			 * First of all, we scan fib_info list searching
 			 * for stray nexthop entries, then ignite fib_flush.
 			 */
-			if (fib_sync_down_addr(dev_net(dev), ifa->ifa_local))
-				fib_flush(dev_net(dev));
+			if (fib_sync_down_addr(&dev_ctx, ifa->ifa_local))
+				fib_flush(dev_ctx.net);
 		}
 	}
 #undef LOCAL_OK
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index d3db718be51d..60b14866661b 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -47,7 +47,7 @@ struct fib4_rule {
 #endif
 };
 
-int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
+int __fib_lookup(struct net_ctx *ctx, struct flowi4 *flp, struct fib_result *res)
 {
 	struct fib_lookup_arg arg = {
 		.result = res,
@@ -55,7 +55,8 @@ int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
 	};
 	int err;
 
-	err = fib_rules_lookup(net->ipv4.rules_ops, flowi4_to_flowi(flp), 0, &arg);
+	err = fib_rules_lookup(ctx->net->ipv4.rules_ops, flowi4_to_flowi(flp),
+			       0, &arg);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	if (arg.rule)
 		res->tclassid = ((struct fib4_rule *)arg.rule)->tclassid;
@@ -288,7 +289,7 @@ static size_t fib4_rule_nlmsg_payload(struct fib_rule *rule)
 
 static void fib4_rule_flush_cache(struct fib_rules_ops *ops)
 {
-	rt_cache_flush(ops->fro_net);
+	rt_cache_flush(ops->fro_net_ctx.net);
 }
 
 static const struct fib_rules_ops __net_initconst fib4_rules_ops_template = {
@@ -330,8 +331,9 @@ int __net_init fib4_rules_init(struct net *net)
 {
 	int err;
 	struct fib_rules_ops *ops;
+	struct net_ctx ctx = { .net = net };
 
-	ops = fib_rules_register(&fib4_rules_ops_template, net);
+	ops = fib_rules_register(&fib4_rules_ops_template, &ctx);
 	if (IS_ERR(ops))
 		return PTR_ERR(ops);
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1e2090ea663e..99af28c2fb6d 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -303,12 +303,13 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi)
 	struct hlist_head *head;
 	struct fib_info *fi;
 	unsigned int hash;
+	const struct net_ctx *nfi_ctx = &nfi->fib_net_ctx;
 
 	hash = fib_info_hashfn(nfi);
 	head = &fib_info_hash[hash];
 
 	hlist_for_each_entry(fi, head, fib_hash) {
-		if (!net_eq(fi->fib_net, nfi->fib_net))
+		if (!fib_net_ctx_eq(fi, nfi_ctx))
 			continue;
 		if (fi->fib_nhs != nfi->fib_nhs)
 			continue;
@@ -587,10 +588,9 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			struct fib_nh *nh)
 {
 	int err;
-	struct net *net;
+	struct net_ctx *net_ctx = &cfg->fc_nlinfo.nl_net_ctx;
 	struct net_device *dev;
 
-	net = cfg->fc_nlinfo.nl_net;
 	if (nh->nh_gw) {
 		struct fib_result res;
 
@@ -598,9 +598,9 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 
 			if (cfg->fc_scope >= RT_SCOPE_LINK)
 				return -EINVAL;
-			if (inet_addr_type(net, nh->nh_gw) != RTN_UNICAST)
+			if (inet_addr_type(net_ctx, nh->nh_gw) != RTN_UNICAST)
 				return -EINVAL;
-			dev = __dev_get_by_index(net, nh->nh_oif);
+			dev = __dev_get_by_index_ctx(net_ctx, nh->nh_oif);
 			if (!dev)
 				return -ENODEV;
 			if (!(dev->flags & IFF_UP))
@@ -622,7 +622,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			/* It is not necessary, but requires a bit of thinking */
 			if (fl4.flowi4_scope < RT_SCOPE_LINK)
 				fl4.flowi4_scope = RT_SCOPE_LINK;
-			err = fib_lookup(net, &fl4, &res);
+			err = fib_lookup(net_ctx, &fl4, &res);
 			if (err) {
 				rcu_read_unlock();
 				return err;
@@ -646,7 +646,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 
 		rcu_read_lock();
 		err = -ENODEV;
-		in_dev = inetdev_by_index(net, nh->nh_oif);
+		in_dev = inetdev_by_index(net_ctx, nh->nh_oif);
 		if (in_dev == NULL)
 			goto out;
 		err = -ENETDOWN;
@@ -748,8 +748,10 @@ static void fib_info_hash_move(struct hlist_head *new_info_hash,
 	fib_info_hash_free(old_laddrhash, bytes);
 }
 
-__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
+__be32 fib_info_update_nh_saddr(struct net_ctx *net_ctx, struct fib_nh *nh)
 {
+	struct net *net = net_ctx->net;
+
 	nh->nh_saddr = inet_select_addr(nh->nh_dev,
 					nh->nh_gw,
 					nh->nh_parent->fib_scope);
@@ -764,7 +766,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 	struct fib_info *fi = NULL;
 	struct fib_info *ofi;
 	int nhs = 1;
-	struct net *net = cfg->fc_nlinfo.nl_net;
+	struct net_ctx *net_ctx = &cfg->fc_nlinfo.nl_net_ctx;
+	struct net *net = net_ctx->net;
 
 	if (cfg->fc_type > RTN_MAX)
 		goto err_inval;
@@ -935,12 +938,12 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 	if (fi->fib_prefsrc) {
 		if (cfg->fc_type != RTN_LOCAL || !cfg->fc_dst ||
 		    fi->fib_prefsrc != cfg->fc_dst)
-			if (inet_addr_type(net, fi->fib_prefsrc) != RTN_LOCAL)
+			if (inet_addr_type(net_ctx, fi->fib_prefsrc) != RTN_LOCAL)
 				goto err_inval;
 	}
 
 	change_nexthops(fi) {
-		fib_info_update_nh_saddr(net, nexthop_nh);
+		fib_info_update_nh_saddr(net_ctx, nexthop_nh);
 	} endfor_nexthops(fi)
 
 link_it:
@@ -1087,7 +1090,7 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
  *   referring to it.
  * - device went down -> we must shutdown all nexthops going via it.
  */
-int fib_sync_down_addr(struct net *net, __be32 local)
+int fib_sync_down_addr(struct net_ctx *net_ctx, __be32 local)
 {
 	int ret = 0;
 	unsigned int hash = fib_laddr_hashfn(local);
@@ -1098,7 +1101,7 @@ int fib_sync_down_addr(struct net *net, __be32 local)
 		return 0;
 
 	hlist_for_each_entry(fi, head, fib_lhash) {
-		if (!net_eq(fi->fib_net, net))
+		if (!fib_net_ctx_eq(fi, net_ctx))
 			continue;
 		if (fi->fib_prefsrc == local) {
 			fi->fib_flags |= RTNH_F_DEAD;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 5e564014a0b7..f64de76f55ef 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -389,6 +389,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	struct ipcm_cookie ipc;
 	struct rtable *rt = skb_rtable(skb);
 	struct net *net = dev_net(rt->dst.dev);
+	struct net_ctx dev_ctx = { .net = net };
 	struct flowi4 fl4;
 	struct sock *sk;
 	struct inet_sock *inet;
@@ -426,7 +427,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_key(net, &fl4);
+	rt = ip_route_output_key(&dev_ctx, &fl4);
 	if (IS_ERR(rt))
 		goto out_unlock;
 	if (icmpv4_xrlim_allow(net, rt, &fl4, icmp_param->data.icmph.type,
@@ -437,7 +438,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	icmp_xmit_unlock(sk);
 }
 
-static struct rtable *icmp_route_lookup(struct net *net,
+static struct rtable *icmp_route_lookup(struct net_ctx *ctx,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
 					const struct iphdr *iph,
@@ -459,14 +460,14 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->fl4_icmp_type = type;
 	fl4->fl4_icmp_code = code;
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-	rt = __ip_route_output_key(net, fl4);
+	rt = __ip_route_output_key(ctx, fl4);
 	if (IS_ERR(rt))
 		return rt;
 
 	/* No need to clone since we're just using its address. */
 	rt2 = rt;
 
-	rt = (struct rtable *) xfrm_lookup(net, &rt->dst,
+	rt = (struct rtable *) xfrm_lookup(ctx, &rt->dst,
 					   flowi4_to_flowi(fl4), NULL, 0);
 	if (!IS_ERR(rt)) {
 		if (rt != rt2)
@@ -480,8 +481,8 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	if (err)
 		goto relookup_failed;
 
-	if (inet_addr_type(net, fl4_dec.saddr) == RTN_LOCAL) {
-		rt2 = __ip_route_output_key(net, &fl4_dec);
+	if (inet_addr_type(ctx, fl4_dec.saddr) == RTN_LOCAL) {
+		rt2 = __ip_route_output_key(ctx, &fl4_dec);
 		if (IS_ERR(rt2))
 			err = PTR_ERR(rt2);
 	} else {
@@ -489,7 +490,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
 		unsigned long orefdst;
 
 		fl4_2.daddr = fl4_dec.saddr;
-		rt2 = ip_route_output_key(net, &fl4_2);
+		rt2 = ip_route_output_key(ctx, &fl4_2);
 		if (IS_ERR(rt2)) {
 			err = PTR_ERR(rt2);
 			goto relookup_failed;
@@ -507,7 +508,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
 	if (err)
 		goto relookup_failed;
 
-	rt2 = (struct rtable *) xfrm_lookup(net, &rt2->dst,
+	rt2 = (struct rtable *) xfrm_lookup(ctx, &rt2->dst,
 					    flowi4_to_flowi(&fl4_dec), NULL,
 					    XFRM_LOOKUP_ICMP);
 	if (!IS_ERR(rt2)) {
@@ -552,12 +553,14 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	__be32 saddr;
 	u8  tos;
 	u32 mark;
+	struct net_ctx dev_ctx;
 	struct net *net;
 	struct sock *sk;
 
 	if (!rt)
 		goto out;
 	net = dev_net(rt->dst.dev);
+	dev_ctx.net = net;
 
 	/*
 	 *	Find the original header. It is expected to be valid, of course.
@@ -641,7 +644,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 		rcu_read_lock();
 		if (rt_is_input_route(rt) &&
 		    net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr)
-			dev = dev_get_by_index_rcu(net, inet_iif(skb_in));
+			dev = dev_get_by_index_rcu_ctx(&dev_ctx, inet_iif(skb_in));
 
 		if (dev)
 			saddr = inet_select_addr(dev, 0, RT_SCOPE_LINK);
@@ -677,7 +680,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	ipc.ttl = 0;
 	ipc.tos = -1;
 
-	rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos, mark,
+	rt = icmp_route_lookup(&dev_ctx, &fl4, skb_in, iph, saddr, tos, mark,
 			       type, code, icmp_param);
 	if (IS_ERR(rt))
 		goto out_unlock;
@@ -750,11 +753,10 @@ static bool icmp_unreach(struct sk_buff *skb)
 {
 	const struct iphdr *iph;
 	struct icmphdr *icmph;
-	struct net *net;
+	struct net_ctx dev_ctx = SKB_NET_CTX_DST(skb);
+	struct net *net = dev_ctx.net;
 	u32 info = 0;
 
-	net = dev_net(skb_dst(skb)->dev);
-
 	/*
 	 *	Incomplete header ?
 	 * 	Only checks for the IP header, there should be an
@@ -828,7 +830,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 	 */
 
 	if (!net->ipv4.sysctl_icmp_ignore_bogus_error_responses &&
-	    inet_addr_type(net, iph->daddr) == RTN_BROADCAST) {
+	    inet_addr_type(&dev_ctx, iph->daddr) == RTN_BROADCAST) {
 		net_warn_ratelimited("%pI4 sent an invalid ICMP type %u, code %u error to a broadcast: %pI4 on %s\n",
 				     &ip_hdr(skb)->saddr,
 				     icmph->type, icmph->code,
@@ -1044,7 +1046,7 @@ void icmp_err(struct sk_buff *skb, u32 info)
 	struct icmphdr *icmph = (struct icmphdr *)(skb->data + offset);
 	int type = icmp_hdr(skb)->type;
 	int code = icmp_hdr(skb)->code;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
 
 	/*
 	 * Use ping_err to handle all icmp errors except those
@@ -1056,9 +1058,9 @@ void icmp_err(struct sk_buff *skb, u32 info)
 	}
 
 	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
-		ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ICMP, 0);
+		ipv4_update_pmtu(skb, &dev_ctx, info, 0, 0, IPPROTO_ICMP, 0);
 	else if (type == ICMP_REDIRECT)
-		ipv4_redirect(skb, net, 0, 0, IPPROTO_ICMP, 0);
+		ipv4_redirect(skb, &dev_ctx, 0, 0, IPPROTO_ICMP, 0);
 }
 
 /*
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 666cf364df86..86aa303a1cf7 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -324,7 +324,7 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, unsigned int mtu)
 	struct rtable *rt;
 	struct iphdr *pip;
 	struct igmpv3_report *pig;
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	struct flowi4 fl4;
 	int hlen = LL_RESERVED_SPACE(dev);
 	int tlen = dev->needed_tailroom;
@@ -341,7 +341,7 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, unsigned int mtu)
 	}
 	skb->priority = TC_PRIO_CONTROL;
 
-	rt = ip_route_output_ports(net, &fl4, NULL, IGMPV3_ALL_MCR, 0,
+	rt = ip_route_output_ports(&dev_ctx, &fl4, NULL, IGMPV3_ALL_MCR, 0,
 				   0, 0,
 				   IPPROTO_IGMP, 0, dev->ifindex);
 	if (IS_ERR(rt)) {
@@ -669,7 +669,7 @@ static int igmp_send_report(struct in_device *in_dev, struct ip_mc_list *pmc,
 	struct igmphdr *ih;
 	struct rtable *rt;
 	struct net_device *dev = in_dev->dev;
-	struct net *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	__be32	group = pmc ? pmc->multiaddr : 0;
 	struct flowi4 fl4;
 	__be32	dst;
@@ -682,7 +682,7 @@ static int igmp_send_report(struct in_device *in_dev, struct ip_mc_list *pmc,
 	else
 		dst = group;
 
-	rt = ip_route_output_ports(net, &fl4, NULL, dst, 0,
+	rt = ip_route_output_ports(&dev_ctx, &fl4, NULL, dst, 0,
 				   0, 0,
 				   IPPROTO_IGMP, 0, dev->ifindex);
 	if (IS_ERR(rt))
@@ -1503,23 +1503,23 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
 }
 
 /* RTNL is locked */
-static struct in_device *ip_mc_find_dev(struct net *net, struct ip_mreqn *imr)
+static struct in_device *ip_mc_find_dev(struct net_ctx *ctx, struct ip_mreqn *imr)
 {
 	struct net_device *dev = NULL;
 	struct in_device *idev = NULL;
 
 	if (imr->imr_ifindex) {
-		idev = inetdev_by_index(net, imr->imr_ifindex);
+		idev = inetdev_by_index(ctx, imr->imr_ifindex);
 		return idev;
 	}
 	if (imr->imr_address.s_addr) {
-		dev = __ip_dev_find(net, imr->imr_address.s_addr, false);
+		dev = __ip_dev_find(ctx, imr->imr_address.s_addr, false);
 		if (!dev)
 			return NULL;
 	}
 
 	if (!dev) {
-		struct rtable *rt = ip_route_output(net,
+		struct rtable *rt = ip_route_output(ctx,
 						    imr->imr_multiaddr.s_addr,
 						    0, 0, 0);
 		if (!IS_ERR(rt)) {
@@ -1860,7 +1860,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 	struct ip_mc_socklist *iml = NULL, *i;
 	struct in_device *in_dev;
 	struct inet_sock *inet = inet_sk(sk);
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	int ifindex;
 	int count = 0;
 
@@ -1869,7 +1869,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 
 	rtnl_lock();
 
-	in_dev = ip_mc_find_dev(net, imr);
+	in_dev = ip_mc_find_dev(&sk_ctx, imr);
 
 	if (!in_dev) {
 		iml = NULL;
@@ -1935,13 +1935,13 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 	struct ip_mc_socklist *iml;
 	struct ip_mc_socklist __rcu **imlp;
 	struct in_device *in_dev;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	__be32 group = imr->imr_multiaddr.s_addr;
 	u32 ifindex;
 	int ret = -EADDRNOTAVAIL;
 
 	rtnl_lock();
-	in_dev = ip_mc_find_dev(net, imr);
+	in_dev = ip_mc_find_dev(&sk_ctx, imr);
 	if (!in_dev) {
 		ret = -ENODEV;
 		goto out;
@@ -1986,7 +1986,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 	struct in_device *in_dev = NULL;
 	struct inet_sock *inet = inet_sk(sk);
 	struct ip_sf_socklist *psl;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	int leavegroup = 0;
 	int i, j, rv;
 
@@ -1998,7 +1998,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 	imr.imr_multiaddr.s_addr = mreqs->imr_multiaddr;
 	imr.imr_address.s_addr = mreqs->imr_interface;
 	imr.imr_ifindex = ifindex;
-	in_dev = ip_mc_find_dev(net, &imr);
+	in_dev = ip_mc_find_dev(&sk_ctx, &imr);
 
 	if (!in_dev) {
 		err = -ENODEV;
@@ -2122,7 +2122,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 	struct in_device *in_dev;
 	struct inet_sock *inet = inet_sk(sk);
 	struct ip_sf_socklist *newpsl, *psl;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	int leavegroup = 0;
 
 	if (!ipv4_is_multicast(addr))
@@ -2136,7 +2136,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 	imr.imr_multiaddr.s_addr = msf->imsf_multiaddr;
 	imr.imr_address.s_addr = msf->imsf_interface;
 	imr.imr_ifindex = ifindex;
-	in_dev = ip_mc_find_dev(net, &imr);
+	in_dev = ip_mc_find_dev(&sk_ctx, &imr);
 
 	if (!in_dev) {
 		err = -ENODEV;
@@ -2209,7 +2209,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	struct in_device *in_dev;
 	struct inet_sock *inet = inet_sk(sk);
 	struct ip_sf_socklist *psl;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	if (!ipv4_is_multicast(addr))
 		return -EINVAL;
@@ -2219,7 +2219,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	imr.imr_multiaddr.s_addr = msf->imsf_multiaddr;
 	imr.imr_address.s_addr = msf->imsf_interface;
 	imr.imr_ifindex = 0;
-	in_dev = ip_mc_find_dev(net, &imr);
+	in_dev = ip_mc_find_dev(&sk_ctx, &imr);
 
 	if (!in_dev) {
 		err = -ENODEV;
@@ -2366,7 +2366,7 @@ void ip_mc_drop_socket(struct sock *sk)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct ip_mc_socklist *iml;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	if (inet->mc_list == NULL)
 		return;
@@ -2376,7 +2376,7 @@ void ip_mc_drop_socket(struct sock *sk)
 		struct in_device *in_dev;
 
 		inet->mc_list = iml->next_rcu;
-		in_dev = inetdev_by_index(net, iml->multi.imr_ifindex);
+		in_dev = inetdev_by_index(&sk_ctx, iml->multi.imr_ifindex);
 		(void) ip_mc_leave_src(sk, iml, in_dev);
 		if (in_dev != NULL)
 			ip_mc_dec_group(in_dev, iml->multi.imr_multiaddr.s_addr);
@@ -2442,7 +2442,8 @@ struct igmp_mc_iter_state {
 
 static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 {
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
+	struct net *net = ctx->net;
 	struct ip_mc_list *im = NULL;
 	struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
 
@@ -2585,7 +2586,8 @@ struct igmp_mcf_iter_state {
 
 static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 {
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
+	struct net *net = ctx->net;
 	struct ip_sf_list *psf = NULL;
 	struct ip_mc_list *im = NULL;
 	struct igmp_mcf_iter_state *state = igmp_mcf_seq_private(seq);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 14d02ea905b6..b3580594d08a 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -95,7 +95,8 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	struct inet_bind_hashbucket *head;
 	struct inet_bind_bucket *tb;
 	int ret, attempts = 5;
-	struct net *net = sock_net(sk);
+	struct net_ctx net_ctx = SOCK_NET_CTX(sk);
+	struct net *net = net_ctx.net;
 	int smallest_size = -1, smallest_rover;
 	kuid_t uid = sock_i_uid(sk);
 
@@ -116,7 +117,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 					hashinfo->bhash_size)];
 			spin_lock(&head->lock);
 			inet_bind_bucket_for_each(tb, &head->chain)
-				if (net_eq(ib_net(tb), net) && tb->port == rover) {
+				if (ib_net_ctx_eq(tb, &net_ctx) && tb->port == rover) {
 					if (((tb->fastreuse > 0 &&
 					      sk->sk_reuse &&
 					      sk->sk_state != TCP_LISTEN) ||
@@ -170,7 +171,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 				hashinfo->bhash_size)];
 		spin_lock(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->port == snum)
+			if (ib_net_ctx_eq(tb, &net_ctx) && tb->port == snum)
 				goto tb_found;
 	}
 	tb = NULL;
@@ -204,7 +205,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 tb_not_found:
 	ret = 1;
 	if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
-					net, head, snum)) == NULL)
+					&net_ctx, head, snum)) == NULL)
 		goto fail_unlock;
 	if (hlist_empty(&tb->owners)) {
 		if (sk->sk_reuse && sk->sk_state != TCP_LISTEN)
@@ -403,6 +404,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	struct ip_options_rcu *opt = inet_rsk(req)->opt;
 	struct net *net = sock_net(sk);
+	struct net_ctx ctx = { .net = net };
 	int flags = inet_sk_flowi_flags(sk);
 
 	flowi4_init_output(fl4, sk->sk_bound_dev_if, ireq->ir_mark,
@@ -412,7 +414,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->ir_rmt_addr,
 			   ireq->ir_loc_addr, ireq->ir_rmt_port, inet_sk(sk)->inet_sport);
 	security_req_classify_flow(req, flowi4_to_flowi(fl4));
-	rt = ip_route_output_flow(net, fl4, sk);
+	rt = ip_route_output_flow(&ctx, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
 	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
@@ -435,6 +437,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 	struct inet_sock *newinet = inet_sk(newsk);
 	struct ip_options_rcu *opt;
 	struct net *net = sock_net(sk);
+	struct net_ctx ctx = { .net = net };
 	struct flowi4 *fl4;
 	struct rtable *rt;
 
@@ -448,7 +451,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->ir_rmt_addr,
 			   ireq->ir_loc_addr, ireq->ir_rmt_port, inet_sk(sk)->inet_sport);
 	security_req_classify_flow(req, flowi4_to_flowi(fl4));
-	rt = ip_route_output_flow(net, fl4, sk);
+	rt = ip_route_output_flow(&ctx, fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
 	if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway)
@@ -898,13 +901,14 @@ static struct dst_entry *inet_csk_rebuild_route(struct sock *sk, struct flowi *f
 	__be32 daddr = inet->inet_daddr;
 	struct flowi4 *fl4;
 	struct rtable *rt;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	rcu_read_lock();
 	inet_opt = rcu_dereference(inet->inet_opt);
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
 	fl4 = &fl->u.ip4;
-	rt = ip_route_output_ports(sock_net(sk), fl4, sk, daddr,
+	rt = ip_route_output_ports(&sk_ctx, fl4, sk, daddr,
 				   inet->inet_saddr, inet->inet_dport,
 				   inet->inet_sport, sk->sk_protocol,
 				   RT_CONN_FLAGS(sk), sk->sk_bound_dev_if);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 81751f12645f..c4691505014b 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -296,17 +296,18 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *in_s
 	int err;
 	struct sock *sk;
 	struct sk_buff *rep;
-	struct net *net = sock_net(in_skb->sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(in_skb->sk);
+	struct net *net = sk_ctx.net;
 
 	err = -EINVAL;
 	if (req->sdiag_family == AF_INET) {
-		sk = inet_lookup(net, hashinfo, req->id.idiag_dst[0],
+		sk = inet_lookup(&sk_ctx, hashinfo, req->id.idiag_dst[0],
 				 req->id.idiag_dport, req->id.idiag_src[0],
 				 req->id.idiag_sport, req->id.idiag_if);
 	}
 #if IS_ENABLED(CONFIG_IPV6)
 	else if (req->sdiag_family == AF_INET6) {
-		sk = inet6_lookup(net, hashinfo,
+		sk = inet6_lookup(&sk_ctx, hashinfo,
 				  (struct in6_addr *)req->id.idiag_dst,
 				  req->id.idiag_dport,
 				  (struct in6_addr *)req->id.idiag_src,
@@ -842,7 +843,7 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 {
 	int i, num;
 	int s_i, s_num;
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx ctx = SOCK_NET_CTX(skb->sk);
 
 	s_i = cb->args[1];
 	s_num = num = cb->args[2];
@@ -862,7 +863,7 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 			sk_nulls_for_each(sk, node, &ilb->head) {
 				struct inet_sock *inet = inet_sk(sk);
 
-				if (!net_eq(sock_net(sk), net))
+				if (!sock_net_ctx_eq(sk, &ctx))
 					continue;
 
 				if (num < s_num) {
@@ -935,7 +936,7 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 			int res;
 			int state;
 
-			if (!net_eq(sock_net(sk), net))
+			if (!sock_net_ctx_eq(sk, &ctx))
 				continue;
 			if (num < s_num)
 				goto next_normal;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 1485dac0ead5..8b3d94ca634c 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -54,14 +54,14 @@ static unsigned int inet_sk_ehashfn(const struct sock *sk)
  * The bindhash mutex for snum's hash chain must be held here.
  */
 struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
-						 struct net *net,
+						 struct net_ctx *ctx,
 						 struct inet_bind_hashbucket *head,
 						 const unsigned short snum)
 {
 	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
 	if (tb != NULL) {
-		write_pnet(&tb->ib_net_ctx.net, hold_net(net));
+		write_pnet(&tb->ib_net_ctx.net, hold_net(ctx->net));
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
@@ -136,6 +136,7 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -146,13 +147,13 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
 		 * as that of the child socket. We have to look up or
 		 * create a new bind bucket for the child here. */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), sock_net(sk)) &&
+			if (ib_net_ctx_eq(tb, &sk_ctx) &&
 			    tb->port == port)
 				break;
 		}
 		if (!tb) {
 			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
-						     sock_net(sk), head, port);
+						     &sk_ctx, head, port);
 			if (!tb) {
 				spin_unlock(&head->lock);
 				return -ENOMEM;
@@ -166,14 +167,14 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
 }
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
-static inline int compute_score(struct sock *sk, struct net *net,
+static inline int compute_score(struct sock *sk, struct net_ctx *ctx,
 				const unsigned short hnum, const __be32 daddr,
 				const int dif)
 {
 	int score = -1;
 	struct inet_sock *inet = inet_sk(sk);
 
-	if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
+	if (sock_net_ctx_eq(sk, ctx) && inet->inet_num == hnum &&
 			!ipv6_only_sock(sk)) {
 		__be32 rcv_saddr = inet->inet_rcv_saddr;
 		score = sk->sk_family == PF_INET ? 2 : 1;
@@ -199,12 +200,13 @@ static inline int compute_score(struct sock *sk, struct net *net,
  */
 
 
-struct sock *__inet_lookup_listener(struct net *net,
+struct sock *__inet_lookup_listener(struct net_ctx *ctx,
 				    struct inet_hashinfo *hashinfo,
 				    const __be32 saddr, __be16 sport,
 				    const __be32 daddr, const unsigned short hnum,
 				    const int dif)
 {
+	struct net *net = ctx->net;
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
 	unsigned int hash = inet_lhashfn(net, hnum);
@@ -217,7 +219,7 @@ struct sock *__inet_lookup_listener(struct net *net,
 	result = NULL;
 	hiscore = 0;
 	sk_nulls_for_each_rcu(sk, node, &ilb->head) {
-		score = compute_score(sk, net, hnum, daddr, dif);
+		score = compute_score(sk, ctx, hnum, daddr, dif);
 		if (score > hiscore) {
 			result = sk;
 			hiscore = score;
@@ -244,7 +246,7 @@ struct sock *__inet_lookup_listener(struct net *net,
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
 			result = NULL;
-		else if (unlikely(compute_score(result, net, hnum, daddr,
+		else if (unlikely(compute_score(result, ctx, hnum, daddr,
 				  dif) < hiscore)) {
 			sock_put(result);
 			goto begin;
@@ -268,12 +270,13 @@ void sock_gen_put(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sock_gen_put);
 
-struct sock *__inet_lookup_established(struct net *net,
+struct sock *__inet_lookup_established(struct net_ctx *ctx,
 				  struct inet_hashinfo *hashinfo,
 				  const __be32 saddr, const __be16 sport,
 				  const __be32 daddr, const u16 hnum,
 				  const int dif)
 {
+	struct net *net = ctx->net;
 	INET_ADDR_COOKIE(acookie, saddr, daddr);
 	const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
 	struct sock *sk;
@@ -290,11 +293,11 @@ struct sock *__inet_lookup_established(struct net *net,
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
 		if (sk->sk_hash != hash)
 			continue;
-		if (likely(INET_MATCH(sk, net, acookie,
+		if (likely(INET_MATCH(sk, ctx, acookie,
 				      saddr, daddr, ports, dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
 				goto out;
-			if (unlikely(!INET_MATCH(sk, net, acookie,
+			if (unlikely(!INET_MATCH(sk, ctx, acookie,
 						 saddr, daddr, ports, dif))) {
 				sock_gen_put(sk);
 				goto begin;
@@ -329,7 +332,8 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 	int dif = sk->sk_bound_dev_if;
 	INET_ADDR_COOKIE(acookie, saddr, daddr);
 	const __portpair ports = INET_COMBINED_PORTS(inet->inet_dport, lport);
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 	unsigned int hash = inet_ehashfn(net, daddr, lport,
 					 saddr, inet->inet_dport);
 	struct inet_ehash_bucket *head = inet_ehash_bucket(hinfo, hash);
@@ -345,7 +349,7 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 		if (sk2->sk_hash != hash)
 			continue;
 
-		if (likely(INET_MATCH(sk2, net, acookie,
+		if (likely(INET_MATCH(sk2, &sk_ctx, acookie,
 					 saddr, daddr, ports, dif))) {
 			if (sk2->sk_state == TCP_TIME_WAIT) {
 				tw = inet_twsk(sk2);
@@ -485,7 +489,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	struct inet_bind_hashbucket *head;
 	struct inet_bind_bucket *tb;
 	int ret;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 	int twrefcnt = 1;
 
 	if (!snum) {
@@ -511,7 +516,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			 * unique enough.
 			 */
 			inet_bind_bucket_for_each(tb, &head->chain) {
-				if (net_eq(ib_net(tb), net) &&
+				if (ib_net_ctx_eq(tb, &sk_ctx) &&
 				    tb->port == port) {
 					if (tb->fastreuse >= 0 ||
 					    tb->fastreuseport >= 0)
@@ -525,7 +530,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			}
 
 			tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-					net, head, port);
+					&sk_ctx, head, port);
 			if (!tb) {
 				spin_unlock(&head->lock);
 				break;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 3d4da2c16b6a..be5933f1f425 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -156,6 +156,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 	u8 protocol = ip_hdr(skb)->protocol;
 	struct sock *last = NULL;
 	struct net_device *dev = skb->dev;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	for (ra = rcu_dereference(ip_ra_chain); ra; ra = rcu_dereference(ra->next)) {
 		struct sock *sk = ra->sk;
@@ -166,7 +167,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 		if (sk && inet_sk(sk)->inet_num == protocol &&
 		    (!sk->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == dev->ifindex) &&
-		    net_eq(sock_net(sk), dev_net(dev))) {
+		    sock_net_ctx_eq(sk, &dev_ctx)) {
 			if (ip_is_fragment(ip_hdr(skb))) {
 				if (ip_defrag(skb, IP_DEFRAG_CALL_RA_CHAIN))
 					return true;
@@ -262,6 +263,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	struct ip_options *opt;
 	const struct iphdr *iph;
 	struct net_device *dev = skb->dev;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	/* It looks as overkill, because not all
 	   IP options require packet mangling.
@@ -279,7 +281,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	opt = &(IPCB(skb)->opt);
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 
-	if (ip_options_compile(dev_net(dev), opt, skb)) {
+	if (ip_options_compile(&dev_ctx, opt, skb)) {
 		IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 5b3d91be2db0..b5e2f5860544 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -139,9 +139,10 @@ int __ip_options_echo(struct ip_options *dopt, struct sk_buff *skb,
 
 					if (soffset + 7 <= optlen) {
 						__be32 addr;
+						struct net_ctx ctx = SKB_NET_CTX_DST(skb);
 
 						memcpy(&addr, dptr+soffset-1, 4);
-						if (inet_addr_type(dev_net(skb_dst(skb)->dev), addr) != RTN_UNICAST) {
+						if (inet_addr_type(&ctx, addr) != RTN_UNICAST) {
 							dopt->ts_needtime = 1;
 							soffset += 8;
 						}
@@ -254,9 +255,10 @@ static void spec_dst_fill(__be32 *spec_dst, struct sk_buff *skb)
  * If opt == NULL, then skb->data should point to IP header.
  */
 
-int ip_options_compile(struct net *net,
+int ip_options_compile(struct net_ctx *net_ctx,
 		       struct ip_options *opt, struct sk_buff *skb)
 {
+	struct net *net = net_ctx->net;
 	__be32 spec_dst = htonl(INADDR_ANY);
 	unsigned char *pp_ptr = NULL;
 	struct rtable *rt = NULL;
@@ -399,7 +401,7 @@ int ip_options_compile(struct net *net,
 					{
 						__be32 addr;
 						memcpy(&addr, &optptr[optptr[2]-1], 4);
-						if (inet_addr_type(net, addr) == RTN_UNICAST)
+						if (inet_addr_type(net_ctx, addr) == RTN_UNICAST)
 							break;
 						if (skb)
 							timeptr = &optptr[optptr[2]+3];
@@ -516,13 +518,13 @@ static struct ip_options_rcu *ip_options_get_alloc(const int optlen)
 		       GFP_KERNEL);
 }
 
-static int ip_options_get_finish(struct net *net, struct ip_options_rcu **optp,
+static int ip_options_get_finish(struct net_ctx *net_ctx, struct ip_options_rcu **optp,
 				 struct ip_options_rcu *opt, int optlen)
 {
 	while (optlen & 3)
 		opt->opt.__data[optlen++] = IPOPT_END;
 	opt->opt.optlen = optlen;
-	if (optlen && ip_options_compile(net, &opt->opt, NULL)) {
+	if (optlen && ip_options_compile(net_ctx, &opt->opt, NULL)) {
 		kfree(opt);
 		return -EINVAL;
 	}
@@ -531,7 +533,7 @@ static int ip_options_get_finish(struct net *net, struct ip_options_rcu **optp,
 	return 0;
 }
 
-int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
+int ip_options_get_from_user(struct net_ctx *net_ctx, struct ip_options_rcu **optp,
 			     unsigned char __user *data, int optlen)
 {
 	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
@@ -542,10 +544,10 @@ int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
 		kfree(opt);
 		return -EFAULT;
 	}
-	return ip_options_get_finish(net, optp, opt, optlen);
+	return ip_options_get_finish(net_ctx, optp, opt, optlen);
 }
 
-int ip_options_get(struct net *net, struct ip_options_rcu **optp,
+int ip_options_get(struct net_ctx *net_ctx, struct ip_options_rcu **optp,
 		   unsigned char *data, int optlen)
 {
 	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
@@ -554,7 +556,7 @@ int ip_options_get(struct net *net, struct ip_options_rcu **optp,
 		return -ENOMEM;
 	if (optlen)
 		memcpy(opt->opt.__data, data, optlen);
-	return ip_options_get_finish(net, optp, opt, optlen);
+	return ip_options_get_finish(net_ctx, optp, opt, optlen);
 }
 
 void ip_forward_options(struct sk_buff *skb)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b50861b22b6b..855e003e43d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -383,6 +383,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 	rt = (struct rtable *)__sk_dst_check(sk, 0);
 	if (rt == NULL) {
 		__be32 daddr;
+		struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 		/* Use correct destination address if we have options. */
 		daddr = inet->inet_daddr;
@@ -393,7 +394,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 		 * keep trying until route appears or the connection times
 		 * itself out.
 		 */
-		rt = ip_route_output_ports(sock_net(sk), fl4, sk,
+		rt = ip_route_output_ports(&sk_ctx, fl4, sk,
 					   daddr, inet->inet_saddr,
 					   inet->inet_dport,
 					   inet->inet_sport,
@@ -1522,7 +1523,7 @@ static DEFINE_PER_CPU(struct inet_sock, unicast_sock) = {
 	.uc_ttl		= -1,
 };
 
-void ip_send_unicast_reply(struct net *net, struct sk_buff *skb,
+void ip_send_unicast_reply(struct net_ctx *ctx, struct sk_buff *skb,
 			   const struct ip_options *sopt,
 			   __be32 daddr, __be32 saddr,
 			   const struct ip_reply_arg *arg,
@@ -1554,14 +1555,14 @@ void ip_send_unicast_reply(struct net *net, struct sk_buff *skb,
 	}
 
 	flowi4_init_output(&fl4, arg->bound_dev_if,
-			   IP4_REPLY_MARK(net, skb->mark),
+			   IP4_REPLY_MARK(ctx->net, skb->mark),
 			   RT_TOS(arg->tos),
 			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
 			   ip_reply_arg_flowi_flags(arg),
 			   daddr, saddr,
 			   tcp_hdr(skb)->source, tcp_hdr(skb)->dest);
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_key(net, &fl4);
+	rt = ip_route_output_key(ctx, &fl4);
 	if (IS_ERR(rt))
 		return;
 
@@ -1572,7 +1573,7 @@ void ip_send_unicast_reply(struct net *net, struct sk_buff *skb,
 	sk->sk_priority = skb->priority;
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
-	sock_net_set(sk, net);
+	sock_net_set(sk, ctx->net);
 	__skb_queue_head_init(&sk->sk_write_queue);
 	sk->sk_sndbuf = sysctl_wmem_default;
 	err = ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 31d8c71986b4..8ab03f0431f5 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -219,7 +219,7 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
-int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
+int ip_cmsg_send(struct net_ctx *ctx, struct msghdr *msg, struct ipcm_cookie *ipc,
 		 bool allow_ipv6)
 {
 	int err, val;
@@ -249,7 +249,7 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
 		switch (cmsg->cmsg_type) {
 		case IP_RETOPTS:
 			err = cmsg->cmsg_len - CMSG_ALIGN(sizeof(struct cmsghdr));
-			err = ip_options_get(net, &ipc->opt, CMSG_DATA(cmsg),
+			err = ip_options_get(ctx, &ipc->opt, CMSG_DATA(cmsg),
 					     err < 40 ? err : 40);
 			if (err)
 				return err;
@@ -529,6 +529,8 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 {
 	struct inet_sock *inet = inet_sk(sk);
 	int val = 0, err;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 
 	switch (optname) {
 	case IP_PKTINFO:
@@ -580,7 +582,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 
 		if (optlen > 40)
 			goto e_inval;
-		err = ip_options_get_from_user(sock_net(sk), &opt,
+		err = ip_options_get_from_user(&sk_ctx, &opt,
 					       optval, optlen);
 		if (err)
 			break;
@@ -736,7 +738,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 			break;
 		}
 
-		dev = dev_get_by_index(sock_net(sk), ifindex);
+		dev = dev_get_by_index(net, ifindex);
 		err = -EADDRNOTAVAIL;
 		if (!dev)
 			break;
@@ -782,13 +784,15 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		}
 
 		if (!mreq.imr_ifindex) {
+			struct net_ctx net_ctx = SOCK_NET_CTX(sk);
+
 			if (mreq.imr_address.s_addr == htonl(INADDR_ANY)) {
 				inet->mc_index = 0;
 				inet->mc_addr  = 0;
 				err = 0;
 				break;
 			}
-			dev = ip_dev_find(sock_net(sk), mreq.imr_address.s_addr);
+			dev = ip_dev_find(&net_ctx, mreq.imr_address.s_addr);
 			if (dev)
 				mreq.imr_ifindex = dev->ifindex;
 		} else
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index b26376ef87f6..e25e3b67be76 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -329,11 +329,12 @@ set_sockaddr(struct sockaddr_in *sin, __be32 addr, __be16 port)
 
 static int __init ic_devinet_ioctl(unsigned int cmd, struct ifreq *arg)
 {
+	struct net_ctx ctx = { .net = &init_net };
 	int res;
 
 	mm_segment_t oldfs = get_fs();
 	set_fs(get_ds());
-	res = devinet_ioctl(&init_net, cmd, (struct ifreq __user *) arg);
+	res = devinet_ioctl(&ctx, cmd, (struct ifreq __user *) arg);
 	set_fs(oldfs);
 	return res;
 }
@@ -351,11 +352,12 @@ static int __init ic_dev_ioctl(unsigned int cmd, struct ifreq *arg)
 
 static int __init ic_route_ioctl(unsigned int cmd, struct rtentry *arg)
 {
+	struct net_ctx ctx = { .net = &init_net };
 	int res;
 
 	mm_segment_t oldfs = get_fs();
 	set_fs(get_ds());
-	res = ip_rt_ioctl(&init_net, cmd, (void __user *) arg);
+	res = ip_rt_ioctl(&ctx, cmd, (void __user *) arg);
 	set_fs(oldfs);
 	return res;
 }
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9d78427652d2..935f45f54862 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -127,7 +127,7 @@ static struct kmem_cache *mrt_cachep __read_mostly;
 static struct mr_table *ipmr_new_table(struct net *net, u32 id);
 static void ipmr_free_table(struct mr_table *mrt);
 
-static void ip_mr_forward(struct net *net, struct mr_table *mrt,
+static void ip_mr_forward(struct net_ctx *ctx, struct mr_table *mrt,
 			  struct sk_buff *skb, struct mfc_cache *cache,
 			  int local);
 static int ipmr_cache_report(struct mr_table *mrt,
@@ -244,11 +244,12 @@ static const struct fib_rules_ops __net_initconst ipmr_rules_ops_template = {
 
 static int __net_init ipmr_rules_init(struct net *net)
 {
+	struct net_ctx ctx = { .net = net };
 	struct fib_rules_ops *ops;
 	struct mr_table *mrt;
 	int err;
 
-	ops = fib_rules_register(&ipmr_rules_ops_template, net);
+	ops = fib_rules_register(&ipmr_rules_ops_template, &ctx);
 	if (IS_ERR(ops))
 		return PTR_ERR(ops);
 
@@ -710,9 +711,10 @@ static void ipmr_update_thresholds(struct mr_table *mrt, struct mfc_cache *cache
 	}
 }
 
-static int vif_add(struct net *net, struct mr_table *mrt,
+static int vif_add(struct net_ctx *ctx, struct mr_table *mrt,
 		   struct vifctl *vifc, int mrtsock)
 {
+	struct net *net = ctx->net;
 	int vifi = vifc->vifc_vifi;
 	struct vif_device *v = &mrt->vif_table[vifi];
 	struct net_device *dev;
@@ -764,7 +766,7 @@ static int vif_add(struct net *net, struct mr_table *mrt,
 				return -EADDRNOTAVAIL;
 			}
 		} else {
-			dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
+			dev = ip_dev_find(ctx, vifc->vifc_lcl_addr.s_addr);
 		}
 		if (!dev)
 			return -EADDRNOTAVAIL;
@@ -903,7 +905,7 @@ static struct mfc_cache *ipmr_cache_alloc_unres(void)
  *	A cache entry has gone into a resolved state from queued
  */
 
-static void ipmr_cache_resolve(struct net *net, struct mr_table *mrt,
+static void ipmr_cache_resolve(struct net_ctx *ctx, struct mr_table *mrt,
 			       struct mfc_cache *uc, struct mfc_cache *c)
 {
 	struct sk_buff *skb;
@@ -927,9 +929,9 @@ static void ipmr_cache_resolve(struct net *net, struct mr_table *mrt,
 				memset(&e->msg, 0, sizeof(e->msg));
 			}
 
-			rtnl_unicast(skb, net, NETLINK_CB(skb).portid);
+			rtnl_unicast(skb, ctx->net, NETLINK_CB(skb).portid);
 		} else {
-			ip_mr_forward(net, mrt, skb, c, 0);
+			ip_mr_forward(ctx, mrt, skb, c, 0);
 		}
 	}
 }
@@ -1121,7 +1123,7 @@ static int ipmr_mfc_delete(struct mr_table *mrt, struct mfcctl *mfc, int parent)
 	return -ENOENT;
 }
 
-static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
+static int ipmr_mfc_add(struct net_ctx *ctx, struct mr_table *mrt,
 			struct mfcctl *mfc, int mrtsock, int parent)
 {
 	bool found = false;
@@ -1190,7 +1192,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 	spin_unlock_bh(&mfc_unres_lock);
 
 	if (found) {
-		ipmr_cache_resolve(net, mrt, uc, c);
+		ipmr_cache_resolve(ctx, mrt, uc, c);
 		ipmr_cache_free(uc);
 	}
 	mroute_netlink_event(mrt, c, RTM_NEWROUTE);
@@ -1272,7 +1274,8 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 	int ret, parent = 0;
 	struct vifctl vif;
 	struct mfcctl mfc;
-	struct net *net = sock_net(sk);
+	struct net_ctx ctx = SOCK_NET_CTX(sk);
+	struct net *net = ctx.net;
 	struct mr_table *mrt;
 
 	if (sk->sk_type != SOCK_RAW ||
@@ -1324,7 +1327,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 			return -ENFILE;
 		rtnl_lock();
 		if (optname == MRT_ADD_VIF) {
-			ret = vif_add(net, mrt, &vif,
+			ret = vif_add(&ctx, mrt, &vif,
 				      sk == rtnl_dereference(mrt->mroute_sk));
 		} else {
 			ret = vif_delete(mrt, vif.vifc_vifi, 0, NULL);
@@ -1351,7 +1354,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 		if (optname == MRT_DEL_MFC || optname == MRT_DEL_MFC_PROXY)
 			ret = ipmr_mfc_delete(mrt, &mfc, parent);
 		else
-			ret = ipmr_mfc_add(net, mrt, &mfc,
+			ret = ipmr_mfc_add(&ctx, mrt, &mfc,
 					   sk == rtnl_dereference(mrt->mroute_sk),
 					   parent);
 		rtnl_unlock();
@@ -1687,7 +1690,7 @@ static inline int ipmr_forward_finish(struct sk_buff *skb)
  *	Processing handlers for ipmr_forward
  */
 
-static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
+static void ipmr_queue_xmit(struct net_ctx *ctx, struct mr_table *mrt,
 			    struct sk_buff *skb, struct mfc_cache *c, int vifi)
 {
 	const struct iphdr *iph = ip_hdr(skb);
@@ -1712,7 +1715,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 #endif
 
 	if (vif->flags & VIFF_TUNNEL) {
-		rt = ip_route_output_ports(net, &fl4, NULL,
+		rt = ip_route_output_ports(ctx, &fl4, NULL,
 					   vif->remote, vif->local,
 					   0, 0,
 					   IPPROTO_IPIP,
@@ -1721,7 +1724,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 			goto out_free;
 		encap = sizeof(struct iphdr);
 	} else {
-		rt = ip_route_output_ports(net, &fl4, NULL, iph->daddr, 0,
+		rt = ip_route_output_ports(ctx, &fl4, NULL, iph->daddr, 0,
 					   0, 0,
 					   IPPROTO_IPIP,
 					   RT_TOS(iph->tos), vif->link);
@@ -1800,7 +1803,7 @@ static int ipmr_find_vif(struct mr_table *mrt, struct net_device *dev)
 
 /* "local" means that we should preserve one skb (for local delivery) */
 
-static void ip_mr_forward(struct net *net, struct mr_table *mrt,
+static void ip_mr_forward(struct net_ctx *ctx, struct mr_table *mrt,
 			  struct sk_buff *skb, struct mfc_cache *cache,
 			  int local)
 {
@@ -1893,7 +1896,7 @@ static void ip_mr_forward(struct net *net, struct mr_table *mrt,
 				struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
 				if (skb2)
-					ipmr_queue_xmit(net, mrt, skb2, cache,
+					ipmr_queue_xmit(ctx, mrt, skb2, cache,
 							psend);
 			}
 			psend = ct;
@@ -1905,9 +1908,9 @@ static void ip_mr_forward(struct net *net, struct mr_table *mrt,
 			struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
 			if (skb2)
-				ipmr_queue_xmit(net, mrt, skb2, cache, psend);
+				ipmr_queue_xmit(ctx, mrt, skb2, cache, psend);
 		} else {
-			ipmr_queue_xmit(net, mrt, skb, cache, psend);
+			ipmr_queue_xmit(ctx, mrt, skb, cache, psend);
 			return;
 		}
 	}
@@ -1949,7 +1952,8 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct sk_buff *skb)
 int ip_mr_input(struct sk_buff *skb)
 {
 	struct mfc_cache *cache;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
+	struct net *net = dev_ctx.net;
 	int local = skb_rtable(skb)->rt_flags & RTCF_LOCAL;
 	struct mr_table *mrt;
 
@@ -2024,7 +2028,7 @@ int ip_mr_input(struct sk_buff *skb)
 	}
 
 	read_lock(&mrt_lock);
-	ip_mr_forward(net, mrt, skb, cache, local);
+	ip_mr_forward(&dev_ctx, mrt, skb, cache, local);
 	read_unlock(&mrt_lock);
 
 	if (local)
@@ -2046,6 +2050,7 @@ static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb,
 {
 	struct net_device *reg_dev = NULL;
 	struct iphdr *encap;
+	struct net_ctx dev_ctx;
 
 	encap = (struct iphdr *)(skb_transport_header(skb) + pimlen);
 	/*
@@ -2073,7 +2078,8 @@ static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb,
 	skb->protocol = htons(ETH_P_IP);
 	skb->ip_summed = CHECKSUM_NONE;
 
-	skb_tunnel_rx(skb, reg_dev, dev_net(reg_dev));
+	dev_ctx.net = dev_net(reg_dev);
+	skb_tunnel_rx(skb, reg_dev, &dev_ctx);
 
 	netif_rx(skb);
 
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index 7ebd6e37875c..a10ab84b69d8 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -19,7 +19,7 @@
 /* route_me_harder function, used by iptable_nat, iptable_mangle + ip_queue */
 int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 {
-	struct net *net = dev_net(skb_dst(skb)->dev);
+	struct net_ctx ctx = SKB_NET_CTX_DST(skb);
 	const struct iphdr *iph = ip_hdr(skb);
 	struct rtable *rt;
 	struct flowi4 fl4 = {};
@@ -28,7 +28,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 	unsigned int hh_len;
 
 	if (addr_type == RTN_UNSPEC)
-		addr_type = inet_addr_type(net, saddr);
+		addr_type = inet_addr_type(&ctx, saddr);
 	if (addr_type == RTN_LOCAL || addr_type == RTN_UNICAST)
 		flags |= FLOWI_FLAG_ANYSRC;
 	else
@@ -43,7 +43,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 	fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_flags = flags;
-	rt = ip_route_output_key(net, &fl4);
+	rt = ip_route_output_key(&ctx, &fl4);
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
@@ -59,7 +59,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 	    xfrm_decode_session(skb, flowi4_to_flowi(&fl4), AF_INET) == 0) {
 		struct dst_entry *dst = skb_dst(skb);
 		skb_dst_set(skb, NULL);
-		dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
+		dst = xfrm_lookup(&ctx, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
 		if (IS_ERR(dst))
 			return PTR_ERR(dst);
 		skb_dst_set(skb, dst);
@@ -173,10 +173,10 @@ static __sum16 nf_ip_checksum_partial(struct sk_buff *skb, unsigned int hook,
 	return csum;
 }
 
-static int nf_ip_route(struct net *net, struct dst_entry **dst,
+static int nf_ip_route(struct net_ctx *ctx, struct dst_entry **dst,
 		       struct flowi *fl, bool strict __always_unused)
 {
-	struct rtable *rt = ip_route_output_key(net, &fl->u.ip4);
+	struct rtable *rt = ip_route_output_key(ctx, &fl->u.ip4);
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 	*dst = &rt->dst;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 2a3720fb5a5f..bca4f27502b0 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -167,9 +167,9 @@ void ping_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(ping_unhash);
 
-static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
+static struct sock *ping_lookup(struct net_ctx *ctx, struct sk_buff *skb, u16 ident)
 {
-	struct hlist_nulls_head *hslot = ping_hashslot(&ping_table, net, ident);
+	struct hlist_nulls_head *hslot = ping_hashslot(&ping_table, ctx->net, ident);
 	struct sock *sk = NULL;
 	struct inet_sock *isk;
 	struct hlist_nulls_node *hnode;
@@ -297,7 +297,7 @@ EXPORT_SYMBOL_GPL(ping_close);
 /* Checks the bind address and possibly modifies sk->sk_bound_dev_if. */
 static int ping_check_bind_addr(struct sock *sk, struct inet_sock *isk,
 				struct sockaddr *uaddr, int addr_len) {
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	if (sk->sk_family == AF_INET) {
 		struct sockaddr_in *addr = (struct sockaddr_in *) uaddr;
 		int chk_addr_ret;
@@ -308,12 +308,12 @@ static int ping_check_bind_addr(struct sock *sk, struct inet_sock *isk,
 		pr_debug("ping_check_bind_addr(sk=%p,addr=%pI4,port=%d)\n",
 			 sk, &addr->sin_addr.s_addr, ntohs(addr->sin_port));
 
-		chk_addr_ret = inet_addr_type(net, addr->sin_addr.s_addr);
+		chk_addr_ret = inet_addr_type(&sk_ctx, addr->sin_addr.s_addr);
 
 		if (addr->sin_addr.s_addr == htonl(INADDR_ANY))
 			chk_addr_ret = RTN_LOCAL;
 
-		if ((net->ipv4.sysctl_ip_nonlocal_bind == 0 &&
+		if ((sk_ctx.net->ipv4.sysctl_ip_nonlocal_bind == 0 &&
 		    isk->freebind == 0 && isk->transparent == 0 &&
 		     chk_addr_ret != RTN_LOCAL) ||
 		    chk_addr_ret == RTN_MULTICAST ||
@@ -344,13 +344,13 @@ static int ping_check_bind_addr(struct sock *sk, struct inet_sock *isk,
 
 		rcu_read_lock();
 		if (addr->sin6_scope_id) {
-			dev = dev_get_by_index_rcu(net, addr->sin6_scope_id);
+			dev = dev_get_by_index_rcu_ctx(&sk_ctx, addr->sin6_scope_id);
 			if (!dev) {
 				rcu_read_unlock();
 				return -ENODEV;
 			}
 		}
-		has_addr = pingv6_ops.ipv6_chk_addr(net, &addr->sin6_addr, dev,
+		has_addr = pingv6_ops.ipv6_chk_addr(&sk_ctx, &addr->sin6_addr, dev,
 						    scoped);
 		rcu_read_unlock();
 
@@ -479,7 +479,7 @@ void ping_err(struct sk_buff *skb, int offset, u32 info)
 	struct inet_sock *inet_sock;
 	int type;
 	int code;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
 	struct sock *sk;
 	int harderr;
 	int err;
@@ -507,7 +507,7 @@ void ping_err(struct sk_buff *skb, int offset, u32 info)
 		 skb->protocol, type, code, ntohs(icmph->un.echo.id),
 		 ntohs(icmph->un.echo.sequence));
 
-	sk = ping_lookup(net, skb, ntohs(icmph->un.echo.id));
+	sk = ping_lookup(&dev_ctx, skb, ntohs(icmph->un.echo.id));
 	if (sk == NULL) {
 		pr_debug("no socket, dropping\n");
 		return;	/* No socket for error */
@@ -687,7 +687,8 @@ EXPORT_SYMBOL_GPL(ping_common_sendmsg);
 static int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			   size_t len)
 {
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 	struct flowi4 fl4;
 	struct inet_sock *inet = inet_sk(sk);
 	struct ipcm_cookie ipc;
@@ -736,7 +737,7 @@ static int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc, false);
+		err = ip_cmsg_send(&sk_ctx, msg, &ipc, false);
 		if (err)
 			return err;
 		if (ipc.opt)
@@ -783,7 +784,7 @@ static int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 			   inet_sk_flowi_flags(sk), faddr, saddr, 0, 0);
 
 	security_sk_classify_flow(sk, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_flow(net, &fl4, sk);
+	rt = ip_route_output_flow(&sk_ctx, &fl4, sk);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		rt = NULL;
@@ -829,7 +830,7 @@ static int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
-		icmp_out_count(sock_net(sk), user_icmph.type);
+		icmp_out_count(net, user_icmph.type);
 		return len;
 	}
 	return err;
@@ -953,7 +954,7 @@ EXPORT_SYMBOL_GPL(ping_queue_rcv_skb);
 bool ping_rcv(struct sk_buff *skb)
 {
 	struct sock *sk;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
 	struct icmphdr *icmph = icmp_hdr(skb);
 
 	/* We assume the packet has already been checked by icmp_rcv */
@@ -964,7 +965,7 @@ bool ping_rcv(struct sk_buff *skb)
 	/* Push ICMP header back */
 	skb_push(skb, skb->data - (u8 *)icmph);
 
-	sk = ping_lookup(net, skb, ntohs(icmph->un.echo.id));
+	sk = ping_lookup(&dev_ctx, skb, ntohs(icmph->un.echo.id));
 	if (sk != NULL) {
 		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
@@ -1007,7 +1008,7 @@ static struct sock *ping_get_first(struct seq_file *seq, int start)
 {
 	struct sock *sk;
 	struct ping_iter_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	for (state->bucket = start; state->bucket < PING_HTABLE_SIZE;
 	     ++state->bucket) {
@@ -1020,7 +1021,7 @@ static struct sock *ping_get_first(struct seq_file *seq, int start)
 			continue;
 
 		sk_nulls_for_each(sk, node, hslot) {
-			if (net_eq(sock_net(sk), net) &&
+			if (sock_net_ctx_eq(sk, ctx) &&
 			    sk->sk_family == state->family)
 				goto found;
 		}
@@ -1033,11 +1034,11 @@ static struct sock *ping_get_first(struct seq_file *seq, int start)
 static struct sock *ping_get_next(struct seq_file *seq, struct sock *sk)
 {
 	struct ping_iter_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	do {
 		sk = sk_nulls_next(sk);
-	} while (sk && (!net_eq(sock_net(sk), net)));
+	} while (sk && !sock_net_ctx_eq(sk, ctx));
 
 	if (!sk)
 		return ping_get_first(seq, state->bucket + 1);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 0bb68df5055d..c06dd58e538b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -119,13 +119,13 @@ void raw_unhash_sk(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(raw_unhash_sk);
 
-static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+static struct sock *__raw_v4_lookup(struct net_ctx *ctx, struct sock *sk,
 		unsigned short num, __be32 raddr, __be32 laddr, int dif)
 {
 	sk_for_each_from(sk) {
 		struct inet_sock *inet = inet_sk(sk);
 
-		if (net_eq(sock_net(sk), net) && inet->inet_num == num	&&
+		if (sock_net_ctx_eq(sk, ctx) && inet->inet_num == num &&
 		    !(inet->inet_daddr && inet->inet_daddr != raddr) 	&&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
 		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
@@ -171,15 +171,14 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
 	struct sock *sk;
 	struct hlist_head *head;
 	int delivered = 0;
-	struct net *net;
+	struct net_ctx ctx = SKB_NET_CTX_DEV(skb);
 
 	read_lock(&raw_v4_hashinfo.lock);
 	head = &raw_v4_hashinfo.ht[hash];
 	if (hlist_empty(head))
 		goto out;
 
-	net = dev_net(skb->dev);
-	sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
+	sk = __raw_v4_lookup(&ctx, __sk_head(head), iph->protocol,
 			     iph->saddr, iph->daddr,
 			     skb->dev->ifindex);
 
@@ -194,7 +193,7 @@ static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
 			if (clone)
 				raw_rcv(sk, clone);
 		}
-		sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol,
+		sk = __raw_v4_lookup(&ctx, sk_next(sk), iph->protocol,
 				     iph->saddr, iph->daddr,
 				     skb->dev->ifindex);
 	}
@@ -287,7 +286,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
 	int hash;
 	struct sock *raw_sk;
 	const struct iphdr *iph;
-	struct net *net;
+	struct net_ctx ctx = SKB_NET_CTX_DEV(skb);
 
 	hash = protocol & (RAW_HTABLE_SIZE - 1);
 
@@ -295,9 +294,8 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
 	raw_sk = sk_head(&raw_v4_hashinfo.ht[hash]);
 	if (raw_sk != NULL) {
 		iph = (const struct iphdr *)skb->data;
-		net = dev_net(skb->dev);
 
-		while ((raw_sk = __raw_v4_lookup(net, raw_sk, protocol,
+		while ((raw_sk = __raw_v4_lookup(&ctx, raw_sk, protocol,
 						iph->daddr, iph->saddr,
 						skb->dev->ifindex)) != NULL) {
 			raw_err(raw_sk, skb, info);
@@ -494,6 +492,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	u8  tos;
 	int err;
 	struct ip_options_data opt_copy;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	struct raw_frag_vec rfv;
 
 	err = -EMSGSIZE;
@@ -544,7 +543,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.oif = sk->sk_bound_dev_if;
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc, false);
+		err = ip_cmsg_send(&sk_ctx, msg, &ipc, false);
 		if (err)
 			goto out;
 		if (ipc.opt)
@@ -609,7 +608,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	}
 
 	security_sk_classify_flow(sk, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
+	rt = ip_route_output_flow(&sk_ctx, &fl4, sk);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		rt = NULL;
@@ -689,10 +688,11 @@ static int raw_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct sockaddr_in *addr = (struct sockaddr_in *) uaddr;
 	int ret = -EINVAL;
 	int chk_addr_ret;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	if (sk->sk_state != TCP_CLOSE || addr_len < sizeof(struct sockaddr_in))
 		goto out;
-	chk_addr_ret = inet_addr_type(sock_net(sk), addr->sin_addr.s_addr);
+	chk_addr_ret = inet_addr_type(&sk_ctx, addr->sin_addr.s_addr);
 	ret = -EADDRNOTAVAIL;
 	if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL &&
 	    chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST)
@@ -938,11 +938,12 @@ static struct sock *raw_get_first(struct seq_file *seq)
 {
 	struct sock *sk;
 	struct raw_iter_state *state = raw_seq_private(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	for (state->bucket = 0; state->bucket < RAW_HTABLE_SIZE;
 			++state->bucket) {
 		sk_for_each(sk, &state->h->ht[state->bucket])
-			if (sock_net(sk) == seq_file_net(seq))
+			if (sock_net_ctx_eq(sk, ctx))
 				goto found;
 	}
 	sk = NULL;
@@ -953,12 +954,13 @@ static struct sock *raw_get_first(struct seq_file *seq)
 static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk)
 {
 	struct raw_iter_state *state = raw_seq_private(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	do {
 		sk = sk_next(sk);
 try_again:
 		;
-	} while (sk && sock_net(sk) != seq_file_net(seq));
+	} while (sk && !sock_net_ctx_eq(sk, ctx));
 
 	if (!sk && ++state->bucket < RAW_HTABLE_SIZE) {
 		sk = sk_head(&state->h->ht[state->bucket]);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0c63b2abd873..018e292ff145 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -428,7 +428,9 @@ static inline int ip_rt_proc_init(void)
 
 static inline bool rt_is_expired(const struct rtable *rth)
 {
-	return rth->rt_genid != rt_genid_ipv4(dev_net(rth->dst.dev));
+	struct net_ctx dev_ctx = DEV_NET_CTX(rth->dst.dev);
+
+	return rth->rt_genid != rt_genid_ipv4(&dev_ctx);
 }
 
 void rt_cache_flush(struct net *net)
@@ -625,6 +627,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 	unsigned int i;
 	int depth;
 	u32 hval = fnhe_hashfun(daddr);
+	struct net_ctx dev_ctx = DEV_NET_CTX(nh->nh_dev);
 
 	spin_lock_bh(&fnhe_lock);
 
@@ -671,7 +674,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 			fnhe->fnhe_next = hash->chain;
 			rcu_assign_pointer(hash->chain, fnhe);
 		}
-		fnhe->fnhe_genid = fnhe_genid(dev_net(nh->nh_dev));
+		fnhe->fnhe_genid = fnhe_genid(&dev_ctx);
 		fnhe->fnhe_daddr = daddr;
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
@@ -709,7 +712,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 	struct in_device *in_dev;
 	struct fib_result res;
 	struct neighbour *n;
-	struct net *net;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 
 	switch (icmp_hdr(skb)->code & 7) {
 	case ICMP_REDIR_NET:
@@ -729,7 +732,6 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 	if (!in_dev)
 		return;
 
-	net = dev_net(dev);
 	if (new_gw == old_gw || !IN_DEV_RX_REDIRECTS(in_dev) ||
 	    ipv4_is_multicast(new_gw) || ipv4_is_lbcast(new_gw) ||
 	    ipv4_is_zeronet(new_gw))
@@ -741,7 +743,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		if (IN_DEV_SEC_REDIRECTS(in_dev) && ip_fib_check_default(new_gw, dev))
 			goto reject_redirect;
 	} else {
-		if (inet_addr_type(net, new_gw) != RTN_UNICAST)
+		if (inet_addr_type(&dev_ctx, new_gw) != RTN_UNICAST)
 			goto reject_redirect;
 	}
 
@@ -750,7 +752,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		if (!(n->nud_state & NUD_VALID)) {
 			neigh_event_send(n, NULL);
 		} else {
-			if (fib_lookup(net, fl4, &res) == 0) {
+			if (fib_lookup(&dev_ctx, fl4, &res) == 0) {
 				struct fib_nh *nh = &FIB_RES_NH(res);
 
 				update_or_create_fnhe(nh, fl4->daddr, new_gw,
@@ -959,6 +961,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 {
 	struct dst_entry *dst = &rt->dst;
 	struct fib_result res;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dst->dev);
 
 	if (dst_metric_locked(dst, RTAX_MTU))
 		return;
@@ -974,7 +977,7 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 		return;
 
 	rcu_read_lock();
-	if (fib_lookup(dev_net(dst->dev), fl4, &res) == 0) {
+	if (fib_lookup(&dev_ctx, fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
 
 		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
@@ -993,7 +996,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	__ip_rt_update_pmtu(rt, &fl4, mtu);
 }
 
-void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
+void ipv4_update_pmtu(struct sk_buff *skb, struct net_ctx *ctx, u32 mtu,
 		      int oif, u32 mark, u8 protocol, int flow_flags)
 {
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
@@ -1001,11 +1004,11 @@ void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
 	struct rtable *rt;
 
 	if (!mark)
-		mark = IP4_REPLY_MARK(net, skb->mark);
+		mark = IP4_REPLY_MARK(ctx->net, skb->mark);
 
 	__build_flow_key(&fl4, NULL, iph, oif,
 			 RT_TOS(iph->tos), protocol, mark, flow_flags);
-	rt = __ip_route_output_key(net, &fl4);
+	rt = __ip_route_output_key(ctx, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_rt_update_pmtu(rt, &fl4, mtu);
 		ip_rt_put(rt);
@@ -1018,13 +1021,14 @@ static void __ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
 	struct flowi4 fl4;
 	struct rtable *rt;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
 
 	if (!fl4.flowi4_mark)
-		fl4.flowi4_mark = IP4_REPLY_MARK(sock_net(sk), skb->mark);
+		fl4.flowi4_mark = IP4_REPLY_MARK(sk_ctx.net, skb->mark);
 
-	rt = __ip_route_output_key(sock_net(sk), &fl4);
+	rt = __ip_route_output_key(&sk_ctx, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_rt_update_pmtu(rt, &fl4, mtu);
 		ip_rt_put(rt);
@@ -1038,6 +1042,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	struct rtable *rt;
 	struct dst_entry *odst = NULL;
 	bool new = false;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	bh_lock_sock(sk);
 
@@ -1055,7 +1060,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 
 	rt = (struct rtable *)odst;
 	if (odst->obsolete && odst->ops->check(odst, 0) == NULL) {
-		rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
+		rt = ip_route_output_flow(&sk_ctx, &fl4, sk);
 		if (IS_ERR(rt))
 			goto out;
 
@@ -1068,7 +1073,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 		if (new)
 			dst_release(&rt->dst);
 
-		rt = ip_route_output_flow(sock_net(sk), &fl4, sk);
+		rt = ip_route_output_flow(&sk_ctx, &fl4, sk);
 		if (IS_ERR(rt))
 			goto out;
 
@@ -1084,7 +1089,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 }
 EXPORT_SYMBOL_GPL(ipv4_sk_update_pmtu);
 
-void ipv4_redirect(struct sk_buff *skb, struct net *net,
+void ipv4_redirect(struct sk_buff *skb, struct net_ctx *ctx,
 		   int oif, u32 mark, u8 protocol, int flow_flags)
 {
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
@@ -1093,7 +1098,7 @@ void ipv4_redirect(struct sk_buff *skb, struct net *net,
 
 	__build_flow_key(&fl4, NULL, iph, oif,
 			 RT_TOS(iph->tos), protocol, mark, flow_flags);
-	rt = __ip_route_output_key(net, &fl4);
+	rt = __ip_route_output_key(ctx, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_do_redirect(rt, skb, &fl4, false);
 		ip_rt_put(rt);
@@ -1106,9 +1111,10 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk)
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
 	struct flowi4 fl4;
 	struct rtable *rt;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
-	rt = __ip_route_output_key(sock_net(sk), &fl4);
+	rt = __ip_route_output_key(&sk_ctx, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_do_redirect(rt, skb, &fl4, false);
 		ip_rt_put(rt);
@@ -1173,6 +1179,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt)
 		struct fib_result res;
 		struct flowi4 fl4;
 		struct iphdr *iph;
+		struct net_ctx dev_ctx = DEV_NET_CTX(rt->dst.dev);
 
 		iph = ip_hdr(skb);
 
@@ -1185,8 +1192,8 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt)
 		fl4.flowi4_mark = skb->mark;
 
 		rcu_read_lock();
-		if (fib_lookup(dev_net(rt->dst.dev), &fl4, &res) == 0)
-			src = FIB_RES_PREFSRC(dev_net(rt->dst.dev), res);
+		if (fib_lookup(&dev_ctx, &fl4, &res) == 0)
+			src = FIB_RES_PREFSRC(&dev_ctx, res);
 		else
 			src = inet_select_addr(rt->dst.dev,
 					       rt_nexthop(rt, iph->daddr),
@@ -1269,7 +1276,8 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 	if (daddr == fnhe->fnhe_daddr) {
 		struct rtable __rcu **porig;
 		struct rtable *orig;
-		int genid = fnhe_genid(dev_net(rt->dst.dev));
+		struct net_ctx dev_ctx = DEV_NET_CTX(rt->dst.dev);
+		int genid = fnhe_genid(&dev_ctx);
 
 		if (rt_is_input_route(rt))
 			porig = &fnhe->fnhe_rth_input;
@@ -1443,6 +1451,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 {
 	struct rtable *rth;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
 	u32 itag = 0;
 	int err;
 
@@ -1478,7 +1487,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	rth->dst.output = ip_rt_bug;
 
-	rth->rt_genid	= rt_genid_ipv4(dev_net(dev));
+	rth->rt_genid	= rt_genid_ipv4(&dev_ctx);
 	rth->rt_flags	= RTCF_MULTICAST;
 	rth->rt_type	= RTN_MULTICAST;
 	rth->rt_is_input= 1;
@@ -1548,6 +1557,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	unsigned int flags = 0;
 	bool do_cache;
 	u32 itag = 0;
+	struct net_ctx dev_ctx;
 
 	/* get a working reference to the output device */
 	out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
@@ -1608,7 +1618,8 @@ static int __mkroute_input(struct sk_buff *skb,
 		goto cleanup;
 	}
 
-	rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
+	dev_ctx.net = dev_net(rth->dst.dev);
+	rth->rt_genid = rt_genid_ipv4(&dev_ctx);
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
 	rth->rt_is_input = 1;
@@ -1666,7 +1677,8 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	u32		itag = 0;
 	struct rtable	*rth;
 	int		err = -EINVAL;
-	struct net    *net = dev_net(dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev);
+	struct net *net = dev_ctx.net;
 	bool do_cache;
 
 	/* IP on this device is disabled. */
@@ -1715,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
-	err = fib_lookup(net, &fl4, &res);
+	err = fib_lookup(&dev_ctx, &fl4, &res);
 	if (err != 0) {
 		if (!IN_DEV_FORWARD(in_dev))
 			err = -EHOSTUNREACH;
@@ -1782,7 +1794,7 @@ out:	return err;
 	rth->dst.tclassid = itag;
 #endif
 
-	rth->rt_genid = rt_genid_ipv4(net);
+	rth->rt_genid = rt_genid_ipv4(&dev_ctx);
 	rth->rt_flags 	= flags|RTCF_LOCAL;
 	rth->rt_type	= res.type;
 	rth->rt_is_input = 1;
@@ -1897,6 +1909,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	u16 type = res->type;
 	struct rtable *rth;
 	bool do_cache;
+	struct net_ctx dev_ctx = DEV_NET_CTX(dev_out);
 
 	in_dev = __in_dev_get_rcu(dev_out);
 	if (!in_dev)
@@ -1971,7 +1984,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 
 	rth->dst.output = ip_output;
 
-	rth->rt_genid = rt_genid_ipv4(dev_net(dev_out));
+	rth->rt_genid = rt_genid_ipv4(&dev_ctx);
 	rth->rt_flags	= flags;
 	rth->rt_type	= type;
 	rth->rt_is_input = 0;
@@ -2011,8 +2024,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
+struct rtable *__ip_route_output_key(struct net_ctx *ctx, struct flowi4 *fl4)
 {
+	struct net *net = ctx->net;
 	struct net_device *dev_out = NULL;
 	__u8 tos = RT_FL_TOS(fl4);
 	unsigned int flags = 0;
@@ -2051,7 +2065,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		    (ipv4_is_multicast(fl4->daddr) ||
 		     ipv4_is_lbcast(fl4->daddr))) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			dev_out = __ip_dev_find(net, fl4->saddr, false);
+			dev_out = __ip_dev_find(ctx, fl4->saddr, false);
 			if (dev_out == NULL)
 				goto out;
 
@@ -2076,14 +2090,14 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 
 		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			if (!__ip_dev_find(net, fl4->saddr, false))
+			if (!__ip_dev_find(ctx, fl4->saddr, false))
 				goto out;
 		}
 	}
 
 
 	if (fl4->flowi4_oif) {
-		dev_out = dev_get_by_index_rcu(net, fl4->flowi4_oif);
+		dev_out = dev_get_by_index_rcu_ctx(ctx, fl4->flowi4_oif);
 		rth = ERR_PTR(-ENODEV);
 		if (dev_out == NULL)
 			goto out;
@@ -2121,7 +2135,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		goto make_route;
 	}
 
-	if (fib_lookup(net, fl4, &res)) {
+	if (fib_lookup(ctx, fl4, &res)) {
 		res.fi = NULL;
 		res.table = NULL;
 		if (fl4->flowi4_oif) {
@@ -2177,7 +2191,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		fib_select_default(&res);
 
 	if (!fl4->saddr)
-		fl4->saddr = FIB_RES_PREFSRC(net, res);
+		fl4->saddr = FIB_RES_PREFSRC(ctx, res);
 
 	dev_out = FIB_RES_DEV(res);
 	fl4->flowi4_oif = dev_out->ifindex;
@@ -2232,7 +2246,7 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
 	.neigh_lookup		=	ipv4_neigh_lookup,
 };
 
-struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig)
+struct dst_entry *ipv4_blackhole_route(struct net_ctx *net_ctx, struct dst_entry *dst_orig)
 {
 	struct rtable *ort = (struct rtable *) dst_orig;
 	struct rtable *rt;
@@ -2253,7 +2267,7 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 		rt->rt_iif = ort->rt_iif;
 		rt->rt_pmtu = ort->rt_pmtu;
 
-		rt->rt_genid = rt_genid_ipv4(net);
+		rt->rt_genid = rt_genid_ipv4(net_ctx);
 		rt->rt_flags = ort->rt_flags;
 		rt->rt_type = ort->rt_type;
 		rt->rt_gateway = ort->rt_gateway;
@@ -2269,16 +2283,16 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
 	return rt ? &rt->dst : ERR_PTR(-ENOMEM);
 }
 
-struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
+struct rtable *ip_route_output_flow(struct net_ctx *ctx, struct flowi4 *flp4,
 				    struct sock *sk)
 {
-	struct rtable *rt = __ip_route_output_key(net, flp4);
+	struct rtable *rt = __ip_route_output_key(ctx, flp4);
 
 	if (IS_ERR(rt))
 		return rt;
 
 	if (flp4->flowi4_proto)
-		rt = (struct rtable *)xfrm_lookup_route(net, &rt->dst,
+		rt = (struct rtable *)xfrm_lookup_route(ctx, &rt->dst,
 							flowi4_to_flowi(flp4),
 							sk, 0);
 
@@ -2286,7 +2300,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
-static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
+static int rt_fill_info(struct net_ctx *ctx,  __be32 dst, __be32 src,
 			struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
 			u32 seq, int event, int nowait, unsigned int flags)
 {
@@ -2367,8 +2381,8 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
 	if (rt_is_input_route(rt)) {
 #ifdef CONFIG_IP_MROUTE
 		if (ipv4_is_multicast(dst) && !ipv4_is_local_multicast(dst) &&
-		    IPV4_DEVCONF_ALL(net, MC_FORWARDING)) {
-			int err = ipmr_get_route(net, skb,
+		    IPV4_DEVCONF_ALL(ctx->net, MC_FORWARDING)) {
+			int err = ipmr_get_route(ctx->net, skb,
 						 fl4->saddr, fl4->daddr,
 						 r, nowait);
 			if (err <= 0) {
@@ -2401,7 +2415,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
 
 static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 {
-	struct net *net = sock_net(in_skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(in_skb);
 	struct rtmsg *rtm;
 	struct nlattr *tb[RTA_MAX+1];
 	struct rtable *rt = NULL;
@@ -2450,7 +2464,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	if (iif) {
 		struct net_device *dev;
 
-		dev = __dev_get_by_index(net, iif);
+		dev = __dev_get_by_index_ctx(&sk_ctx, iif);
 		if (dev == NULL) {
 			err = -ENODEV;
 			goto errout_free;
@@ -2467,7 +2481,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 		if (err == 0 && rt->dst.error)
 			err = -rt->dst.error;
 	} else {
-		rt = ip_route_output_key(net, &fl4);
+		rt = ip_route_output_key(&sk_ctx, &fl4);
 
 		err = 0;
 		if (IS_ERR(rt))
@@ -2481,13 +2495,13 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh)
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
-	err = rt_fill_info(net, dst, src, &fl4, skb,
+	err = rt_fill_info(&sk_ctx, dst, src, &fl4, skb,
 			   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
 			   RTM_NEWROUTE, 0, 0);
 	if (err < 0)
 		goto errout_free;
 
-	err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
+	err = rtnl_unicast(skb, sk_ctx.net, NETLINK_CB(in_skb).portid);
 errout:
 	return err;
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 45fe60c5238e..14b7a772c7a9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -302,6 +302,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	struct rtable *rt;
 	__u8 rcv_wscale;
 	struct flowi4 fl4;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -372,7 +373,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 			   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
 			   ireq->ir_loc_addr, th->source, th->dest);
 	security_req_classify_flow(req, flowi4_to_flowi(&fl4));
-	rt = ip_route_output_key(sock_net(sk), &fl4);
+	rt = ip_route_output_key(&sk_ctx, &fl4);
 	if (IS_ERR(rt)) {
 		reqsk_free(req);
 		goto out;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad3e65bdd368..ceb5616a4273 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -341,9 +341,10 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 	__u32 seq, snd_una;
 	__u32 remaining;
 	int err;
-	struct net *net = dev_net(icmp_skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(icmp_skb);
+	struct net *net = dev_ctx.net;
 
-	sk = inet_lookup(net, &tcp_hashinfo, iph->daddr, th->dest,
+	sk = inet_lookup(&dev_ctx, &tcp_hashinfo, iph->daddr, th->dest,
 			iph->saddr, th->source, inet_iif(icmp_skb));
 	if (!sk) {
 		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
@@ -592,7 +593,8 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 	int genhash;
 	struct sock *sk1 = NULL;
 #endif
-	struct net *net;
+	struct net_ctx ctx = SKB_NET_CTX_DST(skb);
+	struct net *net = ctx.net;
 
 	/* Never send a reset in response to a reset. */
 	if (th->rst)
@@ -634,7 +636,7 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 		 * Incoming packet is checked with md5 hash with finding key,
 		 * no RST generated if md5 hash doesn't match.
 		 */
-		sk1 = __inet_lookup_listener(net,
+		sk1 = __inet_lookup_listener(&ctx,
 					     &tcp_hashinfo, ip_hdr(skb)->saddr,
 					     th->source, ip_hdr(skb)->daddr,
 					     ntohs(th->source), inet_iif(skb));
@@ -683,7 +685,7 @@ static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
 		arg.bound_dev_if = sk->sk_bound_dev_if;
 
 	arg.tos = ip_hdr(skb)->tos;
-	ip_send_unicast_reply(net, skb, &TCP_SKB_CB(skb)->header.h4.opt,
+	ip_send_unicast_reply(&ctx, skb, &TCP_SKB_CB(skb)->header.h4.opt,
 			      ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
 			      &arg, arg.iov[0].iov_len);
 
@@ -718,7 +720,7 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
 			];
 	} rep;
 	struct ip_reply_arg arg;
-	struct net *net = dev_net(skb_dst(skb)->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DST(skb);
 
 	memset(&rep.th, 0, sizeof(struct tcphdr));
 	memset(&arg, 0, sizeof(arg));
@@ -767,11 +769,11 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
 	if (oif)
 		arg.bound_dev_if = oif;
 	arg.tos = tos;
-	ip_send_unicast_reply(net, skb, &TCP_SKB_CB(skb)->header.h4.opt,
+	ip_send_unicast_reply(&dev_ctx, skb, &TCP_SKB_CB(skb)->header.h4.opt,
 			      ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
 			      &arg, arg.iov[0].iov_len);
 
-	TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
+	TCP_INC_STATS_BH(dev_ctx.net, TCP_MIB_OUTSEGS);
 }
 
 static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
@@ -1393,13 +1395,14 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct sock *nsk;
 	struct request_sock **prev;
+	struct net_ctx ctx = { .net = sock_net(sk) };
 	/* Find possible connection requests. */
 	struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
 						       iph->saddr, iph->daddr);
 	if (req)
 		return tcp_check_req(sk, skb, req, prev, false);
 
-	nsk = inet_lookup_established(sock_net(sk), &tcp_hashinfo, iph->saddr,
+	nsk = inet_lookup_established(&ctx, &tcp_hashinfo, iph->saddr,
 			th->source, iph->daddr, th->dest, inet_iif(skb));
 
 	if (nsk) {
@@ -1495,6 +1498,7 @@ void tcp_v4_early_demux(struct sk_buff *skb)
 	const struct iphdr *iph;
 	const struct tcphdr *th;
 	struct sock *sk;
+	struct net_ctx dev_ctx = DEV_NET_CTX(skb->dev);
 
 	if (skb->pkt_type != PACKET_HOST)
 		return;
@@ -1508,7 +1512,7 @@ void tcp_v4_early_demux(struct sk_buff *skb)
 	if (th->doff < sizeof(struct tcphdr) / 4)
 		return;
 
-	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
+	sk = __inet_lookup_established(&dev_ctx, &tcp_hashinfo,
 				       iph->saddr, th->source,
 				       iph->daddr, ntohs(th->dest),
 				       skb->skb_iif);
@@ -1592,7 +1596,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	const struct tcphdr *th;
 	struct sock *sk;
 	int ret;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
+	struct net *net = dev_ctx.net;
 
 	if (skb->pkt_type != PACKET_HOST)
 		goto discard_it;
@@ -1726,7 +1731,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	}
 	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
 	case TCP_TW_SYN: {
-		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
+		struct sock *sk2 = inet_lookup_listener(&dev_ctx,
 							&tcp_hashinfo,
 							iph->saddr, th->source,
 							iph->daddr, th->dest,
@@ -1869,7 +1874,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct sock *sk = cur;
 	struct inet_listen_hashbucket *ilb;
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	if (!sk) {
 		ilb = &tcp_hashinfo.listening_hash[st->bucket];
@@ -1913,7 +1918,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	}
 get_sk:
 	sk_nulls_for_each_from(sk, node) {
-		if (!net_eq(sock_net(sk), net))
+		if (!sock_net_ctx_eq(sk, ctx))
 			continue;
 		if (sk->sk_family == st->family) {
 			cur = sk;
@@ -1972,7 +1977,7 @@ static inline bool empty_bucket(const struct tcp_iter_state *st)
 static void *established_get_first(struct seq_file *seq)
 {
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 	void *rc = NULL;
 
 	st->offset = 0;
@@ -1988,7 +1993,7 @@ static void *established_get_first(struct seq_file *seq)
 		spin_lock_bh(lock);
 		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
 			if (sk->sk_family != st->family ||
-			    !net_eq(sock_net(sk), net)) {
+			    !sock_net_ctx_eq(sk, ctx)) {
 				continue;
 			}
 			rc = sk;
@@ -2005,7 +2010,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 	struct sock *sk = cur;
 	struct hlist_nulls_node *node;
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	++st->num;
 	++st->offset;
@@ -2013,7 +2018,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 	sk = sk_nulls_next(sk);
 
 	sk_nulls_for_each_from(sk, node) {
-		if (sk->sk_family == st->family && net_eq(sock_net(sk), net))
+		if (sk->sk_family == st->family && sock_net_ctx_eq(sk, ctx))
 			return sk;
 	}
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 97ef1f8b7be8..1787dc8e5db3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -132,7 +132,7 @@ EXPORT_SYMBOL(udp_memory_allocated);
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
-static int udp_lib_lport_inuse(struct net *net, __u16 num,
+static int udp_lib_lport_inuse(struct net_ctx *ctx, __u16 num,
 			       const struct udp_hslot *hslot,
 			       unsigned long *bitmap,
 			       struct sock *sk,
@@ -145,7 +145,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
 	kuid_t uid = sock_i_uid(sk);
 
 	sk_nulls_for_each(sk2, node, &hslot->head) {
-		if (net_eq(sock_net(sk2), net) &&
+		if (sock_net_ctx_eq(sk2, ctx) &&
 		    sk2 != sk &&
 		    (bitmap || udp_sk(sk2)->udp_port_hash == num) &&
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
@@ -166,7 +166,7 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num,
  * Note: we still hold spinlock of primary hash chain, so no other writer
  * can insert/delete a socket with local_port == num
  */
-static int udp_lib_lport_inuse2(struct net *net, __u16 num,
+static int udp_lib_lport_inuse2(struct net_ctx *ctx, __u16 num,
 				struct udp_hslot *hslot2,
 				struct sock *sk,
 				int (*saddr_comp)(const struct sock *sk1,
@@ -179,7 +179,7 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num,
 
 	spin_lock(&hslot2->lock);
 	udp_portaddr_for_each_entry(sk2, node, &hslot2->head) {
-		if (net_eq(sock_net(sk2), net) &&
+		if (sock_net_ctx_eq(sk2, ctx) &&
 		    sk2 != sk &&
 		    (udp_sk(sk2)->udp_port_hash == num) &&
 		    (!sk2->sk_reuse || !sk->sk_reuse) &&
@@ -213,7 +213,8 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 	struct udp_hslot *hslot, *hslot2;
 	struct udp_table *udptable = sk->sk_prot->h.udp_table;
 	int    error = 1;
-	struct net *net = sock_net(sk);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
+	struct net *net = sk_ctx.net;
 
 	if (!snum) {
 		int low, high, remaining;
@@ -235,7 +236,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 			hslot = udp_hashslot(udptable, net, first);
 			bitmap_zero(bitmap, PORTS_PER_CHAIN);
 			spin_lock_bh(&hslot->lock);
-			udp_lib_lport_inuse(net, snum, hslot, bitmap, sk,
+			udp_lib_lport_inuse(&sk_ctx, snum, hslot, bitmap, sk,
 					    saddr_comp, udptable->log);
 
 			snum = first;
@@ -268,11 +269,11 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 			if (hslot->count < hslot2->count)
 				goto scan_primary_hash;
 
-			exist = udp_lib_lport_inuse2(net, snum, hslot2,
+			exist = udp_lib_lport_inuse2(&sk_ctx, snum, hslot2,
 						     sk, saddr_comp);
 			if (!exist && (hash2_nulladdr != slot2)) {
 				hslot2 = udp_hashslot2(udptable, hash2_nulladdr);
-				exist = udp_lib_lport_inuse2(net, snum, hslot2,
+				exist = udp_lib_lport_inuse2(&sk_ctx, snum, hslot2,
 							     sk, saddr_comp);
 			}
 			if (exist)
@@ -281,7 +282,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 				goto found;
 		}
 scan_primary_hash:
-		if (udp_lib_lport_inuse(net, snum, hslot, NULL, sk,
+		if (udp_lib_lport_inuse(&sk_ctx, snum, hslot, NULL, sk,
 					saddr_comp, 0))
 			goto fail_unlock;
 	}
@@ -336,14 +337,14 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
 	return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal, hash2_nulladdr);
 }
 
-static inline int compute_score(struct sock *sk, struct net *net,
+static inline int compute_score(struct sock *sk, struct net_ctx *ctx,
 				__be32 saddr, unsigned short hnum, __be16 sport,
 				__be32 daddr, __be16 dport, int dif)
 {
 	int score;
 	struct inet_sock *inet;
 
-	if (!net_eq(sock_net(sk), net) ||
+	if (!sock_net_ctx_eq(sk, ctx) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
 	    ipv6_only_sock(sk))
 		return -1;
@@ -381,14 +382,14 @@ static inline int compute_score(struct sock *sk, struct net *net,
 /*
  * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, inet_num)
  */
-static inline int compute_score2(struct sock *sk, struct net *net,
+static inline int compute_score2(struct sock *sk, struct net_ctx *ctx,
 				 __be32 saddr, __be16 sport,
 				 __be32 daddr, unsigned int hnum, int dif)
 {
 	int score;
 	struct inet_sock *inet;
 
-	if (!net_eq(sock_net(sk), net) ||
+	if (!sock_net_ctx_eq(sk, ctx) ||
 	    ipv6_only_sock(sk))
 		return -1;
 
@@ -435,11 +436,12 @@ static unsigned int udp_ehashfn(struct net *net, const __be32 laddr,
 
 
 /* called with read_rcu_lock() */
-static struct sock *udp4_lib_lookup2(struct net *net,
+static struct sock *udp4_lib_lookup2(struct net_ctx *ctx,
 		__be32 saddr, __be16 sport,
 		__be32 daddr, unsigned int hnum, int dif,
 		struct udp_hslot *hslot2, unsigned int slot2)
 {
+	struct net *net = ctx->net;
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
 	int score, badness, matches = 0, reuseport = 0;
@@ -449,7 +451,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 	result = NULL;
 	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
-		score = compute_score2(sk, net, saddr, sport,
+		score = compute_score2(sk, ctx, saddr, sport,
 				      daddr, hnum, dif);
 		if (score > badness) {
 			result = sk;
@@ -477,7 +479,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
 			result = NULL;
-		else if (unlikely(compute_score2(result, net, saddr, sport,
+		else if (unlikely(compute_score2(result, ctx, saddr, sport,
 				  daddr, hnum, dif) < badness)) {
 			sock_put(result);
 			goto begin;
@@ -489,10 +491,11 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
  * harder than this. -DaveM
  */
-struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
+struct sock *__udp4_lib_lookup(struct net_ctx *ctx, __be32 saddr,
 		__be16 sport, __be32 daddr, __be16 dport,
 		int dif, struct udp_table *udptable)
 {
+	struct net *net = ctx->net;
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(dport);
@@ -509,7 +512,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		if (hslot->count < hslot2->count)
 			goto begin;
 
-		result = udp4_lib_lookup2(net, saddr, sport,
+		result = udp4_lib_lookup2(ctx, saddr, sport,
 					  daddr, hnum, dif,
 					  hslot2, slot2);
 		if (!result) {
@@ -519,7 +522,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 			if (hslot->count < hslot2->count)
 				goto begin;
 
-			result = udp4_lib_lookup2(net, saddr, sport,
+			result = udp4_lib_lookup2(ctx, saddr, sport,
 						  htonl(INADDR_ANY), hnum, dif,
 						  hslot2, slot2);
 		}
@@ -530,7 +533,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	result = NULL;
 	badness = 0;
 	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
-		score = compute_score(sk, net, saddr, hnum, sport,
+		score = compute_score(sk, ctx, saddr, hnum, sport,
 				      daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
@@ -559,7 +562,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
 			result = NULL;
-		else if (unlikely(compute_score(result, net, saddr, hnum, sport,
+		else if (unlikely(compute_score(result, ctx, saddr, hnum, sport,
 				  daddr, dport, dif) < badness)) {
 			sock_put(result);
 			goto begin;
@@ -575,27 +578,28 @@ static inline struct sock *__udp4_lib_lookup_skb(struct sk_buff *skb,
 						 struct udp_table *udptable)
 {
 	const struct iphdr *iph = ip_hdr(skb);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DST(skb);
 
-	return __udp4_lib_lookup(dev_net(skb_dst(skb)->dev), iph->saddr, sport,
+	return __udp4_lib_lookup(&dev_ctx, iph->saddr, sport,
 				 iph->daddr, dport, inet_iif(skb),
 				 udptable);
 }
 
-struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
+struct sock *udp4_lib_lookup(struct net_ctx *ctx, __be32 saddr, __be16 sport,
 			     __be32 daddr, __be16 dport, int dif)
 {
-	return __udp4_lib_lookup(net, saddr, sport, daddr, dport, dif, &udp_table);
+	return __udp4_lib_lookup(ctx, saddr, sport, daddr, dport, dif, &udp_table);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup);
 
-static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
+static inline bool __udp_is_mcast_sock(struct net_ctx *ctx, struct sock *sk,
 				       __be16 loc_port, __be32 loc_addr,
 				       __be16 rmt_port, __be32 rmt_addr,
 				       int dif, unsigned short hnum)
 {
 	struct inet_sock *inet = inet_sk(sk);
 
-	if (!net_eq(sock_net(sk), net) ||
+	if (!sock_net_ctx_eq(sk, ctx) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
 	    (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
 	    (inet->inet_dport != rmt_port && inet->inet_dport) ||
@@ -629,10 +633,12 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 	struct sock *sk;
 	int harderr;
 	int err;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
+	struct net *net = dev_ctx.net;
 
-	sk = __udp4_lib_lookup(net, iph->daddr, uh->dest,
+	sk = __udp4_lib_lookup(&dev_ctx, iph->daddr, uh->dest,
 			iph->saddr, uh->source, skb->dev->ifindex, udptable);
+
 	if (sk == NULL) {
 		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
 		return;	/* No socket for error */
@@ -893,6 +899,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
 	struct sk_buff *skb;
 	struct ip_options_data opt_copy;
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -962,7 +969,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	sock_tx_timestamp(sk, &ipc.tx_flags);
 
 	if (msg->msg_controllen) {
-		err = ip_cmsg_send(sock_net(sk), msg, &ipc,
+		err = ip_cmsg_send(&sk_ctx, msg, &ipc,
 				   sk->sk_family == AF_INET6);
 		if (err)
 			return err;
@@ -1013,7 +1020,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		rt = (struct rtable *)sk_dst_check(sk, 0);
 
 	if (rt == NULL) {
-		struct net *net = sock_net(sk);
+		struct net *net = sk_ctx.net;
 
 		fl4 = &fl4_stack;
 		flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
@@ -1022,7 +1029,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				   faddr, saddr, dport, inet->inet_sport);
 
 		security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
-		rt = ip_route_output_flow(net, fl4, sk);
+		rt = ip_route_output_flow(&sk_ctx, fl4, sk);
 		if (IS_ERR(rt)) {
 			err = PTR_ERR(rt);
 			rt = NULL;
@@ -1657,12 +1664,13 @@ static void udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
  *
  *	Note: called only from the BH handler context.
  */
-static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
+static int __udp4_lib_mcast_deliver(struct net_ctx *ctx, struct sk_buff *skb,
 				    struct udphdr  *uh,
 				    __be32 saddr, __be32 daddr,
 				    struct udp_table *udptable,
 				    int proto)
 {
+	struct net *net = ctx->net;
 	struct sock *sk, *stack[256 / sizeof(struct sock *)];
 	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(uh->dest);
@@ -1683,7 +1691,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 	spin_lock(&hslot->lock);
 	sk_nulls_for_each_entry_offset(sk, node, &hslot->head, offset) {
-		if (__udp_is_mcast_sock(net, sk,
+		if (__udp_is_mcast_sock(ctx, sk,
 					uh->dest, daddr,
 					uh->source, saddr,
 					dif, hnum)) {
@@ -1754,7 +1762,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	unsigned short ulen;
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
+	struct net *net = dev_ctx.net;
 
 	/*
 	 *  Validate the packet.
@@ -1799,7 +1808,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	}
 
 	if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
-		return __udp4_lib_mcast_deliver(net, skb, uh,
+		return __udp4_lib_mcast_deliver(&dev_ctx, skb, uh,
 						saddr, daddr, udptable, proto);
 
 	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
@@ -1866,7 +1875,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 /* We can only early demux multicast if there is a single matching socket.
  * If more than one socket found returns NULL
  */
-static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
+static struct sock *__udp4_lib_mcast_demux_lookup(struct net_ctx *ctx,
 						  __be16 loc_port, __be32 loc_addr,
 						  __be16 rmt_port, __be32 rmt_addr,
 						  int dif)
@@ -1874,7 +1883,7 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(loc_port);
-	unsigned int count, slot = udp_hashfn(net, hnum, udp_table.mask);
+	unsigned int count, slot = udp_hashfn(ctx->net, hnum, udp_table.mask);
 	struct udp_hslot *hslot = &udp_table.hash[slot];
 
 	/* Do not bother scanning a too big list */
@@ -1886,7 +1895,7 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	count = 0;
 	result = NULL;
 	sk_nulls_for_each_rcu(sk, node, &hslot->head) {
-		if (__udp_is_mcast_sock(net, sk,
+		if (__udp_is_mcast_sock(ctx, sk,
 					loc_port, loc_addr,
 					rmt_port, rmt_addr,
 					dif, hnum)) {
@@ -1906,7 +1915,7 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 		if (count != 1 ||
 		    unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
 			result = NULL;
-		else if (unlikely(!__udp_is_mcast_sock(net, result,
+		else if (unlikely(!__udp_is_mcast_sock(ctx, result,
 						       loc_port, loc_addr,
 						       rmt_port, rmt_addr,
 						       dif, hnum))) {
@@ -1922,7 +1931,7 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
  * break forwarding setups.  The chains here can be long so only check
  * if the first socket is an exact match and if not move on.
  */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
+static struct sock *__udp4_lib_demux_lookup(struct net_ctx *ctx,
 					    __be16 loc_port, __be32 loc_addr,
 					    __be16 rmt_port, __be32 rmt_addr,
 					    int dif)
@@ -1930,7 +1939,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 	struct sock *sk, *result;
 	struct hlist_nulls_node *node;
 	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
+	unsigned int hash2 = udp4_portaddr_hash(ctx->net, loc_addr, hnum);
 	unsigned int slot2 = hash2 & udp_table.mask;
 	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
 	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
@@ -1939,7 +1948,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 	rcu_read_lock();
 	result = NULL;
 	udp_portaddr_for_each_entry_rcu(sk, node, &hslot2->head) {
-		if (INET_MATCH(sk, net, acookie,
+		if (INET_MATCH(sk, ctx, acookie,
 			       rmt_addr, loc_addr, ports, dif))
 			result = sk;
 		/* Only check first socket in chain */
@@ -1949,7 +1958,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 	if (result) {
 		if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
 			result = NULL;
-		else if (unlikely(!INET_MATCH(sk, net, acookie,
+		else if (unlikely(!INET_MATCH(sk, ctx, acookie,
 					      rmt_addr, loc_addr,
 					      ports, dif))) {
 			sock_put(result);
@@ -1962,7 +1971,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = DEV_NET_CTX(skb->dev);
 	const struct iphdr *iph;
 	const struct udphdr *uh;
 	struct sock *sk;
@@ -1978,10 +1987,10 @@ void udp_v4_early_demux(struct sk_buff *skb)
 
 	if (skb->pkt_type == PACKET_BROADCAST ||
 	    skb->pkt_type == PACKET_MULTICAST)
-		sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
+		sk = __udp4_lib_mcast_demux_lookup(&dev_ctx, uh->dest, iph->daddr,
 						   uh->source, iph->saddr, dif);
 	else if (skb->pkt_type == PACKET_HOST)
-		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
+		sk = __udp4_lib_demux_lookup(&dev_ctx, uh->dest, iph->daddr,
 					     uh->source, iph->saddr, dif);
 	else
 		return;
@@ -2275,7 +2284,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 {
 	struct sock *sk;
 	struct udp_iter_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	for (state->bucket = start; state->bucket <= state->udp_table->mask;
 	     ++state->bucket) {
@@ -2287,7 +2296,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 
 		spin_lock_bh(&hslot->lock);
 		sk_nulls_for_each(sk, node, &hslot->head) {
-			if (!net_eq(sock_net(sk), net))
+			if (!sock_net_ctx_eq(sk, ctx))
 				continue;
 			if (sk->sk_family == state->family)
 				goto found;
@@ -2302,11 +2311,11 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 {
 	struct udp_iter_state *state = seq->private;
-	struct net *net = seq_file_net(seq);
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	do {
 		sk = sk_nulls_next(sk);
-	} while (sk && (!net_eq(sock_net(sk), net) || sk->sk_family != state->family));
+	} while (sk && (!sock_net_ctx_eq(sk, ctx) || sk->sk_family != state->family));
 
 	if (!sk) {
 		if (state->bucket <= state->udp_table->mask)
diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
index 4a000f1dd757..e702a04d2682 100644
--- a/net/ipv4/udp_diag.c
+++ b/net/ipv4/udp_diag.c
@@ -36,16 +36,17 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
 	int err = -EINVAL;
 	struct sock *sk;
 	struct sk_buff *rep;
-	struct net *net = sock_net(in_skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(in_skb);
+	struct net *net = sk_ctx.net;
 
 	if (req->sdiag_family == AF_INET)
-		sk = __udp4_lib_lookup(net,
+		sk = __udp4_lib_lookup(&sk_ctx,
 				req->id.idiag_src[0], req->id.idiag_sport,
 				req->id.idiag_dst[0], req->id.idiag_dport,
 				req->id.idiag_if, tbl);
 #if IS_ENABLED(CONFIG_IPV6)
 	else if (req->sdiag_family == AF_INET6)
-		sk = __udp6_lib_lookup(net,
+		sk = __udp6_lib_lookup(&sk_ctx,
 				(struct in6_addr *)req->id.idiag_src,
 				req->id.idiag_sport,
 				(struct in6_addr *)req->id.idiag_dst,
@@ -94,7 +95,7 @@ static void udp_dump(struct udp_table *table, struct sk_buff *skb, struct netlin
 		struct inet_diag_req_v2 *r, struct nlattr *bc)
 {
 	int num, s_num, slot, s_slot;
-	struct net *net = sock_net(skb->sk);
+	struct net_ctx sk_ctx = SKB_NET_CTX_SOCK(skb);
 
 	s_slot = cb->args[0];
 	num = s_num = cb->args[1];
@@ -113,7 +114,7 @@ static void udp_dump(struct udp_table *table, struct sk_buff *skb, struct netlin
 		sk_nulls_for_each(sk, node, &hslot->head) {
 			struct inet_sock *inet = inet_sk(sk);
 
-			if (!net_eq(sock_net(sk), net))
+			if (!sock_net_ctx_eq(sk, &sk_ctx))
 				continue;
 			if (num < s_num)
 				goto next;
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 6156f68a1e90..c892b6bb0383 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -18,7 +18,7 @@
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo;
 
-static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4,
+static struct dst_entry *__xfrm4_dst_lookup(struct net_ctx *ctx, struct flowi4 *fl4,
 					    int tos,
 					    const xfrm_address_t *saddr,
 					    const xfrm_address_t *daddr)
@@ -31,29 +31,29 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4,
 	if (saddr)
 		fl4->saddr = saddr->a4;
 
-	rt = __ip_route_output_key(net, fl4);
+	rt = __ip_route_output_key(ctx, fl4);
 	if (!IS_ERR(rt))
 		return &rt->dst;
 
 	return ERR_CAST(rt);
 }
 
-static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos,
+static struct dst_entry *xfrm4_dst_lookup(struct net_ctx *ctx, int tos,
 					  const xfrm_address_t *saddr,
 					  const xfrm_address_t *daddr)
 {
 	struct flowi4 fl4;
 
-	return __xfrm4_dst_lookup(net, &fl4, tos, saddr, daddr);
+	return __xfrm4_dst_lookup(ctx, &fl4, tos, saddr, daddr);
 }
 
-static int xfrm4_get_saddr(struct net *net,
+static int xfrm4_get_saddr(struct net_ctx *ctx,
 			   xfrm_address_t *saddr, xfrm_address_t *daddr)
 {
 	struct dst_entry *dst;
 	struct flowi4 fl4;
 
-	dst = __xfrm4_dst_lookup(net, &fl4, 0, NULL, daddr);
+	dst = __xfrm4_dst_lookup(ctx, &fl4, 0, NULL, daddr);
 	if (IS_ERR(dst))
 		return -EHOSTUNREACH;
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 8f34b27d5775..d59affad3f01 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -359,8 +359,8 @@ static int sctp_v4_addr_valid(union sctp_addr *addr,
 /* Should this be available for binding?   */
 static int sctp_v4_available(union sctp_addr *addr, struct sctp_sock *sp)
 {
-	struct net *net = sock_net(&sp->inet.sk);
-	int ret = inet_addr_type(net, addr->v4.sin_addr.s_addr);
+	struct net_ctx sk_ctx = SOCK_NET_CTX(&sp->inet.sk);
+	int ret = inet_addr_type(&sk_ctx, addr->v4.sin_addr.s_addr);
 
 
 	if (addr->v4.sin_addr.s_addr != htonl(INADDR_ANY) &&
@@ -421,6 +421,7 @@ static sctp_scope_t sctp_v4_scope(union sctp_addr *addr)
 static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 				struct flowi *fl, struct sock *sk)
 {
+	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 	struct sctp_association *asoc = t->asoc;
 	struct rtable *rt;
 	struct flowi4 *fl4 = &fl->u.ip4;
@@ -447,7 +448,7 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 	pr_debug("%s: dst:%pI4, src:%pI4 - ", __func__, &fl4->daddr,
 		 &fl4->saddr);
 
-	rt = ip_route_output_key(sock_net(sk), fl4);
+	rt = ip_route_output_key(&sk_ctx, fl4);
 	if (!IS_ERR(rt))
 		dst = &rt->dst;
 
@@ -498,7 +499,7 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 					     daddr->v4.sin_addr.s_addr,
 					     laddr->a.v4.sin_addr.s_addr);
 
-			rt = ip_route_output_key(sock_net(sk), fl4);
+			rt = ip_route_output_key(&sk_ctx, fl4);
 			if (!IS_ERR(rt)) {
 				dst = &rt->dst;
 				goto out_unlock;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index cee479bc655c..9984dc89f2e4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2184,7 +2184,7 @@ static struct dst_entry *make_blackhole(struct net *net, u16 family,
  * At the moment we eat a raw IP route. Mostly to speed up lookups
  * on interfaces with disabled IPsec.
  */
-struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+struct dst_entry *xfrm_lookup(struct net_ctx *ctx, struct dst_entry *dst_orig,
 			      const struct flowi *fl,
 			      struct sock *sk, int flags)
 {
@@ -2244,7 +2244,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
 			goto nopol;
 
-		flo = flow_cache_lookup(net, fl, family, dir,
+		flo = flow_cache_lookup(ctx, fl, family, dir,
 					xfrm_bundle_lookup, &xflo);
 		if (flo == NULL)
 			goto nopol;
@@ -2443,7 +2443,8 @@ static inline int secpath_has_nontransport(const struct sec_path *sp, int k, int
 int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 			unsigned short family)
 {
-	struct net *net = dev_net(skb->dev);
+	struct net_ctx dev_ctx = SKB_NET_CTX_DEV(skb);
+	struct net *net = dev_ctx.net;
 	struct xfrm_policy *pol;
 	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
 	int npols = 0;
@@ -2490,7 +2491,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 	if (!pol) {
 		struct flow_cache_object *flo;
 
-		flo = flow_cache_lookup(net, &fl, family, fl_dir,
+		flo = flow_cache_lookup(&dev_ctx, &fl, family, fl_dir,
 					xfrm_policy_lookup, NULL);
 		if (IS_ERR_OR_NULL(flo))
 			pol = ERR_CAST(flo);
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 14/29] net: vrf: Introduce vrf header file
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (12 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05 13:44   ` Nicolas Dichtel
  2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
                   ` (22 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Defines for min and max vrf id and helpers for examining

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/vrf.h | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
 create mode 100644 include/net/vrf.h

diff --git a/include/net/vrf.h b/include/net/vrf.h
new file mode 100644
index 000000000000..67bc2e465661
--- /dev/null
+++ b/include/net/vrf.h
@@ -0,0 +1,36 @@
+#ifndef _VRF_H_
+#define _VRF_H_
+
+#define VRF_BITS	12
+#define VRF_MIN		1
+#define VRF_MAX		((1 << VRF_BITS) - 1)
+#define VRF_MASK	VRF_MAX
+
+#define VRF_DEFAULT	1
+#define VRF_ANY		0xffff
+
+static inline
+int vrf_eq(__u32 vrf1, __u32 vrf2)
+{
+	return vrf1 == vrf2;
+}
+
+static inline
+int vrf_eq_or_any(__u32 vrf1, __u32 vrf2)
+{
+	return vrf1 == vrf2 || vrf1 == VRF_ANY || vrf2 == VRF_ANY;
+}
+
+static inline int vrf_is_valid(__u32 vrf)
+{
+	if ((vrf < VRF_MIN || vrf > VRF_MAX) && vrf != VRF_ANY)
+		return 0;
+
+	return 1;
+}
+
+static inline int vrf_is_any(__u32 vrf)
+{
+	return vrf == VRF_ANY;
+}
+#endif
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (13 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Add vrf macros for accessing vrf in net_ctx references similar to what
exists for net, update helper functions and macros to set vrf context,
and handle initialization of vrf context for all existing net_ctx uses.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h        | 9 ++++++++-
 include/net/fib_rules.h          | 2 ++
 include/net/inet_sock.h          | 1 +
 include/net/inet_timewait_sock.h | 1 +
 include/net/ip_fib.h             | 1 +
 include/net/ipv6.h               | 1 +
 include/net/neighbour.h          | 9 +++++++++
 include/net/net_namespace.h      | 4 +++-
 include/net/netlink.h            | 1 +
 include/net/sock.h               | 4 +++-
 net/core/neighbour.c             | 2 +-
 11 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 43bb40260bfa..b6de06eda683 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1699,6 +1699,7 @@ struct net_device {
 
 	struct net_ctx		net_ctx;
 #define nd_net net_ctx.net
+#define nd_vrf net_ctx.vrf
 
 	/* mid-layer private */
 	union {
@@ -1845,7 +1846,13 @@ void dev_net_set(struct net_device *dev, struct net *net)
 }
 
 /* get net_ctx from device */
-#define DEV_NET_CTX(dev)  { .net = dev_net((dev)) }
+#define DEV_NET_CTX(dev)  { .net = dev_net((dev)), .vrf = (dev)->nd_vrf }
+
+static inline
+__u32 dev_vrf(const struct net_device *dev)
+{
+	return dev->nd_vrf;
+}
 
 static inline
 int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 1a545b23494e..0af67c3122f3 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -22,6 +22,7 @@ struct fib_rule {
 	struct fib_rule __rcu	*ctarget;
 	struct net_ctx		fr_net_ctx;
 #define fr_net  fr_net_ctx.net
+#define fr_vrf  fr_net_ctx.vrf
 
 	atomic_t		refcnt;
 	u32			pref;
@@ -78,6 +79,7 @@ struct fib_rules_ops {
 	struct module		*owner;
 	struct net_ctx		fro_net_ctx;
 #define fro_net  fro_net_ctx.net
+#define fro_vrf  fro_net_ctx.vrf
 	struct rcu_head		rcu;
 };
 
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index eb16c7beed1e..de59174d3124 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -77,6 +77,7 @@ struct inet_request_sock {
 #define ir_v6_rmt_addr		req.__req_common.skc_v6_daddr
 #define ir_v6_loc_addr		req.__req_common.skc_v6_rcv_saddr
 #define ir_iif			req.__req_common.skc_bound_dev_if
+#define ir_vrf			req.__req_common.skc_net_ctx.vrf
 
 	kmemcheck_bitfield_begin(flags);
 	u16			snd_wscale : 4,
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 6c566034e26d..c9f3bf6f8b24 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -116,6 +116,7 @@ struct inet_timewait_sock {
 #define tw_hash			__tw_common.skc_hash
 #define tw_prot			__tw_common.skc_prot
 #define tw_net			__tw_common.skc_net
+#define tw_vrf			__tw_common.skc_vrf
 #define tw_daddr        	__tw_common.skc_daddr
 #define tw_v6_daddr		__tw_common.skc_v6_daddr
 #define tw_rcv_saddr    	__tw_common.skc_rcv_saddr
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 85f5ddacba8d..577479d7f268 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -100,6 +100,7 @@ struct fib_info {
 	struct hlist_node	fib_lhash;
 	struct net_ctx		fib_net_ctx;
 #define fib_net  fib_net_ctx.net
+#define fib_vrf  fib_net_ctx.vrf
 	int			fib_treeref;
 	atomic_t		fib_clntref;
 	unsigned int		fib_flags;
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 2d025ed7a183..61f8b6df8bb9 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -240,6 +240,7 @@ struct ip6_flowlabel {
 	unsigned long		expires;
 	struct net_ctx		fl_net_ctx;
 #define fl_net  fl_net_ctx.net
+#define fl_vrf  fl_net_ctx.vrf
 };
 
 static inline
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 8cf9bc2236da..73d0938b085c 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -281,6 +281,15 @@ struct net *neigh_parms_net(const struct neigh_parms *parms)
 }
 
 static inline
+__u32 neigh_parms_vrf(const struct neigh_parms *parms)
+{
+	return parms->net_ctx.vrf;
+}
+
+#define NEIGH_PARMS_NET_CTX(p) \
+		{ .net = neigh_parms_net((p)), .vrf = neigh_parms_vrf((p)) }
+
+static inline
 int neigh_parms_net_ctx_eq(const struct neigh_parms *parms,
 			   const struct net_ctx *net_ctx)
 {
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index e7060b43570d..7cc7b0a1a20b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -27,6 +27,7 @@
 #include <net/netns/nftables.h>
 #include <net/netns/xfrm.h>
 #include <linux/ns_common.h>
+#include <net/vrf.h>
 
 struct user_namespace;
 struct proc_dir_entry;
@@ -138,6 +139,7 @@ struct net_ctx {
 #ifdef CONFIG_NET_NS
 	struct net *net;
 #endif
+	__u32 vrf;
 };
 
 #include <linux/seq_file_net.h>
@@ -145,7 +147,7 @@ struct net_ctx {
 /* Init's network namespace */
 extern struct net init_net;
 
-#define INIT_NET_CTX  { .net = &init_net }
+#define INIT_NET_CTX  { .net = &init_net, .vrf = VRF_DEFAULT }
 
 #ifdef CONFIG_NET_NS
 struct net *copy_net_ns(unsigned long flags, struct user_namespace *user_ns,
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 587a6ef973e5..82c4a2628106 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -224,6 +224,7 @@ struct nl_info {
 	struct nlmsghdr		*nlh;
 	struct net_ctx		nl_net_ctx;
 #define nl_net  nl_net_ctx.net
+#define nl_vrf nl_net_ctx.vrf
 	u32			portid;
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index e67347ed1555..a7cd250e9daf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -192,6 +192,7 @@ struct sock_common {
 	struct proto		*skc_prot;
 	struct net_ctx		skc_net_ctx;
 #define skc_net  skc_net_ctx.net
+#define skc_vrf  skc_net_ctx.vrf
 
 #if IS_ENABLED(CONFIG_IPV6)
 	struct in6_addr		skc_v6_daddr;
@@ -326,6 +327,7 @@ struct sock {
 #define sk_bind_node		__sk_common.skc_bind_node
 #define sk_prot			__sk_common.skc_prot
 #define sk_net			__sk_common.skc_net_ctx.net
+#define sk_vrf			__sk_common.skc_net_ctx.vrf
 #define sk_v6_daddr		__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr
 
@@ -2196,7 +2198,7 @@ void sock_net_set(struct sock *sk, struct net *net)
 	write_pnet(&sk->sk_net, net);
 }
 
-#define SOCK_NET_CTX(sk)  { .net = sock_net((sk)) }
+#define SOCK_NET_CTX(sk)  { .net = sock_net((sk)), .vrf = (sk)->sk_vrf }
 
 static inline
 int sock_net_ctx_eq(struct sock *sk, struct net_ctx *ctx)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 93a7701a7ae7..d872ada6720a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2855,7 +2855,7 @@ static void neigh_proc_update(struct ctl_table *ctl, int write)
 {
 	struct net_device *dev = ctl->extra1;
 	struct neigh_parms *p = ctl->extra2;
-	struct net_ctx ctx = { .net = neigh_parms_net(p) };
+	struct net_ctx ctx = NEIGH_PARMS_NET_CTX(p);
 	int index = (int *) ctl->data - p->data;
 
 	if (!write)
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 16/29] net: vrf: Set default vrf
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (14 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Set default vrf to DEFAULT_VRF for devices, neighbor table, and a few
other places.

If a device is moved from one namespace to another reset the vrf id to
DEFAULT_VRF.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/dev.c       | 4 ++++
 net/core/neighbour.c | 1 +
 net/ipv4/fib_rules.c | 2 +-
 net/ipv4/ipconfig.c  | 4 ++--
 net/ipv4/ipmr.c      | 2 +-
 5 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fa92d1046eeb..0d50b2c1944e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6817,6 +6817,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	dev_uc_init(dev);
 
 	dev_net_set(dev, &init_net);
+	dev->nd_vrf = VRF_DEFAULT;
 
 	dev->gso_max_size = GSO_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
@@ -7079,6 +7080,9 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 	/* Actually switch the network namespace */
 	dev_net_set(dev, net);
 
+	/* reset vrf id since we changed namespaces */
+	dev->nd_vrf = VRF_DEFAULT;
+
 	/* If there is an ifindex conflict assign a new one */
 	if (__dev_get_by_index(net, dev->ifindex)) {
 		int iflink = (dev->iflink == dev->ifindex);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index d872ada6720a..f64e178738de 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1526,6 +1526,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 	INIT_LIST_HEAD(&tbl->parms_list);
 	list_add(&tbl->parms.list, &tbl->parms_list);
 	write_pnet(&tbl->parms.net_ctx.net, &init_net);
+	tbl->parms.net_ctx.vrf = VRF_DEFAULT;
 	atomic_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 60b14866661b..bb9399e2c1cb 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -331,7 +331,7 @@ int __net_init fib4_rules_init(struct net *net)
 {
 	int err;
 	struct fib_rules_ops *ops;
-	struct net_ctx ctx = { .net = net };
+	struct net_ctx ctx = { .net = net, .vrf = VRF_DEFAULT };
 
 	ops = fib_rules_register(&fib4_rules_ops_template, &ctx);
 	if (IS_ERR(ops))
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index e25e3b67be76..b0a5226faaef 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -329,7 +329,7 @@ set_sockaddr(struct sockaddr_in *sin, __be32 addr, __be16 port)
 
 static int __init ic_devinet_ioctl(unsigned int cmd, struct ifreq *arg)
 {
-	struct net_ctx ctx = { .net = &init_net };
+	struct net_ctx ctx = INIT_NET_CTX;
 	int res;
 
 	mm_segment_t oldfs = get_fs();
@@ -352,7 +352,7 @@ static int __init ic_dev_ioctl(unsigned int cmd, struct ifreq *arg)
 
 static int __init ic_route_ioctl(unsigned int cmd, struct rtentry *arg)
 {
-	struct net_ctx ctx = { .net = &init_net };
+	struct net_ctx ctx = INIT_NET_CTX;
 	int res;
 
 	mm_segment_t oldfs = get_fs();
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 935f45f54862..84d6efeeb072 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -244,7 +244,7 @@ static const struct fib_rules_ops __net_initconst ipmr_rules_ops_template = {
 
 static int __net_init ipmr_rules_init(struct net *net)
 {
-	struct net_ctx ctx = { .net = net };
+	struct net_ctx ctx = { .net = net, .vrf = VRF_DEFAULT };
 	struct fib_rules_ops *ops;
 	struct mr_table *mrt;
 	int err;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 17/29] net: vrf: Add vrf context to task struct
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (15 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

vrf context is passed parent to child. Defaults to 1 and can be read
and changed via /proc/<pid>/vrf. In time the /proc write option can be
removed in favor of a prctl; writing to a proc file is a lot simpler
at this point.

A tasks' vrf context is the default used for sockets created by the
task. This is addressed in the next patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 fs/proc/base.c            | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/init_task.h |  1 +
 include/linux/sched.h     |  2 +
 kernel/fork.c             |  2 +
 4 files changed, 99 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7aeb0712..9e538075f7e5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -90,6 +90,7 @@
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
+#include <net/vrf.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -456,6 +457,97 @@ static int proc_pid_limits(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+static ssize_t vrf_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[16];
+	size_t len;
+	__u32 vrf = 0;
+	unsigned long flags;
+
+	if (!task)
+		return -ESRCH;
+
+	if (lock_task_sighand(task, &flags)) {
+		vrf = task->vrf;
+		unlock_task_sighand(task, &flags);
+	}
+
+	put_task_struct(task);
+
+	if (vrf == VRF_ANY)
+		len = snprintf(buffer, sizeof(buffer), "any\n");
+	else
+		len = snprintf(buffer, sizeof(buffer), "%i\n", vrf);
+
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t vrf_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[16], *pbuf;
+	__u32 vrf;
+	unsigned long flags;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	pbuf = strstrip(buffer);
+	if (strcmp(pbuf, "any") == 0)
+		vrf = VRF_ANY;
+	else {
+		err = kstrtouint(strstrip(buffer), 0, &vrf);
+		if (err)
+			goto out;
+
+		if (!vrf_is_valid(vrf)) {
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	task = get_proc_task(file_inode(file));
+	if (!task) {
+		err = -ESRCH;
+		goto out;
+	}
+
+	task_lock(task);
+	if (!task->mm) {
+		err = -EINVAL;
+		goto err_task_lock;
+	}
+
+	if (!lock_task_sighand(task, &flags)) {
+		err = -ESRCH;
+		goto err_task_lock;
+	}
+
+	task->vrf = vrf;
+
+	unlock_task_sighand(task, &flags);
+err_task_lock:
+	task_unlock(task);
+	put_task_struct(task);
+out:
+	return err < 0 ? err : count;
+}
+
+static const struct file_operations proc_vrf_operations = {
+	.read		= vrf_read,
+	.write		= vrf_write,
+	.llseek		= generic_file_llseek,
+};
+
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
 static int proc_pid_syscall(struct seq_file *m, struct pid_namespace *ns,
 			    struct pid *pid, struct task_struct *task)
@@ -2628,6 +2720,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	REG("timers",	  S_IRUGO, proc_timers_operations),
 #endif
+	REG("vrf",   S_IRUGO|S_IWUSR, proc_vrf_operations),
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -2970,6 +3063,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 	REG("setgroups",  S_IRUGO|S_IWUSR, proc_setgroups_operations),
 #endif
+	REG("vrf",   S_IRUGO|S_IWUSR, proc_vrf_operations),
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 3037fc085e8e..3ae3a93d42ce 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -247,6 +247,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	.vrf = VRF_DEFAULT,						\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef98d2f..8b40ba202906 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1459,6 +1459,8 @@ struct task_struct {
 	struct files_struct *files;
 /* namespaces */
 	struct nsproxy *nsproxy;
+/* vrf context within a namespace */
+	__u32 vrf;
 /* signal handlers */
 	struct signal_struct *signal;
 	struct sighand_struct *sighand;
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2ddade9f1..a6f412da1378 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -352,6 +352,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->splice_pipe = NULL;
 	tsk->task_frag.page = NULL;
 
+	tsk->vrf = orig->vrf;
+
 	account_kernel_stack(ti, 1);
 
 	return tsk;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (16 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05 13:44   ` Nicolas Dichtel
  2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
                   ` (18 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Sockets inherit the vrf context of the task opening it. The context can
be read/changed via a socket option (IP_VRF_CONTEXT).

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/uapi/linux/in.h         |  1 +
 net/core/sock.c                 |  2 ++
 net/ipv4/inet_connection_sock.c |  5 +++--
 net/ipv4/inet_hashtables.c      |  1 +
 net/ipv4/inet_timewait_sock.c   |  1 +
 net/ipv4/ip_output.c            |  1 +
 net/ipv4/ip_sockglue.c          | 14 ++++++++++++++
 net/ipv4/tcp_minisocks.c        |  1 +
 8 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index 589ced069e8a..77ac6fce6493 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -145,6 +145,7 @@ struct in_addr {
 #define MCAST_MSFILTER			48
 #define IP_MULTICAST_ALL		49
 #define IP_UNICAST_IF			50
+#define IP_VRF_CONTEXT			51
 
 #define MCAST_EXCLUDE	0
 #define MCAST_INCLUDE	1
diff --git a/net/core/sock.c b/net/core/sock.c
index 93c8b20c91e4..8a4ef8540e50 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1392,6 +1392,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		sk->sk_prot = sk->sk_prot_creator = prot;
 		sock_lock_init(sk);
 		sock_net_set(sk, get_net(net));
+		/* by default socket takes on vrf of task */
+		sk->sk_vrf = current->vrf;
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index b3580594d08a..3b8df03c69db 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -404,7 +404,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	struct ip_options_rcu *opt = inet_rsk(req)->opt;
 	struct net *net = sock_net(sk);
-	struct net_ctx ctx = { .net = net };
+	struct net_ctx ctx = { .net = net, .vrf = ireq->ir_vrf };
 	int flags = inet_sk_flowi_flags(sk);
 
 	flowi4_init_output(fl4, sk->sk_bound_dev_if, ireq->ir_mark,
@@ -437,7 +437,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 	struct inet_sock *newinet = inet_sk(newsk);
 	struct ip_options_rcu *opt;
 	struct net *net = sock_net(sk);
-	struct net_ctx ctx = { .net = net };
+	struct net_ctx ctx = { .net = net, .vrf = ireq->ir_vrf };
 	struct flowi4 *fl4;
 	struct rtable *rt;
 
@@ -681,6 +681,7 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 		newsk->sk_write_space = sk_stream_write_space;
 
 		newsk->sk_mark = inet_rsk(req)->ir_mark;
+		newsk->sk_vrf  = inet_rsk(req)->ir_vrf;
 
 		newicsk->icsk_retransmits = 0;
 		newicsk->icsk_backoff	  = 0;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8b3d94ca634c..71c31c81aea1 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,6 +62,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 
 	if (tb != NULL) {
 		write_pnet(&tb->ib_net_ctx.net, hold_net(ctx->net));
+		tb->ib_net_ctx.vrf = ctx->vrf;
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 6d592f8555fb..faec08993a46 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -196,6 +196,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
 		twsk_net_set(tw, hold_net(sock_net(sk)));
+		tw->tw_vrf	    = sk->sk_vrf;
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 855e003e43d8..126d6edea34e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1574,6 +1574,7 @@ void ip_send_unicast_reply(struct net_ctx *ctx, struct sk_buff *skb,
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
 	sock_net_set(sk, ctx->net);
+	sk->sk_vrf = ctx->vrf;
 	__skb_queue_head_init(&sk->sk_write_queue);
 	sk->sk_sndbuf = sysctl_wmem_default;
 	err = ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8ab03f0431f5..eeb51e935379 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -555,6 +555,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	case IP_MULTICAST_LOOP:
 	case IP_RECVORIGDSTADDR:
 	case IP_CHECKSUM:
+	case IP_VRF_CONTEXT:
 		if (optlen >= sizeof(int)) {
 			if (get_user(val, (int __user *) optval))
 				return -EFAULT;
@@ -1104,6 +1105,16 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		inet->min_ttl = val;
 		break;
 
+	case IP_VRF_CONTEXT:
+		/* VRF context can only be set on unconnected sockets */
+		if (inet->inet_sport || inet->inet_dport) {
+			err = -EINVAL;
+			break;
+		}
+		sk->sk_vrf = val;
+		err = 0;
+		break;
+
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -1411,6 +1422,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_MINTTL:
 		val = inet->min_ttl;
 		break;
+	case IP_VRF_CONTEXT:
+		val = sk->sk_vrf;
+		break;
 	default:
 		release_sock(sk);
 		return -ENOPROTOOPT;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bc9216dc9de1..f5b869799b14 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -283,6 +283,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 
 		tw->tw_transparent	= inet->transparent;
 		tw->tw_rcv_wscale	= tp->rx_opt.rcv_wscale;
+		tw->tw_vrf		= sk->sk_vrf;
 		tcptw->tw_rcv_nxt	= tp->rcv_nxt;
 		tcptw->tw_snd_nxt	= tp->snd_nxt;
 		tcptw->tw_rcv_wnd	= tcp_receive_window(tp);
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 19/29] net: vrf: Add vrf context to skb
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (17 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05 13:45   ` Nicolas Dichtel
  2015-02-06  3:54   ` Eric W. Biederman
  2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
                   ` (17 subsequent siblings)
  36 siblings, 2 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

On ingress skb's inherit vrf context from the net_device. For TX skb's
inherit the vrf context from the socket originating the packet. Update
SKB related net_ctx macros to set vrf.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/skbuff.h   |  7 ++++---
 include/net/sock.h       |  2 ++
 include/net/tcp.h        |  1 +
 net/core/dev.c           |  1 +
 net/core/fib_rules.c     |  2 ++
 net/core/neighbour.c     |  2 ++
 net/core/skbuff.c        | 12 ++++++++++++
 net/ipv4/devinet.c       |  2 ++
 net/ipv4/icmp.c          |  2 +-
 net/ipv4/ip_output.c     |  2 ++
 net/ipv4/syncookies.c    |  1 +
 net/ipv4/tcp_ipv4.c      |  3 ++-
 net/netlink/af_netlink.c | 12 ++++++++++++
 13 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5dfef469d07..bdbee41e8032 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -522,6 +522,7 @@ struct sk_buff {
 	};
 	struct sock		*sk;
 	struct net_device	*dev;
+	__u32			vrf;
 
 	/*
 	 * This is the control buffer. It is free to use for every
@@ -665,9 +666,9 @@ struct sk_buff {
 	atomic_t		users;
 };
 
-#define SKB_NET_CTX_DEV(skb)  { .net = dev_net((skb)->dev) }
-#define SKB_NET_CTX_DST(skb)  { .net = dev_net(skb_dst((skb))->dev) }
-#define SKB_NET_CTX_SOCK(skb) { .net = sock_net((skb)->sk) }
+#define SKB_NET_CTX_DEV(skb)  { .net = dev_net((skb)->dev),	     .vrf = (skb)->vrf }
+#define SKB_NET_CTX_DST(skb)  { .net = dev_net(skb_dst((skb))->dev), .vrf = (skb)->vrf }
+#define SKB_NET_CTX_SOCK(skb) { .net = sock_net((skb)->sk),	     .vrf = (skb)->vrf }
 
 #ifdef __KERNEL__
 /*
diff --git a/include/net/sock.h b/include/net/sock.h
index a7cd250e9daf..d3668b691f82 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1976,6 +1976,7 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	skb_orphan(skb);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	skb->vrf = sk->sk_vrf;
 	skb_set_hash_from_sk(skb, sk);
 	/*
 	 * We used to take a refcount on sk, but following operation
@@ -1990,6 +1991,7 @@ static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	skb_orphan(skb);
 	skb->sk = sk;
 	skb->destructor = sock_rfree;
+	skb->vrf = sk->sk_vrf;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 	sk_mem_charge(sk, skb->truesize);
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b8fdc6bab3f3..ed46170de42a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1155,6 +1155,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
 	ireq->ir_mark = inet_request_mark(sk, skb);
+	ireq->ir_vrf = skb->vrf;
 }
 
 extern void tcp_openreq_init_rwin(struct request_sock *req,
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d50b2c1944e..d64f5b107dba 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3698,6 +3698,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 
 another_round:
 	skb->skb_iif = skb->dev->ifindex;
+	skb->vrf = skb->dev->nd_vrf;
 
 	__this_cpu_inc(softnet_data.processed);
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index b793196f9521..9a1a4a23b6f6 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -690,6 +690,8 @@ static void notify_rule_change(int event, struct fib_rule *rule,
 	if (skb == NULL)
 		goto errout;
 
+	skb->vrf = ops->fro_vrf;
+
 	err = fib_nl_fill_rule(skb, rule, pid, nlh->nlmsg_seq, event, 0, ops);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in fib_rule_nlmsg_size() */
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f64e178738de..0fbbe70be170 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2780,6 +2780,8 @@ static void __neigh_notify(struct neighbour *n, int type, int flags)
 	if (skb == NULL)
 		goto errout;
 
+	skb->vrf = n->dev->nd_vrf;
+
 	err = neigh_fill_info(skb, n, 0, 0, type, flags);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in neigh_nlmsg_size() */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a5bff2767f15..61a75e891342 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -251,6 +251,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb->end = skb->tail + size;
 	skb->mac_header = (typeof(skb->mac_header))~0U;
 	skb->transport_header = (typeof(skb->transport_header))~0U;
+	skb->vrf = VRF_DEFAULT;
 
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
@@ -514,6 +515,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
+		skb->vrf = dev->nd_vrf;
 	}
 
 	return skb;
@@ -832,6 +834,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 #endif
 
+	new->vrf = old->vrf;
 }
 
 /*
@@ -864,6 +867,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	atomic_inc(&(skb_shinfo(skb)->dataref));
 	skb->cloned = 1;
 
+	n->vrf = skb->vrf;
+
 	return n;
 #undef C
 }
@@ -1057,6 +1062,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 		BUG();
 
 	copy_skb_header(n, skb);
+
+	n->vrf = skb->vrf;
+
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1120,6 +1128,8 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 	}
 
 	copy_skb_header(n, skb);
+
+	n->vrf = skb->vrf;
 out:
 	return n;
 }
@@ -1294,6 +1304,8 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
+	n->vrf = skb->vrf;
+
 	return n;
 }
 EXPORT_SYMBOL(skb_copy_expand);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a0182f79f6bf..59de98a44508 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1603,6 +1603,8 @@ static void rtmsg_ifa(int event, struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	if (skb == NULL)
 		goto errout;
 
+	skb->vrf = ifa->ifa_dev->dev->nd_vrf;
+
 	err = inet_fill_ifaddr(skb, ifa, portid, seq, event, 0);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in inet_nlmsg_size() */
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f64de76f55ef..2d1e98e6ad14 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -389,7 +389,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	struct ipcm_cookie ipc;
 	struct rtable *rt = skb_rtable(skb);
 	struct net *net = dev_net(rt->dst.dev);
-	struct net_ctx dev_ctx = { .net = net };
+	struct net_ctx dev_ctx = { .net = net, .vrf = skb->vrf };
 	struct flowi4 fl4;
 	struct sock *sk;
 	struct inet_sock *inet;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 126d6edea34e..383bac145bf4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -471,6 +471,8 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	to->ipvs_property = from->ipvs_property;
 #endif
 	skb_copy_secmark(to, from);
+
+	to->vrf = from->vrf;
 }
 
 /*
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 14b7a772c7a9..7702e1f94174 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -340,6 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->ir_loc_addr	= ip_hdr(skb)->daddr;
 	ireq->ir_rmt_addr	= ip_hdr(skb)->saddr;
 	ireq->ir_mark		= inet_request_mark(sk, skb);
+	ireq->ir_vrf		= skb->vrf;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ceb5616a4273..24089b9534bf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1368,6 +1368,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		sk_nocaps_add(newsk, NETIF_F_GSO_MASK);
 	}
 #endif
+	newsk->sk_vrf = skb->vrf;
 
 	if (__inet_inherit_port(sk, newsk) < 0)
 		goto put_and_exit;
@@ -1395,7 +1396,7 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct sock *nsk;
 	struct request_sock **prev;
-	struct net_ctx ctx = { .net = sock_net(sk) };
+	struct net_ctx ctx = { .net = sock_net(sk), .vrf = skb->vrf };
 	/* Find possible connection requests. */
 	struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
 						       iph->saddr, iph->daddr);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index a36777b7cfb6..bd613406e033 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1736,6 +1736,14 @@ static struct sk_buff *netlink_trim(struct sk_buff *skb, gfp_t allocation)
 	return skb;
 }
 
+/*
+ * kernel sockets are all in vrf 1 (default vrf). Transactions
+ * (e.g., add/delete address/route) are happening in other vrfs.
+ * Packets for transactions from userpsace are funneled through the
+ * kernel sockets. Handle this case by resetting skb vrf after ownership
+ * assignment. rtnetlink based functions need to use skb->vrf for
+ * decisions which is set to the original userspace socket's vrf id.
+ */
 static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb,
 				  struct sock *ssk)
 {
@@ -1744,8 +1752,11 @@ static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb,
 
 	ret = -ECONNREFUSED;
 	if (nlk->netlink_rcv != NULL) {
+		__u32 vrf = skb->vrf;
 		ret = skb->len;
 		netlink_skb_set_owner_r(skb, sk);
+		/* use vrf from sending socket, not kernel's socket context */
+		skb->vrf = vrf;
 		NETLINK_CB(skb).sk = ssk;
 		netlink_deliver_tap_kernel(sk, ssk, skb);
 		nlk->netlink_rcv(skb);
@@ -2313,6 +2324,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	if (skb == NULL)
 		goto out;
 
+	skb->vrf = sk->sk_vrf;
 	NETLINK_CB(skb).portid	= nlk->portid;
 	NETLINK_CB(skb).dst_group = dst_group;
 	NETLINK_CB(skb).creds	= scm.creds;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (18 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/flow.h              |  7 ++++++-
 include/net/route.h             |  4 ++--
 net/ipv4/devinet.c              |  2 +-
 net/ipv4/fib_frontend.c         |  5 ++++-
 net/ipv4/fib_rules.c            |  2 ++
 net/ipv4/fib_semantics.c        |  1 +
 net/ipv4/icmp.c                 |  3 +++
 net/ipv4/inet_connection_sock.c |  4 ++--
 net/ipv4/ip_output.c            |  2 +-
 net/ipv4/ipmr.c                 |  2 ++
 net/ipv4/netfilter.c            |  1 +
 net/ipv4/ping.c                 |  2 +-
 net/ipv4/raw.c                  |  2 +-
 net/ipv4/route.c                | 23 +++++++++++++----------
 net/ipv4/syncookies.c           |  2 +-
 net/ipv4/udp.c                  |  3 ++-
 net/ipv4/xfrm4_policy.c         |  2 ++
 net/sctp/protocol.c             |  1 +
 18 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 07e7a58b9aac..6d35a8bfbe72 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -30,6 +30,7 @@ struct flowi_common {
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
 	__u32	flowic_secid;
+	__u32	flowic_vrf;
 };
 
 union flowi_uli {
@@ -66,6 +67,7 @@ struct flowi4 {
 #define flowi4_proto		__fl_common.flowic_proto
 #define flowi4_flags		__fl_common.flowic_flags
 #define flowi4_secid		__fl_common.flowic_secid
+#define flowi4_vrf		__fl_common.flowic_vrf
 
 	/* (saddr,daddr) must be grouped, same order as in IP header */
 	__be32			saddr;
@@ -81,7 +83,7 @@ struct flowi4 {
 #define fl4_gre_key		uli.gre_key
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
-static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
+static inline void flowi4_init_output(struct flowi4 *fl4, __u32 vrf, int oif,
 				      __u32 mark, __u8 tos, __u8 scope,
 				      __u8 proto, __u8 flags,
 				      __be32 daddr, __be32 saddr,
@@ -95,6 +97,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, int oif,
 	fl4->flowi4_proto = proto;
 	fl4->flowi4_flags = flags;
 	fl4->flowi4_secid = 0;
+	fl4->flowi4_vrf = vrf;
 	fl4->daddr = daddr;
 	fl4->saddr = saddr;
 	fl4->fl4_dport = dport;
@@ -122,6 +125,7 @@ struct flowi6 {
 #define flowi6_proto		__fl_common.flowic_proto
 #define flowi6_flags		__fl_common.flowic_flags
 #define flowi6_secid		__fl_common.flowic_secid
+#define flowi6_vrf		__fl_common.flowic_vrf
 	struct in6_addr		daddr;
 	struct in6_addr		saddr;
 	__be32			flowlabel;
@@ -165,6 +169,7 @@ struct flowi {
 #define flowi_proto	u.__fl_common.flowic_proto
 #define flowi_flags	u.__fl_common.flowic_flags
 #define flowi_secid	u.__fl_common.flowic_secid
+#define flowi_vrf	u.__fl_common.flowic_vrf
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 static inline struct flowi *flowi4_to_flowi(struct flowi4 *fl4)
diff --git a/include/net/route.h b/include/net/route.h
index 5f0b770225d7..a062df826c67 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -139,7 +139,7 @@ static inline struct rtable *ip_route_output_ports(struct net_ctx *ctx, struct f
 						   __be16 dport, __be16 sport,
 						   __u8 proto, __u8 tos, int oif)
 {
-	flowi4_init_output(fl4, oif, sk ? sk->sk_mark : 0, tos,
+	flowi4_init_output(fl4, ctx->vrf, oif, sk ? sk->sk_mark : 0, tos,
 			   RT_SCOPE_UNIVERSE, proto,
 			   sk ? inet_sk_flowi_flags(sk) : 0,
 			   daddr, saddr, dport, sport);
@@ -250,7 +250,7 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32
 	if (inet_sk(sk)->transparent)
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 
-	flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
+	flowi4_init_output(fl4, sk->sk_vrf, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
 			   protocol, flow_flags, dst, src, dport, sport);
 }
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 59de98a44508..02ffbfb8bfee 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -155,7 +155,7 @@ struct net_device *__ip_dev_find(struct net_ctx *ctx, __be32 addr, bool devref)
 		}
 	}
 	if (!result) {
-		struct flowi4 fl4 = { .daddr = addr };
+		struct flowi4 fl4 = { .daddr = addr, .flowi4_vrf = ctx->vrf };
 		struct fib_result res = { 0 };
 		struct fib_table *local;
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b068ab996cc3..f2a8a557a3d8 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -155,7 +155,7 @@ static inline unsigned int __inet_dev_addr_type(struct net_ctx *ctx,
 						__be32 addr)
 {
 	struct net *net = ctx->net;
-	struct flowi4		fl4 = { .daddr = addr };
+	struct flowi4		fl4 = { .daddr = addr, .flowi4_vrf = ctx->vrf };
 	struct fib_result	res;
 	unsigned int ret = RTN_BROADCAST;
 	struct fib_table *local_table;
@@ -221,6 +221,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 		fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 		fl4.flowi4_scope = scope;
 		fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
+		fl4.flowi4_vrf = dev_ctx.vrf;
 		if (!fib_lookup(&dev_ctx, &fl4, &res))
 			return FIB_RES_PREFSRC(&dev_ctx, res);
 	} else {
@@ -258,6 +259,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	no_addr = idev->ifa_list == NULL;
 
 	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
+	fl4.flowi4_vrf = dev_ctx.vrf;
 
 	if (fib_lookup(&dev_ctx, &fl4, &res))
 		goto last_resort;
@@ -292,6 +294,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	if (rpf == 1)
 		goto e_rpf;
 	fl4.flowi4_oif = dev->ifindex;
+	fl4.flowi4_vrf = dev_vrf(dev);
 
 	ret = 0;
 	if (fib_lookup(&dev_ctx, &fl4, &res) == 0) {
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index bb9399e2c1cb..0dc8adf7b767 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -55,6 +55,8 @@ int __fib_lookup(struct net_ctx *ctx, struct flowi4 *flp, struct fib_result *res
 	};
 	int err;
 
+	flp->flowi4_vrf = ctx->vrf;
+
 	err = fib_rules_lookup(ctx->net->ipv4.rules_ops, flowi4_to_flowi(flp),
 			       0, &arg);
 #ifdef CONFIG_IP_ROUTE_CLASSID
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 99af28c2fb6d..9fc5487e66fe 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -617,6 +617,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 				.flowi4_scope = cfg->fc_scope + 1,
 				.flowi4_oif = nh->nh_oif,
 				.flowi4_iif = LOOPBACK_IFINDEX,
+				.flowi4_vrf = net_ctx->vrf,
 			};
 
 			/* It is not necessary, but requires a bit of thinking */
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 2d1e98e6ad14..9d4c38292fee 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -426,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	fl4.flowi4_mark = mark;
 	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
 	fl4.flowi4_proto = IPPROTO_ICMP;
+	fl4.flowi4_vrf = skb->vrf;
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_key(&dev_ctx, &fl4);
 	if (IS_ERR(rt))
@@ -457,6 +458,7 @@ static struct rtable *icmp_route_lookup(struct net_ctx *ctx,
 	fl4->flowi4_mark = mark;
 	fl4->flowi4_tos = RT_TOS(tos);
 	fl4->flowi4_proto = IPPROTO_ICMP;
+	fl4->flowi4_vrf = skb_in->vrf;
 	fl4->fl4_icmp_type = type;
 	fl4->fl4_icmp_code = code;
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
@@ -490,6 +492,7 @@ static struct rtable *icmp_route_lookup(struct net_ctx *ctx,
 		unsigned long orefdst;
 
 		fl4_2.daddr = fl4_dec.saddr;
+		fl4_2.flowi4_vrf = skb_in->vrf;
 		rt2 = ip_route_output_key(ctx, &fl4_2);
 		if (IS_ERR(rt2)) {
 			err = PTR_ERR(rt2);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 3b8df03c69db..ace32910667e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -407,7 +407,7 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 	struct net_ctx ctx = { .net = net, .vrf = ireq->ir_vrf };
 	int flags = inet_sk_flowi_flags(sk);
 
-	flowi4_init_output(fl4, sk->sk_bound_dev_if, ireq->ir_mark,
+	flowi4_init_output(fl4, ctx.vrf, sk->sk_bound_dev_if, ireq->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol,
 			   flags,
@@ -445,7 +445,7 @@ struct dst_entry *inet_csk_route_child_sock(struct sock *sk,
 
 	rcu_read_lock();
 	opt = rcu_dereference(newinet->inet_opt);
-	flowi4_init_output(fl4, sk->sk_bound_dev_if, inet_rsk(req)->ir_mark,
+	flowi4_init_output(fl4, ctx.vrf, sk->sk_bound_dev_if, inet_rsk(req)->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol, inet_sk_flowi_flags(sk),
 			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->ir_rmt_addr,
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 383bac145bf4..9b2d8d7ff6cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1556,7 +1556,7 @@ void ip_send_unicast_reply(struct net_ctx *ctx, struct sk_buff *skb,
 			daddr = replyopts.opt.opt.faddr;
 	}
 
-	flowi4_init_output(&fl4, arg->bound_dev_if,
+	flowi4_init_output(&fl4, skb->vrf, arg->bound_dev_if,
 			   IP4_REPLY_MARK(ctx->net, skb->mark),
 			   RT_TOS(arg->tos),
 			   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 84d6efeeb072..a9e438c7aaa4 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -458,6 +458,7 @@ static netdev_tx_t reg_vif_xmit(struct sk_buff *skb, struct net_device *dev)
 		.flowi4_oif	= dev->ifindex,
 		.flowi4_iif	= skb->skb_iif ? : LOOPBACK_IFINDEX,
 		.flowi4_mark	= skb->mark,
+		.flowi4_vrf	= skb->vrf,
 	};
 	int err;
 
@@ -1934,6 +1935,7 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct sk_buff *skb)
 			       LOOPBACK_IFINDEX :
 			       skb->dev->ifindex),
 		.flowi4_mark = skb->mark,
+		.flowi4_vrf = skb->vrf,
 	};
 	struct mr_table *mrt;
 	int err;
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index a10ab84b69d8..c00ea581839a 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -43,6 +43,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 	fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_flags = flags;
+	fl4.flowi4_vrf = skb->vrf;
 	rt = ip_route_output_key(&ctx, &fl4);
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index bca4f27502b0..e08f7ae8d8fe 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -779,7 +779,7 @@ static int ping_v4_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 	} else if (!ipc.oif)
 		ipc.oif = inet->uc_index;
 
-	flowi4_init_output(&fl4, ipc.oif, sk->sk_mark, tos,
+	flowi4_init_output(&fl4, sk_ctx.vrf, ipc.oif, sk->sk_mark, tos,
 			   RT_SCOPE_UNIVERSE, sk->sk_protocol,
 			   inet_sk_flowi_flags(sk), faddr, saddr, 0, 0);
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c06dd58e538b..f3a349ea3dd8 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -591,7 +591,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	} else if (!ipc.oif)
 		ipc.oif = inet->uc_index;
 
-	flowi4_init_output(&fl4, ipc.oif, sk->sk_mark, tos,
+	flowi4_init_output(&fl4, sk_ctx.vrf, ipc.oif, sk->sk_mark, tos,
 			   RT_SCOPE_UNIVERSE,
 			   inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol,
 			   inet_sk_flowi_flags(sk) |
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 018e292ff145..8271c5b30322 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -503,7 +503,7 @@ EXPORT_SYMBOL(__ip_select_ident);
 
 static void __build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 			     const struct iphdr *iph,
-			     int oif, u8 tos,
+			     __u32 vrf, int oif, u8 tos,
 			     u8 prot, u32 mark, int flow_flags)
 {
 	if (sk) {
@@ -511,10 +511,11 @@ static void __build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 
 		oif = sk->sk_bound_dev_if;
 		mark = sk->sk_mark;
+		vrf = sk->sk_vrf;
 		tos = RT_CONN_FLAGS(sk);
 		prot = inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol;
 	}
-	flowi4_init_output(fl4, oif, mark, tos,
+	flowi4_init_output(fl4, vrf, oif, mark, tos,
 			   RT_SCOPE_UNIVERSE, prot,
 			   flow_flags,
 			   iph->daddr, iph->saddr, 0, 0);
@@ -529,7 +530,7 @@ static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;
 
-	__build_flow_key(fl4, sk, iph, oif, tos, prot, mark, 0);
+	__build_flow_key(fl4, sk, iph, skb->vrf, oif, tos, prot, mark, 0);
 }
 
 static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
@@ -542,7 +543,7 @@ static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
 	inet_opt = rcu_dereference(inet->inet_opt);
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
-	flowi4_init_output(fl4, sk->sk_bound_dev_if, sk->sk_mark,
+	flowi4_init_output(fl4, sk->sk_vrf, sk->sk_bound_dev_if, sk->sk_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol,
 			   inet_sk_flowi_flags(sk),
@@ -794,7 +795,7 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf
 
 	rt = (struct rtable *) dst;
 
-	__build_flow_key(&fl4, sk, iph, oif, tos, prot, mark, 0);
+	__build_flow_key(&fl4, sk, iph, skb->vrf, oif, tos, prot, mark, 0);
 	__ip_do_redirect(rt, skb, &fl4, true);
 }
 
@@ -1006,7 +1007,7 @@ void ipv4_update_pmtu(struct sk_buff *skb, struct net_ctx *ctx, u32 mtu,
 	if (!mark)
 		mark = IP4_REPLY_MARK(ctx->net, skb->mark);
 
-	__build_flow_key(&fl4, NULL, iph, oif,
+	__build_flow_key(&fl4, NULL, iph, skb->vrf, oif,
 			 RT_TOS(iph->tos), protocol, mark, flow_flags);
 	rt = __ip_route_output_key(ctx, &fl4);
 	if (!IS_ERR(rt)) {
@@ -1023,7 +1024,7 @@ static void __ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 	struct rtable *rt;
 	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
-	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
+	__build_flow_key(&fl4, sk, iph, skb->vrf, 0, 0, 0, 0, 0);
 
 	if (!fl4.flowi4_mark)
 		fl4.flowi4_mark = IP4_REPLY_MARK(sk_ctx.net, skb->mark);
@@ -1056,7 +1057,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 		goto out;
 	}
 
-	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
+	__build_flow_key(&fl4, sk, iph, skb->vrf, 0, 0, 0, 0, 0);
 
 	rt = (struct rtable *)odst;
 	if (odst->obsolete && odst->ops->check(odst, 0) == NULL) {
@@ -1096,7 +1097,7 @@ void ipv4_redirect(struct sk_buff *skb, struct net_ctx *ctx,
 	struct flowi4 fl4;
 	struct rtable *rt;
 
-	__build_flow_key(&fl4, NULL, iph, oif,
+	__build_flow_key(&fl4, NULL, iph, skb->vrf, oif,
 			 RT_TOS(iph->tos), protocol, mark, flow_flags);
 	rt = __ip_route_output_key(ctx, &fl4);
 	if (!IS_ERR(rt)) {
@@ -1113,7 +1114,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk)
 	struct rtable *rt;
 	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
-	__build_flow_key(&fl4, sk, iph, 0, 0, 0, 0, 0);
+	__build_flow_key(&fl4, sk, iph, skb->vrf, 0, 0, 0, 0, 0);
 	rt = __ip_route_output_key(&sk_ctx, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_do_redirect(rt, skb, &fl4, false);
@@ -1190,6 +1191,7 @@ void ip_rt_get_source(u8 *addr, struct sk_buff *skb, struct rtable *rt)
 		fl4.flowi4_oif = rt->dst.dev->ifindex;
 		fl4.flowi4_iif = skb->dev->ifindex;
 		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_vrf = skb->vrf;
 
 		rcu_read_lock();
 		if (fib_lookup(&dev_ctx, &fl4, &res) == 0)
@@ -1724,6 +1726,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	fl4.flowi4_iif = dev->ifindex;
 	fl4.flowi4_mark = skb->mark;
 	fl4.flowi4_tos = tos;
+	fl4.flowi4_vrf  = skb->vrf;
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 	fl4.daddr = daddr;
 	fl4.saddr = saddr;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 7702e1f94174..916994d21f17 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -368,7 +368,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	 * hasn't changed since we received the original syn, but I see
 	 * no easy way to do this.
 	 */
-	flowi4_init_output(&fl4, sk->sk_bound_dev_if, ireq->ir_mark,
+	flowi4_init_output(&fl4, skb->vrf, sk->sk_bound_dev_if, ireq->ir_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE, IPPROTO_TCP,
 			   inet_sk_flowi_flags(sk),
 			   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1787dc8e5db3..1446c84428d8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1023,7 +1023,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		struct net *net = sk_ctx.net;
 
 		fl4 = &fl4_stack;
-		flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
+		flowi4_init_output(fl4, sk_ctx.vrf, ipc.oif, sk->sk_mark, tos,
 				   RT_SCOPE_UNIVERSE, sk->sk_protocol,
 				   inet_sk_flowi_flags(sk),
 				   faddr, saddr, dport, inet->inet_sport);
@@ -1083,6 +1083,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	fl4->saddr = saddr;
 	fl4->fl4_dport = dport;
 	fl4->fl4_sport = inet->inet_sport;
+	fl4->flowi4_vrf = sk_ctx.vrf;
 	up->pending = AF_INET;
 
 do_append_data:
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index c892b6bb0383..660059d09872 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -28,6 +28,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net_ctx *ctx, struct flowi4 *
 	memset(fl4, 0, sizeof(*fl4));
 	fl4->daddr = daddr->a4;
 	fl4->flowi4_tos = tos;
+	fl4->flowi4_vrf = ctx->vrf;
 	if (saddr)
 		fl4->saddr = saddr->a4;
 
@@ -112,6 +113,7 @@ _decode_session4(struct sk_buff *skb, struct flowi *fl, int reverse)
 	memset(fl4, 0, sizeof(struct flowi4));
 	fl4->flowi4_mark = skb->mark;
 	fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
+	fl4->flowi4_vrf = skb->vrf;
 
 	if (!ip_is_fragment(iph)) {
 		switch (iph->protocol) {
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index d59affad3f01..11c1a58296d8 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -435,6 +435,7 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
 	fl4->daddr  = daddr->v4.sin_addr.s_addr;
 	fl4->fl4_dport = daddr->v4.sin_port;
 	fl4->flowi4_proto = IPPROTO_SCTP;
+	fl4->flowi4_vrf = sk_ctx.vrf;
 	if (asoc) {
 		fl4->flowi4_tos = RT_CONN_FLAGS(asoc->base.sk);
 		fl4->flowi4_oif = asoc->base.sk->sk_bound_dev_if;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 21/29] net: vrf: Add vrf context to genid's
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (19 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Bottom 12 bits (VRF_BITS) are the VRF id.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip_fib.h        |  2 +-
 include/net/net_namespace.h | 12 ++++++++----
 net/ipv4/devinet.c          | 12 ++++++++----
 net/ipv4/fib_frontend.c     |  8 +++++---
 net/ipv4/fib_semantics.c    |  2 +-
 net/ipv4/route.c            | 13 +++++++++++--
 6 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 577479d7f268..e6b823c0305e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -180,7 +180,7 @@ __be32 fib_info_update_nh_saddr(struct net_ctx *ctx, struct fib_nh *nh);
 
 #define FIB_RES_SADDR(ctx, res)				\
 	((FIB_RES_NH(res).nh_saddr_genid ==		\
-	  atomic_read(&(ctx)->net->ipv4.dev_addr_genid)) ? \
+	  (atomic_read(&(ctx)->net->ipv4.dev_addr_genid) + (ctx)->vrf)) ? \
 	 FIB_RES_NH(res).nh_saddr :			\
 	 fib_info_update_nh_saddr((ctx), &FIB_RES_NH(res)))
 #define FIB_RES_GW(res)			(FIB_RES_NH(res).nh_gw)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 7cc7b0a1a20b..d0a3414758f8 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -372,12 +372,14 @@ static inline void unregister_net_sysctl_table(struct ctl_table_header *header)
 
 static inline int rt_genid_ipv4(struct net_ctx *ctx)
 {
-	return atomic_read(&ctx->net->ipv4.rt_genid);
+	return atomic_read(&ctx->net->ipv4.rt_genid) + ctx->vrf;
 }
 
 static inline void rt_genid_bump_ipv4(struct net *net)
 {
-	atomic_inc(&net->ipv4.rt_genid);
+	int inc = 1 << VRF_BITS;
+
+	atomic_add(inc, &net->ipv4.rt_genid);
 }
 
 extern void (*__fib6_flush_trees)(struct net *net);
@@ -404,12 +406,14 @@ static inline void rt_genid_bump_all(struct net *net)
 
 static inline int fnhe_genid(struct net_ctx *ctx)
 {
-	return atomic_read(&ctx->net->fnhe_genid);
+	return atomic_read(&ctx->net->fnhe_genid) + ctx->vrf;
 }
 
 static inline void fnhe_genid_bump(struct net *net)
 {
-	atomic_inc(&net->fnhe_genid);
+	int inc = 1 << VRF_BITS;
+
+	atomic_add(inc, &net->fnhe_genid);
 }
 
 #endif /* __NET_NET_NAMESPACE_H */
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 02ffbfb8bfee..7c0c3bc17599 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1536,6 +1536,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net_ctx sk_ctx = SOCK_NET_CTX(skb->sk);
 	struct net *net = sk_ctx.net;
+	__u32 vrf = sk_ctx.vrf;
 	int h, s_h;
 	int idx, s_idx;
 	int ip_idx, s_ip_idx;
@@ -1549,11 +1550,12 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	s_ip_idx = ip_idx = cb->args[2];
 
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		int genid;
 		idx = 0;
 		head = &net->dev_index_head[h];
 		rcu_read_lock();
-		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
-			  net->dev_base_seq;
+		genid = atomic_read(&net->ipv4.dev_addr_genid) + vrf;
+		cb->seq = genid ^ net->dev_base_seq;
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
@@ -1861,6 +1863,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 {
 	struct net_ctx sk_ctx = SOCK_NET_CTX(skb->sk);
 	struct net *net = sk_ctx.net;
+	__u32 vrf = sk_ctx.vrf;
 	int h, s_h;
 	int idx, s_idx;
 	struct net_device *dev;
@@ -1871,11 +1874,12 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 	s_idx = idx = cb->args[1];
 
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		int genid;
 		idx = 0;
 		head = &net->dev_index_head[h];
 		rcu_read_lock();
-		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
-			  net->dev_base_seq;
+		genid = atomic_read(&net->ipv4.dev_addr_genid) + vrf;
+		cb->seq = genid ^ net->dev_base_seq;
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index f2a8a557a3d8..cba1e2c9c2ec 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1021,6 +1021,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
 	struct net_device *dev = ifa->ifa_dev->dev;
 	struct net *net = dev_net(dev);
+	int inc = 1 << VRF_BITS;
 
 	switch (event) {
 	case NETDEV_UP:
@@ -1028,12 +1029,12 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 		fib_sync_up(dev);
 #endif
-		atomic_inc(&net->ipv4.dev_addr_genid);
+		atomic_add(inc, &net->ipv4.dev_addr_genid);
 		rt_cache_flush(dev_net(dev));
 		break;
 	case NETDEV_DOWN:
 		fib_del_ifaddr(ifa, NULL);
-		atomic_inc(&net->ipv4.dev_addr_genid);
+		atomic_add(inc, &net->ipv4.dev_addr_genid);
 		if (ifa->ifa_dev->ifa_list == NULL) {
 			/* Last address was deleted from this interface.
 			 * Disable IP.
@@ -1052,6 +1053,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct in_device *in_dev;
 	struct net *net = dev_net(dev);
+	int inc = 1 << VRF_BITS;
 
 	if (event == NETDEV_UNREGISTER) {
 		fib_disable_ip(dev, 2);
@@ -1071,7 +1073,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 		fib_sync_up(dev);
 #endif
-		atomic_inc(&net->ipv4.dev_addr_genid);
+		atomic_add(inc, &net->ipv4.dev_addr_genid);
 		rt_cache_flush(net);
 		break;
 	case NETDEV_DOWN:
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 9fc5487e66fe..a7d810cafada 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -756,7 +756,7 @@ __be32 fib_info_update_nh_saddr(struct net_ctx *net_ctx, struct fib_nh *nh)
 	nh->nh_saddr = inet_select_addr(nh->nh_dev,
 					nh->nh_gw,
 					nh->nh_parent->fib_scope);
-	nh->nh_saddr_genid = atomic_read(&net->ipv4.dev_addr_genid);
+	nh->nh_saddr_genid = atomic_read(&net->ipv4.dev_addr_genid) + net_ctx->vrf;
 
 	return nh->nh_saddr;
 }
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8271c5b30322..f980a42a995f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2706,10 +2706,19 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 
 static __net_init int rt_genid_init(struct net *net)
 {
+	int genid;
+
 	atomic_set(&net->ipv4.rt_genid, 0);
 	atomic_set(&net->fnhe_genid, 0);
-	get_random_bytes(&net->ipv4.dev_addr_genid,
-			 sizeof(net->ipv4.dev_addr_genid));
+
+again:
+	get_random_bytes(&genid, sizeof(genid));
+	genid &= ~VRF_MASK;
+	if (genid == 0)
+		goto again;
+
+	atomic_set(&net->ipv4.dev_addr_genid, genid);
+
 	return 0;
 }
 
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (20 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

VRF id comes from passed in network context similar to namespace.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/inet_hashtables.h | 1 +
 include/net/neighbour.h       | 1 +
 net/core/fib_rules.c          | 3 +++
 net/core/neighbour.c          | 1 +
 net/ipv4/fib_frontend.c       | 3 +++
 net/ipv4/fib_semantics.c      | 1 +
 net/ipv4/icmp.c               | 1 +
 net/ipv4/ipmr.c               | 1 +
 net/ipv4/route.c              | 1 +
 9 files changed, 13 insertions(+)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9ddc1b2309ce..eec177ef0798 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -95,6 +95,7 @@ static inline
 void ib_net_ctx_set(struct inet_bind_bucket *ib, struct net_ctx *ctx)
 {
 	write_pnet(&ib->ib_net_ctx.net, hold_net(ctx->net));
+	ib->ib_net_ctx.vrf = ctx->vrf;
 }
 
 static inline
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 73d0938b085c..d9e2328ad60a 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -323,6 +323,7 @@ void pneigh_net_ctx_set(struct pneigh_entry *pneigh,
 			const struct net_ctx *net_ctx)
 {
 	write_pnet(&pneigh->net_ctx.net, hold_net(net_ctx->net));
+	pneigh->net_ctx.vrf = net_ctx->vrf;
 }
 static inline
 int pneigh_net_ctx_eq(const struct pneigh_entry *pneigh,
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 9a1a4a23b6f6..223a4004bdd0 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -32,6 +32,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 	r->table = table;
 	r->flags = flags;
 	r->fr_net = hold_net(ops->fro_net);
+	r->fr_vrf = ops->fro_vrf;
 
 	r->suppress_prefixlen = -1;
 	r->suppress_ifgroup = -1;
@@ -137,6 +138,7 @@ fib_rules_register(const struct fib_rules_ops *tmpl, struct net_ctx *ctx)
 
 	INIT_LIST_HEAD(&ops->rules_list);
 	ops->fro_net = ctx->net;
+	ops->fro_vrf = ctx->vrf;
 
 	err = __fib_rules_register(ops);
 	if (err) {
@@ -305,6 +307,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 		goto errout;
 	}
 	rule->fr_net = hold_net(net);
+	rule->fr_vrf = sk_ctx.vrf;
 
 	if (tb[FRA_PRIORITY])
 		rule->pref = nla_get_u32(tb[FRA_PRIORITY]);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 0fbbe70be170..e6c03d367f56 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1467,6 +1467,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 		dev_hold(dev);
 		p->dev = dev;
 		write_pnet(&p->net_ctx.net, hold_net(dev_ctx.net));
+		p->net_ctx.vrf = dev_ctx.vrf;
 		p->sysctl_table = NULL;
 
 		if (ops->ndo_neigh_setup && ops->ndo_neigh_setup(dev, p)) {
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index cba1e2c9c2ec..2f06b71bed53 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -357,6 +357,7 @@ static int rtentry_to_fib_config(struct net_ctx *ctx, int cmd,
 
 	memset(cfg, 0, sizeof(*cfg));
 	cfg->fc_nlinfo.nl_net = net;
+	cfg->fc_nlinfo.nl_vrf = ctx->vrf;
 
 	if (rt->rt_dst.sa_family != AF_INET)
 		return -EAFNOSUPPORT;
@@ -564,6 +565,7 @@ static int rtm_to_fib_config(struct net_ctx *ctx, struct sk_buff *skb,
 	cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid;
 	cfg->fc_nlinfo.nlh = nlh;
 	cfg->fc_nlinfo.nl_net = ctx->net;
+	cfg->fc_nlinfo.nl_vrf = ctx->vrf;
 
 	if (cfg->fc_type > RTN_MAX) {
 		err = -EINVAL;
@@ -714,6 +716,7 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad
 		.fc_nlflags = NLM_F_CREATE | NLM_F_APPEND,
 		.fc_nlinfo = {
 			.nl_net = net,
+			.nl_vrf = dev_vrf(ifa->ifa_dev->dev),
 		},
 	};
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index a7d810cafada..65d01c5b747e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -819,6 +819,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		fi->fib_metrics = (u32 *) dst_default_metrics;
 
 	fi->fib_net = hold_net(net);
+	fi->fib_vrf = net_ctx->vrf;
 	fi->fib_protocol = cfg->fc_protocol;
 	fi->fib_scope = cfg->fc_scope;
 	fi->fib_flags = cfg->fc_flags;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 9d4c38292fee..b7766a73e46d 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -564,6 +564,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 		goto out;
 	net = dev_net(rt->dst.dev);
 	dev_ctx.net = net;
+	dev_ctx.vrf = dev_vrf(rt->dst.dev);
 
 	/*
 	 *	Find the original header. It is expected to be valid, of course.
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a9e438c7aaa4..d00ba199a012 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2081,6 +2081,7 @@ static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb,
 	skb->ip_summed = CHECKSUM_NONE;
 
 	dev_ctx.net = dev_net(reg_dev);
+	dev_ctx.vrf = dev_vrf(reg_dev);
 	skb_tunnel_rx(skb, reg_dev, &dev_ctx);
 
 	netif_rx(skb);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f980a42a995f..d6c5f0a8ab17 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1621,6 +1621,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	}
 
 	dev_ctx.net = dev_net(rth->dst.dev);
+	dev_ctx.vrf = dev_vrf(rth->dst.dev);
 	rth->rt_genid = rt_genid_ipv4(&dev_ctx);
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 23/29] net: vrf: Enable vrf checks
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (21 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Add vrf comparison to all of the net_ctx_eq functions and a few other
places needed to enable vrf awareness.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h     |  3 ++-
 include/net/inet_hashtables.h |  3 ++-
 include/net/ip_fib.h          |  3 ++-
 include/net/ipv6.h            |  2 +-
 include/net/neighbour.h       |  6 ++++--
 include/net/net_namespace.h   |  2 +-
 include/net/sock.h            |  2 +-
 net/core/dev.c                |  9 +++++++++
 net/core/fib_rules.c          |  4 ++--
 net/core/neighbour.c          |  6 ++++--
 net/ipv4/arp.c                |  4 +++-
 net/ipv4/devinet.c            | 11 ++++++++++-
 net/ipv4/fib_frontend.c       |  2 +-
 net/ipv4/fib_semantics.c      |  5 +++++
 net/ipv4/igmp.c               |  7 +++++++
 net/ipv4/inet_hashtables.c    |  2 ++
 net/ipv4/ip_sockglue.c        |  4 ++++
 17 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b6de06eda683..f4a707263446 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1857,7 +1857,8 @@ __u32 dev_vrf(const struct net_device *dev)
 static inline
 int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
 {
-	if (net_eq(dev_net(dev), ctx->net))
+	if (net_eq(dev_net(dev), ctx->net) &&
+	    vrf_eq(dev_vrf(dev), ctx->vrf))
 		return 1;
 
 	return 0;
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index eec177ef0798..199809e46133 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -101,7 +101,8 @@ void ib_net_ctx_set(struct inet_bind_bucket *ib, struct net_ctx *ctx)
 static inline
 int ib_net_ctx_eq(struct inet_bind_bucket *ib, struct net_ctx *ctx)
 {
-	if (net_eq(ib_net(ib), ctx->net))
+	if (net_eq(ib_net(ib), ctx->net) &&
+	    vrf_eq(ib->ib_net_ctx.vrf, ctx->vrf))
 		return 1;
 
 	return 0;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e6b823c0305e..d49358bc342c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -127,7 +127,8 @@ struct fib_info {
 static inline
 int fib_net_ctx_eq(const struct fib_info *fi, const struct net_ctx *ctx)
 {
-	if (net_eq(fi->fib_net_ctx.net, ctx->net))
+	if (net_eq(fi->fib_net_ctx.net, ctx->net) &&
+	    vrf_eq(fi->fib_net_ctx.vrf, ctx->vrf))
 		return 1;
 
 	return 0;
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 61f8b6df8bb9..ba1d145d67fd 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -247,7 +247,7 @@ static inline
 int fl_net_ctx_eq(struct ip6_flowlabel *fl, struct net_ctx *ctx)
 {
 #ifdef CONFIG_NET_NS
-	return net_eq(fl->fl_net, ctx->net);
+	return net_eq(fl->fl_net, ctx->net) && vrf_eq(fl->fl_vrf, ctx->vrf);
 #else
 	return 1;
 #endif
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index d9e2328ad60a..f3527b25d612 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -294,7 +294,8 @@ int neigh_parms_net_ctx_eq(const struct neigh_parms *parms,
 			   const struct net_ctx *net_ctx)
 {
 #ifdef CONFIG_NET_NS
-	if (net_eq(neigh_parms_net(parms), net_ctx->net))
+	if (net_eq(neigh_parms_net(parms), net_ctx->net) &&
+	    vrf_eq(neigh_parms_vrf(parms), net_ctx->vrf))
 		return 1;
 
 	return 0;
@@ -330,7 +331,8 @@ int pneigh_net_ctx_eq(const struct pneigh_entry *pneigh,
 		      const struct net_ctx *net_ctx)
 {
 #ifdef CONFIG_NET_NS
-	if (net_eq(pneigh_net(pneigh), net_ctx->net))
+	if (net_eq(pneigh_net(pneigh), net_ctx->net) &&
+	    vrf_eq(pneigh->net_ctx.vrf, net_ctx->vrf))
 		return 1;
 
 	return 0;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index d0a3414758f8..7ae98b85cd21 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -215,7 +215,7 @@ int net_eq(const struct net *net1, const struct net *net2)
 static inline
 int net_ctx_eq(struct net_ctx *ctx1, struct net_ctx *ctx2)
 {
-	return net_eq(ctx1->net, ctx2->net);
+	return net_eq(ctx1->net, ctx2->net) && vrf_eq(ctx1->vrf, ctx2->vrf);
 }
 
 
diff --git a/include/net/sock.h b/include/net/sock.h
index d3668b691f82..a9b45fca4605 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2205,7 +2205,7 @@ void sock_net_set(struct sock *sk, struct net *net)
 static inline
 int sock_net_ctx_eq(struct sock *sk, struct net_ctx *ctx)
 {
-	return net_eq(sock_net(sk), ctx->net);
+	return net_eq(sock_net(sk), ctx->net) && vrf_eq(sk->sk_vrf, ctx->vrf);
 }
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index d64f5b107dba..adf575d6d267 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -688,6 +688,9 @@ struct net_device *__dev_get_by_name_ctx(struct net_ctx *ctx, const char *name)
 {
 	struct net_device *dev = __dev_get_by_name(ctx->net, name);
 
+	if (dev && !vrf_eq(dev_vrf(dev), ctx->vrf))
+		dev = NULL;
+
 	return dev;
 }
 EXPORT_SYMBOL(__dev_get_by_name_ctx);
@@ -771,6 +774,9 @@ struct net_device *__dev_get_by_index_ctx(struct net_ctx *ctx, int ifindex)
 {
 	struct net_device *dev = __dev_get_by_index(ctx->net, ifindex);
 
+	if (dev && !vrf_eq(dev_vrf(dev), ctx->vrf))
+		dev = NULL;
+
 	return dev;
 }
 EXPORT_SYMBOL(__dev_get_by_index_ctx);
@@ -814,6 +820,9 @@ struct net_device *dev_get_by_index_rcu_ctx(struct net_ctx *ctx, int ifindex)
 {
 	struct net_device *dev = dev_get_by_index_rcu(ctx->net, ifindex);
 
+	if (dev && !vrf_eq(dev_vrf(dev), ctx->vrf))
+		dev = NULL;
+
 	return dev;
 }
 EXPORT_SYMBOL(dev_get_by_index_rcu_ctx);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 223a4004bdd0..aea74e16360c 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -317,7 +317,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 
 		rule->iifindex = -1;
 		nla_strlcpy(rule->iifname, tb[FRA_IIFNAME], IFNAMSIZ);
-		dev = __dev_get_by_name(net, rule->iifname);
+		dev = __dev_get_by_name_ctx(&sk_ctx, rule->iifname);
 		if (dev)
 			rule->iifindex = dev->ifindex;
 	}
@@ -327,7 +327,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 
 		rule->oifindex = -1;
 		nla_strlcpy(rule->oifname, tb[FRA_OIFNAME], IFNAMSIZ);
-		dev = __dev_get_by_name(net, rule->oifname);
+		dev = __dev_get_by_name_ctx(&sk_ctx, rule->oifname);
 		if (dev)
 			rule->oifindex = dev->ifindex;
 	}
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index e6c03d367f56..46b7e8cc7c70 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2846,9 +2846,11 @@ static void neigh_copy_dflt_parms(struct net_ctx *ctx, struct neigh_parms *p,
 
 	rcu_read_lock();
 	for_each_netdev_rcu(ctx->net, dev) {
-		struct neigh_parms *dst_p =
-				neigh_get_dev_parms_rcu(dev, family);
+		struct neigh_parms *dst_p;
 
+		if (!vrf_eq(dev_vrf(dev), ctx->vrf))
+			continue;
+		dst_p = neigh_get_dev_parms_rcu(dev, family);
 		if (dst_p && !test_bit(index, dst_p->data_state))
 			dst_p->data[index] = p->data[index];
 	}
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index b24773b275a9..ed1453b9eeab 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1021,6 +1021,8 @@ static int arp_req_set_public(struct net_ctx *ctx, struct arpreq *r,
 	if (!dev && (r->arp_flags & ATF_COM)) {
 		dev = dev_getbyhwaddr_rcu(net, r->arp_ha.sa_family,
 				      r->arp_ha.sa_data);
+		if (dev && !vrf_eq(dev_vrf(dev), ctx->vrf))
+			dev = NULL;
 		if (!dev)
 			return -ENODEV;
 	}
@@ -1214,7 +1216,7 @@ int arp_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 	rtnl_lock();
 	if (r.arp_dev[0]) {
 		err = -ENODEV;
-		dev = __dev_get_by_name(net, r.arp_dev);
+		dev = __dev_get_by_name_ctx(ctx, r.arp_dev);
 		if (dev == NULL)
 			goto out;
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 7c0c3bc17599..54afa816ff66 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -951,7 +951,7 @@ int devinet_ioctl(struct net_ctx *net_ctx, unsigned int cmd, void __user *arg)
 	rtnl_lock();
 
 	ret = -ENODEV;
-	dev = __dev_get_by_name(net, ifr.ifr_name);
+	dev = __dev_get_by_name_ctx(net_ctx, ifr.ifr_name);
 	if (!dev)
 		goto done;
 
@@ -1166,6 +1166,7 @@ __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
 	__be32 addr = 0;
 	struct in_device *in_dev;
 	struct net *net = dev_net(dev);
+	__u32 vrf = dev_vrf(dev);
 
 	rcu_read_lock();
 	in_dev = __in_dev_get_rcu(dev);
@@ -1192,6 +1193,8 @@ __be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
 	   in dev_base list.
 	 */
 	for_each_netdev_rcu(net, dev) {
+		if (!vrf_eq(dev_vrf(dev), vrf))
+			continue;
 		in_dev = __in_dev_get_rcu(dev);
 		if (!in_dev)
 			continue;
@@ -1266,6 +1269,8 @@ __be32 inet_confirm_addr(struct net_ctx *ctx, struct in_device *in_dev,
 
 	rcu_read_lock();
 	for_each_netdev_rcu(ctx->net, dev) {
+		if (!vrf_eq(dev_vrf(dev), ctx->vrf))
+			continue;
 		in_dev = __in_dev_get_rcu(dev);
 		if (in_dev) {
 			addr = confirm_addr_indev(in_dev, dst, local, scope);
@@ -1561,6 +1566,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 				goto cont;
 			if (h > s_h || idx > s_idx)
 				s_ip_idx = 0;
+			if (!vrf_eq(dev_vrf(dev), vrf))
+				goto cont;
 			in_dev = __in_dev_get_rcu(dev);
 			if (!in_dev)
 				goto cont;
@@ -1883,6 +1890,8 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
+			if (!vrf_eq(dev_vrf(dev), vrf))
+				goto cont;
 			in_dev = __in_dev_get_rcu(dev);
 			if (!in_dev)
 				goto cont;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2f06b71bed53..8713618e2835 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -418,7 +418,7 @@ static int rtentry_to_fib_config(struct net_ctx *ctx, int cmd,
 		colon = strchr(devname, ':');
 		if (colon)
 			*colon = 0;
-		dev = __dev_get_by_name(net, devname);
+		dev = __dev_get_by_name_ctx(ctx, devname);
 		if (!dev)
 			return -ENODEV;
 		cfg->fc_oif = dev->ifindex;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 65d01c5b747e..0aa5990b1c02 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -929,6 +929,11 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		err = -ENODEV;
 		if (nh->nh_dev == NULL)
 			goto failure;
+		if (!vrf_eq(dev_vrf(nh->nh_dev), net_ctx->vrf)) {
+			dev_put(nh->nh_dev);
+			nh->nh_dev = NULL;
+			goto failure;
+		}
 	} else {
 		change_nexthops(fi) {
 			err = fib_check_nh(cfg, fi, nexthop_nh);
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 86aa303a1cf7..fddc3bbf6b8b 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2451,6 +2451,9 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 	for_each_netdev_rcu(net, state->dev) {
 		struct in_device *in_dev;
 
+		if (!vrf_eq(dev_vrf(state->dev), ctx->vrf))
+			continue;
+
 		in_dev = __in_dev_get_rcu(state->dev);
 		if (!in_dev)
 			continue;
@@ -2596,6 +2599,10 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 	state->im = NULL;
 	for_each_netdev_rcu(net, state->dev) {
 		struct in_device *idev;
+
+		if (!vrf_eq(dev_vrf(state->dev), ctx->vrf))
+			continue;
+
 		idev = __in_dev_get_rcu(state->dev);
 		if (unlikely(idev == NULL))
 			continue;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 71c31c81aea1..0dcde9839d66 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -220,6 +220,8 @@ struct sock *__inet_lookup_listener(struct net_ctx *ctx,
 	result = NULL;
 	hiscore = 0;
 	sk_nulls_for_each_rcu(sk, node, &ilb->head) {
+		if (!vrf_eq(sk->sk_vrf, ctx->vrf) && !vrf_is_any(sk->sk_vrf))
+			continue;
 		score = compute_score(sk, ctx, hnum, daddr, dif);
 		if (score > hiscore) {
 			result = sk;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index eeb51e935379..b5521f7b36b1 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -728,6 +728,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	{
 		struct net_device *dev = NULL;
 		int ifindex;
+		__u32 vrf;
 
 		if (optlen != sizeof(int))
 			goto e_inval;
@@ -743,7 +744,10 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		err = -EADDRNOTAVAIL;
 		if (!dev)
 			break;
+		vrf = dev_vrf(dev);
 		dev_put(dev);
+		if (!vrf_eq(vrf, sk_ctx.vrf))
+			break;
 
 		err = -EINVAL;
 		if (sk->sk_bound_dev_if)
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (22 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h    |  1 +
 include/uapi/linux/if_link.h |  1 +
 net/core/dev.c               | 28 ++++++++++++++++++++++++++++
 net/core/rtnetlink.c         | 12 ++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f4a707263446..7d983f005622 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2922,6 +2922,7 @@ int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int dev_set_mtu(struct net_device *, int);
+int dev_set_vrf(struct net_device *, __u32);
 void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0deee3eeddbf..0afdb50ee75c 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -147,6 +147,7 @@ enum {
 	IFLA_CARRIER_CHANGES,
 	IFLA_PHYS_SWITCH_ID,
 	IFLA_LINK_NETNSID,
+	IFLA_VRF,
 	__IFLA_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index adf575d6d267..d96d0d46dc6e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5882,6 +5882,34 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
 }
 EXPORT_SYMBOL(dev_set_mtu);
 
+ /**
+  *     dev_set_vrf - Change VRF
+  *     @dev: device
+  *     @new_vrf: new VRF
+  *
+  *     Change the VRF association for the network device.
+  */
+int dev_set_vrf(struct net_device *dev, __u32 new_vrf)
+{
+	if (!netif_device_present(dev))
+		return -ENODEV;
+
+	/* device needs to be taken down to drop routes */
+	if (dev->flags & IFF_UP)
+		return -EINVAL;
+
+	if (!vrf_is_valid(new_vrf))
+		return -EINVAL;
+
+	if (new_vrf == dev->nd_vrf)
+		return 0;
+
+	dev->nd_vrf = new_vrf;
+
+	return 0;
+}
+EXPORT_SYMBOL(dev_set_vrf);
+
 /**
  *	dev_set_group - Change group this device belongs to
  *	@dev: device
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 673cb4c6f391..bf41e63f87ae 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -866,6 +866,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_TXQLEN */
 	       + nla_total_size(4) /* IFLA_WEIGHT */
 	       + nla_total_size(4) /* IFLA_MTU */
+	       + nla_total_size(4) /* IFLA_VRF */
 	       + nla_total_size(4) /* IFLA_LINK */
 	       + nla_total_size(4) /* IFLA_MASTER */
 	       + nla_total_size(1) /* IFLA_CARRIER */
@@ -1031,6 +1032,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 		       netif_running(dev) ? dev->operstate : IF_OPER_DOWN) ||
 	    nla_put_u8(skb, IFLA_LINKMODE, dev->link_mode) ||
 	    nla_put_u32(skb, IFLA_MTU, dev->mtu) ||
+	    nla_put_u32(skb, IFLA_VRF, dev->nd_vrf) ||
 	    nla_put_u32(skb, IFLA_GROUP, dev->group) ||
 	    nla_put_u32(skb, IFLA_PROMISCUITY, dev->promiscuity) ||
 	    nla_put_u32(skb, IFLA_NUM_TX_QUEUES, dev->num_tx_queues) ||
@@ -1249,6 +1251,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
 	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
+	[IFLA_VRF]		= { .type = NLA_U32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1616,6 +1619,13 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_MODIFIED;
 	}
 
+	if (tb[IFLA_VRF]) {
+		err = dev_set_vrf(dev, nla_get_u32(tb[IFLA_VRF]));
+		if (err < 0)
+			goto errout;
+		status |= DO_SETLINK_MODIFIED;
+	}
+
 	if (tb[IFLA_GROUP]) {
 		dev_set_group(dev, nla_get_u32(tb[IFLA_GROUP]));
 		status |= DO_SETLINK_NOTIFY;
@@ -1911,6 +1921,8 @@ struct net_device *rtnl_create_link(struct net *net,
 
 	if (tb[IFLA_MTU])
 		dev->mtu = nla_get_u32(tb[IFLA_MTU]);
+	if (tb[IFLA_VRF])
+		dev->nd_vrf = nla_get_u32(tb[IFLA_VRF]);
 	if (tb[IFLA_ADDRESS]) {
 		memcpy(dev->dev_addr, nla_data(tb[IFLA_ADDRESS]),
 				nla_len(tb[IFLA_ADDRESS]));
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 25/29] net: vrf: Handle VRF any context
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (23 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05 13:46   ` Nicolas Dichtel
  2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
                   ` (11 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

VRF any context applies only to tasks to and sockets. Devices are
associated with a single VRF, and skb's by extension are connected to
a single VRF.

Listen sockets and unconnected sockets can be opened in a "VRF any"
context allowing a single daemon to provide service across all VRFs
in a namespace. Connected sockets must be in a specific vrf context.
Accepted sockets acquire the VRF context from the device the packet
enters (via the skb).

"VRF any" context is also useful for tasks wanting to view L3/L4
data for all VRFs.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/linux/netdevice.h     | 15 +++++++++++++++
 include/net/inet_hashtables.h |  4 +++-
 include/net/neighbour.h       | 29 +++++++++++++++++++++++++++++
 include/net/sock.h            |  2 +-
 net/core/dev.c                |  2 +-
 net/core/fib_rules.c          |  4 ++++
 net/core/neighbour.c          | 18 +++++++++---------
 net/ipv4/af_inet.c            |  4 ++++
 net/ipv4/arp.c                |  6 ++++++
 net/ipv4/datagram.c           |  3 +++
 net/ipv4/devinet.c            |  7 +++++--
 net/ipv4/fib_frontend.c       |  4 ++++
 net/ipv4/igmp.c               |  4 ++--
 net/ipv4/raw.c                |  9 +++++++++
 net/ipv4/udp.c                |  4 ++++
 15 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7d983f005622..a1de460b1b7c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1864,6 +1864,21 @@ int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
 	return 0;
 }
 
+/*
+ * same as above except if ctx has 'any' vrf that it counts as a match
+ * (devices are not assigned to 'any' vrf)
+ */
+static inline
+int dev_net_ctx_eq_any(const struct net_device *dev, struct net_ctx *ctx)
+{
+	if (net_eq(dev_net(dev), ctx->net) &&
+	   (vrf_eq(dev->nd_vrf, ctx->vrf) || vrf_is_any(ctx->vrf))) {
+		return 1;
+	}
+
+	return 0;
+}
+
 static inline bool netdev_uses_dsa(struct net_device *dev)
 {
 #if IS_ENABLED(CONFIG_NET_DSA)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 199809e46133..e4ba898af422 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -101,8 +101,10 @@ void ib_net_ctx_set(struct inet_bind_bucket *ib, struct net_ctx *ctx)
 static inline
 int ib_net_ctx_eq(struct inet_bind_bucket *ib, struct net_ctx *ctx)
 {
+	__u32 vrf = ib->ib_net_ctx.vrf;
+
 	if (net_eq(ib_net(ib), ctx->net) &&
-	    vrf_eq(ib->ib_net_ctx.vrf, ctx->vrf))
+	    (vrf_eq_or_any(vrf, ctx->vrf)))
 		return 1;
 
 	return 0;
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f3527b25d612..122a3acda83e 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -303,6 +303,21 @@ int neigh_parms_net_ctx_eq(const struct neigh_parms *parms,
 	return 1;
 #endif
 }
+static inline int neigh_parms_net_ctx_eq_any(const struct neigh_parms *parms,
+					     const struct net_ctx *net_ctx)
+{
+#ifdef CONFIG_NET_NS
+	if (net_eq(neigh_parms_net(parms), net_ctx->net) &&
+	    (vrf_eq(neigh_parms_vrf(parms), net_ctx->vrf) ||
+	     vrf_is_any(net_ctx->vrf))) {
+		return 1;
+	}
+
+	return 0;
+#else
+	return 1;
+#endif
+}
 unsigned long neigh_rand_reach_time(unsigned long base);
 
 void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
@@ -340,6 +355,20 @@ int pneigh_net_ctx_eq(const struct pneigh_entry *pneigh,
 	return 1;
 #endif
 }
+static inline
+int pneigh_net_ctx_eq_any(const struct pneigh_entry *pneigh,
+		      const struct net_ctx *net_ctx)
+{
+#ifdef CONFIG_NET_NS
+	if (net_eq(pneigh_net(pneigh), net_ctx->net) &&
+	    vrf_eq_or_any(pneigh->net_ctx.vrf, net_ctx->vrf))
+		return 1;
+
+	return 0;
+#else
+	return 1;
+#endif
+}
 
 void neigh_app_ns(struct neighbour *n);
 void neigh_for_each(struct neigh_table *tbl,
diff --git a/include/net/sock.h b/include/net/sock.h
index a9b45fca4605..6a880d04361e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2205,7 +2205,7 @@ void sock_net_set(struct sock *sk, struct net *net)
 static inline
 int sock_net_ctx_eq(struct sock *sk, struct net_ctx *ctx)
 {
-	return net_eq(sock_net(sk), ctx->net) && vrf_eq(sk->sk_vrf, ctx->vrf);
+	return net_eq(sock_net(sk), ctx->net) && vrf_eq_or_any(sk->sk_vrf, ctx->vrf);
 }
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index d96d0d46dc6e..0dae3cfd2890 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -688,7 +688,7 @@ struct net_device *__dev_get_by_name_ctx(struct net_ctx *ctx, const char *name)
 {
 	struct net_device *dev = __dev_get_by_name(ctx->net, name);
 
-	if (dev && !vrf_eq(dev_vrf(dev), ctx->vrf))
+	if (dev && !vrf_eq_or_any(dev_vrf(dev), ctx->vrf))
 		dev = NULL;
 
 	return dev;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index aea74e16360c..637a6738165e 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -301,6 +301,10 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 	if (err < 0)
 		goto errout;
 
+	/* cannot create new rule for any vrf context */
+	if (vrf_is_any(sk_ctx.vrf))
+		goto errout;
+
 	rule = kzalloc(ops->rule_size, GFP_KERNEL);
 	if (rule == NULL) {
 		err = -ENOMEM;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 46b7e8cc7c70..d15f84de860d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -442,7 +442,7 @@ struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl,
 	     n != NULL;
 	     n = rcu_dereference_bh(n->next)) {
 		if (!memcmp(n->primary_key, pkey, key_len) &&
-		    dev_net_ctx_eq(n->dev, ctx)) {
+		    dev_net_ctx_eq_any(n->dev, ctx)) {
 			if (!atomic_inc_not_zero(&n->refcnt))
 				n = NULL;
 			NEIGH_CACHE_STAT_INC(tbl, hits);
@@ -2138,7 +2138,7 @@ static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
 		nidx = 0;
 		p = list_next_entry(&tbl->parms, list);
 		list_for_each_entry_from(p, &tbl->parms_list, list) {
-			if (!neigh_parms_net_ctx_eq(p, &ctx))
+			if (!neigh_parms_net_ctx_eq_any(p, &ctx))
 				continue;
 
 			if (nidx < neigh_skip)
@@ -2271,7 +2271,7 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
 		     n != NULL;
 		     n = rcu_dereference_bh(n->next)) {
-			if (!dev_net_ctx_eq(n->dev, &ctx))
+			if (!dev_net_ctx_eq_any(n->dev, &ctx))
 				continue;
 			if (idx < s_idx)
 				goto next;
@@ -2308,7 +2308,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 		if (h > s_h)
 			s_idx = 0;
 		for (n = tbl->phash_buckets[h], idx = 0; n; n = n->next) {
-			if (!dev_net_ctx_eq(n->dev, &ctx))
+			if (!dev_net_ctx_eq_any(n->dev, &ctx))
 				continue;
 			if (idx < s_idx)
 				goto next;
@@ -2446,7 +2446,7 @@ static struct neighbour *neigh_get_first(struct seq_file *seq)
 		n = rcu_dereference_bh(nht->hash_buckets[bucket]);
 
 		while (n) {
-			if (!dev_net_ctx_eq(n->dev, ctx))
+			if (!dev_net_ctx_eq_any(n->dev, ctx))
 				goto next;
 			if (state->neigh_sub_iter) {
 				loff_t fakep = 0;
@@ -2489,7 +2489,7 @@ static struct neighbour *neigh_get_next(struct seq_file *seq,
 
 	while (1) {
 		while (n) {
-			if (!dev_net_ctx_eq(n->dev, ctx))
+			if (!dev_net_ctx_eq_any(n->dev, ctx))
 				goto next;
 			if (state->neigh_sub_iter) {
 				void *v = state->neigh_sub_iter(state, n, pos);
@@ -2546,7 +2546,7 @@ static struct pneigh_entry *pneigh_get_first(struct seq_file *seq)
 	state->flags |= NEIGH_SEQ_IS_PNEIGH;
 	for (bucket = 0; bucket <= PNEIGH_HASHMASK; bucket++) {
 		pn = tbl->phash_buckets[bucket];
-		while (pn && !pneigh_net_ctx_eq(pn, ctx))
+		while (pn && !pneigh_net_ctx_eq_any(pn, ctx))
 			pn = pn->next;
 		if (pn)
 			break;
@@ -2566,13 +2566,13 @@ static struct pneigh_entry *pneigh_get_next(struct seq_file *seq,
 
 	do {
 		pn = pn->next;
-	} while (pn && !pneigh_net_ctx_eq(pn, ctx));
+	} while (pn && !pneigh_net_ctx_eq_any(pn, ctx));
 
 	while (!pn) {
 		if (++state->bucket > PNEIGH_HASHMASK)
 			break;
 		pn = tbl->phash_buckets[state->bucket];
-		while (pn && !pneigh_net_ctx_eq(pn, ctx))
+		while (pn && !pneigh_net_ctx_eq_any(pn, ctx))
 			pn = pn->next;
 		if (pn)
 			break;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2627fff2b2d0..a2b9a8ad0f76 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -565,6 +565,10 @@ int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	int err;
 	long timeo;
 
+	/* sockets must be set into a vrf context to connect */
+	if (vrf_is_any(sk->sk_vrf))
+		return -EINVAL;
+
 	if (addr_len < sizeof(uaddr->sa_family))
 		return -EINVAL;
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index ed1453b9eeab..4f52a5bce975 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1195,6 +1195,9 @@ int arp_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 	case SIOCSARP:
 		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
+		/* must set vrf context to modify arp cache */
+		if (vrf_is_any(ctx->vrf))
+			return -EINVAL;
 	case SIOCGARP:
 		err = copy_from_user(&r, arg, sizeof(struct arpreq));
 		if (err)
@@ -1215,6 +1218,9 @@ int arp_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 							   htonl(0xFFFFFFFFUL);
 	rtnl_lock();
 	if (r.arp_dev[0]) {
+		err = -EINVAL;
+		if (vrf_is_any(ctx->vrf))
+			goto out;
 		err = -ENODEV;
 		dev = __dev_get_by_name_ctx(ctx, r.arp_dev);
 		if (dev == NULL)
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 7f93d6b92d0b..40b3602bfc78 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -30,6 +30,9 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	int oif;
 	int err;
 
+	/* connected sockets must have a specific vrf context */
+	if (vrf_is_any(sk->sk_vrf))
+		return -EINVAL;
 
 	if (addr_len < sizeof(*usin))
 		return -EINVAL;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 54afa816ff66..d9e7140df915 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -942,6 +942,9 @@ int devinet_ioctl(struct net_ctx *net_ctx, unsigned int cmd, void __user *arg)
 		ret = -EINVAL;
 		if (sin->sin_family != AF_INET)
 			goto out;
+		/* cannot use vrf any for set */
+		if (vrf_is_any(net_ctx->vrf))
+			goto out;
 		break;
 	default:
 		ret = -EINVAL;
@@ -1566,7 +1569,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 				goto cont;
 			if (h > s_h || idx > s_idx)
 				s_ip_idx = 0;
-			if (!vrf_eq(dev_vrf(dev), vrf))
+			if (!vrf_eq_or_any(dev_vrf(dev), vrf))
 				goto cont;
 			in_dev = __in_dev_get_rcu(dev);
 			if (!in_dev)
@@ -1890,7 +1893,7 @@ static int inet_netconf_dump_devconf(struct sk_buff *skb,
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
-			if (!vrf_eq(dev_vrf(dev), vrf))
+			if (!vrf_eq_or_any(dev_vrf(dev), vrf))
 				goto cont;
 			in_dev = __in_dev_get_rcu(dev);
 			if (!in_dev)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 8713618e2835..b024afcbf0b9 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -495,6 +495,10 @@ int ip_rt_ioctl(struct net_ctx *ctx, unsigned int cmd, void __user *arg)
 		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
+		/* route table can only be manipulated in a vrf context */
+		if (vrf_is_any(ctx->vrf))
+			return -EINVAL;
+
 		if (copy_from_user(&rt, arg, sizeof(rt)))
 			return -EFAULT;
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index fddc3bbf6b8b..ba66840688c2 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2451,7 +2451,7 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 	for_each_netdev_rcu(net, state->dev) {
 		struct in_device *in_dev;
 
-		if (!vrf_eq(dev_vrf(state->dev), ctx->vrf))
+		if (!vrf_eq_or_any(dev_vrf(state->dev), ctx->vrf))
 			continue;
 
 		in_dev = __in_dev_get_rcu(state->dev);
@@ -2600,7 +2600,7 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 	for_each_netdev_rcu(net, state->dev) {
 		struct in_device *idev;
 
-		if (!vrf_eq(dev_vrf(state->dev), ctx->vrf))
+		if (!vrf_eq_or_any(dev_vrf(state->dev), ctx->vrf))
 			continue;
 
 		idev = __in_dev_get_rcu(state->dev);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f3a349ea3dd8..6d4be3fd2d01 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -591,6 +591,11 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	} else if (!ipc.oif)
 		ipc.oif = inet->uc_index;
 
+	/* out vrf cannot be set to VRF_ANY */
+	err = -EINVAL;
+	if (vrf_is_any(sk_ctx.vrf))
+		goto done;
+
 	flowi4_init_output(&fl4, sk_ctx.vrf, ipc.oif, sk->sk_mark, tos,
 			   RT_SCOPE_UNIVERSE,
 			   inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol,
@@ -690,6 +695,10 @@ static int raw_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	int chk_addr_ret;
 	struct net_ctx sk_ctx = SOCK_NET_CTX(sk);
 
+	/* any vrf socket cannot bind to an address or device */
+	if (vrf_is_any(sk->sk_vrf))
+		goto out;
+
 	if (sk->sk_state != TCP_CLOSE || addr_len < sizeof(struct sockaddr_in))
 		goto out;
 	chk_addr_ret = inet_addr_type(&sk_ctx, addr->sin_addr.s_addr);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1446c84428d8..2d7e2748a138 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -904,6 +904,10 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
 
+	/* out vrf cannot be set to VRF_ANY */
+	if (vrf_is_any(sk_ctx.vrf))
+		return -EINVAL;
+
 	/*
 	 *	Check the flags.
 	 */
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (24 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 fs/proc/proc_net.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 4996f5e91a90..3745661b5370 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -56,6 +56,8 @@ int seq_open_net(struct inode *ino, struct file *f,
 #ifdef CONFIG_NET_NS
 	p->net_ctx.net = net;
 #endif
+	p->net_ctx.vrf = current->vrf;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(seq_open_net);
@@ -65,19 +67,32 @@ int single_open_net(struct inode *inode, struct file *file,
 {
 	int err;
 	struct net *net;
+	struct seq_net_private *p;
 
 	err = -ENXIO;
 	net = get_proc_net(inode);
 	if (net == NULL)
 		goto err_net;
 
-	err = single_open(file, show, net);
+	err = -ENOMEM;
+	p = kzalloc(sizeof(*p), GFP_KERNEL);
+	if (p == NULL)
+		goto err_malloc;
+
+#ifdef CONFIG_NET_NS
+	p->net_ctx.net = net;
+#endif
+	p->net_ctx.vrf = current->vrf;
+
+	err = single_open(file, show, p);
 	if (err < 0)
 		goto err_open;
 
 	return 0;
 
 err_open:
+	kfree(p);
+err_malloc:
 	put_net(net);
 err_net:
 	return err;
@@ -99,7 +114,8 @@ EXPORT_SYMBOL_GPL(seq_release_net);
 int single_release_net(struct inode *ino, struct file *f)
 {
 	struct seq_file *seq = f->private_data;
-	put_net(seq->private);
+	put_net(seq_file_net(seq));
+	kfree(seq->private);
 	return single_release(ino, f);
 }
 EXPORT_SYMBOL_GPL(single_release_net);
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (25 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv4/fib_trie.c | 24 ++++++++++++++++++------
 net/ipv4/proc.c     | 10 +++++-----
 net/ipv4/raw.c      |  7 ++++---
 net/ipv4/route.c    |  2 +-
 net/ipv4/tcp_ipv4.c | 15 ++++++++-------
 net/ipv4/udp.c      |  6 +++---
 6 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3daf0224ff2e..a3ff1100dc2a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1439,6 +1439,8 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 			}
 			if (fi->fib_flags & RTNH_F_DEAD)
 				continue;
+			if (!vrf_eq(fi->fib_net_ctx.vrf, flp->flowi4_vrf))
+				continue;
 			for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
 				const struct fib_nh *nh = &fi->fib_nh[nhsel];
 
@@ -1738,6 +1740,7 @@ static int fn_trie_dump_fa(t_key key, int plen, struct list_head *fah,
 	int i, s_i;
 	struct fib_alias *fa;
 	__be32 xkey = htonl(key);
+	__u32 vrf = skb->sk->sk_vrf;
 
 	s_i = cb->args[5];
 	i = 0;
@@ -1750,6 +1753,10 @@ static int fn_trie_dump_fa(t_key key, int plen, struct list_head *fah,
 			continue;
 		}
 
+		 if (!vrf_eq(fa->fa_info->fib_net_ctx.vrf, vrf) &&
+		     !vrf_is_any(vrf))
+			continue;
+
 		if (fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
 				  cb->nlh->nlmsg_seq,
 				  RTM_NEWROUTE,
@@ -2078,7 +2085,7 @@ static void fib_table_print(struct seq_file *seq, struct fib_table *tb)
 
 static int fib_triestat_seq_show(struct seq_file *seq, void *v)
 {
-	struct net *net = (struct net *)seq->private;
+	struct net *net = seq_file_net(seq);
 	unsigned int h;
 
 	seq_printf(seq,
@@ -2414,11 +2421,12 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 {
 	struct tnode *l = v;
 	struct leaf_info *li;
+	struct net_ctx *ctx = seq_file_net_ctx(seq);
 
 	if (v == SEQ_START_TOKEN) {
 		seq_printf(seq, "%-127s\n", "Iface\tDestination\tGateway "
 			   "\tFlags\tRefCnt\tUse\tMetric\tMask\t\tMTU"
-			   "\tWindow\tIRTT");
+			   "\tWindow\tIRTT\tvrf");
 		return 0;
 	}
 
@@ -2439,10 +2447,13 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 
 			seq_setwidth(seq, 127);
 
+			if (fi && !vrf_eq_or_any(fi->fib_vrf, ctx->vrf))
+				continue;
+
 			if (fi)
 				seq_printf(seq,
 					 "%s\t%08X\t%08X\t%04X\t%d\t%u\t"
-					 "%d\t%08X\t%d\t%u\t%u",
+					 "%d\t%08X\t%d\t%u\t%u\t%u",
 					 fi->fib_dev ? fi->fib_dev->name : "*",
 					 prefix,
 					 fi->fib_nh->nh_gw, flags, 0, 0,
@@ -2451,13 +2462,14 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 					 (fi->fib_advmss ?
 					  fi->fib_advmss + 40 : 0),
 					 fi->fib_window,
-					 fi->fib_rtt >> 3);
+					 fi->fib_rtt >> 3,
+					 fi->fib_vrf);
 			else
 				seq_printf(seq,
 					 "*\t%08X\t%08X\t%04X\t%d\t%u\t"
-					 "%d\t%08X\t%d\t%u\t%u",
+					 "%d\t%08X\t%d\t%u\t%u\t%u",
 					 prefix, 0, flags, 0, 0, 0,
-					 mask, 0, 0, 0);
+					 mask, 0, 0, 0, 0);
 
 			seq_pad(seq, '\n');
 		}
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 8f9cd200ce20..721dd600d722 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -51,7 +51,7 @@
  */
 static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
-	struct net *net = seq->private;
+	struct net *net = seq_file_net(seq);
 	unsigned int frag_mem;
 	int orphans, sockets;
 
@@ -319,7 +319,7 @@ static void icmpmsg_put(struct seq_file *seq)
 	int i, count;
 	unsigned short type[PERLINE];
 	unsigned long vals[PERLINE], val;
-	struct net *net = seq->private;
+	struct net *net = seq_file_net(seq);
 
 	count = 0;
 	for (i = 0; i < ICMPMSG_MIB_MAX; i++) {
@@ -341,7 +341,7 @@ static void icmpmsg_put(struct seq_file *seq)
 static void icmp_put(struct seq_file *seq)
 {
 	int i;
-	struct net *net = seq->private;
+	struct net *net = seq_file_net(seq);
 	atomic_long_t *ptr = net->mib.icmpmsg_statistics->mibs;
 
 	seq_puts(seq, "\nIcmp: InMsgs InErrors InCsumErrors");
@@ -371,7 +371,7 @@ static void icmp_put(struct seq_file *seq)
 static int snmp_seq_show(struct seq_file *seq, void *v)
 {
 	int i;
-	struct net *net = seq->private;
+	struct net *net = seq_file_net(seq);
 
 	seq_puts(seq, "Ip: Forwarding DefaultTTL");
 
@@ -455,7 +455,7 @@ static const struct file_operations snmp_seq_fops = {
 static int netstat_seq_show(struct seq_file *seq, void *v)
 {
 	int i;
-	struct net *net = seq->private;
+	struct net *net = seq_file_net(seq);
 
 	seq_puts(seq, "TcpExt:");
 	for (i = 0; snmp4_net_list[i].name != NULL; i++)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 6d4be3fd2d01..11e8313b5ea2 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -1027,14 +1027,15 @@ static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
 	      srcp  = inet->inet_num;
 
 	seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d\n",
+		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d %d\n",
 		i, src, srcp, dest, destp, sp->sk_state,
 		sk_wmem_alloc_get(sp),
 		sk_rmem_alloc_get(sp),
 		0, 0L, 0,
 		from_kuid_munged(seq_user_ns(seq), sock_i_uid(sp)),
 		0, sock_i_ino(sp),
-		atomic_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
+		atomic_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops),
+		sp->sk_vrf);
 }
 
 static int raw_seq_show(struct seq_file *seq, void *v)
@@ -1042,7 +1043,7 @@ static int raw_seq_show(struct seq_file *seq, void *v)
 	if (v == SEQ_START_TOKEN)
 		seq_printf(seq, "  sl  local_address rem_address   st tx_queue "
 				"rx_queue tr tm->when retrnsmt   uid  timeout "
-				"inode ref pointer drops\n");
+				"inode ref pointer drops vrf\n");
 	else
 		raw_sock_seq_show(seq, v, raw_seq_private(seq)->bucket);
 	return 0;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d6c5f0a8ab17..59af5016bf26 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -215,7 +215,7 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "%-127s\n",
 			   "Iface\tDestination\tGateway \tFlags\t\tRefCnt\tUse\t"
 			   "Metric\tSource\t\tMTU\tWindow\tIRTT\tTOS\tHHRef\t"
-			   "HHUptod\tSpecDst");
+			   "HHUptod\tSpecDst\tvrf");
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 24089b9534bf..249ce80d12d6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2215,7 +2215,7 @@ static void get_openreq4(const struct sock *sk, const struct request_sock *req,
 	long delta = req->expires - jiffies;
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %u %d %pK",
+		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %u %d %pK %d",
 		i,
 		ireq->ir_loc_addr,
 		ntohs(inet_sk(sk)->inet_sport),
@@ -2230,7 +2230,7 @@ static void get_openreq4(const struct sock *sk, const struct request_sock *req,
 		0,  /* non standard timer */
 		0, /* open_requests have no inode */
 		atomic_read(&sk->sk_refcnt),
-		req);
+		req, sk->sk_vrf);
 }
 
 static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i)
@@ -2272,7 +2272,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i)
 		rx_queue = max_t(int, tp->rcv_nxt - tp->copied_seq, 0);
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
-			"%08X %5u %8d %lu %d %pK %lu %lu %u %u %d",
+			"%08X %5u %8d %lu %d %pK %lu %lu %u %u %d %2d",
 		i, src, srcp, dest, destp, sk->sk_state,
 		tp->write_seq - tp->snd_una,
 		rx_queue,
@@ -2289,7 +2289,8 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i)
 		tp->snd_cwnd,
 		sk->sk_state == TCP_LISTEN ?
 		    (fastopenq ? fastopenq->max_qlen : 0) :
-		    (tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh));
+		    (tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh),
+		sk->sk_vrf);
 }
 
 static void get_timewait4_sock(const struct inet_timewait_sock *tw,
@@ -2305,10 +2306,10 @@ static void get_timewait4_sock(const struct inet_timewait_sock *tw,
 	srcp  = ntohs(tw->tw_sport);
 
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK",
+		" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK %2d",
 		i, src, srcp, dest, destp, tw->tw_substate, 0, 0,
 		3, jiffies_delta_to_clock_t(delta), 0, 0, 0, 0,
-		atomic_read(&tw->tw_refcnt), tw);
+		atomic_read(&tw->tw_refcnt), tw, tw->tw_vrf);
 }
 
 #define TMPSZ 150
@@ -2322,7 +2323,7 @@ static int tcp4_seq_show(struct seq_file *seq, void *v)
 	if (v == SEQ_START_TOKEN) {
 		seq_puts(seq, "  sl  local_address rem_address   st tx_queue "
 			   "rx_queue tr tm->when retrnsmt   uid  timeout "
-			   "inode");
+			   "inode  vrf");
 		goto out;
 	}
 	st = seq->private;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2d7e2748a138..345d5a5b4489 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2422,7 +2422,7 @@ static void udp4_format_sock(struct sock *sp, struct seq_file *f,
 	__u16 srcp	  = ntohs(inet->inet_sport);
 
 	seq_printf(f, "%5d: %08X:%04X %08X:%04X"
-		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d",
+		" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d %d",
 		bucket, src, srcp, dest, destp, sp->sk_state,
 		sk_wmem_alloc_get(sp),
 		sk_rmem_alloc_get(sp),
@@ -2430,7 +2430,7 @@ static void udp4_format_sock(struct sock *sp, struct seq_file *f,
 		from_kuid_munged(seq_user_ns(f), sock_i_uid(sp)),
 		0, sock_i_ino(sp),
 		atomic_read(&sp->sk_refcnt), sp,
-		atomic_read(&sp->sk_drops));
+		atomic_read(&sp->sk_drops), sp->sk_vrf);
 }
 
 int udp4_seq_show(struct seq_file *seq, void *v)
@@ -2439,7 +2439,7 @@ int udp4_seq_show(struct seq_file *seq, void *v)
 	if (v == SEQ_START_TOKEN)
 		seq_puts(seq, "  sl  local_address rem_address   st tx_queue "
 			   "rx_queue tr tm->when retrnsmt   uid  timeout "
-			   "inode ref pointer drops");
+			   "inode ref pointer drops vrf");
 	else {
 		struct udp_iter_state *state = seq->private;
 
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (26 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Add vrf subcommand with exec option to run a process in a specific VRF
context. Similar to ip netns subcommand.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 ip/Makefile    |   2 +-
 ip/ip.c        |   3 +-
 ip/ip_common.h |   1 +
 ip/ipvrf.c     | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100644 ip/ipvrf.c

diff --git a/ip/Makefile b/ip/Makefile
index 2c742f305fef..4d44906802bd 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,4 +1,4 @@
-IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
+IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o ipvrf.o \
     rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o iptoken.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
diff --git a/ip/ip.c b/ip/ip.c
index 850a001756af..80d90a409541 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -48,7 +48,7 @@ static void usage(void)
 "       ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
 "                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm |\n"
-"                   netns | l2tp | fou | tcp_metrics | token | netconf }\n"
+"                   netns | vrf | l2tp | fou | tcp_metrics | token | netconf }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -h[uman-readable] | -iec |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | bridge | link } |\n"
@@ -93,6 +93,7 @@ static const struct cmd {
 	{ "mroute",	do_multiroute },
 	{ "mrule",	do_multirule },
 	{ "netns",	do_netns },
+	{ "vrf",	do_vrf },
 	{ "netconf",	do_ipnetconf },
 	{ "help",	do_help },
 	{ 0 }
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 89a495ea1074..499f9f34cd36 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -49,6 +49,7 @@ extern int do_multiaddr(int argc, char **argv);
 extern int do_multiroute(int argc, char **argv);
 extern int do_multirule(int argc, char **argv);
 extern int do_netns(int argc, char **argv);
+extern int do_vrf(int argc, char **argv);
 extern int do_xfrm(int argc, char **argv);
 extern int do_ipl2tp(int argc, char **argv);
 extern int do_ipfou(int argc, char **argv);
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
new file mode 100644
index 000000000000..df9b2e76b309
--- /dev/null
+++ b/ip/ipvrf.c
@@ -0,0 +1,109 @@
+#define _ATFILE_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <sys/inotify.h>
+#include <sys/mount.h>
+#include <sys/param.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <errno.h>
+#include <unistd.h>
+#include <ctype.h>
+
+#include "utils.h"
+#include "ip_common.h"
+
+static int usage(void)
+{
+	fprintf(stderr, "Usage: ip vrf exec ID cmd ...\n");
+	exit(-1);
+}
+
+static int vrf_exec(int argc, char **argv)
+{
+	const char *cmd, *id;
+	char vrf_path[MAXPATHLEN];
+	int fd;
+
+	if (argc < 1) {
+		fprintf(stderr, "No vrf id specified\n");
+		return -1;
+	}
+	if (argc < 2) {
+		fprintf(stderr, "No command specified\n");
+		return -1;
+	}
+
+	id = argv[0];
+	cmd = argv[1];
+	snprintf(vrf_path, sizeof(vrf_path), "/proc/%d/vrf", getpid());
+	fd = open(vrf_path, O_WRONLY);
+	if (fd < 0) {
+		fprintf(stderr, "Cannot open vrf file: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	if (write(fd, id, strlen(id)) < 0) {
+		fprintf(stderr, "Failed to set vrf id: %s\n",
+			strerror(errno));
+		close(fd);
+		return -1;
+	}
+	close(fd);
+
+	fflush(stdout);
+
+	if (batch_mode) {
+		int status;
+		pid_t pid;
+
+		pid = fork();
+		if (pid < 0) {
+			perror("fork");
+			exit(1);
+		}
+
+		if (pid != 0) {
+			/* Parent  */
+			if (waitpid(pid, &status, 0) < 0) {
+				perror("waitpid");
+				exit(1);
+			}
+
+			if (WIFEXITED(status)) {
+				/* ip must return the status of the child,
+				 * but do_cmd() will add a minus to this,
+				 * so let's add another one here to cancel it.
+				 */
+				return -WEXITSTATUS(status);
+			}
+
+			exit(1);
+		}
+	}
+
+	if (execvp(cmd, argv + 1)  < 0)
+		fprintf(stderr, "exec of \"%s\" failed: %s\n",
+			cmd, strerror(errno));
+	_exit(1);
+}
+
+int do_vrf(int argc, char **argv)
+{
+	if (*argv == NULL)
+		return usage();
+
+	if (matches(*argv, "help") == 0)
+		return usage();
+
+	if (matches(*argv, "exec") == 0)
+		return vrf_exec(argc-1, argv+1);
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip vrf help\".\n", *argv);
+	exit(-1);
+}
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [RFC PATCH 29/29] iproute2: Add vrf option to ip link command
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (27 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
@ 2015-02-05  1:34 ` David Ahern
  2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-05  1:34 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm, David Ahern

Add option to ip link to change the vrf context on a netdevice.
e.g., ip link set dev eth4 vrf 99

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 bridge/link.c           | 3 +++
 include/linux/if_link.h | 2 ++
 ip/ipaddress.c          | 2 ++
 ip/iplink.c             | 9 +++++++++
 4 files changed, 16 insertions(+)

diff --git a/bridge/link.c b/bridge/link.c
index c8555f82d5b4..520e656f3bf8 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -146,6 +146,9 @@ int print_linkinfo(const struct sockaddr_nl *who,
 
 	print_link_flags(fp, ifi->ifi_flags);
 
+	if (tb[IFLA_VRF])
+		fprintf(fp, "vrf %u ", rta_getattr_u32(tb[IFLA_VRF]));
+
 	if (tb[IFLA_MTU])
 		fprintf(fp, "mtu %u ", rta_getattr_u32(tb[IFLA_MTU]));
 
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 167ec34bab73..c261d3040b88 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -146,6 +146,8 @@ enum {
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
 	IFLA_PHYS_SWITCH_ID,
+	IFLA_LINK_NETNSID,
+	IFLA_VRF,
 	__IFLA_MAX
 };
 
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index d5e863dd1f12..f4001e0ef8cb 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -625,6 +625,8 @@ int print_linkinfo(const struct sockaddr_nl *who,
 
 	if (tb[IFLA_MTU])
 		fprintf(fp, "mtu %u ", *(int*)RTA_DATA(tb[IFLA_MTU]));
+	if (tb[IFLA_VRF])
+		fprintf(fp, "vrf %u ", *(int*)RTA_DATA(tb[IFLA_VRF]));
 	if (tb[IFLA_QDISC])
 		fprintf(fp, "qdisc %s ", rta_getattr_str(tb[IFLA_QDISC]));
 	if (tb[IFLA_MASTER]) {
diff --git a/ip/iplink.c b/ip/iplink.c
index c93d1dc3d5f6..0474293527c5 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -72,6 +72,7 @@ void iplink_usage(void)
 	fprintf(stderr, "	                  [ mtu MTU ]\n");
 	fprintf(stderr, "	                  [ netns PID ]\n");
 	fprintf(stderr, "	                  [ netns NAME ]\n");
+	fprintf(stderr, "	                  [ vrf ID]\n");
 	fprintf(stderr, "			  [ alias NAME ]\n");
 	fprintf(stderr, "	                  [ vf NUM [ mac LLADDR ]\n");
 	fprintf(stderr, "				   [ vlan VLANID [ qos VLAN-QOS ] ]\n");
@@ -383,6 +384,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 	int mtu = -1;
 	int netns = -1;
 	int vf = -1;
+	int vrf = -1;
 	int numtxqueues = -1;
 	int numrxqueues = -1;
 	int dev_index = 0;
@@ -447,6 +449,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 				addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_PID, &netns, 4);
 			else
 				invarg("Invalid \"netns\" value\n", *argv);
+		} else if (strcmp(*argv, "vrf") == 0) {
+			NEXT_ARG();
+			if (vrf != -1)
+				duparg("vrf", *argv);
+			if (get_integer(&vrf, *argv, 0))
+				invarg("Invalid \"vrf\" value\n", *argv);
+			addattr_l(&req->n, sizeof(*req), IFLA_VRF, &vrf, 4);
 		} else if (strcmp(*argv, "multicast") == 0) {
 			NEXT_ARG();
 			req->i.ifi_change |= IFF_MULTICAST;
-- 
1.9.3 (Apple Git-50)

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (28 preceding siblings ...)
  2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
@ 2015-02-05  5:17 ` roopa
  2015-02-05 13:44 ` Nicolas Dichtel
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: roopa @ 2015-02-05  5:17 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, ebiederm, hannes

On 2/4/15, 5:34 PM, David Ahern wrote:
> Kernel patches are also available here:
>      https://github.com/dsahern/linux.git vrf-3.19
>
> iproute2 patches are also available here:
>      https://github.com/dsahern/iproute2 vrf-3.19
>
>
> Background
> ----------
> The concept of VRFs (Virtual Routing and Forwarding) has been around for over
> 15 years. Support for VRFs in the Linux kernel has been an often requested
> feature for almost as long. For a while support was available via an out of
> tree patch [1]. Since network namespaces came along, the response to queries
> about VRF support for Linux was 'use namespaces'. But as mentioned previously
> [2] network namespaces are not a good match for VRFs. Of the list of problems
> noted the big one is that namespaces do not scale efficiently to the number
> of VRFs supported by networking gear (> 1000 VRFs). Networking vendors that
> want to use Linux as the OS have to carry custom solutions to this problem --
> be it userspace networking stacks, extensive kernel patches (to add VRF
> support or bend the implementation of namespaces), and/or patches to many
> open source components. The recent addition of switchdev support in the
> kernel suggests that people expect the use of Linux as a switch networking
> OS to increase. Hopefully the time is right to re-open the discussion on a
> salable VRF implementation for the Linux kernel.

yes, We have been thinking vrfs and have stumbled upon similar questions 
and problems you list.
Thanks for the work and putting up a proposal. Haven't looked at all of 
your patches in detail, but
we are certainly interested in working on a possible vrf solution for Linux.
>
> The intent of this RFC is to get feedback on the overall idea - namely VRFs
> as integer id and the nesting of VRFs within a namespace. This set includes
> changes only to core IPv4 code which shows the concept; changes to the rest
> of the network stack are fairly repetitive.

I see that the changes look many but they are mostly adding the vrf 
indirection.

We have been looking at ip rules (for  the use cases with non-duplicate 
ip addresses) and also
at net namespaces. Currently net namespaces seems like a good solution 
but it provides
stricter isolation than needed and we will need to punch holes or leak 
stuff across namespaces
to make all use cases of  vrfs really work.

Your approach seems reasonable so far.

more on this later,

Thanks!


>
> This patch set has a number of similarities to the original VRF patch - most
> notably VRF ids as an integer index and plumbing through iproute2 and
> netlink. But this set is really a complete re-implementation of the feature,
> integrating VRF within a namespace and leveraging existing support for
> network namespaces.
>
> Design
> ------
> Namespaces provide excellent separation of the networking stack from the
> netdevices and up. The intent of VRFs is to provide an additional,
> logical separation at the L3 layer within a namespace.
>
>     +----------------------------------------------------------+
>     | Namespace foo                                            |
>     |                         +---------------+                |
>     |          +------+       | L3/L4 service |                |
>     |          | lldp |       |   (VRF any)   |                |
>     |          +------+       +---------------+                |
>     |                                                          |
>     |                             +-------------------------+  |
>     |                             | VRF M                   |  |
>     |  +---------------------+  +-------------------------+ |  |
>     |  | VRF 1 (default)     |  | VRF N                   | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |  | L3/L4 service |  |  |    | L3/L4 service |    | |  |
>     |  |  | (VRF unaware) |  |  |    | (VRF unaware) |    | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |                     |  |                         | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  || FIB | | neighbor | |  |  | FIB | | neighbor |   | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  |                     |  |                         |-+  |
>     |  | {dev 1}  {dev 2}    |  | {dev 3} {dev 4} {dev 5} |    |
>     |  +---------------------+  +-------------------------+    |
>     +----------------------------------------------------------+
>
> This is accomplished by enhancing the current namespace checks to a
> broader network context that is both a namepsace and a VRF id. The VRF
> id is a tag applied to relevant structures, an integer between 1 and 4095
> which allows for 4095 VRFs (could have 0 be the default VRF and then the
> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
> is based on the genid scheme for versioning networking data which is a
> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
>
> Netdevices, sk_buffs, sockets, and tasks are all tagged with a VRF id.
> Network lookups (devices, sockets, addresses, routes, neighbors) require a
> match of both network namespace and VRF id (or the special 'vrf any' tag;
> more on that later).
>
> Beyond the 4-byte tag in various data structures, there are no resources
> allocated to a VRF so there is no need to create or destroy a VRF which is
> in-line with the concept of keeping it lightweight for scalability. The
> trade-off is that VRFs use the the same sysctl settings as the namespace
> they are part of and, for example, MIB counters.
>
> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
> to this file (if preferred this can be made a prctl to change the VRF id).
> This allows services to be launched in a VRF context using ip, similar to
> what is done for network namespaces.
>      e.g., ip vrf exec 99 /usr/sbin/sshd
>
> (or a simpler chvrf alias/command can be used to just write the VRF id
> to the proc file.)
>
> The task's VRF id also affects viewing and modifying network configuration.
> For example, 'ip addr show', 'ip route ls', 'ifconfig', 'arp -n', etc, only
> show network data for the VRF associated with the task's VRF id; devices
> are at the L2 layer so a command listing devices is not impacted by VRF id.
>
> When a socket is created the VRF id is taken from the task. Socket-vrf
> association for non-connected sockets can be changed using a setsockopt
> (e.g., create a socket then change VRF id prior to calling bind or connect).
>
> Network devices belong to a single VRF context which defaults to VRF 1.
> They can be assigned to another VRF using IFLA_VRF attribute in link
> messages. Similarly the VRF assignment is returned in the IFLA_VRF
> attribute. The ip command has been modified to display the VRF id of a
> device. L2 applications like lldp are not VRF aware and still work through
> through all network devices within the namespace.
>
> On RX skbs get their VRF context from the netdevice the packet is received
> on. For TX the VRF context for an skb is taken from the socket. The
> intention is for L3/raw sockets to be able to set the VRF context for a
> packet TX using cmsg (not coded in this patch set).
>
> VRF aware apps (e.g., L3 VPNs) can have sockets in multiple VRFs for
> forwarding packets.
>
> The special 'ANY VRF' context allows a single instance of a daemon to
> provide a service across all VRFs.
>      e.g., ip vrf exec any /usr/sbin/sshd
>
> The 'any' context applies to listen sockets only; connected sockets are in
> a VRF context. Child sockets accepted by the daemon acquire the VRF context
> of the network device the connection originated on.
>
> The 'ANY VRF' context can also be used to display all addresses, routes
> or neighbors in the kernel cache. That is, 'ip addr show', 'ip route ls',
> 'ifconfig', 'arp -n', etc, show all network data for the namespace.
>
>
> About this Patch Set
> --------------------
> This is not a complete conversion of the networking stack, only a small
> sampling to test the waters. Only changes are to core IPv4 code [2] which
> is sufficient to illustrate the fundamental concept. Changes from
> struct net to net_ctx are very repetitive.
>
> I'm sure there are a lot of oversights and bugs, but the intent here is
> to solicit feedback on the overall idea.
>
>
> Examples
> --------
> To illustrate the VRF patches consider a system with 18 NICs:
> - eth0, eth17 are in default namespace (e.g., management namespace)
>
> - eth1 - eth8 are in group1 namespace
>    - eth1 - eth4 are in VRF 11
>    - eth5 - eth8 are in VRF 13
>
> - eth9 - eth16 are in group2 namespace
>    - eth9 - eth12 are in VRF 21
>    - eth13 - eth16 are in VRF 23
>
> - Addresses assigned to each interface:
>    - eth1: 1.1.1.1/24
>    - eth2: 2.2.2.1/24
>    - eth3: 3.3.3.1/24
>    - eth4: 4.4.4.1/24
>    - eth5: 1.1.1.1/24 (not a typo, duplicate address in different vrfs)
>    - eth6: 6.6.6.1/24
>    - eth7: 7.7.7.1/24
>    - eth8: 8.8.8.1/24
>
> - openlldpd is started in each namespace
>
> 1. device list is VRF agnostic
>     - ifconfig -a, ip link show, /proc/net/dev
>       --> default namespace shows only eth0 and eth17
>       --> group1 namespace shows only eth1 - eth8
>       --> group2 namespace shows only eth9 - eth16
>           - ip shows vrf assignment of each link
>
>      3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
>          link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff
>
> 2. address, route, neighbor list is VRF aware
>     - ifconfig, ip addr show, ip route ls, /proc/net/route
>       --> shows only addresses for VRF id of task unless id is 'any'
>
>     in VRF 1:
>     ifconfig eth1
>     eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>          ether 02:ab:cd:02:00:01  txqueuelen 1000  (Ethernet)
>     ...
>
>     No addresses are shown. But if the command is run in VRF 11 or VRF 'any'
>       ip vrf exec 11 ip addr show dev eth1
>       3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP group default qlen 1000
>          link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff
>          inet 1.1.1.1/24 brd 1.1.1.255 scope global eth1
>             valid_lft forever preferred_lft forever
>
> 3. start ssh in group1 namespace
>     ip netns exec group1 ip vrf exec 11 /usr/sbin/sshd -d
>     ssh to 1.1.1.1 via eth1
>
>     ip netns exec group1 ip vrf exec 13 /usr/sbin/sshd -d
>     ssh to 1.1.1.1 via eth5
>     --> same namespace but different VRFs
>
> 4. One ssh instance handles VRFs in group1 namespace
>     ip netns exec group1 ip vrf exec any /usr/sbin/sshd
>
>     --> ssh to any address in the namespace works
>
> References
> ----------
> [1] http://sourceforge.net/projects/linux-vrf
>
> [2] http://www.spinics.net/lists/netdev/msg298368.html
>
> [3] To build only enable core ipv4 code. Disable IPv6, netfilter, ipsec, etc.
>
>
> David Ahern (29):
>    net: Introduce net_ctx and macro for context comparison
>    net: Flip net_device to use net_ctx
>    net: Flip sock_common to net_ctx
>    net: Add net_ctx macros for skbuffs
>    net: Flip seq_net_private to net_ctx
>    net: Flip fib_rules and fib_rules_ops to use net_ctx
>    net: Flip inet_bind_bucket to net_ctx
>    net: Flip fib_info to net_ctx
>    net: Flip ip6_flowlabel to net_ctx
>    net: Flip neigh structs to net_ctx
>    net: Flip nl_info to net_ctx
>    net: Add device lookups by net_ctx
>    net: Convert function arg from struct net to struct net_ctx
>    net: vrf: Introduce vrf header file
>    net: vrf: Add vrf to net_ctx struct
>    net: vrf: Set default vrf
>    net: vrf: Add vrf context to task struct
>    net: vrf: Plumbing for vrf context on a socket
>    net: vrf: Add vrf context to skb
>    net: vrf: Add vrf context to flow struct
>    net: vrf: Add vrf context to genid's
>    net: vrf: Set VRF id in various network structs
>    net: vrf: Enable vrf checks
>    net: vrf: Add support to get/set vrf context on a device
>    net: vrf: Handle VRF any context
>    net: vrf: Change single_open_net to pass net_ctx
>    net: vrf: Add vrf checks and context to ipv4 proc files
>    iproute2: vrf: Add vrf subcommand
>    iproute2: Add vrf option to ip link command
>
>   fs/proc/base.c                   |  94 +++++++++++++++++++++++++
>   fs/proc/proc_net.c               |  22 +++++-
>   include/linux/inetdevice.h       |  12 ++--
>   include/linux/init_task.h        |   1 +
>   include/linux/netdevice.h        |  44 +++++++++++-
>   include/linux/sched.h            |   2 +
>   include/linux/seq_file_net.h     |  10 +--
>   include/linux/skbuff.h           |   5 ++
>   include/net/addrconf.h           |  22 +++---
>   include/net/arp.h                |   2 +-
>   include/net/dst.h                |  16 ++---
>   include/net/fib_rules.h          |  10 ++-
>   include/net/flow.h               |  10 ++-
>   include/net/inet6_hashtables.h   |  19 +++---
>   include/net/inet_hashtables.h    |  60 ++++++++++------
>   include/net/inet_sock.h          |   1 +
>   include/net/inet_timewait_sock.h |   1 +
>   include/net/ip.h                 |  10 +--
>   include/net/ip6_fib.h            |   4 +-
>   include/net/ip6_route.h          |  24 +++----
>   include/net/ip_fib.h             |  38 +++++++----
>   include/net/ipv6.h               |  14 +++-
>   include/net/neighbour.h          |  93 +++++++++++++++++++++----
>   include/net/net_namespace.h      |  39 +++++++++--
>   include/net/netlink.h            |   5 +-
>   include/net/route.h              |  46 +++++++------
>   include/net/sock.h               |  21 ++++--
>   include/net/tcp.h                |   1 +
>   include/net/transp_v6.h          |   2 +-
>   include/net/udp.h                |   8 +--
>   include/net/vrf.h                |  36 ++++++++++
>   include/net/xfrm.h               |  28 ++++----
>   include/uapi/linux/if_link.h     |   1 +
>   include/uapi/linux/in.h          |   1 +
>   kernel/fork.c                    |   2 +
>   net/core/dev.c                   |  95 +++++++++++++++++++++++---
>   net/core/fib_rules.c             |  36 ++++++----
>   net/core/flow.c                  |   5 +-
>   net/core/neighbour.c             | 106 +++++++++++++++--------------
>   net/core/rtnetlink.c             |  12 ++++
>   net/core/skbuff.c                |  12 ++++
>   net/core/sock.c                  |   2 +
>   net/ipv4/af_inet.c               |  20 ++++--
>   net/ipv4/arp.c                   |  76 ++++++++++++---------
>   net/ipv4/datagram.c              |   6 +-
>   net/ipv4/devinet.c               |  64 ++++++++++++------
>   net/ipv4/fib_frontend.c          |  83 ++++++++++++++---------
>   net/ipv4/fib_rules.c             |  12 ++--
>   net/ipv4/fib_semantics.c         |  38 +++++++----
>   net/ipv4/fib_trie.c              |  24 +++++--
>   net/ipv4/icmp.c                  |  40 ++++++-----
>   net/ipv4/igmp.c                  |  53 +++++++++------
>   net/ipv4/inet_connection_sock.c  |  23 ++++---
>   net/ipv4/inet_diag.c             |  13 ++--
>   net/ipv4/inet_hashtables.c       |  42 +++++++-----
>   net/ipv4/inet_timewait_sock.c    |   1 +
>   net/ipv4/ip_input.c              |   6 +-
>   net/ipv4/ip_options.c            |  20 +++---
>   net/ipv4/ip_output.c             |  16 +++--
>   net/ipv4/ip_sockglue.c           |  32 +++++++--
>   net/ipv4/ipconfig.c              |   6 +-
>   net/ipv4/ipmr.c                  |  53 +++++++++------
>   net/ipv4/netfilter.c             |  13 ++--
>   net/ipv4/ping.c                  |  41 +++++------
>   net/ipv4/proc.c                  |  10 +--
>   net/ipv4/raw.c                   |  48 ++++++++-----
>   net/ipv4/route.c                 | 143 +++++++++++++++++++++++----------------
>   net/ipv4/syncookies.c            |   6 +-
>   net/ipv4/tcp_ipv4.c              |  57 +++++++++-------
>   net/ipv4/tcp_minisocks.c         |   1 +
>   net/ipv4/udp.c                   | 122 ++++++++++++++++++---------------
>   net/ipv4/udp_diag.c              |  11 +--
>   net/ipv4/xfrm4_policy.c          |  14 ++--
>   net/netlink/af_netlink.c         |  12 ++++
>   net/sctp/protocol.c              |  10 +--
>   net/xfrm/xfrm_policy.c           |   9 +--
>   76 files changed, 1415 insertions(+), 682 deletions(-)
>   create mode 100644 include/net/vrf.h
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (29 preceding siblings ...)
  2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
@ 2015-02-05 13:44 ` Nicolas Dichtel
  2015-02-06  1:32   ` David Ahern
  2015-02-05 23:12 ` roopa
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:44 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
[snip]
> This is accomplished by enhancing the current namespace checks to a
> broader network context that is both a namepsace and a VRF id. The VRF
> id is a tag applied to relevant structures, an integer between 1 and 4095
> which allows for 4095 VRFs (could have 0 be the default VRF and then the
> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
> is based on the genid scheme for versioning networking data which is a
> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
Would it be possible to avoid this artificial limit?
There could be scenarii with more than 4096 vrf.

Do you plan to have a way to dump or monitor VRF via netlink?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket
  2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
@ 2015-02-05 13:44   ` Nicolas Dichtel
  2015-02-06  1:18     ` David Ahern
  0 siblings, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:44 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
> Sockets inherit the vrf context of the task opening it. The context can
> be read/changed via a socket option (IP_VRF_CONTEXT).
What about using a common socket option (SO_VRF_CONTEXT) instead of an ipv4
specific option?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 14/29] net: vrf: Introduce vrf header file
  2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
@ 2015-02-05 13:44   ` Nicolas Dichtel
  2015-02-06  0:52     ` David Ahern
  0 siblings, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:44 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
> Defines for min and max vrf id and helpers for examining
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>   include/net/vrf.h | 36 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
>   create mode 100644 include/net/vrf.h
>
> diff --git a/include/net/vrf.h b/include/net/vrf.h
> new file mode 100644
> index 000000000000..67bc2e465661
> --- /dev/null
> +++ b/include/net/vrf.h
> @@ -0,0 +1,36 @@
> +#ifndef _VRF_H_
> +#define _VRF_H_
> +
> +#define VRF_BITS	12
> +#define VRF_MIN		1
> +#define VRF_MAX		((1 << VRF_BITS) - 1)
> +#define VRF_MASK	VRF_MAX
> +
> +#define VRF_DEFAULT	1
> +#define VRF_ANY		0xffff
It could be useful to expose this value to userland.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 19/29] net: vrf: Add vrf context to skb
  2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
@ 2015-02-05 13:45   ` Nicolas Dichtel
  2015-02-06  1:21     ` David Ahern
  2015-02-06  3:54   ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:45 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
> On ingress skb's inherit vrf context from the net_device. For TX skb's
> inherit the vrf context from the socket originating the packet. Update
> SKB related net_ctx macros to set vrf.
Is it really needed to have this skb->vrf?
Is it not possible to get the vrf from sockets, interfaces, etc? Like it's done
with netns?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 25/29] net: vrf: Handle VRF any context
  2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
@ 2015-02-05 13:46   ` Nicolas Dichtel
  2015-02-06  1:23     ` David Ahern
  0 siblings, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:46 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
> VRF any context applies only to tasks to and sockets. Devices are
> associated with a single VRF, and skb's by extension are connected to
> a single VRF.
>
> Listen sockets and unconnected sockets can be opened in a "VRF any"
> context allowing a single daemon to provide service across all VRFs
> in a namespace. Connected sockets must be in a specific vrf context.
> Accepted sockets acquire the VRF context from the device the packet
> enters (via the skb).
>
> "VRF any" context is also useful for tasks wanting to view L3/L4
> data for all VRFs.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
[snip]
> +static inline int neigh_parms_net_ctx_eq_any(const struct neigh_parms *parms,
> +					     const struct net_ctx *net_ctx)
> +{
> +#ifdef CONFIG_NET_NS
> +	if (net_eq(neigh_parms_net(parms), net_ctx->net) &&
> +	    (vrf_eq(neigh_parms_vrf(parms), net_ctx->vrf) ||
> +	     vrf_is_any(net_ctx->vrf))) {
> +		return 1;
> +	}
> +
> +	return 0;
> +#else
> +	return 1;
> +#endif
If I understand well, the way the patch is done, VRF can be used only if 
CONFIG_NET_NS is enabled.
But if I'm not wrong, it could be independent. Am I right?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 02/29] net: Flip net_device to use net_ctx
  2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
@ 2015-02-05 13:47   ` Nicolas Dichtel
  2015-02-06  0:45     ` David Ahern
  0 siblings, 1 reply; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-05 13:47 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 05/02/2015 02:34, David Ahern a écrit :
> +static inline
> +int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
> +{
> +	if (net_eq(dev_net(dev), ctx->net))
> +		return 1;
Why not using net_ctx_eq()?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (30 preceding siblings ...)
  2015-02-05 13:44 ` Nicolas Dichtel
@ 2015-02-05 23:12 ` roopa
  2015-02-06  2:19   ` David Ahern
  2015-02-06  6:10   ` Shmulik Ladkani
  2015-02-06  1:33 ` Stephen Hemminger
                   ` (4 subsequent siblings)
  36 siblings, 2 replies; 119+ messages in thread
From: roopa @ 2015-02-05 23:12 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, ebiederm, Dinesh Dutt, Vipin Kumar, Nicolas Dichtel, hannes

On 2/4/15, 5:34 PM, David Ahern wrote:
> Kernel patches are also available here:
>      https://github.com/dsahern/linux.git vrf-3.19
>
> iproute2 patches are also available here:
>      https://github.com/dsahern/iproute2 vrf-3.19
>
>
> Background
> ----------
> The concept of VRFs (Virtual Routing and Forwarding) has been around for over
> 15 years. Support for VRFs in the Linux kernel has been an often requested
> feature for almost as long. For a while support was available via an out of
> tree patch [1]. Since network namespaces came along, the response to queries
> about VRF support for Linux was 'use namespaces'. But as mentioned previously
> [2] network namespaces are not a good match for VRFs. Of the list of problems
> noted the big one is that namespaces do not scale efficiently to the number
> of VRFs supported by networking gear (> 1000 VRFs). Networking vendors that
> want to use Linux as the OS have to carry custom solutions to this problem --
> be it userspace networking stacks, extensive kernel patches (to add VRF
> support or bend the implementation of namespaces), and/or patches to many
> open source components. The recent addition of switchdev support in the
> kernel suggests that people expect the use of Linux as a switch networking
> OS to increase. Hopefully the time is right to re-open the discussion on a
> salable VRF implementation for the Linux kernel.
>
> The intent of this RFC is to get feedback on the overall idea - namely VRFs
> as integer id and the nesting of VRFs within a namespace. This set includes
> changes only to core IPv4 code which shows the concept; changes to the rest
> of the network stack are fairly repetitive.
>
> This patch set has a number of similarities to the original VRF patch - most
> notably VRF ids as an integer index and plumbing through iproute2 and
> netlink. But this set is really a complete re-implementation of the feature,
> integrating VRF within a namespace and leveraging existing support for
> network namespaces.
>
> Design
> ------
> Namespaces provide excellent separation of the networking stack from the
> netdevices and up. The intent of VRFs is to provide an additional,
> logical separation at the L3 layer within a namespace.
>
>     +----------------------------------------------------------+
>     | Namespace foo                                            |
>     |                         +---------------+                |
>     |          +------+       | L3/L4 service |                |
>     |          | lldp |       |   (VRF any)   |                |
>     |          +------+       +---------------+                |
>     |                                                          |
>     |                             +-------------------------+  |
>     |                             | VRF M                   |  |
>     |  +---------------------+  +-------------------------+ |  |
>     |  | VRF 1 (default)     |  | VRF N                   | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |  | L3/L4 service |  |  |    | L3/L4 service |    | |  |
>     |  |  | (VRF unaware) |  |  |    | (VRF unaware) |    | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |                     |  |                         | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  || FIB | | neighbor | |  |  | FIB | | neighbor |   | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  |                     |  |                         |-+  |
>     |  | {dev 1}  {dev 2}    |  | {dev 3} {dev 4} {dev 5} |    |
>     |  +---------------------+  +-------------------------+    |
>     +----------------------------------------------------------+
>
> This is accomplished by enhancing the current namespace checks to a
> broader network context that is both a namepsace and a VRF id. The VRF
> id is a tag applied to relevant structures, an integer between 1 and 4095
> which allows for 4095 VRFs (could have 0 be the default VRF and then the
> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
> is based on the genid scheme for versioning networking data which is a
> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
>
> Netdevices, sk_buffs, sockets, and tasks are all tagged with a VRF id.
> Network lookups (devices, sockets, addresses, routes, neighbors) require a
> match of both network namespace and VRF id (or the special 'vrf any' tag;
> more on that later).
>
>
David,

Wondering if you have thought about some of the the below cases in your 
approach to vrfs ?.
- Leaking routes from one vrf to another
- route lookup in one vrf on failure to fallback to the global vrf (This 
for example can be done using throw if we used ip rules and route tables 
to do the same).
- A route in one vrf pointing to a nexthop in another vrf

We have been playing with ip rules to implement vrfs. And the blocker 
today is that we cannot bind a socket to a vrf (routing tables in this 
case).

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 02/29] net: Flip net_device to use net_ctx
  2015-02-05 13:47   ` Nicolas Dichtel
@ 2015-02-06  0:45     ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  0:45 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:47 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
>> +static inline
>> +int dev_net_ctx_eq(const struct net_device *dev, struct net_ctx *ctx)
>> +{
>> +    if (net_eq(dev_net(dev), ctx->net))
>> +        return 1;
> Why not using net_ctx_eq()?

right, will change it.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 14/29] net: vrf: Introduce vrf header file
  2015-02-05 13:44   ` Nicolas Dichtel
@ 2015-02-06  0:52     ` David Ahern
  2015-02-06  8:53       ` Nicolas Dichtel
  0 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-06  0:52 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:44 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
>> Defines for min and max vrf id and helpers for examining
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> ---
>>   include/net/vrf.h | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>   create mode 100644 include/net/vrf.h
>>
>> diff --git a/include/net/vrf.h b/include/net/vrf.h
>> new file mode 100644
>> index 000000000000..67bc2e465661
>> --- /dev/null
>> +++ b/include/net/vrf.h
>> @@ -0,0 +1,36 @@
>> +#ifndef _VRF_H_
>> +#define _VRF_H_
>> +
>> +#define VRF_BITS    12
>> +#define VRF_MIN        1
>> +#define VRF_MAX        ((1 << VRF_BITS) - 1)
>> +#define VRF_MASK    VRF_MAX
>> +
>> +#define VRF_DEFAULT    1
>> +#define VRF_ANY        0xffff
> It could be useful to expose this value to userland.

Maybe. I was thinking VRF_ANY should stay kernel side only and have a 
sockopt value of -1 mean VRF_ANY.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket
  2015-02-05 13:44   ` Nicolas Dichtel
@ 2015-02-06  1:18     ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  1:18 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:44 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
>> Sockets inherit the vrf context of the task opening it. The context can
>> be read/changed via a socket option (IP_VRF_CONTEXT).
> What about using a common socket option (SO_VRF_CONTEXT) instead of an ipv4
> specific option?

Makes sense.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 19/29] net: vrf: Add vrf context to skb
  2015-02-05 13:45   ` Nicolas Dichtel
@ 2015-02-06  1:21     ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  1:21 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:45 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
>> On ingress skb's inherit vrf context from the net_device. For TX skb's
>> inherit the vrf context from the socket originating the packet. Update
>> SKB related net_ctx macros to set vrf.
> Is it really needed to have this skb->vrf?
> Is it not possible to get the vrf from sockets, interfaces, etc? Like
> it's done
> with netns?

One reason is for VRF_ANY context of sockets.
Another one is netlink messages getting funneled through a single kernel 
netlink socket; skb->vrf is used to retain vrf handling.

in time might be able to make it go away.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 25/29] net: vrf: Handle VRF any context
  2015-02-05 13:46   ` Nicolas Dichtel
@ 2015-02-06  1:23     ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  1:23 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:46 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
>> VRF any context applies only to tasks to and sockets. Devices are
>> associated with a single VRF, and skb's by extension are connected to
>> a single VRF.
>>
>> Listen sockets and unconnected sockets can be opened in a "VRF any"
>> context allowing a single daemon to provide service across all VRFs
>> in a namespace. Connected sockets must be in a specific vrf context.
>> Accepted sockets acquire the VRF context from the device the packet
>> enters (via the skb).
>>
>> "VRF any" context is also useful for tasks wanting to view L3/L4
>> data for all VRFs.
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> ---
> [snip]
>> +static inline int neigh_parms_net_ctx_eq_any(const struct neigh_parms
>> *parms,
>> +                         const struct net_ctx *net_ctx)
>> +{
>> +#ifdef CONFIG_NET_NS
>> +    if (net_eq(neigh_parms_net(parms), net_ctx->net) &&
>> +        (vrf_eq(neigh_parms_vrf(parms), net_ctx->vrf) ||
>> +         vrf_is_any(net_ctx->vrf))) {
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +#else
>> +    return 1;
>> +#endif
> If I understand well, the way the patch is done, VRF can be used only if
> CONFIG_NET_NS is enabled.
> But if I'm not wrong, it could be independent. Am I right?
>

Yes. VRF can exist without namespace. It became tedious to keep tracking 
the CONFIG_NET_NS for the RFC set. Would certainly do that for later 
versions.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05 13:44 ` Nicolas Dichtel
@ 2015-02-06  1:32   ` David Ahern
  2015-02-06  8:53     ` Nicolas Dichtel
  0 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-06  1:32 UTC (permalink / raw)
  To: nicolas.dichtel, netdev; +Cc: ebiederm

On 2/5/15 6:44 AM, Nicolas Dichtel wrote:
> Le 05/02/2015 02:34, David Ahern a écrit :
> [snip]
>> This is accomplished by enhancing the current namespace checks to a
>> broader network context that is both a namepsace and a VRF id. The VRF
>> id is a tag applied to relevant structures, an integer between 1 and 4095
>> which allows for 4095 VRFs (could have 0 be the default VRF and then the
>> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
>> is based on the genid scheme for versioning networking data which is a
>> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
> Would it be possible to avoid this artificial limit?
> There could be scenarii with more than 4096 vrf.

As I recall the genid was the only reason to put a limit on it. I know 
of one product with a higher limit (16k I believe), but I figured this 
was a reasonable start point for the discussion.

>
> Do you plan to have a way to dump or monitor VRF via netlink?

What do you mean? There is no creation / deletion event. Are you 
referring to monitoring device changes -- device moved from one network 
context (namespace, vrf) to another?

The VRF id can be added as an attribute to all relevant netlink 
notifications.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (31 preceding siblings ...)
  2015-02-05 23:12 ` roopa
@ 2015-02-06  1:33 ` Stephen Hemminger
  2015-02-06  2:10   ` David Ahern
  2015-02-06 15:10 ` Nicolas Dichtel
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 119+ messages in thread
From: Stephen Hemminger @ 2015-02-06  1:33 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, ebiederm

On Wed,  4 Feb 2015 18:34:01 -0700
David Ahern <dsahern@gmail.com> wrote:

> Kernel patches are also available here:
>     https://github.com/dsahern/linux.git vrf-3.19
> 
> iproute2 patches are also available here:
>     https://github.com/dsahern/iproute2 vrf-3.19
> 
> 
> Background
> ----------
> The concept of VRFs (Virtual Routing and Forwarding) has been around for over
> 15 years. Support for VRFs in the Linux kernel has been an often requested
> feature for almost as long. For a while support was available via an out of
> tree patch [1]. Since network namespaces came along, the response to queries
> about VRF support for Linux was 'use namespaces'. But as mentioned previously
> [2] network namespaces are not a good match for VRFs. Of the list of problems
> noted the big one is that namespaces do not scale efficiently to the number
> of VRFs supported by networking gear (> 1000 VRFs). Networking vendors that
> want to use Linux as the OS have to carry custom solutions to this problem --
> be it userspace networking stacks, extensive kernel patches (to add VRF
> support or bend the implementation of namespaces), and/or patches to many
> open source components. The recent addition of switchdev support in the
> kernel suggests that people expect the use of Linux as a switch networking
> OS to increase. Hopefully the time is right to re-open the discussion on a
> salable VRF implementation for the Linux kernel.
> 
> The intent of this RFC is to get feedback on the overall idea - namely VRFs
> as integer id and the nesting of VRFs within a namespace. This set includes
> changes only to core IPv4 code which shows the concept; changes to the rest
> of the network stack are fairly repetitive.
> 
> This patch set has a number of similarities to the original VRF patch - most
> notably VRF ids as an integer index and plumbing through iproute2 and
> netlink. But this set is really a complete re-implementation of the feature,
> integrating VRF within a namespace and leveraging existing support for
> network namespaces.
> 
> Design
> ------
> Namespaces provide excellent separation of the networking stack from the
> netdevices and up. The intent of VRFs is to provide an additional,
> logical separation at the L3 layer within a namespace.
> 
>    +----------------------------------------------------------+
>    | Namespace foo                                            |
>    |                         +---------------+                |
>    |          +------+       | L3/L4 service |                |
>    |          | lldp |       |   (VRF any)   |                |
>    |          +------+       +---------------+                |
>    |                                                          |
>    |                             +-------------------------+  |
>    |                             | VRF M                   |  |
>    |  +---------------------+  +-------------------------+ |  |
>    |  | VRF 1 (default)     |  | VRF N                   | |  |
>    |  |  +---------------+  |  |    +---------------+    | |  |
>    |  |  | L3/L4 service |  |  |    | L3/L4 service |    | |  |
>    |  |  | (VRF unaware) |  |  |    | (VRF unaware) |    | |  |
>    |  |  +---------------+  |  |    +---------------+    | |  |
>    |  |                     |  |                         | |  |
>    |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>    |  || FIB | | neighbor | |  |  | FIB | | neighbor |   | |  |
>    |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>    |  |                     |  |                         |-+  |
>    |  | {dev 1}  {dev 2}    |  | {dev 3} {dev 4} {dev 5} |    |
>    |  +---------------------+  +-------------------------+    |
>    +----------------------------------------------------------+
> 
> This is accomplished by enhancing the current namespace checks to a
> broader network context that is both a namepsace and a VRF id. The VRF
> id is a tag applied to relevant structures, an integer between 1 and 4095
> which allows for 4095 VRFs (could have 0 be the default VRF and then the
> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
> is based on the genid scheme for versioning networking data which is a
> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
> 
> Netdevices, sk_buffs, sockets, and tasks are all tagged with a VRF id.
> Network lookups (devices, sockets, addresses, routes, neighbors) require a
> match of both network namespace and VRF id (or the special 'vrf any' tag;
> more on that later).
> 
> Beyond the 4-byte tag in various data structures, there are no resources
> allocated to a VRF so there is no need to create or destroy a VRF which is
> in-line with the concept of keeping it lightweight for scalability. The
> trade-off is that VRFs use the the same sysctl settings as the namespace
> they are part of and, for example, MIB counters.
> 
> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
> to this file (if preferred this can be made a prctl to change the VRF id).
> This allows services to be launched in a VRF context using ip, similar to
> what is done for network namespaces.
>     e.g., ip vrf exec 99 /usr/sbin/sshd
> 
> (or a simpler chvrf alias/command can be used to just write the VRF id
> to the proc file.)
> 
> The task's VRF id also affects viewing and modifying network configuration.
> For example, 'ip addr show', 'ip route ls', 'ifconfig', 'arp -n', etc, only
> show network data for the VRF associated with the task's VRF id; devices
> are at the L2 layer so a command listing devices is not impacted by VRF id.
> 
> When a socket is created the VRF id is taken from the task. Socket-vrf
> association for non-connected sockets can be changed using a setsockopt
> (e.g., create a socket then change VRF id prior to calling bind or connect).
> 
> Network devices belong to a single VRF context which defaults to VRF 1.
> They can be assigned to another VRF using IFLA_VRF attribute in link
> messages. Similarly the VRF assignment is returned in the IFLA_VRF
> attribute. The ip command has been modified to display the VRF id of a
> device. L2 applications like lldp are not VRF aware and still work through
> through all network devices within the namespace.
> 
> On RX skbs get their VRF context from the netdevice the packet is received
> on. For TX the VRF context for an skb is taken from the socket. The
> intention is for L3/raw sockets to be able to set the VRF context for a
> packet TX using cmsg (not coded in this patch set).
> 
> VRF aware apps (e.g., L3 VPNs) can have sockets in multiple VRFs for
> forwarding packets.
> 
> The special 'ANY VRF' context allows a single instance of a daemon to
> provide a service across all VRFs.
>     e.g., ip vrf exec any /usr/sbin/sshd 
> 
> The 'any' context applies to listen sockets only; connected sockets are in
> a VRF context. Child sockets accepted by the daemon acquire the VRF context
> of the network device the connection originated on.
> 
> The 'ANY VRF' context can also be used to display all addresses, routes
> or neighbors in the kernel cache. That is, 'ip addr show', 'ip route ls',
> 'ifconfig', 'arp -n', etc, show all network data for the namespace.
> 
> 
> About this Patch Set
> --------------------
> This is not a complete conversion of the networking stack, only a small
> sampling to test the waters. Only changes are to core IPv4 code [2] which
> is sufficient to illustrate the fundamental concept. Changes from 
> struct net to net_ctx are very repetitive.
> 
> I'm sure there are a lot of oversights and bugs, but the intent here is
> to solicit feedback on the overall idea.
> 
> 
> Examples
> --------
> To illustrate the VRF patches consider a system with 18 NICs:
> - eth0, eth17 are in default namespace (e.g., management namespace)
> 
> - eth1 - eth8 are in group1 namespace
>   - eth1 - eth4 are in VRF 11
>   - eth5 - eth8 are in VRF 13
> 
> - eth9 - eth16 are in group2 namespace
>   - eth9 - eth12 are in VRF 21
>   - eth13 - eth16 are in VRF 23
> 
> - Addresses assigned to each interface:
>   - eth1: 1.1.1.1/24
>   - eth2: 2.2.2.1/24
>   - eth3: 3.3.3.1/24
>   - eth4: 4.4.4.1/24
>   - eth5: 1.1.1.1/24 (not a typo, duplicate address in different vrfs)
>   - eth6: 6.6.6.1/24
>   - eth7: 7.7.7.1/24
>   - eth8: 8.8.8.1/24
> 
> - openlldpd is started in each namespace
> 
> 1. device list is VRF agnostic
>    - ifconfig -a, ip link show, /proc/net/dev
>      --> default namespace shows only eth0 and eth17
>      --> group1 namespace shows only eth1 - eth8
>      --> group2 namespace shows only eth9 - eth16
>          - ip shows vrf assignment of each link
> 
>     3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
>         link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff
> 
> 2. address, route, neighbor list is VRF aware
>    - ifconfig, ip addr show, ip route ls, /proc/net/route
>      --> shows only addresses for VRF id of task unless id is 'any'
> 
>    in VRF 1:
>    ifconfig eth1
>    eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         ether 02:ab:cd:02:00:01  txqueuelen 1000  (Ethernet)
>    ...
> 
>    No addresses are shown. But if the command is run in VRF 11 or VRF 'any' 
>      ip vrf exec 11 ip addr show dev eth1
>      3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 vrf 11 qdisc pfifo_fast state UP group default qlen 1000
>         link/ether 02:ab:cd:02:00:01 brd ff:ff:ff:ff:ff:ff
>         inet 1.1.1.1/24 brd 1.1.1.255 scope global eth1
>            valid_lft forever preferred_lft forever
> 
> 3. start ssh in group1 namespace
>    ip netns exec group1 ip vrf exec 11 /usr/sbin/sshd -d
>    ssh to 1.1.1.1 via eth1
> 
>    ip netns exec group1 ip vrf exec 13 /usr/sbin/sshd -d
>    ssh to 1.1.1.1 via eth5
>    --> same namespace but different VRFs
> 
> 4. One ssh instance handles VRFs in group1 namespace
>    ip netns exec group1 ip vrf exec any /usr/sbin/sshd
> 
>    --> ssh to any address in the namespace works
> 
> References
> ----------
> [1] http://sourceforge.net/projects/linux-vrf
> 
> [2] http://www.spinics.net/lists/netdev/msg298368.html
> 
> [3] To build only enable core ipv4 code. Disable IPv6, netfilter, ipsec, etc.

It is still not clear how adding another level of abstraction
solves the scaling problem. Is it just because you can have one application
connect to multiple VRF's? so you don't need  N routing daemons?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  1:33 ` Stephen Hemminger
@ 2015-02-06  2:10   ` David Ahern
  2015-02-06  4:14     ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-06  2:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, ebiederm

On 2/5/15 6:33 PM, Stephen Hemminger wrote:
> It is still not clear how adding another level of abstraction
> solves the scaling problem. Is it just because you can have one application
> connect to multiple VRF's? so you don't need  N routing daemons?
>
>

How do you provide a service in N VRFs? "Service" can be a daemon with a 
listen socket (e.g., bgpd, sshd) or a monitoring app (e.g., collectd, 
snmpd). For the current namespace only paradigm the options are:
1. replicate the process for each namespace (e.g., N instances of sshd, 
bgpd, collectd, snmpd, etc.)

2. a single process spawns a thread for each namespace

3. a single process opens a socket in each namespace

All of those options are rather heavyweight and the number of 'things' 
is linear with the number of VRFs. When multiplied by the number of 
services needed for a full-featured product the end result is a lot of 
wasted resources.

The idea here is to simplify things by allowing a single process to have 
a presence / provide a service in all VRFs within a namespace without 
the need to spawn a thread, socket or another process.

For example, 1 instance of a monitoring app can still see all of the 
netdevices in the namespace and in the VRF_ANY context can still report 
network configuration data. VRF unaware/agnostic L3/L4 apps (e.g., sshd) 
do not need to be modified and will be able to provide service through 
any interface. VRF aware apps (e.g., bgpd) might require modifications 
per the implementation of the VRF construct but they would able to 
provide service with a single instance.

Nesting VRFs within namespaces provides 2 synergistic constructs for 
building separation within a switch -- namespaces at the device layer 
(e.g., multiple virtual switches from a single physical switch) with 
VRFs at the L3 layer.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05 23:12 ` roopa
@ 2015-02-06  2:19   ` David Ahern
  2015-02-09 16:38     ` roopa
  2015-02-10 10:43     ` Derek Fawcus
  2015-02-06  6:10   ` Shmulik Ladkani
  1 sibling, 2 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  2:19 UTC (permalink / raw)
  To: roopa; +Cc: netdev, ebiederm, Dinesh Dutt, Vipin Kumar, Nicolas Dichtel, hannes

On 2/5/15 4:12 PM, roopa wrote:
> Wondering if you have thought about some of the the below cases in your
> approach to vrfs ?.
> - Leaking routes from one vrf to another

Can you give me an example of what you mean by this?

> - route lookup in one vrf on failure to fallback to the global vrf (This
> for example can be done using throw if we used ip rules and route tables
> to do the same).
> - A route in one vrf pointing to a nexthop in another vrf

I have been more focused on the initial VRF infrastructure and have not 
spent too much time on these use cases or other route lookup features 
(e.g., allow an application to handle route lookup misses similar to arp 
misses, allow custom route lookup modules) that are needed to approach 
the feature richness provided by high end routers.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 19/29] net: vrf: Add vrf context to skb
  2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
  2015-02-05 13:45   ` Nicolas Dichtel
@ 2015-02-06  3:54   ` Eric W. Biederman
  2015-02-06  6:00     ` David Ahern
  1 sibling, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-02-06  3:54 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev

David Ahern <dsahern@gmail.com> writes:

> On ingress skb's inherit vrf context from the net_device. For TX skb's
> inherit the vrf context from the socket originating the packet. Update
> SKB related net_ctx macros to set vrf.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

As I recall we had the conversation about modifying  struct sk_buff when
implementing network namespaces and it turns out it is a no go.  It adds
too much overhead.  As I recall skbs as they are on the fastpath are
things we want to make smaller.

> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>  include/linux/skbuff.h   |  7 ++++---
>  include/net/sock.h       |  2 ++
>  include/net/tcp.h        |  1 +
>  net/core/dev.c           |  1 +
>  net/core/fib_rules.c     |  2 ++
>  net/core/neighbour.c     |  2 ++
>  net/core/skbuff.c        | 12 ++++++++++++
>  net/ipv4/devinet.c       |  2 ++
>  net/ipv4/icmp.c          |  2 +-
>  net/ipv4/ip_output.c     |  2 ++
>  net/ipv4/syncookies.c    |  1 +
>  net/ipv4/tcp_ipv4.c      |  3 ++-
>  net/netlink/af_netlink.c | 12 ++++++++++++
>  13 files changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index a5dfef469d07..bdbee41e8032 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -522,6 +522,7 @@ struct sk_buff {
>  	};
>  	struct sock		*sk;
>  	struct net_device	*dev;
> +	__u32			vrf;
>  
>  	/*
>  	 * This is the control buffer. It is free to use for every

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  2:10   ` David Ahern
@ 2015-02-06  4:14     ` Eric W. Biederman
  2015-02-06  6:15       ` David Ahern
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-02-06  4:14 UTC (permalink / raw)
  To: David Ahern; +Cc: Stephen Hemminger, netdev

David Ahern <dsahern@gmail.com> writes:

> On 2/5/15 6:33 PM, Stephen Hemminger wrote:
>> It is still not clear how adding another level of abstraction
>> solves the scaling problem. Is it just because you can have one application
>> connect to multiple VRF's? so you don't need  N routing daemons?
>>
>>
>
> How do you provide a service in N VRFs? "Service" can be a daemon with a listen
> socket (e.g., bgpd, sshd) or a monitoring app (e.g., collectd, snmpd). For the
> current namespace only paradigm the options are:
> 1. replicate the process for each namespace (e.g., N instances of sshd, bgpd,
> collectd, snmpd, etc.)
>
> 2. a single process spawns a thread for each namespace
>
> 3. a single process opens a socket in each namespace
>
> All of those options are rather heavyweight and the number of 'things' is linear
> with the number of VRFs. When multiplied by the number of services needed for a
> full-featured product the end result is a lot of wasted resources.

If all you want is a single listening socket there are other
implementation possibilities that are focused on solving just that
problem, and would be much more generally applicable.

> The idea here is to simplify things by allowing a single process to have a
> presence / provide a service in all VRFs within a namespace without the need to
> spawn a thread, socket or another process.
>
> For example, 1 instance of a monitoring app can still see all of the netdevices
> in the namespace and in the VRF_ANY context can still report network
> configuration data. VRF unaware/agnostic L3/L4 apps (e.g., sshd) do not need to
> be modified and will be able to provide service through any
> interface. 

*Blink*  sshd does not need to be modified????
Which insecure implementation on which planet?

You mean you are not interested in logging which ip and vrf pair a login
came from?  You are not interested in performing any reverse DNS
lookups?

I do believe you are strongly mistaken.  I can not imagine a case where
making it impossible to know where someone is coming from when they try
to login to any machine is at all desirable.

I think it is unrealistic to expect daemons in general to listen on all
interfaces and in all vrfs, and require trimming down the set of
interfaces inbound connections can come from with firewall rules.  That
just seems backwards.  Telling the daemons which interfaces/address to
listen on explicitly seems much more robust.

The objection about logging in-bound connections applies to every
listening daemon I can think of.  I can't see how you can possibly
seriously be proposing totally changing the networking environment of
applications and expecting those applications to work with out changes.


I do think we can come up with an API that is much less awkward than we
have today, that would allow minimal application changes, but no
application changes I do not believe.

> VRF aware
> apps (e.g., bgpd) might require modifications per the implementation of the VRF
> construct but they would able to provide service with a single
> instance.

A single service instance is all that is required with network
namespaces.

I do not see how code modifications that result in a slower network
stack can possibly solve any kind of scaling problem.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 19/29] net: vrf: Add vrf context to skb
  2015-02-06  3:54   ` Eric W. Biederman
@ 2015-02-06  6:00     ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  6:00 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev

On 2/5/15 8:54 PM, Eric W. Biederman wrote:
> David Ahern<dsahern@gmail.com>  writes:
>
>> >On ingress skb's inherit vrf context from the net_device. For TX skb's
>> >inherit the vrf context from the socket originating the packet. Update
>> >SKB related net_ctx macros to set vrf.
> Nacked-by: "Eric W. Biederman"<ebiederm@xmission.com>
>
> As I recall we had the conversation about modifying  struct sk_buff when
> implementing network namespaces and it turns out it is a no go.  It adds
> too much overhead.  As I recall skbs as they are on the fastpath are
> things we want to make smaller.
>

Ok, I will look into removing the need for this change.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05 23:12 ` roopa
  2015-02-06  2:19   ` David Ahern
@ 2015-02-06  6:10   ` Shmulik Ladkani
  2015-02-09 15:54     ` roopa
  1 sibling, 1 reply; 119+ messages in thread
From: Shmulik Ladkani @ 2015-02-06  6:10 UTC (permalink / raw)
  To: roopa
  Cc: David Ahern, netdev, ebiederm, Dinesh Dutt, Vipin Kumar,
	Nicolas Dichtel, hannes, Eyal Birger

On Thu, 05 Feb 2015 15:12:57 -0800 roopa <roopa@cumulusnetworks.com> wrote:
> We have been playing with ip rules to implement vrfs. And the blocker 
> today is that we cannot bind a socket to a vrf (routing tables in this 
> case).

Hi Roopa,

One option would be using SO_MARK sockopt on that socket, and have an ip
rule which matches this mark to point to your table.
I don't know your exact use-cases, but you can play around with that
idea.

Regards,
Shmulik

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  4:14     ` Eric W. Biederman
@ 2015-02-06  6:15       ` David Ahern
  2015-02-06 15:08         ` Nicolas Dichtel
       [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
  0 siblings, 2 replies; 119+ messages in thread
From: David Ahern @ 2015-02-06  6:15 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen Hemminger, netdev

On 2/5/15 9:14 PM, Eric W. Biederman wrote:
> David Ahern <dsahern@gmail.com> writes:
>
>> On 2/5/15 6:33 PM, Stephen Hemminger wrote:
>>> It is still not clear how adding another level of abstraction
>>> solves the scaling problem. Is it just because you can have one application
>>> connect to multiple VRF's? so you don't need  N routing daemons?
>>>
>>>
>>
>> How do you provide a service in N VRFs? "Service" can be a daemon with a listen
>> socket (e.g., bgpd, sshd) or a monitoring app (e.g., collectd, snmpd). For the
>> current namespace only paradigm the options are:
>> 1. replicate the process for each namespace (e.g., N instances of sshd, bgpd,
>> collectd, snmpd, etc.)
>>
>> 2. a single process spawns a thread for each namespace
>>
>> 3. a single process opens a socket in each namespace
>>
>> All of those options are rather heavyweight and the number of 'things' is linear
>> with the number of VRFs. When multiplied by the number of services needed for a
>> full-featured product the end result is a lot of wasted resources.
>
> If all you want is a single listening socket there are other
> implementation possibilities that are focused on solving just that
> problem, and would be much more generally applicable.

These are examples of the higher level problem -- the current need for 
replicating processes/threads/sockets per namespace, not to mention the 
memory consumed by the creation of the namespace itself which is fairly 
high. i.e., The problem is more than just a listening socket of a single 
process.


>
>> The idea here is to simplify things by allowing a single process to have a
>> presence / provide a service in all VRFs within a namespace without the need to
>> spawn a thread, socket or another process.
>>
>> For example, 1 instance of a monitoring app can still see all of the netdevices
>> in the namespace and in the VRF_ANY context can still report network
>> configuration data. VRF unaware/agnostic L3/L4 apps (e.g., sshd) do not need to
>> be modified and will be able to provide service through any
>> interface.
>
> *Blink*  sshd does not need to be modified????
> Which insecure implementation on which planet?

That would be the current one -- in both cases. It is an example, Eric, 
(admittedly not a good one) that existing code does not *have* to be 
modified to run in a 'VRF any' context. It can be made VRF aware of course.

>
> You mean you are not interested in logging which ip and vrf pair a login
> came from?  You are not interested in performing any reverse DNS
> lookups?
>
> I do believe you are strongly mistaken.  I can not imagine a case where
> making it impossible to know where someone is coming from when they try
> to login to any machine is at all desirable.

Aren't you conflating two problems? Network namespaces does not require 
a separate DNS config for each namespace. A user may create 2+ network 
namespaces and have them share the same /etc/resolv.conf. Correct?

>
> I think it is unrealistic to expect daemons in general to listen on all
> interfaces and in all vrfs, and require trimming down the set of
> interfaces inbound connections can come from with firewall rules.  That
> just seems backwards.  Telling the daemons which interfaces/address to
> listen on explicitly seems much more robust.

Networking products with 1000+ interfaces? Physical, sub-interfaces, 
breakout ports, VLANs, SVIs, port channels, ...

>
> The objection about logging in-bound connections applies to every
> listening daemon I can think of.  I can't see how you can possibly
> seriously be proposing totally changing the networking environment of
> applications and expecting those applications to work with out changes.

Nothing stops me from having xinetd launch /bin/bash as root for all 
connections to 666/tcp. Nothing about the Linux networking stack 
prevents someone from running telnet or ftp. ie., the existing code base 
can be used in insecure ways.

Application code -- open source daemons -- can be modified to be VRF 
aware as needed. Kernel side VRF support would be made a CONFIG option 
that defaults off. The macros will ensure anything VRF related falls 
out, so server deployments would not be impacted.

>
>
> I do think we can come up with an API that is much less awkward than we
> have today, that would allow minimal application changes, but no
> application changes I do not believe.

Can we agree that no L2 apps should require a single line of code to be 
changed? If I create 4000 VRFs -- again an L3 construct -- not one L2 
application (socket based, monitoring, etc) should care. It should not 
have to be replicated or modified. For L3 and up they can be made VRF 
aware as needed, but that is an application problem.

>
>> VRF aware
>> apps (e.g., bgpd) might require modifications per the implementation of the VRF
>> construct but they would able to provide service with a single
>> instance.
>
> A single service instance is all that is required with network
> namespaces.

N VRFs = N namespaces = N instances of every single process, where N is 
1024, 2048, 4096, and up. Someone has already done the analysis for 
quagga with 1024 instances showing what a huge waste of memory that is.

>
> I do not see how code modifications that result in a slower network
> stack can possibly solve any kind of scaling problem.

I'll see what I can do to remove the skb change. That is the only 
comment you have made about performance. Do you have other concerns 
about performance impacts of the higher level proposal -- s/struct 
net/struct net_ctx/ where net_ctx is a namespace and a VRF?

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  1:32   ` David Ahern
@ 2015-02-06  8:53     ` Nicolas Dichtel
  0 siblings, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-06  8:53 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 06/02/2015 02:32, David Ahern a écrit :
> On 2/5/15 6:44 AM, Nicolas Dichtel wrote:
>> Le 05/02/2015 02:34, David Ahern a écrit :
>> [snip]
>>> This is accomplished by enhancing the current namespace checks to a
>>> broader network context that is both a namepsace and a VRF id. The VRF
>>> id is a tag applied to relevant structures, an integer between 1 and 4095
>>> which allows for 4095 VRFs (could have 0 be the default VRF and then the
>>> range is 0-4095 = 4096s VRFs). (The limitation is arguably artificial. It
>>> is based on the genid scheme for versioning networking data which is a
>>> 32-bit integer. The VRF id is the lower 12 bits of the genid's.)
>> Would it be possible to avoid this artificial limit?
>> There could be scenarii with more than 4096 vrf.
>
> As I recall the genid was the only reason to put a limit on it. I know of one
> product with a higher limit (16k I believe), but I figured this was a reasonable
> start point for the discussion.
Sure. My point was to be able to extend this limit in the future.

>>
>> Do you plan to have a way to dump or monitor VRF via netlink?
>
> What do you mean? There is no creation / deletion event. Are you referring to
> monitoring device changes -- device moved from one network context (namespace,
> vrf) to another?
I mean getting the list of existing vrf.

>
> The VRF id can be added as an attribute to all relevant netlink notifications.
I must think a bit more to this (is VRF an "object" or an attribute).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 14/29] net: vrf: Introduce vrf header file
  2015-02-06  0:52     ` David Ahern
@ 2015-02-06  8:53       ` Nicolas Dichtel
  0 siblings, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-06  8:53 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm

Le 06/02/2015 01:52, David Ahern a écrit :
> On 2/5/15 6:44 AM, Nicolas Dichtel wrote:
>> Le 05/02/2015 02:34, David Ahern a écrit :
[snip]
>>> +#define VRF_ANY        0xffff
>> It could be useful to expose this value to userland.
>
> Maybe. I was thinking VRF_ANY should stay kernel side only and have a sockopt
> value of -1 mean VRF_ANY.
Better to have a define instead of a magic value (-1) ;-)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  6:15       ` David Ahern
@ 2015-02-06 15:08         ` Nicolas Dichtel
       [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
  1 sibling, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-06 15:08 UTC (permalink / raw)
  To: David Ahern, Eric W. Biederman; +Cc: Stephen Hemminger, netdev

Le 06/02/2015 07:15, David Ahern a écrit :
[snip]
> Application code -- open source daemons -- can be modified to be VRF aware as
> needed. Kernel side VRF support would be made a CONFIG option that defaults off.
> The macros will ensure anything VRF related falls out, so server deployments
> would not be impacted.
Most of distributions enables all kernel options.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (32 preceding siblings ...)
  2015-02-06  1:33 ` Stephen Hemminger
@ 2015-02-06 15:10 ` Nicolas Dichtel
  2015-02-06 20:50 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-06 15:10 UTC (permalink / raw)
  To: David Ahern, netdev; +Cc: ebiederm, Roopa Prabhu

Le 05/02/2015 02:34, David Ahern a écrit :
[snip]
>
> Design
> ------
> Namespaces provide excellent separation of the networking stack from the
> netdevices and up. The intent of VRFs is to provide an additional,
> logical separation at the L3 layer within a namespace.
>
>     +----------------------------------------------------------+
>     | Namespace foo                                            |
>     |                         +---------------+                |
>     |          +------+       | L3/L4 service |                |
>     |          | lldp |       |   (VRF any)   |                |
>     |          +------+       +---------------+                |
>     |                                                          |
>     |                             +-------------------------+  |
>     |                             | VRF M                   |  |
>     |  +---------------------+  +-------------------------+ |  |
>     |  | VRF 1 (default)     |  | VRF N                   | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |  | L3/L4 service |  |  |    | L3/L4 service |    | |  |
>     |  |  | (VRF unaware) |  |  |    | (VRF unaware) |    | |  |
>     |  |  +---------------+  |  |    +---------------+    | |  |
>     |  |                     |  |                         | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  || FIB | | neighbor | |  |  | FIB | | neighbor |   | |  |
>     |  |+-----+ +----------+ |  |  +-----+ +----------+   | |  |
>     |  |                     |  |                         |-+  |
>     |  | {dev 1}  {dev 2}    |  | {dev 3} {dev 4} {dev 5} |    |
>     |  +---------------------+  +-------------------------+    |
>     +----------------------------------------------------------+
>
Another question: how is the loopback managed?
If I understand well, there is only one loopback for all vrf.
If I add an address on this loopback, is this address visible from all VRF?
Can an app in VRF1 talks to an app in VRF2 via the loopback?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (33 preceding siblings ...)
  2015-02-06 15:10 ` Nicolas Dichtel
@ 2015-02-06 20:50 ` Eric W. Biederman
  2015-02-09  0:36   ` David Ahern
       [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
  2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
  2016-05-25 16:04 ` Chenna
  36 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-02-06 20:50 UTC (permalink / raw)
  To: David Ahern
  Cc: Stephen Hemminger, netdev, Nicolas Dichtel, roopa, hannes,
	Dinesh Dutt, Vipin Kumar, Nicolas Dichtel, Shmulik Ladkani


David looking at your patches and reading through your code I think I
understand what you are proposing, let me see if I can sum it up.

Semantics:
   - The same as a network namespace.
   - Addition of a VRF-ANY context
   - No per VRF settings
   - No creation or removal operation.

Implementation:
   - Add a VRF-id to every affected data structure.

This implies that you see the current implementation of network
namespaces to be space inefficient, and that you think you can remove
this inefficiency by simply removing the settings and the associated
proc files. 

Given that you have chosen to keep the same semantics as network
namespaces for everything except for settings this raises the questions:
- Are the settings and their associated proc files what actually cause
  the size cost you see in network namespaces?
- Can we instead of reimplementing network namespaces instead optimize
  the current implementation?

We need measurements to answer either of those questions and I think
before proceeding we need to answer those questions.


Beyond that I want to point out that in general a data structure that
has a tag on every member is going to have a larger memory foot print
per entry, contain more entries, and by virtue of both of those be less
memory efficient and less time efficient to use.   So it is not clear
that a implementation that tags everything with a vrf-id will actually
be lighter weight.

Also there is a concern that placing tags in every data structure may be
significantly more error prone to implement (as it is more more thing to
keep trace of), and that can impact the maintainability and the
correctness of the code for everyone.


The standard that was applied to the network namespace was that
it did not have any measurable performance impact when enabled.  The
measurments taken at the time did not show a slow down when a 1Gig
interface was place in a network namespace.  Compared to running an
unpatched kernel.

I suspect your extra layer of indirection to get to struct net in
addition to touching struct skb will give you a noticable performance
impact.


I have another concern.  I don't think it is wise to have a data
structure modified two different ways to deal with network namespaces
and vrfs.  For maintainability and our own sanity we should pick which
version that we judge to be the most efficient implementation and go
with it.



The architecture I imagine for using network namespaces as vrfs for
devices that perform layer 2 bridging and layer 3 routing.

port1 port2 port3 port4 port5 port6 port7 port8 port9 port10
  |     |     |     |     |     |     |     |     |     |
  +-----+-----+-----+-----+-----+-----+-----+-----+-----+
 /                   Link Aggregation                    \
+                                                         +
|                        Bridging                         |
+----------------------------+----------------------------+
                             |
                          cpu port
                             |
       +---------------------+---------------------+
      /     +---------------/ \---------------+     \
     /     /     +---------/   \---------+     \     \
    /     /     /     +---/     \---+     \     \     \
   /     /     /     /    |     |    \     \     \     \
  |     |     |     |     |     |     |     |     |     |
vlan1 vlan2 vlan3 vlan4 vlan5 vlan6 vlan7 vlan8 vlan9 vlan10
  |     |     |     |     |     |     |     |     |     |   
+-+-----+-----+-----+-----+-+ +-+-----+-----+-----+-----+-+ 
|    network namespace 1    | |    network namespace2     |
+---------------------------+ +---------------------------+

Traffic to and from the rest of the world comes through the
external ports.  

The traffic is then processed at layer two including link
aggregation, bridging and classifying which vlan the traffic
belongs in.

If the traffic needs to be routed it then comes up to the cpu port.
The cpu port looks at the tags on the traffic and places it into
the appropriate vlan device.

>From the various vlans the traffic is then routed according
to the routing table of whichever network namespace the vlan
device is in.

There are stateless offloads to this in modern hardware but this is a
reasonable model how all of this works semantically.

As such the vlan devices can be moved between network namespaces without
affecting any layer two monitoring or work that happens on the lower
level devices.  The practical restriction is that L2 and L3 need to be
handled on different network devices.

This split of network devices ensures that L2 code that works today
should not need any changes or in any way be concerned about network
namespaces or that the parent devices are in.

Eric












 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06 20:50 ` Eric W. Biederman
@ 2015-02-09  0:36   ` David Ahern
  2015-02-09 11:30     ` Derek Fawcus
       [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
  1 sibling, 1 reply; 119+ messages in thread
From: David Ahern @ 2015-02-09  0:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen Hemminger, netdev, Nicolas Dichtel, roopa, hannes,
	Dinesh Dutt, Vipin Kumar, Shmulik Ladkani

On 2/6/15 1:50 PM, Eric W. Biederman wrote:
>
> David looking at your patches and reading through your code I think I
> understand what you are proposing, let me see if I can sum it up.
>
> Semantics:
>     - The same as a network namespace.
>     - Addition of a VRF-ANY context
>     - No per VRF settings
>     - No creation or removal operation.

Yes. Plus what I see is an important feature: the ability to layer VRFs 
and network namespaces.

Network namespaces can be used to create smaller, logical switches 
within a single physical switch. (i.e., A network namespace is on par 
with what Cisco calls a VDC, virtual device context, for its Nexus 7000 
switches -- logical separation at the device layer front panel port).

Layering VRFs (L3 separation) within a network namespace (L1 separation) 
provides some nice end-user features.

>
> Implementation:
>     - Add a VRF-id to every affected data structure.
>
> This implies that you see the current implementation of network
> namespaces to be space inefficient, and that you think you can remove
> this inefficiency by simply removing the settings and the associated
> proc files.

Not exactly. I see the current namespace implementation as an excellent 
L1 separation construct, but not an L3 construct.

>
> Given that you have chosen to keep the same semantics as network
> namespaces for everything except for settings this raises the questions:
> - Are the settings and their associated proc files what actually cause
>    the size cost you see in network namespaces?
> - Can we instead of reimplementing network namespaces instead optimize
>    the current implementation?

The namespace memory consumption is side problem to the bigger problem 
of how the isolation of namespaces affect processes (the need to have a 
presence in each namespace).

What I was targeting is a trade-off. To make an L3 separation efficient 
from a large scale* perspective one needs to give up something -- here 
it is per-VRF procfs settings. Replicating procfs tree for each 
namespace does have a high cost.

* scale here meaning VRFs from 1 to N, N for current products goes up to 
4000, though I know of that has mentioned 16k VRFs.

>
> We need measurements to answer either of those questions and I think
> before proceeding we need to answer those questions.

agreed.


> Beyond that I want to point out that in general a data structure that
> has a tag on every member is going to have a larger memory foot print
> per entry, contain more entries, and by virtue of both of those be less
> memory efficient and less time efficient to use.   So it is not clear
> that a implementation that tags everything with a vrf-id will actually
> be lighter weight.

The memory hit for a network namespace is >100k (yes, CONFIG option 
dependent and that 100k is based on a 3.10 kernel which is higher than 
what was measured for a 3.4 kernel).

This proposal puts a 4-byte tag to netdevices, sockets and tasks (skb's 
are out per a prior email). So yes there will be a point that the number 
of netdevices (logical + physical), plus tasks, and sockets will make 
the memory hit of a VRF tag on par with namespace overhead. But the VRF 
tagging alleviates the need to replicate processes/multiple 
sockets/threads so in the big picture I can;t see how the overall hit to 
memory is higher with a VRF id tag.

>
> Also there is a concern that placing tags in every data structure may be
> significantly more error prone to implement (as it is more more thing to
> keep trace of), and that can impact the maintainability and the
> correctness of the code for everyone.

I don't agree with this. You have already done the groundwork here by 
plumbing through the namespace checks. Adding a vrf id has not proven to 
be a huge problem. The patch changes are highly repetitive because again 
it can leverage the namespace changes you have done.

> The standard that was applied to the network namespace was that
> it did not have any measurable performance impact when enabled.  The
> measurments taken at the time did not show a slow down when a 1Gig
> interface was place in a network namespace.  Compared to running an
> unpatched kernel.

Sure. I will build kernels at the commit id my patches are based on and 
one with my changes and do a comparison. Virtual machines on KVM 
emphasize the performance effects so I will compare a few netperf runs 
with and without my changes. On a newer 3.x kernel I typically see 
network throughput rates in 15 to 16 Gbps range (though H/W dependent), 
so this far exceeds the 1G rate. Does that sound reasonable?

>
> I suspect your extra layer of indirection to get to struct net in
> addition to touching struct skb will give you a noticable performance
> impact.

I don't understand the 'extra layer of indirection' comment. I don't see 
the indirection, I see an extra comparison. ie., from net_eq to net_eq + 
(vrf_eq || vrf_eq_any).

 From a struct comparison it has gone from:

struct net_device {
     ...
     struct net              *nd_net;
     ...
};

to

struct net_device {
     ...
     struct net_ctx          net_ctx {
         struct net *net;
         __u32 vrf;
    };
...
};


>
>
> I have another concern.  I don't think it is wise to have a data
> structure modified two different ways to deal with network namespaces
> and vrfs.  For maintainability and our own sanity we should pick which
> version that we judge to be the most efficient implementation and go
> with it.

you lost me. What data structure is modified 2 different ways? VRFs are 
sub context to a namespace.

>
>
>
> The architecture I imagine for using network namespaces as vrfs for
> devices that perform layer 2 bridging and layer 3 routing.
>
> port1 port2 port3 port4 port5 port6 port7 port8 port9 port10
>    |     |     |     |     |     |     |     |     |     |
>    +-----+-----+-----+-----+-----+-----+-----+-----+-----+
>   /                   Link Aggregation                    \
> +                                                         +
> |                        Bridging                         |
> +----------------------------+----------------------------+
>                               |
>                            cpu port
>                               |
>         +---------------------+---------------------+
>        /     +---------------/ \---------------+     \
>       /     /     +---------/   \---------+     \     \
>      /     /     /     +---/     \---+     \     \     \
>     /     /     /     /    |     |    \     \     \     \
>    |     |     |     |     |     |     |     |     |     |
> vlan1 vlan2 vlan3 vlan4 vlan5 vlan6 vlan7 vlan8 vlan9 vlan10
>    |     |     |     |     |     |     |     |     |     |
> +-+-----+-----+-----+-----+-+ +-+-----+-----+-----+-----+-+
> |    network namespace 1    | |    network namespace2     |
> +---------------------------+ +---------------------------+
>
> Traffic to and from the rest of the world comes through the
> external ports.
>
> The traffic is then processed at layer two including link
> aggregation, bridging and classifying which vlan the traffic
> belongs in.
>
> If the traffic needs to be routed it then comes up to the cpu port.
> The cpu port looks at the tags on the traffic and places it into
> the appropriate vlan device.
>
>  From the various vlans the traffic is then routed according
> to the routing table of whichever network namespace the vlan
> device is in.
>
> There are stateless offloads to this in modern hardware but this is a
> reasonable model how all of this works semantically.
>
> As such the vlan devices can be moved between network namespaces without
> affecting any layer two monitoring or work that happens on the lower
> level devices.  The practical restriction is that L2 and L3 need to be
> handled on different network devices.
>
> This split of network devices ensures that L2 code that works today
> should not need any changes or in any way be concerned about network
> namespaces or that the parent devices are in.

9+ months ago I had considered something similar. I'll try to amend your 
picture to show my concept:


port1 port2 port3 port4 port5 port6 port7 port8 port9 port10

   |     |     |     |     |     |     |     |     |     |
   +-----+-----+-----+-----+-----+-----+-----+-----+-----+
  /                    Link Aggregation                   \
+                                                         +
|                         Bridging                        |
+-----------------------------+---------------------------+
                               |
+-----------------------------+----------------------------+
|default namespace            |                            |
| (init_net)              NIC driver                       |
|                             |                            |
|  +----+----+-----+-----+-------+----+----+----+----+     |
| eth1 eth2 eth3  eth4 eth5     eth6 eth7 eth8 eth9 eth10  |
|  |    |    |     |     |        |    |    |    |    |    |
+-----------------------------+----------------------------+
    |    |    |     |     |        |    |    |    |    |
+--+----+----+-----+-----+---+ +--+----+----+----+----+----+
|  |    |    |     |     |   | |  |    |    |    |    |    |
| seth1 |   seth3  |   seth5 | | seth6 |   seth8 |   seth10|
|      seth2      seth4      | |      seth7     seth9      |
|                            | |                           |
|    network namespace 1     | |    network namespace 2    |
+----------------------------+ +---------------------------+

Essentially netdevices for front panel ports exist in the default 
namespace (init_net). L2 processes, monitoring processes (collectd, snmp 
agents for devices, etc) and such would run here. From there "shadow 
devices" (the 's' on the eth pairs) are created for namespaces where the 
path between real and shadow is similar to how veth pairs work. In the 
end this approach seemed to be a rather complex solution playing a lot 
of games so I abandoned it in favor of the approach in this patch set -- 
adding a VRF id to a network context.

The patch diff might be large but almost all of it is converting the 
existing struct net passing and net_eq checks to the broader struct 
net_ctx and net_ctx_eq comparisons. Really the change rides on top of 
what you have done for namespaces.

As for your proposal with VLAN based tagging, I do not understand the 
packet path from driver for front panel ports to namespace based 
netdevices. The VLAN sorting, and hence VRF sorting, is done in H/W? So 
there are netdevices in init_net the driver uses and then VLAN devices 
in the namespaces -- so those would correspond to what I called a shadow 
device? If the packets are also VLAN tagged we have nested tagging -- 
one for the port and one for the VRF?

Also, doesn't the VLAN design limit number of VRFs to 4096? My current 
patch set might limit it to 4096 but fix the genid piece (IPv6 seems to 
have removed genid comparisons betweeen 3.17 and 3.19 -- need to look 
into that) and it becomes a 32-bit tag which is a huge range for VRFs.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-09  0:36   ` David Ahern
@ 2015-02-09 11:30     ` Derek Fawcus
  0 siblings, 0 replies; 119+ messages in thread
From: Derek Fawcus @ 2015-02-09 11:30 UTC (permalink / raw)
  To: David Ahern
  Cc: Eric W. Biederman, Stephen Hemminger, netdev, Nicolas Dichtel,
	roopa, hannes, Dinesh Dutt, Vipin Kumar, Shmulik Ladkani

On Sun, Feb 08, 2015 at 05:36:05pm -0700, David Ahern wrote:
> To make an L3 separation efficient from a large scale* perspective one
> needs to give up something
[snip]
> 
> * scale here meaning VRFs from 1 to N, N for current products goes up to
> 4000, though I know of that has mentioned 16k VRFs.

FWIW - There is at least one router vendor which allows for the creation of
       more than 4096 VRFs.  Some of its products have a specific limitation
       which may be less than the architectural maximum.

DF

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  6:10   ` Shmulik Ladkani
@ 2015-02-09 15:54     ` roopa
  2015-02-11  7:42       ` Shmulik Ladkani
  0 siblings, 1 reply; 119+ messages in thread
From: roopa @ 2015-02-09 15:54 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: David Ahern, netdev, ebiederm, Dinesh Dutt, Vipin Kumar,
	Nicolas Dichtel, hannes, Eyal Birger

On 2/5/15, 10:10 PM, Shmulik Ladkani wrote:
> On Thu, 05 Feb 2015 15:12:57 -0800 roopa <roopa@cumulusnetworks.com> wrote:
>> We have been playing with ip rules to implement vrfs. And the blocker
>> today is that we cannot bind a socket to a vrf (routing tables in this
>> case).
> Hi Roopa,
>
> One option would be using SO_MARK sockopt on that socket, and have an ip
> rule which matches this mark to point to your table.
> I don't know your exact use-cases, but you can play around with that
> idea.
sorry for getting back late on this,
  yes, SO_MARK and 'ip rule fwmark'  is an option to bind tx from a socket
to a table. But, There are more things that will be needed on the rx side.
and at this point we are not considering netfilter marking of the 
ingress packets so
haven't been following this option

Thanks.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  2:19   ` David Ahern
@ 2015-02-09 16:38     ` roopa
  2015-02-10 10:43     ` Derek Fawcus
  1 sibling, 0 replies; 119+ messages in thread
From: roopa @ 2015-02-09 16:38 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, ebiederm, Dinesh Dutt, Vipin Kumar, Nicolas Dichtel, hannes

On 2/5/15, 6:19 PM, David Ahern wrote:
> On 2/5/15 4:12 PM, roopa wrote:
>> Wondering if you have thought about some of the the below cases in your
>> approach to vrfs ?.
>> - Leaking routes from one vrf to another
>
> Can you give me an example of what you mean by this?

sorry for the delay in my response.
I cant say I know a lot about types of vrfs and vrf route leaking today :).
Hoping that i will catch up someday ;)

But I think it is needed to deploy VRF-lite (when MPLS is not in the 
picture).
And there maybe a need to leak routes from one VRF to another.
And the leaking can be done with static routes or dynamic using a 
routing protocol
I think (lot of content on the web for vrf route leaking).

I have been in discussions where namespaces are considered for vrfs..
and on the same lines as above there have been discussions on
  having the ability to add a route in one vrf with nexthop in another vrf
>
>> - route lookup in one vrf on failure to fallback to the global vrf (This
>> for example can be done using throw if we used ip rules and route tables
>> to do the same).
>> - A route in one vrf pointing to a nexthop in another vrf
>
> I have been more focused on the initial VRF infrastructure and have 
> not spent too much time on these use cases or other route lookup 
> features (e.g., allow an application to handle route lookup misses 
> similar to arp misses, allow custom route lookup modules) that are 
> needed to approach the feature richness provided by high end routers.
agreed. Thanks for opening up the vrf discussion!.
It will be good to consider all types of vrf deployments with any vrf 
solution we consider for the kernel.


Thanks,
Roopa

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
       [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
@ 2015-02-09 20:48           ` Nicolas Dichtel
  2015-02-11  4:14           ` David Ahern
  1 sibling, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-09 20:48 UTC (permalink / raw)
  To: Eric W. Biederman, David Ahern
  Cc: Stephen Hemminger, netdev, roopa, hannes, Dinesh Dutt, Vipin Kumar

Le 06/02/2015 22:22, Eric W. Biederman a écrit :
>
>
> Having looked at this problem, I am currently convinced that network
> namespaces can be improved to the point where they can reasonably
> act as VRFS.
We are using netns this way at 6WIND.

>
> Further I think code maintenance argues that this VRF proposal is a bad
> direction to go.
>
> David Ahern <dsahern@gmail.com> writes:
>
>> On 2/5/15 9:14 PM, Eric W. Biederman wrote:
>>> David Ahern <dsahern@gmail.com> writes:
>>>
>>>> On 2/5/15 6:33 PM, Stephen Hemminger wrote:
>>>>> It is still not clear how adding another level of abstraction
>>>>> solves the scaling problem. Is it just because you can have one application
>>>>> connect to multiple VRF's? so you don't need  N routing daemons?
>
>>>> All of those options are rather heavyweight and the number of 'things' is linear
>>>> with the number of VRFs. When multiplied by the number of services needed for a
>>>> full-featured product the end result is a lot of wasted resources.
>>>
>>> If all you want is a single listening socket there are other
>>> implementation possibilities that are focused on solving just that
>>> problem, and would be much more generally applicable.
>>
>> These are examples of the higher level problem -- the current need for
>> replicating processes/threads/sockets per namespace, not to mention the memory
>> consumed by the creation of the namespace itself which is fairly high. i.e., The
>> problem is more than just a listening socket of a single process.
>
> Sometimes replication is simpler and more efficient, so I do not believe
> this is a fundamental design problem.
>
> That said.  Having N listening sockets is arguably a mis-feature of the
> berkely sockets layer, and is fixable by adding support for adding
> features for listening sockets to listen on more than one address.  So
> by adding an feature to teach a listening socket how to listen on
> additional addresses that is fixable.  SCTP and MPTCP have even done
> some work in that area, so it may just be a matter of generalizing
> earlier solutions.  More likely we would want to build on Nicolas
> Dichtels work on adding ids to other network namespaces and have
> our VRF any sockets listen on any network namespace that we an for.
I agree, it would be great to have this kind of feature. Any help to
achieve it is welcomed :)

>
> Similarly we can build on Nicolas Dichtel's work of implementing in
> kernel ids for other network namespaces to provide proc files or
> netlink messages that report on multiple network namespaces at once.
> Assuming of course that such interfaces are shown to be worth
> implementing.
Same here. At least, we should have a try to have a status or to see which
problems can block.

>
> I believe that with small focused changes we can make the existing
> userspace API efficient to work with for programs that want to work
> with multiple network namespaces (or VRFs) at once.
Yes, some work remains into this area.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (34 preceding siblings ...)
  2015-02-06 20:50 ` Eric W. Biederman
@ 2015-02-10  0:53 ` Thomas Graf
  2015-02-10 20:54   ` David Ahern
  2016-05-25 16:04 ` Chenna
  36 siblings, 1 reply; 119+ messages in thread
From: Thomas Graf @ 2015-02-10  0:53 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, ebiederm

On 02/04/15 at 06:34pm, David Ahern wrote:
> Namespaces provide excellent separation of the networking stack from the
> netdevices and up. The intent of VRFs is to provide an additional,
> logical separation at the L3 layer within a namespace.

What you ask for seems to be L3 micro segmentation inside netns. I
would argue that we already support this through multiple routing
tables. I would prefer improving the existing architecture to cover
your use cases: Increase the number of supported tables, extend
routing rules as needed, ...

> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
> to this file (if preferred this can be made a prctl to change the VRF id).
> This allows services to be launched in a VRF context using ip, similar to
> what is done for network namespaces.
>     e.g., ip vrf exec 99 /usr/sbin/sshd

I think such as classification should occur through cgroups instead
of touching PIDs directly.

> Network devices belong to a single VRF context which defaults to VRF 1.
> They can be assigned to another VRF using IFLA_VRF attribute in link
> messages. Similarly the VRF assignment is returned in the IFLA_VRF
> attribute. The ip command has been modified to display the VRF id of a
> device. L2 applications like lldp are not VRF aware and still work through
> through all network devices within the namespace.

I believe that binding net_devices to VRFs is misleading and the
concept by itself is non-scalable. You do not want to create 10k
net_devices for your overlay of choice just to tie them to a
particular VRF. You want to store the VRF identifier as metadata and
have a stateless classifier included it in the VRF decision. See the
recent VXLAN-GBP work.

You could either map whatever selects the VRF to the mark or support it
natively in the routing rules classifier.

An obvious alternative is OVS. What you describe can be implemented in
a scalable matter using OVS and mark. I understand that OVS is not for
everybody but it gets a fundamental principle right: Scalability
demands for programmability.

I don’t think we should be adding a new single purpose metadata field
to arbitrary structures for every new use case that comes up. We
should work on programmability which increases flexibility and allows
decoupling application interest from networking details.

> On RX skbs get their VRF context from the netdevice the packet is received
> on. For TX the VRF context for an skb is taken from the socket. The
> intention is for L3/raw sockets to be able to set the VRF context for a
> packet TX using cmsg (not coded in this patch set).

Specyfing L3 context in cmsg seems very broken to me. We do not want
to bind applications any closer to underlying networking infrastructure.
In fact, we should do the opposite and decouple this completely.

> The 'any' context applies to listen sockets only; connected sockets are in
> a VRF context. Child sockets accepted by the daemon acquire the VRF context
> of the network device the connection originated on.

Linux considers an address local regardless of the interface the packet
was received on.  So you would accept the packet on any interface and
then bind it to the VRF of that interface even though the route for it
might be on a different interface.

This really belongs into routing rules from my perspective which takes
mark and the cgroup context into account.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-06  2:19   ` David Ahern
  2015-02-09 16:38     ` roopa
@ 2015-02-10 10:43     ` Derek Fawcus
  1 sibling, 0 replies; 119+ messages in thread
From: Derek Fawcus @ 2015-02-10 10:43 UTC (permalink / raw)
  To: David Ahern
  Cc: roopa, netdev, ebiederm, Dinesh Dutt, Vipin Kumar,
	Nicolas Dichtel, hannes

On Thu, Feb 05, 2015 at 07:19:41pm -0700, David Ahern wrote:
> On 2/5/15 4:12 PM, roopa wrote:
> >Wondering if you have thought about some of the the below cases in your
> >approach to vrfs ?.
> >- Leaking routes from one vrf to another
> 
> Can you give me an example of what you mean by this?

I believe the following sort of thing is being referred to:

   http://packetlife.net/blog/2010/mar/29/inter-vrf-routing-vrf-lite/
   http://packetlife.net/blog/2013/sep/26/vrf-export-maps/
   https://rekrowten.wordpress.com/2012/09/24/route-leak-between-vrfs-with-import-maps-and-export-maps/

I google search for "bgp import OR export" brings a bunch of hits.

DF

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
@ 2015-02-10 20:54   ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-10 20:54 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, ebiederm

On 2/9/15 5:53 PM, Thomas Graf wrote:
> On 02/04/15 at 06:34pm, David Ahern wrote:
>> Namespaces provide excellent separation of the networking stack from the
>> netdevices and up. The intent of VRFs is to provide an additional,
>> logical separation at the L3 layer within a namespace.
>
> What you ask for seems to be L3 micro segmentation inside netns. I

I would not label it 'micro' but yes a L3 separation within a L1 separation.

> would argue that we already support this through multiple routing
> tables. I would prefer improving the existing architecture to cover
> your use cases: Increase the number of supported tables, extend
> routing rules as needed, ...

I've seen that response for VRFs as well. I have not personally tried 
it, but from what I have read it does not work well. I think Roopa 
responded that Cumulus has spent time on that path and has hit some 
roadblocks.

>
>> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
>> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
>> to this file (if preferred this can be made a prctl to change the VRF id).
>> This allows services to be launched in a VRF context using ip, similar to
>> what is done for network namespaces.
>>      e.g., ip vrf exec 99 /usr/sbin/sshd
>
> I think such as classification should occur through cgroups instead
> of touching PIDs directly.

That is an interesting idea -- using cgroups for task labeling. It 
presents a creation / deletion event for VRFs which I was trying to 
avoid, and there will be some amount of overhead with a cgroup. I'll 
take a look at that option when I get some time.

As for as the current proposal I am treating VRF as part of a network 
context. Today 'ip netns' is used to run a command in a specific network 
namespace; the proposal with the VRF layering is to add a vrf context 
within a namespace so in keeping with how 'ip netns' works the above 
syntax allows a user to supply both a network namespace + VRF for 
running a command.

>
>> Network devices belong to a single VRF context which defaults to VRF 1.
>> They can be assigned to another VRF using IFLA_VRF attribute in link
>> messages. Similarly the VRF assignment is returned in the IFLA_VRF
>> attribute. The ip command has been modified to display the VRF id of a
>> device. L2 applications like lldp are not VRF aware and still work through
>> through all network devices within the namespace.
>
> I believe that binding net_devices to VRFs is misleading and the
> concept by itself is non-scalable. You do not want to create 10k
> net_devices for your overlay of choice just to tie them to a
> particular VRF. You want to store the VRF identifier as metadata and
> have a stateless classifier included it in the VRF decision. See the
> recent VXLAN-GBP work.

I'll take a look when I get time.

I have not seen scalability issues creating 1,000+ net_devices. 
Certainly the 40k'ish memory per net_device is noticeable but I believe 
that can be improved (e.g., a number of entries can be moved under 
proper CONFIG_ checks). I do need to repeat the tests on newer kernels.

>
> You could either map whatever selects the VRF to the mark or support it
> natively in the routing rules classifier.
>
> An obvious alternative is OVS. What you describe can be implemented in
> a scalable matter using OVS and mark. I understand that OVS is not for
> everybody but it gets a fundamental principle right: Scalability
> demands for programmability.
>
> I don’t think we should be adding a new single purpose metadata field
> to arbitrary structures for every new use case that comes up. We
> should work on programmability which increases flexibility and allows
> decoupling application interest from networking details.
>
>> On RX skbs get their VRF context from the netdevice the packet is received
>> on. For TX the VRF context for an skb is taken from the socket. The
>> intention is for L3/raw sockets to be able to set the VRF context for a
>> packet TX using cmsg (not coded in this patch set).
>
> Specyfing L3 context in cmsg seems very broken to me. We do not want
> to bind applications any closer to underlying networking infrastructure.
> In fact, we should do the opposite and decouple this completely.

That suggestion is inline with what is done today for other L3 
parameters -- TOS, TTL, and a few others.

>
>> The 'any' context applies to listen sockets only; connected sockets are in
>> a VRF context. Child sockets accepted by the daemon acquire the VRF context
>> of the network device the connection originated on.
>
> Linux considers an address local regardless of the interface the packet
> was received on.  So you would accept the packet on any interface and
> then bind it to the VRF of that interface even though the route for it
> might be on a different interface.
>
> This really belongs into routing rules from my perspective which takes
> mark and the cgroup context into account.

Expanding the current network namespace checks to a networking context 
is a very simple and clean way of implementing VRFs versus cobbling 
together a 'VRF like' capability using marks, multiple tables, etc (ie., 
the existing capabilities). Further, the VRF tagging of net_devices 
seems to readily fit into the hardware offload and switchdev 
capabilities (e.g., add a ndo operation for setting the VRF tag on a 
device which passes it to the driver).

Big picture wise where is OCP and switchdev headed? Top-of-rack switches 
seem to be the first target, but after that? Will the kernel ever 
support MPLS? Will the kernel attain the richer feature set of high-end 
routers? If so, how does VRF support fit into the design? As I 
understand it a scalable VRF solution is a fundamental building block. 
Will a cobbled together solution of cgroups, marks, rules, multiple 
tables really work versus the simplicity of an expanded network context?

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: network namespace bloat
       [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
@ 2015-02-11  2:55     ` Eric Dumazet
  2015-02-11  3:18       ` Eric W. Biederman
  2015-02-11 17:09     ` network namespace bloat Nicolas Dichtel
  1 sibling, 1 reply; 119+ messages in thread
From: Eric Dumazet @ 2015-02-11  2:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, Stephen Hemminger, Nicolas Dichtel, roopa,
	Hannes Frederic Sowa, Dinesh Dutt, Vipin Kumar, Shmulik Ladkani,
	David Ahern, David S. Miller

On Tue, Feb 10, 2015 at 6:42 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:

>
> Those large hash tables impact creation speed as large memory
> allocations require more work from the memory allocators, and they
> affect reliability of as order > 0 pages are not reliabily available in
> the kernel.  So from a network namespace perspective I would really like
> to convert the per network namespace hash tables to hash tables that
> have a single instance across all network namespaces.
>

tcp_metric can fallback to vzalloc() after commit 976a702ac9eea ?

There is nothing preventing to use a single tcp_metric, a bit like
global TCP hash table.

We only have to convert the thing...

Note : At Google we do not save tcp metrics.
We have to use it only for FastOpen cookies eventually (For clients)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: network namespace bloat
  2015-02-11  2:55     ` network namespace bloat Eric Dumazet
@ 2015-02-11  3:18       ` Eric W. Biederman
  2015-02-19 19:49         ` David Miller
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-02-11  3:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Stephen Hemminger, Nicolas Dichtel, roopa,
	Hannes Frederic Sowa, Dinesh Dutt, Vipin Kumar, Shmulik Ladkani,
	David Ahern, David S. Miller

Eric Dumazet <edumazet@google.com> writes:

> On Tue, Feb 10, 2015 at 6:42 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
>>
>> Those large hash tables impact creation speed as large memory
>> allocations require more work from the memory allocators, and they
>> affect reliability of as order > 0 pages are not reliabily available in
>> the kernel.  So from a network namespace perspective I would really like
>> to convert the per network namespace hash tables to hash tables that
>> have a single instance across all network namespaces.
>>
>
> tcp_metric can fallback to vzalloc() after commit 976a702ac9eea ?

True, although vmalloc space is limited on some architectures.

> There is nothing preventing to use a single tcp_metric, a bit like
> global TCP hash table.
>
> We only have to convert the thing...

Thanks for the confirmation, that is what I figured was going on.

> Note : At Google we do not save tcp metrics.
> We have to use it only for FastOpen cookies eventually (For clients)

Interesting.  That makes it doubly desirable to not need to allocate
a per network namespace hash table.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
       [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
  2015-02-09 20:48           ` Nicolas Dichtel
@ 2015-02-11  4:14           ` David Ahern
  1 sibling, 0 replies; 119+ messages in thread
From: David Ahern @ 2015-02-11  4:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen Hemminger, netdev, Nicolas Dichtel, roopa, hannes,
	Dinesh Dutt, Vipin Kumar

On 2/6/15 2:22 PM, Eric W. Biederman wrote:
> I think you have also introduced a second layer of indirection and thus
> an extra cache-line miss with net_ctx.  At 60ns-100ns per cache line
> miss that could be significant.
>
> Overall the standard should be that there is no measurable performance
> overhead with something like this enabled.  At least at 1Gbps we were
> able to demonstrate there was not measuable performance impact with
> network namespaces before they were merged.
>
> Eric
>

Here's a quick look at performance impacts of this patch set.

Host:
     Fedora 21
     Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
     1 socket, 4 cores, 8 threads

VM:
     Fedora 21
     2 vcpus, cpu model is 'host,x2apic'
     1G RAM
     network: virtio + vhost with tap device connected to a bridge

Tests are between host OS and VM (RX: netperf in host; TX: netperf in 
guest; netserver is the reverse).

No tweaks have been done to the default Fedora settings. In particular 
all of the offloads that default to enabled on tap devices and virtio 
devices are left enabled. Specifically, these offloads are what 
accelerate the stream tests to the 40,000 Mbps range and hence really 
stress the overhead. Nor has any cpu pinning been done or any other 
attempts at optimizations.

Commands:
     netperf -l 10 -t TCP_STREAM -H <ip>
     netperf -l 10 -t TCP_RR -H <ip>

Results are the average of 3 runs:

                       pre-VRF          with VRF
                       TX      RX       TX     RX
TCP Stream (Mbps)   39503    40325    39856  38211
TCP RR (trans/sec)  46047    46512    47619  43032

* pre-VRF means commit id 7e8acbb69ee2 which is the commit id before 
this patch st
* with VRF = patches posted in this thread

The VM setup definitely pushes some limits and represents an extreme in 
performance comparisons. While the VRF patches do show a degradation in 
RX performance the delta is fairly small. As I mentioned before I can 
remove the vrf tagging to skbs which should help. Overall I have focused 
more on concept than the performance; I'm sure that delta can be reduced.

David

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-09 15:54     ` roopa
@ 2015-02-11  7:42       ` Shmulik Ladkani
  0 siblings, 0 replies; 119+ messages in thread
From: Shmulik Ladkani @ 2015-02-11  7:42 UTC (permalink / raw)
  To: roopa
  Cc: David Ahern, netdev, ebiederm, Dinesh Dutt, Vipin Kumar,
	Nicolas Dichtel, hannes, Eyal Birger

On Mon, 09 Feb 2015 07:54:50 -0800 roopa <roopa@cumulusnetworks.com> wrote:
> On 2/5/15, 10:10 PM, Shmulik Ladkani wrote:
> > On Thu, 05 Feb 2015 15:12:57 -0800 roopa <roopa@cumulusnetworks.com> wrote:
> >> We have been playing with ip rules to implement vrfs. And the blocker
> >> today is that we cannot bind a socket to a vrf (routing tables in this
> >> case).
> >
> > One option would be using SO_MARK sockopt on that socket, and have an ip
> > rule which matches this mark to point to your table.
> > I don't know your exact use-cases, but you can play around with that
> > idea.
> 
>   yes, SO_MARK and 'ip rule fwmark'  is an option to bind tx from a socket
> to a table. But, There are more things that will be needed on the rx side.
> and at this point we are not considering netfilter marking of the 
> ingress packets so haven't been following this option

In the past we've implemented small-scale L3 segmentation using multiple
tables, without using netfilter marking.

We've used 'iif' rules for rx (as application knows its interface-to-vrf
mapping, it may provision 'iif' rules to point to the appropriate table).
For locally originated traffic, SO_MARK and 'mark' rules were used.

An 'ingress-netdevice to mark' mapping would make such solution less
awkward, but one might claim such mapping is not generic as it leaks
application specific knowledge and logic into the network stack.

Also, the downside of using multiple-tables based solution might
probably be lack of scalability, as the amount of ip rules in such a
scheme grows linearily with number of L3 segments.

Regards,
Shmulik

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: network namespace bloat
       [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
  2015-02-11  2:55     ` network namespace bloat Eric Dumazet
@ 2015-02-11 17:09     ` Nicolas Dichtel
  1 sibling, 0 replies; 119+ messages in thread
From: Nicolas Dichtel @ 2015-02-11 17:09 UTC (permalink / raw)
  To: Eric W. Biederman, netdev
  Cc: Stephen Hemminger, roopa, hannes, Dinesh Dutt, Vipin Kumar,
	Shmulik Ladkani, David Ahern, Eric Dumazet, David S. Miller

Le 11/02/2015 03:42, Eric W. Biederman a écrit :
[snip]
>
> The next largest component appears to be all of the tunnel network
> devices that we allocate for compatibility reasons so that the old ioctl
> interfaces still work.
>
[snip]
>
>
> A knob (sysctl?) that controls the creation of the backwards
> compabitilty tunnel network devices seems desirable.  As in many
> instances those are just overhead today.
Note that these interfaces are also used as fallback devices, they catch
packets that don't match any configured tunnels.

See: http://thread.gmane.org/gmane.linux.network/249634/focus=249634

[snip]
>
> In my ideal world enough of these issues would be fixed that a
> new empty network namespace would consume less than 100KiB of memory.
+1


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: network namespace bloat
  2015-02-11  3:18       ` Eric W. Biederman
@ 2015-02-19 19:49         ` David Miller
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: David Miller @ 2015-02-19 19:49 UTC (permalink / raw)
  To: ebiederm
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 10 Feb 2015 21:18:13 -0600

> Eric Dumazet <edumazet@google.com> writes:
> 
>> On Tue, Feb 10, 2015 at 6:42 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>
>>>
>>> Those large hash tables impact creation speed as large memory
>>> allocations require more work from the memory allocators, and they
>>> affect reliability of as order > 0 pages are not reliabily available in
>>> the kernel.  So from a network namespace perspective I would really like
>>> to convert the per network namespace hash tables to hash tables that
>>> have a single instance across all network namespaces.
>>>
>>
>> tcp_metric can fallback to vzalloc() after commit 976a702ac9eea ?
> 
> True, although vmalloc space is limited on some architectures.
> 
>> There is nothing preventing to use a single tcp_metric, a bit like
>> global TCP hash table.
>>
>> We only have to convert the thing...
> 
> Thanks for the confirmation, that is what I figured was going on.
> 
>> Note : At Google we do not save tcp metrics.
>> We have to use it only for FastOpen cookies eventually (For clients)
> 
> Interesting.  That makes it doubly desirable to not need to allocate
> a per network namespace hash table.

As a first step we can make the tcp_metrics hash global, but in addition
to that an rhashtable conversion is probably in order as well.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction
  2015-02-19 19:49         ` David Miller
@ 2015-03-09 18:22           ` Eric W. Biederman
  2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
                               ` (7 more replies)
  0 siblings, 8 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:22 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


This is a small pile of patches that convert tcp_metrics from using a
hash table per network namespace to using a single hash table for all
network namespaces.

This is broken up into several patches so that each small step along
the way could be carefully scrutinized as I wrote it, and equally so
that each small step can be reviewed.

There are two minor cleanups included.  The addition of a missing panic
when the tcp_metrics hash table can not be allocated during boot and the
removal of the return code from tcp_metrics_flush_all

The motivation for this change is that the tcp_metrics hash table at
128KiB is the single largest component of a freshly allocated network
namespace.

Eric W. Biederman (6):
      tcp_metrics: panic when tcp_metrics can not be allocated
      tcp_metrics: Mix the network namespace into the hash function.
      tcp_metrics: Add a field tcpm_net and verify it matches on lookup
      tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
      tcp_metrics: Rewrite tcp_metrics_flush_all
      tcp_metrics: Use a single hash table for all network namespaces.

 include/net/netns/ipv4.h |   2 -
 net/ipv4/tcp_metrics.c   | 118 +++++++++++++++++++++++++++++------------------
 2 files changed, 73 insertions(+), 47 deletions(-)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
@ 2015-03-09 18:27             ` Eric W. Biederman
  2015-03-09 18:50               ` Sergei Shtylyov
  2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
                               ` (6 subsequent siblings)
  7 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:27 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


Panic so that in the unlikely event we have problems we will have a
clear place to start debugging instead of a mysterious NULL pointer
deference later on.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index e5f41bd5ec1b..fdf4bdda971f 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1186,5 +1186,6 @@ cleanup_subsys:
 	unregister_pernet_subsys(&tcp_net_metrics_ops);
 
 cleanup:
+	panic("Could not allocate the tcp_metrics hash table\n");
 	return;
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function.
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
  2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
@ 2015-03-09 18:27             ` Eric W. Biederman
  2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
                               ` (5 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:27 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


In preparation for using one hash table for all network namespaces mix
the network namespace into the hash value.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index fdf4bdda971f..70196c3c16a1 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -252,6 +252,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 	}
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -299,6 +300,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 		return NULL;
 
 	net = twsk_net(tw);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -347,6 +349,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		return NULL;
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
@@ -994,6 +997,7 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	if (!reply)
 		goto nla_put_failure;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
@@ -1070,6 +1074,7 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	if (ret < 0)
 		src = false;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	hb = net->ipv4.tcp_metrics_hash + hash;
 	pp = &hb->chain;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
  2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
  2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
@ 2015-03-09 18:29             ` Eric W. Biederman
  2015-03-09 20:25               ` Julian Anastasov
  2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
                               ` (4 subsequent siblings)
  7 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:29 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and verify
that field on all hash table lookups.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 70196c3c16a1..4ec02d6cab5b 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -40,6 +40,7 @@ struct tcp_fastopen_metrics {
 
 struct tcp_metrics_block {
 	struct tcp_metrics_block __rcu	*tcpm_next;
+	struct net			*tcpm_net;
 	struct inetpeer_addr		tcpm_saddr;
 	struct inetpeer_addr		tcpm_daddr;
 	unsigned long			tcpm_stamp;
@@ -183,6 +184,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 		if (!tm)
 			goto out_unlock;
 	}
+	tm->tcpm_net = net;
 	tm->tcpm_saddr = *saddr;
 	tm->tcpm_daddr = *daddr;
 
@@ -216,7 +218,8 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
-		if (addr_same(&tm->tcpm_saddr, saddr) &&
+		if ((tm->tcpm_net == net) &&
+		    addr_same(&tm->tcpm_saddr, saddr) &&
 		    addr_same(&tm->tcpm_daddr, daddr))
 			break;
 		depth++;
@@ -257,7 +260,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
-		if (addr_same(&tm->tcpm_saddr, &saddr) &&
+		if ((tm->tcpm_net == net) &&
+		    addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr))
 			break;
 	}
@@ -305,7 +309,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
-		if (addr_same(&tm->tcpm_saddr, &saddr) &&
+		if ((tm->tcpm_net == net) &&
+		    addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr))
 			break;
 	}
@@ -912,6 +917,8 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
 		     tm = rcu_dereference(tm->tcpm_next), col++) {
+			if (tm->tcpm_net != net)
+				continue;
 			if (col < s_col)
 				continue;
 			if (tcp_metrics_dump_info(skb, cb, tm) < 0) {
@@ -1003,7 +1010,8 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	rcu_read_lock();
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
-		if (addr_same(&tm->tcpm_daddr, &daddr) &&
+		if ((tm->tcpm_net == net) &&
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
 		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
 			ret = tcp_metrics_fill_info(msg, tm);
 			break;
@@ -1080,7 +1088,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	pp = &hb->chain;
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
-		if (addr_same(&tm->tcpm_daddr, &daddr) &&
+		if ((tm->tcpm_net == net) &&
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
 		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
 			*pp = tm->tcpm_next;
 			kfree_rcu(tm, rcu_head);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
                               ` (2 preceding siblings ...)
  2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
@ 2015-03-09 18:30             ` Eric W. Biederman
  2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
                               ` (3 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:30 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


tcp_metrics_flush_all always returns 0.  Remove the unnecessary return code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 4ec02d6cab5b..e98c6e9770c1 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1038,7 +1038,7 @@ out_free:
 
 #define deref_genl(p)	rcu_dereference_protected(p, lockdep_genl_is_held())
 
-static int tcp_metrics_flush_all(struct net *net)
+static void tcp_metrics_flush_all(struct net *net)
 {
 	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
 	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
@@ -1059,7 +1059,6 @@ static int tcp_metrics_flush_all(struct net *net)
 			tm = next;
 		}
 	}
-	return 0;
 }
 
 static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
@@ -1076,8 +1075,10 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ret = parse_nl_addr(info, &daddr, &hash, 1);
 	if (ret < 0)
 		return ret;
-	if (ret > 0)
-		return tcp_metrics_flush_all(net);
+	if (ret > 0) {
+		tcp_metrics_flush_all(net);
+		return 0;
+	}
 	ret = parse_nl_saddr(info, &saddr);
 	if (ret < 0)
 		src = false;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
                               ` (3 preceding siblings ...)
  2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-09 18:30             ` Eric W. Biederman
  2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
                               ` (2 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:30 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting a
selection of entries from a tcp metrics hash chain.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index e98c6e9770c1..b85e0c79895f 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1046,18 +1046,19 @@ static void tcp_metrics_flush_all(struct net *net)
 	unsigned int row;
 
 	for (row = 0; row < max_rows; row++, hb++) {
+		struct tcp_metrics_block __rcu **pp;
 		spin_lock_bh(&tcp_metrics_lock);
-		tm = deref_locked_genl(hb->chain);
-		if (tm)
-			hb->chain = NULL;
-		spin_unlock_bh(&tcp_metrics_lock);
-		while (tm) {
-			struct tcp_metrics_block *next;
-
-			next = deref_genl(tm->tcpm_next);
-			kfree_rcu(tm, rcu_head);
-			tm = next;
+		pp = &hb->chain;
+		for (tm = deref_locked_genl(*pp); tm;
+		     tm = deref_locked_genl(*pp)) {
+			if (tm->tcpm_net == net) {
+				*pp = tm->tcpm_next;
+				kfree_rcu(tm, rcu_head);
+			} else {
+				pp = &tm->tcpm_next;
+			}
 		}
+		spin_unlock_bh(&tcp_metrics_lock);
 	}
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
                               ` (4 preceding siblings ...)
  2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-09 18:31             ` Eric W. Biederman
  2015-03-09 18:43               ` Eric Dumazet
  2015-03-09 18:47               ` Eric Dumazet
  2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
  7 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 18:31 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern


Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h |  2 --
 net/ipv4/tcp_metrics.c   | 63 ++++++++++++++++++++++++++++--------------------
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8f3a1a1a5a94..614a49be68a9 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -54,8 +54,6 @@ struct netns_ipv4 {
 	struct sock		*mc_autojoin_sk;
 
 	struct inet_peer_base	*peers;
-	struct tcpm_hash_bucket	*tcp_metrics_hash;
-	unsigned int		tcp_metrics_hash_log;
 	struct sock  * __percpu	*tcp_sk;
 	struct netns_frags	frags;
 #ifdef CONFIG_NETFILTER
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b85e0c79895f..785c4ab51fe9 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -92,6 +92,9 @@ struct tcpm_hash_bucket {
 	struct tcp_metrics_block __rcu	*chain;
 };
 
+static struct tcpm_hash_bucket	*tcp_metrics_hash;
+static unsigned int		tcp_metrics_hash_log;
+
 static DEFINE_SPINLOCK(tcp_metrics_lock);
 
 static void tcpm_suck_dst(struct tcp_metrics_block *tm,
@@ -172,7 +175,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	if (unlikely(reclaim)) {
 		struct tcp_metrics_block *oldest;
 
-		oldest = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain);
+		oldest = rcu_dereference(tcp_metrics_hash[hash].chain);
 		for (tm = rcu_dereference(oldest->tcpm_next); tm;
 		     tm = rcu_dereference(tm->tcpm_next)) {
 			if (time_before(tm->tcpm_stamp, oldest->tcpm_stamp))
@@ -191,8 +194,8 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	tcpm_suck_dst(tm, dst, true);
 
 	if (likely(!reclaim)) {
-		tm->tcpm_next = net->ipv4.tcp_metrics_hash[hash].chain;
-		rcu_assign_pointer(net->ipv4.tcp_metrics_hash[hash].chain, tm);
+		tm->tcpm_next = tcp_metrics_hash[hash].chain;
+		rcu_assign_pointer(tcp_metrics_hash[hash].chain, tm);
 	}
 
 out_unlock:
@@ -216,7 +219,7 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 	struct tcp_metrics_block *tm;
 	int depth = 0;
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if ((tm->tcpm_net == net) &&
 		    addr_same(&tm->tcpm_saddr, saddr) &&
@@ -256,9 +259,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if ((tm->tcpm_net == net) &&
 		    addr_same(&tm->tcpm_saddr, &saddr) &&
@@ -305,9 +308,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 
 	net = twsk_net(tw);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if ((tm->tcpm_net == net) &&
 		    addr_same(&tm->tcpm_saddr, &saddr) &&
@@ -355,7 +358,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
 	if (tm == TCP_METRICS_RECLAIM_PTR)
@@ -906,13 +909,13 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 			       struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
 	unsigned int row, s_row = cb->args[0];
 	int s_col = cb->args[1], col = s_col;
 
 	for (row = s_row; row < max_rows; row++, s_col = 0) {
 		struct tcp_metrics_block *tm;
-		struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash + row;
+		struct tcpm_hash_bucket *hb = tcp_metrics_hash + row;
 
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
@@ -1005,10 +1008,10 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto nla_put_failure;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if ((tm->tcpm_net == net) &&
 		    addr_same(&tm->tcpm_daddr, &daddr) &&
@@ -1040,8 +1043,8 @@ out_free:
 
 static void tcp_metrics_flush_all(struct net *net)
 {
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
-	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
+	struct tcpm_hash_bucket *hb = tcp_metrics_hash;
 	struct tcp_metrics_block *tm;
 	unsigned int row;
 
@@ -1085,8 +1088,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		src = false;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
-	hb = net->ipv4.tcp_metrics_hash + hash;
+	hash = hash_32(hash, tcp_metrics_hash_log);
+	hb = tcp_metrics_hash + hash;
 	pp = &hb->chain;
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
@@ -1142,6 +1145,9 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	size_t size;
 	unsigned int slots;
 
+	if (net != &init_net)
+		return 0;
+
 	slots = tcpmhash_entries;
 	if (!slots) {
 		if (totalram_pages >= 128 * 1024)
@@ -1150,14 +1156,14 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 			slots = 8 * 1024;
 	}
 
-	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
-	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
+	tcp_metrics_hash_log = order_base_2(slots);
+	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!net->ipv4.tcp_metrics_hash)
-		net->ipv4.tcp_metrics_hash = vzalloc(size);
+	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!tcp_metrics_hash)
+		tcp_metrics_hash = vzalloc(size);
 
-	if (!net->ipv4.tcp_metrics_hash)
+	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
 	return 0;
@@ -1167,17 +1173,22 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
 {
 	unsigned int i;
 
-	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
+	if (net != &init_net) {
+		tcp_metrics_flush_all(net);
+		return;
+	}
+
+	for (i = 0; i < (1U << tcp_metrics_hash_log) ; i++) {
 		struct tcp_metrics_block *tm, *next;
 
-		tm = rcu_dereference_protected(net->ipv4.tcp_metrics_hash[i].chain, 1);
+		tm = rcu_dereference_protected(tcp_metrics_hash[i].chain, 1);
 		while (tm) {
 			next = rcu_dereference_protected(tm->tcpm_next, 1);
 			kfree(tm);
 			tm = next;
 		}
 	}
-	kvfree(net->ipv4.tcp_metrics_hash);
+	kvfree(tcp_metrics_hash);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
@ 2015-03-09 18:43               ` Eric Dumazet
  2015-03-09 18:47               ` Eric Dumazet
  1 sibling, 0 replies; 119+ messages in thread
From: Eric Dumazet @ 2015-03-09 18:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

On Mon, 2015-03-09 at 13:31 -0500, Eric W. Biederman wrote:

>  
> +static struct tcpm_hash_bucket	*tcp_metrics_hash;
> +static unsigned int		tcp_metrics_hash_log;

static struct tcpm_hash_bucket *tcp_metrics_hash __read_mostly;
static unsigned int            tcp_metrics_hash_log __read_mostly;


> +
>  static DEFINE_SPINLOCK(tcp_metrics_lock);

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
  2015-03-09 18:43               ` Eric Dumazet
@ 2015-03-09 18:47               ` Eric Dumazet
  2015-03-09 19:35                 ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: Eric Dumazet @ 2015-03-09 18:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

On Mon, 2015-03-09 at 13:31 -0500, Eric W. Biederman wrote:

> @@ -1167,17 +1173,22 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
>  {
>  	unsigned int i;
>  
> -	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
> +	if (net != &init_net) {
> +		tcp_metrics_flush_all(net);
> +		return;
> +	}

Note this _very_ unlikely (read: not possible) that
tcp_net_metrics_exit() will ever be called on init_net

So I would remove all this code.

> +
> +	for (i = 0; i < (1U << tcp_metrics_hash_log) ; i++) {
>  		struct tcp_metrics_block *tm, *next;
>  
> -		tm = rcu_dereference_protected(net->ipv4.tcp_metrics_hash[i].chain, 1);
> +		tm = rcu_dereference_protected(tcp_metrics_hash[i].chain, 1);
>  		while (tm) {
>  			next = rcu_dereference_protected(tm->tcpm_next, 1);
>  			kfree(tm);
>  			tm = next;
>  		}
>  	}
> -	kvfree(net->ipv4.tcp_metrics_hash);
> +	kvfree(tcp_metrics_hash);
>  }
>  

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated
  2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
@ 2015-03-09 18:50               ` Sergei Shtylyov
  2015-03-11 19:22                 ` Sergei Shtylyov
  0 siblings, 1 reply; 119+ messages in thread
From: Sergei Shtylyov @ 2015-03-09 18:50 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern

Hello.

On 3/9/2015 9:27 PM, Eric W. Biederman wrote:

> Panic so that in the unlikely event we have problems we will have a
> clear place to start debugging instead of a mysterious NULL pointer
> deference later on.

> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>   net/ipv4/tcp_metrics.c | 1 +
>   1 file changed, 1 insertion(+)

> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
> index e5f41bd5ec1b..fdf4bdda971f 100644
> --- a/net/ipv4/tcp_metrics.c
> +++ b/net/ipv4/tcp_metrics.c
> @@ -1186,5 +1186,6 @@ cleanup_subsys:
>   	unregister_pernet_subsys(&tcp_net_metrics_ops);
>
>   cleanup:
> +	panic("Could not allocate the tcp_metrics hash table\n");
>   	return;

    You can drop this *return* as well, it serves not purpose.

>   }

WBR, Sergei

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-09 18:47               ` Eric Dumazet
@ 2015-03-09 19:35                 ` Eric W. Biederman
  2015-03-09 20:21                   ` Eric Dumazet
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 19:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Mon, 2015-03-09 at 13:31 -0500, Eric W. Biederman wrote:
>
>> @@ -1167,17 +1173,22 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
>>  {
>>  	unsigned int i;
>>  
>> -	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
>> +	if (net != &init_net) {
>> +		tcp_metrics_flush_all(net);
>> +		return;
>> +	}
>
> Note this _very_ unlikely (read: not possible) that
> tcp_net_metrics_exit() will ever be called on init_net
>
> So I would remove all this code.

If the line:
	ret = genl_register_family_with_ops(&tcp_metrics_nl_family,
					    tcp_metrics_nl_ops);

fails then this code will be called with net == &init_net.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
                               ` (5 preceding siblings ...)
  2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
@ 2015-03-09 20:09             ` David Miller
  2015-03-09 20:21               ` Eric W. Biederman
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
  7 siblings, 1 reply; 119+ messages in thread
From: David Miller @ 2015-03-09 20:09 UTC (permalink / raw)
  To: ebiederm
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Mon, 09 Mar 2015 13:22:52 -0500

> This is a small pile of patches that convert tcp_metrics from using a
> hash table per network namespace to using a single hash table for all
> network namespaces.
> 
> This is broken up into several patches so that each small step along
> the way could be carefully scrutinized as I wrote it, and equally so
> that each small step can be reviewed.
> 
> There are two minor cleanups included.  The addition of a missing panic
> when the tcp_metrics hash table can not be allocated during boot and the
> removal of the return code from tcp_metrics_flush_all
> 
> The motivation for this change is that the tcp_metrics hash table at
> 128KiB is the single largest component of a freshly allocated network
> namespace.

Looks like there is feedback for this series, so I'll let you address
those and submit a V2.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-09 19:35                 ` Eric W. Biederman
@ 2015-03-09 20:21                   ` Eric Dumazet
  0 siblings, 0 replies; 119+ messages in thread
From: Eric Dumazet @ 2015-03-09 20:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

On Mon, 2015-03-09 at 14:35 -0500, Eric W. Biederman wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
> > On Mon, 2015-03-09 at 13:31 -0500, Eric W. Biederman wrote:
> >
> >> @@ -1167,17 +1173,22 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
> >>  {
> >>  	unsigned int i;
> >>  
> >> -	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
> >> +	if (net != &init_net) {
> >> +		tcp_metrics_flush_all(net);
> >> +		return;
> >> +	}
> >
> > Note this _very_ unlikely (read: not possible) that
> > tcp_net_metrics_exit() will ever be called on init_net
> >
> > So I would remove all this code.
> 
> If the line:
> 	ret = genl_register_family_with_ops(&tcp_metrics_nl_family,
> 					    tcp_metrics_nl_ops);
> 
> fails then this code will be called with net == &init_net.


So, what is wrong calling.

tcp_metrics_flush_all(net)

even if net == &init_net ?

The only code needed after the purge is

kvfree(tcp_metrics_hash);

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction
  2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
@ 2015-03-09 20:21               ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-09 20:21 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern

David Miller <davem@redhat.com> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Mon, 09 Mar 2015 13:22:52 -0500
>
>> This is a small pile of patches that convert tcp_metrics from using a
>> hash table per network namespace to using a single hash table for all
>> network namespaces.
>> 
>> This is broken up into several patches so that each small step along
>> the way could be carefully scrutinized as I wrote it, and equally so
>> that each small step can be reviewed.
>> 
>> There are two minor cleanups included.  The addition of a missing panic
>> when the tcp_metrics hash table can not be allocated during boot and the
>> removal of the return code from tcp_metrics_flush_all
>> 
>> The motivation for this change is that the tcp_metrics hash table at
>> 128KiB is the single largest component of a freshly allocated network
>> namespace.
>
> Looks like there is feedback for this series, so I'll let you address
> those and submit a V2.

Fair enough.


Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
@ 2015-03-09 20:25               ` Julian Anastasov
  2015-03-10  6:59                 ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: Julian Anastasov @ 2015-03-09 20:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern


	Hello,

On Mon, 9 Mar 2015, Eric W. Biederman wrote:

> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
> index 70196c3c16a1..4ec02d6cab5b 100644
> --- a/net/ipv4/tcp_metrics.c
> +++ b/net/ipv4/tcp_metrics.c
> @@ -40,6 +40,7 @@ struct tcp_fastopen_metrics {
>  
>  struct tcp_metrics_block {
>  	struct tcp_metrics_block __rcu	*tcpm_next;
> +	struct net			*tcpm_net;

	Is it better if we have this field under
#ifef CONFIG_NET_NS ? read_pnet and write_pnet do not
care when first argument is not declared for the
!CONFIG_NET_NS case.

>  	struct inetpeer_addr		tcpm_saddr;
>  	struct inetpeer_addr		tcpm_daddr;
>  	unsigned long			tcpm_stamp;
> @@ -183,6 +184,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
>  		if (!tm)
>  			goto out_unlock;
>  	}
> +	tm->tcpm_net = net;

	write_pnet(&tm->tcpm_net, net);

>  	tm->tcpm_saddr = *saddr;
>  	tm->tcpm_daddr = *daddr;
>  
> @@ -216,7 +218,8 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
>  
>  	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
>  	     tm = rcu_dereference(tm->tcpm_next)) {
> -		if (addr_same(&tm->tcpm_saddr, saddr) &&
> +		if ((tm->tcpm_net == net) &&
> +		    addr_same(&tm->tcpm_saddr, saddr) &&
>  		    addr_same(&tm->tcpm_daddr, daddr))

	net_eq(read_pnet(), net) can be checked last,
better to match the addresses first?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-09 20:25               ` Julian Anastasov
@ 2015-03-10  6:59                 ` Eric W. Biederman
  2015-03-10  8:23                   ` Julian Anastasov
  2015-03-10 16:36                   ` David Miller
  0 siblings, 2 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-10  6:59 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

Julian Anastasov <ja@ssi.bg> writes:

> 	Hello,
>
> On Mon, 9 Mar 2015, Eric W. Biederman wrote:
>
>> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
>> index 70196c3c16a1..4ec02d6cab5b 100644
>> --- a/net/ipv4/tcp_metrics.c
>> +++ b/net/ipv4/tcp_metrics.c
>> @@ -40,6 +40,7 @@ struct tcp_fastopen_metrics {
>>  
>>  struct tcp_metrics_block {
>>  	struct tcp_metrics_block __rcu	*tcpm_next;
>> +	struct net			*tcpm_net;
>
> 	Is it better if we have this field under
> #ifef CONFIG_NET_NS ? read_pnet and write_pnet do not
> care when first argument is not declared for the
> !CONFIG_NET_NS case.

I don't actually believe there are many people compiling out network
namespaces but there is no point in making it unnecessarily bad for them.

That said.  If we actually really care about struct net going away it
would be much better to globally replace struct net with a typedef that
looks something like:

#ifdef CONFIG_NET_NS
struct net_ref {
       struct net *;
};
#else
struct net_ref {
};
#endif
typedef struct net_ref net_t;

That would remove the need for write_pnet and read_pnet, make it
impossible to forget net_eq and make network namespace arguments to
functions also boil away at compile time if the network namespace code
was not enabled.

That was the original design and I forget why we didn't do that with
struct net.  But we did not.

In this specific case an extra member of tcp_metrics_block isn't going
to hurt so I don't think write_pnet and read_pnet are particularly
interesting.   net_eq definitely seems worthwhile.

>>  	struct inetpeer_addr		tcpm_saddr;
>>  	struct inetpeer_addr		tcpm_daddr;
>>  	unsigned long			tcpm_stamp;
>> @@ -183,6 +184,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
>>  		if (!tm)
>>  			goto out_unlock;
>>  	}
>> +	tm->tcpm_net = net;
>
> 	write_pnet(&tm->tcpm_net, net);
>
>>  	tm->tcpm_saddr = *saddr;
>>  	tm->tcpm_daddr = *daddr;
>>  
>> @@ -216,7 +218,8 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
>>  
>>  	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
>>  	     tm = rcu_dereference(tm->tcpm_next)) {
>> -		if (addr_same(&tm->tcpm_saddr, saddr) &&
>> +		if ((tm->tcpm_net == net) &&
>> +		    addr_same(&tm->tcpm_saddr, saddr) &&
>>  		    addr_same(&tm->tcpm_daddr, daddr))
>
> 	net_eq(read_pnet(), net) can be checked last,
> better to match the addresses first?

Not hugely better but failing faster is always a good idea if you get
stuck with a long hash chain.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10  6:59                 ` Eric W. Biederman
@ 2015-03-10  8:23                   ` Julian Anastasov
  2015-03-11  0:58                     ` Eric W. Biederman
  2015-03-10 16:36                   ` David Miller
  1 sibling, 1 reply; 119+ messages in thread
From: Julian Anastasov @ 2015-03-10  8:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern


	Hello,

On Tue, 10 Mar 2015, Eric W. Biederman wrote:

> Julian Anastasov <ja@ssi.bg> writes:
> 
> >>  struct tcp_metrics_block {
> >>  	struct tcp_metrics_block __rcu	*tcpm_next;
> >> +	struct net			*tcpm_net;
> >
> > 	Is it better if we have this field under
> > #ifef CONFIG_NET_NS ? read_pnet and write_pnet do not
> > care when first argument is not declared for the
> > !CONFIG_NET_NS case.
> 
> I don't actually believe there are many people compiling out network
> namespaces but there is no point in making it unnecessarily bad for them.

	Small routers. Also, we can move tcpm_net after
tcpm_daddr to help machines with 32-byte cache line.
It would be good if fields and access is in this order:
tcpm_next, tcpm_daddr, tcpm_saddr and finally tcpm_net.
tcpm_daddr is the leading key, it was tcpm_addr originally.
If such change is not suitable for this patchset I can try it
later after your changes.

> That said.  If we actually really care about struct net going away it
> would be much better to globally replace struct net with a typedef that
> looks something like:
> 
> #ifdef CONFIG_NET_NS
> struct net_ref {
>        struct net *;
> };
> #else
> struct net_ref {
> };
> #endif
> typedef struct net_ref net_t;
> 
> That would remove the need for write_pnet and read_pnet, make it
> impossible to forget net_eq and make network namespace arguments to
> functions also boil away at compile time if the network namespace code
> was not enabled.
> 
> That was the original design and I forget why we didn't do that with
> struct net.  But we did not.

	Not sure. Only macros help to avoid ifdefs here
and there in the code...

> In this specific case an extra member of tcp_metrics_block isn't going
> to hurt so I don't think write_pnet and read_pnet are particularly
> interesting.   net_eq definitely seems worthwhile.

	Lets use them, even neighbour.c uses them.
Your patchset shows the direction we go, net field to be
part of the structures and using common tables.
Macros like write_pnet and read_pnet look the only
way to save memory on small systems. And net_eq should
be optimized by compilers too.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10  6:59                 ` Eric W. Biederman
  2015-03-10  8:23                   ` Julian Anastasov
@ 2015-03-10 16:36                   ` David Miller
  2015-03-10 17:06                     ` Eric W. Biederman
  1 sibling, 1 reply; 119+ messages in thread
From: David Miller @ 2015-03-10 16:36 UTC (permalink / raw)
  To: ebiederm
  Cc: ja, edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes,
	ddutt, vipin, shmulik.ladkani, dsahern

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 10 Mar 2015 01:59:48 -0500

> If we actually really care about struct net going away it would be
> much better to globally replace struct net with a typedef that looks
> something like:
> 
> #ifdef CONFIG_NET_NS
> struct net_ref {
>        struct net *;
> };
> #else
> struct net_ref {
> };
> #endif
> typedef struct net_ref net_t;
> 
> That would remove the need for write_pnet and read_pnet, make it
> impossible to forget net_eq and make network namespace arguments to
> functions also boil away at compile time if the network namespace code
> was not enabled.
> 
> That was the original design and I forget why we didn't do that with
> struct net.  But we did not.

This keeps the ifdefs out of foo.c code, so I like it.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10 16:36                   ` David Miller
@ 2015-03-10 17:06                     ` Eric W. Biederman
  2015-03-10 17:29                       ` David Miller
  0 siblings, 1 reply; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-10 17:06 UTC (permalink / raw)
  To: David Miller
  Cc: ja, edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes,
	ddutt, vipin, shmulik.ladkani, dsahern

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 10 Mar 2015 01:59:48 -0500
>
>> If we actually really care about struct net going away it would be
>> much better to globally replace struct net with a typedef that looks
>> something like:
>> 
>> #ifdef CONFIG_NET_NS
>> struct net_ref {
>>        struct net *net;
>> };
>> #else
>> struct net_ref {
>> };
>> #endif
>> typedef struct net_ref net_t;
>> 
>> That would remove the need for write_pnet and read_pnet, make it
>> impossible to forget net_eq and make network namespace arguments to
>> functions also boil away at compile time if the network namespace code
>> was not enabled.
>> 
>> That was the original design and I forget why we didn't do that with
>> struct net.  But we did not.
>
> This keeps the ifdefs out of foo.c code, so I like it.

Alright.  It does wind up requiring things like:

static inline struct net *to_struct_net(net_t net)
{
#ifdef CONFIG_NET_NS
        return net.net;
#else
        return &init_net;
#endif        
}

static inline net_t to_net_t(struct net *net)
{
	net_t result;
#ifdef CONFIG_NET_NS
	result.net = net;
#endif        
	return result;
}

static inline struct struct net *ipv4_net(net_t net)
{
        return &to_struct_net(net)->ipv4;
}

But we are used to those from dealing with network devices,
so it should be no big deal.

I will start playing with it to see how much work that is to implement.

Any suggestions on a better name for to_struct_net()? Perhaps
global_net()?  Although arguably anything that to_struct_net feels
awkward for should it's own xxx_net type.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10 17:06                     ` Eric W. Biederman
@ 2015-03-10 17:29                       ` David Miller
  2015-03-10 17:56                         ` Eric W. Biederman
  0 siblings, 1 reply; 119+ messages in thread
From: David Miller @ 2015-03-10 17:29 UTC (permalink / raw)
  To: ebiederm
  Cc: ja, edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes,
	ddutt, vipin, shmulik.ladkani, dsahern

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 10 Mar 2015 12:06:32 -0500

> David Miller <davem@davemloft.net> writes:
> 
>> From: ebiederm@xmission.com (Eric W. Biederman)
>> Date: Tue, 10 Mar 2015 01:59:48 -0500
>>
>>> If we actually really care about struct net going away it would be
>>> much better to globally replace struct net with a typedef that looks
>>> something like:
>>> 
>>> #ifdef CONFIG_NET_NS
>>> struct net_ref {
>>>        struct net *net;
>>> };
>>> #else
>>> struct net_ref {
>>> };
>>> #endif
>>> typedef struct net_ref net_t;
>>> 
>>> That would remove the need for write_pnet and read_pnet, make it
>>> impossible to forget net_eq and make network namespace arguments to
>>> functions also boil away at compile time if the network namespace code
>>> was not enabled.
>>> 
>>> That was the original design and I forget why we didn't do that with
>>> struct net.  But we did not.
>>
>> This keeps the ifdefs out of foo.c code, so I like it.
> 
> Alright.  It does wind up requiring things like:

Another approach is to use a macro for the instantiation of a "struct
net *" member.

It could evaluate to "struct { } x;" when NETNS is disabled.

Then you don't need all the special accessors, read_pnet() and
write_pnet() are sufficient.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10 17:29                       ` David Miller
@ 2015-03-10 17:56                         ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-10 17:56 UTC (permalink / raw)
  To: David Miller
  Cc: ja, edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes,
	ddutt, vipin, shmulik.ladkani, dsahern

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 10 Mar 2015 12:06:32 -0500
>
>> David Miller <davem@davemloft.net> writes:
>> 
>>> From: ebiederm@xmission.com (Eric W. Biederman)
>>> Date: Tue, 10 Mar 2015 01:59:48 -0500
>>>
>>>> If we actually really care about struct net going away it would be
>>>> much better to globally replace struct net with a typedef that looks
>>>> something like:
>>>> 
>>>> #ifdef CONFIG_NET_NS
>>>> struct net_ref {
>>>>        struct net *net;
>>>> };
>>>> #else
>>>> struct net_ref {
>>>> };
>>>> #endif
>>>> typedef struct net_ref net_t;
>>>> 
>>>> That would remove the need for write_pnet and read_pnet, make it
>>>> impossible to forget net_eq and make network namespace arguments to
>>>> functions also boil away at compile time if the network namespace code
>>>> was not enabled.
>>>> 
>>>> That was the original design and I forget why we didn't do that with
>>>> struct net.  But we did not.
>>>
>>> This keeps the ifdefs out of foo.c code, so I like it.
>> 
>> Alright.  It does wind up requiring things like:
>
> Another approach is to use a macro for the instantiation of a "struct
> net *" member.
>
> It could evaluate to "struct { } x;" when NETNS is disabled.
>
> Then you don't need all the special accessors, read_pnet() and
> write_pnet() are sufficient.

Looking at read_pnet we always wrap it in a special accessor that
is appropriate for it's type.

But yes all that is really missing from read_pnet and write_pnet
fundamentally is a typedef to make the structure definitions not
require typedefs.

The part that spooks me from using them is essentially:
net_eq(read_pnet(&tm->tcpm_net), net)

Which is a long and ugly and ick.

The classic solution to that is of course to make it.

net_eq(tm_net(tm), net)

Which doesn't look too bad.  And is just a small wrapper around
read_pnet.

And that I can implement without much work.

The invasive bit would be writing something that would require us to use
net_eq to compare network namespaces.  But I can live without that for
now.


Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-10  8:23                   ` Julian Anastasov
@ 2015-03-11  0:58                     ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11  0:58 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern

Julian Anastasov <ja@ssi.bg> writes:

> 	Hello,
>
> On Tue, 10 Mar 2015, Eric W. Biederman wrote:
>
>> Julian Anastasov <ja@ssi.bg> writes:
>> 
>> >>  struct tcp_metrics_block {
>> >>  	struct tcp_metrics_block __rcu	*tcpm_next;
>> >> +	struct net			*tcpm_net;
>> >
>> > 	Is it better if we have this field under
>> > #ifef CONFIG_NET_NS ? read_pnet and write_pnet do not
>> > care when first argument is not declared for the
>> > !CONFIG_NET_NS case.
>> 
>> I don't actually believe there are many people compiling out network
>> namespaces but there is no point in making it unnecessarily bad for them.
>
> 	Small routers. Also, we can move tcpm_net after
> tcpm_daddr to help machines with 32-byte cache line.
> It would be good if fields and access is in this order:
> tcpm_next, tcpm_daddr, tcpm_saddr and finally tcpm_net.
> tcpm_daddr is the leading key, it was tcpm_addr originally.
> If such change is not suitable for this patchset I can try it
> later after your changes.

I think optimizing the order of the fields and their alignment for
machines with a 32-byte cache line is beyond the scope of my patchset.

My gut says if you are a small router you might want to just turn this
off, because you would not be using many tcp sockets.

For a small machine doing tcp and benefitting from this cache I think
figuring out rhashtables for this data structure has merit.   I don't
know if rhashtables are ready for general use yet.

For machines with 64-byte cache lines placing saddr, daddr, net and next
all in the first 4 fields should be sufficient.

For machines with 32-byte cache lines you are in trouble because even
without struct net in their you need two cache lines as saddr and daddr
are both 20 bytes long.  It is just a mess.

>> That said.  If we actually really care about struct net going away it
>> would be much better to globally replace struct net with a typedef that
>> looks something like:
>> 
>> #ifdef CONFIG_NET_NS
>> struct net_ref {
>>        struct net *;
>> };
>> #else
>> struct net_ref {
>> };
>> #endif
>> typedef struct net_ref net_t;
>> 
>> That would remove the need for write_pnet and read_pnet, make it
>> impossible to forget net_eq and make network namespace arguments to
>> functions also boil away at compile time if the network namespace code
>> was not enabled.
>> 
>> That was the original design and I forget why we didn't do that with
>> struct net.  But we did not.
>
> 	Not sure. Only macros help to avoid ifdefs here
> and there in the code...
>
>> In this specific case an extra member of tcp_metrics_block isn't going
>> to hurt so I don't think write_pnet and read_pnet are particularly
>> interesting.   net_eq definitely seems worthwhile.
>
> 	Lets use them, even neighbour.c uses them.
> Your patchset shows the direction we go, net field to be
> part of the structures and using common tables.
> Macros like write_pnet and read_pnet look the only
> way to save memory on small systems. And net_eq should
> be optimized by compilers too.

net_eq definitely will.  And I have figured out how to have
write_pnet and read_pnet be palatiable to me.

I wish I could force the use of net_eq when the network namespace
field might be compiled out but that seems one step too far for now.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2
  2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
                               ` (6 preceding siblings ...)
  2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
@ 2015-03-11 16:33             ` Eric W. Biederman
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
                                 ` (8 more replies)
  7 siblings, 9 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:33 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


This is a small pile of patches that convert tcp_metrics from using a
hash table per network namespace to using a single hash table for all
network namespaces.

This is broken up into several patches so that each small step along
the way could be carefully scrutinized as I wrote it, and equally so
that each small step can be reviewed.

There are several cleanups included in this series.  The death of
hold_net and release_net.  The addition of a new type possible_net_t to
allow compiling out struct net * pointers in structures without ugly
ifdefs.  The addition of panic calls during boot where we can not handle
failure, and not trying simplifies the code.  The removal of the return
code from tcp_metrics_flush_all.

The motivation for this change is that the tcp_metrics hash table at
128KiB is one of the largest components of a freshly allocated network
namespace.

Eric W. Biederman (6):
      tcp_metrics: panic when tcp_metrics can not be allocated
      tcp_metrics: Mix the network namespace into the hash function.
      tcp_metrics: Add a field tcpm_net and verify it matches on lookup
      tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
      tcp_metrics: Rewrite tcp_metrics_flush_all
      tcp_metrics: Use a single hash table for all network namespaces.

 include/net/netns/ipv4.h |   2 -
 net/ipv4/tcp_metrics.c   | 118 +++++++++++++++++++++++++++++------------------
 2 files changed, 73 insertions(+), 47 deletions(-)

Eric W. Biederman (8):
      net: Kill hold_net release_net
      net: Introduce possible_net_t
      tcp_metrics: panic when tcp_metrics_init fails.
      tcp_metrics: Mix the network namespace into the hash function.
      tcp_metrics: Add a field tcpm_net and verify it matches on lookup
      tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
      tcp_metrics: Rewrite tcp_metrics_flush_all
      tcp_metrics: Use a single hash table for all network namespaces.

 include/linux/netdevice.h            |   9 +--
 include/net/cfg80211.h               |   4 +-
 include/net/fib_rules.h              |   1 -
 include/net/genetlink.h              |   4 +-
 include/net/inet_hashtables.h        |   4 +-
 include/net/ip_vs.h                  |   8 +-
 include/net/neighbour.h              |   8 +-
 include/net/net_namespace.h          |  52 ++++---------
 include/net/netfilter/nf_conntrack.h |   5 +-
 include/net/netns/ipv4.h             |   2 -
 include/net/sock.h                   |   6 +-
 include/net/xfrm.h                   |   8 +-
 net/9p/trans_fd.c                    |   4 +-
 net/core/dev.c                       |   2 -
 net/core/fib_rules.c                 |   8 +-
 net/core/neighbour.c                 |   9 +--
 net/core/net_namespace.c             |  11 ---
 net/core/sock.c                      |   1 -
 net/ipv4/fib_semantics.c             |   3 +-
 net/ipv4/inet_hashtables.c           |   3 +-
 net/ipv4/inet_timewait_sock.c        |   4 +-
 net/ipv4/ipmr.c                      |   4 +-
 net/ipv4/tcp_metrics.c               | 137 +++++++++++++++++++----------------
 net/ipv6/addrlabel.c                 |  11 +--
 net/ipv6/ip6_flowlabel.c             |   3 +-
 net/ipv6/ip6mr.c                     |   4 +-
 net/openvswitch/datapath.c           |   4 +-
 net/openvswitch/datapath.h           |   4 +-
 net/packet/internal.h                |   4 +-
 29 files changed, 121 insertions(+), 206 deletions(-)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
@ 2015-03-11 16:35               ` Eric W. Biederman
  2015-03-11 16:55                 ` Eric Dumazet
                                   ` (2 more replies)
  2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
                                 ` (7 subsequent siblings)
  8 siblings, 3 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:35 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


hold_net and release_net were an idea that didn't work.  Kill them
it is long past due.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/netdevice.h     |  3 +--
 include/net/fib_rules.h       |  1 -
 include/net/net_namespace.h   | 29 -----------------------------
 include/net/sock.h            |  2 +-
 net/core/dev.c                |  2 --
 net/core/fib_rules.c          |  8 ++------
 net/core/neighbour.c          |  9 ++-------
 net/core/net_namespace.c      | 11 -----------
 net/core/sock.c               |  1 -
 net/ipv4/fib_semantics.c      |  3 +--
 net/ipv4/inet_hashtables.c    |  3 +--
 net/ipv4/inet_timewait_sock.c |  4 ++--
 net/ipv6/addrlabel.c          |  5 +----
 net/ipv6/ip6_flowlabel.c      |  3 +--
 net/openvswitch/datapath.c    |  4 +---
 15 files changed, 13 insertions(+), 75 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 45413784a3b1..b214ba2ebbce 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1863,8 +1863,7 @@ static inline
 void dev_net_set(struct net_device *dev, struct net *net)
 {
 #ifdef CONFIG_NET_NS
-	release_net(dev->nd_net);
-	dev->nd_net = hold_net(net);
+	dev->nd_net = net;
 #endif
 }
 
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index e584de16e4c3..ef94dc379e6f 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -98,7 +98,6 @@ static inline void fib_rule_get(struct fib_rule *rule)
 static inline void fib_rule_put_rcu(struct rcu_head *head)
 {
 	struct fib_rule *rule = container_of(head, struct fib_rule, rcu);
-	release_net(rule->fr_net);
 	kfree(rule);
 }
 
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 2cb9acb618e9..6fd76650d36f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -49,11 +49,6 @@ struct net {
 	atomic_t		count;		/* To decided when the network
 						 *  namespace should be shut down.
 						 */
-#ifdef NETNS_REFCNT_DEBUG
-	atomic_t		use_count;	/* To track references we
-						 * destroy on demand
-						 */
-#endif
 	spinlock_t		rules_mod_lock;
 
 	struct list_head	list;		/* list of network namespaces */
@@ -234,30 +229,6 @@ int net_eq(const struct net *net1, const struct net *net2)
 #endif
 
 
-#ifdef NETNS_REFCNT_DEBUG
-static inline struct net *hold_net(struct net *net)
-{
-	if (net)
-		atomic_inc(&net->use_count);
-	return net;
-}
-
-static inline void release_net(struct net *net)
-{
-	if (net)
-		atomic_dec(&net->use_count);
-}
-#else
-static inline struct net *hold_net(struct net *net)
-{
-	return net;
-}
-
-static inline void release_net(struct net *net)
-{
-}
-#endif
-
 #ifdef CONFIG_NET_NS
 
 static inline void write_pnet(struct net **pnet, struct net *net)
diff --git a/include/net/sock.h b/include/net/sock.h
index 250822cc1e02..40437e0a94a8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2201,7 +2201,7 @@ static inline void sk_change_net(struct sock *sk, struct net *net)
 
 	if (!net_eq(current_net, net)) {
 		put_net(current_net);
-		sock_net_set(sk, hold_net(net));
+		sock_net_set(sk, net);
 	}
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 962ee9d71964..39fe369b46ad 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6841,8 +6841,6 @@ void free_netdev(struct net_device *dev)
 {
 	struct napi_struct *p, *n;
 
-	release_net(dev_net(dev));
-
 	netif_free_tx_queues(dev);
 #ifdef CONFIG_SYSFS
 	kvfree(dev->_rx);
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 44706e81b2e0..0b204626f784 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -31,7 +31,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 	r->pref = pref;
 	r->table = table;
 	r->flags = flags;
-	r->fr_net = hold_net(ops->fro_net);
+	r->fr_net = ops->fro_net;
 
 	r->suppress_prefixlen = -1;
 	r->suppress_ifgroup = -1;
@@ -116,7 +116,6 @@ static int __fib_rules_register(struct fib_rules_ops *ops)
 		if (ops->family == o->family)
 			goto errout;
 
-	hold_net(net);
 	list_add_tail_rcu(&ops->list, &net->rules_ops);
 	err = 0;
 errout:
@@ -163,9 +162,7 @@ static void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
 static void fib_rules_put_rcu(struct rcu_head *head)
 {
 	struct fib_rules_ops *ops = container_of(head, struct fib_rules_ops, rcu);
-	struct net *net = ops->fro_net;
 
-	release_net(net);
 	kfree(ops);
 }
 
@@ -303,7 +300,7 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 		err = -ENOMEM;
 		goto errout;
 	}
-	rule->fr_net = hold_net(net);
+	rule->fr_net = net;
 
 	if (tb[FRA_PRIORITY])
 		rule->pref = nla_get_u32(tb[FRA_PRIORITY]);
@@ -423,7 +420,6 @@ static int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr* nlh)
 	return 0;
 
 errout_free:
-	release_net(rule->fr_net);
 	kfree(rule);
 errout:
 	rules_ops_put(ops);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ad07990e943d..0e8b32efc031 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -591,7 +591,7 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 	if (!n)
 		goto out;
 
-	write_pnet(&n->net, hold_net(net));
+	write_pnet(&n->net, net);
 	memcpy(n->key, pkey, key_len);
 	n->dev = dev;
 	if (dev)
@@ -600,7 +600,6 @@ struct pneigh_entry * pneigh_lookup(struct neigh_table *tbl,
 	if (tbl->pconstructor && tbl->pconstructor(n)) {
 		if (dev)
 			dev_put(dev);
-		release_net(net);
 		kfree(n);
 		n = NULL;
 		goto out;
@@ -634,7 +633,6 @@ int pneigh_delete(struct neigh_table *tbl, struct net *net, const void *pkey,
 				tbl->pdestructor(n);
 			if (n->dev)
 				dev_put(n->dev);
-			release_net(pneigh_net(n));
 			kfree(n);
 			return 0;
 		}
@@ -657,7 +655,6 @@ static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev)
 					tbl->pdestructor(n);
 				if (n->dev)
 					dev_put(n->dev);
-				release_net(pneigh_net(n));
 				kfree(n);
 				continue;
 			}
@@ -1428,11 +1425,10 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
 		dev_hold(dev);
 		p->dev = dev;
-		write_pnet(&p->net, hold_net(net));
+		write_pnet(&p->net, net);
 		p->sysctl_table = NULL;
 
 		if (ops->ndo_neigh_setup && ops->ndo_neigh_setup(dev, p)) {
-			release_net(net);
 			dev_put(dev);
 			kfree(p);
 			return NULL;
@@ -1472,7 +1468,6 @@ EXPORT_SYMBOL(neigh_parms_release);
 
 static void neigh_parms_destroy(struct neigh_parms *parms)
 {
-	release_net(neigh_parms_net(parms));
 	kfree(parms);
 }
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index cb5290b8c428..e5e96b0f6717 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -236,10 +236,6 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	net->user_ns = user_ns;
 	idr_init(&net->netns_ids);
 
-#ifdef NETNS_REFCNT_DEBUG
-	atomic_set(&net->use_count, 0);
-#endif
-
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
 		if (error < 0)
@@ -294,13 +290,6 @@ out_free:
 
 static void net_free(struct net *net)
 {
-#ifdef NETNS_REFCNT_DEBUG
-	if (unlikely(atomic_read(&net->use_count) != 0)) {
-		pr_emerg("network namespace not free! Usage: %d\n",
-			 atomic_read(&net->use_count));
-		return;
-	}
-#endif
 	kfree(rcu_access_pointer(net->gen));
 	kmem_cache_free(net_cachep, net);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 726e1f99aa8d..cb5cf93683c8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1455,7 +1455,6 @@ void sk_release_kernel(struct sock *sk)
 
 	sock_hold(sk);
 	sock_release(sk->sk_socket);
-	release_net(sock_net(sk));
 	sock_net_set(sk, get_net(&init_net));
 	sock_put(sk);
 }
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c6d267442dac..66c1e4fbf884 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -213,7 +213,6 @@ static void free_fib_info_rcu(struct rcu_head *head)
 		rt_fibinfo_free(&nexthop_nh->nh_rth_input);
 	} endfor_nexthops(fi);
 
-	release_net(fi->fib_net);
 	if (fi->fib_metrics != (u32 *) dst_default_metrics)
 		kfree(fi->fib_metrics);
 	kfree(fi);
@@ -814,7 +813,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 	} else
 		fi->fib_metrics = (u32 *) dst_default_metrics;
 
-	fi->fib_net = hold_net(net);
+	fi->fib_net = net;
 	fi->fib_protocol = cfg->fc_protocol;
 	fi->fib_scope = cfg->fc_scope;
 	fi->fib_flags = cfg->fc_flags;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 9111a4e22155..f6a12b97d12b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -61,7 +61,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
 	if (tb != NULL) {
-		write_pnet(&tb->ib_net, hold_net(net));
+		write_pnet(&tb->ib_net, net);
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
@@ -79,7 +79,6 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket
 {
 	if (hlist_empty(&tb->owners)) {
 		__hlist_del(&tb->node);
-		release_net(ib_net(tb));
 		kmem_cache_free(cachep, tb);
 	}
 }
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 6d592f8555fb..e0d5c8327159 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -98,7 +98,7 @@ void inet_twsk_free(struct inet_timewait_sock *tw)
 #ifdef SOCK_REFCNT_DEBUG
 	pr_debug("%s timewait_sock %p released\n", tw->tw_prot->name, tw);
 #endif
-	release_net(twsk_net(tw));
+	twsk_net(tw);
 	kmem_cache_free(tw->tw_prot->twsk_prot->twsk_slab, tw);
 	module_put(owner);
 }
@@ -195,7 +195,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat
 		tw->tw_ipv6only	    = 0;
 		tw->tw_transparent  = inet->transparent;
 		tw->tw_prot	    = sk->sk_prot_creator;
-		twsk_net_set(tw, hold_net(sock_net(sk)));
+		twsk_net_set(tw, sock_net(sk));
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index e43e79d0a612..59c793040498 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -129,9 +129,6 @@ static const __net_initconst struct ip6addrlbl_init_table
 /* Object management */
 static inline void ip6addrlbl_free(struct ip6addrlbl_entry *p)
 {
-#ifdef CONFIG_NET_NS
-	release_net(p->lbl_net);
-#endif
 	kfree(p);
 }
 
@@ -241,7 +238,7 @@ static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net *net,
 	newp->label = label;
 	INIT_HLIST_NODE(&newp->list);
 #ifdef CONFIG_NET_NS
-	newp->lbl_net = hold_net(net);
+	newp->lbl_net = net;
 #endif
 	atomic_set(&newp->refcnt, 1);
 	return newp;
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index f45d6db50a45..457303886fd4 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -100,7 +100,6 @@ static void fl_free(struct ip6_flowlabel *fl)
 	if (fl) {
 		if (fl->share == IPV6_FL_S_PROCESS)
 			put_pid(fl->owner.pid);
-		release_net(fl->fl_net);
 		kfree(fl->opt);
 		kfree_rcu(fl, rcu);
 	}
@@ -403,7 +402,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq,
 		}
 	}
 
-	fl->fl_net = hold_net(net);
+	fl->fl_net = net;
 	fl->expires = jiffies;
 	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires);
 	if (err)
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 5bae7243c577..096c6276e6b9 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -203,7 +203,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
 
 	ovs_flow_tbl_destroy(&dp->table);
 	free_percpu(dp->stats_percpu);
-	release_net(ovs_dp_get_net(dp));
 	kfree(dp->ports);
 	kfree(dp);
 }
@@ -1501,7 +1500,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	if (dp == NULL)
 		goto err_free_reply;
 
-	ovs_dp_set_net(dp, hold_net(sock_net(skb->sk)));
+	ovs_dp_set_net(dp, sock_net(skb->sk));
 
 	/* Allocate table. */
 	err = ovs_flow_tbl_init(&dp->table);
@@ -1575,7 +1574,6 @@ err_destroy_percpu:
 err_destroy_table:
 	ovs_flow_tbl_destroy(&dp->table);
 err_free_dp:
-	release_net(ovs_dp_get_net(dp));
 	kfree(dp);
 err_free_reply:
 	kfree_skb(reply);
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 2/8] net: Introduce possible_net_t
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
@ 2015-03-11 16:36               ` Eric W. Biederman
  2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
                                 ` (6 subsequent siblings)
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:36 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


Having to say
> #ifdef CONFIG_NET_NS
> 	struct net *net;
> #endif

in structures is a little bit wordy and a little bit error prone.

Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
>       struct net *net;
> #endif
> } possible_net_t;

And then in a header say:

> 	possible_net_t net;

Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.

Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.

This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/netdevice.h            |  8 ++------
 include/net/cfg80211.h               |  4 +---
 include/net/genetlink.h              |  4 +---
 include/net/inet_hashtables.h        |  4 +---
 include/net/ip_vs.h                  |  8 ++++----
 include/net/neighbour.h              |  8 ++------
 include/net/net_namespace.h          | 23 +++++++++++++----------
 include/net/netfilter/nf_conntrack.h |  5 ++---
 include/net/sock.h                   |  4 +---
 include/net/xfrm.h                   |  8 ++------
 net/9p/trans_fd.c                    |  4 ++--
 net/ipv4/ipmr.c                      |  4 +---
 net/ipv6/addrlabel.c                 |  8 ++------
 net/ipv6/ip6mr.c                     |  4 +---
 net/openvswitch/datapath.h           |  4 +---
 net/packet/internal.h                |  4 +---
 16 files changed, 37 insertions(+), 67 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b214ba2ebbce..d2a678f983c0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1720,9 +1720,7 @@ struct net_device {
 	struct netpoll_info __rcu	*npinfo;
 #endif
 
-#ifdef CONFIG_NET_NS
-	struct net		*nd_net;
-#endif
+	possible_net_t			nd_net;
 
 	/* mid-layer private */
 	union {
@@ -1862,9 +1860,7 @@ struct net *dev_net(const struct net_device *dev)
 static inline
 void dev_net_set(struct net_device *dev, struct net *net)
 {
-#ifdef CONFIG_NET_NS
-	dev->nd_net = net;
-#endif
+	write_pnet(&dev->nd_net, net);
 }
 
 static inline bool netdev_uses_dsa(struct net_device *dev)
diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 64e09e1e8099..f977abec07f6 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -3183,10 +3183,8 @@ struct wiphy {
 	const struct ieee80211_ht_cap *ht_capa_mod_mask;
 	const struct ieee80211_vht_cap *vht_capa_mod_mask;
 
-#ifdef CONFIG_NET_NS
 	/* the network namespace this phy lives in currently */
-	struct net *_net;
-#endif
+	possible_net_t _net;
 
 #ifdef CONFIG_CFG80211_WEXT
 	const struct iw_handler_def *wext;
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 0574abd3db86..a9af1cc8c1bc 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -92,9 +92,7 @@ struct genl_info {
 	struct genlmsghdr *	genlhdr;
 	void *			userhdr;
 	struct nlattr **	attrs;
-#ifdef CONFIG_NET_NS
-	struct net *		_net;
-#endif
+	possible_net_t		_net;
 	void *			user_ptr[2];
 	struct sock *		dst_sk;
 };
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index dd1950a7e273..bcd64756e5fe 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -76,9 +76,7 @@ struct inet_ehash_bucket {
  * ports are created in O(1) time?  I thought so. ;-)	-DaveM
  */
 struct inet_bind_bucket {
-#ifdef CONFIG_NET_NS
-	struct net		*ib_net;
-#endif
+	possible_net_t		ib_net;
 	unsigned short		port;
 	signed char		fastreuse;
 	signed char		fastreuseport;
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 20fd23398537..4e3731ee4eac 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -47,13 +47,13 @@ static inline struct net *skb_net(const struct sk_buff *skb)
 	 * Start with the most likely hit
 	 * End with BUG
 	 */
-	if (likely(skb->dev && skb->dev->nd_net))
+	if (likely(skb->dev && dev_net(skb->dev)))
 		return dev_net(skb->dev);
 	if (skb_dst(skb) && skb_dst(skb)->dev)
 		return dev_net(skb_dst(skb)->dev);
 	WARN(skb->sk, "Maybe skb_sknet should be used in %s() at line:%d\n",
 		      __func__, __LINE__);
-	if (likely(skb->sk && skb->sk->sk_net))
+	if (likely(skb->sk && sock_net(skb->sk)))
 		return sock_net(skb->sk);
 	pr_err("There is no net ptr to find in the skb in %s() line:%d\n",
 		__func__, __LINE__);
@@ -71,11 +71,11 @@ static inline struct net *skb_sknet(const struct sk_buff *skb)
 #ifdef CONFIG_NET_NS
 #ifdef CONFIG_IP_VS_DEBUG
 	/* Start with the most likely hit */
-	if (likely(skb->sk && skb->sk->sk_net))
+	if (likely(skb->sk && sock_net(skb->sk)))
 		return sock_net(skb->sk);
 	WARN(skb->dev, "Maybe skb_net should be used instead in %s() line:%d\n",
 		       __func__, __LINE__);
-	if (likely(skb->dev && skb->dev->nd_net))
+	if (likely(skb->dev && dev_net(skb->dev)))
 		return dev_net(skb->dev);
 	pr_err("There is no net ptr to find in the skb in %s() line:%d\n",
 		__func__, __LINE__);
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index d48b8ec8b5f4..e7bdf5170802 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -65,9 +65,7 @@ enum {
 };
 
 struct neigh_parms {
-#ifdef CONFIG_NET_NS
-	struct net *net;
-#endif
+	possible_net_t net;
 	struct net_device *dev;
 	struct list_head list;
 	int	(*neigh_setup)(struct neighbour *);
@@ -167,9 +165,7 @@ struct neigh_ops {
 
 struct pneigh_entry {
 	struct pneigh_entry	*next;
-#ifdef CONFIG_NET_NS
-	struct net		*net;
-#endif
+	possible_net_t		net;
 	struct net_device	*dev;
 	u8			flags;
 	u8			key[0];
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 6fd76650d36f..c2bedcc3ab43 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -229,24 +229,27 @@ int net_eq(const struct net *net1, const struct net *net2)
 #endif
 
 
+typedef struct {
 #ifdef CONFIG_NET_NS
+	struct net *net;
+#endif
+} possible_net_t;
 
-static inline void write_pnet(struct net **pnet, struct net *net)
+static inline void write_pnet(possible_net_t *pnet, struct net *net)
 {
-	*pnet = net;
+#ifdef CONFIG_NET_NS
+	pnet->net = net;
+#endif
 }
 
-static inline struct net *read_pnet(struct net * const *pnet)
+static inline struct net *read_pnet(const possible_net_t *pnet)
 {
-	return *pnet;
-}
-
+#ifdef CONFIG_NET_NS
+	return pnet->net;
 #else
-
-#define write_pnet(pnet, net)	do { (void)(net);} while (0)
-#define read_pnet(pnet)		(&init_net)
-
+	return &init_net;
 #endif
+}
 
 #define for_each_net(VAR)				\
 	list_for_each_entry(VAR, &net_namespace_list, list)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 74f271a172dd..095433b8a8b0 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -95,9 +95,8 @@ struct nf_conn {
 	/* Timer function; drops refcnt when it goes off. */
 	struct timer_list timeout;
 
-#ifdef CONFIG_NET_NS
-	struct net *ct_net;
-#endif
+	possible_net_t ct_net;
+
 	/* all members below initialized via memset */
 	u8 __nfct_init_offset[0];
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 40437e0a94a8..1d3bbd6bdc71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,9 +190,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_portaddr_node;
 	};
 	struct proto		*skc_prot;
-#ifdef CONFIG_NET_NS
-	struct net	 	*skc_net;
-#endif
+	possible_net_t		skc_net;
 
 #if IS_ENABLED(CONFIG_IPV6)
 	struct in6_addr		skc_v6_daddr;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index dc4865e90fe4..d0ac7d7be8a7 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -126,9 +126,7 @@ struct xfrm_state_walk {
 
 /* Full description of state of transformer. */
 struct xfrm_state {
-#ifdef CONFIG_NET_NS
-	struct net		*xs_net;
-#endif
+	possible_net_t		xs_net;
 	union {
 		struct hlist_node	gclist;
 		struct hlist_node	bydst;
@@ -522,9 +520,7 @@ struct xfrm_policy_queue {
 };
 
 struct xfrm_policy {
-#ifdef CONFIG_NET_NS
-	struct net		*xp_net;
-#endif
+	possible_net_t		xp_net;
 	struct hlist_node	bydst;
 	struct hlist_node	byidx;
 
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 80d08f6664cb..3e3d82d8ff70 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -940,7 +940,7 @@ p9_fd_create_tcp(struct p9_client *client, const char *addr, char *args)
 	sin_server.sin_family = AF_INET;
 	sin_server.sin_addr.s_addr = in_aton(addr);
 	sin_server.sin_port = htons(opts.port);
-	err = __sock_create(read_pnet(&current->nsproxy->net_ns), PF_INET,
+	err = __sock_create(current->nsproxy->net_ns, PF_INET,
 			    SOCK_STREAM, IPPROTO_TCP, &csocket, 1);
 	if (err) {
 		pr_err("%s (%d): problem creating socket\n",
@@ -988,7 +988,7 @@ p9_fd_create_unix(struct p9_client *client, const char *addr, char *args)
 
 	sun_server.sun_family = PF_UNIX;
 	strcpy(sun_server.sun_path, addr);
-	err = __sock_create(read_pnet(&current->nsproxy->net_ns), PF_UNIX,
+	err = __sock_create(current->nsproxy->net_ns, PF_UNIX,
 			    SOCK_STREAM, 0, &csocket, 1);
 	if (err < 0) {
 		pr_err("%s (%d): problem creating socket\n",
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9d78427652d2..5b188832800f 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -73,9 +73,7 @@
 
 struct mr_table {
 	struct list_head	list;
-#ifdef CONFIG_NET_NS
-	struct net		*net;
-#endif
+	possible_net_t		net;
 	u32			id;
 	struct sock __rcu	*mroute_sk;
 	struct timer_list	ipmr_expire_timer;
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 59c793040498..3cc50e2d3bf5 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -29,9 +29,7 @@
  * Policy Table
  */
 struct ip6addrlbl_entry {
-#ifdef CONFIG_NET_NS
-	struct net *lbl_net;
-#endif
+	possible_net_t lbl_net;
 	struct in6_addr prefix;
 	int prefixlen;
 	int ifindex;
@@ -237,9 +235,7 @@ static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net *net,
 	newp->addrtype = addrtype;
 	newp->label = label;
 	INIT_HLIST_NODE(&newp->list);
-#ifdef CONFIG_NET_NS
-	newp->lbl_net = net;
-#endif
+	write_pnet(&newp->lbl_net, net);
 	atomic_set(&newp->refcnt, 1);
 	return newp;
 }
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 34b682617f50..4b9315aa273e 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -56,9 +56,7 @@
 
 struct mr6_table {
 	struct list_head	list;
-#ifdef CONFIG_NET_NS
-	struct net		*net;
-#endif
+	possible_net_t		net;
 	u32			id;
 	struct sock		*mroute6_sk;
 	struct timer_list	ipmr_expire_timer;
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 3ece94563079..4ec4a480b147 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -84,10 +84,8 @@ struct datapath {
 	/* Stats. */
 	struct dp_stats_percpu __percpu *stats_percpu;
 
-#ifdef CONFIG_NET_NS
 	/* Network namespace ref. */
-	struct net *net;
-#endif
+	possible_net_t net;
 
 	u32 user_features;
 };
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a30399..fe6e20caea1d 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -74,9 +74,7 @@ extern struct mutex fanout_mutex;
 #define PACKET_FANOUT_MAX	256
 
 struct packet_fanout {
-#ifdef CONFIG_NET_NS
-	struct net		*net;
-#endif
+	possible_net_t		net;
 	unsigned int		num_members;
 	u16			id;
 	u8			type;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails.
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
  2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
@ 2015-03-11 16:38               ` Eric W. Biederman
  2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
                                 ` (5 subsequent siblings)
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:38 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


There is not a practical way to cleanup during boot so
just panic if there is a problem initializing tcp_metrics.

That will at least give us a clear place to start debugging
if something does go wrong.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index e5f41bd5ec1b..4206b14d956d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1175,16 +1175,10 @@ void __init tcp_metrics_init(void)
 
 	ret = register_pernet_subsys(&tcp_net_metrics_ops);
 	if (ret < 0)
-		goto cleanup;
+		panic("Could not allocate the tcp_metrics hash table\n");
+
 	ret = genl_register_family_with_ops(&tcp_metrics_nl_family,
 					    tcp_metrics_nl_ops);
 	if (ret < 0)
-		goto cleanup_subsys;
-	return;
-
-cleanup_subsys:
-	unregister_pernet_subsys(&tcp_net_metrics_ops);
-
-cleanup:
-	return;
+		panic("Could not register tcp_metrics generic netlink\n");
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function.
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (2 preceding siblings ...)
  2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
@ 2015-03-11 16:38               ` Eric W. Biederman
  2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
                                 ` (4 subsequent siblings)
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:38 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


In preparation for using one hash table for all network namespaces
mix the network namespace into the hash value.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 4206b14d956d..fbb42f44501e 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -252,6 +252,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 	}
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -299,6 +300,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 		return NULL;
 
 	net = twsk_net(tw);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -347,6 +349,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		return NULL;
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
@@ -994,6 +997,7 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	if (!reply)
 		goto nla_put_failure;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
@@ -1070,6 +1074,7 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	if (ret < 0)
 		src = false;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	hb = net->ipv4.tcp_metrics_hash + hash;
 	pp = &hb->chain;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (3 preceding siblings ...)
  2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
@ 2015-03-11 16:40               ` Eric W. Biederman
  2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
                                 ` (3 subsequent siblings)
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:40 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and
verify that field on all hash table lookups.

Make the field tcpm_net of type possible_net_t so it takes no space
when network namespaces are disabled.

Further add a function tm_net to read that field so we can be
efficient when network namespaces are disabled and concise
the rest of the time.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index fbb42f44501e..461c3d2e1ca4 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -40,6 +40,7 @@ struct tcp_fastopen_metrics {
 
 struct tcp_metrics_block {
 	struct tcp_metrics_block __rcu	*tcpm_next;
+	possible_net_t			tcpm_net;
 	struct inetpeer_addr		tcpm_saddr;
 	struct inetpeer_addr		tcpm_daddr;
 	unsigned long			tcpm_stamp;
@@ -52,6 +53,11 @@ struct tcp_metrics_block {
 	struct rcu_head			rcu_head;
 };
 
+static inline struct net *tm_net(struct tcp_metrics_block *tm)
+{
+	return read_pnet(&tm->tcpm_net);
+}
+
 static bool tcp_metric_locked(struct tcp_metrics_block *tm,
 			      enum tcp_metric_index idx)
 {
@@ -183,6 +189,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 		if (!tm)
 			goto out_unlock;
 	}
+	write_pnet(&tm->tcpm_net, net);
 	tm->tcpm_saddr = *saddr;
 	tm->tcpm_daddr = *daddr;
 
@@ -217,7 +224,8 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, saddr) &&
-		    addr_same(&tm->tcpm_daddr, daddr))
+		    addr_same(&tm->tcpm_daddr, daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 		depth++;
 	}
@@ -258,7 +266,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
-		    addr_same(&tm->tcpm_daddr, &daddr))
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 	}
 	tcpm_check_stamp(tm, dst);
@@ -306,7 +315,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
-		    addr_same(&tm->tcpm_daddr, &daddr))
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 	}
 	return tm;
@@ -912,6 +922,8 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
 		     tm = rcu_dereference(tm->tcpm_next), col++) {
+			if (!net_eq(tm_net(tm), net))
+				continue;
 			if (col < s_col)
 				continue;
 			if (tcp_metrics_dump_info(skb, cb, tm) < 0) {
@@ -1004,7 +1016,8 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
-		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
+		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
+		    net_eq(tm_net(tm), net)) {
 			ret = tcp_metrics_fill_info(msg, tm);
 			break;
 		}
@@ -1081,7 +1094,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
-		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
+		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
+		    net_eq(tm_net(tm), net)) {
 			*pp = tm->tcpm_next;
 			kfree_rcu(tm, rcu_head);
 			found = true;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (4 preceding siblings ...)
  2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
@ 2015-03-11 16:41               ` Eric W. Biederman
  2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
                                 ` (2 subsequent siblings)
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:41 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


tcp_metrics_flush_all always returns 0.  Remove the unnecessary return code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 461c3d2e1ca4..0d07e14f2ca5 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1043,7 +1043,7 @@ out_free:
 
 #define deref_genl(p)	rcu_dereference_protected(p, lockdep_genl_is_held())
 
-static int tcp_metrics_flush_all(struct net *net)
+static void tcp_metrics_flush_all(struct net *net)
 {
 	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
 	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
@@ -1064,7 +1064,6 @@ static int tcp_metrics_flush_all(struct net *net)
 			tm = next;
 		}
 	}
-	return 0;
 }
 
 static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
@@ -1081,8 +1080,10 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ret = parse_nl_addr(info, &daddr, &hash, 1);
 	if (ret < 0)
 		return ret;
-	if (ret > 0)
-		return tcp_metrics_flush_all(net);
+	if (ret > 0) {
+		tcp_metrics_flush_all(net);
+		return 0;
+	}
 	ret = parse_nl_saddr(info, &saddr);
 	if (ret < 0)
 		src = false;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (5 preceding siblings ...)
  2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-11 16:43               ` Eric W. Biederman
  2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:43 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting a
selection of entries from a tcp metrics hash chain.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 0d07e14f2ca5..baccb070427d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1051,18 +1051,19 @@ static void tcp_metrics_flush_all(struct net *net)
 	unsigned int row;
 
 	for (row = 0; row < max_rows; row++, hb++) {
+		struct tcp_metrics_block __rcu **pp;
 		spin_lock_bh(&tcp_metrics_lock);
-		tm = deref_locked_genl(hb->chain);
-		if (tm)
-			hb->chain = NULL;
-		spin_unlock_bh(&tcp_metrics_lock);
-		while (tm) {
-			struct tcp_metrics_block *next;
-
-			next = deref_genl(tm->tcpm_next);
-			kfree_rcu(tm, rcu_head);
-			tm = next;
+		pp = &hb->chain;
+		for (tm = deref_locked_genl(*pp); tm;
+		     tm = deref_locked_genl(*pp)) {
+			if (net_eq(tm_net(tm), net)) {
+				*pp = tm->tcpm_next;
+				kfree_rcu(tm, rcu_head);
+			} else {
+				pp = &tm->tcpm_next;
+			}
 		}
+		spin_unlock_bh(&tcp_metrics_lock);
 	}
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (6 preceding siblings ...)
  2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-11 16:43               ` Eric W. Biederman
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
  8 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 16:43 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


Now that all of the operations are safe on a single hash table accross
network namespaces, allocate a single global hash table and update the
code to use it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h |  2 --
 net/ipv4/tcp_metrics.c   | 66 ++++++++++++++++++++++--------------------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8f3a1a1a5a94..614a49be68a9 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -54,8 +54,6 @@ struct netns_ipv4 {
 	struct sock		*mc_autojoin_sk;
 
 	struct inet_peer_base	*peers;
-	struct tcpm_hash_bucket	*tcp_metrics_hash;
-	unsigned int		tcp_metrics_hash_log;
 	struct sock  * __percpu	*tcp_sk;
 	struct netns_frags	frags;
 #ifdef CONFIG_NETFILTER
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index baccb070427d..366728cbee4a 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -97,6 +97,9 @@ struct tcpm_hash_bucket {
 	struct tcp_metrics_block __rcu	*chain;
 };
 
+static struct tcpm_hash_bucket	*tcp_metrics_hash __read_mostly;
+static unsigned int		tcp_metrics_hash_log __read_mostly;
+
 static DEFINE_SPINLOCK(tcp_metrics_lock);
 
 static void tcpm_suck_dst(struct tcp_metrics_block *tm,
@@ -177,7 +180,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	if (unlikely(reclaim)) {
 		struct tcp_metrics_block *oldest;
 
-		oldest = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain);
+		oldest = rcu_dereference(tcp_metrics_hash[hash].chain);
 		for (tm = rcu_dereference(oldest->tcpm_next); tm;
 		     tm = rcu_dereference(tm->tcpm_next)) {
 			if (time_before(tm->tcpm_stamp, oldest->tcpm_stamp))
@@ -196,8 +199,8 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	tcpm_suck_dst(tm, dst, true);
 
 	if (likely(!reclaim)) {
-		tm->tcpm_next = net->ipv4.tcp_metrics_hash[hash].chain;
-		rcu_assign_pointer(net->ipv4.tcp_metrics_hash[hash].chain, tm);
+		tm->tcpm_next = tcp_metrics_hash[hash].chain;
+		rcu_assign_pointer(tcp_metrics_hash[hash].chain, tm);
 	}
 
 out_unlock:
@@ -221,7 +224,7 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 	struct tcp_metrics_block *tm;
 	int depth = 0;
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, saddr) &&
 		    addr_same(&tm->tcpm_daddr, daddr) &&
@@ -261,9 +264,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr) &&
@@ -310,9 +313,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 
 	net = twsk_net(tw);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr) &&
@@ -360,7 +363,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
 	if (tm == TCP_METRICS_RECLAIM_PTR)
@@ -911,13 +914,13 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 			       struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
 	unsigned int row, s_row = cb->args[0];
 	int s_col = cb->args[1], col = s_col;
 
 	for (row = s_row; row < max_rows; row++, s_col = 0) {
 		struct tcp_metrics_block *tm;
-		struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash + row;
+		struct tcpm_hash_bucket *hb = tcp_metrics_hash + row;
 
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
@@ -1010,10 +1013,10 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto nla_put_failure;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
 		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
@@ -1045,8 +1048,8 @@ out_free:
 
 static void tcp_metrics_flush_all(struct net *net)
 {
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
-	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
+	struct tcpm_hash_bucket *hb = tcp_metrics_hash;
 	struct tcp_metrics_block *tm;
 	unsigned int row;
 
@@ -1090,8 +1093,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		src = false;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
-	hb = net->ipv4.tcp_metrics_hash + hash;
+	hash = hash_32(hash, tcp_metrics_hash_log);
+	hb = tcp_metrics_hash + hash;
 	pp = &hb->chain;
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
@@ -1147,6 +1150,9 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	size_t size;
 	unsigned int slots;
 
+	if (!net_eq(net, &init_net))
+		return 0;
+
 	slots = tcpmhash_entries;
 	if (!slots) {
 		if (totalram_pages >= 128 * 1024)
@@ -1155,14 +1161,14 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 			slots = 8 * 1024;
 	}
 
-	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
-	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
+	tcp_metrics_hash_log = order_base_2(slots);
+	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!net->ipv4.tcp_metrics_hash)
-		net->ipv4.tcp_metrics_hash = vzalloc(size);
+	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!tcp_metrics_hash)
+		tcp_metrics_hash = vzalloc(size);
 
-	if (!net->ipv4.tcp_metrics_hash)
+	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
 	return 0;
@@ -1170,19 +1176,7 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 
 static void __net_exit tcp_net_metrics_exit(struct net *net)
 {
-	unsigned int i;
-
-	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
-		struct tcp_metrics_block *tm, *next;
-
-		tm = rcu_dereference_protected(net->ipv4.tcp_metrics_hash[i].chain, 1);
-		while (tm) {
-			next = rcu_dereference_protected(tm->tcpm_next, 1);
-			kfree(tm);
-			tm = next;
-		}
-	}
-	kvfree(net->ipv4.tcp_metrics_hash);
+	tcp_metrics_flush_all(net);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
@ 2015-03-11 16:55                 ` Eric Dumazet
  2015-03-11 17:34                   ` Eric W. Biederman
  2015-03-11 17:07                 ` Eric Dumazet
  2015-03-11 17:10                 ` Eric Dumazet
  2 siblings, 1 reply; 119+ messages in thread
From: Eric Dumazet @ 2015-03-11 16:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:
> hold_net and release_net were an idea that didn't work.  Kill them
> it is long past due.

Care to describe a bit more what is not working ?

Thanks !

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
  2015-03-11 16:55                 ` Eric Dumazet
@ 2015-03-11 17:07                 ` Eric Dumazet
  2015-03-11 17:08                   ` Eric Dumazet
  2015-03-11 17:10                 ` Eric Dumazet
  2 siblings, 1 reply; 119+ messages in thread
From: Eric Dumazet @ 2015-03-11 17:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:

> @@ -163,9 +162,7 @@ static void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
>  static void fib_rules_put_rcu(struct rcu_head *head)
>  {
>  	struct fib_rules_ops *ops = container_of(head, struct fib_rules_ops, rcu);
> -	struct net *net = ops->fro_net;
>  
> -	release_net(net);
>  	kfree(ops);
>  }


Looks like this function is no longer needed, can caller can instead do

-	call_rcu(&rule->rcu, fib_rule_put_rcu);
+	kfree_rcu(rule, rcu);

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 17:07                 ` Eric Dumazet
@ 2015-03-11 17:08                   ` Eric Dumazet
  0 siblings, 0 replies; 119+ messages in thread
From: Eric Dumazet @ 2015-03-11 17:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

On Wed, 2015-03-11 at 10:07 -0700, Eric Dumazet wrote:
> On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:
> 
> > @@ -163,9 +162,7 @@ static void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
> >  static void fib_rules_put_rcu(struct rcu_head *head)
> >  {
> >  	struct fib_rules_ops *ops = container_of(head, struct fib_rules_ops, rcu);
> > -	struct net *net = ops->fro_net;
> >  
> > -	release_net(net);
> >  	kfree(ops);
> >  }
> 
> 
> Looks like this function is no longer needed, can caller can instead do
> 
> -	call_rcu(&rule->rcu, fib_rule_put_rcu);
> +	kfree_rcu(rule, rcu);
> 
> 


Same remark for fib_rule_put_rcu()

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
  2015-03-11 16:55                 ` Eric Dumazet
  2015-03-11 17:07                 ` Eric Dumazet
@ 2015-03-11 17:10                 ` Eric Dumazet
  2015-03-11 17:36                   ` Eric W. Biederman
  2 siblings, 1 reply; 119+ messages in thread
From: Eric Dumazet @ 2015-03-11 17:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:

> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 6d592f8555fb..e0d5c8327159 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -98,7 +98,7 @@ void inet_twsk_free(struct inet_timewait_sock *tw)
>  #ifdef SOCK_REFCNT_DEBUG
>  	pr_debug("%s timewait_sock %p released\n", tw->tw_prot->name, tw);
>  #endif
> -	release_net(twsk_net(tw));
> +	twsk_net(tw);
>  	kmem_cache_free(tw->tw_prot->twsk_prot->twsk_slab, tw);
>  	module_put(owner);
>  }


Not clear why you left this line ?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 16:55                 ` Eric Dumazet
@ 2015-03-11 17:34                   ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 17:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:
>> hold_net and release_net were an idea that didn't work.  Kill them
>> it is long past due.
>
> Care to describe a bit more what is not working ?

Long ago and far away when I was first starting the network namespace
implementation hold_net and release_net were based on a misunderstanding
of what dev_hold and dev_release do.  I thought they were just for
debugging.  I cloned the concept and then later realized it added no
benefit to network namespaces.  The code has been disabled for more years
than I care to remember, and it is just dead weight now.

So it makes sense to remove the definitions now.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/8] net: Kill hold_net release_net
  2015-03-11 17:10                 ` Eric Dumazet
@ 2015-03-11 17:36                   ` Eric W. Biederman
  0 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-11 17:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, edumazet, netdev, stephen, nicolas.dichtel, roopa,
	hannes, ddutt, vipin, shmulik.ladkani, dsahern, Julian Anastasov

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Wed, 2015-03-11 at 11:35 -0500, Eric W. Biederman wrote:
>
>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>> index 6d592f8555fb..e0d5c8327159 100644
>> --- a/net/ipv4/inet_timewait_sock.c
>> +++ b/net/ipv4/inet_timewait_sock.c
>> @@ -98,7 +98,7 @@ void inet_twsk_free(struct inet_timewait_sock *tw)
>>  #ifdef SOCK_REFCNT_DEBUG
>>  	pr_debug("%s timewait_sock %p released\n", tw->tw_prot->name, tw);
>>  #endif
>> -	release_net(twsk_net(tw));
>> +	twsk_net(tw);
>>  	kmem_cache_free(tw->tw_prot->twsk_prot->twsk_slab, tw);
>>  	module_put(owner);
>>  }
>
>
> Not clear why you left this line ?

I goofed and then I overlooked the goof.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated
  2015-03-09 18:50               ` Sergei Shtylyov
@ 2015-03-11 19:22                 ` Sergei Shtylyov
  0 siblings, 0 replies; 119+ messages in thread
From: Sergei Shtylyov @ 2015-03-11 19:22 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern

Hello.

On 03/09/2015 09:50 PM, Sergei Shtylyov wrote:

>> Panic so that in the unlikely event we have problems we will have a
>> clear place to start debugging instead of a mysterious NULL pointer
>> deference later on.

>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>   net/ipv4/tcp_metrics.c | 1 +
>>   1 file changed, 1 insertion(+)

>> diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
>> index e5f41bd5ec1b..fdf4bdda971f 100644
>> --- a/net/ipv4/tcp_metrics.c
>> +++ b/net/ipv4/tcp_metrics.c
>> @@ -1186,5 +1186,6 @@ cleanup_subsys:
>>       unregister_pernet_subsys(&tcp_net_metrics_ops);
>>
>>   cleanup:
>> +    panic("Could not allocate the tcp_metrics hash table\n");
>>       return;

>     You can drop this *return* as well, it serves not purpose.

    Ah, it used to have a purpose before your patch... but now it surely doesn't.

>>   }

WBR, Sergei

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3
  2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
                                 ` (7 preceding siblings ...)
  2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
@ 2015-03-13  5:04               ` Eric W. Biederman
  2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
                                   ` (6 more replies)
  8 siblings, 7 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:04 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


This is a small pile of patches that convert tcp_metrics from using a
hash table per network namespace to using a single hash table for all
network namespaces.

This is broken up into several patches so that each small step along
the way could be carefully scrutinized as I wrote it, and equally so
that each small step can be reviewed.

There are several cleanups included in this series.  The addition of
panic calls during boot where we can not handle failure, and not trying
simplifies the code.  The removal of the return code from
tcp_metrics_flush_all.

The motivation for this change is that the tcp_metrics hash table at
128KiB is one of the largest components of a freshly allocated network
namespace.

I am resending the the previous version I sent has suffered bitrot, so I
have respun the patches so that they apply.  I believe I have addressed
all of the review concerns except optimal behavior on little machines
with 32-byte cache lines, which is beyond me as even the current code
has bad behavior in that case.

Eric W. Biederman (6):
      tcp_metrics: panic when tcp_metrics_init fails.
      tcp_metrics: Mix the network namespace into the hash function.
      tcp_metrics: Add a field tcpm_net and verify it matches on lookup
      tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
      tcp_metrics: Rewrite tcp_metrics_flush_all
      tcp_metrics: Use a single hash table for all network namespaces.

 include/net/netns/ipv4.h |   2 -
 net/ipv4/tcp_metrics.c   | 137 +++++++++++++++++++++++++----------------------
 2 files changed, 73 insertions(+), 66 deletions(-)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails.
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
@ 2015-03-13  5:04                 ` Eric W. Biederman
  2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:04 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


There is not a practical way to cleanup during boot so
just panic if there is a problem initializing tcp_metrics.

That will at least give us a clear place to start debugging
if something does go wrong.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index e5f41bd5ec1b..4206b14d956d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1175,16 +1175,10 @@ void __init tcp_metrics_init(void)
 
 	ret = register_pernet_subsys(&tcp_net_metrics_ops);
 	if (ret < 0)
-		goto cleanup;
+		panic("Could not allocate the tcp_metrics hash table\n");
+
 	ret = genl_register_family_with_ops(&tcp_metrics_nl_family,
 					    tcp_metrics_nl_ops);
 	if (ret < 0)
-		goto cleanup_subsys;
-	return;
-
-cleanup_subsys:
-	unregister_pernet_subsys(&tcp_net_metrics_ops);
-
-cleanup:
-	return;
+		panic("Could not register tcp_metrics generic netlink\n");
 }
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function.
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
  2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
@ 2015-03-13  5:05                 ` Eric W. Biederman
  2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:05 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


In preparation for using one hash table for all network namespaces
mix the network namespace into the hash value.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 4206b14d956d..fbb42f44501e 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -252,6 +252,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 	}
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -299,6 +300,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 		return NULL;
 
 	net = twsk_net(tw);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
@@ -347,6 +349,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		return NULL;
 
 	net = dev_net(dst->dev);
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
@@ -994,6 +997,7 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	if (!reply)
 		goto nla_put_failure;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
@@ -1070,6 +1074,7 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	if (ret < 0)
 		src = false;
 
+	hash ^= net_hash_mix(net);
 	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 	hb = net->ipv4.tcp_metrics_hash + hash;
 	pp = &hb->chain;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
  2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
  2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
@ 2015-03-13  5:05                 ` Eric W. Biederman
  2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:05 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and
verify that field on all hash table lookups.

Make the field tcpm_net of type possible_net_t so it takes no space
when network namespaces are disabled.

Further add a function tm_net to read that field so we can be
efficient when network namespaces are disabled and concise
the rest of the time.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index fbb42f44501e..461c3d2e1ca4 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -40,6 +40,7 @@ struct tcp_fastopen_metrics {
 
 struct tcp_metrics_block {
 	struct tcp_metrics_block __rcu	*tcpm_next;
+	possible_net_t			tcpm_net;
 	struct inetpeer_addr		tcpm_saddr;
 	struct inetpeer_addr		tcpm_daddr;
 	unsigned long			tcpm_stamp;
@@ -52,6 +53,11 @@ struct tcp_metrics_block {
 	struct rcu_head			rcu_head;
 };
 
+static inline struct net *tm_net(struct tcp_metrics_block *tm)
+{
+	return read_pnet(&tm->tcpm_net);
+}
+
 static bool tcp_metric_locked(struct tcp_metrics_block *tm,
 			      enum tcp_metric_index idx)
 {
@@ -183,6 +189,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 		if (!tm)
 			goto out_unlock;
 	}
+	write_pnet(&tm->tcpm_net, net);
 	tm->tcpm_saddr = *saddr;
 	tm->tcpm_daddr = *daddr;
 
@@ -217,7 +224,8 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, saddr) &&
-		    addr_same(&tm->tcpm_daddr, daddr))
+		    addr_same(&tm->tcpm_daddr, daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 		depth++;
 	}
@@ -258,7 +266,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
-		    addr_same(&tm->tcpm_daddr, &daddr))
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 	}
 	tcpm_check_stamp(tm, dst);
@@ -306,7 +315,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
-		    addr_same(&tm->tcpm_daddr, &daddr))
+		    addr_same(&tm->tcpm_daddr, &daddr) &&
+		    net_eq(tm_net(tm), net))
 			break;
 	}
 	return tm;
@@ -912,6 +922,8 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
 		     tm = rcu_dereference(tm->tcpm_next), col++) {
+			if (!net_eq(tm_net(tm), net))
+				continue;
 			if (col < s_col)
 				continue;
 			if (tcp_metrics_dump_info(skb, cb, tm) < 0) {
@@ -1004,7 +1016,8 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
-		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
+		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
+		    net_eq(tm_net(tm), net)) {
 			ret = tcp_metrics_fill_info(msg, tm);
 			break;
 		}
@@ -1081,7 +1094,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
-		    (!src || addr_same(&tm->tcpm_saddr, &saddr))) {
+		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
+		    net_eq(tm_net(tm), net)) {
 			*pp = tm->tcpm_next;
 			kfree_rcu(tm, rcu_head);
 			found = true;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
                                   ` (2 preceding siblings ...)
  2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
@ 2015-03-13  5:06                 ` Eric W. Biederman
  2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:06 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


tcp_metrics_flush_all always returns 0.  Remove the unnecessary return code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 461c3d2e1ca4..0d07e14f2ca5 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1043,7 +1043,7 @@ out_free:
 
 #define deref_genl(p)	rcu_dereference_protected(p, lockdep_genl_is_held())
 
-static int tcp_metrics_flush_all(struct net *net)
+static void tcp_metrics_flush_all(struct net *net)
 {
 	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
 	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
@@ -1064,7 +1064,6 @@ static int tcp_metrics_flush_all(struct net *net)
 			tm = next;
 		}
 	}
-	return 0;
 }
 
 static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
@@ -1081,8 +1080,10 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	ret = parse_nl_addr(info, &daddr, &hash, 1);
 	if (ret < 0)
 		return ret;
-	if (ret > 0)
-		return tcp_metrics_flush_all(net);
+	if (ret > 0) {
+		tcp_metrics_flush_all(net);
+		return 0;
+	}
 	ret = parse_nl_saddr(info, &saddr);
 	if (ret < 0)
 		src = false;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
                                   ` (3 preceding siblings ...)
  2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-13  5:07                 ` Eric W. Biederman
  2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
  2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:07 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting
a selection of entries from a tcp metrics hash chain.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/ipv4/tcp_metrics.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 0d07e14f2ca5..baccb070427d 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1051,18 +1051,19 @@ static void tcp_metrics_flush_all(struct net *net)
 	unsigned int row;
 
 	for (row = 0; row < max_rows; row++, hb++) {
+		struct tcp_metrics_block __rcu **pp;
 		spin_lock_bh(&tcp_metrics_lock);
-		tm = deref_locked_genl(hb->chain);
-		if (tm)
-			hb->chain = NULL;
-		spin_unlock_bh(&tcp_metrics_lock);
-		while (tm) {
-			struct tcp_metrics_block *next;
-
-			next = deref_genl(tm->tcpm_next);
-			kfree_rcu(tm, rcu_head);
-			tm = next;
+		pp = &hb->chain;
+		for (tm = deref_locked_genl(*pp); tm;
+		     tm = deref_locked_genl(*pp)) {
+			if (net_eq(tm_net(tm), net)) {
+				*pp = tm->tcpm_next;
+				kfree_rcu(tm, rcu_head);
+			} else {
+				pp = &tm->tcpm_next;
+			}
 		}
+		spin_unlock_bh(&tcp_metrics_lock);
 	}
 }
 
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces.
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
                                   ` (4 preceding siblings ...)
  2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
@ 2015-03-13  5:07                 ` Eric W. Biederman
  2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
  6 siblings, 0 replies; 119+ messages in thread
From: Eric W. Biederman @ 2015-03-13  5:07 UTC (permalink / raw)
  To: David Miller
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, Julian Anastasov


Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h |  2 --
 net/ipv4/tcp_metrics.c   | 66 ++++++++++++++++++++++--------------------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8f3a1a1a5a94..614a49be68a9 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -54,8 +54,6 @@ struct netns_ipv4 {
 	struct sock		*mc_autojoin_sk;
 
 	struct inet_peer_base	*peers;
-	struct tcpm_hash_bucket	*tcp_metrics_hash;
-	unsigned int		tcp_metrics_hash_log;
 	struct sock  * __percpu	*tcp_sk;
 	struct netns_frags	frags;
 #ifdef CONFIG_NETFILTER
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index baccb070427d..366728cbee4a 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -97,6 +97,9 @@ struct tcpm_hash_bucket {
 	struct tcp_metrics_block __rcu	*chain;
 };
 
+static struct tcpm_hash_bucket	*tcp_metrics_hash __read_mostly;
+static unsigned int		tcp_metrics_hash_log __read_mostly;
+
 static DEFINE_SPINLOCK(tcp_metrics_lock);
 
 static void tcpm_suck_dst(struct tcp_metrics_block *tm,
@@ -177,7 +180,7 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	if (unlikely(reclaim)) {
 		struct tcp_metrics_block *oldest;
 
-		oldest = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain);
+		oldest = rcu_dereference(tcp_metrics_hash[hash].chain);
 		for (tm = rcu_dereference(oldest->tcpm_next); tm;
 		     tm = rcu_dereference(tm->tcpm_next)) {
 			if (time_before(tm->tcpm_stamp, oldest->tcpm_stamp))
@@ -196,8 +199,8 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 	tcpm_suck_dst(tm, dst, true);
 
 	if (likely(!reclaim)) {
-		tm->tcpm_next = net->ipv4.tcp_metrics_hash[hash].chain;
-		rcu_assign_pointer(net->ipv4.tcp_metrics_hash[hash].chain, tm);
+		tm->tcpm_next = tcp_metrics_hash[hash].chain;
+		rcu_assign_pointer(tcp_metrics_hash[hash].chain, tm);
 	}
 
 out_unlock:
@@ -221,7 +224,7 @@ static struct tcp_metrics_block *__tcp_get_metrics(const struct inetpeer_addr *s
 	struct tcp_metrics_block *tm;
 	int depth = 0;
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, saddr) &&
 		    addr_same(&tm->tcpm_daddr, daddr) &&
@@ -261,9 +264,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr) &&
@@ -310,9 +313,9 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 
 	net = twsk_net(tw);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_saddr, &saddr) &&
 		    addr_same(&tm->tcpm_daddr, &daddr) &&
@@ -360,7 +363,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 
 	net = dev_net(dst->dev);
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&saddr, &daddr, net, hash);
 	if (tm == TCP_METRICS_RECLAIM_PTR)
@@ -911,13 +914,13 @@ static int tcp_metrics_nl_dump(struct sk_buff *skb,
 			       struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
 	unsigned int row, s_row = cb->args[0];
 	int s_col = cb->args[1], col = s_col;
 
 	for (row = s_row; row < max_rows; row++, s_col = 0) {
 		struct tcp_metrics_block *tm;
-		struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash + row;
+		struct tcpm_hash_bucket *hb = tcp_metrics_hash + row;
 
 		rcu_read_lock();
 		for (col = 0, tm = rcu_dereference(hb->chain); tm;
@@ -1010,10 +1013,10 @@ static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto nla_put_failure;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hash = hash_32(hash, tcp_metrics_hash_log);
 	ret = -ESRCH;
 	rcu_read_lock();
-	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	for (tm = rcu_dereference(tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
 		if (addr_same(&tm->tcpm_daddr, &daddr) &&
 		    (!src || addr_same(&tm->tcpm_saddr, &saddr)) &&
@@ -1045,8 +1048,8 @@ out_free:
 
 static void tcp_metrics_flush_all(struct net *net)
 {
-	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
-	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
+	unsigned int max_rows = 1U << tcp_metrics_hash_log;
+	struct tcpm_hash_bucket *hb = tcp_metrics_hash;
 	struct tcp_metrics_block *tm;
 	unsigned int row;
 
@@ -1090,8 +1093,8 @@ static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		src = false;
 
 	hash ^= net_hash_mix(net);
-	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
-	hb = net->ipv4.tcp_metrics_hash + hash;
+	hash = hash_32(hash, tcp_metrics_hash_log);
+	hb = tcp_metrics_hash + hash;
 	pp = &hb->chain;
 	spin_lock_bh(&tcp_metrics_lock);
 	for (tm = deref_locked_genl(*pp); tm; tm = deref_locked_genl(*pp)) {
@@ -1147,6 +1150,9 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	size_t size;
 	unsigned int slots;
 
+	if (!net_eq(net, &init_net))
+		return 0;
+
 	slots = tcpmhash_entries;
 	if (!slots) {
 		if (totalram_pages >= 128 * 1024)
@@ -1155,14 +1161,14 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 			slots = 8 * 1024;
 	}
 
-	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
-	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
+	tcp_metrics_hash_log = order_base_2(slots);
+	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!net->ipv4.tcp_metrics_hash)
-		net->ipv4.tcp_metrics_hash = vzalloc(size);
+	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!tcp_metrics_hash)
+		tcp_metrics_hash = vzalloc(size);
 
-	if (!net->ipv4.tcp_metrics_hash)
+	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
 	return 0;
@@ -1170,19 +1176,7 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 
 static void __net_exit tcp_net_metrics_exit(struct net *net)
 {
-	unsigned int i;
-
-	for (i = 0; i < (1U << net->ipv4.tcp_metrics_hash_log) ; i++) {
-		struct tcp_metrics_block *tm, *next;
-
-		tm = rcu_dereference_protected(net->ipv4.tcp_metrics_hash[i].chain, 1);
-		while (tm) {
-			next = rcu_dereference_protected(tm->tcpm_next, 1);
-			kfree(tm);
-			tm = next;
-		}
-	}
-	kvfree(net->ipv4.tcp_metrics_hash);
+	tcp_metrics_flush_all(net);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3
  2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
                                   ` (5 preceding siblings ...)
  2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
@ 2015-03-13  5:57                 ` David Miller
  6 siblings, 0 replies; 119+ messages in thread
From: David Miller @ 2015-03-13  5:57 UTC (permalink / raw)
  To: ebiederm
  Cc: edumazet, netdev, stephen, nicolas.dichtel, roopa, hannes, ddutt,
	vipin, shmulik.ladkani, dsahern, ja

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 13 Mar 2015 00:04:07 -0500

> This is a small pile of patches that convert tcp_metrics from using a
> hash table per network namespace to using a single hash table for all
> network namespaces.

Looks great, series applied, thanks Eric!

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
                   ` (35 preceding siblings ...)
  2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
@ 2016-05-25 16:04 ` Chenna
  2016-05-25 19:04   ` David Ahern
  36 siblings, 1 reply; 119+ messages in thread
From: Chenna @ 2016-05-25 16:04 UTC (permalink / raw)
  To: netdev

David Ahern <dsahern <at> gmail.com> writes:

> 
> Kernel patches are also available here:
>     https://github.com/dsahern/linux.git vrf-3.19
> 
> iproute2 patches are also available here:
>     https://github.com/dsahern/iproute2 vrf-3.19
> 


Hello David,

Do we have the similar support package for 3.10 kernel?

Thanks
-Chenna

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC PATCH 00/29] net: VRF support
  2016-05-25 16:04 ` Chenna
@ 2016-05-25 19:04   ` David Ahern
  0 siblings, 0 replies; 119+ messages in thread
From: David Ahern @ 2016-05-25 19:04 UTC (permalink / raw)
  To: Chenna, netdev

On 5/25/16 10:04 AM, Chenna wrote:
> David Ahern <dsahern <at> gmail.com> writes:
>
>>
>> Kernel patches are also available here:
>>     https://github.com/dsahern/linux.git vrf-3.19
>>
>> iproute2 patches are also available here:
>>     https://github.com/dsahern/iproute2 vrf-3.19
>>
>
>
> Hello David,
>
> Do we have the similar support package for 3.10 kernel?

The VRF patches referenced above were not accepted upstream. An 
alternative implementation was accepted for the 4.3 kernel with various 
updates in all of the kernel versions since.

Users that want the VRF implementation in an older kernel (e.g., 3.10) 
will need to backport the kernel patches. Top of tree iproute2 can be 
used as is with older kernels.

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2016-05-25 19:04 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
2015-02-05 13:47   ` Nicolas Dichtel
2015-02-06  0:45     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  0:52     ` David Ahern
2015-02-06  8:53       ` Nicolas Dichtel
2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  1:18     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
2015-02-05 13:45   ` Nicolas Dichtel
2015-02-06  1:21     ` David Ahern
2015-02-06  3:54   ` Eric W. Biederman
2015-02-06  6:00     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
2015-02-05 13:46   ` Nicolas Dichtel
2015-02-06  1:23     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
2015-02-05 13:44 ` Nicolas Dichtel
2015-02-06  1:32   ` David Ahern
2015-02-06  8:53     ` Nicolas Dichtel
2015-02-05 23:12 ` roopa
2015-02-06  2:19   ` David Ahern
2015-02-09 16:38     ` roopa
2015-02-10 10:43     ` Derek Fawcus
2015-02-06  6:10   ` Shmulik Ladkani
2015-02-09 15:54     ` roopa
2015-02-11  7:42       ` Shmulik Ladkani
2015-02-06  1:33 ` Stephen Hemminger
2015-02-06  2:10   ` David Ahern
2015-02-06  4:14     ` Eric W. Biederman
2015-02-06  6:15       ` David Ahern
2015-02-06 15:08         ` Nicolas Dichtel
     [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
2015-02-09 20:48           ` Nicolas Dichtel
2015-02-11  4:14           ` David Ahern
2015-02-06 15:10 ` Nicolas Dichtel
2015-02-06 20:50 ` Eric W. Biederman
2015-02-09  0:36   ` David Ahern
2015-02-09 11:30     ` Derek Fawcus
     [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
2015-02-11  2:55     ` network namespace bloat Eric Dumazet
2015-02-11  3:18       ` Eric W. Biederman
2015-02-19 19:49         ` David Miller
2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
2015-03-09 18:50               ` Sergei Shtylyov
2015-03-11 19:22                 ` Sergei Shtylyov
2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-09 20:25               ` Julian Anastasov
2015-03-10  6:59                 ` Eric W. Biederman
2015-03-10  8:23                   ` Julian Anastasov
2015-03-11  0:58                     ` Eric W. Biederman
2015-03-10 16:36                   ` David Miller
2015-03-10 17:06                     ` Eric W. Biederman
2015-03-10 17:29                       ` David Miller
2015-03-10 17:56                         ` Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-09 18:43               ` Eric Dumazet
2015-03-09 18:47               ` Eric Dumazet
2015-03-09 19:35                 ` Eric W. Biederman
2015-03-09 20:21                   ` Eric Dumazet
2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
2015-03-09 20:21               ` Eric W. Biederman
2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
2015-03-11 16:55                 ` Eric Dumazet
2015-03-11 17:34                   ` Eric W. Biederman
2015-03-11 17:07                 ` Eric Dumazet
2015-03-11 17:08                   ` Eric Dumazet
2015-03-11 17:10                 ` Eric Dumazet
2015-03-11 17:36                   ` Eric W. Biederman
2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
2015-02-11 17:09     ` network namespace bloat Nicolas Dichtel
2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
2015-02-10 20:54   ` David Ahern
2016-05-25 16:04 ` Chenna
2016-05-25 19:04   ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.