All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Ahern <dsahern@gmail.com>
To: Thomas Graf <tgraf@suug.ch>
Cc: netdev@vger.kernel.org, ebiederm@xmission.com
Subject: Re: [RFC PATCH 00/29] net: VRF support
Date: Tue, 10 Feb 2015 13:54:05 -0700	[thread overview]
Message-ID: <54DA6FED.5020907@gmail.com> (raw)
In-Reply-To: <20150210005344.GA6293@casper.infradead.org>

On 2/9/15 5:53 PM, Thomas Graf wrote:
> On 02/04/15 at 06:34pm, David Ahern wrote:
>> Namespaces provide excellent separation of the networking stack from the
>> netdevices and up. The intent of VRFs is to provide an additional,
>> logical separation at the L3 layer within a namespace.
>
> What you ask for seems to be L3 micro segmentation inside netns. I

I would not label it 'micro' but yes a L3 separation within a L1 separation.

> would argue that we already support this through multiple routing
> tables. I would prefer improving the existing architecture to cover
> your use cases: Increase the number of supported tables, extend
> routing rules as needed, ...

I've seen that response for VRFs as well. I have not personally tried 
it, but from what I have read it does not work well. I think Roopa 
responded that Cumulus has spent time on that path and has hit some 
roadblocks.

>
>> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
>> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
>> to this file (if preferred this can be made a prctl to change the VRF id).
>> This allows services to be launched in a VRF context using ip, similar to
>> what is done for network namespaces.
>>      e.g., ip vrf exec 99 /usr/sbin/sshd
>
> I think such as classification should occur through cgroups instead
> of touching PIDs directly.

That is an interesting idea -- using cgroups for task labeling. It 
presents a creation / deletion event for VRFs which I was trying to 
avoid, and there will be some amount of overhead with a cgroup. I'll 
take a look at that option when I get some time.

As for as the current proposal I am treating VRF as part of a network 
context. Today 'ip netns' is used to run a command in a specific network 
namespace; the proposal with the VRF layering is to add a vrf context 
within a namespace so in keeping with how 'ip netns' works the above 
syntax allows a user to supply both a network namespace + VRF for 
running a command.

>
>> Network devices belong to a single VRF context which defaults to VRF 1.
>> They can be assigned to another VRF using IFLA_VRF attribute in link
>> messages. Similarly the VRF assignment is returned in the IFLA_VRF
>> attribute. The ip command has been modified to display the VRF id of a
>> device. L2 applications like lldp are not VRF aware and still work through
>> through all network devices within the namespace.
>
> I believe that binding net_devices to VRFs is misleading and the
> concept by itself is non-scalable. You do not want to create 10k
> net_devices for your overlay of choice just to tie them to a
> particular VRF. You want to store the VRF identifier as metadata and
> have a stateless classifier included it in the VRF decision. See the
> recent VXLAN-GBP work.

I'll take a look when I get time.

I have not seen scalability issues creating 1,000+ net_devices. 
Certainly the 40k'ish memory per net_device is noticeable but I believe 
that can be improved (e.g., a number of entries can be moved under 
proper CONFIG_ checks). I do need to repeat the tests on newer kernels.

>
> You could either map whatever selects the VRF to the mark or support it
> natively in the routing rules classifier.
>
> An obvious alternative is OVS. What you describe can be implemented in
> a scalable matter using OVS and mark. I understand that OVS is not for
> everybody but it gets a fundamental principle right: Scalability
> demands for programmability.
>
> I don’t think we should be adding a new single purpose metadata field
> to arbitrary structures for every new use case that comes up. We
> should work on programmability which increases flexibility and allows
> decoupling application interest from networking details.
>
>> On RX skbs get their VRF context from the netdevice the packet is received
>> on. For TX the VRF context for an skb is taken from the socket. The
>> intention is for L3/raw sockets to be able to set the VRF context for a
>> packet TX using cmsg (not coded in this patch set).
>
> Specyfing L3 context in cmsg seems very broken to me. We do not want
> to bind applications any closer to underlying networking infrastructure.
> In fact, we should do the opposite and decouple this completely.

That suggestion is inline with what is done today for other L3 
parameters -- TOS, TTL, and a few others.

>
>> The 'any' context applies to listen sockets only; connected sockets are in
>> a VRF context. Child sockets accepted by the daemon acquire the VRF context
>> of the network device the connection originated on.
>
> Linux considers an address local regardless of the interface the packet
> was received on.  So you would accept the packet on any interface and
> then bind it to the VRF of that interface even though the route for it
> might be on a different interface.
>
> This really belongs into routing rules from my perspective which takes
> mark and the cgroup context into account.

Expanding the current network namespace checks to a networking context 
is a very simple and clean way of implementing VRFs versus cobbling 
together a 'VRF like' capability using marks, multiple tables, etc (ie., 
the existing capabilities). Further, the VRF tagging of net_devices 
seems to readily fit into the hardware offload and switchdev 
capabilities (e.g., add a ndo operation for setting the VRF tag on a 
device which passes it to the driver).

Big picture wise where is OCP and switchdev headed? Top-of-rack switches 
seem to be the first target, but after that? Will the kernel ever 
support MPLS? Will the kernel attain the richer feature set of high-end 
routers? If so, how does VRF support fit into the design? As I 
understand it a scalable VRF solution is a fundamental building block. 
Will a cobbled together solution of cgroups, marks, rules, multiple 
tables really work versus the simplicity of an expanded network context?

David

  reply	other threads:[~2015-02-10 20:54 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
2015-02-05 13:47   ` Nicolas Dichtel
2015-02-06  0:45     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  0:52     ` David Ahern
2015-02-06  8:53       ` Nicolas Dichtel
2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  1:18     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
2015-02-05 13:45   ` Nicolas Dichtel
2015-02-06  1:21     ` David Ahern
2015-02-06  3:54   ` Eric W. Biederman
2015-02-06  6:00     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
2015-02-05 13:46   ` Nicolas Dichtel
2015-02-06  1:23     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
2015-02-05 13:44 ` Nicolas Dichtel
2015-02-06  1:32   ` David Ahern
2015-02-06  8:53     ` Nicolas Dichtel
2015-02-05 23:12 ` roopa
2015-02-06  2:19   ` David Ahern
2015-02-09 16:38     ` roopa
2015-02-10 10:43     ` Derek Fawcus
2015-02-06  6:10   ` Shmulik Ladkani
2015-02-09 15:54     ` roopa
2015-02-11  7:42       ` Shmulik Ladkani
2015-02-06  1:33 ` Stephen Hemminger
2015-02-06  2:10   ` David Ahern
2015-02-06  4:14     ` Eric W. Biederman
2015-02-06  6:15       ` David Ahern
2015-02-06 15:08         ` Nicolas Dichtel
     [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
2015-02-09 20:48           ` Nicolas Dichtel
2015-02-11  4:14           ` David Ahern
2015-02-06 15:10 ` Nicolas Dichtel
2015-02-06 20:50 ` Eric W. Biederman
2015-02-09  0:36   ` David Ahern
2015-02-09 11:30     ` Derek Fawcus
     [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
2015-02-11  2:55     ` network namespace bloat Eric Dumazet
2015-02-11  3:18       ` Eric W. Biederman
2015-02-19 19:49         ` David Miller
2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
2015-03-09 18:50               ` Sergei Shtylyov
2015-03-11 19:22                 ` Sergei Shtylyov
2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-09 20:25               ` Julian Anastasov
2015-03-10  6:59                 ` Eric W. Biederman
2015-03-10  8:23                   ` Julian Anastasov
2015-03-11  0:58                     ` Eric W. Biederman
2015-03-10 16:36                   ` David Miller
2015-03-10 17:06                     ` Eric W. Biederman
2015-03-10 17:29                       ` David Miller
2015-03-10 17:56                         ` Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-09 18:43               ` Eric Dumazet
2015-03-09 18:47               ` Eric Dumazet
2015-03-09 19:35                 ` Eric W. Biederman
2015-03-09 20:21                   ` Eric Dumazet
2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
2015-03-09 20:21               ` Eric W. Biederman
2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
2015-03-11 16:55                 ` Eric Dumazet
2015-03-11 17:34                   ` Eric W. Biederman
2015-03-11 17:07                 ` Eric Dumazet
2015-03-11 17:08                   ` Eric Dumazet
2015-03-11 17:10                 ` Eric Dumazet
2015-03-11 17:36                   ` Eric W. Biederman
2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
2015-02-11 17:09     ` network namespace bloat Nicolas Dichtel
2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
2015-02-10 20:54   ` David Ahern [this message]
2016-05-25 16:04 ` Chenna
2016-05-25 19:04   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54DA6FED.5020907@gmail.com \
    --to=dsahern@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.