Re: Numen with reference to vmap

From: "Serguei Bezverkhi (sbezverk)" <sbezverk@cisco.com>
To: Arturo Borrero Gonzalez <arturo@netfilter.org>,
	Phil Sutter <phil@nwl.cc>
Cc: "netfilter-devel@vger.kernel.org" <netfilter-devel@vger.kernel.org>
Subject: Re: Numen with reference to vmap
Date: Wed, 4 Dec 2019 21:05:31 +0000	[thread overview]
Message-ID: <8F45EB6F-B2DD-4498-9F5A-34D6CEB36F6A@cisco.com> (raw)
In-Reply-To: <92609998-F3BF-42A5-AF95-A75AAE941C27@cisco.com>

Here are code generated  nftables rules for  nat portion of k8s proxy.  Probably it does not cover all cases, but on a normal k8s cluster it would be sufficient. Appreciate reviews and suggestions for optimization. Thank you very much.
Serguei

table ip ipv4table {
	chain nat-preroutin {
		type nat hook prerouting priority filter; policy accept;
		jump k8s-nat-services
	}

	chain nat-output {
		type nat hook output priority filter; policy accept;
		jump k8s-nat-services
	}

	chain nat-postrouting {
		type nat hook postrouting priority filter; policy accept;
		jump k8s-nat-postrouting
	}

	chain k8s-nat-mark-drop {
		meta mark set 0x00008000
	}

	chain k8s-nat-services {
		ip saddr != 57.112.0.0/12 ip daddr 57.142.221.21 tcp dport 80 meta mark set 0x00004000
		ip daddr 57.142.221.21 tcp dport 80 jump KUBE-SVC-57XVOCFNTLTR3Q27
		ip saddr != 57.112.0.0/12 ip daddr 57.142.35.114 tcp dport 15443 meta mark set 0x00004000
		ip daddr 57.142.35.114 tcp dport 15443 jump KUBE-SVC-S4S242M2WNFIAT6Y
		ip daddr 57.131.151.19 tcp dport 8989 jump KUBE-SVC-MUPXPVK4XAZHSWAR
		ip daddr 192.168.80.104 tcp dport 8989 meta mark set 0x00004000
		fib saddr type != local ip daddr 192.168.80.104 tcp dport 8989 iifname != "bridge*" jump KUBE-SVC-MUPXPVK4XAZHSWAR
		fib daddr type local ip daddr 192.168.80.104 tcp dport 8989 jump KUBE-SVC-MUPXPVK4XAZHSWAR
	}

	chain k8s-nat-nodeports {
		tcp dport 30725 meta mark set 0x00004000 jump KUBE-SVC-S4S242M2WNFIAT6Y
	}

	chain k8s-nat-postrouting {
		meta mark 0x00004000 masquerade random,persistent
	}

	chain KUBE-SVC-S4S242M2WNFIAT6Y {
		jump KUBE-SEP-CUAZ6PSSTEDPJ43V
	}

	chain KUBE-SVC-57XVOCFNTLTR3Q27 {
		numgen random mod 2 vmap { 0 : jump KUBE-SEP-FS3FUULGZPVD4VYB, 1 : jump KUBE-SEP-MMFZROQSLQ3DKOQA }
	}

	chain KUBE-SVC-MUPXPVK4XAZHSWAR {
		jump KUBE-SEP-LO6TEVOI6GV524F3
	}

	chain KUBE-SEP-CUAZ6PSSTEDPJ43V {
		ip saddr 57.112.0.244 meta mark set 0x00004000
		dnat to 57.112.0.244:15443 fully-random
	}

	chain KUBE-SEP-FS3FUULGZPVD4VYB {
		ip saddr 57.112.0.247 meta mark set 0x00004000
		dnat to 57.112.0.247:8080 fully-random
	}

	chain KUBE-SEP-MMFZROQSLQ3DKOQA {
		ip saddr 57.112.0.248 meta mark set 0x00004000
		dnat to 57.112.0.248:8080 fully-random
	}

	chain KUBE-SEP-LO6TEVOI6GV524F3 {
		ip saddr 57.112.0.250 meta mark set 0x00004000
		dnat to 57.112.0.250:38989 fully-random
	}
}

On 2019-12-04, 12:49 PM, "Serguei Bezverkhi (sbezverk)" <sbezverk@cisco.com> wrote:

    Hello @Phil,

    Just to confirm,

    If I do,

    Numgen random mod 3 vmap { 0  :  jump endpoint1, 1  :  jump endpoint2,  2  :  jump endpoint3 }

    Then if 4th endpoint appears I replace the previous rule with:

    Numgen random mod 4 vmap { 0  :  jump endpoint1, 1  :  jump endpoint2, 2  :  jump endpoint3,  3  :  jump endpoint4 }

    It should do the trick of loadbalancing, right?

    @Arturo

    I am no planning to use  " dnat numgen randmo { 0-49 : <ip>:<port> }."

    Each end point will have it is own chain and it will to dnat to ip and specific to endpoint target port. The load balancing will be done in service chain between multiple endpoint chains.
    See example above. Does it make sense?

    Thank you
    Serguei

    On 2019-12-04, 12:31 PM, "Arturo Borrero Gonzalez" <arturo@netfilter.org> wrote:

        On 12/4/19 4:56 PM, Phil Sutter wrote:
        > OK, static load-balancing between two services - no big deal. :)
        > 
        > What happens if config changes? I.e., if one of the endpoints goes down
        > or a third one is added? (That's the thing we're discussing right now,
        > aren't we?)

        if the non-anon map for random numgen was allowed, then only elements would need
        to be adjusted:

        dnat numgen random mod 100 map { 0-49 : 1.1.1.1, 50-99 : 2.2.2.2 }

        You could always use mod 100 (or 10000 if you want) and just play with the map
        probabilities by updating map elements. This is a valid use case I think.
        The mod number can just be the max number of allowed endpoints per service in
        kubernetes.

        @Phil,

        I'm not sure if the typeof() thingy will work in this case, since the integer
        length would depend on the mod value used.
        What about introducing something like an explicit u128 integer datatype. Perhaps
        it's useful for other use cases too...

        @Serguei,

        kubernetes implements a complex chain of mechanisms to deal with traffic. What
        happens if endpoints for a given svc have different ports? I don't know if
        that's supported or not, but then this approach wouldn't work either: you can't
        use dnat numgen randmo { 0-49 : <ip>:<port> }.

        Also, we have the masquerade/drop thing going on too, which needs to be deal
        with and that currently is done by yet another chain jump + packet mark.

        I'm not sure in which state of the development you are, but this is my
        suggestion: Try to don't over-optimize in the first iteration. Just get a
        working nft ruleset with the few optimization that make sense and are easy to
        use (and understand). For iteration #2 we can do better optimizations, including
        patching missing features we may have in nftables.
        I really want a ruleset with very little rules, but we are still comparing with
        the iptables ruleset. I suggest we leave the hard optimization for a later point
        when we are comparing nft vs nft rulesets.