All of lore.kernel.org
 help / color / mirror / Atom feed
* Operation not supported when adding jump command
@ 2019-11-25 18:55 Serguei Bezverkhi (sbezverk)
  2019-11-26 12:21 ` Florian Westphal
  2019-12-03 23:50 ` Duncan Roe
  0 siblings, 2 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-25 18:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel

Hello Pablo,

Please see below  table/chain/rules/sets I program,  when I try to add jump from input-net, input-local to services  it fails with " Operation not supported" , I would appreciate if somebody could help to understand why:

sudo nft add rule ipv4table input-net jump services
Error: Could not process rule: Operation not supported
add rule ipv4table input-net jump services
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


table ip ipv4table {
	set no-endpoint-svc-ports {
		type inet_service
		elements = { 8080, 8989 }
	}

	set no-endpoint-svc-addrs {
		type ipv4_addr
		flags interval
		elements = { 10.1.1.1, 10.1.1.2 }
	}

	chain input-net {
		type nat hook prerouting priority filter; policy accept;
	}

	chain input-local {
		type nat hook output priority filter; policy accept;
	}

	chain services {
		ip daddr @no-endpoint-svc-addrs tcp dport @no-endpoint-svc-ports reject with tcp reset
		ip daddr @no-endpoint-svc-addrs udp dport @no-endpoint-svc-ports reject with icmp type net-unreachable
	}
}

Thank you
Serguei


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-25 18:55 Operation not supported when adding jump command Serguei Bezverkhi (sbezverk)
@ 2019-11-26 12:21 ` Florian Westphal
  2019-11-26 14:30   ` Serguei Bezverkhi (sbezverk)
  2019-12-03 23:50 ` Duncan Roe
  1 sibling, 1 reply; 34+ messages in thread
From: Florian Westphal @ 2019-11-26 12:21 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk); +Cc: Pablo Neira Ayuso, netfilter-devel

Serguei Bezverkhi (sbezverk) <sbezverk@cisco.com> wrote:
> Hello Pablo,
> 
> Please see below  table/chain/rules/sets I program,  when I try to add jump from input-net, input-local to services  it fails with " Operation not supported" , I would appreciate if somebody could help to understand why:
> 
> sudo nft add rule ipv4table input-net jump services
> Error: Could not process rule: Operation not supported
> add rule ipv4table input-net jump services
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

iirc "reject" only works in input/forward/postrouting hooks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 12:21 ` Florian Westphal
@ 2019-11-26 14:30   ` Serguei Bezverkhi (sbezverk)
  2019-11-26 14:52     ` Florian Westphal
  2019-11-26 15:38     ` Pablo Neira Ayuso
  0 siblings, 2 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-26 14:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel

Hello Florian,

Thank you very much for your reply. Once I changed to Input chain type, the rule worked. It seems iptables DO allow the same rule configuration see below:

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A KUBE-SERVICES -d 57.131.151.19/32 -p tcp -m comment --comment "default/portal:portal has no endpoints" -m tcp --dport 8989 -j REJECT --reject-with icmp-port-unreachable

This config is from working kubernetes cluster for the service which has no endpoints. Do you know if this change in behavior was a design decision or it is a bug?

Thank you
Serguei


On 2019-11-26, 7:21 AM, "Florian Westphal" <fw@strlen.de> wrote:

    Serguei Bezverkhi (sbezverk) <sbezverk@cisco.com> wrote:
    > Hello Pablo,
    > 
    > Please see below  table/chain/rules/sets I program,  when I try to add jump from input-net, input-local to services  it fails with " Operation not supported" , I would appreciate if somebody could help to understand why:
    > 
    > sudo nft add rule ipv4table input-net jump services
    > Error: Could not process rule: Operation not supported
    > add rule ipv4table input-net jump services
    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    iirc "reject" only works in input/forward/postrouting hooks.
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 14:30   ` Serguei Bezverkhi (sbezverk)
@ 2019-11-26 14:52     ` Florian Westphal
  2019-11-26 15:38     ` Pablo Neira Ayuso
  1 sibling, 0 replies; 34+ messages in thread
From: Florian Westphal @ 2019-11-26 14:52 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel

Serguei Bezverkhi (sbezverk) <sbezverk@cisco.com> wrote:
> Hello Florian,
> 
> Thank you very much for your reply. Once I changed to Input chain type, the rule worked. It seems iptables DO allow the same rule configuration see below:
> 
> -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
> -A KUBE-SERVICES -d 57.131.151.19/32 -p tcp -m comment --comment "default/portal:portal has no endpoints" -m tcp --dport 8989 -j REJECT --reject-with icmp-port-unreachable

No idea how this could work:

iptables -t nat -A PREROUTING -j REJECT
iptables: Invalid argument. Run `dmesg' for more information.
dmesg | tail -1
x_tables: ip_tables: REJECT target: only valid in filter

That check has been there since beginning of git history.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 14:30   ` Serguei Bezverkhi (sbezverk)
  2019-11-26 14:52     ` Florian Westphal
@ 2019-11-26 15:38     ` Pablo Neira Ayuso
  2019-11-26 15:47       ` Serguei Bezverkhi (sbezverk)
  1 sibling, 1 reply; 34+ messages in thread
From: Pablo Neira Ayuso @ 2019-11-26 15:38 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk); +Cc: Florian Westphal, netfilter-devel

On Tue, Nov 26, 2019 at 02:30:02PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Hello Florian,
>
> Thank you very much for your reply. Once I changed to Input chain type, the rule worked. It seems iptables DO allow the same rule configuration see below:
>
> -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
> -A KUBE-SERVICES -d 57.131.151.19/32 -p tcp -m comment --comment "default/portal:portal has no endpoints" -m tcp --dport 8989 -j REJECT --reject-with icmp-port-unreachable

static struct xt_target reject_tg_reg __read_mostly = {
        .name           = "REJECT",
        .family         = NFPROTO_IPV4,
        .target         = reject_tg,
        .targetsize     = sizeof(struct ipt_reject_info),
        .table          = "filter",
        .hooks          = (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD) |
                          (1 << NF_INET_LOCAL_OUT),
        .checkentry     = reject_tg_check,
        .me             = THIS_MODULE,
};

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 15:38     ` Pablo Neira Ayuso
@ 2019-11-26 15:47       ` Serguei Bezverkhi (sbezverk)
  2019-11-26 15:51         ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-26 15:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Hello,

I totally get it that it is not possible in theory, but the matter of fact is in kubernetes somehow it works, maybe in some cases this check is not enforced, I do not know. If you are interested to investigate it further, please let me know as I said I have a cluster with these 2 rules configured.

Thank you
Serguei

On 2019-11-26, 10:40 AM, "Pablo Neira Ayuso" <pablo@netfilter.org> wrote:

    On Tue, Nov 26, 2019 at 02:30:02PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Hello Florian,
    >
    > Thank you very much for your reply. Once I changed to Input chain type, the rule worked. It seems iptables DO allow the same rule configuration see below:
    >
    > -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
    > -A KUBE-SERVICES -d 57.131.151.19/32 -p tcp -m comment --comment "default/portal:portal has no endpoints" -m tcp --dport 8989 -j REJECT --reject-with icmp-port-unreachable
    
    static struct xt_target reject_tg_reg __read_mostly = {
            .name           = "REJECT",
            .family         = NFPROTO_IPV4,
            .target         = reject_tg,
            .targetsize     = sizeof(struct ipt_reject_info),
            .table          = "filter",
            .hooks          = (1 << NF_INET_LOCAL_IN) | (1 << NF_INET_FORWARD) |
                              (1 << NF_INET_LOCAL_OUT),
            .checkentry     = reject_tg_check,
            .me             = THIS_MODULE,
    };
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 15:47       ` Serguei Bezverkhi (sbezverk)
@ 2019-11-26 15:51         ` Phil Sutter
  2019-11-26 18:47           ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-26 15:51 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel

Hi Serguei,

On Tue, Nov 26, 2019 at 03:47:49PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> I totally get it that it is not possible in theory, but the matter of fact is in kubernetes somehow it works, maybe in some cases this check is not enforced, I do not know. If you are interested to investigate it further, please let me know as I said I have a cluster with these 2 rules configured.

In another case I noticed that user-defined chains are a way to
circumvent these types of functional restrictions. If that's good or bad
is up to you to decide. ;)

Regarding the desired functionality, I guess you're wandering the
sinkhole-filled plains of undefined behaviour.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 15:51         ` Phil Sutter
@ 2019-11-26 18:47           ` Serguei Bezverkhi (sbezverk)
  2019-11-26 19:27             ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-26 18:47 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel

Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.

I have a second issue I am struggling to solve with nftables. Here is a service exposed for tcp port 80 which has 2 corresponding backends listening on a container port 8080.

!
! Backend 1
!
-A KUBE-SEP-FS3FUULGZPVD4VYB -s 57.112.0.247/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-FS3FUULGZPVD4VYB -p tcp -m tcp -j DNAT --to-destination 57.112.0.247:8080
!
! Backend 2
!
-A KUBE-SEP-MMFZROQSLQ3DKOQA -s 57.112.0.248/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-MMFZROQSLQ3DKOQA -p tcp -m tcp -j DNAT --to-destination 57.112.0.248:8080
!
! Service
!
-A KUBE-SERVICES -d 57.142.221.21/32 -p tcp -m comment --comment "default/app:http-web cluster IP" -m tcp --dport 80 -j KUBE-SVC-57XVOCFNTLTR3Q27
!
! Load balancing between 2 backends
!
-A KUBE-SVC-57XVOCFNTLTR3Q27 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-FS3FUULGZPVD4VYB
-A KUBE-SVC-57XVOCFNTLTR3Q27 -j KUBE-SEP-MMFZROQSLQ3DKOQA

I am looking for nftables equivalent for the load balancing part and also in this case there are double dnat translation,  destination port from 80 to 8080 and destination IP:  57.112.0.247 or 57.112.0.248.
Can it be expressed in a single nft dnat statement with vmaps or sets?

Thank you
Serguei


On 2019-11-26, 10:53 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi Serguei,
    
    On Tue, Nov 26, 2019 at 03:47:49PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > I totally get it that it is not possible in theory, but the matter of fact is in kubernetes somehow it works, maybe in some cases this check is not enforced, I do not know. If you are interested to investigate it further, please let me know as I said I have a cluster with these 2 rules configured.
    
    In another case I noticed that user-defined chains are a way to
    circumvent these types of functional restrictions. If that's good or bad
    is up to you to decide. ;)
    
    Regarding the desired functionality, I guess you're wandering the
    sinkhole-filled plains of undefined behaviour.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 18:47           ` Serguei Bezverkhi (sbezverk)
@ 2019-11-26 19:27             ` Phil Sutter
  2019-11-26 21:20               ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-26 19:27 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel

Hi,

On Tue, Nov 26, 2019 at 06:47:09PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.
> 
> I have a second issue I am struggling to solve with nftables. Here is a service exposed for tcp port 80 which has 2 corresponding backends listening on a container port 8080.
> 
> !
> ! Backend 1
> !
> -A KUBE-SEP-FS3FUULGZPVD4VYB -s 57.112.0.247/32 -j KUBE-MARK-MASQ
> -A KUBE-SEP-FS3FUULGZPVD4VYB -p tcp -m tcp -j DNAT --to-destination 57.112.0.247:8080
> !
> ! Backend 2
> !
> -A KUBE-SEP-MMFZROQSLQ3DKOQA -s 57.112.0.248/32 -j KUBE-MARK-MASQ
> -A KUBE-SEP-MMFZROQSLQ3DKOQA -p tcp -m tcp -j DNAT --to-destination 57.112.0.248:8080
> !
> ! Service
> !
> -A KUBE-SERVICES -d 57.142.221.21/32 -p tcp -m comment --comment "default/app:http-web cluster IP" -m tcp --dport 80 -j KUBE-SVC-57XVOCFNTLTR3Q27
> !
> ! Load balancing between 2 backends
> !
> -A KUBE-SVC-57XVOCFNTLTR3Q27 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-FS3FUULGZPVD4VYB
> -A KUBE-SVC-57XVOCFNTLTR3Q27 -j KUBE-SEP-MMFZROQSLQ3DKOQA
> 
> I am looking for nftables equivalent for the load balancing part and also in this case there are double dnat translation,  destination port from 80 to 8080 and destination IP:  57.112.0.247 or 57.112.0.248.
> Can it be expressed in a single nft dnat statement with vmaps or sets?

Regarding xt_statistic replacement, I once identified the equivalent of
'-m statistic --mode random --probability 0.5' would be 'numgen random
mod 0x2 < 0x1'.

Keeping both target address and port in a single map for *NAT statements
is not possible AFAIK.

If I'm not mistaken, you might be able to hook up a vmap together with
the numgen expression above like so:

| numgen random mod 0x2 vmap { \
|	0x0: jump KUBE-SEP-FS3FUULGZPVD4VYB, \
|	0x1: jump KUBE-SEP-MMFZROQSLQ3DKOQA }

Pure speculation, though. :)

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 19:27             ` Phil Sutter
@ 2019-11-26 21:20               ` Serguei Bezverkhi (sbezverk)
  2019-11-26 22:15                 ` Phil Sutter
  2019-11-27 10:11                 ` Arturo Borrero Gonzalez
  0 siblings, 2 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-26 21:20 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel

Hello Phil,

It almost worked ( Check this out:
sudo nft list table ipv4table
table ip ipv4table {
	set no-endpoint-svc-ports {
		type inet_service
		elements = { 8080, 8989 }
	}

	set no-endpoint-svc-addrs {
		type ipv4_addr
		flags interval
		elements = { 10.1.1.1, 10.1.1.2}
	}

	chain input-net {
		type nat hook input priority filter; policy accept;
		jump services
	}

	chain input-local {
		type nat hook output priority filter; policy accept;
		jump services
	}

	chain services {
		ip daddr @no-endpoint-svc-addrs tcp dport @no-endpoint-svc-ports reject with tcp reset
		ip daddr @no-endpoint-svc-addrs udp dport @no-endpoint-svc-ports reject with icmp type net-unreachable
	}

	chain svc1-endpoint-1 {
		ip protocol tcp dnat to 12.1.1.1:8080
	}

	chain svc1-endpoint-2 {
		ip protocol tcp dnat to 12.1.1.2:8080
	}

	chain svc2-endpoint-1 {
		ip protocol tcp dnat to 12.1.1.3:8090
	}

	chain svc2-endpoint-2 {
		ip protocol tcp dnat to 12.1.1.4:8090
	}

	chain svc1 {
	}

	chain svc2 {
	}

	chain prerouting {
		type nat hook prerouting priority filter; policy accept;
		ip daddr 1.1.1.1 tcp dport 88 numgen random mod 2 vmap { 0 : jump svc1-endpoint-1, 1 : jump svc1-endpoint-2 }
		ip daddr 2.2.2.2 tcp dport 99 numgen random mod 2 vmap { 0 : jump svc2-endpoint-1, 1 : jump svc2-endpoint-2 }
	}}

Ideally I need to apply  this rule " numgen random mod 2 vmap { 0 : jump svc1-endpoint-1, 1 : jump svc1-endpoint-2 }" to svc1 and svc2 chains to load balance between services' endpoints but when I do that it fails with Unsupported operation.
In contrast it let me apply this rule to prerouting chain.

This split support of reject in input/forward/output and numgen only in prerouting is not ideal as a packet for a client  of a service without registered endpoint will need to go through all checks in prerouting chain before it reaches input chain and get its reject back.

Thank you very much for your help.
Serguei

On 2019-11-26, 2:28 PM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi,
    
    On Tue, Nov 26, 2019 at 06:47:09PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.
    > 
    > I have a second issue I am struggling to solve with nftables. Here is a service exposed for tcp port 80 which has 2 corresponding backends listening on a container port 8080.
    > 
    > !
    > ! Backend 1
    > !
    > -A KUBE-SEP-FS3FUULGZPVD4VYB -s 57.112.0.247/32 -j KUBE-MARK-MASQ
    > -A KUBE-SEP-FS3FUULGZPVD4VYB -p tcp -m tcp -j DNAT --to-destination 57.112.0.247:8080
    > !
    > ! Backend 2
    > !
    > -A KUBE-SEP-MMFZROQSLQ3DKOQA -s 57.112.0.248/32 -j KUBE-MARK-MASQ
    > -A KUBE-SEP-MMFZROQSLQ3DKOQA -p tcp -m tcp -j DNAT --to-destination 57.112.0.248:8080
    > !
    > ! Service
    > !
    > -A KUBE-SERVICES -d 57.142.221.21/32 -p tcp -m comment --comment "default/app:http-web cluster IP" -m tcp --dport 80 -j KUBE-SVC-57XVOCFNTLTR3Q27
    > !
    > ! Load balancing between 2 backends
    > !
    > -A KUBE-SVC-57XVOCFNTLTR3Q27 -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-FS3FUULGZPVD4VYB
    > -A KUBE-SVC-57XVOCFNTLTR3Q27 -j KUBE-SEP-MMFZROQSLQ3DKOQA
    > 
    > I am looking for nftables equivalent for the load balancing part and also in this case there are double dnat translation,  destination port from 80 to 8080 and destination IP:  57.112.0.247 or 57.112.0.248.
    > Can it be expressed in a single nft dnat statement with vmaps or sets?
    
    Regarding xt_statistic replacement, I once identified the equivalent of
    '-m statistic --mode random --probability 0.5' would be 'numgen random
    mod 0x2 < 0x1'.
    
    Keeping both target address and port in a single map for *NAT statements
    is not possible AFAIK.
    
    If I'm not mistaken, you might be able to hook up a vmap together with
    the numgen expression above like so:
    
    | numgen random mod 0x2 vmap { \
    |	0x0: jump KUBE-SEP-FS3FUULGZPVD4VYB, \
    |	0x1: jump KUBE-SEP-MMFZROQSLQ3DKOQA }
    
    Pure speculation, though. :)
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 21:20               ` Serguei Bezverkhi (sbezverk)
@ 2019-11-26 22:15                 ` Phil Sutter
  2019-11-27 10:11                 ` Arturo Borrero Gonzalez
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Sutter @ 2019-11-26 22:15 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel

Hi,

On Tue, Nov 26, 2019 at 09:20:20PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> It almost worked ( Check this out:
> sudo nft list table ipv4table
> table ip ipv4table {
> 	set no-endpoint-svc-ports {
> 		type inet_service
> 		elements = { 8080, 8989 }
> 	}
> 
> 	set no-endpoint-svc-addrs {
> 		type ipv4_addr
> 		flags interval
> 		elements = { 10.1.1.1, 10.1.1.2}
> 	}
> 
> 	chain input-net {
> 		type nat hook input priority filter; policy accept;
> 		jump services
> 	}
> 
> 	chain input-local {
> 		type nat hook output priority filter; policy accept;
> 		jump services
> 	}
> 
> 	chain services {
> 		ip daddr @no-endpoint-svc-addrs tcp dport @no-endpoint-svc-ports reject with tcp reset
> 		ip daddr @no-endpoint-svc-addrs udp dport @no-endpoint-svc-ports reject with icmp type net-unreachable
> 	}
> 
> 	chain svc1-endpoint-1 {
> 		ip protocol tcp dnat to 12.1.1.1:8080
> 	}
> 
> 	chain svc1-endpoint-2 {
> 		ip protocol tcp dnat to 12.1.1.2:8080
> 	}
> 
> 	chain svc2-endpoint-1 {
> 		ip protocol tcp dnat to 12.1.1.3:8090
> 	}
> 
> 	chain svc2-endpoint-2 {
> 		ip protocol tcp dnat to 12.1.1.4:8090
> 	}
> 
> 	chain svc1 {
> 	}
> 
> 	chain svc2 {
> 	}
> 
> 	chain prerouting {
> 		type nat hook prerouting priority filter; policy accept;
> 		ip daddr 1.1.1.1 tcp dport 88 numgen random mod 2 vmap { 0 : jump svc1-endpoint-1, 1 : jump svc1-endpoint-2 }
> 		ip daddr 2.2.2.2 tcp dport 99 numgen random mod 2 vmap { 0 : jump svc2-endpoint-1, 1 : jump svc2-endpoint-2 }
> 	}}
> 
> Ideally I need to apply  this rule " numgen random mod 2 vmap { 0 : jump svc1-endpoint-1, 1 : jump svc1-endpoint-2 }" to svc1 and svc2 chains to load balance between services' endpoints but when I do that it fails with Unsupported operation.
> In contrast it let me apply this rule to prerouting chain.

I don't see where you jump to svc1/svc2 so this is a bit of guesswork.
Anyway, please keep in mind that dnat is only supported from nat (and
prerouting or output).

> This split support of reject in input/forward/output and numgen only in prerouting is not ideal as a packet for a client  of a service without registered endpoint will need to go through all checks in prerouting chain before it reaches input chain and get its reject back.

As said, it is dnat which is limited to prerouting. Numgen itself works
everywhere. If there is a known criteria identifying a client without
registered endpoint, you could match on that and 'accept' early in
prerouting. This will make the packet go to input/forward directly
without traversing the remaining prerouting rules.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-26 21:20               ` Serguei Bezverkhi (sbezverk)
  2019-11-26 22:15                 ` Phil Sutter
@ 2019-11-27 10:11                 ` Arturo Borrero Gonzalez
  2019-11-27 11:57                   ` Phil Sutter
  2019-11-27 14:36                   ` Serguei Bezverkhi (sbezverk)
  1 sibling, 2 replies; 34+ messages in thread
From: Arturo Borrero Gonzalez @ 2019-11-27 10:11 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Phil Sutter, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

On 11/26/19 10:20 PM, Serguei Bezverkhi (sbezverk) wrote:
>     On Tue, Nov 26, 2019 at 06:47:09PM +0000, Serguei Bezverkhi (sbezverk) wrote:
>     > Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.
>     > 

@Sergei, thanks for reaching out about this topic.

I'm using k8s a lot lately and would be interested in knowing more about what
you are trying to do with kubernetes and nftables.

In any case, if the somebody in kubernetes is planning to introduce nft for
kube-proxy or other component, I would suggest the generated ruleset is
validated here to really benefit from nftables. Is this what you are doing, right?

Recently I had the chance to attend a talk by @Laura (in CC) about the iptables
ruleset generated by docker and kube-proxy. Such rulesets are the opposite of
something meant to scale and perform well. Then people compare such rulesets
with other networking setups... and unfair compare.

Worth mentioning at this point this PoC too:

https://github.com/zevenet/kube-nftlb

Trying to mimic 1:1 what iptables was doing is a mistake from my point of view.
I believe you are aware of this already :-)

>     
>     Keeping both target address and port in a single map for *NAT statements
>     is not possible AFAIK.

@Phil, I think it is possible! examples in the wiki:

https://wiki.nftables.org/wiki-nftables/index.php/Multiple_NATs_using_nftables_maps

It would be something like:

% nft add rule nat prerouting dnat \
      tcp dport map { 1000 : 1.1.1.1, 2000 : 2.2.2.2, 3000 : 3.3.3.3} \
      : tcp dport map { 1000 : 1234, 2000 : 2345, 3000 : 3456 }


>     
>     If I'm not mistaken, you might be able to hook up a vmap together with
>     the numgen expression above like so:
>     
>     | numgen random mod 0x2 vmap { \
>     |	0x0: jump KUBE-SEP-FS3FUULGZPVD4VYB, \
>     |	0x1: jump KUBE-SEP-MMFZROQSLQ3DKOQA }
>     
>     Pure speculation, though. :)
>     

This works indeed. Just added the example to the wiki:

https://wiki.nftables.org/wiki-nftables/index.php/Load_balancing#Round_Robin



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 10:11                 ` Arturo Borrero Gonzalez
@ 2019-11-27 11:57                   ` Phil Sutter
  2019-11-27 14:36                   ` Serguei Bezverkhi (sbezverk)
  1 sibling, 0 replies; 34+ messages in thread
From: Phil Sutter @ 2019-11-27 11:57 UTC (permalink / raw)
  To: Arturo Borrero Gonzalez
  Cc: Serguei Bezverkhi (sbezverk),
	Pablo Neira Ayuso, Florian Westphal, netfilter-devel,
	Laura Garcia

Hi Arturo,

On Wed, Nov 27, 2019 at 11:11:32AM +0100, Arturo Borrero Gonzalez wrote:
> On 11/26/19 10:20 PM, Serguei Bezverkhi (sbezverk) wrote:
> >     On Tue, Nov 26, 2019 at 06:47:09PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> >     > Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.
> >     > 
> 
> @Sergei, thanks for reaching out about this topic.
> 
> I'm using k8s a lot lately and would be interested in knowing more about what
> you are trying to do with kubernetes and nftables.
> 
> In any case, if the somebody in kubernetes is planning to introduce nft for
> kube-proxy or other component, I would suggest the generated ruleset is
> validated here to really benefit from nftables. Is this what you are doing, right?
> 
> Recently I had the chance to attend a talk by @Laura (in CC) about the iptables
> ruleset generated by docker and kube-proxy. Such rulesets are the opposite of
> something meant to scale and perform well. Then people compare such rulesets
> with other networking setups... and unfair compare.
> 
> Worth mentioning at this point this PoC too:
> 
> https://github.com/zevenet/kube-nftlb
> 
> Trying to mimic 1:1 what iptables was doing is a mistake from my point of view.
> I believe you are aware of this already :-)
> 
> >     
> >     Keeping both target address and port in a single map for *NAT statements
> >     is not possible AFAIK.
> 
> @Phil, I think it is possible! examples in the wiki:
> 
> https://wiki.nftables.org/wiki-nftables/index.php/Multiple_NATs_using_nftables_maps
> 
> It would be something like:
> 
> % nft add rule nat prerouting dnat \
>       tcp dport map { 1000 : 1.1.1.1, 2000 : 2.2.2.2, 3000 : 3.3.3.3} \
>       : tcp dport map { 1000 : 1234, 2000 : 2345, 3000 : 3456 }

Ah, thanks! Using two maps didn't come to mind.

> >     If I'm not mistaken, you might be able to hook up a vmap together with
> >     the numgen expression above like so:
> >     
> >     | numgen random mod 0x2 vmap { \
> >     |	0x0: jump KUBE-SEP-FS3FUULGZPVD4VYB, \
> >     |	0x1: jump KUBE-SEP-MMFZROQSLQ3DKOQA }
> >     
> >     Pure speculation, though. :)
> >     
> 
> This works indeed. Just added the example to the wiki:
> 
> https://wiki.nftables.org/wiki-nftables/index.php/Load_balancing#Round_Robin

Thanks, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 10:11                 ` Arturo Borrero Gonzalez
  2019-11-27 11:57                   ` Phil Sutter
@ 2019-11-27 14:36                   ` Serguei Bezverkhi (sbezverk)
  2019-11-27 15:08                     ` Phil Sutter
  1 sibling, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-27 14:36 UTC (permalink / raw)
  To: Arturo Borrero Gonzalez
  Cc: Phil Sutter, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hello Arturo,

Thanks a lot for your reply, my ultimate goal is to develop kube-proxy which is building  nftables rules instead of iptables, in addition the goal is to use direct API calls to netlink without any external dependencies and of course to try to leverage nftables' advanced features to achieve the best performance.

I am in the process of identifying gaps in functionality available in github.com/google/nftables and github.com/sbezverk/nftableslib libraries, example yesterday I found out that neither of these libraries supports "numgen", which would be a mandatory feature to support load balancing between service's multiple end points.  I will have to add it to both to be able to move forward.
I use iptables from a working cluster and try to build a code which would program nftables the same way (with optimization). Once it is done, then it can be arranged into a controller listening for svc/endpoints and program  into nftables accordingly.

I am looking for people interested in the same topic to be able to discuss different approaches, like it was done yesterday with Phil and select the best approach to make nftables to shine (

Please let me know if you are interested in further discussions.

Thank you
Serguei

On 2019-11-27, 5:12 AM, "Arturo Borrero Gonzalez" <arturo@netfilter.org> wrote:

    On 11/26/19 10:20 PM, Serguei Bezverkhi (sbezverk) wrote:
    >     On Tue, Nov 26, 2019 at 06:47:09PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    >     > Ok, I guess I will work around by using input and output chain types, even though it will raise some brows in k8s networking community.
    >     > 
    
    @Sergei, thanks for reaching out about this topic.
    
    I'm using k8s a lot lately and would be interested in knowing more about what
    you are trying to do with kubernetes and nftables.
    
    In any case, if the somebody in kubernetes is planning to introduce nft for
    kube-proxy or other component, I would suggest the generated ruleset is
    validated here to really benefit from nftables. Is this what you are doing, right?
    
    Recently I had the chance to attend a talk by @Laura (in CC) about the iptables
    ruleset generated by docker and kube-proxy. Such rulesets are the opposite of
    something meant to scale and perform well. Then people compare such rulesets
    with other networking setups... and unfair compare.
    
    Worth mentioning at this point this PoC too:
    
    https://github.com/zevenet/kube-nftlb
    
    Trying to mimic 1:1 what iptables was doing is a mistake from my point of view.
    I believe you are aware of this already :-)
    
    >     
    >     Keeping both target address and port in a single map for *NAT statements
    >     is not possible AFAIK.
    
    @Phil, I think it is possible! examples in the wiki:
    
    https://wiki.nftables.org/wiki-nftables/index.php/Multiple_NATs_using_nftables_maps
    
    It would be something like:
    
    % nft add rule nat prerouting dnat \
          tcp dport map { 1000 : 1.1.1.1, 2000 : 2.2.2.2, 3000 : 3.3.3.3} \
          : tcp dport map { 1000 : 1234, 2000 : 2345, 3000 : 3456 }
    
    
    >     
    >     If I'm not mistaken, you might be able to hook up a vmap together with
    >     the numgen expression above like so:
    >     
    >     | numgen random mod 0x2 vmap { \
    >     |	0x0: jump KUBE-SEP-FS3FUULGZPVD4VYB, \
    >     |	0x1: jump KUBE-SEP-MMFZROQSLQ3DKOQA }
    >     
    >     Pure speculation, though. :)
    >     
    
    This works indeed. Just added the example to the wiki:
    
    https://wiki.nftables.org/wiki-nftables/index.php/Load_balancing#Round_Robin
    
    
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 14:36                   ` Serguei Bezverkhi (sbezverk)
@ 2019-11-27 15:08                     ` Phil Sutter
  2019-11-27 15:35                       ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-27 15:08 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi Serguei,

On Wed, Nov 27, 2019 at 02:36:07PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Thanks a lot for your reply, my ultimate goal is to develop kube-proxy which is building  nftables rules instead of iptables, in addition the goal is to use direct API calls to netlink without any external dependencies and of course to try to leverage nftables' advanced features to achieve the best performance.
> 
> I am in the process of identifying gaps in functionality available in github.com/google/nftables and github.com/sbezverk/nftableslib libraries, example yesterday I found out that neither of these libraries supports "numgen", which would be a mandatory feature to support load balancing between service's multiple end points.  I will have to add it to both to be able to move forward.
> I use iptables from a working cluster and try to build a code which would program nftables the same way (with optimization). Once it is done, then it can be arranged into a controller listening for svc/endpoints and program  into nftables accordingly.
> 
> I am looking for people interested in the same topic to be able to discuss different approaches, like it was done yesterday with Phil and select the best approach to make nftables to shine (
> 
> Please let me know if you are interested in further discussions.

Yes, we're definitely interested further discussion/cooperation. You're
using the JSON API for nftableslib, right?

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 15:08                     ` Phil Sutter
@ 2019-11-27 15:35                       ` Serguei Bezverkhi (sbezverk)
  2019-11-27 16:06                         ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-27 15:35 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

HI Phil,

No, I do not, nftableslib talks directly talk to netlink connection.

nftableslib offers an API which allows create tables/chains/rules and exposes an interface which looks similar to k8s client-go.  If you check https://github.com/sbezverk/nftableslib/blob/master/cmd/e2e/e2e.go

It will give you a good idea how it operates.

The reason for going in this direction is  performance, for a relatively static applications like a firewall, json approach is great, but for applications like a kube-proxy where hundreds or even thousands of service/endpoint events happen, I do not believe json is a right approach. When I talked to api machinery folks I was given 5k events per second as a target.

Thank you
Serguei

On 2019-11-27, 10:09 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi Serguei,
    
    On Wed, Nov 27, 2019 at 02:36:07PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Thanks a lot for your reply, my ultimate goal is to develop kube-proxy which is building  nftables rules instead of iptables, in addition the goal is to use direct API calls to netlink without any external dependencies and of course to try to leverage nftables' advanced features to achieve the best performance.
    > 
    > I am in the process of identifying gaps in functionality available in github.com/google/nftables and github.com/sbezverk/nftableslib libraries, example yesterday I found out that neither of these libraries supports "numgen", which would be a mandatory feature to support load balancing between service's multiple end points.  I will have to add it to both to be able to move forward.
    > I use iptables from a working cluster and try to build a code which would program nftables the same way (with optimization). Once it is done, then it can be arranged into a controller listening for svc/endpoints and program  into nftables accordingly.
    > 
    > I am looking for people interested in the same topic to be able to discuss different approaches, like it was done yesterday with Phil and select the best approach to make nftables to shine (
    > 
    > Please let me know if you are interested in further discussions.
    
    Yes, we're definitely interested further discussion/cooperation. You're
    using the JSON API for nftableslib, right?
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 15:35                       ` Serguei Bezverkhi (sbezverk)
@ 2019-11-27 16:06                         ` Phil Sutter
  2019-11-27 16:50                           ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-27 16:06 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi,

On Wed, Nov 27, 2019 at 03:35:04PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> No, I do not, nftableslib talks directly talk to netlink connection.
> 
> nftableslib offers an API which allows create tables/chains/rules and exposes an interface which looks similar to k8s client-go.  If you check https://github.com/sbezverk/nftableslib/blob/master/cmd/e2e/e2e.go
> 
> It will give you a good idea how it operates.
> 
> The reason for going in this direction is  performance, for a relatively static applications like a firewall, json approach is great, but for applications like a kube-proxy where hundreds or even thousands of service/endpoint events happen, I do not believe json is a right approach. When I talked to api machinery folks I was given 5k events per second as a target.

So you're bypassing both libnftables and libnftnl. Those 5k events per
second are a benchmark, not an expected load, right?

While you're obviously searching for the most performance, the drawback
is complexity. Using JSON (and thereby libnftables and libnftnl as
backends) a task like utilizing numgen expression is relatively simple.

A problem you won't get rid of with the move from iptables to nftables
is concurrent use: The "let's insert our rules on top" approach to
dealing with an existing ruleset or other users is obviously not the
best one. I guess you're aiming at dedicated applications where this is
not an issue but for "general purpose" applications I guess a k8s
backend communicating with firewalld would be a good approach of
customizing host's firewall setup without stepping onto others' toes.

Back to topic, you are creating a static ruleset based on the iptables
one you got for simple comparison tests or are you already over that? If
not, I guess it would be a good basis for high level ruleset
optimization discussions.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 16:06                         ` Phil Sutter
@ 2019-11-27 16:50                           ` Serguei Bezverkhi (sbezverk)
  2019-11-27 17:22                             ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-27 16:50 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi,

According to api folks kube-proxy must sustain 5k or about test otherwise it will never see production environment. Implementing of numgen expression is relatively simple, thanks to "nft --debug all" once it's done, a user can use it as easily as  with json __

Regarding concurrent usage, since my primary goal is kube-proxy I do not really care at this moment, as k8s cluster is not an application you co-locate in production with some other applications potentially altering host's tables. I agree firewalld might be interesting and more generic alternative, but seeing how quickly things are done in k8s,  maybe it will be done by the end of 21st century __

Once I get filter chain portion in the code I will share a link to repo so you could review.

Thanks a lot for this discussion, very useful
Serguei

On 2019-11-27, 11:08 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi,
    
    On Wed, Nov 27, 2019 at 03:35:04PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > No, I do not, nftableslib talks directly talk to netlink connection.
    > 
    > nftableslib offers an API which allows create tables/chains/rules and exposes an interface which looks similar to k8s client-go.  If you check https://github.com/sbezverk/nftableslib/blob/master/cmd/e2e/e2e.go
    > 
    > It will give you a good idea how it operates.
    > 
    > The reason for going in this direction is  performance, for a relatively static applications like a firewall, json approach is great, but for applications like a kube-proxy where hundreds or even thousands of service/endpoint events happen, I do not believe json is a right approach. When I talked to api machinery folks I was given 5k events per second as a target.
    
    So you're bypassing both libnftables and libnftnl. Those 5k events per
    second are a benchmark, not an expected load, right?
    
    While you're obviously searching for the most performance, the drawback
    is complexity. Using JSON (and thereby libnftables and libnftnl as
    backends) a task like utilizing numgen expression is relatively simple.
    
    A problem you won't get rid of with the move from iptables to nftables
    is concurrent use: The "let's insert our rules on top" approach to
    dealing with an existing ruleset or other users is obviously not the
    best one. I guess you're aiming at dedicated applications where this is
    not an issue but for "general purpose" applications I guess a k8s
    backend communicating with firewalld would be a good approach of
    customizing host's firewall setup without stepping onto others' toes.
    
    Back to topic, you are creating a static ruleset based on the iptables
    one you got for simple comparison tests or are you already over that? If
    not, I guess it would be a good basis for high level ruleset
    optimization discussions.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 16:50                           ` Serguei Bezverkhi (sbezverk)
@ 2019-11-27 17:22                             ` Phil Sutter
  2019-11-28  1:22                               ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-27 17:22 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi,

On Wed, Nov 27, 2019 at 04:50:56PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> According to api folks kube-proxy must sustain 5k or about test otherwise it will never see production environment. Implementing of numgen expression is relatively simple, thanks to "nft --debug all" once it's done, a user can use it as easily as  with json __
> 
> Regarding concurrent usage, since my primary goal is kube-proxy I do not really care at this moment, as k8s cluster is not an application you co-locate in production with some other applications potentially altering host's tables. I agree firewalld might be interesting and more generic alternative, but seeing how quickly things are done in k8s,  maybe it will be done by the end of 21st century __

I agree, in dedicated setup there's no need for compromises. I guess if
you manage to reduce ruleset changes to mere set element modifications,
you could outperform iptables in that regard. Run-time performance of
the resulting ruleset will obviously benefit from set/map use as there
are much fewer rules to traverse for each packet.

> Once I get filter chain portion in the code I will share a link to repo so you could review.

Thanks! I'm also interested in seeing whether there are any
inconveniences due to nftables limitations. Maybe some problems are
easier solved on kernel-side.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-27 17:22                             ` Phil Sutter
@ 2019-11-28  1:22                               ` Serguei Bezverkhi (sbezverk)
  2019-11-28  9:10                                 ` Laura Garcia
  2019-11-28 13:08                                 ` Phil Sutter
  0 siblings, 2 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-28  1:22 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hello Phil,

Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.

Here is the location of code programming these rules. 
https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go

Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.


sudo nft list table ipv4table
table ip ipv4table {
	set svc1-no-endpoints {
		type inet_service
		elements = { 8989 }
	}

	chain filter-input {
		type filter hook input priority filter; policy accept;
		ct state new jump k8s-filter-services
		jump k8s-filter-firewall
	}

	chain filter-output {
		type filter hook output priority filter; policy accept;
		ct state new jump k8s-filter-services
		jump k8s-filter-firewall
	}

	chain filter-forward {
		type filter hook forward priority filter; policy accept;
		jump k8s-filter-forward
		ct state new jump k8s-filter-services
	}

	chain k8s-filter-ext-services {
	}

	chain k8s-filter-firewall {
		meta mark 0x00008000 drop
	}

	chain k8s-filter-services {
		ip daddr 192.168.80.104 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
		ip daddr 57.131.151.19 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
	}

	chain k8s-filter-forward {
		ct state invalid drop
		meta mark 0x00004000 accept
		ip saddr 57.112.0.0/12 ct state established,related accept
		ip daddr 57.112.0.0/12 ct state established,related accept
	}
}

Thank you
Serguei

On 2019-11-27, 12:22 PM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi,
    
    On Wed, Nov 27, 2019 at 04:50:56PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > According to api folks kube-proxy must sustain 5k or about test otherwise it will never see production environment. Implementing of numgen expression is relatively simple, thanks to "nft --debug all" once it's done, a user can use it as easily as  with json __
    > 
    > Regarding concurrent usage, since my primary goal is kube-proxy I do not really care at this moment, as k8s cluster is not an application you co-locate in production with some other applications potentially altering host's tables. I agree firewalld might be interesting and more generic alternative, but seeing how quickly things are done in k8s,  maybe it will be done by the end of 21st century __
    
    I agree, in dedicated setup there's no need for compromises. I guess if
    you manage to reduce ruleset changes to mere set element modifications,
    you could outperform iptables in that regard. Run-time performance of
    the resulting ruleset will obviously benefit from set/map use as there
    are much fewer rules to traverse for each packet.
    
    > Once I get filter chain portion in the code I will share a link to repo so you could review.
    
    Thanks! I'm also interested in seeing whether there are any
    inconveniences due to nftables limitations. Maybe some problems are
    easier solved on kernel-side.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28  1:22                               ` Serguei Bezverkhi (sbezverk)
@ 2019-11-28  9:10                                 ` Laura Garcia
  2019-11-28 11:58                                   ` Serguei Bezverkhi (sbezverk)
  2019-11-28 13:08                                 ` Phil Sutter
  1 sibling, 1 reply; 34+ messages in thread
From: Laura Garcia @ 2019-11-28  9:10 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Phil Sutter, Arturo Borrero Gonzalez, Pablo Neira Ayuso,
	Florian Westphal, netfilter-devel

Hi, I guess we had a very similar conversation with the sig-network guys.

Please see below some comments.

On Thu, Nov 28, 2019 at 2:22 AM Serguei Bezverkhi (sbezverk)
<sbezverk@cisco.com> wrote:
>
> Hello Phil,
>
> Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.
>
> Here is the location of code programming these rules.
> https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go
>
> Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.
>
>
> sudo nft list table ipv4table
> table ip ipv4table {
>         set svc1-no-endpoints {
>                 type inet_service
>                 elements = { 8989 }
>         }
>
>         chain filter-input {
>                 type filter hook input priority filter; policy accept;
>                 ct state new jump k8s-filter-services
>                 jump k8s-filter-firewall
>         }
>
>         chain filter-output {
>                 type filter hook output priority filter; policy accept;
>                 ct state new jump k8s-filter-services
>                 jump k8s-filter-firewall
>         }
>
>         chain filter-forward {
>                 type filter hook forward priority filter; policy accept;
>                 jump k8s-filter-forward
>                 ct state new jump k8s-filter-services
>         }
>
>         chain k8s-filter-ext-services {
>         }
>
>         chain k8s-filter-firewall {
>                 meta mark 0x00008000 drop
>         }
>
>         chain k8s-filter-services {
>                 ip daddr 192.168.80.104 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
>                 ip daddr 57.131.151.19 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
>         }
>

Here you're going to have the same problems with iptables, lack of
scalability and complexity during rules removal. In nftlb we create
maps and with the same rules, you only have to take care of insert and
remove elements in them.

Some extensive examples here:

https://github.com/zevenet/nftlb/tree/master/tests

In regards to the ip : port natting, is not possible to use 2 maps
cause you need to generate numgen per each one and it will come to
different numbers.

Cheers.

>         chain k8s-filter-forward {
>                 ct state invalid drop
>                 meta mark 0x00004000 accept
>                 ip saddr 57.112.0.0/12 ct state established,related accept
>                 ip daddr 57.112.0.0/12 ct state established,related accept
>         }
> }
>
> Thank you
> Serguei
>
> On 2019-11-27, 12:22 PM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:
>
>     Hi,
>
>     On Wed, Nov 27, 2019 at 04:50:56PM +0000, Serguei Bezverkhi (sbezverk) wrote:
>     > According to api folks kube-proxy must sustain 5k or about test otherwise it will never see production environment. Implementing of numgen expression is relatively simple, thanks to "nft --debug all" once it's done, a user can use it as easily as  with json __
>     >
>     > Regarding concurrent usage, since my primary goal is kube-proxy I do not really care at this moment, as k8s cluster is not an application you co-locate in production with some other applications potentially altering host's tables. I agree firewalld might be interesting and more generic alternative, but seeing how quickly things are done in k8s,  maybe it will be done by the end of 21st century __
>
>     I agree, in dedicated setup there's no need for compromises. I guess if
>     you manage to reduce ruleset changes to mere set element modifications,
>     you could outperform iptables in that regard. Run-time performance of
>     the resulting ruleset will obviously benefit from set/map use as there
>     are much fewer rules to traverse for each packet.
>
>     > Once I get filter chain portion in the code I will share a link to repo so you could review.
>
>     Thanks! I'm also interested in seeing whether there are any
>     inconveniences due to nftables limitations. Maybe some problems are
>     easier solved on kernel-side.
>
>     Cheers, Phil
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28  9:10                                 ` Laura Garcia
@ 2019-11-28 11:58                                   ` Serguei Bezverkhi (sbezverk)
  0 siblings, 0 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-28 11:58 UTC (permalink / raw)
  To: Laura Garcia
  Cc: Phil Sutter, Arturo Borrero Gonzalez, Pablo Neira Ayuso,
	Florian Westphal, netfilter-devel

Hello Laura,

Thank you for your comments and link. Maybe I have not reach that point, but I do not see complexity in rules removal. Maybe it is because in case of json, rule's handle has not been reported back to the caller? In my case since a rule gets created individually and directly, I get back uint64 of rule handle which can be easily associated with a service or endpoint, same for set/map/vmap. Any further changes with service like removal/(add delete) endpoints, the rule handle or maybe handles will be available.

I do not have code yet ready for this part, but I have done a rule update by using rule's handle for other things and it worked. 

Thanks again for your feedback. Once I have code for rules management, I will ask you to review it if you do not mind.
Serguei

On 2019-11-28, 4:10 AM, "Laura Garcia" <nevola@gmail.com> wrote:

    Hi, I guess we had a very similar conversation with the sig-network guys.
    
    Please see below some comments.
    
    On Thu, Nov 28, 2019 at 2:22 AM Serguei Bezverkhi (sbezverk)
    <sbezverk@cisco.com> wrote:
    >
    > Hello Phil,
    >
    > Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.
    >
    > Here is the location of code programming these rules.
    > https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go
    >
    > Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.
    >
    >
    > sudo nft list table ipv4table
    > table ip ipv4table {
    >         set svc1-no-endpoints {
    >                 type inet_service
    >                 elements = { 8989 }
    >         }
    >
    >         chain filter-input {
    >                 type filter hook input priority filter; policy accept;
    >                 ct state new jump k8s-filter-services
    >                 jump k8s-filter-firewall
    >         }
    >
    >         chain filter-output {
    >                 type filter hook output priority filter; policy accept;
    >                 ct state new jump k8s-filter-services
    >                 jump k8s-filter-firewall
    >         }
    >
    >         chain filter-forward {
    >                 type filter hook forward priority filter; policy accept;
    >                 jump k8s-filter-forward
    >                 ct state new jump k8s-filter-services
    >         }
    >
    >         chain k8s-filter-ext-services {
    >         }
    >
    >         chain k8s-filter-firewall {
    >                 meta mark 0x00008000 drop
    >         }
    >
    >         chain k8s-filter-services {
    >                 ip daddr 192.168.80.104 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
    >                 ip daddr 57.131.151.19 tcp dport @svc1-no-endpoints reject with icmp type host-unreachable
    >         }
    >
    
    Here you're going to have the same problems with iptables, lack of
    scalability and complexity during rules removal. In nftlb we create
    maps and with the same rules, you only have to take care of insert and
    remove elements in them.
    
    Some extensive examples here:
    
    https://github.com/zevenet/nftlb/tree/master/tests
    
    In regards to the ip : port natting, is not possible to use 2 maps
    cause you need to generate numgen per each one and it will come to
    different numbers.
    
    Cheers.
    
    >         chain k8s-filter-forward {
    >                 ct state invalid drop
    >                 meta mark 0x00004000 accept
    >                 ip saddr 57.112.0.0/12 ct state established,related accept
    >                 ip daddr 57.112.0.0/12 ct state established,related accept
    >         }
    > }
    >
    > Thank you
    > Serguei
    >
    > On 2019-11-27, 12:22 PM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:
    >
    >     Hi,
    >
    >     On Wed, Nov 27, 2019 at 04:50:56PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    >     > According to api folks kube-proxy must sustain 5k or about test otherwise it will never see production environment. Implementing of numgen expression is relatively simple, thanks to "nft --debug all" once it's done, a user can use it as easily as  with json __
    >     >
    >     > Regarding concurrent usage, since my primary goal is kube-proxy I do not really care at this moment, as k8s cluster is not an application you co-locate in production with some other applications potentially altering host's tables. I agree firewalld might be interesting and more generic alternative, but seeing how quickly things are done in k8s,  maybe it will be done by the end of 21st century __
    >
    >     I agree, in dedicated setup there's no need for compromises. I guess if
    >     you manage to reduce ruleset changes to mere set element modifications,
    >     you could outperform iptables in that regard. Run-time performance of
    >     the resulting ruleset will obviously benefit from set/map use as there
    >     are much fewer rules to traverse for each packet.
    >
    >     > Once I get filter chain portion in the code I will share a link to repo so you could review.
    >
    >     Thanks! I'm also interested in seeing whether there are any
    >     inconveniences due to nftables limitations. Maybe some problems are
    >     easier solved on kernel-side.
    >
    >     Cheers, Phil
    >
    >
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28  1:22                               ` Serguei Bezverkhi (sbezverk)
  2019-11-28  9:10                                 ` Laura Garcia
@ 2019-11-28 13:08                                 ` Phil Sutter
  2019-11-28 13:34                                   ` Serguei Bezverkhi (sbezverk)
  2019-11-28 14:51                                   ` Serguei Bezverkhi (sbezverk)
  1 sibling, 2 replies; 34+ messages in thread
From: Phil Sutter @ 2019-11-28 13:08 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi Serguei,

On Thu, Nov 28, 2019 at 01:22:17AM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.
> 
> Here is the location of code programming these rules. 
> https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go
> 
> Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.

What about something like this:

| table ip t {
| 	map m {
| 		type ipv4_addr . inet_service : verdict
| 		elements = { 192.168.80.104 . 8989 : goto do_reject }
| 	}
| 
| 	chain c {
| 		ip daddr . tcp dport vmap @m
| 	}
| 
| 	chain do_reject {
| 		reject with icmp type host-unreachable
| 	}
| }

For unknown reasons reject statement can't be used directly in a verdict
map, but the do_reject chain hack works.

> sudo nft list table ipv4table
> table ip ipv4table {
> 	set svc1-no-endpoints {
> 		type inet_service
> 		elements = { 8989 }
> 	}
> 
> 	chain filter-input {
> 		type filter hook input priority filter; policy accept;
> 		ct state new jump k8s-filter-services
> 		jump k8s-filter-firewall
> 	}
> 
> 	chain filter-output {
> 		type filter hook output priority filter; policy accept;
> 		ct state new jump k8s-filter-services
> 		jump k8s-filter-firewall
> 	}

Same ruleset for input and output? Seems weird given the daddr-based
filtering in k8s-filter-services.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28 13:08                                 ` Phil Sutter
@ 2019-11-28 13:34                                   ` Serguei Bezverkhi (sbezverk)
  2019-11-28 14:51                                   ` Serguei Bezverkhi (sbezverk)
  1 sibling, 0 replies; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-28 13:34 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hello Phil,

Thanks a lot for your suggestions, I will refactor using approach.

Best regards
Serguei

On 2019-11-28, 8:08 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi Serguei,
    
    On Thu, Nov 28, 2019 at 01:22:17AM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.
    > 
    > Here is the location of code programming these rules. 
    > https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go
    > 
    > Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.
    
    What about something like this:
    
    | table ip t {
    | 	map m {
    | 		type ipv4_addr . inet_service : verdict
    | 		elements = { 192.168.80.104 . 8989 : goto do_reject }
    | 	}
    | 
    | 	chain c {
    | 		ip daddr . tcp dport vmap @m
    | 	}
    | 
    | 	chain do_reject {
    | 		reject with icmp type host-unreachable
    | 	}
    | }
    
    For unknown reasons reject statement can't be used directly in a verdict
    map, but the do_reject chain hack works.

This is exactly what I was looking for, it is just I never knew you could combine address and port in the key..
    
    > sudo nft list table ipv4table
    > table ip ipv4table {
    > 	set svc1-no-endpoints {
    > 		type inet_service
    > 		elements = { 8989 }
    > 	}
    > 
    > 	chain filter-input {
    > 		type filter hook input priority filter; policy accept;
    > 		ct state new jump k8s-filter-services
    > 		jump k8s-filter-firewall
    > 	}
    > 
    > 	chain filter-output {
    > 		type filter hook output priority filter; policy accept;
    > 		ct state new jump k8s-filter-services
    > 		jump k8s-filter-firewall
    > 	}
    
    Same ruleset for input and output? Seems weird given the daddr-based
    filtering in k8s-filter-services.
    
I will review one more time k8s filter input/output to confirm if I got something wrong.

    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28 13:08                                 ` Phil Sutter
  2019-11-28 13:34                                   ` Serguei Bezverkhi (sbezverk)
@ 2019-11-28 14:51                                   ` Serguei Bezverkhi (sbezverk)
  2019-11-28 15:15                                     ` Phil Sutter
  1 sibling, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-28 14:51 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi Phil,

Quick question, it appears that we do not support yet combining of two types into a key, so I need to quickly add it, your help would be appreciated. Here is the sequence I get to create such map:
sudo nft --debug all add map ipv4table no-endpoint-services   { type  ipv4_addr . inet_service : verdict \; }

----------------	------------------
| 02 00 00 00  |	|  extra header  |
|00014|--|00001|	|len |flags| type|
| 69 70 76 34  |	|      data      |	 i p v 4
| 74 61 62 6c  |	|      data      |	 t a b l
| 65 00 00 00  |	|      data      |	 e      
|00025|--|00002|	|len |flags| type|
| 6e 6f 2d 65  |	|      data      |	 n o - e
| 6e 64 70 6f  |	|      data      |	 n d p o
| 69 6e 74 2d  |	|      data      |	 i n t -
| 73 65 72 76  |	|      data      |	 s e r v
| 69 63 65 73  |	|      data      |	 i c e s
| 00 00 00 00  |	|      data      |	        
|00008|--|00003|	|len |flags| type|   NFTA_SET_FLAGS
| 00 00 00 08  |	|      data      |	 NFT_SET_MAP                       = 0x8      

|00008|--|00004|	|len |flags| type|   NFTA_SET_KEY_TYPE                 = 0x4
| 00 00 01 cd  |	|      data      |	        

|00008|--|00005|	|len |flags| type|   NFTA_SET_KEY_LEN                  = 0x5
| 00 00 00 08  |	|      data      |	        

|00008|--|00006|	|len |flags| type|   NFTA_SET_DATA_TYPE                = 0x6  Verdict
| ff ff ff 00  |	|      data      |	        

|00008|--|00007|	|len |flags| type|   NFTA_SET_DATA_LEN                 = 0x7
| 00 00 00 00  |	|      data      |	        

|00008|--|00010|	|len |flags| type|   NFTA_SET_ID                       = 0xa
| 00 00 00 01  |	|      data      |	        
|00016|--|00013|	|len |flags| type|
| 00 04 00 00  |	|      data      |	        
| 00 00 01 04  |	|      data      |	        
| 00 00 00 00  |	|      data      |	        
----------------	------------------

Almost all is clear except 2 points; how set flag "00 00 01 cd "  is generated and when key length is 8 and not 6. 

Thanks a lot
Serguei

On 2019-11-28, 8:08 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi Serguei,
    
    On Thu, Nov 28, 2019 at 01:22:17AM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Please see below the list of nftables rules the code generate to mimic only filter chain portion of kube proxy.
    > 
    > Here is the location of code programming these rules. 
    > https://github.com/sbezverk/nftableslib-samples/blob/master/proxy/mimic-filter/mimic-filter.go
    > 
    > Most of rules are static, will be programed  just once when proxy comes up, with the exception is 2 rules in k8s-filter-services chain. The reference to the list of ports can change. Ideally it would be great to express these two rules with a single rule and a vmap, where the key must be service's ip AND service port, as it is possible to have a single service IP that can be associated with several ports and some of these ports might have an endpoint and some do not. So far I could not figure it out. Appreciate your thought/suggestions/critics. If you could file an issue for anything you feel needs to be discussed, that would be great.
    
    What about something like this:
    
    | table ip t {
    | 	map m {
    | 		type ipv4_addr . inet_service : verdict
    | 		elements = { 192.168.80.104 . 8989 : goto do_reject }
    | 	}
    | 
    | 	chain c {
    | 		ip daddr . tcp dport vmap @m
    | 	}
    | 
    | 	chain do_reject {
    | 		reject with icmp type host-unreachable
    | 	}
    | }
    
    For unknown reasons reject statement can't be used directly in a verdict
    map, but the do_reject chain hack works.
    
    > sudo nft list table ipv4table
    > table ip ipv4table {
    > 	set svc1-no-endpoints {
    > 		type inet_service
    > 		elements = { 8989 }
    > 	}
    > 
    > 	chain filter-input {
    > 		type filter hook input priority filter; policy accept;
    > 		ct state new jump k8s-filter-services
    > 		jump k8s-filter-firewall
    > 	}
    > 
    > 	chain filter-output {
    > 		type filter hook output priority filter; policy accept;
    > 		ct state new jump k8s-filter-services
    > 		jump k8s-filter-firewall
    > 	}
    
    Same ruleset for input and output? Seems weird given the daddr-based
    filtering in k8s-filter-services.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28 14:51                                   ` Serguei Bezverkhi (sbezverk)
@ 2019-11-28 15:15                                     ` Phil Sutter
  2019-11-29 20:13                                       ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-28 15:15 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi,

On Thu, Nov 28, 2019 at 02:51:36PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Quick question, it appears that we do not support yet combining of two types into a key, so I need to quickly add it, your help would be appreciated. Here is the sequence I get to create such map:
> sudo nft --debug all add map ipv4table no-endpoint-services   { type  ipv4_addr . inet_service : verdict \; }
> 
[...]
> 
> Almost all is clear except 2 points; how set flag "00 00 01 cd "  is generated and when key length is 8 and not 6. 

I've been through that recently when implementing among match support in
iptables-nft (which uses an anonymous set with concatenated elements
internally). Please have a look at the relevant code here:

https://git.netfilter.org/iptables/tree/iptables/nft.c#n999

I guess this helps clarifying how set flags are created and how to pad
element data.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-28 15:15                                     ` Phil Sutter
@ 2019-11-29 20:13                                       ` Serguei Bezverkhi (sbezverk)
  2019-11-30  0:04                                         ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-11-29 20:13 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hello,

@Phil, thanks so much for Concat suggestion. Any more points for optimization? If no, then I will move to nat portion of k8s iptables.
Here are rules generated with refactored code:
table ip ipv4table {
	map no-endpoints-services {
		type ipv4_addr . inet_service : verdict
		elements = { 57.131.151.19 . 8989 : jump k8s-filter-do-reject,
			     192.168.80.104 . 8989 : jump k8s-filter-do-reject }
	}

	chain filter-input {
		type filter hook input priority filter; policy accept;
		ct state new jump k8s-filter-services
		jump k8s-filter-firewall
	}

	chain filter-output {
		type filter hook output priority filter; policy accept;
		ct state new jump k8s-filter-services
		jump k8s-filter-firewall
	}

	chain filter-forward {
		type filter hook forward priority filter; policy accept;
		jump k8s-filter-forward
		ct state new jump k8s-filter-services
	}

	chain k8s-filter-firewall {
		meta mark 0x00008000 drop
	}

	chain k8s-filter-services {
		ip daddr . tcp dport vmap @no-endpoints-services
	}

	chain k8s-filter-forward {
		ct state invalid drop
		meta mark 0x00004000 accept
		ip saddr 57.112.0.0/12 ct state established,related accept
		ip daddr 57.112.0.0/12 ct state established,related accept
	}

	chain k8s-filter-do-reject {
		reject with icmp type host-unreachable
	}
}

Thank you
Serguei

On 2019-11-28, 10:15 AM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi,
    
    On Thu, Nov 28, 2019 at 02:51:36PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > Quick question, it appears that we do not support yet combining of two types into a key, so I need to quickly add it, your help would be appreciated. Here is the sequence I get to create such map:
    > sudo nft --debug all add map ipv4table no-endpoint-services   { type  ipv4_addr . inet_service : verdict \; }
    > 
    [...]
    > 
    > Almost all is clear except 2 points; how set flag "00 00 01 cd "  is generated and when key length is 8 and not 6. 
    
    I've been through that recently when implementing among match support in
    iptables-nft (which uses an anonymous set with concatenated elements
    internally). Please have a look at the relevant code here:
    
    https://git.netfilter.org/iptables/tree/iptables/nft.c#n999
    
    I guess this helps clarifying how set flags are created and how to pad
    element data.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-29 20:13                                       ` Serguei Bezverkhi (sbezverk)
@ 2019-11-30  0:04                                         ` Phil Sutter
  2019-12-03 18:43                                           ` Serguei Bezverkhi (sbezverk)
  0 siblings, 1 reply; 34+ messages in thread
From: Phil Sutter @ 2019-11-30  0:04 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi Serguei,

On Fri, Nov 29, 2019 at 08:13:21PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> @Phil, thanks so much for Concat suggestion. Any more points for optimization? If no, then I will move to nat portion of k8s iptables.

Looks fine to me. I don't like the mark-based verdicts, but to validate
those we need to see where the marks are set.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-30  0:04                                         ` Phil Sutter
@ 2019-12-03 18:43                                           ` Serguei Bezverkhi (sbezverk)
  2019-12-04 10:36                                             ` Phil Sutter
  0 siblings, 1 reply; 34+ messages in thread
From: Serguei Bezverkhi (sbezverk) @ 2019-12-03 18:43 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hello Phil,

Started working on nat portion and here is iptables rule which is a bit concerning.

-A KUBE-SERVICES -d 192.168.80.104/32 -p tcp -m comment --comment "default/portal:portal external IP" -m tcp --dport 8989 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-MUPXPVK4XAZHSWAR

I can address " addrtype" with nftables "fib" and " iif type local" but I am not sure about "physdev", appreciate any suggestions.

Thank you
Serguei

On 2019-11-29, 7:04 PM, "n0-1@orbyte.nwl.cc on behalf of Phil Sutter" <n0-1@orbyte.nwl.cc on behalf of phil@nwl.cc> wrote:

    Hi Serguei,
    
    On Fri, Nov 29, 2019 at 08:13:21PM +0000, Serguei Bezverkhi (sbezverk) wrote:
    > @Phil, thanks so much for Concat suggestion. Any more points for optimization? If no, then I will move to nat portion of k8s iptables.
    
    Looks fine to me. I don't like the mark-based verdicts, but to validate
    those we need to see where the marks are set.
    
    Cheers, Phil
    


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-11-25 18:55 Operation not supported when adding jump command Serguei Bezverkhi (sbezverk)
  2019-11-26 12:21 ` Florian Westphal
@ 2019-12-03 23:50 ` Duncan Roe
  2019-12-04  1:13   ` [PATCH nft] doc: Clarify conditions under which a reject verdict is permissible Duncan Roe
  2019-12-06  2:37   ` [PATCH nft v2] " Duncan Roe
  1 sibling, 2 replies; 34+ messages in thread
From: Duncan Roe @ 2019-12-03 23:50 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk); +Cc: Pablo Neira Ayuso, netfilter-devel

On Mon, Nov 25, 2019 at 06:55:41PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Hello Pablo,
>
> Please see below  table/chain/rules/sets I program,  when I try to add jump from input-net, input-local to services  it fails with " Operation not supported" , I would appreciate if somebody could help to understand why:
>
> sudo nft add rule ipv4table input-net jump services
> Error: Could not process rule: Operation not supported
> add rule ipv4table input-net jump services
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
> table ip ipv4table {
> 	set no-endpoint-svc-ports {
> 		type inet_service
> 		elements = { 8080, 8989 }
> 	}
>
> 	set no-endpoint-svc-addrs {
> 		type ipv4_addr
> 		flags interval
> 		elements = { 10.1.1.1, 10.1.1.2 }
> 	}
>
> 	chain input-net {
> 		type nat hook prerouting priority filter; policy accept;
> 	}
>
> 	chain input-local {
> 		type nat hook output priority filter; policy accept;
> 	}
>
> 	chain services {
> 		ip daddr @no-endpoint-svc-addrs tcp dport @no-endpoint-svc-ports reject with tcp reset
> 		ip daddr @no-endpoint-svc-addrs udp dport @no-endpoint-svc-ports reject with icmp type net-unreachable
> 	}
> }
>
> Thank you
> Serguei
>
Hi Serguei,

The reason it files is, from *man nft*:

> This statement [reject] is only valid in the input, forward and output chains,
> and user-defined chains which are only called from those chains.

(I inserted the bit in square brackets).

The wording could perhaps be clarified: what it really means to say is

Reject is only only valid in base chains using the input, forward or output
hooks, and user-defined chains which are only called from those chains.

Put that way, you can see why your command is rejected.

Cheers ... Duncan.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH nft] doc: Clarify conditions under which a reject verdict is permissible
  2019-12-03 23:50 ` Duncan Roe
@ 2019-12-04  1:13   ` Duncan Roe
  2019-12-06  2:37   ` [PATCH nft v2] " Duncan Roe
  1 sibling, 0 replies; 34+ messages in thread
From: Duncan Roe @ 2019-12-04  1:13 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, sbezverk

A phrase like "input chain" is a throwback to xtables documentation.
In nft, chains are containers for rules. They do have a type, but what's
important here is which hook each uses.

There may be other instances of this throwback elsewhere in the manual.

Signed-off-by: Duncan Roe <duncan_roe@optusnet.com.au>
---
 doc/statements.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/doc/statements.txt b/doc/statements.txt
index 3b82436..4ff7d05 100644
--- a/doc/statements.txt
+++ b/doc/statements.txt
@@ -171,8 +171,9 @@ ____
 
 A reject statement is used to send back an error packet in response to the
 matched packet otherwise it is equivalent to drop so it is a terminating
-statement, ending rule traversal. This statement is only valid in the input,
-forward and output chains, and user-defined chains which are only called from
+statement, ending rule traversal. This statement is only valid in base chains
+using the input,
+forward or output hooks, and user-defined chains which are only called from
 those chains.
 
 .different ICMP reject variants are meant for use in different table families
-- 
2.14.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: Operation not supported when adding jump command
  2019-12-03 18:43                                           ` Serguei Bezverkhi (sbezverk)
@ 2019-12-04 10:36                                             ` Phil Sutter
  0 siblings, 0 replies; 34+ messages in thread
From: Phil Sutter @ 2019-12-04 10:36 UTC (permalink / raw)
  To: Serguei Bezverkhi (sbezverk)
  Cc: Arturo Borrero Gonzalez, Pablo Neira Ayuso, Florian Westphal,
	netfilter-devel, Laura Garcia

Hi,

On Tue, Dec 03, 2019 at 06:43:19PM +0000, Serguei Bezverkhi (sbezverk) wrote:
> Started working on nat portion and here is iptables rule which is a bit concerning.
> 
> -A KUBE-SERVICES -d 192.168.80.104/32 -p tcp -m comment --comment "default/portal:portal external IP" -m tcp --dport 8989 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-MUPXPVK4XAZHSWAR
> 
> I can address " addrtype" with nftables "fib" and " iif type local" but I am not sure about "physdev", appreciate any suggestions.

I think you can use 'meta iiftype != "bridge"' in this case.

Cheers, Phil

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH nft v2] doc: Clarify conditions under which a reject verdict is permissible
  2019-12-03 23:50 ` Duncan Roe
  2019-12-04  1:13   ` [PATCH nft] doc: Clarify conditions under which a reject verdict is permissible Duncan Roe
@ 2019-12-06  2:37   ` Duncan Roe
  2019-12-06  6:55     ` Florian Westphal
  1 sibling, 1 reply; 34+ messages in thread
From: Duncan Roe @ 2019-12-06  2:37 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, sbezverk

A phrase like "input chain" is a throwback to xtables documentation.
In nft, chains are containers for rules. They do have a type, but what's
important here is which hook each uses.

v2: Show hook names in bold
Signed-off-by: Duncan Roe <duncan_roe@optusnet.com.au>
---
 doc/statements.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/doc/statements.txt b/doc/statements.txt
index 3b82436..ced311c 100644
--- a/doc/statements.txt
+++ b/doc/statements.txt
@@ -171,8 +171,9 @@ ____
 
 A reject statement is used to send back an error packet in response to the
 matched packet otherwise it is equivalent to drop so it is a terminating
-statement, ending rule traversal. This statement is only valid in the input,
-forward and output chains, and user-defined chains which are only called from
+statement, ending rule traversal. This statement is only valid in base chains
+using the *input*,
+*forward* or *output* hooks, and user-defined chains which are only called from
 those chains.
 
 .different ICMP reject variants are meant for use in different table families
-- 
2.14.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH nft v2] doc: Clarify conditions under which a reject verdict is permissible
  2019-12-06  2:37   ` [PATCH nft v2] " Duncan Roe
@ 2019-12-06  6:55     ` Florian Westphal
  0 siblings, 0 replies; 34+ messages in thread
From: Florian Westphal @ 2019-12-06  6:55 UTC (permalink / raw)
  To: Duncan Roe; +Cc: pablo, netfilter-devel, sbezverk

Duncan Roe <duncan_roe@optusnet.com.au> wrote:
> A phrase like "input chain" is a throwback to xtables documentation.
> In nft, chains are containers for rules. They do have a type, but what's
> important here is which hook each uses.

Applied, thanks Duncan.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-12-06  6:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 18:55 Operation not supported when adding jump command Serguei Bezverkhi (sbezverk)
2019-11-26 12:21 ` Florian Westphal
2019-11-26 14:30   ` Serguei Bezverkhi (sbezverk)
2019-11-26 14:52     ` Florian Westphal
2019-11-26 15:38     ` Pablo Neira Ayuso
2019-11-26 15:47       ` Serguei Bezverkhi (sbezverk)
2019-11-26 15:51         ` Phil Sutter
2019-11-26 18:47           ` Serguei Bezverkhi (sbezverk)
2019-11-26 19:27             ` Phil Sutter
2019-11-26 21:20               ` Serguei Bezverkhi (sbezverk)
2019-11-26 22:15                 ` Phil Sutter
2019-11-27 10:11                 ` Arturo Borrero Gonzalez
2019-11-27 11:57                   ` Phil Sutter
2019-11-27 14:36                   ` Serguei Bezverkhi (sbezverk)
2019-11-27 15:08                     ` Phil Sutter
2019-11-27 15:35                       ` Serguei Bezverkhi (sbezverk)
2019-11-27 16:06                         ` Phil Sutter
2019-11-27 16:50                           ` Serguei Bezverkhi (sbezverk)
2019-11-27 17:22                             ` Phil Sutter
2019-11-28  1:22                               ` Serguei Bezverkhi (sbezverk)
2019-11-28  9:10                                 ` Laura Garcia
2019-11-28 11:58                                   ` Serguei Bezverkhi (sbezverk)
2019-11-28 13:08                                 ` Phil Sutter
2019-11-28 13:34                                   ` Serguei Bezverkhi (sbezverk)
2019-11-28 14:51                                   ` Serguei Bezverkhi (sbezverk)
2019-11-28 15:15                                     ` Phil Sutter
2019-11-29 20:13                                       ` Serguei Bezverkhi (sbezverk)
2019-11-30  0:04                                         ` Phil Sutter
2019-12-03 18:43                                           ` Serguei Bezverkhi (sbezverk)
2019-12-04 10:36                                             ` Phil Sutter
2019-12-03 23:50 ` Duncan Roe
2019-12-04  1:13   ` [PATCH nft] doc: Clarify conditions under which a reject verdict is permissible Duncan Roe
2019-12-06  2:37   ` [PATCH nft v2] " Duncan Roe
2019-12-06  6:55     ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.