All of lore.kernel.org
 help / color / mirror / Atom feed
* Inability to IPVS DR with nft dnat since 9971a514ed26
@ 2019-03-27  6:26 Simon Kirby
  2019-03-27  9:30 ` Florian Westphal
  0 siblings, 1 reply; 7+ messages in thread
From: Simon Kirby @ 2019-03-27  6:26 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, netfilter-devel, lvs-devel

Hello!

We have been successfully using nft dnat and IPVS in DR mode on 4.9, 4.14
kernels, but since upgrading to 4.19, such rules now appear to miss the
IPVS input hook and instead appear to hit localhost (and "tcpdump -ni lo"
shows the packets) instead of being forwarded to a real server.

I bisected this to 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
nf_nat: add nat type hooks to nat core").

It should be pretty easy to see this with a minimal setup:

/etc/nftables.conf:

table ip nat {
        chain prerouting {
                type nat hook prerouting priority 0;

		ip daddr $ext_ip dnat to $vip
	}
	chain postrouting {
		type nat hook postrouting priority 100;

		# In theory this hook no longer needed since this commit,
		# but we also need to do some unrelated snatting.
	}
}

/etc/sysctl.conf:
	
net.ipv4.conf.all.accept_local = 1
net.ipv4.vs.conntrack = 1

IPVS DR setup:

ipvsadm -A -t $vip:80 -s wrr
ipvsadm -a -t $vip:80 -r $real_ip:80 -g -w 100

On the real server, the vip has to be bound to lo or similar and
net.ipv4.conf.all.arp_announce=2 and net.ipv4.conf.all.arp_ignore=1 as
usual for DR, with a symmetric gateway setup (with accept_local above).
Actually, a real server isn't needed to show the issue here, just another
neighbor to route at.

When it works, the inbound frame (TCP connectin to $ext_ip:80) should be
dnatted and then L2-routed (like a static route) to the MAC of $real_ip,
and sent out that interface. Since this commit, it hits lo instead.

Any ideas on what is going wrong here?

Note that we ended up using originally using nftables here because it let
us do one more thing: hairpin NAT _with_ IPVS all on the same host with
"type nat hook input priority -99" and applying snat there. The abillity
to specify hook priorities made this possible. I haven't checked if this
is still working or not, yet, though.

Simon-

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
  2019-03-27  6:26 Inability to IPVS DR with nft dnat since 9971a514ed26 Simon Kirby
@ 2019-03-27  9:30 ` Florian Westphal
  2019-03-27 15:34   ` Simon Kirby
  2021-12-03  8:34   ` Simon Kirby
  0 siblings, 2 replies; 7+ messages in thread
From: Florian Westphal @ 2019-03-27  9:30 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Florian Westphal, netdev, netfilter-devel, lvs-devel

Simon Kirby <sim@hostway.ca> wrote:
> We have been successfully using nft dnat and IPVS in DR mode on 4.9, 4.14
> kernels, but since upgrading to 4.19, such rules now appear to miss the
> IPVS input hook and instead appear to hit localhost (and "tcpdump -ni lo"
> shows the packets) instead of being forwarded to a real server.
> 
> I bisected this to 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> nf_nat: add nat type hooks to nat core").
> 
> It should be pretty easy to see this with a minimal setup:
> 
> /etc/nftables.conf:
> 
> table ip nat {
>         chain prerouting {
>                 type nat hook prerouting priority 0;
> 
> 		ip daddr $ext_ip dnat to $vip
> 	}
> 	chain postrouting {
> 		type nat hook postrouting priority 100;
> 
> 		# In theory this hook no longer needed since this commit,
> 		# but we also need to do some unrelated snatting.
> 	}
> }
> 
> /etc/sysctl.conf:
> 	
> net.ipv4.conf.all.accept_local = 1
> net.ipv4.vs.conntrack = 1
> 
> IPVS DR setup:
> 
> ipvsadm -A -t $vip:80 -s wrr
> ipvsadm -a -t $vip:80 -r $real_ip:80 -g -w 100

I have a hard time figuring out how to expand $ext_ip, $vip and $real_ip,
and where to place those addresses on the nft machine.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
  2019-03-27  9:30 ` Florian Westphal
@ 2019-03-27 15:34   ` Simon Kirby
  2021-12-03  8:34   ` Simon Kirby
  1 sibling, 0 replies; 7+ messages in thread
From: Simon Kirby @ 2019-03-27 15:34 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, netfilter-devel, lvs-devel

On Wed, Mar 27, 2019 at 10:30:27AM +0100, Florian Westphal wrote:

> > I bisected this to 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> > nf_nat: add nat type hooks to nat core").
> > 
> > It should be pretty easy to see this with a minimal setup:
> > 
> > /etc/nftables.conf:
> > 
> > table ip nat {
> >     chain prerouting {
> > 		type nat hook prerouting priority 0;
> > 
> > 		ip daddr $ext_ip dnat to $vip
> > 	}
> > 	chain postrouting {
> > 		type nat hook postrouting priority 100;
> > 
> > 		# In theory this hook no longer needed since this commit,
> > 		# but we also need to do some unrelated snatting.
> > 	}
> > }
> > 
> > /etc/sysctl.conf:
> > 	
> > net.ipv4.conf.all.accept_local = 1
> > net.ipv4.vs.conntrack = 1
> > 
> > IPVS DR setup:
> > 
> > ipvsadm -A -t $vip:80 -s wrr
> > ipvsadm -a -t $vip:80 -r $real_ip:80 -g -w 100
> 
> I have a hard time figuring out how to expand $ext_ip, $vip and $real_ip,
> and where to place those addresses on the nft machine.

$ext_ip is something reachable from the "outside"; it just has to be
something which can get to the nft box that isn't the real server or the
same host. We have a public IP in this case.

$vip is something that is on the local LAN "behind" the nft box. In our
case this is an rfc1918 IP address.

$real_ip is on the same subnet as the $vip and is just a way for IPVS to
resolve the neighbor of one of the real servers in order to forward the
packet. With this example configuration, IPVS is basically equivalent to:

ip route add $vip via $real_ip

Except that it hooks the input path because $vip is expected to be bound
locally...and normally you have multiple real servers and some algorithm
selected for balancing. So, I guess I didn't mention that, and you also
need to bind $vip to the nft box, and also to the real server if you
want it to actually be able to respond.

"LVS-HOWTO" has info on how to set up LVS-DR. The only difference here is
that we're using it in a relatively new (2009) configuration where "DR"
(Direct Return) mode is actually symmetric and replying back to the nft
box (symmetric) instead of directly to a separate router. This lets NAT
actually work since it can see traffic in both directions.

Simon-

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
  2019-03-27  9:30 ` Florian Westphal
  2019-03-27 15:34   ` Simon Kirby
@ 2021-12-03  8:34   ` Simon Kirby
  2021-12-03  9:40     ` Pablo Neira Ayuso
  2021-12-03 21:48       ` Julian Anastasov
  1 sibling, 2 replies; 7+ messages in thread
From: Simon Kirby @ 2021-12-03  8:34 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, netfilter-devel, lvs-devel

On Wed, Mar 27, 2019 at 10:30:27AM +0100, Florian Westphal wrote:

> Simon Kirby <sim@hostway.ca> wrote:
> > We have been successfully using nft dnat and IPVS in DR mode on 4.9, 4.14
> > kernels, but since upgrading to 4.19, such rules now appear to miss the
> > IPVS input hook and instead appear to hit localhost (and "tcpdump -ni lo"
> > shows the packets) instead of being forwarded to a real server.
> > 
> > I bisected this to 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> > nf_nat: add nat type hooks to nat core").
> > 
> > It should be pretty easy to see this with a minimal setup:
> > 
> > /etc/nftables.conf:
> > 
> > table ip nat {
> >         chain prerouting {
> >                 type nat hook prerouting priority 0;
> > 
> > 		ip daddr $ext_ip dnat to $vip
> > 	}
> > 	chain postrouting {
> > 		type nat hook postrouting priority 100;
> > 
> > 		# In theory this hook no longer needed since this commit,
> > 		# but we also need to do some unrelated snatting.
> > 	}
> > }
> > 
> > /etc/sysctl.conf:
> > 	
> > net.ipv4.conf.all.accept_local = 1
> > net.ipv4.vs.conntrack = 1
> > 
> > IPVS DR setup:
> > 
> > ipvsadm -A -t $vip:80 -s wrr
> > ipvsadm -a -t $vip:80 -r $real_ip:80 -g -w 100
> 
> I have a hard time figuring out how to expand $ext_ip, $vip and $real_ip,
> and where to place those addresses on the nft machine.

I had some time to set up some test VMs for this, which I can post if
you'd like (several GB), or I can tarball up the configs.

Our setup still doesn't work in 5.15, and we have some LVS servers held
up on 4.14 kernels that are the last working stable branch.

LVS expects the VIPs to route to loopback in order to reach the ipvs
hook, and since 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
nf_nat: add nat type hooks to nat core"), the nftrace output changes to
show that the ipvs_vs_dr_xmit packet is oif "lo" rather than "enp1s0".

With perf probes, I found that the reason the outbound device is changing
is that there is an nft hook that ends up calling ip_route_me_harder().

This function is not called prior to this change, but we can make it be
called even on 4.14 by hooking nat output (with no rules) or route output
with anything modifying, such as "mark set 1".

We just didn't happen to hook this previously, so it worked for us, but
after this change, all hooks (including output) are always applied.

# perf probe -a 'ip_route_me_harder%return retval=$retval'
# perf record -g -e probe:ip_route_me_harder__return -aR sleep 4
(send a test connection)
# perf script
swapper     0 [000]  1654.547622: probe:ip_route_me_harder__return: (ffffffff819ac910 <- ffffffffa002b8f6) retval=0x0
        ffffffff810564b0 kretprobe_trampoline+0x0 (vmlinux-4.14.252)
        ffffffffa0084090 nft_nat_ipv4_local_fn+0x10 ([nft_chain_nat_ipv4])
        ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
        ffffffffa004af2b ip_vs_dr_xmit+0x18b ([ip_vs])
        ffffffffa003fb2e ip_vs_in+0x58e ([ip_vs])
        ffffffffa00400d1 ip_vs_local_request4+0x21 ([ip_vs])
        ffffffffa00400e9 ip_vs_remote_request4+0x9 ([ip_vs])
        ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
        ffffffff8195c48b ip_local_deliver+0x7b (vmlinux-4.14.252)
        ffffffff8195c0b8 ip_rcv_finish+0x1f8 (vmlinux-4.14.252)
        ffffffff8195c7b7 ip_rcv+0x2e7 (vmlinux-4.14.252)
        ffffffff818dc113 __netif_receive_skb_core+0x883 (vmlinux-4.14.252)
(pruned a bit)

On 5.15, the trace is similar, but nft_nat_ipv4_local_fn is gone
(nft_nat_do_chain is inlined).

nftrace output through "nft monitor trace" shows it changing the packet
dest between filter output and nat postrouting:

...
trace id 32904fd3 ip filter output packet: oif "enp1s0" @ll,0,112 0x5254009039555254002ace280800 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
trace id 32904fd3 ip filter output verdict continue
trace id 32904fd3 ip filter output policy accept
trace id 32904fd3 ip nat postrouting packet: iif "enp1s0" oif "lo" ether saddr 52:54:00:2a:ce:28 ether daddr 52:54:00:90:39:55 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
trace id 32904fd3 ip nat postrouting verdict continue
trace id 32904fd3 ip nat postrouting policy accept

On 4.14 without hooking nat output, the oif for nat postrouting remains
unchanged ("enp1s0").

If we avoid the nftables dnat rule and connect directly to the LVS VIP,
it still works on newer kernels, I suspect because nf_nat_ipv4_fn()
doesn't match. If we dnat directly to the DR VIP without LVS, it works
because the dest is not loopback, as expected. It's the combination of
these two that used to work, but now doesn't.

Our specific use case here is that we're doing the dnat from public to
rfc1918 space, and the rfc1918 LVS VIPs support some hairpinning cases.

Any ideas?

Simon-

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
  2021-12-03  8:34   ` Simon Kirby
@ 2021-12-03  9:40     ` Pablo Neira Ayuso
  2021-12-03 21:48       ` Julian Anastasov
  1 sibling, 0 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2021-12-03  9:40 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Florian Westphal, netdev, netfilter-devel, lvs-devel

Hi,

On Fri, Dec 03, 2021 at 12:34:52AM -0800, Simon Kirby wrote:
> On Wed, Mar 27, 2019 at 10:30:27AM +0100, Florian Westphal wrote:
> 
> > Simon Kirby <sim@hostway.ca> wrote:
> > > We have been successfully using nft dnat and IPVS in DR mode on 4.9, 4.14
> > > kernels, but since upgrading to 4.19, such rules now appear to miss the
> > > IPVS input hook and instead appear to hit localhost (and "tcpdump -ni lo"
> > > shows the packets) instead of being forwarded to a real server.
> > > 
> > > I bisected this to 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> > > nf_nat: add nat type hooks to nat core").
> > > 
> > > It should be pretty easy to see this with a minimal setup:
> > > 
> > > /etc/nftables.conf:
> > > 
> > > table ip nat {
> > >         chain prerouting {
> > >                 type nat hook prerouting priority 0;

This priority number does not look correct, this should be -100 which
is NF_IP_PRI_NAT_DST (in recent nftables versions you can use:

        ... priority dstnat;

> > > 		ip daddr $ext_ip dnat to $vip

Why do you need DNAT in this case? In the IPVS DR mode virtual server
and the load balancer already own the same IP address.

> > > 	}
> > > 	chain postrouting {
> > > 		type nat hook postrouting priority 100;
> > > 
> > > 		# In theory this hook no longer needed since this commit,
> > > 		# but we also need to do some unrelated snatting.
> > > 	}

Your configuration is also missing the input/output nat hooks, which
also need to be registered manually. Otherwise, NAT and locally
generated traffic might break.

In the kernel 4.14 and below, all of the NAT hooks in nftables need to
be manually registered in order for NAT to work.

> > > }
> > > 
> > > /etc/sysctl.conf:
> > > 	
> > > net.ipv4.conf.all.accept_local = 1
> > > net.ipv4.vs.conntrack = 1
> > > 
> > > IPVS DR setup:
> > > 
> > > ipvsadm -A -t $vip:80 -s wrr
> > > ipvsadm -a -t $vip:80 -r $real_ip:80 -g -w 100
> > 
> > I have a hard time figuring out how to expand $ext_ip, $vip and $real_ip,
> > and where to place those addresses on the nft machine.
> 
> I had some time to set up some test VMs for this, which I can post if
> you'd like (several GB), or I can tarball up the configs.
> 
> Our setup still doesn't work in 5.15, and we have some LVS servers held
> up on 4.14 kernels that are the last working stable branch.
> 
> LVS expects the VIPs to route to loopback in order to reach the ipvs
> hook, and since 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> nf_nat: add nat type hooks to nat core"), the nftrace output changes to
> show that the ipvs_vs_dr_xmit packet is oif "lo" rather than "enp1s0".
> 
> With perf probes, I found that the reason the outbound device is changing
> is that there is an nft hook that ends up calling ip_route_me_harder().

This is called from local_out path for NAT, is the $vip owned by your
load balancer? Then the route lookup is correct since it points to the
address that your load balancer owns.

> This function is not called prior to this change, but we can make it be
> called even on 4.14 by hooking nat output (with no rules) or route output
> with anything modifying, such as "mark set 1".
> 
> We just didn't happen to hook this previously, so it worked for us, but
> after this change, all hooks (including output) are always applied.
> 
> # perf probe -a 'ip_route_me_harder%return retval=$retval'
> # perf record -g -e probe:ip_route_me_harder__return -aR sleep 4
> (send a test connection)
> # perf script
> swapper     0 [000]  1654.547622: probe:ip_route_me_harder__return: (ffffffff819ac910 <- ffffffffa002b8f6) retval=0x0
>         ffffffff810564b0 kretprobe_trampoline+0x0 (vmlinux-4.14.252)
>         ffffffffa0084090 nft_nat_ipv4_local_fn+0x10 ([nft_chain_nat_ipv4])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffffa004af2b ip_vs_dr_xmit+0x18b ([ip_vs])
>         ffffffffa003fb2e ip_vs_in+0x58e ([ip_vs])
>         ffffffffa00400d1 ip_vs_local_request4+0x21 ([ip_vs])
>         ffffffffa00400e9 ip_vs_remote_request4+0x9 ([ip_vs])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffff8195c48b ip_local_deliver+0x7b (vmlinux-4.14.252)
>         ffffffff8195c0b8 ip_rcv_finish+0x1f8 (vmlinux-4.14.252)
>         ffffffff8195c7b7 ip_rcv+0x2e7 (vmlinux-4.14.252)
>         ffffffff818dc113 __netif_receive_skb_core+0x883 (vmlinux-4.14.252)
> (pruned a bit)
> 
> On 5.15, the trace is similar, but nft_nat_ipv4_local_fn is gone
> (nft_nat_do_chain is inlined).
> 
> nftrace output through "nft monitor trace" shows it changing the packet
> dest between filter output and nat postrouting:
> 
> ...
> trace id 32904fd3 ip filter output packet: oif "enp1s0" @ll,0,112 0x5254009039555254002ace280800 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip filter output verdict continue
> trace id 32904fd3 ip filter output policy accept
> trace id 32904fd3 ip nat postrouting packet: iif "enp1s0" oif "lo" ether saddr 52:54:00:2a:ce:28 ether daddr 52:54:00:90:39:55 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip nat postrouting verdict continue
> trace id 32904fd3 ip nat postrouting policy accept
> 
> On 4.14 without hooking nat output, the oif for nat postrouting remains
> unchanged ("enp1s0").
> 
> If we avoid the nftables dnat rule and connect directly to the LVS VIP,
> it still works on newer kernels, I suspect because nf_nat_ipv4_fn()
> doesn't match. If we dnat directly to the DR VIP without LVS, it works
> because the dest is not loopback, as expected. It's the combination of
> these two that used to work, but now doesn't.

Is your load balancer owning the IP address that you use to dnat?

> Our specific use case here is that we're doing the dnat from public to
> rfc1918 space, and the rfc1918 LVS VIPs support some hairpinning cases.
> 
> Any ideas?
> 
> Simon-

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
  2021-12-03  8:34   ` Simon Kirby
@ 2021-12-03 21:48       ` Julian Anastasov
  2021-12-03 21:48       ` Julian Anastasov
  1 sibling, 0 replies; 7+ messages in thread
From: Julian Anastasov @ 2021-12-03 21:48 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Florian Westphal, netdev, netfilter-devel, lvs-devel


	Hello,

On Fri, 3 Dec 2021, Simon Kirby wrote:

> I had some time to set up some test VMs for this, which I can post if
> you'd like (several GB), or I can tarball up the configs.
> 
> Our setup still doesn't work in 5.15, and we have some LVS servers held
> up on 4.14 kernels that are the last working stable branch.
> 
> LVS expects the VIPs to route to loopback in order to reach the ipvs
> hook, and since 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> nf_nat: add nat type hooks to nat core"), the nftrace output changes to
> show that the ipvs_vs_dr_xmit packet is oif "lo" rather than "enp1s0".
> 
> With perf probes, I found that the reason the outbound device is changing
> is that there is an nft hook that ends up calling ip_route_me_harder().

	Yes, this call is supposed to route locally generated
packets after daddr is translated by Netfilter. But IPVS uses
LOCAL_OUT hook to post packets to real servers. If you use
DR method, daddr is not changed (remains VIP) but packet's route
points to the real server (different from VIP). Any rerouting
will assign wrong route.

	Such code that compares tuple.dst.u3.ip with
tuple.src.u3.ip (for !dir) in nf_nat_ipv4_local_fn() is present
in old kernels. So, I'm not sure how you escaped it. The
only possible way is if net.ipv4.vs.conntrack is 0 because
in this case ip_vs_send_or_cont() calls ip_vs_notrack() to set 
IP_CT_UNTRACKED and ct becomes NULL (untracked skb is skipped
by NAT).

> This function is not called prior to this change, but we can make it be
> called even on 4.14 by hooking nat output (with no rules) or route output
> with anything modifying, such as "mark set 1".

	In this case it hits the mangle code (ipt_mangle_out).

> We just didn't happen to hook this previously, so it worked for us, but
> after this change, all hooks (including output) are always applied.
> 
> # perf probe -a 'ip_route_me_harder%return retval=$retval'
> # perf record -g -e probe:ip_route_me_harder__return -aR sleep 4
> (send a test connection)
> # perf script
> swapper     0 [000]  1654.547622: probe:ip_route_me_harder__return: (ffffffff819ac910 <- ffffffffa002b8f6) retval=0x0
>         ffffffff810564b0 kretprobe_trampoline+0x0 (vmlinux-4.14.252)
>         ffffffffa0084090 nft_nat_ipv4_local_fn+0x10 ([nft_chain_nat_ipv4])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffffa004af2b ip_vs_dr_xmit+0x18b ([ip_vs])
>         ffffffffa003fb2e ip_vs_in+0x58e ([ip_vs])
>         ffffffffa00400d1 ip_vs_local_request4+0x21 ([ip_vs])
>         ffffffffa00400e9 ip_vs_remote_request4+0x9 ([ip_vs])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffff8195c48b ip_local_deliver+0x7b (vmlinux-4.14.252)
>         ffffffff8195c0b8 ip_rcv_finish+0x1f8 (vmlinux-4.14.252)
>         ffffffff8195c7b7 ip_rcv+0x2e7 (vmlinux-4.14.252)
>         ffffffff818dc113 __netif_receive_skb_core+0x883 (vmlinux-4.14.252)
> (pruned a bit)
> 
> On 5.15, the trace is similar, but nft_nat_ipv4_local_fn is gone
> (nft_nat_do_chain is inlined).
> 
> nftrace output through "nft monitor trace" shows it changing the packet
> dest between filter output and nat postrouting:
> 
> ...
> trace id 32904fd3 ip filter output packet: oif "enp1s0" @ll,0,112 0x5254009039555254002ace280800 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip filter output verdict continue
> trace id 32904fd3 ip filter output policy accept
> trace id 32904fd3 ip nat postrouting packet: iif "enp1s0" oif "lo" ether saddr 52:54:00:2a:ce:28 ether daddr 52:54:00:90:39:55 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip nat postrouting verdict continue
> trace id 32904fd3 ip nat postrouting policy accept
> 
> On 4.14 without hooking nat output, the oif for nat postrouting remains
> unchanged ("enp1s0").

	Is net.ipv4.vs.conntrack set in 4.14 ?

> If we avoid the nftables dnat rule and connect directly to the LVS VIP,
> it still works on newer kernels, I suspect because nf_nat_ipv4_fn()
> doesn't match. If we dnat directly to the DR VIP without LVS, it works

	The problem is that the DNAT rule schedules translation
which is detected by this check:

	if (ct->tuplehash[dir].tuple.dst.u3.ip !=
	    ct->tuplehash[!dir].tuple.src.u3.ip) {
		err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC);

	But it happens if ct is not NULL (vs/conntrack=1).

> because the dest is not loopback, as expected. It's the combination of
> these two that used to work, but now doesn't.
> 
> Our specific use case here is that we're doing the dnat from public to
> rfc1918 space, and the rfc1918 LVS VIPs support some hairpinning cases.
> 
> Any ideas?

	As nf_nat_ipv4_local_fn is just for LOCAL_OUT, an additional
skb->dev check can help to skip the code when packet comes from
network (not from local stack):

	if (ret != NF_ACCEPT || skb->dev)
		return ret;

	But I'm not sure if such hack breaks something.

	Second option is to check if daddr/dport actually changed
in our call to nf_nat_ipv4_fn() but it is more complex.
It will catch that packet was already DNAT-ed in PRE_ROUTING,
it was already routed locally and now it is passed on LOCAL_OUT
by IPVS for second DNAT+rerouting which is not wanted by IPVS.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inability to IPVS DR with nft dnat since 9971a514ed26
@ 2021-12-03 21:48       ` Julian Anastasov
  0 siblings, 0 replies; 7+ messages in thread
From: Julian Anastasov @ 2021-12-03 21:48 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Florian Westphal, netdev, netfilter-devel, lvs-devel


	Hello,

On Fri, 3 Dec 2021, Simon Kirby wrote:

> I had some time to set up some test VMs for this, which I can post if
> you'd like (several GB), or I can tarball up the configs.
> 
> Our setup still doesn't work in 5.15, and we have some LVS servers held
> up on 4.14 kernels that are the last working stable branch.
> 
> LVS expects the VIPs to route to loopback in order to reach the ipvs
> hook, and since 9971a514ed2697e542f3984a6162eac54bb1da98 ("netfilter:
> nf_nat: add nat type hooks to nat core"), the nftrace output changes to
> show that the ipvs_vs_dr_xmit packet is oif "lo" rather than "enp1s0".
> 
> With perf probes, I found that the reason the outbound device is changing
> is that there is an nft hook that ends up calling ip_route_me_harder().

	Yes, this call is supposed to route locally generated
packets after daddr is translated by Netfilter. But IPVS uses
LOCAL_OUT hook to post packets to real servers. If you use
DR method, daddr is not changed (remains VIP) but packet's route
points to the real server (different from VIP). Any rerouting
will assign wrong route.

	Such code that compares tuple.dst.u3.ip with
tuple.src.u3.ip (for !dir) in nf_nat_ipv4_local_fn() is present
in old kernels. So, I'm not sure how you escaped it. The
only possible way is if net.ipv4.vs.conntrack is 0 because
in this case ip_vs_send_or_cont() calls ip_vs_notrack() to set 
IP_CT_UNTRACKED and ct becomes NULL (untracked skb is skipped
by NAT).

> This function is not called prior to this change, but we can make it be
> called even on 4.14 by hooking nat output (with no rules) or route output
> with anything modifying, such as "mark set 1".

	In this case it hits the mangle code (ipt_mangle_out).

> We just didn't happen to hook this previously, so it worked for us, but
> after this change, all hooks (including output) are always applied.
> 
> # perf probe -a 'ip_route_me_harder%return retval=$retval'
> # perf record -g -e probe:ip_route_me_harder__return -aR sleep 4
> (send a test connection)
> # perf script
> swapper     0 [000]  1654.547622: probe:ip_route_me_harder__return: (ffffffff819ac910 <- ffffffffa002b8f6) retval=0x0
>         ffffffff810564b0 kretprobe_trampoline+0x0 (vmlinux-4.14.252)
>         ffffffffa0084090 nft_nat_ipv4_local_fn+0x10 ([nft_chain_nat_ipv4])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffffa004af2b ip_vs_dr_xmit+0x18b ([ip_vs])
>         ffffffffa003fb2e ip_vs_in+0x58e ([ip_vs])
>         ffffffffa00400d1 ip_vs_local_request4+0x21 ([ip_vs])
>         ffffffffa00400e9 ip_vs_remote_request4+0x9 ([ip_vs])
>         ffffffff8193e793 nf_hook_slow+0x43 (vmlinux-4.14.252)
>         ffffffff8195c48b ip_local_deliver+0x7b (vmlinux-4.14.252)
>         ffffffff8195c0b8 ip_rcv_finish+0x1f8 (vmlinux-4.14.252)
>         ffffffff8195c7b7 ip_rcv+0x2e7 (vmlinux-4.14.252)
>         ffffffff818dc113 __netif_receive_skb_core+0x883 (vmlinux-4.14.252)
> (pruned a bit)
> 
> On 5.15, the trace is similar, but nft_nat_ipv4_local_fn is gone
> (nft_nat_do_chain is inlined).
> 
> nftrace output through "nft monitor trace" shows it changing the packet
> dest between filter output and nat postrouting:
> 
> ...
> trace id 32904fd3 ip filter output packet: oif "enp1s0" @ll,0,112 0x5254009039555254002ace280800 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip filter output verdict continue
> trace id 32904fd3 ip filter output policy accept
> trace id 32904fd3 ip nat postrouting packet: iif "enp1s0" oif "lo" ether saddr 52:54:00:2a:ce:28 ether daddr 52:54:00:90:39:55 ip saddr 192.168.7.1 ip daddr 10.99.99.10 ip dscp 0x04 ip ecn not-ect ip ttl 63 ip id 5753 ip length 60 tcp sport 58620 tcp dport 80 tcp flags == syn tcp window 64240
> trace id 32904fd3 ip nat postrouting verdict continue
> trace id 32904fd3 ip nat postrouting policy accept
> 
> On 4.14 without hooking nat output, the oif for nat postrouting remains
> unchanged ("enp1s0").

	Is net.ipv4.vs.conntrack set in 4.14 ?

> If we avoid the nftables dnat rule and connect directly to the LVS VIP,
> it still works on newer kernels, I suspect because nf_nat_ipv4_fn()
> doesn't match. If we dnat directly to the DR VIP without LVS, it works

	The problem is that the DNAT rule schedules translation
which is detected by this check:

	if (ct->tuplehash[dir].tuple.dst.u3.ip !=
	    ct->tuplehash[!dir].tuple.src.u3.ip) {
		err = ip_route_me_harder(state->net, state->sk, skb, RTN_UNSPEC);

	But it happens if ct is not NULL (vs/conntrack=1).

> because the dest is not loopback, as expected. It's the combination of
> these two that used to work, but now doesn't.
> 
> Our specific use case here is that we're doing the dnat from public to
> rfc1918 space, and the rfc1918 LVS VIPs support some hairpinning cases.
> 
> Any ideas?

	As nf_nat_ipv4_local_fn is just for LOCAL_OUT, an additional
skb->dev check can help to skip the code when packet comes from
network (not from local stack):

	if (ret != NF_ACCEPT || skb->dev)
		return ret;

	But I'm not sure if such hack breaks something.

	Second option is to check if daddr/dport actually changed
in our call to nf_nat_ipv4_fn() but it is more complex.
It will catch that packet was already DNAT-ed in PRE_ROUTING,
it was already routed locally and now it is passed on LOCAL_OUT
by IPVS for second DNAT+rerouting which is not wanted by IPVS.

Regards

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-03 21:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-27  6:26 Inability to IPVS DR with nft dnat since 9971a514ed26 Simon Kirby
2019-03-27  9:30 ` Florian Westphal
2019-03-27 15:34   ` Simon Kirby
2021-12-03  8:34   ` Simon Kirby
2021-12-03  9:40     ` Pablo Neira Ayuso
2021-12-03 21:48     ` Julian Anastasov
2021-12-03 21:48       ` Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.