All of lore.kernel.org
 help / color / mirror / Atom feed
* Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
@ 2021-10-12 13:28 Eugene Crosser
  2021-10-13  9:22 ` Florian Westphal
  2021-10-13 12:28 ` Lahav Schlesinger
  0 siblings, 2 replies; 13+ messages in thread
From: Eugene Crosser @ 2021-10-12 13:28 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, Lahav Schlesinger, David Ahern


[-- Attachment #1.1: Type: text/plain, Size: 4593 bytes --]

Hello all,


Commit mentioned in the subject was intended to avoid creation of stray
conntrack entries when input interface is enslaved in a VRF, and thus
prerouting conntrack hook is called twice: once in the context of the
original input interface, and once in the context of the VRF interface.
Solution was to nuke conntrack related data associated with the skb when
it enters VRF context.



However this breaks netfilter operation. Imagine a use case when
conntrack zone must be assigned based on the (original, "real") input
interface, rather than VRF interface (that can enslave multiple "real"
interfaces, that would become indistinguishable). One could create
netfilter rules similar to these:



        chain rawprerouting {

                type filter hook prerouting priority raw;

                iif realiface1 ct zone set 1 return

                iif realiface2 ct zone set 2 return

        }



This works before the mentioned commit, but not after: zone assignment
is "forgotten", and any subsequent NAT or filtering that is dependent on
the conntrack zone does not work.



There is a reproducer script at the bottom of this message that
demonstrates the difference in behaviour.



Maybe a better solution for stray conntrack entries would be to
introduce finer control in netfilter? One possible idea would be to
implement both "track" and "notrack" targets; then a working
configuration would look like this:



        chain rawprerouting {

                type filter hook prerouting priority raw;

                iif realiface1 ct zone set 1 notrack

                iif realiface2 ct zone set 2 notrack

                iif vrfmaster track

        }



so in the original input interface context, zone is assigned, but
conntrack processing itself is shortcircuited. When the packet enters
VRF context, conntracking is reenabled, so one entry is created, in the
zone assigned at an earlier stage.



This is just an idea, I don't have enough knowledge to judge how
workable is it.



For reference, this is a thread about the issue in netfilter-devel:
https://marc.info/?t=163310182600001&r=1&w=2



Thank you,



Eugene



==========

#!/bin/sh



# This script demonstrates unexpected change of nftables behaviour

# caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack

# connection on VRF rcv"

#
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1

#

# Before the commit, it was possible to assign conntrack zone to a

# packet (or mark it for `notracking`) in the prerouting chanin, raw

# priority, based on the `iif` (interface from which the packet

# arrived).

# After the change, # if the interface is enslaved in a VRF, such

# assignment is lost. Instead, assignment based on the `iif` matching

# the VRF master interface is honored. Thus it is impossible to

# distinguish packets based on the original interface.

#

# This script demonstrates this change of behaviour: conntrack zone 1

# or 2 is assigned depending on the match with the original interface

# or the vrf master interface. It can be observed that conntrack entry

# appears in different zone in the kernel versions before and after

# the commit. Additionaly, the script produces netfilter trace files

# that can be used for debugging the issue.



IPIN=172.30.30.1

IPOUT=172.30.30.2

PFXL=30



ip li sh vein >/dev/null 2>&1 && ip li del vein

ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf

nft list table testct >/dev/null 2>&1 && nft delete table testct



ip li add vein type veth peer veout

ip li add tvrf type vrf table 9876

ip li set veout master tvrf

ip li set vein up

ip li set veout up

ip li set tvrf up

/sbin/sysctl -w net.ipv4.conf.veout.accept_local=1

/sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0

ip addr add $IPIN/$PFXL dev vein

ip addr add $IPOUT/$PFXL dev veout



nft -f - <<__END__

table testct {

	chain rawpre {

		type filter hook prerouting priority raw;

		iif { veout, tvrf } meta nftrace set 1

		iif veout ct zone set 1 return

		iif tvrf ct zone set 2 return

		notrack

	}

	chain rawout {

		type filter hook output priority raw;

		notrack

	}

}

__END__



uname -rv

conntrack -F

stdbuf -o0 nft monitor trace >nftrace.`uname -r`.txt &

monpid=$!

ping -W 1 -c 1 -I vein $IPOUT

conntrack -L

sleep 1

kill -15 $monpid

wait


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-12 13:28 Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour Eugene Crosser
@ 2021-10-13  9:22 ` Florian Westphal
  2021-10-15 21:04   ` Florian Westphal
  2021-10-13 12:28 ` Lahav Schlesinger
  1 sibling, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2021-10-13  9:22 UTC (permalink / raw)
  To: Eugene Crosser; +Cc: netdev, netfilter-devel, Lahav Schlesinger, David Ahern

Eugene Crosser <crosser@average.org> wrote:
> Maybe a better solution for stray conntrack entries would be to
> introduce finer control in netfilter? One possible idea would be to
> implement both "track" and "notrack" targets; then a working
> configuration would look like this:

'track' is hard to implement correctly because of RELATED traffic.

E.g. 'tcp dport 22 track' won't work correctly because icmp pmtu
won't be handled.

I'd suggest to try a conditional nf_ct_reset that keeps the conntrack
entry if its in another zone.

I can't think of another solution at the moment, the existing behaviour
of resetting conntrack entry for postrouting/output is too old,
otherwise the better solution IMO would be to keep that entry around on
egress if a NAT rewrite has been done. This would avoid the 'double snat'
problem that the 'reset on ingress' tries to solve.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-12 13:28 Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour Eugene Crosser
  2021-10-13  9:22 ` Florian Westphal
@ 2021-10-13 12:28 ` Lahav Schlesinger
  2021-10-13 12:58   ` Florian Westphal
  1 sibling, 1 reply; 13+ messages in thread
From: Lahav Schlesinger @ 2021-10-13 12:28 UTC (permalink / raw)
  To: Eugene Crosser; +Cc: netdev, netfilter-devel, David Ahern

On Tue, Oct 12, 2021 at 03:28:39PM +0200, Eugene Crosser wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>

> Date: Tue, 12 Oct 2021 15:28:39 +0200
> From: Eugene Crosser <crosser@average.org>
> To: netdev@vger.kernel.org
> Cc: netfilter-devel@vger.kernel.org, Lahav Schlesinger
>  <lschlesinger@drivenets.com>, David Ahern <dsahern@kernel.org>
> Subject: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb
>  conntrack connection on VRF rcv" breaks expected netfilter behaviour
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
>  Thunderbird/78.13.0
>
> Hello all,
>
>
> Commit mentioned in the subject was intended to avoid creation of stray
> conntrack entries when input interface is enslaved in a VRF, and thus
> prerouting conntrack hook is called twice: once in the context of the
> original input interface, and once in the context of the VRF interface.
> Solution was to nuke conntrack related data associated with the skb when
> it enters VRF context.
>
>
>
> However this breaks netfilter operation. Imagine a use case when
> conntrack zone must be assigned based on the (original, "real") input
> interface, rather than VRF interface (that can enslave multiple "real"
> interfaces, that would become indistinguishable). One could create
> netfilter rules similar to these:
>
>
>
>         chain rawprerouting {
>
>                 type filter hook prerouting priority raw;
>
>                 iif realiface1 ct zone set 1 return
>
>                 iif realiface2 ct zone set 2 return
>
>         }
>
>
>
> This works before the mentioned commit, but not after: zone assignment
> is "forgotten", and any subsequent NAT or filtering that is dependent on
> the conntrack zone does not work.
>
>
>
> There is a reproducer script at the bottom of this message that
> demonstrates the difference in behaviour.
>
>
>
> Maybe a better solution for stray conntrack entries would be to
> introduce finer control in netfilter? One possible idea would be to
> implement both "track" and "notrack" targets; then a working
> configuration would look like this:
>
>
>
>         chain rawprerouting {
>
>                 type filter hook prerouting priority raw;
>
>                 iif realiface1 ct zone set 1 notrack
>
>                 iif realiface2 ct zone set 2 notrack
>
>                 iif vrfmaster track
>
>         }
>
>
>
> so in the original input interface context, zone is assigned, but
> conntrack processing itself is shortcircuited. When the packet enters
> VRF context, conntracking is reenabled, so one entry is created, in the
> zone assigned at an earlier stage.
>
>
>
> This is just an idea, I don't have enough knowledge to judge how
> workable is it.
>
>
>
> For reference, this is a thread about the issue in netfilter-devel:
> https://marc.info/?t=163310182600001&r=1&w=2
>
>
>
> Thank you,
>
>
>
> Eugene
>
>
>
> ==========
>
> #!/bin/sh
>
>
>
> # This script demonstrates unexpected change of nftables behaviour
>
> # caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack
>
> # connection on VRF rcv"
>
> #
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1
>
> #
>
> # Before the commit, it was possible to assign conntrack zone to a
>
> # packet (or mark it for `notracking`) in the prerouting chanin, raw
>
> # priority, based on the `iif` (interface from which the packet
>
> # arrived).
>
> # After the change, # if the interface is enslaved in a VRF, such
>
> # assignment is lost. Instead, assignment based on the `iif` matching
>
> # the VRF master interface is honored. Thus it is impossible to
>
> # distinguish packets based on the original interface.
>
> #
>
> # This script demonstrates this change of behaviour: conntrack zone 1
>
> # or 2 is assigned depending on the match with the original interface
>
> # or the vrf master interface. It can be observed that conntrack entry
>
> # appears in different zone in the kernel versions before and after
>
> # the commit. Additionaly, the script produces netfilter trace files
>
> # that can be used for debugging the issue.
>
>
>
> IPIN=172.30.30.1
>
> IPOUT=172.30.30.2
>
> PFXL=30
>
>
>
> ip li sh vein >/dev/null 2>&1 && ip li del vein
>
> ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf
>
> nft list table testct >/dev/null 2>&1 && nft delete table testct
>
>
>
> ip li add vein type veth peer veout
>
> ip li add tvrf type vrf table 9876
>
> ip li set veout master tvrf
>
> ip li set vein up
>
> ip li set veout up
>
> ip li set tvrf up
>
> /sbin/sysctl -w net.ipv4.conf.veout.accept_local=1
>
> /sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0
>
> ip addr add $IPIN/$PFXL dev vein
>
> ip addr add $IPOUT/$PFXL dev veout
>
>
>
> nft -f - <<__END__
>
> table testct {
>
> 	chain rawpre {
>
> 		type filter hook prerouting priority raw;
>
> 		iif { veout, tvrf } meta nftrace set 1
>
> 		iif veout ct zone set 1 return
>
> 		iif tvrf ct zone set 2 return
>
> 		notrack
>
> 	}
>
> 	chain rawout {
>
> 		type filter hook output priority raw;
>
> 		notrack
>
> 	}
>
> }
>
> __END__
>
>
>
> uname -rv
>
> conntrack -F
>
> stdbuf -o0 nft monitor trace >nftrace.`uname -r`.txt &
>
> monpid=$!
>
> ping -W 1 -c 1 -I vein $IPOUT
>
> conntrack -L
>
> sleep 1
>
> kill -15 $monpid
>
> wait
>



Hi Eugene, I apologie for the late response (was on vacation).

The call to nf_reset_ct() I added was to match the existing call in the
egress flow, which I didn't want to change in order to not break
existing behaviour (which I unintentionally still did :-)).

Seems like any combination of calling nf_reset_ct() will lead to
something breaking. So continuing on what Florian suggested, another
possibility is to make the calls to nf_reset_ct() in both ingress and egress
flow configurable (procfs or new flags to RTM_NEWLINK).

One benefit of this is that disabling nf_reset_ct() on the egress flow will
mean no port SNAT will take place when SNAT rule is installed on a VRF
(as I described in my original commit), which can break applications
that depend on using a specific source port.

I'll need to think about it some more though, I don't have any better solution
right now.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-13 12:28 ` Lahav Schlesinger
@ 2021-10-13 12:58   ` Florian Westphal
  0 siblings, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2021-10-13 12:58 UTC (permalink / raw)
  To: Lahav Schlesinger; +Cc: Eugene Crosser, netdev, netfilter-devel, David Ahern

Lahav Schlesinger <lschlesinger@drivenets.com> wrote:
> The call to nf_reset_ct() I added was to match the existing call in the
> egress flow, which I didn't want to change in order to not break
> existing behaviour (which I unintentionally still did :-)).
> 
> Seems like any combination of calling nf_reset_ct() will lead to
> something breaking. So continuing on what Florian suggested, another
> possibility is to make the calls to nf_reset_ct() in both ingress and egress
> flow configurable (procfs or new flags to RTM_NEWLINK).
> 
> One benefit of this is that disabling nf_reset_ct() on the egress flow will
> mean no port SNAT will take place when SNAT rule is installed on a VRF
> (as I described in my original commit), which can break applications
> that depend on using a specific source port.

Looking at the original change, eb63ecc1706b3e094d0f57438b6c2067cfc299f2
"net: vrf: Drop conntrack data after pass through VRF device on Tx",
I wonder if thats not the real cause of the problem.

=========================
Locally originated traffic in a VRF fails in the presence of a POSTROUTING
rule. For example,

$ iptables -t nat -A POSTROUTING -s 11.1.1.0/24  -j
MASQUERADE
$ ping -I red -c1 11.1.1.3
ping: Warning: source address might
be selected on device other than red.
PING 11.1.1.3 (11.1.1.3)
from 11.1.1.2 red: 56(84) bytes of data.
ping: sendmsg: Operation not permitted
=========================

I think we first need selftest scripts that re-creates the three scenarios
the one reported by Eugene, the one outlined above and the double-PAT one Lahav
fixed before any code changes are tested.

Its tempting to just change the nf_ct_reset() done on egress to be
conditional on the ct->status snat bit & drop support for double-snat.

Given Lahavs patch, double-snat probably never worked to begin with?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-13  9:22 ` Florian Westphal
@ 2021-10-15 21:04   ` Florian Westphal
  2021-10-16 18:51     ` David Ahern
  2021-10-19 21:41     ` Jakub Kicinski
  0 siblings, 2 replies; 13+ messages in thread
From: Florian Westphal @ 2021-10-15 21:04 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eugene Crosser, netdev, netfilter-devel, Lahav Schlesinger, David Ahern

Florian Westphal <fw@strlen.de> wrote:
> Eugene Crosser <crosser@average.org> wrote:
> > Maybe a better solution for stray conntrack entries would be to
> > introduce finer control in netfilter? One possible idea would be to
> > implement both "track" and "notrack" targets; then a working
> > configuration would look like this:
> 
> 'track' is hard to implement correctly because of RELATED traffic.
> 
> E.g. 'tcp dport 22 track' won't work correctly because icmp pmtu
> won't be handled.
> 
> I'd suggest to try a conditional nf_ct_reset that keeps the conntrack
> entry if its in another zone.
> 
> I can't think of another solution at the moment, the existing behaviour
> of resetting conntrack entry for postrouting/output is too old,
> otherwise the better solution IMO would be to keep that entry around on
> egress if a NAT rewrite has been done. This would avoid the 'double snat'
> problem that the 'reset on ingress' tries to solve.

I'm working on this.

Eugene, I think it makes sense if you send a formal revert, a proper
fix for snat+vrf needs more work.

I think this is fixable but it will likely be not acceptable for net
tree.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-15 21:04   ` Florian Westphal
@ 2021-10-16 18:51     ` David Ahern
  2021-10-18 14:34       ` Florian Westphal
  2021-10-19 21:41     ` Jakub Kicinski
  1 sibling, 1 reply; 13+ messages in thread
From: David Ahern @ 2021-10-16 18:51 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eugene Crosser, netdev, netfilter-devel, Lahav Schlesinger, David Ahern

On 10/15/21 3:04 PM, Florian Westphal wrote:
> I'm working on this.
> 
> Eugene, I think it makes sense if you send a formal revert, a proper
> fix for snat+vrf needs more work.
> 
> I think this is fixable but it will likely be not acceptable for net
> tree.

Thanks for jumping in, Florian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-16 18:51     ` David Ahern
@ 2021-10-18 14:34       ` Florian Westphal
  2021-10-18 18:14         ` David Ahern
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2021-10-18 14:34 UTC (permalink / raw)
  To: David Ahern
  Cc: Florian Westphal, Eugene Crosser, netdev, netfilter-devel,
	Lahav Schlesinger, David Ahern

David Ahern <dsahern@gmail.com> wrote:

Hi David

TL;DR:
What is your take on the correct/expected behaviour with vrf+conntrack+nat?

> Thanks for jumping in, Florian.

NP.

Sorry for the wall of text below.

You can fast-forward to 'Possible solutions' if you want, but they key
question for me at the moment is the one above.

I've just submitted a selftest patch to nf.git that adds test
cases for the problem reported by Eugene + two masquerade/snat test
cases.

With net/net-next, first test fails and the other two work, after
revert its vice versa.

To get all three working there are a couple of possible solutions to
this but so far I don't have anything that is void of side effects.

It assumes revert of the problematic commit, i.e. no nf_reset_ct in
ingress path from VRF driver.

First, a summary of VRF+nf+conntrack interaction and where problems are.

The VRF driver invokes netfilter for output+postrouting hooks,
with outdev set to the vrf device. Afterwards, ip stack calls those
hooks again, with outdev set to lower device.

This is a problem when conntrack is used with IP masquerading.
With all nf_reset_ct() in vrf driver removed following will happen:

1. Conntrack only, no nat, locally generated traffic.

vrf driver calls output/postrouting hooks.
output creates new conntrack object and attaches it to skb.
postrouting confirms entry and places it into state table.

When hooks are called a second time by IP stack, no new conntrack lookup is
done because skb already has one attached to it.

2. When SNAT is used, this works as well, second iteration doesn't
   do connection tracking and re-uses nat settings done in iteration 1.

However there are caveats:
a) NAT rules that use something like '-o eth0' won't have any effect.
b) IP address matching in round 2 is 'broken', as the second round deals
with a rewritten skb (iph saddr already updated).

This is because when the hooks are invoked a second time, there already
is a NAT binding attached to the entry. This happens regardless of a
matching '-o vrfdevice' nat rule or not; when first round did not match
a nat rule, nat engine attaches a 'nat null binding'.

3) When Masquerade is used, things don't work at all.

This is because of nf_nat_oif_changed() triggering errnously in the nat
engine.  When masquerade is hit, the output interface index gets stored
in the conntrack entry.  When the interface index changes, its assumed
that a routing change took place and the connection has been broken
(masquerade picks saddr based on the output interface).

Therefore, NF_DROP gets returned.

In VRF case, this triggers because we see the skb twice, once with
ifindex == vrf and once with lower/physdev.

I suspect that this is what lead eb63ecc1706b3e094d0f57438b6c2067cfc299f2
(net: vrf: Drop conntrack data after pass through VRF device on Tx),
this added nf_reset() calls to the tx path.

This changes things as follows:

1. Conntrack only, no nat:
same as before, expect that the second round does a new conntrack lookup.
It will find the entry created by first iteration and use that.

2. With nat:
If first round has no matching nat rule, things work:
Second round will find existing entry and use it.
NAT rules on second iteration have no effect, just like before.

If first round had a matching nat rule, then the packet gets rewritten.
This means that the second round will NOT find an existing entry, and
conntrack tracks the flow a second time, this time with the post-nat
source address.

Because of this, conntrack will also detect a tuple collision when it
tries to insert the 'new flow incarnation', because the reply direction
of the 'first round flow' collides with the original direction of the
second iteration. This forces allocation of a new source port, so source
port translation is done.

This in turn breaks in reverse direction, because incoming/reply packet
only hits the connection tracking engine once, i.e. only the source
port translation is reversed.

To fix this, Lahav also added nf_reset_ct() to ingress processing to
undo the first round nat transformation as well.

... and that in turn breaks 'notrack', 'mark' or 'ct zone' assignments
done based on the incoming/lower device -- the nf_reset_ct zaps those
settings before round 2 so they have no effect anymore.

Possible solutions
==================

Taking a few steps back and ignoring 'backwards compat' for now, I think
that conntrack should process the flow only once.  VRF doesn't transform the
packets in any way and the only reason for the extra NF_HOOK() calls appear to
be so that users can match on the incoming/outgoing vrf interface.

a)
For locally generated packets, the most simple fix would be to mark
skb->nfct as 'untracked', and clear it again instead of nf_reset_ct().

This avoids the need to change anyting on conntrack/nat side.
The downside is that it becomes impossible to add nat mappings based
on '-o vrf', because conntrack gets bypassed in round 1.

For forwarding case (where OUTPUT hooks are not hit and
ingress path has attached skb->nfct), we would need to find a different
way to bypass conntrack.  One solution (least-LOC) is to nf_reset() and
then mark skb as untracked.  IP(6)CB should have FORWARD flag set that
can be used to detect this condition.

b)
make the nf_ct_reset calls in vrf tx path conditional.
Its possible to detect when a NAT rule was hit via ct->status bits.

When the NF_HOOK() calls invoked via VRF found a SNAT/MASQ rule,
don't reset the conntrack entry.

Downside 1: the second invocation is done with 'incorrect' ip source
address, OTOH that already happens at this time.

Downside 2: We need to alter conntrack/nat to avoid the 'oif has
changed' logic from kicking in.

I don't see other solutions at the moment.

For INPUT, users can also match the lower device via inet_sdif()
(original ifindex stored in IP(6)CB), but we don't have that
for output, and its not easy to add something because IPCB isn't
preserved between rounds 1 & 2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-18 14:34       ` Florian Westphal
@ 2021-10-18 18:14         ` David Ahern
  2021-10-19 11:49           ` Florian Westphal
  0 siblings, 1 reply; 13+ messages in thread
From: David Ahern @ 2021-10-18 18:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eugene Crosser, netdev, netfilter-devel, Lahav Schlesinger, David Ahern

On 10/18/21 8:34 AM, Florian Westphal wrote:
> David Ahern <dsahern@gmail.com> wrote:
> 
> Hi David
> 
> TL;DR:
> What is your take on the correct/expected behaviour with vrf+conntrack+nat?
> 
>> Thanks for jumping in, Florian.
> 
> NP.
> 
> Sorry for the wall of text below.
> 
> You can fast-forward to 'Possible solutions' if you want, but they key
> question for me at the moment is the one above.
> 
> I've just submitted a selftest patch to nf.git that adds test
> cases for the problem reported by Eugene + two masquerade/snat test
> cases.
> 
> With net/net-next, first test fails and the other two work, after
> revert its vice versa.
> 
> To get all three working there are a couple of possible solutions to
> this but so far I don't have anything that is void of side effects.
> 
> It assumes revert of the problematic commit, i.e. no nf_reset_ct in
> ingress path from VRF driver.
> 
> First, a summary of VRF+nf+conntrack interaction and where problems are.
> 
> The VRF driver invokes netfilter for output+postrouting hooks,
> with outdev set to the vrf device. Afterwards, ip stack calls those
> hooks again, with outdev set to lower device.
> 
> This is a problem when conntrack is used with IP masquerading.
> With all nf_reset_ct() in vrf driver removed following will happen:
> 
> 1. Conntrack only, no nat, locally generated traffic.
> 
> vrf driver calls output/postrouting hooks.
> output creates new conntrack object and attaches it to skb.
> postrouting confirms entry and places it into state table.
> 
> When hooks are called a second time by IP stack, no new conntrack lookup is
> done because skb already has one attached to it.
> 
> 2. When SNAT is used, this works as well, second iteration doesn't
>    do connection tracking and re-uses nat settings done in iteration 1.
> 
> However there are caveats:
> a) NAT rules that use something like '-o eth0' won't have any effect.
> b) IP address matching in round 2 is 'broken', as the second round deals
> with a rewritten skb (iph saddr already updated).
> 
> This is because when the hooks are invoked a second time, there already
> is a NAT binding attached to the entry. This happens regardless of a
> matching '-o vrfdevice' nat rule or not; when first round did not match
> a nat rule, nat engine attaches a 'nat null binding'.
> 
> 3) When Masquerade is used, things don't work at all.
> 
> This is because of nf_nat_oif_changed() triggering errnously in the nat
> engine.  When masquerade is hit, the output interface index gets stored
> in the conntrack entry.  When the interface index changes, its assumed
> that a routing change took place and the connection has been broken
> (masquerade picks saddr based on the output interface).
> 
> Therefore, NF_DROP gets returned.
> 
> In VRF case, this triggers because we see the skb twice, once with
> ifindex == vrf and once with lower/physdev.
> 
> I suspect that this is what lead eb63ecc1706b3e094d0f57438b6c2067cfc299f2
> (net: vrf: Drop conntrack data after pass through VRF device on Tx),
> this added nf_reset() calls to the tx path.
> 
> This changes things as follows:
> 
> 1. Conntrack only, no nat:
> same as before, expect that the second round does a new conntrack lookup.
> It will find the entry created by first iteration and use that.
> 
> 2. With nat:
> If first round has no matching nat rule, things work:
> Second round will find existing entry and use it.
> NAT rules on second iteration have no effect, just like before.
> 
> If first round had a matching nat rule, then the packet gets rewritten.
> This means that the second round will NOT find an existing entry, and
> conntrack tracks the flow a second time, this time with the post-nat
> source address.
> 
> Because of this, conntrack will also detect a tuple collision when it
> tries to insert the 'new flow incarnation', because the reply direction
> of the 'first round flow' collides with the original direction of the
> second iteration. This forces allocation of a new source port, so source
> port translation is done.
> 
> This in turn breaks in reverse direction, because incoming/reply packet
> only hits the connection tracking engine once, i.e. only the source
> port translation is reversed.
> 
> To fix this, Lahav also added nf_reset_ct() to ingress processing to
> undo the first round nat transformation as well.
> 
> ... and that in turn breaks 'notrack', 'mark' or 'ct zone' assignments
> done based on the incoming/lower device -- the nf_reset_ct zaps those
> settings before round 2 so they have no effect anymore.
> 
> Possible solutions
> ==================
> 
> Taking a few steps back and ignoring 'backwards compat' for now, I think
> that conntrack should process the flow only once.  VRF doesn't transform the
> packets in any way and the only reason for the extra NF_HOOK() calls appear to
> be so that users can match on the incoming/outgoing vrf interface.
> 
> a)
> For locally generated packets, the most simple fix would be to mark
> skb->nfct as 'untracked', and clear it again instead of nf_reset_ct().
> 
> This avoids the need to change anyting on conntrack/nat side.
> The downside is that it becomes impossible to add nat mappings based
> on '-o vrf', because conntrack gets bypassed in round 1.
> 
> For forwarding case (where OUTPUT hooks are not hit and
> ingress path has attached skb->nfct), we would need to find a different
> way to bypass conntrack.  One solution (least-LOC) is to nf_reset() and
> then mark skb as untracked.  IP(6)CB should have FORWARD flag set that
> can be used to detect this condition.
> 
> b)
> make the nf_ct_reset calls in vrf tx path conditional.
> Its possible to detect when a NAT rule was hit via ct->status bits.
> 
> When the NF_HOOK() calls invoked via VRF found a SNAT/MASQ rule,
> don't reset the conntrack entry.
> 
> Downside 1: the second invocation is done with 'incorrect' ip source
> address, OTOH that already happens at this time.
> 
> Downside 2: We need to alter conntrack/nat to avoid the 'oif has
> changed' logic from kicking in.
> 
> I don't see other solutions at the moment.
> 
> For INPUT, users can also match the lower device via inet_sdif()
> (original ifindex stored in IP(6)CB), but we don't have that
> for output, and its not easy to add something because IPCB isn't
> preserved between rounds 1 & 2.
> 

Thanks for the detailed summary and possible solutions.

NAT/MASQ rules with VRF were not really thought about during
development; it was not a use case (or use cases) Cumulus or other NOS
vendors cared about. Community users were popping up fairly early and
patches would get sent, but no real thought about how to handle both
sets of rules - VRF device and port devices.

What about adding an attribute on the VRF device to declare which side
to take -- rules against the port device or rules against the VRF device
and control the nf resets based on it?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-18 18:14         ` David Ahern
@ 2021-10-19 11:49           ` Florian Westphal
  2021-10-19 13:21             ` Eugene Crosser
  2021-10-19 14:34             ` David Ahern
  0 siblings, 2 replies; 13+ messages in thread
From: Florian Westphal @ 2021-10-19 11:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Florian Westphal, Eugene Crosser, netdev, netfilter-devel,
	Lahav Schlesinger, David Ahern

David Ahern <dsahern@gmail.com> wrote:
> Thanks for the detailed summary and possible solutions.
> 
> NAT/MASQ rules with VRF were not really thought about during
> development; it was not a use case (or use cases) Cumulus or other NOS
> vendors cared about. Community users were popping up fairly early and
> patches would get sent, but no real thought about how to handle both
> sets of rules - VRF device and port devices.
> 
> What about adding an attribute on the VRF device to declare which side
> to take -- rules against the port device or rules against the VRF device
> and control the nf resets based on it?

This would need a way to suppress the NF_HOOK invocation from the
normal IP path.  Any idea on how to do that?  AFAICS there is no way to
get to the vrf device at that point, so no way to detect the toggle.

Or did you mean to only suppress the 2nd conntrack round?

For packets that get forwarded we'd always need to run those in the vrf
context, afaics, because doing an nf_reset() may create a new conntrack
entry (if flow has DNAT, then incoming address has been reversed
already, so it won't match existing REPLY entry in the conntrack table anymore).

For locally generated packets, we could skip conntrack for VRF context
via 'skb->_nfct = UNTRACKED' + nf_reset_ct before xmit to lower device,
and for lower device by eliding the reset entirely.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-19 11:49           ` Florian Westphal
@ 2021-10-19 13:21             ` Eugene Crosser
  2021-10-19 14:34             ` David Ahern
  1 sibling, 0 replies; 13+ messages in thread
From: Eugene Crosser @ 2021-10-19 13:21 UTC (permalink / raw)
  To: Florian Westphal, David Ahern
  Cc: netdev, netfilter-devel, Lahav Schlesinger, David Ahern


[-- Attachment #1.1: Type: text/plain, Size: 4039 bytes --]

On 19/10/2021 13:49, Florian Westphal wrote:
> David Ahern <dsahern@gmail.com> wrote:
>> Thanks for the detailed summary and possible solutions.
>>
>> NAT/MASQ rules with VRF were not really thought about during
>> development; it was not a use case (or use cases) Cumulus or other NOS
>> vendors cared about. Community users were popping up fairly early and
>> patches would get sent, but no real thought about how to handle both
>> sets of rules - VRF device and port devices.
>>
>> What about adding an attribute on the VRF device to declare which side
>> to take -- rules against the port device or rules against the VRF device
>> and control the nf resets based on it?
> 
> This would need a way to suppress the NF_HOOK invocation from the
> normal IP path.  Any idea on how to do that?  AFAICS there is no way to
> get to the vrf device at that point, so no way to detect the toggle.
> 
> Or did you mean to only suppress the 2nd conntrack round?
> 
> For packets that get forwarded we'd always need to run those in the vrf
> context, afaics, because doing an nf_reset() may create a new conntrack
> entry (if flow has DNAT, then incoming address has been reversed
> already, so it won't match existing REPLY entry in the conntrack table anymore).
> 
> For locally generated packets, we could skip conntrack for VRF context
> via 'skb->_nfct = UNTRACKED' + nf_reset_ct before xmit to lower device,
> and for lower device by eliding the reset entirely.

I think that I have SNAT (at least) working fine with VRFs, without the
commit. What I do is I set notrack at vrf prerouting callback. Could it
be the "proper" way to go? I don't know if I am breaking anything else
though. Here is my reproducer script. SNAT works on kernels without the
"reset conntrack" commit.

(Sorry my Thunderbird inserts extra newlines :( )

========
#!/bin/sudo /bin/bash



for i in 1 2; do

	ip li sh src$i >/dev/null 2>&1 && ip li set src$i nomaster \
		&& ip li del src$i

	ip li sh sink$i >/dev/null 2>&1 && ip li del sink$i

	ip li sh vrf$i >/dev/null 2>&1 && ip li del vrf$i

	ip r flush table 10$i

	ip netns list | grep -q ns$i && ip netns del ns$i

done

nft list table testnat >/dev/null 2>&1 && nft delete table testnat



case $1 in

        clean)  echo "cleaned up"; exit 0;;

esac



sysctl -w net.ipv4.ip_forward=1



for i in 1 2; do

	ip netns add ns$i

	ip netns exec ns$i ip li set lo up



	ip li add vrf$i type vrf table 10$i

	ip r add vrf vrf$i unreachable default metric 4278198272

	ip li add src$i type veth peer wayout netns ns$i

	ip li set src$i master vrf$i

	ip a add 172.31.$i.1/32 dev src$i

	ip li set src$i up

	ip li set vrf$i up

	#/sbin/sysctl -w net.ipv4.conf.src$i.accept_local=1

	ip netns exec ns$i ip a add 172.31.$i.2/24 dev wayout

	ip netns exec ns$i ip li set wayout up

	ip netns exec ns$i ip r add default via 172.31.$i.1



	ip li add sink$i type veth peer wayin netns ns$i

	ip netns exec ns$i ip li set wayin up

	ip netns exec ns$i /sbin/sysctl -w net.ipv4.conf.wayin.rp_filter=0

	ip li set sink$i up

done



ip r add 172.31.1.0/24 dev sink1 table 102

ip r add 172.31.2.0/24 dev sink2 table 101



ip r add 100.64.0.0/24 dev sink1 table 102



nft -f - <<__END__

table testnat {

        chain rawpre {

                type filter hook prerouting priority raw;

                #iif { src1, src2 } meta nftrace set 1

                iif { src1, src2 } ct zone set 1 return

                notrack

        }

        chain rawout {

                type filter hook output priority raw;

                notrack

        }

	chain natpost {

		type nat hook postrouting priority srcnat;

		oif sink2 snat ip to 100.64.0.2

	}

}

__END__



conntrack -F

ip netns exec ns2 tcpdump -lni wayin arp or icmp &

tdpid=$!

sleep 1

ip netns exec ns1 ping -W 1 -c 1 172.31.2.2

conntrack -L

sleep 1

kill $tdpid

wait

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-19 11:49           ` Florian Westphal
  2021-10-19 13:21             ` Eugene Crosser
@ 2021-10-19 14:34             ` David Ahern
  2021-10-19 14:46               ` Florian Westphal
  1 sibling, 1 reply; 13+ messages in thread
From: David Ahern @ 2021-10-19 14:34 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eugene Crosser, netdev, netfilter-devel, Lahav Schlesinger, David Ahern

On 10/19/21 5:49 AM, Florian Westphal wrote:
> David Ahern <dsahern@gmail.com> wrote:
>> Thanks for the detailed summary and possible solutions.
>>
>> NAT/MASQ rules with VRF were not really thought about during
>> development; it was not a use case (or use cases) Cumulus or other NOS
>> vendors cared about. Community users were popping up fairly early and
>> patches would get sent, but no real thought about how to handle both
>> sets of rules - VRF device and port devices.
>>
>> What about adding an attribute on the VRF device to declare which side
>> to take -- rules against the port device or rules against the VRF device
>> and control the nf resets based on it?
> 
> This would need a way to suppress the NF_HOOK invocation from the
> normal IP path.  Any idea on how to do that?  AFAICS there is no way to
> get to the vrf device at that point, so no way to detect the toggle.
> 
> Or did you mean to only suppress the 2nd conntrack round?

My thought was that the newly inserted nf_reset_ct fixed one use case
and breaks another, so the new attribute would control that call.

> 
> For packets that get forwarded we'd always need to run those in the vrf
> context, afaics, because doing an nf_reset() may create a new conntrack
> entry (if flow has DNAT, then incoming address has been reversed
> already, so it won't match existing REPLY entry in the conntrack table anymore).
> 
> For locally generated packets, we could skip conntrack for VRF context
> via 'skb->_nfct = UNTRACKED' + nf_reset_ct before xmit to lower device,
> and for lower device by eliding the reset entirely.
> 

ok.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-19 14:34             ` David Ahern
@ 2021-10-19 14:46               ` Florian Westphal
  0 siblings, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2021-10-19 14:46 UTC (permalink / raw)
  To: David Ahern
  Cc: Florian Westphal, Eugene Crosser, netdev, netfilter-devel,
	Lahav Schlesinger, David Ahern

David Ahern <dsahern@gmail.com> wrote:
> On 10/19/21 5:49 AM, Florian Westphal wrote:
> > David Ahern <dsahern@gmail.com> wrote:
> >> Thanks for the detailed summary and possible solutions.
> >>
> >> NAT/MASQ rules with VRF were not really thought about during
> >> development; it was not a use case (or use cases) Cumulus or other NOS
> >> vendors cared about. Community users were popping up fairly early and
> >> patches would get sent, but no real thought about how to handle both
> >> sets of rules - VRF device and port devices.
> >>
> >> What about adding an attribute on the VRF device to declare which side
> >> to take -- rules against the port device or rules against the VRF device
> >> and control the nf resets based on it?
> > 
> > This would need a way to suppress the NF_HOOK invocation from the
> > normal IP path.  Any idea on how to do that?  AFAICS there is no way to
> > get to the vrf device at that point, so no way to detect the toggle.
> > 
> > Or did you mean to only suppress the 2nd conntrack round?
> 
> My thought was that the newly inserted nf_reset_ct fixed one use case
> and breaks another, so the new attribute would control that call.

Right, but the 'new nf_reset_ct' are there to undo the 2nd nat
transformation done on round 2.

So, no round 2, no second nat transformation & no need for the new
nf_ct_reset().

I dislike the idea of treating locally originating flows different
from forwarded ones.

Treating them the same causes asymmetry of ingress&egress, i.e.
ingress means 'traverse conntrack for lower device' whereas egress means
'traverse conntrack via vrf device'.

I could hack the nat core & the conntrack commit hook to skip
functionality if the outdev is a vrf device -- that should in theory
result in consistent semantics, i.e. conntrack only runs in lower device
context.

I'll give that a shot unless someone has a better idea.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour
  2021-10-15 21:04   ` Florian Westphal
  2021-10-16 18:51     ` David Ahern
@ 2021-10-19 21:41     ` Jakub Kicinski
  1 sibling, 0 replies; 13+ messages in thread
From: Jakub Kicinski @ 2021-10-19 21:41 UTC (permalink / raw)
  To: Florian Westphal, David Ahern
  Cc: Eugene Crosser, netdev, netfilter-devel, Lahav Schlesinger

On Fri, 15 Oct 2021 23:04:48 +0200 Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
> > 'track' is hard to implement correctly because of RELATED traffic.
> > 
> > E.g. 'tcp dport 22 track' won't work correctly because icmp pmtu
> > won't be handled.
> > 
> > I'd suggest to try a conditional nf_ct_reset that keeps the conntrack
> > entry if its in another zone.
> > 
> > I can't think of another solution at the moment, the existing behaviour
> > of resetting conntrack entry for postrouting/output is too old,
> > otherwise the better solution IMO would be to keep that entry around on
> > egress if a NAT rewrite has been done. This would avoid the 'double snat'
> > problem that the 'reset on ingress' tries to solve.  
> 
> I'm working on this.
> 
> Eugene, I think it makes sense if you send a formal revert, a proper
> fix for snat+vrf needs more work.

If this is still the plan can we get some acks on the revert please?

https://patchwork.kernel.org/project/netdevbpf/patch/20211018182250.23093-2-crosser@average.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-19 21:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12 13:28 Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour Eugene Crosser
2021-10-13  9:22 ` Florian Westphal
2021-10-15 21:04   ` Florian Westphal
2021-10-16 18:51     ` David Ahern
2021-10-18 14:34       ` Florian Westphal
2021-10-18 18:14         ` David Ahern
2021-10-19 11:49           ` Florian Westphal
2021-10-19 13:21             ` Eugene Crosser
2021-10-19 14:34             ` David Ahern
2021-10-19 14:46               ` Florian Westphal
2021-10-19 21:41     ` Jakub Kicinski
2021-10-13 12:28 ` Lahav Schlesinger
2021-10-13 12:58   ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.