All of lore.kernel.org
 help / color / mirror / Atom feed
* ECMP routing: problematic selection of outgoing interface
@ 2018-05-16  1:51 Hirotaka Yamamoto
  2018-05-16 13:01 ` Andrew Lunn
  0 siblings, 1 reply; 3+ messages in thread
From: Hirotaka Yamamoto @ 2018-05-16  1:51 UTC (permalink / raw)
  To: netdev

Hi,

Recently I have built a highly-available network using an ECMP
route connected to two isolated L2 switches as follows.

Router-- ToR switch 1  ---- Linux
     |   192.168.11.1/24     |  eth0: 192.168.11.2/24
     |                       |  eth1: 192.168.12.2/24
     +-- ToR switch 2  ------+
         192.168.12.1/24

The (default) route has been configured with:

    $ sudo ip route add default \
           nexthop via 192.168.11.1 \
           nexthop via 192.168.12.1

Then I found that Linux chooses a wrong outgoing device for some
destination/source address pairs like this:

    $ ip route get 12.34.56.78 from 192.168.12.2:
    12.34.56.78 from 192.168.12.2 via 192.168.11.1 dev eth0 uid 0
                                                 # dev should be "eth1"

As a consequence, programs like SSH or curl do not work for such
destinations because routers drop packets having strange source
addresses.

Unbound sockets also suffer this problem.  My guess for this is that
Linux chooses a source address first, then a wrong outgoing device.

Although I believe this is a bug in Linux, I found a possibly relevant
comment in function ip_route_output_key_hash_rcu at net/ipv4/route.c:

    /* I removed check for oif == dev_out->oif here.
       It was wrong for two reasons:
       1. ip_dev_find(net, saddr) can return wrong iface, if saddr
          is assigned to multiple interfaces.
       2. Moreover, we are allowed to send packets with saddr
          of another iface. --ANK

According to the comment 2, I wonder this behavior might be intended.

So, my question is:

1. Is this intended or not?
2. If this is intended, how can I make programs work in this ECMP network?


I have created a simple script to reproduce the problem (attached below).
The script creates a dedicated network namespace "testns" and configures
ECMP route to reproduce the problem.

So far, I can reproduce the problem with these Linux versions:
    - 4.17-rc5          (Upstream)
    - 4.15.0-20-generic (Ubuntu 18.04)
    - 4.14.32-coreos    (CoreOS)
    - 4.13.0-37-generic (Ubuntu 16.04 HWE)
    - 4.4.0-116-generic (Ubuntu 16.04)

Note that the problem is not limited to the default route.
Any route configured as ECMP can cause the problem.

- ymmt

#!/bin/sh -e

NS=testns

BR1=testbr1
VETH1=testveth1
BR2=testbr2
VETH2=testveth2
LINKS="$VETH1 $VETH2 $BR1 $BR2"

NET1=192.168.11.xx/24
NET2=192.168.12.xx/24
IPNS="ip netns exec $NS ip"

clean() {
    for l in $LINKS; do
        if ip -o link show $l >/dev/null 2>&1; then
            ip link del $l
        fi
    done

    if ip netns list | grep -q $NS; then
        ip netns del $NS
    fi
}
trap clean INT QUIT TERM HUP PIPE 0

make_address() {
    local net addr
    net=$1
    addr=$2

    echo $net | sed "s/xx/$addr/"
}

cidr2ip() {
    echo $1 | cut -d / -f 1
}

GW1=$(make_address $NET1 1)
GW2=$(make_address $NET2 1)
ADDR1=$(make_address $NET1 2)
ADDR2=$(make_address $NET2 2)

setup_veth() {
    local br veth dest
    br=$1
    veth=$2
    dest=$3

    ip link add $br type bridge
    ip link add $veth type veth peer name ${veth}_
    ip link set $br up
    ip link set $veth master $br up
    ip link set ${veth}_ netns $NS name $dest up
}

setup() {
    ip netns add $NS
    $IPNS link set lo up

    setup_veth $BR1 $VETH1 eth0
    setup_veth $BR2 $VETH2 eth1

    local gw1 gw2
    ip addr add $GW1 dev $BR1
    ip addr add $GW2 dev $BR2
    $IPNS addr add $ADDR1 dev eth0
    $IPNS addr add $ADDR2 dev eth1

    $IPNS route add 0.0.0.0/0 nexthop via $(cidr2ip $GW1) nexthop via $(cidr2ip $GW2)
}

test_route_from() {
    local dest dev from r rdev
    dest=$1
    dev=$2
    from=$3
    r=$($IPNS -o route get $dest from $from)
    rdev=$(echo $r | sed -nr 's/^.*dev (eth[[:digit:]]+).*/\1/p')
    if [ "$dev" != "$rdev" ]; then
        echo "WRONG dev/from pair: ip -o route get $dest from $from:"
        printf "%s\n" "$r"
        return
    fi
}

test_route() {
    test_route_from "$1" eth0 $(cidr2ip $ADDR1)
    test_route_from "$1" eth1 $(cidr2ip $ADDR2)
}

run_tests() {
    test_route 12.34.56.78
    test_route 216.58.200.160
    test_route 216.58.200.161
    test_route 216.58.200.162
    test_route 216.58.200.163
    test_route 216.58.200.164
    test_route 52.85.149.10
    test_route 52.85.149.11
    test_route 52.85.149.12
    test_route 52.85.149.13
    test_route 52.85.149.14
}

# main
setup
run_tests
read -p "Press enter to finish" ret

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ECMP routing: problematic selection of outgoing interface
  2018-05-16  1:51 ECMP routing: problematic selection of outgoing interface Hirotaka Yamamoto
@ 2018-05-16 13:01 ` Andrew Lunn
  2018-05-16 16:16   ` Hirotaka Yamamoto
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Lunn @ 2018-05-16 13:01 UTC (permalink / raw)
  To: Hirotaka Yamamoto; +Cc: netdev

On Wed, May 16, 2018 at 01:51:36AM +0000, Hirotaka Yamamoto wrote:
> Hi,
> 
> Recently I have built a highly-available network using an ECMP
> route connected to two isolated L2 switches as follows.
> 
> Router-- ToR switch 1  ---- Linux
>      |   192.168.11.1/24     |  eth0: 192.168.11.2/24
>      |                       |  eth1: 192.168.12.2/24
>      +-- ToR switch 2  ------+
>          192.168.12.1/24
> 
> The (default) route has been configured with:
> 
>     $ sudo ip route add default \
>            nexthop via 192.168.11.1 \
>            nexthop via 192.168.12.1
> 
> Then I found that Linux chooses a wrong outgoing device for some
> destination/source address pairs like this:
> 
>     $ ip route get 12.34.56.78 from 192.168.12.2:
>     12.34.56.78 from 192.168.12.2 via 192.168.11.1 dev eth0 uid 0
>                                                  # dev should be "eth1"
> 
> As a consequence, programs like SSH or curl do not work for such
> destinations because routers drop packets having strange source
> addresses.

Hi Hirotaka

I assume you add the 192.168.11.1 and 192.168.12.1 to the interfaces
using global scope? Global scope means the IP addresses are valid
everywhere. All routers should know how to route packets to these IP
addresses. So a host is free to pick any of its global scope IP
addresses and use them. The outgoing interface should not matter,
since all routers downstream of it should have routes for the global
scope IP addresses.

It sounds like your router is doing reverse path filtering. It is
checking its routing table for the source address, and throwing the
packets away if they don't come in the interface the route points out
of. If you don't trust your network, this makes sense. It helps to
stop a host spoofing another host, by sending packets with a spoofed
IP address. But you probably want to do reverse path filtering on the
gateway which borders between the networks you do trust and those you
don't.

	Andrew

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: ECMP routing: problematic selection of outgoing interface
  2018-05-16 13:01 ` Andrew Lunn
@ 2018-05-16 16:16   ` Hirotaka Yamamoto
  0 siblings, 0 replies; 3+ messages in thread
From: Hirotaka Yamamoto @ 2018-05-16 16:16 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev

Hi Andrew,

> I assume you add the 192.168.11.1 and 192.168.12.1 to the interfaces
> using global scope? Global scope means the IP addresses are valid
> everywhere. All routers should know how to route packets to these IP
> addresses. So a host is free to pick any of its global scope IP

Yes their scopes are global,

> It sounds like your router is doing reverse path filtering. It is
> checking its routing table for the source address, and throwing the
> packets away if they don't come in the interface the route points out
> of.

and yes the routers do reverse path filtering.

Now I understood that this is an intended and in fact a legitimate behavior.

So it seems that one thing I can do is to talk with networking people to accept
these packets.  Another option that has come to my mind is to change the
address scope to link-local and assign a global, routable address to a dummy
interface so that Linux chooses the address for the dummyif.

I'm going to evaluate these options.  Thank you!

- ymmt

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-05-16 16:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16  1:51 ECMP routing: problematic selection of outgoing interface Hirotaka Yamamoto
2018-05-16 13:01 ` Andrew Lunn
2018-05-16 16:16   ` Hirotaka Yamamoto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.